key: cord-0579762-4mnh58gi
authors: Fariha, Anna; Cousins, Lucy; Mahyar, Narges; Meliou, Alexandra
title: Example-Driven User Intent Discovery: Empowering Users to Cross the SQL Barrier Through Query by Example
date: 2020-12-29
journal: nan
DOI: nan
sha: 1502cccfc1d5fae1f850b06b181ba239b71d85bb
doc_id: 579762
cord_uid: 4mnh58gi

Traditional data systems require specialized technical skills where users need to understand the data organization and write precise queries to access data. Therefore, novice users who lack technical expertise face hurdles in perusing and analyzing data. Existing tools assist in formulating queries through keyword search, query recommendation, and query auto-completion, but still require some technical expertise. An alternative method for accessing data is Query by Example (QBE), where users express their data exploration intent simply by providing examples of their intended data. We study a state-of-the-art QBE system called SQuID, and contrast it with traditional SQL querying. Our comparative user studies demonstrate that users with varying expertise are significantly more effective and efficient with SQuID than SQL. We find that SQuID eliminates the barriers in studying the database schema, formalizing task semantics, and writing syntactically correct SQL queries, and thus, substantially alleviates the need for technical expertise in data exploration.

instead of product name). Therefore, to generate the sales report, the sales executive will have to (1) familiarize themselves with the data organization (schema) to locate relevant tables and understand the name encoding schemes, and (2) pose a query in the SQL language that is both syntactically and semantically correct to obtain the desired sales records in the correct format (e.g., customer names and product names). These steps are challenging for the sales executive who lacks a technical background, and thus, they would prefer to bypass such complexities. However, an enterprise information worker, such as this sales executive, is often aware of a few examples that should be present in the report. They might remember that John Smith bought an iPad and Nora Shankar bought a Samsung smartphone last week.

Certainly, they might not remember all sales records, but can an example-based interaction mechanism effectively assist this sales executive in their task here, with just these examples? Furthermore, for users with some technical skills, would such an interaction model still be useful?

Example-based interactions have been explored as a method to bridge the usability gap of computational systems that typically require precise programs from users, such as in our user scenario above. Under the programming by example (PBE) paradigm (also known as programming by demonstration), instead of writing a precise program to specify their intent, users only need to provide a few examples of the mechanism or result they desire [7, 24, 43, 56] . Prior work conducted user studies to contrast PBE tools against traditional alternatives [11, 41, 47, 55] . However, none of them considered PBE tools that are specifically designed for data exploration over relational databases. We argue that query by example (QBE), a facet of PBE focused on access and exploration of relational data, has unique characteristics and poses distinct challenges compared to general PBE methods. The focus of our work in this paper is to study the effectiveness and usability of state-of-the-art QBE against the traditional relational data access methods that rely on SQL programs, through comparative user studies. We proceed to provide some background on PBE and QBE systems, highlight the unique aspects of QBE that have not been addressed by prior work and call for a targeted study, and summarize our method and the contributions we make in this paper.

Programming by example (PBE): background and applications. The PBE paradigm is based on the intuitive premise that users who may lack or have low technical skills, but have expertise in a particular domain, can more easily express their computational desire by providing examples than by writing programs under strict language specifications. This is in contrast with traditional program synthesis [26, 33, 54] , which requires a high-level formal specification (e.g., first-order logic) of the desired program. Example-driven program synthesis has been effectively used for a variety of tasks, such as code synthesis for data scientists [11] ; data wrangling [23] , integration [31] , extraction [5, 39] , transformation [22, 28] , and filtering [67] ; data structure transformation [18] ; text processing [70] , normalization [38] , and summarization [15] ; querying relational databases [58] , and so on.

Query by example (QBE): the need for a new study. Example-driven interactions have also been explored in the context of retrieving and exploring relational data, which led to the development of query by example (QBE) systems [8, 16, 17, 52, 58] . In QBE systems, a user is expected to provide examples of the data records they would like to retrieve, in place of providing a well-formed query in the SQL language. The QBE system then infers the query the user likely intended, and uses it to retrieve additional records from the database. QBE is a special category of PBE that brings forth unique aspects and challenges. We proceed to describe three significant distinctions that motivate our comparative study evaluation of QBE systems.

First, the traditional mechanism for retrieving relational data requires not only strong technical skills over the SQL language, but also familiarity with the structural organization of the data, called a schema. Schemas can be very complex, may contain domain-specific abstractions, differ from one database to the next, and could also get modified over time.

As a result, even expert users with prior SQL experience can struggle to familiarize themselves with the schema of a previously unseen dataset, leading to difficulties in data exploration. Therefore, QBE needs to be studied from the perspective of users with varied levels of expertise, and the study needs to investigate the pain points specific to relational data access and exploration.

Second, the operational mechanisms in QBE systems fundamentally differ from those in general PBE systems.

Traditional PBE approaches often rely on demonstration, where the mechanism to solve the intended task is demonstrated by the user. In contrast, in QBE, the user gives examples of the intended output and not the querying mechanism. Other PBE approaches rely on complete input-output specifications: the user needs to provide, typically small, sample inputs and outputs and the system infers their intended program. This mechanism is also not possible in a data exploration setting, where the input data is predetermined and typically large, and the user can only provide a small set of examples of their intended query output. Since the set of examples in the QBE setting is naturally incomplete, there is typically a much larger number of queries (programs) that could be compatible with them, compared to the general PBE setting; thus, the effectiveness of QBE systems needs to be explored with a targeted study.

Third, the setting of data exploration has two characteristics that can have significant impact in the performance of a QBE system: (1) Since the user needs to provide example records from the dataset at hand, domain expertise can have a bigger impact in the user experience than in the general PBE setting. (2) Data exploration tasks can be vague and subjective, where a strict specification is often hard or even impossible to derive even by experts; this is a perspective not relevant to general PBE and not explored by prior studies.

Our scope and method. In this paper, we present findings from our comparative user studies over a QBE tool and the traditional SQL-based mechanism. For our study, we picked SQUID [16, 17] [30] , query reverse engineering [64] , and recommendation systems [46] . We conducted two comparative user studies: (1) a controlled experiment study involving 35 participants, and (2) an interview study involving 7 interviewees to gain a richer understanding of users' issues and preferences. All participants and interviewees had varying levels of SQL expertise and experience, but were required to have at least basic SQL skills.

Our studies focused on the task of data exploration and explored how SQUID compares against the traditional SQL querying mechanism, over a variety of objective and subjective data exploration tasks. Specifically, our study aimed to identify the most critical issues users face when interacting with the traditional SQL querying mechanism, to what extent a QBE system like SQUID can alleviate these challenges, how effective SQUID is over a variety of data exploration tasks, and what are the possible pain-points of SQUID.

Contributions. We summarize our contributions below:

• Through an analysis of the SQL queries issued by the controlled experiment study participants and quantitative analysis of the data collected from the study, we found that participants were significantly more effective (achieved more accurate results) and efficient (required less time and fewer attempts) over a diverse set of subjective and objective tasks using SQUID compared to manual SQL programming.

• From observations made from the behavior of the interviewees during our interview study, and their qualitative feedback, we identified three key challenges that SQL poses to the users: familiarizing oneself with the database schema, formally expressing the semantics of the task, and writing syntactically correct queries. From the qualitative feedback of the interviewees, we confirmed that SQUID removes these SQL challenges altogether and assists the users in effective data exploration. Notably, even some of the SQL experts reported that certain subjective queries were extremely hard to encode in SQL and that they would prefer SQUID over SQL in those circumstances.

• Finally, we discuss how SQUID and traditional SQL mechanisms complement each other, under what circumstances the users prefer one over the other, and how the QBE tools should be expanded to achieve more user acceptance. While our results validate some findings of prior studies over other PBE approaches [55] , we contribute new empirical insights gained from our studies that indicate that even a limited level of domain expertise (knowledge of a small subset of the desired data) can substantially help overcome the lack of technical expertise (knowledge of SQL and schema) in data exploration.

Organization. The rest of the paper is organized as follows: We discuss the related work in Section 2. Section 3 gives an overview of the dataset and the two systems used in our studies: SQL 1 and SQUID. In Section 4, we describe the design choices and methods of our comparative user studies. Section 5 and 6 describe the quantitative findings and the qualitative feedback found from the user studies, respectively. We discuss the key take-aways from the user study and provide guidelines to improve QBE tools with additional features in Section 7. Finally, we conclude in Section 8.

In this section, we provide an overview of the existing PBE and QBE approaches, discuss alternative mechanisms that also aid users in data exploration, and discuss prior literature on comparative user studies over other PBE approaches.

Many PBE approaches have been developed in the literature to aid novices or semi-experts in a variety of data management tasks. The focus of PBE is to not only solve the task, but also provide the mechanism that can solve the task. To this end, all PBE tools learn from the user examples and synthesize programs that can produce the desired results. To help data scientists write complex data-wrangling and data-transformation codes, WREX [11] proposes an example-driven program synthesis approach. To enable integration of web data with spreadsheets, WebRelate [31] facilitates joining semi-structured web data with relational data in spreadsheets using input-output examples. FlashRelate [5] and FlashExtract [39] enable extraction of relational data from semi-structured spreadsheets, text files, and web pages, using examples. Data-transformation-by-example approaches [22, 28] led to the development of the FlashFill [19] feature in Microsoft Excel, which can learn the user's data transformation intent only from a few examples. Beyond data management tasks, recently, PBE has been used for text processing [70] , text normalization [38] , and personalized text summarization [15] .

Live programming [57] helps novice programmers to understand their codes, where they can manipulate the input by directly editing the codes and manipulate the output by providing examples of the desired output. Beyond computational tasks, PBE tools also support creative tasks such as music creation by example [20] , where a software takes a song as an example and allows the user to interactively mix the AI-generated music.

Some QBE systems [52, 58] focus on identifying relevant relations and joins to compensate the user's lack of schema understanding, but are limited to project-join queries. These systems only exploit the structural similarities of the examples and ignore the semantic similarities. QPlain [8] requires provenance of the examples from the users to better learn the join paths. However, this requires understanding of the schema, content, and domain of the data, which novice users often lack.

Unlike QBE approaches that can work only with partial output (example), query reverse engineering (QRE) approaches require the entire output with respect to the original database. With this complete output specification, QRE can target more expressive queries [65, 73] , but only works for very small databases and fails to scale to large databases. Some QRE approaches require the user to specify a small input database and the corresponding output, and constants in the query [65] .

However, this requires complete schema knowledge and precise domain knowledge. QRE [4, 36, 50, 61-63, 68, 72] is less challenging than QBE, because it is aware of the entire output, while typically only a small fraction of the output is available for QBE. Thus, QRE systems can build data classification models on denormalized tables [63] , assuming the user-provided examples as positive and the rest as negative. However, due to lack of sufficient annotated data, similar techniques do not apply for QBE.

A problem similar to QBE in relational databases is set expansion in knowledge bases [66, 69, 74] . SPARQLByE [9] allows querying datasets in resource description framework (RDF) by annotated (positive/negative) examples. In semantic knowledge graphs, systems exist to address the entity set expansion problem using maximal-aspect-based entity model, semantic-feature-based graph query, entity co-occurrence information, etc. [27, 32, 44, 49] . Although not applicable in the relational domain, these approaches also exploit the semantic context of the examples; however, they cannot learn new semantic properties that are not explicit in the knowledge base.

Beyond by-example methods, alternative approaches exist to aid novice users explore relational databases. Keyword-based search [2, 29, 71] allows accessing relational data without knowledge of the schema and SQL syntax, but does not facilitate search by examples. Other notable systems that aim to assist novice users in data exploration and complex query formulation are: QueRIE, a query recommendation based on collaborative filtering [13] , SnipSuggest, a context-aware SQL autocompletion system [37] , SQL-Sugg, a keyword-based query suggestion system [14] , YmalDB, a "you-mayalso-like"-style data exploration system [12] , and SnapToQuery, an exploratory query specification assistance tool [34] .

These approaches focus on assisting users in query formulation, but assume that the users have sufficient knowledge about the schema and the data. VIDA [40] , ShapeSearch [60] , and Zenvisage [59] are visual query systems that allow visual data exploration, but they require the user to be aware of the trend within the output. Some approaches exploit user interaction to assist users in query formulation and result delivery [1, 6, 10, 21, 42] . There, the user has to provide relevance feedback on system-generated tuples. However, such highly-interactive approaches are not suitable for data exploration as users often lack knowledge about the system-provided tuples, and thus, fail to provide correct feedback reflecting their query intent. Moreover, such systems often require a large number of user interactions.

Drosos et al. [11] present a comparative user study contrasting WREX against manual programming. The study results indicate that data scientists are more effective and efficient at data wrangling with WREX over manual programming.

Mayer et al. [47] presents comparative study between two user interaction models-program navigation and conversational clarification-that can help resolve the ambiguities in the examples in by-example interaction models. Lee et al. [41] presents an online user study on how PBE systems help the users solve complex tasks. They identify seven types of mistakes commonly made by the users while using PBE systems, and also suggest an actionable feedback mechanism based on unsuccessful examples. Santolucito et al. [55] studied the impact of PBE on real-world users over a tool for shell scripting by example. Their study results indicate that while the users are quicker to solve the task using the PBE tool, they trust the traditional approach more. However, none of these studies focus on QBE in particular, which is a PBE system tailored towards data exploration over relational databases. The performance of a QBE tool is affected by additional factors, such as the subjectivity of the data exploration task and the domain knowledge of the user. Moreover, traditional data access and exploration methods pose hurdles not only to novices, but to expert users as well. These factors indicate the need for a new study that targets QBE systems in particular.

In our comparative user studies, we studied how users perceive a state-of-the-art QBE system, SQUID, compared to the traditional SQL querying mechanism, over a variety of subjective and objective data exploration tasks. In this section, we provide an overview of the dataset we used in our studies, along with brief description of both systems.

For our comparative user studies, our goal was to emulate data exploration tasks in a controlled experiment setting.

Generally, people explore data they are interested in and within a domain they are somewhat familiar with. Moreover, data exploration with QBE expects some basic domain familiarity, as users need to be able to provide examples. Therefore, our goal in selecting a dataset was to identify a domain of general interest, where most study participants can be expected to have a basic level of domain familiarity. Furthermore, the dataset needs to be sufficiently large to emulate the practical challenges that users face during data exploration. We selected the Internet Movie Database (IMDb) 2 , which satisfies these goals. The IMDb website is well-known source of movie and entertainment facts, has over 83 million registered users and about 927 million yearly page visits. 3 The database contains information regarding over 10 million personalities along with their demographic information; and about 6 million movies and TV series, along with their genre, language, country, certificate, production company, cast and crew, etc.

The traditional way to query a relational database is to write a query in structured query language (SQL). SQL is one of the most widely-used programming languages (54.7% developers use SQL [45] ) for handling structured data, is specifically designed to query relational databases, and has been used for over 50 years. SQL is a declarative query language and is primarily based on relational algebra. The SQL language consists of several elements such as clauses, expressions, predicates, statements, integrity constraints, etc. SQL has been implemented by different developers-such as Oracle, Microsoft SQL, MySQL, PostgreSQL, etc.-slightly differently, however, fundamentally, they all work the same way. For our comparative user studies, we picked PostgreSQL, which is a free and open-source relational database management system.

Relational databases usually organize data in a normalized form, to avoid redundancy. This is in contrast with the flat data format where all attributes of an entity are stored together within the same row. For example, the detailed schema of the IMDb database, split in 15 relational tables, is shown in Figure 1 . Here, the relation movie contains only three attributes about movies: a numerical record id (called primary key), a text attribute specifying the title of the movie, and the production year of the movie. However, information about associated genres of a movie is not present in the movie table. To figure out the genres of a movie, one would need to write a SQL query to JOIN the tables movie, movietogenre, and genre. The query would also need to specify the logic behind this join, i.e., which rows in the genre table are relevant to a particular movie in the movie table. SQL is a relatively simple language with a limited set of operators (e.g., SELECT, PROJECT, JOIN, etc.). While this simplicity enables the users to learn quickly how to express easy intents using SQL (e.g., the SQL query SELECT title FROM movie would retrieve all movie titles), it comes at the cost that complex intents are hard to express in SQL. Specifically, the restrictions in the data organization (normalized schema) and the simplicity of the SQL operators make complex tasks harder to translate in SQL: it requires the users to specify the entire data-retrieval logic. Overall, writing a successful SQL query for a data exploration tasks requires several skills: (1) familiarity with the database schema, (2) understanding of the table semantics, (3) understanding of the SQL operators, (4) knowledge of the SQL syntax, and (5) expertise in translating task intents to SQL.

SQUID [16, 17] is an end-to-end system that automatically formulates complex SQL queries over commonly used SQUID expresses the problem of query intent discovery using a probabilistic model that infers the most likely query,

given the examples. To mathematically derive the intended query, SQUID applies abduction [35, 48] , an inference method In contrast, a very small fraction of actors in the database are associated with such a high number of Comedy movies; this means that it is unlikely for this similarity to be coincidental, as opposed to the other two. Based on abduction, SQUID selects the third similarity as the best explanation of the observed example tuples, and produces the following SQL query: SQUID then executes this query and presents the results containing two well-known funny actors-Adam Sandler and Ben Stiller-among others ( Figure 3 ).

Enter example actors 

In our user studies, our goal was to quantitatively compare the efficacy and efficiency of SQUID and SQL over a variety of data exploration tasks, while also gathering qualitative feedback from users regarding their experiences with the systems. To this end, we opted for two separate comparative user studies: (1) a controlled experiment study, with a fixed set of tasks, over a group of participants of sufficient size to support quantitative evaluation; (2) an interview study, with a flexible set of tasks, over a small group to gather qualitative user feedback. Due to the situation caused by the current COVID-19 pandemic, both studies were conducted online: the controlled experiment was conducted through a website, hosted on our university servers, and the interview study was conducted over Zoom.

For both studies, we provided the database schema ( Figure 1 ) and a graphical user interface with a text box, where the participants could write SQL queries to interact with a PostgreSQL database. For SQUID, we provided a graphical user interface to allow the participants to interact with the system (Figure 3) . We now proceed to describe the settings, design choices, and methods of our comparative user studies. We first describe our controlled experiment study over a user group of 35 participants, followed by our interview study with a smaller group of 7 interviewees.

Participants. For our controlled experiment study, we recruited students who were enrolled in an undergraduate computer science course on Data Management Systems at our university during the Spring 2020 semester. The course offers an introduction to data management systems and the SQL language. This ensured that our study participants would have basic familiarity with SQL, which is required to compare the two systems: SQUID and SQL. We invited all 89 students enrolled in this course to take part in the study and 35 of them agreed to participate. We offered extra credit for study participation; students who opted to not participate were given alternative opportunities for extra credit. We labeled these participants P1-P35. The average grade the participants achieved in the course was 86.3 (out of 100), with a minimum grade of 45, and a maximum grade of 100; the standard deviation of the grades was 9.87. This indicates a broad range in our participants' SQL skills, which was one of our goals. While all of them had prior experience and exposure, some had only very basic skills (and failed the class) and some achieved advanced skills.

Tasks. We designed 4 data exploration tasks over the IMDb database. Our goal was to observe what challenges a set of diverse tasks poses to the participants and how the challenges vary based on the subjectivity of the tasks and the mechanism (SQUID or SQL) used to solve the tasks. To this end, we designed two objective tasks: (1) to find Disney movies and

(2) to find Marvel movies; and two subjective tasks: (1) to find funny actors and (2) to find strong and muscular actors.

We provided a detailed description for each task to the study participants. (Details are in our supplementary materials.)

Task-assignment mechanism. Each participant was assigned all of the four tasks in the sequence: Disney, Marvel, funny, and strong. This order was enforced to ensure that they perform objective tasks first, which are easier, and then move to more complex and subjective tasks. We randomized task-system pairings to make sure that for each task, about half of the participants use SQUID while the other half use SQL. The task-assignment mechanism was as follows: for each user, we randomized which system (SQUID or SQL) they are allowed to use for each task. Everyone did the tasks-Disney, Marvel, funny, strong-in that order, but there were two possible system assignment orders: (a) SQuID, SQL, SQuID, SQL, or (b) SQL, SQuID, SQL, SQuID. Each participant was randomly given one of these assignments. This resulted in randomized task-system pairings, with the constraint that each participant must solve one objective and one subjective task using SQL and the remaining two tasks (also one objective and one subjective) using SQUID.

This mechanism also eliminated any potential order bias with respect to the treatment system as half of the participants interacted with SQUID before SQL, while the other half interacted with SQL before SQUID. Within each task (e.g., Disney), each participant used either SQUID or SQL to solve each task, but not both.

Study procedure. This study was conducted online and the participants took the study over the Internet on a specific website, hosted on our university servers. We sent out the URL of the website during recruitment. At the beginning of the study, participants were asked a series of questions about their familiarity with SQL. The questions asked the participants to provide answers using a 5-point Likert-scale ranging from "Not familiar (1)" to "Very familiar (5)". Next, there was a question asking them at what frequency they watch movies, followed by a questions about overall movie and actor familiarity where participants could select multiple options. After this survey, participants were given an interactive tutorial, which was divided into two sections, walking them through the steps to obtain results with both SQUID and SQL.

The tutorial took about 2-5 minutes to complete. After the tutorial, the participants started the tasks. They had 10 minutes for each task, but could finish before the time was up if they chose to. Participants were asked to avoid using Internet search, but if they did, they were encouraged to report it. After each task, the participants were asked to answer a post-task survey with two questions: the first one was about the difficulty of the task where the participants had to provide answers using a 5-point Likert-scale ranging from "Very difficult (1)" to "Very easy (5)"; and the second one was about their satisfaction with the results where the participants had to provide answers using a 5-point Likert-scale ranging from "Very unsatisfied (1)" to "Very satisfied (5)". After completing all four tasks, the participants were asked to answer four survey questions: the first one was regarding their preferences between SQUID and SQL where the participants had to provide answers using a 5-point Likert-scale ranging from "Definitely SQL (1)" to "Definitely SQUID (5)"; the second one was about usability comparison between SQL and SQUID where the participants had to provide answers using a 5-point Likert-scale ranging from "SQL was a lot easier (1)" to "SQUID was a lot easier (5)"; the third one was about satisfaction with results obtained using SQUID where the participants had to provide answers using a 5-point Likert-scale ranging from "very unsatisfied (1)" to "very satisfied (5)"; and the fourth one was about accuracy of the results obtained using SQL where the participants had to provide answers using a 5-point Likert-scale ranging from "very inaccurate (1)" to "very accurate (5) lists for each. We selected lists that meet the following criteria: (1) they have a number of entries that is representative of the task (e.g., there are more than five Marvel movies, thus the list should contain more than five entries), (2) they are frequently-viewed, and (3) they contain entries that match the task objectives. For instance, we collected a list of 300 funny actors, which was compiled from 7 shorter lists of funny actors. One of these lists, titled "Funny Actors", has over 400,000 views, and includes 60 well-known comedians including Jim Carrey, Robin Williams, Eddie Murphy, Mel

Brooks, and Will Ferrell. 4 We provide all the lists we used in our supplementary materials.

We conducted a comparative interview study to gain richer insights on users' behavior, their preferences, and issues they faced while solving the data exploration tasks using both systems.

Interviewees. We recruited 7 interviewees for this study by targeting a diverse set of computer science graduate students directly working or collaborating with the data management research lab at our university. Out of the 7 interviewees, 4

were male and 3 were female; 6 of them were international students; and their ages ranged from 25 to 30 years old. All of them had experience using SQL for at least one year, however, their expertise varied from moderate to expert. We label the interviewees I1-I7. We provide further details on the interviewees in Figure 4 .

Tasks. For this study, we asked the interviewees to pick one objective task from the following list: (1) Disney movies,

(2) Marvel movies, (3) animation movies, (4) sci-fi movies, (5) action movies, (6) movies by an actor of their own choice, or (7) movies by a country of their own choice. We also asked them to select one subjective task form this list: (1) funny actors, (2) physically strong actors, or (3) serious actors. The variety of tasks allowed interviewees to pick tasks based on their interests and enabled us to observe how the two systems compare over a variety of data exploration tasks. This study was within-subject, i.e., all of the interviewees were required to use both the systems (SQUID and SQL) to solve each task.

Study procedure. For each interview, two of our research team members were present, one as primary to lead the interview and ask questions and another as secondary to take notes and ask potential follow-up questions. At the beginning of the study, we provided them the URL of the study website over the chat feature of Zoom. During the study, the interviewees first completed an interactive tutorial and then they were asked to pick two tasks. The interviewees were then asked to solve each task using both SQUID and SQL, so that they can directly contrast the two systems. We asked them to complete each task first using SQUID and then using SQL, so that the examples they would provide while using SQUID would be free from biases due to observing the results from their SQL query outputs. We did not expose through the SQUID interface the query that SQUID generates, thus avoiding biases when the interviewees were completing the SQL tasks. The interviewees followed a think-aloud protocol and shared their screen over Zoom during the study. They were observed by two interviewers who also asked open-ended questions to the interviewees on completion of each of the two tasks using both systems. The questions aimed to gather information on which of the two systems the interviewees prefer, under what circumstances they prefer one over the other, and the justification of why they do so. They were also asked what challenges they faced while using the systems and whether some particular task exacerbated these challenges.

Finally, they were asked what type of results they prefer during data exploration: specific or general. Data analysis. We thematically analyzed the responses using our coding software (spreadsheet). Two independent coders from our team independently coded the data. The following six themes emerged after several rounds of analysis:

(1) struggle in task understanding, (2) struggle in familiarizing oneself with the schema while using SQL, (3) difficulties with writing syntactically correct SQL queries, (4) struggle with solving vague/subjective tasks using SQL, (5) struggle due to lack of domain familiarity while using SQUID, and (6) preference between precision and recall of the results.

Inter-coder reliability was 0.98, calculated using Krippendorff's alpha.

In this section, we present the quantitative results of the controlled experiment study, summarizing our findings.

The distribution of self-reported movie-watching frequency among the participants is shown in Figure 5a , with the most common response being 'once or twice a month', followed by 'once or twice every few months'. The responses 

To quantitatively measure the quality of the results produced by both SQUID and SQL, we checked them against the ground-truth results (discussed in Section 4.1). We used three widely-used correctness metrics to quantify the result quality: precision, recall, and F1 score. These metrics capture different aspects: precision captures "preciseness", i.e., the fraction of retrieved tuples that are relevant; recall captures "coverage", i.e., the fraction of relevant tuples that are correctly retrieved; and F1 score-which is a harmonic mean of precision and recall-maintains a balance between them.

On average, we found SQUID to be more effective in generating accurate results than SQL ( Figure 6 ). For all four tasks, on average across participants, results obtained with SQUID achieved significantly higher precision than the results obtained with SQL. SQUID achieved higher recall than SQL for the two objective tasks (Disney and Marvel). While SQUID's recall for the subjective tasks (Funny and Strong) was lower than SQL, note that SQL's precision for those tasks was close to 0. This is simply because the SQL queries the participants wrote for those tasks were very imprecise and returned a very large number of results (e.g., all actors in the database). While such general queries can happen to contain a large portion of the correct results (hence the high recall), they contain an extremely large number of irrelevant results making them poorly suited for this retrieval task. In terms of F1 score, SQUID always achieved higher values than SQL implying its effectiveness over SQL for generating more accurate results. The result of t-tests for these findings are shown in Figure 7 . Out of the 12 findings, 7 are statistically significant with a p-value less than 0.05.

SQUID helped the participants solve the tasks more quickly (Figure 8a ) and with fewer attempts (Figure 8b ) than SQL.

On average, the participants were able to solve the tasks using SQUID about 200 seconds faster than when using SQL. Participants were also able to solve the tasks with about 4 fewer attempts while using SQUID compared to SQL. The results of t-test of these findings, shown in Figure 9 , signify that most are statistically significant with a p-value less than 0.05. Figures 10a and 10b show self-reported overall satisfaction with the results produced by SQUID and SQL, respectively.

Generally, participants found the results produced by SQUID more satisfying than the results produced by SQL. Out of the 35 participants, 23 were somewhat or very satisfied with SQUID. In contrast, 18 reported that the results produced by SQL were somewhat or very accurate. However, we found that the self-reported satisfaction does not correlate with the actual correctness of the results (measured in terms of precision, recall, and F1 score), and in fact, the participants generally did better with SQUID than SQL, although they did not always realize it. Figure 10c shows self-reported overall evaluation comparing SQUID and SQL in terms of ease of use. Out of the 35 participants, 19 reported that SQUID was easier, 6 reported that they had the same level of difficulty, and 10 reported that SQL was easier.

However, despite reporting that SQUID was easier to use and the results were more satisfying, the participants were still leaning towards SQL as a preferred mechanism for data exploration. Figure 10d shows self-reported overall preference between SQUID and SQL, where 11 reported that they would prefer SQUID while 19 reported that they would prefer SQL. Five participants reported no preference.

We now report the results of our interview study and describe six main themes that emerged from our qualitative analysis.

All seven of our interviewees from the interview study commented that it was difficult to become acquainted with the database schema. "As a user, I have to explore the schema", I1 said. I1 continued, "The query itself was not complicated.

It was time consuming to get familiar with the schema itself. Even for experienced users, reading through the schema and ] might have the attribute, but it didn't. Then I had to go through joining five tables!"

All interviewees struggled to a varying degree to write a SQL query because of different issues; e.g., some of them could not figure out the correct spelling of attributes. For instance, one would query for the genres 'scifi' or 'comedic', neither of which exist in the database. I4 said, "The difficult part was to get the accurate predicate for the query, and I had to

[explore the database] for that." SQL requires strict string matching, which can be extremely difficult to overcome for someone who is unfamiliar with the database constants and SQL syntax. While it is possible to query a table and view its content to see how the names are spelled, very few interviewees did this. It appears that the ability to write a SQL query is based on experience and recent exposure to SQL. Interviewees noted that they do not use SQL on a daily basis-some even said they had not used SQL in months-thus, it was difficult to recall specific syntax. For instance, two of the interviewees-who had relatively lower SQL expertise-could not remember the requirements for joining tables. I7 had to use Google to help with this syntax, and I2 did not recall that SQL could join more than two tables. I5 said, "I was an auto-correct if the name is spelled incorrectly, the auto-completion feature allows the users to type what they know and scroll through the suggestions until they find the proper name. We observed several of our interviewees initially spelled a movie name incorrectly, but they were helped by the auto-completion feature. For example, I2 initially typed 'Spiderman' in the search bar, but the title is spelled 'Spider-Man' in the database. I2 was able to correct the spelling when he typed 'Spider' in the search box and autocomplete showed the entire title. The search bar also helped I5 who noted, "If I was missing some spellings, there were some suggestions."

Some exploration tasks can be subjective and inherently vague, e.g., how does one define a "funny" actor precisely? How many comedies, exactly, does an actor have to star in before they are considered funny? These questions have no clear answers, and such parameters can vary from person to person and from day to day. In practice, it may be very difficult, if not impossible, to think of objective measures for a subjective concept, which makes subjective tasks very complicated to specify with SQL. I2 said, "Even if I forget about syntax . . . figuring out how to go about writing the pseudo code query for funny actors [is difficult]". One of the most common blunders of interviewees who used SQL to find "funny" actors was to query all actors who had been in some comedy movie. I3 was the first to acknowledge this. "I had to play around with a lot of smaller queries," he said, "to get the one that I eventually had, which I was still not satisfied with. It seems like I pulled many actors and actresses that happened to be in some comedy." I3 elaborated, "Vague tasks are generally a lot more open to interpretation. Coding up a query that meets someone's vague specifications [is] hard . . . It was very hard to nail down what the correct definition of funny is." I4 also recognized that vague tasks are difficult to define. She even said, "This probably isn't a query that I should write in SQL!" She continued, "strong and muscular are very vague descriptors, and SQL needs clear rules. I have to use genre as a proxy, and that makes the query very nasty."

On the other hand, SQUID can interpret complex parameters without any involvement from the user, sparing them the mental burden of defining and implementing a complex query. I4 also said, "In order to write a SQL query, you need to understand the schema well, know your data well, and know your question well . . . But if the task is exploratory and you only have a vague idea in mind, like 'strong actors' . . . it would be very hard, if not impossible, to write a SQL query."

Indicating how SQUID helped in the subjective tasks, I3 said "SQUID is a lot more user-oriented. You could just put in some actor names and it would infer what you really want."

We asked interviewees whether they would prefer a long list that includes all relevant names, but may also include many irrelevant names (high-recall) or a shorter list that includes exclusively relevant names with very few irrelevant names, but may miss some relevant names (high-precision). Six out of seven interviewees reported that they would prefer having a shorter list with higher precision, while one interviewee had no preference. "I think I'm okay with not having all Marvel movies listed here," I2 said, "but I definitely don't want anything outside of Marvel movies. 

In this section, we summarize significant findings found from the quantitative and qualitative analysis of our comparative user studies and highlight the key take-aways.

From our interviews, we identified three key pain-points of the traditional SQL querying mechanism, all of which are removed when using SQUID:

Schema complexity. One significant difficulty that we observed during the use of SQL was the requirement of schema understanding. To issue a SQL query over a relational database, the user must first familiarize themselves with the database schema [3, 16, 58] . The schema is often complex, such as the IMDb schema shown in Figure 1 , and requires significant effort to understand. The user also needs to correctly specify the constant values (e.g., Comedy and not Comedic), name of the relations (e.g., movietogenre and not movie_to_genre), and name of the attributes (e.g.,

id and not movie_id) in the SQL query. Moreover, some attributes reside in the main relation (e.g., person.name) while others reside in a different relation (e.g., names of a movie's genres reside in the relation genre and not in the relation movie). From a closer look at some of the user-issued SQL queries, we observed futile efforts to guess keywords,

incorrectly trying values such as "comedic", "superhero comics", and "funny", which do not exist in the database and result in syntax or semantic errors. In structured databases, if one does not know the exact keywords, they end up issuing an incorrect SQL query, which returns an empty result. In contrast, SQUID frees the user from this additional overhead as it leverages the database content and schema and associates it automatically with the user-provided examples.

Semantic translation. After studying the schema, the next task was to translate the task's semantics formally to a language (e.g., SQL) that computational systems understand. While this is relatively easy for objective tasks (e.g., finding all movies produced by Disney), the same is not true for subjective tasks (e.g., finding all "funny" actors). As our qualitative feedback indicates, expressing subjective or vague tasks is hard in any formal language and not only in SQL.

For example, for the task of finding all "funny" actors, even the SQL experts struggled to encode the concept "funny" in SQL. Many participants wrote a SQL query to retrieve all actors who appeared in at least one movie whose genre is Comedy. However, upon observing the output of such an ill-formed query, they were not satisfied with the results. This is because appearing in only one comedy movie does not necessarily make an actor funny. Usually, actors who appear in "many" comedy movies are considered funny. The key struggle here is to figure out what is the right threshold for "many", i.e., in how many comedy movies should an actor appear to be considered "funny". In contrast, SQUID is able to discover these implicit constants from Language syntax. SQL is a programming language with several operators and keywords, and similar to all programming languages, SQL also requires strict syntax. While issuing a SQL query, even a minor syntactic error will result in complete failure and will return no result. Moreover, the syntax error messages that the SQL engine provides are often ambiguous and confusing to novice users. We observed that one of our interviewees could not recall the correct syntax of the JOIN operation. This stringent requirement of syntax poses significant hurdles to novice and even intermediate SQL users. In contrast, SQUID completely bypasses SQL, eliminating this challenge.

In our controlled experiments, we noted that SQUID is generally more effective than SQL in deriving accurate results.

For objective tasks, we found that SQUID outperforms SQL in all three correctness metrics-precision, recall, and F1 score. However, it is important to highlight that our interviewees noted that SQUID is particularly useful and preferable to SQL for subjective tasks. This does not contradict our quantitative analysis. While SQL has higher recall than SQUID for subjective tasks, SQUID achieves much higher F1 scores, because SQL's precision for these tasks is close to 0. This is because an extremely general SQL query (e.g., one that returns all the data) may have very high recall, but it will not be of use to the exploration task that expects targeted results. Furthermore, SQUID significantly boosts the user's efficiency in data exploration. This was confirmed by our controlled experiment study where we found that participants achieved their goal much faster (in about 200 fewer seconds) and with less effort (with about 4 fewer attempts) while using SQUID compared to SQL.

Lack of domain knowledge is a handicap for SQUID, as it requires at least a few initial examples for its inference.

This is a general issue with all query-by-example mechanisms [16, 22, 58] 

The specific properties of SQUID, specifically interactivity, providing feedback, and promoting serendipitous discovery, make it a significantly better choice for solving subjective tasks that are usually ambiguous and vague, and are very hard to solve using SQL. For example, in our studies, we used "strong actors" or "funny actors" as two examples of subjective tasks. Participants of both our controlled experiment study and interview study found thinking of examples easier than expressing their intent using SQL, especially for subjective tasks. Our results indicate that SQUID provides an easier mechanism for data retrieval and helps users overcome the difficulty of writing overly complex SQL queries for subjective tasks.

In contrast, for objective tasks, we found both SQUID and SQL equally effective, given the user has basic SQL expertise.

Trust on a system depends on prior exposure, expertise, type of the tasks, and system explainability During our controlled experiment, we wanted to measure how much the participants trust the mechanism that produces the results by asking the questions: "how well do you think SQUID did in generating the desired results?" and "how accurate were the SQL results?" While some participants reported that they were more satisfied with the results produced by SQUID than SQL, interestingly, many of them reported that they prefer SQL over SQUID even though they generally did better with SQUID ( Figure 10d ). This result is in line with prior work that compared a PBE tool against traditional shell-scripting and found that despite performing better using the PBE tool, users tend to trust the traditional shell-scripting more [55] . We validated this by checking against ground-truth results where SQUID groups achieved results with higher precision (more specific) and F1 score (more accurate), as shown in Figure 6 .

Since the participants performed better when using SQUID compared to SQL, we interpret their preference for SQL to be due to three possible sources of bias: (1) Familiarity: The participants were at the time taking a course on relational databases and SQL, which may have artificially increased their confidence in their SQL skills. They had prior experience with SQL, but were experiencing SQUID for the first time through the study. (2) Explainability: SQL exposes the precise mechanism (the code) that produces the results, while we did not provide participants with an explanation of the inner workings of SQUID nor exposed the query it produces. (3) Domain expertise: Low domain expertise poses a hurdle in producing examples for SQUID; we posit that the users may consider SQL a more versatile mechanism for such circumstances.

We further investigated the issue of trust during our interview study by asking all our interviewees the question: "Which of these two systems, SQUID or SQL, do you trust more?" We expected SQL experts to trust SQL more, but did not observe any strong trend. Rather, the interviewees mentioned that for objective tasks, they were more confident about the SQL queries they wrote, and hence, they trusted SQL more. In contrast, for the subjective tasks, they reported that they trusted the results produced by SQUID more, as for the subjective tasks, the most common complaint was that SQL produced too many results (less specific) and perhaps retrieved the entire database content. Ultimately, SQUID can also provide explanations, by exposing the SQL query it synthesizes in order to generate the results and the underlying mechanism used to synthesize the query. We shed more light on this in the future work.

A desired property for any system is learnability: how easy it is to get used to the system. From our study, we found that it was very easy for the participants to learn how to use SQUID almost instantly. SQUID's interface is intuitive and both novices and experts learned how to use it, just by observing its behavior. In contrast, when participants did not know how to write certain classes of SQL queries, they simply gave up and mentioned that they cannot express their logic in SQL. This is particularly significant considering that all our study participants and interviewees had prior exposure to and experience with SQL, while this was their first experience using SQUID.

Our study results indicate that SQUID effectively helped users with various levels of SQL familiarity perform their tasks faster and more efficiently. However, our work explored only one example of QBE systems and recognizably with a limited number of participants. Additional work is needed to study the impact of QBE systems further. While our goal was to draw a comparison between traditional and QBE systems, additional studies might investigate how complete novices (users with no SQL expertise) use QBE systems. Furthermore, future studies can expand the list of tasks to tease apart better the impact of using QBE systems for various task types. From the interviewees' feedback, we extracted a few directions for future work to improve user experience while using QBE systems:

Exposing the synthesized SQL query for explainability. One shortcoming of SQUID is that the user is unaware of the mechanism SQUID uses to generate the results. Under the hood, SQUID synthesizes a SQL query from the user-provided examples, which it uses to produce the results. A possible future work for QBE systems is to expose the SQL query and allow the users to fine tune the query parameters to suit their specific purposes.

Exposing internal mechanism for further explainability. In addition to exposing the SQL query, QBE systems can Interaction with the results for feedback. Another direction of future work is to allow the users to interact with the results produced by QBE system: the user will accept or reject a few result tuples which will act as feedback to the system. This will help QBE system learn the user intent better.

Extensive user study. More extensive user studies are needed in the future to evaluate all these additional features and determine whether they contribute positively to the users' trust and satisfaction in QBE systems.

Our comparative user studies found that database users, with varied levels of prior SQL expertise, are significantly more effective and efficient at a variety of data exploration tasks with SQUID over the traditional SQL querying mechanism that requires database schema understanding and manual programming. Our results indicate that SQUID eliminates the barriers of familiarizing oneself with the database schema, formally expressing the semantics of an intended task, and writing syntactically correct SQL queries. The key take-away of this work is that in a programming-by-example tool like SQUID, even a limited level of domain expertise (knowledge of a subset of the desired data) can substantially help overcome the lack of technical expertise (knowledge of SQL and schema) in data exploration and retrieval. This indicates that programming by example can lead to the democratization of complex computational systems and make these systems accessible to novice users while aiding expert users as well. Our studies validate some prior results over other PBE approaches but also contribute new empirical insights and suggest future directions for QBE systems to further increase system explainability and user trust.

Learning and Verifying Quantified Boolean Queries by Example

DBXplorer: a system for keyword-based search over relational databases

Duoquest: A Dual-Specification System for Expressive SQL Queries

The Complexity of Reverse Engineering Problems for Conjunctive Queries

FlashRelate: extracting relational data from semi-structured spreadsheets using examples

Learning Join Queries from User Examples

Eager: Programming repetitive tasks by example

QPlain: Query by explanation

SPARQLByE: Querying RDF data by example

AIDE: An Active Learning-Based Approach for Interactive Data Exploration

Wrex: A Unified Programming-by-Example Interaction for Synthesizing Readable Code for Data Scientists

YmalDB: Exploring Relational Databases via Result-driven Recommendations

QueRIE: Collaborative Database Exploration

Interactive SQL query suggestion: Making databases user-friendly

SuDocu: Summarizing Documents by Example

Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity

SQuID: Semantic Similarity-Aware Query Intent Discovery

Synthesizing data structure transformations from input-output examples

Music Creation by Example

REQUEST: A scalable framework for interactive construction of exploratory queries

Automating string processing in spreadsheets using input-output examples

Programming by Examples -and its applications in Data Wrangling

Programming by Examples: Applications, Algorithms, and Ambiguity Resolution

Programming by Examples: PL meets ML

Program synthesis

Discovering Neighborhood Pattern Queries by sample answers in knowledge base

Transform-Data-by-Example (TDE): Extensible Data Transformation in Excel

DISCOVER: Keyword Search in Relational Databases

Overview of data exploration techniques

WebRelate: integrating web data with spreadsheets using examples

Querying Knowledge Graphs by Example Entity Tuples

Oracle-guided component-based program synthesis

SnapToQuery: Providing Interactive Feedback during Exploratory Query Specification. PVLDB

Abduction

FastQRE: Fast Query Reverse Engineering

SnipSuggest: Context-Aware Autocompletion for SQL

FlashNormalize: Programming by Examples for Text Normalization

FlashExtract: a framework for data extraction by examples

The Case for a Visual Discovery Assistant: A Holistic Solution for Accelerating Visual Data Exploration

Towards understanding human mistakes of programming by example: an online user study

Query From Examples: An Iterative, Data-Driven Approach to Query Construction

Programming by example (introduction)

Semantic Queries by Example

Most used languages among software developers globally 2020

Recommender system application developments: a survey

User Interaction Models for Disambiguation in Programming by Example

Applications of abduction: knowledge-level modelling

QBEES: query-by-example entity search in semantic knowledge graphs based on maximal aspects, diversity-awareness and relaxation

Reverse Engineering Top-k Join Queries

Democratizing Data Science and Lessons Learned Along the Way

Top-k Spreadsheet-Style Search for Query Discovery. In SIGMOD

Josh Tenenbaum, and Armando Solar-Lezama

Disjunctive Program Synthesis: A Robust Approach to Programming by Example

Programming by Example: Efficient, but Not "Helpful

Live programming by example

Live Programming By Example

Discovering Queries Based on Example Tuples

Effortless Data Exploration with zenvisage: An Expressive and Interactive Visual Analytics System

ShapeSearch: A Flexible and Efficient System for Shape-based Exploration of Trendlines

REGAL+: Reverse Engineering SPJA Queries

Query Reverse Engineering

Query reverse engineering

Synthesizing Highly Expressive SQL Queries from Input-output Examples

Language-Independent Set Expansion of Named Entities Using the Web

FIDEX: filtering spreadsheet data using examples

Reverse Engineering SPJ-Queries from Examples

A colorful approach to text processing by example

Answering Keyword Queries involving Aggregates and GROUPBY on Relational Databases

Reverse Engineering Complex Join Queries

Automatically synthesizing SQL queries from input-output examples

Entity Set Expansion via Knowledge Graphs