Semantic Parsing of Ambiguous Input through Paraphrasing and Verification Philip Arthur, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura Graduate School of Information Science, Nara Institute of Science and Technology, Japan {philip.arthur.om0, neubig, ssakti, tomoki, s-nakamura}@is.naist.jp Abstract We propose a new method for semantic pars- ing of ambiguous and ungrammatical input, such as search queries. We do so by build- ing on an existing semantic parsing framework that uses synchronous context free grammars (SCFG) to jointly model the input sentence and output meaning representation. We gener- alize this SCFG framework to allow not one, but multiple outputs. Using this formalism, we construct a grammar that takes an ambigu- ous input string and jointly maps it into both a meaning representation and a natural lan- guage paraphrase that is less ambiguous than the original input. This paraphrase can be used to disambiguate the meaning representa- tion via verification using a language model that calculates the probability of each para- phrase.1 1 Introduction Semantic parsing (SP) is the problem of parsing a given natural language (NL) sentence into a meaning representation (MR) conducive to further processing by applications. One of the major challenges in SP stems from the fact that NL is rife with ambiguities. For example, even the simple sentence “Where can we eat a steak in Kobe?” contains syntactic ambi- guities (“eat in Kobe” or “steak in Kobe”?), quan- tifier scope ambiguities (do we all eat one steak, or each eat one steak?), and word sense ambigui- ties (is Kobe a city in Japan; or an NBA basketball 1Tools to replicate our experiments can be found at http://isw3.naist.jp/~philip-a/tacl2015/index.html. player?). Previous works using statistical models along with formalisms such as combinatorial cat- egorial grammars, synchronous context free gram- mars, and dependency based compositional seman- tics have shown notable success in resolving these ambiguities (Zettlemoyer and Collins, 2005; Wong and Mooney, 2007; Liang et al., 2011; Kwiatkowski et al., 2013). Much previous work on SP has focused on the case of answering natural language queries to a database of facts, where the queries generally take the form of full sentences such as “What is the height of Kobe Bryant?” While answering these ques- tions provides an excellent first step to natural lan- guage information access, in many cases the input is not a full sentence, but something more underspec- ified and ungrammatical. For example, this is the case for keyword-based search queries (Sajjad et al., 2012) or short dialogue utterances (Zettlemoyer and Collins, 2007). Specifically taking the example of search queries, users tend to omit some of the function words and grammatical constructs in the language to make a more concise query. The first column of Table 1 illustrates several search queries of the pattern “Kobe X” where X is another word. From these queries and their MRs in column two, we can see that there are several kinds of ambiguity, including not only the distinction between Kobe as city or a basketball player as in the previous example, but also more pernicious problems unique to the more ambiguous input. Focusing on the queries “Kobe hotels” and “Kobe flight” we can see that it is also necessary to estimate the latent relationship between Search Query Meaning Representation Paraphrase Kobe Hotel λx (hotel(x) ∧ in(x, kobe city)) Hotel in Kobe city Kobe Flight λx (flight(x) ∧ to(x, kobe city)) Flight to Kobe city Kobe Height height(kobe bryant) Height of Kobe Bryant Table 1: Example of a search query, MR, and its paraphrase words, such as “location” or “destination.” However it should be noted that if we take the keyword query and re-express it as a more explicit paraphrase, we can reduce this ambiguity to the point where there is only one reasonable interpretation. For example, in the second line, if we add the preposition “to” the user is likely asking for flights that arriving in Kobe, and if we add “from” the user is asking for depar- tures. In this paper, we focus on SP of ambiguous input and propose a new method for dealing with the prob- lem of ambiguity. Here we propose a framework where an ambiguous input (Column 1 in Table 1) is simultaneously transformed into both its MR (Col- umn 2) and a more explicit, less ambiguous para- phrase (Column 3). The advantage of this method is that it is then possible to verify that the paraphrase indeed expresses the intended meaning of the under- specified input. This verification can be done either manually by the system user or automatically using a probabilistic model trained to judge the naturalness of the paraphrases. As a concrete approach, building upon the formal- ism of synchronous context free grammars (SCFG). Unlike traditional SCFGs, which usually only gen- erate one target string (in semantic parsing, an MR), we introduce a new variety of SCFGs that generate multiple strings on the target side. This allows us to not only generate the MR, but also jointly gen- erate the more explicit paraphrase. We then use a language model over the paraphrases generated by each derivation to help determine which derivations, and consequently which MRs, are more likely. We perform an evaluation using the standard Geo- query benchmark of 880 query-logic pairs. First we note that baseline SCFG parser achieves reasonable accuracy on regular questions but when the same method is used with underspecified input, the system accuracy decreases significantly. On the other hand, when incorporating the proposed tri-synchronous grammar to generate paraphrases and verify them with a language model, we find that it is possible to recover the loss of accuracy, resulting in a model that is able to parse the ambiguous input with signif- icantly better accuracy. 2 Semantic Parsing using Context Free Grammars As a baseline SP formalism, we follow Wong and Mooney (2006) in casting SP as a problem of trans- lation from a natural language query into its MR. This translation is done using synchronous context free grammars, which we describe in detail in the following sections. 2.1 Synchronous Context Free Grammars Synchronous context free grammars are a general- ization of context-free grammars (CFGs) that gener- ate pairs of related strings instead of single strings. Slightly modifying the notation of Chiang (2007), we can formalize SCFG rules as: X → ⟨γs, γt⟩ (1) where X is a non-terminal and γs and γt are strings of terminals and indexed non-terminals on the source and target side of the grammar. SCFGs have recently come into favor as a tool for statistical machine translation (SMT). In SMT, a synchronous rule could, for example, take the form of: X → ⟨X0 eats X1, X0 wa X1 wo taberu⟩ (2) where γs is an English string and γt is a Japanese string. Each non-terminal on the right side is in- dexed, with non-terminals with identical indices cor- responding to each-other. Given the SCFG grammar, we can additionally as- sign a score to each rule, where higher scored rules are more likely to participate in a derivation. Given the grammar of scored rules, and an input sentence Grammar r0 QUERY → ⟨give me the CONJ0, answer(x1, CONJ0)⟩ r1 CONJ → ⟨FORM0 FORM1 STATE2, (FORM0, FORM1, const(x2, stateid(STATE2))⟩ r2 FORM → ⟨cities, city(x1)⟩ r3 FORM → ⟨in, loc(x1, x2)⟩ r4 STATE → ⟨virginia, virginia⟩ Derivations ⟨QUERY0, QUERY0⟩ r0 ⇒ ⟨give me the CONJ1, answer(x1, CONJ1))⟩ r1 ⇒ ⟨give me the FORM2 FORM3 STATE4, answer(x1, (FORM2, FORM3, const(x2, stateid(STATE4))))⟩ r2 ⇒ ⟨give me the cities FORM3 STATE4, answer(x1, (city(x1), FORM3, const(x2, stateid(STATE4)))⟩ r3 ⇒ ⟨give me the cities in STATE4, answer(x1, (city(x1), loc(x1,x2), const(x2, stateid(STATE4)))⟩ r4 ⇒ ⟨give me the cities in virginia, answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(virginia)))⟩ Figure 1: Example of SP using SCFGs. The left hand and right hand sides are generated simultaneously. S, the highest scoring parse and output sentence T can be calculated using the CKY+ algorithm (Chi- ang, 2007). 2.2 Semantic Parsing with SCFGs In the simplest form of SP with SCFGs, γs is used to construct a natural language string S and γt is used to construct the MR T (Wong and Mooney, 2006). Figure 1 shows an example of using an SCFG to si- multaneously generate a natural language string and its MR. In this picture, the bold symbols are non- terminals which can be substituted with other non- terminal productions. Productions end when all the tokens are terminals. The collection of rules used to generate a particular ⟨S,T ⟩ pair is a derivation D= d1, d2, ..., d|D|. Wong and Mooney (2007) further extended this formalism to handle λ-SCFGs, which treat γs as the natural language query and γt as an MR based on λ calculus. SCFG rules are automatically learned from pairs of sentences with input text and the corre- sponding MR, where the MR is expressed as a parse tree whose internal nodes are predicates, operators, or quantifiers. In this paper, we follow Li et al. (2013)’s approach to extract a grammar from this parallel data. In this approach, for each pair, statistical word alignment aligns natural language tokens with the correspond- ing elements in the MR, then according to the align- ment, minimal rules are extracted with the GHKM algorithm (Galley et al., 2004; Li et al., 2013). Then, up to k minimal rules are composed to form longer rules (Galley et al., 2006), while considering the re- lationship between logical variables. Finally, un- aligned NL tokens are aligned by attaching them to the highest node in the tree that does not break the consistencies of alignment, as specified in Galley et al. (2006). 2.3 Additional Rules While basic rules extracted above are quite effective in parsing the training data,2 we found several prob- lems when we attempt to parse unseen queries. To make our parser more robust, we add two additional varieties of rules. First, we add a deletion rule which allows us to delete any arbitrary word w with any head symbol X, formally: X → ⟨w X, X⟩. (3) This rule allows our grammar an option of ignoring words that it does not know what to do with. 2We achieve almost 100% F-measure in closed testing. In addition, to ensure that all of the facts in the database can be accessed by our semantic parser, we provide some additional SCFG rules based on the given database of facts. The Geoquery dataset pro- vides a database of facts represented as logical asser- tions. For every assertion provided in the database, we produce a single rule using the function name as the label of the non-terminal and one parameter of the assertion as the terminal, depending on the asser- tion’s type. For example, Geoquery provides some details about the state of Michigan with the form state(’michigan’,...), and thus we add STATE → ⟨michigan, michigan⟩ as an additional rule in the grammar. 3 Semantic Parsing of Keyword Queries As explained in Section 1, when users input key- word queries, they will often ignore the grammatical structure and omit function words. Based on this, a traditional SP model can be problematic. To give a concrete example, consider the synchronous parse in Figure 1. If we try to parse with only the keywords (e.g. “cities virginia”) with a standard grammar, the parser will not be able to recover the latent relation- ship “loc(x1, x2)” between the two words. Unfortu- nately, we are lacking evidence to recover this re- lationship, because the token “in” associated with the predicate “loc” will often not occur in a keyword query. In this work, we perform experiments on this par- ticular variety of ambiguous input, both to examine the effect that it has on parsing accuracy under the baseline model, and to examine whether this sort of ambiguity can be reduced. In order to do so, we need examples of keyword queries. In this work, we sim- ulate the keyword query K by altering the original question S to make it more closely match the style of keyword queries. In particular, following the analy- sis of Leveling (2010), we make two changes to the original queries: stop word deletion, and word order shuffling. Stop word deletion, as its name implies, sim- ply deletes all stop words from the input sentence. We use a stop word list,3 making a few subjective changes to make the simulated keyword output more 3http://www.lextek.com/manuals/onix/stopwords2.html realistic. Specifically, we add “give” and “show,” which often occur in statements such as “give me ...” or “show me ...” but are unnatural in keyword queries. We also exclude from the list “us,” which often refers to “United States,” and function words such as “many,” “most,” and “much.” Word order shuffling permutes the order of the keywords remaining after stop word deletion, to simulate the fact that keyword queries often don’t have strict order. First we shuffled the tokens ran- domly, then had a human annotator fix the order of the keywords manually, making the minimal num- ber of changes necessary to ensure that the queries are natural and fluent. This produced a single key- word query K for a particular question/MR pair in the Geoquery database, which will be used to train and verify our system. At the end we will have a 3- parallel corpus consisting of 880 pairs of keyword, question, and the meaning representation. We should note that while shortening and reorder- ing are prominent features of search queries (Lev- eling, 2010), these are not the only phenomenon distinguishing queries from standard text. For ex- ample, humans tend to also change content words into an equivalent and easier word of their prefer- ence (Gurský et al., 2009). While collecting this data is out of the scope of the present work, if a cor- pus of real keyword inputs and question paraphrases were available, it is theoretically possible for our proposed method to learn from this data as well. 4 Joint Semantic Parsing and Paraphrasing using Tri-Synchronous Grammars In this section we describe our proposed method to parse underspecified and ungrammatical input while jointly generating a paraphrase that can be used to disambiguate the meaning of the original query. 4.1 Generalized Synchronous Context Free Grammars Before defining the actual parsing framework, we first present a generalization of SCFGs, the n- synchronous context free grammar (n-SCFG) (Neu- big et al., 2015). In an n-SCFG, the elementary structures are rewrite rules of n − 1 target sides: X → ⟨γ1, γ2, ..., γn⟩ (4) Grammar r0 QUERY → ⟨CONJ0, give me the CONJ0, answer(x1, CONJ0)⟩ r1 CONJ → ⟨FORM0 STATE1, FORM0 in STATE1, (FORM0, loc(x1, x2), const(x2, stateid(STATE1)))⟩ r2 FORM → ⟨cities, cities, city(x1)⟩ r3 STATE → ⟨virginia, virginia, virginia⟩ Derivations ⟨QUERY0, QUERY0, QUERY0 ⟩ r0 ⇒ ⟨CONJ0, give me the CONJ0, answer(x1, CONJ0)⟩ r1 ⇒ ⟨FORM2 STATE3, give me the FORM2 in STATE3, answer(x1, (FORM2, loc(x1, x2), const(x2, stateid(STATE3))) ⟩ r2 ⇒ ⟨cities STATE3, give me the cities in STATE3, answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(STATE3)))⟩ r3 ⇒ ⟨cities virginia, give me the cities in virginia, answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(virginia)))⟩ Figure 2: An example of 3-SCFG rules and productions. Here there are 2 target sides, one is the paraphrase and the other is the MR where X is a non-terminal symbol, γ1 is the source side string of terminal and non-terminal symbols, and γ2, ...γn are the target side strings. Therefore, at each derivation step, one non-terminal in γ1 is cho- sen and all the corresponding non-terminals with the same index in {γ2, ..., γn} are rewritten using a sin- gle rule. 4.2 Tri-Synchronous Grammars for Joint Parsing and Paraphrasing Based on this framework, we propose a model for joint semantic parsing and paraphrasing using tri- synchronous grammars, or 3-SCFGs. In this frame- work, input γ1 corresponds to a keyword query K, and the outputs γ2 and γ3 correspond to the para- phrase and MR respectively. An example of jointly generating a keyword query, question, and MR with a 3-SCFG is shown in Figure 2. In this work, we construct the tri-synchronous grammar by transforming the basic SCFG for se- mantic parsing G into a 3-SCFG. Specifically, we first assume that the source question γs and target MR γt of the original SCFG become the two out- puts γ2 and γ3 of the new 3-SCFG grammar. γ1 is the newly added keyword query input. During the process of model training, we first ex- tract rules consisting of γ2 and γ3 using the algo- rithm in Section 2.2, then generate γ1 from γ2 by first deleting the stop-words then rearranging the or- der of the words based on word alignments between the keyword query and the original question. This is done by assigning each word in K a range of words in S to which it is aligned, then sorting words in γ1 in ascending order of these ranges. It is possible to have cases in which there are some words in K that have no alignment in S, and these rules are filtered out. Finally, we use the tuple ⟨γ1, γ2, γ3⟩ to form rules in our tri-synchronous grammar. Because of the stop word deletion, we may find that some rules have an empty source side, and con- sequently cannot be used in an SCFG. For example, in r3 in Figure 1, “in” is in the stop word list, and thus will be deleted from the source side, leaving it empty. In order to solve this problem, we compose all rules with empty inputs together with their parent rule. It should be noted that this introduces a large amount of ambiguity into the grammar, as the con- tent represented by the deleted content word must now be generated essentially out of thin air, based only on its parent context. 4.3 Integrating Language Models with Tri-SCFGs When using SCFGs for machine translation, the power of language models (LM) to improve the translation accuracy is widely acknowledged. The LM ensures fluent SMT output by assigning a prob- ability to the target sentence. In case of n-gram lan- guage models, this probability is defined as: pLM(W) = l∏ i=1 p(wi|wi−1, wi−2, ...wi−n+1) (5) where the probabilities of sentence W of length l is calculated as the product of the probability of its words, depending on the previous n − 1 words. In- tegrating these language models makes the search space larger, precluding the use of the full CKY-style parsing algorithm, but efficient approximate search algorithms such as cube pruning (Chiang, 2007) or incremental search (Heafield et al., 2013) can help ameliorate this problem. We could also consider constructing a probabilis- tic LM over MR T for semantic parsing. However, constructing a language model for the MR is less straightforward for several reasons. First, the order of the words of MR in the same rooted logical tree will not make a difference in the final result (e.g. for a commutative operator node). Second, while lan- guage models for natural text benefit from the large amounts of text data available on the web, obtaining correct MRs to train a model is less trivial. On the other hand, in our tri-synchronous gram- mar framework, in addition to the MR itself, we are generating a paraphrase that nonetheless holds some disambiguating power over the MR, as described in Section 1. The naturalness of this paraphrase out- put, like the output of the MT system, can easily be judged by a language model, and might have some correlation with the naturalness of the MR itself. Thus, in this work we add a language model over the paraphrase output as a feature of the scoring model described in the next section. 5 Parse Scoring Given this SCFG-based parsing model, we must now assign a score to decide which scores are bet- ter or worse than others. 5.1 Scoring Function Our scoring function is a standard log linear model with feature functions defined over ⟨K,S,T ,D⟩ tu- ples: score(K,S,T ,D) = w · Φ(K,S,T ,D) (6) where Φ(K,S,T ,D) is a vector of feature functions and w is the weight vector. 5.2 Features For the baseline model, our feature vector Φ(K,S,T ,D) is simply defined as the element-wise sum of the feature vectors for each rule in the deriva- tion: Φ (K,S,T ,D) = ∑ d∈D Φ (d) (7) where d takes the form in Equation (4). We score each basic rule using features widely used in translation as follows: • Forward Probability: The log probabil- ity of source side given all the target sides p(γ1|γ2, ..., γn), calculated based on rule counts in the training corpus c(γ1, ..., γn)/c(γ2, ..., γn). • Backward Probability: Similarly, the log prob- ability of all target sides given the source side p(γ2, ..., γn|γ1). • Joint Probability: The log probability of the source and target p(γ1, γ2, ..., γn). • Terminal Rule: Equal to one if there is no non- terminal symbol in the rule. This feature is use- ful to decide whether the model prefers entirely lexicalized rules. • Deletion: Binary feature for deletion rules. • Knowledge Base Rule: Binary feature for rules produced from the knowledge base. For the proposed tri-synchronous grammar with LM verification, we additionally add three features defined over the generated paraphrase. • Language Model: Counts the log language model probability of the paraphrase. • Unknown: Counts the number of tokens in the paraphrase that are unknown in the language model. • Paraphrase Length: Counts the number of words in the paraphrase, and can be calculated for each rule as the number of terminals in the paraphrase. This feature helps compensate for the fact that language models prefer shorter sentences. 5.3 Learning Feature Weights Now that we have defined the feature space, we need to optimize the weights. For this we use minimum error rate training (MERT) (Och and Ney, 2003), maximizing the number of correct answers over the entire corpus.4 6 Experiment and Analysis We evaluate our system using the Geoquery corpus (Zelle and Mooney, 1996), which contains 880 sen- tences representing natural language questions about U.S. Geography, and their corresponding MRs. 6.1 Setup • Data: We use the full Geoquery dataset us- ing the same 10 folds of 792 and 88 test data used by Wong and Mooney (2007). We cre- ated keyword queries according to the process described in Section 3. We follow standard pro- cedure of removing punctuation for all natural language text, regardless of whether it is a key- word or full question. We also perform stem- ming on all natural language text, both in the keyword and question queries. • Rule Extraction: Alignment is performed by pialign (Neubig et al., 2011) with the set- ting forcing one-to-many alignments. The al- gorithm to extract the tri-synchronous grammar is as discussed in Section 4.2 and maximum size of the rules for composition is 4. • Decoding: To query the database, we use prolog queries fired against the Geoquery 4We also tried gradient-based optimization methods and large feature sets as in Wong and Mooney (2007) and Li et al. (2013), but the dense feature set and MERT achieved similar results with shorter training time. database. The parsing problem can thus be con- sidered the task of decoding from underspec- ified natural language queries into prolog queries. This is done by performing decoding of the SCFG-based parsing model to translate the input query into an MR including λ calculus expressions, performing β-reduction to remove the λ function, then firing the query against the database. Before querying the database, we also apply Wong and Mooney (2007)’s type- checking to ensure that all MRs are logically valid. For parsing, we implemented CKY- based parsing of tri-synchronous grammars on top of the Travatar (Neubig, 2013) decoder. Unless otherwise specified, the default settings of the decoder are used. • Language Model: For all 3-SCFG systems we use a 4-gram Kneser-Ney smoothed lan- guage model trained using the KenLM toolkit (Heafield, 2011). Standard preprocessing such as lowercasing and tokenization is performed before training the models. As it is of inter- est whether or not the type of data used to train the language model affects the resulting perfor- mance, we build language models on several types of data. First, we use a corpus of news data from the Workshop on Machine Translation evaluation data (Callison-Burch et al., 2011) (News). This data represents standard English text unrelated to questions. Second, we use a part of the question paraphrase data gathered by Fader et al. (2013) (Questions).5 This data consists entirely of questions, and thus is a better rep- resentative of the latent questions behind the input queries. Finally, we used the full ques- tions from Geoquery sentences to build the language model, building a different language model for each fold, completely separate from the test set. Table 2 gives the details of each dataset. In addition, because the Geoquery data is useful but small, for all 3-SCFG systems, we perform experiments using an additional 4- 5We use only data set 0 from the 30 sets released at http://knowitall.cs.washington.edu/afader/paraphrases. Data Sent. Tok. LM Size News 44.0M 891M 5.5G Questions 20.2M 174M 1.5G Geoquery 792 ∼1.6K ∼96K Table 2: Details of the data used to build LMs gram feed-forward neural network language model (NNLM) (Bengio et al., 2003) feature, which is possibly better equipped to handle sparse data than standard n-grams. The NNLM is built on Geoquery sentences, excluding the test sentences for each fold. This feature is not produced during parsing, but is separately scored and used to re-rank the n-best list gen- erated by the parser. Integration with the paraphrase language model is performed using incremental search (Heafield et al., 2013). For the parsing with NNLM, we recalculate the score of the para- phrases by firstly adding the NNLM score as one of the feature in Equation 6 and taking the parse with the best score. • Parameter Optimization: For learning the pa- rameters of the scoring function we use 10-fold cross validation on the training data, i.e. each fold iteration uses model trained on 712 exam- ples and to parse the remaining 79. First we run decoding for all folds and gather the results. Then we run MERT with the combined results to update the parameters. We use the standard evaluation measure of question answering ac- curacy as our objective function and set the n- best list to be the top 300 derivations. To learn the weights for rescoring with the NNLM, we first generate an n-best list with the base model not using the NNLM feature. We then calculate the NNLM feature for each hy- pothesis in the n-best list, and run one more run of MERT with this feature to obtain the weights used in the rescoring model. • Evaluation: Following the definition from Zettlemoyer and Collins (2005) and Wong and Mooney (2007), we use question answering ac- curacy as our evaluation measure. We define recall as the fraction of correct answers divided by the number of test data, precision as the frac- tion of correct answers divided by the number of parsed queries and F-measure as the har- monic mean of the two. The query is judged correct if and only if the SCFG can gener- ate a valid parse tree, and the resulting query does not produce any syntax errors when ac- cessing the database through a prolog query. Note that all 880 questions are used for testing through cross validation, so a recall improve- ment of 0.001 is approximately equal to an- swering one more question correctly. 6.2 Parsing Accuracy Results Input Method P R F Question Direct .880 .878 .879 Keyword Direct .792 .790 .791 Tri-LM .804 .790 .797 Tri+LM .830 .820 .827 Table 3: Parsing accuracy, where Keyword Direct is the baseline for semantic parsing on keyword queries, and the Tri with the LM for verification is our proposed method. Bold indicates a significant gain over both Direct and Tri- LM for keyword input according to bootstrap resampling (Koehn, 2004) (p < 0.05). First, in this section, we examine the effect of the proposed method on accuracy of parsing ambiguous keyword queries. Specifically, in Table 3 we show the baseline “Direct” method of training a standard SCFG-based semantic parser, the proposed method without language model verification “Tri-LM,” and the proposed method using the Questions lan- guage model with NNLM reranking “Tri+LM.” Looking at the baseline accuracy over full ques- tions (first row), we can see the recall is slightly su- perior to Wong and Mooney (2007)’s 86.6% and Li et al. (2013)’s 87.6%, demonstrating our baseline is comparable to previous work. When we apply the same method to parse the keyword queries (second row), however, the recall drops almost 9%, showing that the ambiguity included in the keyword query in- put causes large decreases in accuracy of a semantic parser built according to the baseline method. This ambiguity is also reflected in the number of MRs generatable by the parser for any particular input. In the top 300 list generated by each parser, there were Ex. LM Paraphrase/MR Correct 1 Direct answer(A,(capital(A),loc(A,B),largest(C,population(B,C)))) no Tri-LM answer(A,largest(B,(capital(A),population(A,B)))) yes Tri+LM what capital has the largest population Original Question: what capital has the largest population Original MR: answer(A,largest(B,(capital(A),population(A,B)))) Keyword: largest population capital 2 Direct(-) answer(A,largest(A,(capital(A),city(A),loc(A,B),state(B)))) no Tri-LM answer(A,largest(A,(state(A),loc(A,B),capital(B)))) no what is the largest state in capital Tri+LM answer(A,(state(A),loc(B,A),largest(B,capital(B)))) yes what state has the largest capital Original Question: what state has the largest capital Original MR: answer(A,(state(A),loc(B,A),largest(B,capital(B)))) Keyword: largest capital state Table 4: Examples of paraphrase outputs produced by the direct keyword-MR system, and the proposed systems without and with a language model. a total of 16.54 and 36.77 unique MRs for question and keyword input respectively. Now we take a look at the 3-SCFG (third row) without the LM verification, we can see the results are similar to the baseline. Then, when adding the language model to the 3-SCFG system (fourth row) we can see a significant of 3-4% gain over the Di- rect and the Tri-LM systems, demonstrating that the proposed method of paraphrasing and verification is indeed able to resolve some of the ambiguity in the keyword queries. To illustrate how the language model helps, we provide two examples in Table 4. The first example shows that considering the original question when parsing from keywords can help improve alignment with the MR for more plausible results. The sec- ond example shows the effect of adding the language model to disambiguate the keyword query. Here there are several interpretations for the keyword- query “largest capital state,” which also can mean “state that has the largest capital,” or “largest state in the capital.” The system without the language model incorrectly chooses the latter interpretation, but the system with language model correctly dis- ambiguates the sentence as it considers the phrase “state in capital” is unlikely, showing the effective- ness of our method. 6.3 Analysis We first examine the effect of choice of language model in the first two columns of Table 5. The first column is the full model with NNLM re-ranking, and the second column is without. The rows show the effect of using different data to train the n-gram LM. All the systems using LMs are basically bet- ter than the system using neither an n-gram LM nor the NNLM. Looking at the differences between the n-gram LMs, we can see that the Questions LM tends to be the most effective. This is particularly encouraging as the Questions language model does not contain any domain specific content, but is able to outperform the Geoquery domain spe- cific LM. We also found that, as expected, the more sophisticated neural network language model raises the system accuracy by approximately 2%, which also supports our proposed idea that a better LM will better raise system accuracy. The proposed method aims at reducing nonsen- sical interpretations, and another trivial baseline that can achieve a similar effect is to filter out the queries that produce empty answers, with the as- sumption that empty answers are generated from in- valid queries. This simple filtering method reduced the number of unique queries to 11.74 for questions and 20.16 for keywords. However, as shown in the “-Empty” results in Table 5, we found that this fil- Input Output LM Full -NNLM -Empty P R F P R F P R F Keyword Q+MR - .828 .813 .821 .804 .790 .797 .820 .784 .801 News .823 .814 .819 .806 .797 .802 .817 .786 .802 Quest .830† .820† .827† .809 .804 .806 .806 .780 .793 Geo .821 .812 .817 .804 .794 .799 .824 .794 .808 Table 5: The result of experiment with/without neural network language model for the proposed 3-SCFG framework. Question-LM +NNLM achieved the best accuracy. Bold indicates a significant gain over the baseline Direct Keyword (second row of Table 3) and dagger indicates a significant gain over the 3-SCFG baseline without language model (-NNLM column, first row). The Full and -Empty column use NNLM as language model. The first row of the -NNLM column is the experiment without any language model. tering method is not effective, causing the system’s performance to drop by around 2%. This is caused by the fact that the correct answer is sometimes an empty answer, for example “what states border hawaii?” 6.4 Human Evaluation While all evaluation up to this point has used lan- guage models to disambiguate paraphrases, we can assume that human users will be even better at judg- ing whether or not a paraphrase makes sense. Thus, we perform an additional evaluation in which hu- man annotators evaluate the paraphrases generated from the systems. First, we took the 1-best parse and 7 random parses from the Tri+LM and Tri-LM systems where both systems produced a non-empty n-best. Then we show both the keyword queries and all the paraphrases to human evaluators to annotate: i) a fluency score of 0, 1, or 2 where 0 is completely unnatural English, 1 indicates minor grammatical errors, and 2 indicates flawless English, ii) a letter starting from “A”, “B”, etc. for the paraphrase that matches their preferred interpretation of the search query.6 If the input has multiple interpretations, then a different letter is assigned for each possible inter- pretation in the order that the annotator believes that the interpretation is the correct one, and only para- phrase paraphrase is chosen for each interpretation. If the human annotator does not find the paraphrase that matched his/her pboth features set.igned and an- notation starts from “B.” 3 annotators were asked to annotate 300 keyword queries and their paraphrases. 6Here the letters are just the indicators of ranking with a let- ter “A” means the most possible interpretation of search queries according to the users. There are a total of 866 keyword queries (out of 880) that produced a non-empty n-best list in both sys- tems, so we chose random duplications of 34 inputs to make the sum 900. System Precision Tri-LM .803 Tri+LM .834 Tri+LM+Human .846 Table 6: System precision with additional human help. Table 6 shows the improvement of the system with human help. We take all the answers from the annotators that were annotated with “A” and re- placed the answer of Tri+LM system. Overall, there were 35 questions that changed between the 1-best and human choices, with 23 improving and 12 de- grading accuracy. This experiment suggests that it is possible to show the generated paraphrases to hu- man users to improve the accuracy of the semantic parser. System Fluency Ratio Precision Tri-LM 0 .163 .415 1 .425 .819 2 .411 .940 Tri+LM 0 .083 .367 1 .372 .811 2 .544 .918 Table 7: Fluency, Ratio, and Precision statistics for the one-best of both systems. Now we look at the relationship between the flu- ency of the paraphrase and the accuracy of the se- mantic parsers in Table 7. The statistics are gathered Letter System Count Total Precision A Tri-LM 547 721 .919 Tri+LM 674 .902 B Tri-LM 308 452 .772 Tri+LM 340 .752 C Tri-LM 57 94 .631 Tri+LM 52 .557 D Tri-LM 7 13 .428 Tri+LM 7 .142 Table 8: A result for the letter accuracy from the human evaluation. Note that counts do not sum up to total be- cause it is possible that both systems generate same para- phrases. from the one best output for both systems. Tri+LM had a significantly larger percentage of fluent para- phrases with score “2” (54% v.s. 41%) compared to the system without the language model. Of the paraphrases that were assigned “2” score, 91% cor- responded to correct MRs, indicating that the sub- jective fluency of the paraphrase is a good indicator of parsing accuracy. Finally, Table 8 shows the relationship between the rank of the human interpretation and the accu- racy of semantic parsing. Out of the 900 problems shown to the annotators, 721 of them were ranked “A.” This experiment showed that the interpretation of the paraphrase judged as most likely by the anno- tators achieves a high precision, confirming our hy- potheses that humans are able to use paraphrases to accurately judge whether the interpretation is likely to be correct or not. 6.5 Other Methods for Using Paraphrase Data In addition to the method describe up until this point, there are several other ways to potentially incorpo- rate paraphrasing into syntactic parsing of under- specified input. In this section we briefly outline two other (unsuccessful) attempts to do so: creation of a pipelined paraphrasing/semantic parsing sys- tem, and addition of features from a large paraphrase database. First, regarding the pipelined system, we build the paraphrasing system using the parallel keyword- question data, with standard settings of hierarchical phrase-based translation (Chiang, 2007), and stan- dard SMT features. We use the Geoquery n-gram model for the language model used during decoding and NNLM language model to finally rerank the n- best list. As a result of experiments, even though this system obtained a respectable BLEU score of 57.5, the parsing accuracies were much lower than the direct keyword-MR system at 64.8 F-measure. An analysis showed that, perhaps as expected, this was caused by cascading errors, with unnatural para- phrases also resulting in failed semantic parses. In addition, we also attempted to use the external Questions data to calculate additional features to our Tri+LM system. We do this by first simulat- ing the keyword version for each sentence in the Questions data by performing shuffling and stop- word deletion.7 Next we train a hierarchical phrase-based system on this data to create a paraphrasing model. Next we intersect this model with our existing model by matching the source side and the target side of the rules and if they match, taking the union of the fea- tures sets. Unfortunately, however, this setting also did not allow for a gain in accuracy, likely due to to the low recall (15%) of the matching between paraphrasing grammar and semantic parsing rules. This low recall stemmed from a number of factors including restrictions on the standard Hiero para- phrasing grammars (no more than 2 non-terminals, no consecutive non-terminals on the source side, and no rules without at least one terminal), as well as simple lack of coverage of the words in the para- phrase database. This result does indicate room for improvement by developing algorithms that extract paraphrases that are closely related to the semantic parsing rules, but also suggests potential difficulties in simply applying paraphrasing resources such as PPDB (Ganitkevitch et al., 2013). 7 Related Work Interpretation of search queries is a major concern in the field of information retrieval as it can affect the choice of retrieved documents. Underspecified queries are commonly entered into search engines, leading to large result sets that are difficult for users 7Because the shuffling process is random we could conceiv- ably generate and train with multiple shuffled versions, but be- cause the Questions data is relatively large already, we only train the paraphrasing system with the single permutation of keywords generated by the shuffling. to navigate (Sajjad et al., 2012). Studies have shown that there are several ways to deal with this problem, including query reformulation, which can fall in the categories of query expansion or query substitution (Shokouhi et al., 2014; Xue and Croft, 2013). Lev- eling (2010) proposed a paraphrasing method that tries to reconstruct original questions given keyword inputs in the IR context, but did not model this re- formulation together with semantic parsing. In ad- dition, Wang et al. (2013) showed that doing para- phrasing on the queries for web search is able to re- duce the mismatch between queries and documents, resulting in a gain in search accuracy. Using paraphrasing to resolve ambiguity is not new, as it was used to resolve ambiguity interac- tively with a user’s input (McKeown, 1983). Ge and Mooney (2009) and Miller et al. (1994) have also used the guidance of natural language syntax for semantic parsing. However, the usage of natu- ral language syntax in the semantic parsing on key- word queries are not trivial. For example, the ap- proach using syntax tree of the input side from Ge and Mooney (2009) can not be directly applied to the keyword query as syntax parsing on keyword query itself is not a trivial problem. There have also been a few methods proposed to combine paraphrasing with semantic parsing. Fader et al. (2013) proposed a method to map from full questions to more canonical forms of these ques- tions, with the canonical NL questions being triv- ially convertible to an MR. Berant and Liang (2014) extract entities from a full-text question, map these entities into a set of candidate MRs, and generate canonical utterances accordingly. Then the canoni- cal utterance that best paraphrases the input is cho- sen, thereby outputting the corresponding MR. Our approach is the similar but orthogonal to these works in that we focus on situations where the original user input is underspecified, and try to generate a natu- ral language paraphrase that more explicitly states the user intention for disambiguation purposes. A second difference is that we do not use separate model to do paraphrasing, instead using the same model to do paraphrasing and semantic parsing syn- chronously. This has the advantage of being able to scale more easily to complicated and highly com- positional questions such as the ones found in Geo- query. In addition to being useful for semantic parsing, SCFGs have also been used for paraphrasing. A va- riety of research has used SCFG-based paraphrases for text-to-text generation tasks like sentence com- pression (Cohn and Lapata, 2009; Ganitkevitch et al., 2011), or expanding the set of reference transla- tions for machine translation evaluation (Madnani et al., 2007). In this paper we have introduced a novel use of 3-way SCFGs that allows us to simultane- ously do semantic parsing and text-to-text genera- tion. To our knowledge, this is the first method to parse an underspecified input by trying to reconstruct a more explicit paraphrase of the input and validate the naturalness of the paraphrase to disambiguate the meaning of the original input. 8 Conclusion and Future Work In this paper we introduced a method for construct- ing a semantic parser for ambiguous input that para- phrases the ambiguous input into a more explicit form, and verifies the correctness using a language model. We do so through a generalization of syn- chronous context free grammars that allows for gen- eration of multiple output strings at one time. An evaluation showed that our method is effective in helping compensate for the 9% loss of system accu- racies due to the ambiguity of the keyword queries, providing a 3% improvement. Human evaluation also confirmed that manually evaluating the para- phrases generated by our framework can improve the accuracy of the semantic parser further. There are a number of future directions for this study. First, we plan to scale the proposed method to open domain semantic parsing of search queries over extensive knowledge bases such as FreeBase (Bollacker, 2007). In addition, previous works have tackled semantic parsing directly from question and answer pairs (Liang et al., 2011; Poon and Domin- gos, 2009; Artzi and Zettlemoyer, 2011). The idea of learning from unannotated data is attractive, and incorporating this learning framework into our model is a promising direction for future work. Acknowledgments We thank Professor Yuji Matsumoto and Assistant Professor Kevin Duh for the discussions and insight- ful ideas for the paper. We also thank Professor Chris Callison-Burch and anonymous reviewers for their suggestions and discussions. This project is supported by grants from the Min- istry of Education, Culture, Sport, Science, and Technology of Japan and from the Microsoft CORE program. References Yoav Artzi and Luke Zettlemoyer. 2011. Bootstrapping semantic parsers from conversations. In Proceedings of the 2011 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 421–432. Yoshua Bengio, Ducharme Réjean, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- guage model. The Journal of Machine Learning Re- search, 3:1137–1155. Jonathan Berant and Percy Liang. 2014. Semantic pars- ing via paraphrasing. In Proceedings of the 52th An- nual Meeting of the Association for Computational Linguistics (ACL), pages 1415–1425. Kurt Bollacker. 2007. A platform for scalable, collabo- rative, structured information integration. In Proceed- ings of the 22nd Association ofr Advancement of Arti- ficial Intelligence, pages 22–27. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F Zaidan. 2011. Findings of the 2011 work- shop on statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Transla- tion, pages 22–64. David Chiang. 2007. Hierarchical phrase-based transla- tion. Computational Linguistics, (2):201–228. Trevor Cohn and Mirella Lapata. 2009. Sentence com- pression as tree transduction. Journal of Artificial In- telligence Research (JAIR), 34:637–674. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51th Annual Meet- ing of the Association for Computational Linguistics (ACL), pages 1608–1618. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What’s in a translation rule? In Proceedings of the 2004 Human Language Technol- ogy Conference of the North American Chapter of the Association for Computational Linguistics (HLT- NAACL), pages 273–280. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic translation models. In Proceed- ings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pages 961–968. Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles, and Benjamin Van Durme. 2011. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1168–1179. Asso- ciation for Computational Linguistics. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 758–764, Atlanta, Georgia, June. As- sociation for Computational Linguistics. Ruifang Ge and Raymond J Mooney. 2009. Learning a compositional semantic parser using an existing syn- tactic parser. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 611–619. Peter Gurský, Tomás Horváth, Peter Vojtás, Jozef Jirásek, Stanislav Krajci, Robert Novotny, Jana Pribolová, and Veronika Vaneková. 2009. User preference web search – experiments with a system connecting web and user. Computing and Informatics, 28(4):515–553. Kenneth Heafield, Philipp Koehn, and Alon Lavie. 2013. Grouping language model boundary words to speed k- best extraction from hypergraphs. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 958–968. Kenneth Heafield. 2011. KenLM: faster and smaller lan- guage model queries. In Proceedings of the 2011 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 187–197. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP). Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1545–1556. Johannes Leveling. 2010. A comparative analysis: QA evaluation questions versus real-world queries. In Pro- ceedings of the 7th International Conference on Lan- guage Resources and Evaluation (LREC). Peng Li, Yang Liu, and Maosong Sun. 2013. An ex- tended GHKM algorithm for inducing lambda-SCFG. In Proceedings of the 27th Association for Advance- ment of Artificial Intelligence, pages 605–611. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 590–599. Nitin Madnani, Necip Fazil Ayan, Philip Resnik, and Bonnie Dorr. 2007. Using paraphrases for parameter tuning in statistical machine translation. In Proceed- ings of the Workshop on Statistical Machine Transla- tion (WMT07), Prague, Czech Republic. Kathleen R. McKeown. 1983. Paraphrasing questions using given and new information. Computational Lin- guistics, 9(1):1–10. Scott Miller, Robert Bobrow, Robert Ingria, and Richard Schwartz. 1994. Hidden understanding models of natural language. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguis- tics (ACL), pages 25–32. Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shin- suke Mori, and Tatsuya Kawahara. 2011. An unsuper- vised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Lan- guage Technologies (HLT-ACL), pages 632–641. Graham Neubig, Philip Arthur, and Kevin Duh. 2015. Multi-target machine translation with multi- synchronous context-free grammars. In Meeting of the North American Chapter of the Association for Com- putational Linguistics (NAACL), Denver, USA, May. Graham Neubig. 2013. Travatar: A forest-to-string ma- chine translation engine based on tree transducers. In ACL (Conference System Demonstrations), pages 91– 96. Franz Josef Och and Hermann Ney. 2003. A system- atic comparison of various statistical alignment mod- els. Computational Linguistics, (1):19–51. Hoifung Poon and Pedro Domingos. 2009. Unsuper- vised semantic parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1–10. Hassan Sajjad, Patrick Pantel, and Michael Gamon. 2012. Underspecified query refinement via natural language question generation. In Proceedings of the 24th International Conference on Computational Lin- guistics (COLING), pages 2341–2356. Milad Shokouhi, Rosie Jones, Umut Ozertem, Karthik Raghunathan, and Fernando Diaz. 2014. Mobile query reformulations. In Proceedings of the 37th In- ternational ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1011– 1014. Chenguang Wang, Nan Duan, Ming Zhou, and Ming Zhang. 2013. Paraphrasing adaptation for web search ranking. In Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics (ACL), pages 41–46. Yuk Wah Wong and Raymond J Mooney. 2006. Learn- ing for semantic parsing with statistical machine trans- lation. In Proceedings of the 2006 Human Language Technology Conference of the North American Chap- ter of the Association for Computational Linguistics (HLT-NAACL), pages 439–446. Yuk Wah Wong and Raymond J Mooney. 2007. Learn- ing synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguis- tics (ACL), number 1, pages 960–967. Xiaobing Xue and W. Bruce Croft. 2013. Modeling re- formulation using query distributions. ACM Transac- tion on Information Systems, (2):6:1–6:34. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive logic pro- gramming. In Proceedings of the 13th National Con- ference on Artificial Intelligence, pages 1050–1055. Luke S Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. Un- certainty in Artificial Intelligence (UAI), pages 658– 666. Luke S Zettlemoyer and Michael Collins. 2007. On- line learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the 2007 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing (EMNLP-CoNLL), pages 678–687.