Semantic Parsing of Ambiguous Input
through Paraphrasing and Verification

Philip Arthur, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura
Graduate School of Information Science, Nara Institute of Science and Technology, Japan

{philip.arthur.om0, neubig, ssakti, tomoki, s-nakamura}@is.naist.jp

Abstract

We propose a new method for semantic pars-
ing of ambiguous and ungrammatical input,
such as search queries. We do so by build-
ing on an existing semantic parsing framework
that uses synchronous context free grammars
(SCFG) to jointly model the input sentence
and output meaning representation. We gener-
alize this SCFG framework to allow not one,
but multiple outputs. Using this formalism,
we construct a grammar that takes an ambigu-
ous input string and jointly maps it into both
a meaning representation and a natural lan-
guage paraphrase that is less ambiguous than
the original input. This paraphrase can be
used to disambiguate the meaning representa-
tion via verification using a language model
that calculates the probability of each para-
phrase.1

1 Introduction

Semantic parsing (SP) is the problem of parsing a
given natural language (NL) sentence into a meaning
representation (MR) conducive to further processing
by applications. One of the major challenges in SP
stems from the fact that NL is rife with ambiguities.
For example, even the simple sentence “Where can
we eat a steak in Kobe?” contains syntactic ambi-
guities (“eat in Kobe” or “steak in Kobe”?), quan-
tifier scope ambiguities (do we all eat one steak,
or each eat one steak?), and word sense ambigui-
ties (is Kobe a city in Japan; or an NBA basketball

1Tools to replicate our experiments can be found at
http://isw3.naist.jp/~philip-a/tacl2015/index.html.

player?). Previous works using statistical models
along with formalisms such as combinatorial cat-
egorial grammars, synchronous context free gram-
mars, and dependency based compositional seman-
tics have shown notable success in resolving these
ambiguities (Zettlemoyer and Collins, 2005; Wong
and Mooney, 2007; Liang et al., 2011; Kwiatkowski
et al., 2013).

Much previous work on SP has focused on the
case of answering natural language queries to a
database of facts, where the queries generally take
the form of full sentences such as “What is the height
of Kobe Bryant?” While answering these ques-
tions provides an excellent first step to natural lan-
guage information access, in many cases the input is
not a full sentence, but something more underspec-
ified and ungrammatical. For example, this is the
case for keyword-based search queries (Sajjad et al.,
2012) or short dialogue utterances (Zettlemoyer and
Collins, 2007).

Specifically taking the example of search queries,
users tend to omit some of the function words and
grammatical constructs in the language to make a
more concise query. The first column of Table
1 illustrates several search queries of the pattern
“Kobe X” where X is another word. From these
queries and their MRs in column two, we can see
that there are several kinds of ambiguity, including
not only the distinction between Kobe as city or a
basketball player as in the previous example, but
also more pernicious problems unique to the more
ambiguous input. Focusing on the queries “Kobe
hotels” and “Kobe flight” we can see that it is also
necessary to estimate the latent relationship between


Search Query Meaning Representation Paraphrase
Kobe Hotel λx (hotel(x) ∧ in(x, kobe city)) Hotel in Kobe city
Kobe Flight λx (flight(x) ∧ to(x, kobe city)) Flight to Kobe city
Kobe Height height(kobe bryant) Height of Kobe Bryant

Table 1: Example of a search query, MR, and its paraphrase

words, such as “location” or “destination.” However
it should be noted that if we take the keyword query
and re-express it as a more explicit paraphrase, we
can reduce this ambiguity to the point where there
is only one reasonable interpretation. For example,
in the second line, if we add the preposition “to” the
user is likely asking for flights that arriving in Kobe,
and if we add “from” the user is asking for depar-
tures.

In this paper, we focus on SP of ambiguous input
and propose a new method for dealing with the prob-
lem of ambiguity. Here we propose a framework
where an ambiguous input (Column 1 in Table 1) is
simultaneously transformed into both its MR (Col-
umn 2) and a more explicit, less ambiguous para-
phrase (Column 3). The advantage of this method is
that it is then possible to verify that the paraphrase
indeed expresses the intended meaning of the under-
specified input. This verification can be done either
manually by the system user or automatically using
a probabilistic model trained to judge the naturalness
of the paraphrases.

As a concrete approach, building upon the formal-
ism of synchronous context free grammars (SCFG).
Unlike traditional SCFGs, which usually only gen-
erate one target string (in semantic parsing, an MR),
we introduce a new variety of SCFGs that generate
multiple strings on the target side. This allows us
to not only generate the MR, but also jointly gen-
erate the more explicit paraphrase. We then use a
language model over the paraphrases generated by
each derivation to help determine which derivations,
and consequently which MRs, are more likely.

We perform an evaluation using the standard Geo-
query benchmark of 880 query-logic pairs. First we
note that baseline SCFG parser achieves reasonable
accuracy on regular questions but when the same
method is used with underspecified input, the system
accuracy decreases significantly. On the other hand,
when incorporating the proposed tri-synchronous

grammar to generate paraphrases and verify them
with a language model, we find that it is possible
to recover the loss of accuracy, resulting in a model
that is able to parse the ambiguous input with signif-
icantly better accuracy.

2 Semantic Parsing using Context Free
Grammars

As a baseline SP formalism, we follow Wong and
Mooney (2006) in casting SP as a problem of trans-
lation from a natural language query into its MR.
This translation is done using synchronous context
free grammars, which we describe in detail in the
following sections.

2.1 Synchronous Context Free Grammars

Synchronous context free grammars are a general-
ization of context-free grammars (CFGs) that gener-
ate pairs of related strings instead of single strings.
Slightly modifying the notation of Chiang (2007),
we can formalize SCFG rules as:

X → ⟨γs, γt⟩ (1)

where X is a non-terminal and γs and γt are
strings of terminals and indexed non-terminals on
the source and target side of the grammar. SCFGs
have recently come into favor as a tool for statistical
machine translation (SMT). In SMT, a synchronous
rule could, for example, take the form of:

X → ⟨X0 eats X1, X0 wa X1 wo taberu⟩ (2)

where γs is an English string and γt is a Japanese
string. Each non-terminal on the right side is in-
dexed, with non-terminals with identical indices cor-
responding to each-other.

Given the SCFG grammar, we can additionally as-
sign a score to each rule, where higher scored rules
are more likely to participate in a derivation. Given
the grammar of scored rules, and an input sentence


Grammar
r0 QUERY → ⟨give me the CONJ0, answer(x1, CONJ0)⟩
r1 CONJ → ⟨FORM0 FORM1 STATE2, (FORM0, FORM1, const(x2, stateid(STATE2))⟩
r2 FORM → ⟨cities, city(x1)⟩
r3 FORM → ⟨in, loc(x1, x2)⟩
r4 STATE → ⟨virginia, virginia⟩

Derivations
⟨QUERY0, QUERY0⟩
r0 ⇒ ⟨give me the CONJ1, answer(x1, CONJ1))⟩
r1 ⇒ ⟨give me the FORM2 FORM3 STATE4,

answer(x1, (FORM2, FORM3, const(x2, stateid(STATE4))))⟩
r2 ⇒ ⟨give me the cities FORM3 STATE4,

answer(x1, (city(x1), FORM3, const(x2, stateid(STATE4)))⟩
r3 ⇒ ⟨give me the cities in STATE4,

answer(x1, (city(x1), loc(x1,x2), const(x2, stateid(STATE4)))⟩
r4 ⇒ ⟨give me the cities in virginia, answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(virginia)))⟩

Figure 1: Example of SP using SCFGs. The left hand and right hand sides are generated simultaneously.

S, the highest scoring parse and output sentence T
can be calculated using the CKY+ algorithm (Chi-
ang, 2007).

2.2 Semantic Parsing with SCFGs

In the simplest form of SP with SCFGs, γs is used to
construct a natural language string S and γt is used
to construct the MR T (Wong and Mooney, 2006).
Figure 1 shows an example of using an SCFG to si-
multaneously generate a natural language string and
its MR. In this picture, the bold symbols are non-
terminals which can be substituted with other non-
terminal productions. Productions end when all the
tokens are terminals. The collection of rules used
to generate a particular ⟨S,T ⟩ pair is a derivation
D= d1, d2, ..., d|D|.

Wong and Mooney (2007) further extended this
formalism to handle λ-SCFGs, which treat γs as the
natural language query and γt as an MR based on
λ calculus. SCFG rules are automatically learned
from pairs of sentences with input text and the corre-
sponding MR, where the MR is expressed as a parse
tree whose internal nodes are predicates, operators,
or quantifiers.

In this paper, we follow Li et al. (2013)’s approach
to extract a grammar from this parallel data. In this

approach, for each pair, statistical word alignment
aligns natural language tokens with the correspond-
ing elements in the MR, then according to the align-
ment, minimal rules are extracted with the GHKM
algorithm (Galley et al., 2004; Li et al., 2013). Then,
up to k minimal rules are composed to form longer
rules (Galley et al., 2006), while considering the re-
lationship between logical variables. Finally, un-
aligned NL tokens are aligned by attaching them to
the highest node in the tree that does not break the
consistencies of alignment, as specified in Galley et
al. (2006).

2.3 Additional Rules
While basic rules extracted above are quite effective
in parsing the training data,2 we found several prob-
lems when we attempt to parse unseen queries. To
make our parser more robust, we add two additional
varieties of rules. First, we add a deletion rule which
allows us to delete any arbitrary word w with any
head symbol X, formally:

X → ⟨w X, X⟩. (3)

This rule allows our grammar an option of ignoring
words that it does not know what to do with.

2We achieve almost 100% F-measure in closed testing.


In addition, to ensure that all of the facts in the
database can be accessed by our semantic parser, we
provide some additional SCFG rules based on the
given database of facts. The Geoquery dataset pro-
vides a database of facts represented as logical asser-
tions. For every assertion provided in the database,
we produce a single rule using the function name as
the label of the non-terminal and one parameter of
the assertion as the terminal, depending on the asser-
tion’s type. For example, Geoquery provides some
details about the state of Michigan with the form
state(’michigan’,...), and thus we add

STATE → ⟨michigan, michigan⟩

as an additional rule in the grammar.

3 Semantic Parsing of Keyword Queries

As explained in Section 1, when users input key-
word queries, they will often ignore the grammatical
structure and omit function words. Based on this, a
traditional SP model can be problematic. To give a
concrete example, consider the synchronous parse in
Figure 1. If we try to parse with only the keywords
(e.g. “cities virginia”) with a standard grammar, the
parser will not be able to recover the latent relation-
ship “loc(x1, x2)” between the two words. Unfortu-
nately, we are lacking evidence to recover this re-
lationship, because the token “in” associated with
the predicate “loc” will often not occur in a keyword
query.

In this work, we perform experiments on this par-
ticular variety of ambiguous input, both to examine
the effect that it has on parsing accuracy under the
baseline model, and to examine whether this sort of
ambiguity can be reduced. In order to do so, we need
examples of keyword queries. In this work, we sim-
ulate the keyword query K by altering the original
question S to make it more closely match the style of
keyword queries. In particular, following the analy-
sis of Leveling (2010), we make two changes to the
original queries: stop word deletion, and word order
shuffling.

Stop word deletion, as its name implies, sim-
ply deletes all stop words from the input sentence.
We use a stop word list,3 making a few subjective
changes to make the simulated keyword output more

3http://www.lextek.com/manuals/onix/stopwords2.html

realistic. Specifically, we add “give” and “show,”
which often occur in statements such as “give me
...” or “show me ...” but are unnatural in keyword
queries. We also exclude from the list “us,” which
often refers to “United States,” and function words
such as “many,” “most,” and “much.”

Word order shuffling permutes the order of the
keywords remaining after stop word deletion, to
simulate the fact that keyword queries often don’t
have strict order. First we shuffled the tokens ran-
domly, then had a human annotator fix the order of
the keywords manually, making the minimal num-
ber of changes necessary to ensure that the queries
are natural and fluent. This produced a single key-
word query K for a particular question/MR pair in
the Geoquery database, which will be used to train
and verify our system. At the end we will have a 3-
parallel corpus consisting of 880 pairs of keyword,
question, and the meaning representation.

We should note that while shortening and reorder-
ing are prominent features of search queries (Lev-
eling, 2010), these are not the only phenomenon
distinguishing queries from standard text. For ex-
ample, humans tend to also change content words
into an equivalent and easier word of their prefer-
ence (Gurský et al., 2009). While collecting this
data is out of the scope of the present work, if a cor-
pus of real keyword inputs and question paraphrases
were available, it is theoretically possible for our
proposed method to learn from this data as well.

4 Joint Semantic Parsing and
Paraphrasing using Tri-Synchronous
Grammars

In this section we describe our proposed method to
parse underspecified and ungrammatical input while
jointly generating a paraphrase that can be used to
disambiguate the meaning of the original query.

4.1 Generalized Synchronous Context Free
Grammars

Before defining the actual parsing framework, we
first present a generalization of SCFGs, the n-
synchronous context free grammar (n-SCFG) (Neu-
big et al., 2015). In an n-SCFG, the elementary
structures are rewrite rules of n − 1 target sides:

X → ⟨γ1, γ2, ..., γn⟩ (4)


Grammar
r0 QUERY → ⟨CONJ0, give me the CONJ0, answer(x1, CONJ0)⟩
r1 CONJ → ⟨FORM0 STATE1, FORM0 in STATE1,

(FORM0, loc(x1, x2), const(x2, stateid(STATE1)))⟩
r2 FORM → ⟨cities, cities, city(x1)⟩
r3 STATE → ⟨virginia, virginia, virginia⟩

Derivations
⟨QUERY0, QUERY0, QUERY0 ⟩
r0 ⇒ ⟨CONJ0, give me the CONJ0, answer(x1, CONJ0)⟩
r1 ⇒ ⟨FORM2 STATE3, give me the FORM2 in STATE3,

answer(x1, (FORM2, loc(x1, x2), const(x2, stateid(STATE3))) ⟩
r2 ⇒ ⟨cities STATE3, give me the cities in STATE3,

answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(STATE3)))⟩
r3 ⇒ ⟨cities virginia, give me the cities in virginia,

answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(virginia)))⟩

Figure 2: An example of 3-SCFG rules and productions. Here there are 2 target sides, one is the paraphrase and the
other is the MR

where X is a non-terminal symbol, γ1 is the source
side string of terminal and non-terminal symbols,
and γ2, ...γn are the target side strings. Therefore, at
each derivation step, one non-terminal in γ1 is cho-
sen and all the corresponding non-terminals with the
same index in {γ2, ..., γn} are rewritten using a sin-
gle rule.

4.2 Tri-Synchronous Grammars for Joint
Parsing and Paraphrasing

Based on this framework, we propose a model for
joint semantic parsing and paraphrasing using tri-
synchronous grammars, or 3-SCFGs. In this frame-
work, input γ1 corresponds to a keyword query K,
and the outputs γ2 and γ3 correspond to the para-
phrase and MR respectively. An example of jointly
generating a keyword query, question, and MR with
a 3-SCFG is shown in Figure 2.

In this work, we construct the tri-synchronous
grammar by transforming the basic SCFG for se-
mantic parsing G into a 3-SCFG. Specifically, we
first assume that the source question γs and target
MR γt of the original SCFG become the two out-
puts γ2 and γ3 of the new 3-SCFG grammar. γ1 is
the newly added keyword query input.

During the process of model training, we first ex-

tract rules consisting of γ2 and γ3 using the algo-
rithm in Section 2.2, then generate γ1 from γ2 by
first deleting the stop-words then rearranging the or-
der of the words based on word alignments between
the keyword query and the original question. This is
done by assigning each word in K a range of words
in S to which it is aligned, then sorting words in γ1
in ascending order of these ranges. It is possible to
have cases in which there are some words in K that
have no alignment in S, and these rules are filtered
out. Finally, we use the tuple ⟨γ1, γ2, γ3⟩ to form
rules in our tri-synchronous grammar.

Because of the stop word deletion, we may find
that some rules have an empty source side, and con-
sequently cannot be used in an SCFG. For example,
in r3 in Figure 1, “in” is in the stop word list, and
thus will be deleted from the source side, leaving it
empty. In order to solve this problem, we compose
all rules with empty inputs together with their parent
rule. It should be noted that this introduces a large
amount of ambiguity into the grammar, as the con-
tent represented by the deleted content word must
now be generated essentially out of thin air, based
only on its parent context.


4.3 Integrating Language Models with
Tri-SCFGs

When using SCFGs for machine translation, the
power of language models (LM) to improve the
translation accuracy is widely acknowledged. The
LM ensures fluent SMT output by assigning a prob-
ability to the target sentence. In case of n-gram lan-
guage models, this probability is defined as:

pLM(W) =
l∏

i=1

p(wi|wi−1, wi−2, ...wi−n+1) (5)

where the probabilities of sentence W of length l
is calculated as the product of the probability of its
words, depending on the previous n − 1 words. In-
tegrating these language models makes the search
space larger, precluding the use of the full CKY-style
parsing algorithm, but efficient approximate search
algorithms such as cube pruning (Chiang, 2007) or
incremental search (Heafield et al., 2013) can help
ameliorate this problem.

We could also consider constructing a probabilis-
tic LM over MR T for semantic parsing. However,
constructing a language model for the MR is less
straightforward for several reasons. First, the order
of the words of MR in the same rooted logical tree
will not make a difference in the final result (e.g. for
a commutative operator node). Second, while lan-
guage models for natural text benefit from the large
amounts of text data available on the web, obtaining
correct MRs to train a model is less trivial.

On the other hand, in our tri-synchronous gram-
mar framework, in addition to the MR itself, we are
generating a paraphrase that nonetheless holds some
disambiguating power over the MR, as described in
Section 1. The naturalness of this paraphrase out-
put, like the output of the MT system, can easily be
judged by a language model, and might have some
correlation with the naturalness of the MR itself.
Thus, in this work we add a language model over the
paraphrase output as a feature of the scoring model
described in the next section.

5 Parse Scoring

Given this SCFG-based parsing model, we must
now assign a score to decide which scores are bet-
ter or worse than others.

5.1 Scoring Function
Our scoring function is a standard log linear model
with feature functions defined over ⟨K,S,T ,D⟩ tu-
ples:

score(K,S,T ,D) = w · Φ(K,S,T ,D) (6)

where Φ(K,S,T ,D) is a vector of feature functions
and w is the weight vector.

5.2 Features
For the baseline model, our feature vector
Φ(K,S,T ,D) is simply defined as the element-wise
sum of the feature vectors for each rule in the deriva-
tion:

Φ (K,S,T ,D) =
∑
d∈D

Φ (d) (7)

where d takes the form in Equation (4).
We score each basic rule using features widely

used in translation as follows:

• Forward Probability: The log probabil-
ity of source side given all the target
sides p(γ1|γ2, ..., γn), calculated based
on rule counts in the training corpus
c(γ1, ..., γn)/c(γ2, ..., γn).

• Backward Probability: Similarly, the log prob-
ability of all target sides given the source side
p(γ2, ..., γn|γ1).

• Joint Probability: The log probability of the
source and target p(γ1, γ2, ..., γn).

• Terminal Rule: Equal to one if there is no non-
terminal symbol in the rule. This feature is use-
ful to decide whether the model prefers entirely
lexicalized rules.

• Deletion: Binary feature for deletion rules.

• Knowledge Base Rule: Binary feature for rules
produced from the knowledge base.

For the proposed tri-synchronous grammar with
LM verification, we additionally add three features
defined over the generated paraphrase.

• Language Model: Counts the log language
model probability of the paraphrase.


• Unknown: Counts the number of tokens in the
paraphrase that are unknown in the language
model.

• Paraphrase Length: Counts the number of
words in the paraphrase, and can be calculated
for each rule as the number of terminals in
the paraphrase. This feature helps compensate
for the fact that language models prefer shorter
sentences.

5.3 Learning Feature Weights
Now that we have defined the feature space, we need
to optimize the weights. For this we use minimum
error rate training (MERT) (Och and Ney, 2003),
maximizing the number of correct answers over the
entire corpus.4

6 Experiment and Analysis

We evaluate our system using the Geoquery corpus
(Zelle and Mooney, 1996), which contains 880 sen-
tences representing natural language questions about
U.S. Geography, and their corresponding MRs.

6.1 Setup
• Data: We use the full Geoquery dataset us-

ing the same 10 folds of 792 and 88 test data
used by Wong and Mooney (2007). We cre-
ated keyword queries according to the process
described in Section 3. We follow standard pro-
cedure of removing punctuation for all natural
language text, regardless of whether it is a key-
word or full question. We also perform stem-
ming on all natural language text, both in the
keyword and question queries.

• Rule Extraction: Alignment is performed by
pialign (Neubig et al., 2011) with the set-
ting forcing one-to-many alignments. The al-
gorithm to extract the tri-synchronous grammar
is as discussed in Section 4.2 and maximum
size of the rules for composition is 4.

• Decoding: To query the database, we use
prolog queries fired against the Geoquery

4We also tried gradient-based optimization methods and
large feature sets as in Wong and Mooney (2007) and Li et al.
(2013), but the dense feature set and MERT achieved similar
results with shorter training time.

database. The parsing problem can thus be con-
sidered the task of decoding from underspec-
ified natural language queries into prolog
queries. This is done by performing decoding
of the SCFG-based parsing model to translate
the input query into an MR including λ calculus
expressions, performing β-reduction to remove
the λ function, then firing the query against the
database. Before querying the database, we
also apply Wong and Mooney (2007)’s type-
checking to ensure that all MRs are logically
valid. For parsing, we implemented CKY-
based parsing of tri-synchronous grammars on
top of the Travatar (Neubig, 2013) decoder.
Unless otherwise specified, the default settings
of the decoder are used.

• Language Model: For all 3-SCFG systems
we use a 4-gram Kneser-Ney smoothed lan-
guage model trained using the KenLM toolkit
(Heafield, 2011). Standard preprocessing such
as lowercasing and tokenization is performed
before training the models. As it is of inter-
est whether or not the type of data used to train
the language model affects the resulting perfor-
mance, we build language models on several
types of data.

First, we use a corpus of news data from the
Workshop on Machine Translation evaluation
data (Callison-Burch et al., 2011) (News). This
data represents standard English text unrelated
to questions. Second, we use a part of the
question paraphrase data gathered by Fader et
al. (2013) (Questions).5 This data consists
entirely of questions, and thus is a better rep-
resentative of the latent questions behind the
input queries. Finally, we used the full ques-
tions from Geoquery sentences to build the
language model, building a different language
model for each fold, completely separate from
the test set. Table 2 gives the details of each
dataset.

In addition, because the Geoquery data is
useful but small, for all 3-SCFG systems, we
perform experiments using an additional 4-

5We use only data set 0 from the 30 sets released at
http://knowitall.cs.washington.edu/afader/paraphrases.


Data Sent. Tok. LM Size
News 44.0M 891M 5.5G
Questions 20.2M 174M 1.5G
Geoquery 792 ∼1.6K ∼96K

Table 2: Details of the data used to build LMs

gram feed-forward neural network language
model (NNLM) (Bengio et al., 2003) feature,
which is possibly better equipped to handle
sparse data than standard n-grams. The NNLM
is built on Geoquery sentences, excluding the
test sentences for each fold. This feature is
not produced during parsing, but is separately
scored and used to re-rank the n-best list gen-
erated by the parser.

Integration with the paraphrase language
model is performed using incremental search
(Heafield et al., 2013). For the parsing with
NNLM, we recalculate the score of the para-
phrases by firstly adding the NNLM score as
one of the feature in Equation 6 and taking the
parse with the best score.

• Parameter Optimization: For learning the pa-
rameters of the scoring function we use 10-fold
cross validation on the training data, i.e. each
fold iteration uses model trained on 712 exam-
ples and to parse the remaining 79. First we
run decoding for all folds and gather the results.
Then we run MERT with the combined results
to update the parameters. We use the standard
evaluation measure of question answering ac-
curacy as our objective function and set the n-
best list to be the top 300 derivations.

To learn the weights for rescoring with the
NNLM, we first generate an n-best list with the
base model not using the NNLM feature. We
then calculate the NNLM feature for each hy-
pothesis in the n-best list, and run one more run
of MERT with this feature to obtain the weights
used in the rescoring model.

• Evaluation: Following the definition from
Zettlemoyer and Collins (2005) and Wong and
Mooney (2007), we use question answering ac-
curacy as our evaluation measure. We define
recall as the fraction of correct answers divided

by the number of test data, precision as the frac-
tion of correct answers divided by the number
of parsed queries and F-measure as the har-
monic mean of the two. The query is judged
correct if and only if the SCFG can gener-
ate a valid parse tree, and the resulting query
does not produce any syntax errors when ac-
cessing the database through a prolog query.
Note that all 880 questions are used for testing
through cross validation, so a recall improve-
ment of 0.001 is approximately equal to an-
swering one more question correctly.

6.2 Parsing Accuracy Results

Input Method P R F
Question Direct .880 .878 .879

Keyword
Direct .792 .790 .791
Tri-LM .804 .790 .797
Tri+LM .830 .820 .827

Table 3: Parsing accuracy, where Keyword Direct is the
baseline for semantic parsing on keyword queries, and the
Tri with the LM for verification is our proposed method.
Bold indicates a significant gain over both Direct and Tri-
LM for keyword input according to bootstrap resampling
(Koehn, 2004) (p < 0.05).

First, in this section, we examine the effect of the
proposed method on accuracy of parsing ambiguous
keyword queries. Specifically, in Table 3 we show
the baseline “Direct” method of training a standard
SCFG-based semantic parser, the proposed method
without language model verification “Tri-LM,” and
the proposed method using the Questions lan-
guage model with NNLM reranking “Tri+LM.”

Looking at the baseline accuracy over full ques-
tions (first row), we can see the recall is slightly su-
perior to Wong and Mooney (2007)’s 86.6% and Li
et al. (2013)’s 87.6%, demonstrating our baseline is
comparable to previous work. When we apply the
same method to parse the keyword queries (second
row), however, the recall drops almost 9%, showing
that the ambiguity included in the keyword query in-
put causes large decreases in accuracy of a semantic
parser built according to the baseline method. This
ambiguity is also reflected in the number of MRs
generatable by the parser for any particular input. In
the top 300 list generated by each parser, there were


Ex. LM Paraphrase/MR Correct

1

Direct answer(A,(capital(A),loc(A,B),largest(C,population(B,C)))) no
Tri-LM answer(A,largest(B,(capital(A),population(A,B)))) yes
Tri+LM what capital has the largest population
Original Question: what capital has the largest population
Original MR: answer(A,largest(B,(capital(A),population(A,B))))
Keyword: largest population capital

2

Direct(-) answer(A,largest(A,(capital(A),city(A),loc(A,B),state(B)))) no
Tri-LM answer(A,largest(A,(state(A),loc(A,B),capital(B)))) no

what is the largest state in capital
Tri+LM answer(A,(state(A),loc(B,A),largest(B,capital(B)))) yes

what state has the largest capital
Original Question: what state has the largest capital
Original MR: answer(A,(state(A),loc(B,A),largest(B,capital(B))))
Keyword: largest capital state

Table 4: Examples of paraphrase outputs produced by the direct keyword-MR system, and the proposed systems
without and with a language model.

a total of 16.54 and 36.77 unique MRs for question
and keyword input respectively.

Now we take a look at the 3-SCFG (third row)
without the LM verification, we can see the results
are similar to the baseline. Then, when adding the
language model to the 3-SCFG system (fourth row)
we can see a significant of 3-4% gain over the Di-
rect and the Tri-LM systems, demonstrating that the
proposed method of paraphrasing and verification is
indeed able to resolve some of the ambiguity in the
keyword queries.

To illustrate how the language model helps, we
provide two examples in Table 4. The first example
shows that considering the original question when
parsing from keywords can help improve alignment
with the MR for more plausible results. The sec-
ond example shows the effect of adding the language
model to disambiguate the keyword query. Here
there are several interpretations for the keyword-
query “largest capital state,” which also can mean
“state that has the largest capital,” or “largest state
in the capital.” The system without the language
model incorrectly chooses the latter interpretation,
but the system with language model correctly dis-
ambiguates the sentence as it considers the phrase
“state in capital” is unlikely, showing the effective-
ness of our method.

6.3 Analysis

We first examine the effect of choice of language
model in the first two columns of Table 5. The first
column is the full model with NNLM re-ranking,
and the second column is without. The rows show
the effect of using different data to train the n-gram
LM. All the systems using LMs are basically bet-
ter than the system using neither an n-gram LM nor
the NNLM. Looking at the differences between the
n-gram LMs, we can see that the Questions LM
tends to be the most effective. This is particularly
encouraging as the Questions language model
does not contain any domain specific content, but
is able to outperform the Geoquery domain spe-
cific LM. We also found that, as expected, the more
sophisticated neural network language model raises
the system accuracy by approximately 2%, which
also supports our proposed idea that a better LM will
better raise system accuracy.

The proposed method aims at reducing nonsen-
sical interpretations, and another trivial baseline
that can achieve a similar effect is to filter out the
queries that produce empty answers, with the as-
sumption that empty answers are generated from in-
valid queries. This simple filtering method reduced
the number of unique queries to 11.74 for questions
and 20.16 for keywords. However, as shown in the
“-Empty” results in Table 5, we found that this fil-


Input Output LM
Full -NNLM -Empty

P R F P R F P R F

Keyword Q+MR

- .828 .813 .821 .804 .790 .797 .820 .784 .801
News .823 .814 .819 .806 .797 .802 .817 .786 .802
Quest .830† .820† .827† .809 .804 .806 .806 .780 .793
Geo .821 .812 .817 .804 .794 .799 .824 .794 .808

Table 5: The result of experiment with/without neural network language model for the proposed 3-SCFG framework.
Question-LM +NNLM achieved the best accuracy. Bold indicates a significant gain over the baseline Direct Keyword
(second row of Table 3) and dagger indicates a significant gain over the 3-SCFG baseline without language model
(-NNLM column, first row). The Full and -Empty column use NNLM as language model. The first row of the -NNLM
column is the experiment without any language model.

tering method is not effective, causing the system’s
performance to drop by around 2%. This is caused
by the fact that the correct answer is sometimes
an empty answer, for example “what states border
hawaii?”

6.4 Human Evaluation

While all evaluation up to this point has used lan-
guage models to disambiguate paraphrases, we can
assume that human users will be even better at judg-
ing whether or not a paraphrase makes sense. Thus,
we perform an additional evaluation in which hu-
man annotators evaluate the paraphrases generated
from the systems. First, we took the 1-best parse
and 7 random parses from the Tri+LM and Tri-LM
systems where both systems produced a non-empty
n-best. Then we show both the keyword queries and
all the paraphrases to human evaluators to annotate:
i) a fluency score of 0, 1, or 2 where 0 is completely
unnatural English, 1 indicates minor grammatical
errors, and 2 indicates flawless English, ii) a letter
starting from “A”, “B”, etc. for the paraphrase that
matches their preferred interpretation of the search
query.6 If the input has multiple interpretations, then
a different letter is assigned for each possible inter-
pretation in the order that the annotator believes that
the interpretation is the correct one, and only para-
phrase paraphrase is chosen for each interpretation.
If the human annotator does not find the paraphrase
that matched his/her pboth features set.igned and an-
notation starts from “B.” 3 annotators were asked to
annotate 300 keyword queries and their paraphrases.

6Here the letters are just the indicators of ranking with a let-
ter “A” means the most possible interpretation of search queries
according to the users.

There are a total of 866 keyword queries (out of 880)
that produced a non-empty n-best list in both sys-
tems, so we chose random duplications of 34 inputs
to make the sum 900.

System Precision
Tri-LM .803
Tri+LM .834
Tri+LM+Human .846

Table 6: System precision with additional human help.

Table 6 shows the improvement of the system
with human help. We take all the answers from
the annotators that were annotated with “A” and re-
placed the answer of Tri+LM system. Overall, there
were 35 questions that changed between the 1-best
and human choices, with 23 improving and 12 de-
grading accuracy. This experiment suggests that it
is possible to show the generated paraphrases to hu-
man users to improve the accuracy of the semantic
parser.

System Fluency Ratio Precision

Tri-LM
0 .163 .415
1 .425 .819
2 .411 .940

Tri+LM
0 .083 .367
1 .372 .811
2 .544 .918

Table 7: Fluency, Ratio, and Precision statistics for the
one-best of both systems.

Now we look at the relationship between the flu-
ency of the paraphrase and the accuracy of the se-
mantic parsers in Table 7. The statistics are gathered


Letter System Count Total Precision

A
Tri-LM 547

721
.919

Tri+LM 674 .902

B
Tri-LM 308

452
.772

Tri+LM 340 .752

C
Tri-LM 57

94
.631

Tri+LM 52 .557

D
Tri-LM 7

13
.428

Tri+LM 7 .142

Table 8: A result for the letter accuracy from the human
evaluation. Note that counts do not sum up to total be-
cause it is possible that both systems generate same para-
phrases.

from the one best output for both systems. Tri+LM
had a significantly larger percentage of fluent para-
phrases with score “2” (54% v.s. 41%) compared
to the system without the language model. Of the
paraphrases that were assigned “2” score, 91% cor-
responded to correct MRs, indicating that the sub-
jective fluency of the paraphrase is a good indicator
of parsing accuracy.

Finally, Table 8 shows the relationship between
the rank of the human interpretation and the accu-
racy of semantic parsing. Out of the 900 problems
shown to the annotators, 721 of them were ranked
“A.” This experiment showed that the interpretation
of the paraphrase judged as most likely by the anno-
tators achieves a high precision, confirming our hy-
potheses that humans are able to use paraphrases to
accurately judge whether the interpretation is likely
to be correct or not.

6.5 Other Methods for Using Paraphrase Data

In addition to the method describe up until this point,
there are several other ways to potentially incorpo-
rate paraphrasing into syntactic parsing of under-
specified input. In this section we briefly outline
two other (unsuccessful) attempts to do so: creation
of a pipelined paraphrasing/semantic parsing sys-
tem, and addition of features from a large paraphrase
database.

First, regarding the pipelined system, we build
the paraphrasing system using the parallel keyword-
question data, with standard settings of hierarchical
phrase-based translation (Chiang, 2007), and stan-
dard SMT features. We use the Geoquery n-gram

model for the language model used during decoding
and NNLM language model to finally rerank the n-
best list. As a result of experiments, even though
this system obtained a respectable BLEU score of
57.5, the parsing accuracies were much lower than
the direct keyword-MR system at 64.8 F-measure.
An analysis showed that, perhaps as expected, this
was caused by cascading errors, with unnatural para-
phrases also resulting in failed semantic parses.

In addition, we also attempted to use the external
Questions data to calculate additional features to
our Tri+LM system. We do this by first simulat-
ing the keyword version for each sentence in the
Questions data by performing shuffling and stop-
word deletion.7

Next we train a hierarchical phrase-based system
on this data to create a paraphrasing model. Next
we intersect this model with our existing model by
matching the source side and the target side of the
rules and if they match, taking the union of the fea-
tures sets. Unfortunately, however, this setting also
did not allow for a gain in accuracy, likely due to
to the low recall (15%) of the matching between
paraphrasing grammar and semantic parsing rules.
This low recall stemmed from a number of factors
including restrictions on the standard Hiero para-
phrasing grammars (no more than 2 non-terminals,
no consecutive non-terminals on the source side, and
no rules without at least one terminal), as well as
simple lack of coverage of the words in the para-
phrase database. This result does indicate room for
improvement by developing algorithms that extract
paraphrases that are closely related to the semantic
parsing rules, but also suggests potential difficulties
in simply applying paraphrasing resources such as
PPDB (Ganitkevitch et al., 2013).

7 Related Work

Interpretation of search queries is a major concern
in the field of information retrieval as it can affect
the choice of retrieved documents. Underspecified
queries are commonly entered into search engines,
leading to large result sets that are difficult for users

7Because the shuffling process is random we could conceiv-
ably generate and train with multiple shuffled versions, but be-
cause the Questions data is relatively large already, we only
train the paraphrasing system with the single permutation of
keywords generated by the shuffling.


to navigate (Sajjad et al., 2012). Studies have shown
that there are several ways to deal with this problem,
including query reformulation, which can fall in the
categories of query expansion or query substitution
(Shokouhi et al., 2014; Xue and Croft, 2013). Lev-
eling (2010) proposed a paraphrasing method that
tries to reconstruct original questions given keyword
inputs in the IR context, but did not model this re-
formulation together with semantic parsing. In ad-
dition, Wang et al. (2013) showed that doing para-
phrasing on the queries for web search is able to re-
duce the mismatch between queries and documents,
resulting in a gain in search accuracy.

Using paraphrasing to resolve ambiguity is not
new, as it was used to resolve ambiguity interac-
tively with a user’s input (McKeown, 1983). Ge
and Mooney (2009) and Miller et al. (1994) have
also used the guidance of natural language syntax
for semantic parsing. However, the usage of natu-
ral language syntax in the semantic parsing on key-
word queries are not trivial. For example, the ap-
proach using syntax tree of the input side from Ge
and Mooney (2009) can not be directly applied to the
keyword query as syntax parsing on keyword query
itself is not a trivial problem.

There have also been a few methods proposed to
combine paraphrasing with semantic parsing. Fader
et al. (2013) proposed a method to map from full
questions to more canonical forms of these ques-
tions, with the canonical NL questions being triv-
ially convertible to an MR. Berant and Liang (2014)
extract entities from a full-text question, map these
entities into a set of candidate MRs, and generate
canonical utterances accordingly. Then the canoni-
cal utterance that best paraphrases the input is cho-
sen, thereby outputting the corresponding MR. Our
approach is the similar but orthogonal to these works
in that we focus on situations where the original user
input is underspecified, and try to generate a natu-
ral language paraphrase that more explicitly states
the user intention for disambiguation purposes. A
second difference is that we do not use separate
model to do paraphrasing, instead using the same
model to do paraphrasing and semantic parsing syn-
chronously. This has the advantage of being able to
scale more easily to complicated and highly com-
positional questions such as the ones found in Geo-
query.

In addition to being useful for semantic parsing,
SCFGs have also been used for paraphrasing. A va-
riety of research has used SCFG-based paraphrases
for text-to-text generation tasks like sentence com-
pression (Cohn and Lapata, 2009; Ganitkevitch et
al., 2011), or expanding the set of reference transla-
tions for machine translation evaluation (Madnani et
al., 2007). In this paper we have introduced a novel
use of 3-way SCFGs that allows us to simultane-
ously do semantic parsing and text-to-text genera-
tion.

To our knowledge, this is the first method to parse
an underspecified input by trying to reconstruct a
more explicit paraphrase of the input and validate
the naturalness of the paraphrase to disambiguate the
meaning of the original input.

8 Conclusion and Future Work

In this paper we introduced a method for construct-
ing a semantic parser for ambiguous input that para-
phrases the ambiguous input into a more explicit
form, and verifies the correctness using a language
model. We do so through a generalization of syn-
chronous context free grammars that allows for gen-
eration of multiple output strings at one time. An
evaluation showed that our method is effective in
helping compensate for the 9% loss of system accu-
racies due to the ambiguity of the keyword queries,
providing a 3% improvement. Human evaluation
also confirmed that manually evaluating the para-
phrases generated by our framework can improve
the accuracy of the semantic parser further.

There are a number of future directions for this
study. First, we plan to scale the proposed method
to open domain semantic parsing of search queries
over extensive knowledge bases such as FreeBase
(Bollacker, 2007). In addition, previous works have
tackled semantic parsing directly from question and
answer pairs (Liang et al., 2011; Poon and Domin-
gos, 2009; Artzi and Zettlemoyer, 2011). The
idea of learning from unannotated data is attractive,
and incorporating this learning framework into our
model is a promising direction for future work.

Acknowledgments

We thank Professor Yuji Matsumoto and Assistant
Professor Kevin Duh for the discussions and insight-


ful ideas for the paper. We also thank Professor
Chris Callison-Burch and anonymous reviewers for
their suggestions and discussions.

This project is supported by grants from the Min-
istry of Education, Culture, Sport, Science, and
Technology of Japan and from the Microsoft CORE
program.

References
Yoav Artzi and Luke Zettlemoyer. 2011. Bootstrapping

semantic parsers from conversations. In Proceedings
of the 2011 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 421–432.

Yoshua Bengio, Ducharme Réjean, Pascal Vincent, and
Christian Janvin. 2003. A neural probabilistic lan-
guage model. The Journal of Machine Learning Re-
search, 3:1137–1155.

Jonathan Berant and Percy Liang. 2014. Semantic pars-
ing via paraphrasing. In Proceedings of the 52th An-
nual Meeting of the Association for Computational
Linguistics (ACL), pages 1415–1425.

Kurt Bollacker. 2007. A platform for scalable, collabo-
rative, structured information integration. In Proceed-
ings of the 22nd Association ofr Advancement of Arti-
ficial Intelligence, pages 22–27.

Chris Callison-Burch, Philipp Koehn, Christof Monz,
and Omar F Zaidan. 2011. Findings of the 2011 work-
shop on statistical machine translation. In Proceedings
of the Sixth Workshop on Statistical Machine Transla-
tion, pages 22–64.

David Chiang. 2007. Hierarchical phrase-based transla-
tion. Computational Linguistics, (2):201–228.

Trevor Cohn and Mirella Lapata. 2009. Sentence com-
pression as tree transduction. Journal of Artificial In-
telligence Research (JAIR), 34:637–674.

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni.
2013. Paraphrase-driven learning for open question
answering. In Proceedings of the 51th Annual Meet-
ing of the Association for Computational Linguistics
(ACL), pages 1608–1618.

Michel Galley, Mark Hopkins, Kevin Knight, and Daniel
Marcu. 2004. What’s in a translation rule? In
Proceedings of the 2004 Human Language Technol-
ogy Conference of the North American Chapter of
the Association for Computational Linguistics (HLT-
NAACL), pages 273–280.

Michel Galley, Jonathan Graehl, Kevin Knight, Daniel
Marcu, Steve DeNeefe, Wei Wang, and Ignacio
Thayer. 2006. Scalable inference and training of
context-rich syntactic translation models. In Proceed-
ings of the 44th Annual Meeting of the Association for
Computational Linguistics (ACL), pages 961–968.

Juri Ganitkevitch, Chris Callison-Burch, Courtney
Napoles, and Benjamin Van Durme. 2011. Learning
sentential paraphrases from bilingual parallel corpora
for text-to-text generation. In Proceedings of the 2011
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 1168–1179. Asso-
ciation for Computational Linguistics.

Juri Ganitkevitch, Benjamin Van Durme, and Chris
Callison-Burch. 2013. PPDB: The paraphrase
database. In Proceedings of the 2013 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, pages 758–764, Atlanta, Georgia, June. As-
sociation for Computational Linguistics.

Ruifang Ge and Raymond J Mooney. 2009. Learning a
compositional semantic parser using an existing syn-
tactic parser. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL and the 4th
International Joint Conference on Natural Language
Processing of the AFNLP: Volume 2-Volume 2, pages
611–619.

Peter Gurský, Tomás Horváth, Peter Vojtás, Jozef Jirásek,
Stanislav Krajci, Robert Novotny, Jana Pribolová, and
Veronika Vaneková. 2009. User preference web
search – experiments with a system connecting web
and user. Computing and Informatics, 28(4):515–553.

Kenneth Heafield, Philipp Koehn, and Alon Lavie. 2013.
Grouping language model boundary words to speed k-
best extraction from hypergraphs. In Proceedings of
the 2013 Conference of the North American Chapter
of the Association for Computational Linguistics: Hu-
man Language Technologies, pages 958–968.

Kenneth Heafield. 2011. KenLM: faster and smaller lan-
guage model queries. In Proceedings of the 2011 Con-
ference on Empirical Methods in Natural Language
Processing (EMNLP), pages 187–197.

Philipp Koehn. 2004. Statistical significance tests for
machine translation evaluation. In Proceedings of the
2004 Conference on Empirical Methods in Natural
Language Processing (EMNLP).

Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke
Zettlemoyer. 2013. Scaling semantic parsers with
on-the-fly ontology matching. In Proceedings of the
2013 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 1545–1556.

Johannes Leveling. 2010. A comparative analysis: QA
evaluation questions versus real-world queries. In Pro-
ceedings of the 7th International Conference on Lan-
guage Resources and Evaluation (LREC).

Peng Li, Yang Liu, and Maosong Sun. 2013. An ex-
tended GHKM algorithm for inducing lambda-SCFG.
In Proceedings of the 27th Association for Advance-
ment of Artificial Intelligence, pages 605–611.


Percy Liang, Michael I. Jordan, and Dan Klein. 2011.
Learning dependency-based compositional semantics.
In Proceedings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics (ACL), pages
590–599.

Nitin Madnani, Necip Fazil Ayan, Philip Resnik, and
Bonnie Dorr. 2007. Using paraphrases for parameter
tuning in statistical machine translation. In Proceed-
ings of the Workshop on Statistical Machine Transla-
tion (WMT07), Prague, Czech Republic.

Kathleen R. McKeown. 1983. Paraphrasing questions
using given and new information. Computational Lin-
guistics, 9(1):1–10.

Scott Miller, Robert Bobrow, Robert Ingria, and Richard
Schwartz. 1994. Hidden understanding models of
natural language. In Proceedings of the 32nd Annual
Meeting of the Association for Computational Linguis-
tics (ACL), pages 25–32.

Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shin-
suke Mori, and Tatsuya Kawahara. 2011. An unsuper-
vised model for joint phrase alignment and extraction.
In Proceedings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics: Human Lan-
guage Technologies (HLT-ACL), pages 632–641.

Graham Neubig, Philip Arthur, and Kevin Duh.
2015. Multi-target machine translation with multi-
synchronous context-free grammars. In Meeting of the
North American Chapter of the Association for Com-
putational Linguistics (NAACL), Denver, USA, May.

Graham Neubig. 2013. Travatar: A forest-to-string ma-
chine translation engine based on tree transducers. In
ACL (Conference System Demonstrations), pages 91–
96.

Franz Josef Och and Hermann Ney. 2003. A system-
atic comparison of various statistical alignment mod-
els. Computational Linguistics, (1):19–51.

Hoifung Poon and Pedro Domingos. 2009. Unsuper-
vised semantic parsing. In Proceedings of the 2009
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 1–10.

Hassan Sajjad, Patrick Pantel, and Michael Gamon.
2012. Underspecified query refinement via natural
language question generation. In Proceedings of the
24th International Conference on Computational Lin-
guistics (COLING), pages 2341–2356.

Milad Shokouhi, Rosie Jones, Umut Ozertem, Karthik
Raghunathan, and Fernando Diaz. 2014. Mobile
query reformulations. In Proceedings of the 37th In-
ternational ACM SIGIR Conference on Research &
Development in Information Retrieval, pages 1011–
1014.

Chenguang Wang, Nan Duan, Ming Zhou, and Ming
Zhang. 2013. Paraphrasing adaptation for web search

ranking. In Proceedings of the 51th Annual Meeting of
the Association for Computational Linguistics (ACL),
pages 41–46.

Yuk Wah Wong and Raymond J Mooney. 2006. Learn-
ing for semantic parsing with statistical machine trans-
lation. In Proceedings of the 2006 Human Language
Technology Conference of the North American Chap-
ter of the Association for Computational Linguistics
(HLT-NAACL), pages 439–446.

Yuk Wah Wong and Raymond J Mooney. 2007. Learn-
ing synchronous grammars for semantic parsing with
lambda calculus. In Proceedings of the 45th Annual
Meeting of the Association for Computational Linguis-
tics (ACL), number 1, pages 960–967.

Xiaobing Xue and W. Bruce Croft. 2013. Modeling re-
formulation using query distributions. ACM Transac-
tion on Information Systems, (2):6:1–6:34.

John M. Zelle and Raymond J. Mooney. 1996. Learn-
ing to parse database queries using inductive logic pro-
gramming. In Proceedings of the 13th National Con-
ference on Artificial Intelligence, pages 1050–1055.

Luke S Zettlemoyer and Michael Collins. 2005. Learn-
ing to map sentences to logical form: Structured clas-
sification with probabilistic categorial grammars. Un-
certainty in Artificial Intelligence (UAI), pages 658–
666.

Luke S Zettlemoyer and Michael Collins. 2007. On-
line learning of relaxed CCG grammars for parsing to
logical form. In Proceedings of the 2007 Joint Confer-
ence on Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language Learn-
ing (EMNLP-CoNLL), pages 678–687.