Mapping to Declarative Knowledge for Word Problem Solving

Subhro Roy∗
Massachusetts Institute of Technology

subhro@csail.mit.edu

Dan Roth∗
University of Pennsylvania

danroth@seas.upenn.edu

Abstract

Math word problems form a natural abstrac-
tion to a range of quantitative reasoning prob-
lems, such as understanding financial news,
sports results, and casualties of war. Solving
such problems requires the understanding of
several mathematical concepts such as dimen-
sional analysis, subset relationships, etc. In
this paper, we develop declarative rules which
govern the translation of natural language de-
scription of these concepts to math expres-
sions. We then present a framework for in-
corporating such declarative knowledge into
word problem solving. Our method learns to
map arithmetic word problem text to math ex-
pressions, by learning to select the relevant
declarative knowledge for each operation of
the solution expression. This provides a way
to handle multiple concepts in the same prob-
lem while, at the same time, supporting in-
terpretability of the answer expression. Our
method models the mapping to declarative
knowledge as a latent variable, thus remov-
ing the need for expensive annotations. Exper-
imental evaluation suggests that our domain
knowledge based solver outperforms all other
systems, and that it generalizes better in the
realistic case where the training data it is ex-
posed to is biased in a different way than the
test data.

1 Introduction

Many natural language understanding situations re-
quire reasoning with respect to numbers or quanti-

∗Most of the work was done when the authors were at the
University of Illinois, Urbana Champaign.

ties – understanding financial news, sports results,
or the number of casualties in a bombing. Math
word problems form a natural abstraction to a lot
of these quantitative reasoning problems. Conse-
quently, there has been a growing interest in devel-
oping automated methods to solve math word prob-
lems (Kushman et al., 2014; Hosseini et al., 2014;
Roy and Roth, 2015).

Arithmetic Word Problem
Mrs. Hilt baked pies last weekend for a holiday din-
ner. She baked 16 pecan pies and 14 apple pies. If she
wants to arrange all of the pies in rows of 5 pies each,
how many rows will she have?
Solution (16 + 14)/5 = 6
Math Concept needed for Each Operation

Figure 1: An example arithmetic word problem and its
solution, along with the concepts required to generate
each operation of the solution

Understanding and solving math word problems
involves interpreting the natural language descrip-
tion of mathematical concepts, as well as under-
standing their interaction with the physical world.
Consider the elementary school level arithmetic
word problem shown in Fig 1. To solve the prob-
lem, one needs to understand that “apple pies” and
“pecan pies” are kinds of “pies”, and hence, the

159

Transactions of the Association for Computational Linguistics, vol. 6, pp. 159–172, 2018. Action Editor: Luke Zettlemoyer.
Submission batch: 10/2017; Revision batch: 12/2017; Published 3/2018.

c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


number of apple pies and pecan pies needs to be
summed up to get the total number of pies. Simi-
larly, detecting that “5” represents “the number of
pies per row” and applying dimensional analysis or
unit compatibility knowledge, helps us infer that the
total number of pies needs to be divided by 5 to
get the answer. Besides part-whole relationship and
dimensional analysis, there are several other con-
cepts that are needed to support reasoning in math
word problems. Some of these involve understand-
ing comparisons, transactions, and the application of
math or physics formulas. Most of this knowledge
can be encoded as declarative rules, as illustrated in
this paper.

This paper introduces a framework for incorpo-
rating this “declarative knowledge” into word prob-
lem solving. We focus on arithmetic word prob-
lems, whose solution can be obtained by combin-
ing the numbers in the problem with basic opera-
tions (addition, subtraction, multiplication or divi-
sion). For combining a pair of numbers or math sub-
expressions, our method first predicts the math con-
cept that is needed for it (e.g., subset relationship, di-
mensional analysis, etc.), and then predicts a declar-
ative rule under that concept to infer the mathemati-
cal operation. We model the selection of declarative
rules as a latent variable, which removes the need
for expensive annotations for the intermediate steps.

The proposed approach has some clear advan-
tages compared to existing work on word problem
solving. First, it provides interpretability of the so-
lution, without expensive annotations. Our method
selects a declarative knowledge based inference rule
for each operation needed in the solution. These
rules provide an explanation for the operations per-
formed. In particular, it learns to select relevant rules
without explicit annotations for them. Second, each
individual operation in the solution expression can
be generated independently by a separate mathemat-
ical concept. This allows our method to handle mul-
tiple concepts in the same problem.

We show that existing datasets of arithmetic word
problems suffer from significant vocabulary biases
and, consequently, existing solvers do not do well on
conceptually similar problems that are not biased in
the same way. Our method, on the other hand, learns
the right abstractions even in the presence of biases
in the data. We also introduce a novel approach to

gather word problems without these biases, creating
a new dataset of 1492 problems.

The next section discusses related work. We next
introduce the mathematical concepts required for
arithmetic word problems, as well as the declara-
tive rules for each concept. Section 4 describes our
model – how we predict answers using declarative
knowledge – and provides the details of our train-
ing paradigm. Finally, we provide an experimental
evaluation of our proposed method in Section 6, and
then conclude with a discussion of future work.

2 Related Work

Our work is primarily related to three major strands
of research - automatic word problem solving, se-
mantic parsing, and approaches incorporating back-
ground knowledge in learning.

2.1 Automatic Word Problem Solving

There has been a growing interest in automatically
solving math word problems, with various systems
focusing on particular types of problems. These can
be broadly categorized into two types: arithmetic
and algebra.
Arithmetic Word Problems Arithmetic problems
involve combining numbers with basic operations
(addition, subtraction, multiplication and division),
and are generally directed towards elementary
school students. Roy and Roth (2015), Roy and Roth
(2017) and this work focus on this class of word
problems. The works of Hosseini et al. (2014) and
Mitra and Baral (2016) focus on arithmetic prob-
lems involving only addition and subtraction. Some
of these approaches also try to incorporate some
form of declarative or domain knowledge. Hosseini
et al. (2014) incorporates the transfer phenomenon
by classifying verbs; Mitra and Baral (2016) maps
problems to a set of formulas. Both require exten-
sive annotations for intermediate steps (verb classi-
fication for Hosseini et al. (2014), alignment of num-
bers to formulas for Mitra and Baral (2016), etc). In
contrast, our method can handle a more general class
of problems, while training only requires problem-
equation pairs coupled with rate component anno-
tations. Roy and Roth (2017) focuses only on us-
ing dimensional analysis knowledge, and handles
the same class of problems as we do. In contrast,

160


our method provides a framework for including any
form of declarative knowledge, exemplified here by
incorporating common concepts required for arith-
metic problems.
Algebra Word Problems Algebra word problems
are characterized by the use of (one or more)
variables in contructing (one or more) equations.
These are typically middle or high school problems.
Koncel-Kedziorski et al. (2015) looks at single equa-
tion problems, and Shi et al. (2015) focuses on num-
ber word problems. Kushman et al. (2014) intro-
duces a template based approach to handle general
algebra word problems and several works have later
proposed improvements over this approach (Zhou
et al., 2015; Upadhyay et al., 2016; Huang et al.,
2017). There has also been work on generating ra-
tionale for word problem solving (Ling et al., 2017).
More recently, some focus turned to pre-university
exam questions (Matsuzaki et al., 2017; Hopkins et
al., 2017), which requires handling a wider range of
problems and often more complex semantics.

2.2 Semantic Parsing

Our work is also related to learning semantic parsers
from indirect supervision (Clarke et al., 2010; Liang
et al., 2011). The general approach here is to learn a
mapping of sentences to logical forms, with the only
supervision being the response of executing the log-
ical form on a knowledge base. Similarly, we learn
to select declarative rules from supervision that only
includes the final operation (and not which rule gen-
erated it). However, in contrast to the semantic pars-
ing work, in our case the selection of each declar-
ative rule usually requires reasoning across multi-
ple sentences. Further, we do not require an explicit
grounding of words or phrases to logical variables.

2.3 Background Knowledge in Learning

Approaches to incorporate knowledge in learning
started with Explanation Based Learning (EBL)
(DeJong, 1993; DeJong, 2014). EBL uses domain
knowledge based on observable predicates, whereas
we learn to map text to predicates of our declara-
tive knowledge. More recent approaches tried to in-
corporate knowledge in the form of constraints or
expectations from the output (Roth and Yih, 2004;
Chang et al., 2007; Chang et al., 2012; Ganchev
et al., 2010; Smith and Eisner, 2006; Naseem et

al., 2010; Bisk and Hockenmaier, 2012; Gimpel and
Bansal, 2014).

Finally, we note that there has been some work
in the context of Question Answering on perturbing
questions or answers as a way to test or assure the
robustness, or lack of, the approach (Khashabi et al.,
2016; Jia and Liang, 2017). We make use of similar
ideas in order to generate an unbiased test set for
Math word problems (Sec. 6).

3 Knowledge Representation

Here, we introduce our representation of domain
knowledge. We organize the knowledge hierarchi-
cally in two levels – concepts and declarative rules.
A math concept is a phenomenon which needs to be
understood to apply reasoning over quantities. Ex-
amples of concepts include part-whole relations, di-
mensional analysis, etc. Under each concept, there
are a few declarative rules, which dictate which op-
eration is needed in a particular context. An exam-
ple of a declarative rule under the part-whole con-
cept can be that “if two numbers quantify “parts” of
a larger quantity, the operation between them must
be addition”. These rules use concept specific pred-
icates, which we exemplify in the following subsec-
tions.

Since this work focuses on arithmetic word prob-
lems, we consider 4 math concepts which are most
common in these problems, as follows:

1. Transfer: This involves understanding the
transfer of objects from one person to another.
For example, the action described by the sen-
tence “Tim gave 5 apples to Jim”, results in Tim
losing “5 apples” and Jim gaining “5 apples”.

2. Dimensional Analysis: This involves under-
standing compatibility of units or dimensions.
For example, “30 pies” can be divided by “5
pies per row” to get the number of rows.

3. Part-Whole Relation: This includes asserting
that if two numbers quantify parts of a larger
quantity, they are to be added. For example,
the problem in Section 1 involves understand-
ing “pecan pies” and “apple pies” are parts of
“pies”, and hence must be added.

4. Explicit Math: Word problems often mention
explicit math relationships among quantities or

161


entities in the problem. For example, “Jim is 5
inches taller than Tim”. This concept captures
the reasoning needed for such relationships.

Each of these concepts comprises a small number
of declarative rules which determine the math oper-
ations; we describe them below.

3.1 Transfer
Consider the following excerpt of a word problem
exhibiting a transfer phenomenon: “Stephen owns 5
books. Daniel gave him 4 books.” The goal of the
declarative rules is to determine which operation is
required between 5 and 4, given that we know that a
transfer is taking place. We note that a transfer usu-
ally involves two entities, which occurs as subject
and indirect object in a sentence. The direction of
transfer is determined by the verbs associated with
the entities. We define a set of variables to denote
these properties; we define as Subj1, Verb1, IObj1
the subject, verb and indirect object associated with
the first number, and as Subj2, Verb2, IObj2 the sub-
ject, verb and indirect object related to the second
number. For the above example, the assignment of
the variables are shown below:

[Stephen]Subj1 [owns]V erb1 5 books.
[Daniel]Subj2 [gave]V erb2 [him]IObj2 4 books.

In order to determine the direction of the transfer,
we require some classification of verbs. In partic-
ular, we classify each verb into one of five classes:
HAVE, GET, GIVE, CONSTRUCT and DESTROY.
The HAVE class consists of all verbs which sig-
nify the state of an entity, such as “have”, “own”,
etc. The GET class contains verbs which indicate
the gaining of things for the subject. Examples of
such verbs are “acquire”, “borrow”, etc. The GIVE
class contains verbs which indicate the loss of things
for the subject. Verbs like “lend”, “give” belong
to this class. Finally CONSTRUCT class consti-
tutes verbs indicating construction or creation, like
“build”, “fill”, etc., while DESTROY verbs indi-
cate destruction related verbs like “destroy”, “eat”,
“use”, etc. This verb classification is largely based
on the work of Hosseini et al. (2014).

Finally, the declarative rules for this concept have
the following form:

[Verb1 ∈ HAVE] ∧ [Verb2 ∈ GIVE] ∧
[Coref(Subj1, IObj2)] ⇒ Addition

where Coref(A,B) is true when A and B repre-
sent the same entity or are coreferent, and is false
otherwise. In the examples above, Verb1 is “own”
and hence [Verb1 ∈ HAVE] is true. Verb2 is
“give” and hence [Verb2 ∈ GIVE] is true. Fi-
nally, Subj1 and IObj2 both refer to Stephen, so
[Coref(Subj1, IObj2)] returns true. As a result, the
above declarative rule dictates that addition should
be performed between 5 and 4.

We have 18 such inference rules for transfer, cov-
ering all combinations of verb classes and Coref()
values. All these rules generate addition or subtrac-
tion operations.

3.2 Dimensional Analysis

We now look at the use of dimensional analysis
knowledge in word problem solving. To use di-
mensional analysis, one needs to extract the units of
numbers as well as the relations between the units.
Consider the following excerpt of a word problem:
“Stephen has 5 bags. Each bag has 4 apples. Know-
ing that the unit of 5 is “bag” and the effective unit
of 4 is “apples per bag”, allows us to infer that the
numbers can be multiplied to obtain the total number
of apples.

To capture these dependencies, we first introduce
a few terms. Whenever a number has a unit of the
form “A per B”, we refer to “A” as the unit of the
number, and refer to “B” as the rate component of
the number. In our example, the unit of 4 is “apple”,
and the rate component of 4 is “bag”. We define
variables Unit1 and Rate1 to denote the unit and the
rate component of the first number respectively. We
similarly define Unit2 and Rate2. For the above ex-
ample, the assignment of variables is shown below:

Stephen has 5 [bags]Unit1. Each [bag]Rate2 has
4 [apples]Unit2.

Finally, the declarative rule applicable for our exam-
ple has the following form:

162


[Coref(Unit1, Rate2)] ⇒ Multiplication

We only have 3 rules for dimensional analysis. They
generate multiplication or division operations.

3.3 Explicit Math

In this subsection, we want to capture the reasoning
behind explicit math relationships expressed in word
problems such as the one described in: “Stephen has
5 apples. Daniel has 4 more apples than Stephen”.
We define Math1 and Math2 by any explicit math
term associated with the first and second numbers
respectively. As was the case for transfers, we also
define Subj1, IObj1, Subj2, and IObj2 to denote the
entities participating in the math relationship. The
assignment of these variables in our example is:

[Stephen]Subj1 has 5 apples. [Daniel]Subj2 has
4 [more apples than]Math2 [Stephen]IObj2.

We classify explicit math terms into one of three
classes - ADD, SUB and MUL. ADD comprises
terms for addition, like “more than”, “taller than”
and “heavier than”. SUB consists of terms for sub-
traction like“less than”, “shorter than”, etc., and
MUL contains terms indicating multiplication, like
“times”, “twice” and “thrice”. Finally, the declara-
tive rule that applies for our example is:

[Coref(Subj1, IObj2)] ∧ [Math2 ∈ ADD] ⇒ 
Addition.

We have only 7 rules for explicit math.

3.4 Part-Whole Relation

Understanding part-whole relationships entails un-
derstanding whether two quantities are hyponym,
hypernym or siblings (that is, co-hyponym, or parts
of the same quantity). For example, in the excerpt
“Mrs. Hilt has 5 pecan pies and 4 apple pies”, de-
termining that pecan pies and apple pies are parts of
all pies, helps infer that addition is needed. We have
3 simple rules which directly map from Hyponym,
Hypernym or Sibling detection to the corresponding
math operation. For the above example, the applica-
ble declarative rule is:

[Sibling(Number1, Number2)] ⇒ Addition

The rules for the part-whole concept can generate
addition and subtraction operations. Table 1 gives
a list of all the declarative rules. Note that all the
declarative rules are designed to determine an op-
eration between two numbers only. We introduce
a strategy in Section 4, which facilitates combining
sub-expressions with these rules.

4 Mapping of Word Problems to
Declarative Knowledge

Given an input arithmetic word problem x, the goal
is to predict the math expression y, which generates
the correct answer. In order to derive the expres-
sion y from the word problem x, we leverage math
concepts and declarative rules that we introduced in
Section 3. In order to combine two numbers men-
tioned in x, we first predict a concept k, and then we
choose a declarative knowledge rule r from k. The
rule r generates the math operation needed to com-
bine the two numbers. Consider the first example
in Table 2. To combine 6 and 9, we first decide on
the transfer concept, and then choose an appropriate
rule under the transfer to generate the operation.

Next we need to combine the sub-expression (6 +
9) with the number 3. However, our inference rules
were designed for the combination of two num-
bers only. In order to combine a sub-expression,
we choose a representative number from the sub-
expression, and use that number to determine the
operation. In our example, we choose the number 6
as the representative number for (6 + 9), and decide
the operation between 6 and 3, following a similar
procedure as before. This operation is now used to
combine (6 + 9) and 3.

The representative number for a sub-expression is
chosen such that it preserves the reasoning needed
for the combination of this sub-expression with
other numbers. We follow a heuristic to choose a
representative number from a sub-expression:

1. For transfers and part-whole relationships, we
choose the representative number of the left
subtree.

2. In the case of rate relationship, we choose the
number which does not have a rate component.

163


Transfer
[Verb1 ∈ HAVE] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒−
[Verb1 ∈ HAVE] ∧ [Verb2 ∈ (GET ∪ CONSTRUCT)] ∧ [Coref(Subj1, Subj2)] ⇒ +
[Verb1 ∈ HAVE] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒−
[Verb1 ∈ (GET ∪ CONSTRUCT)] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒−
[Verb1 ∈ (GET ∪ CONSTRUCT)] ∧ [Verb2 ∈ (GET ∪ CONSTRUCT)] ∧ [Coref(Subj1, Subj2)] ⇒ +
[Verb1 ∈ (GET ∪ CONSTRUCT)] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒−
[Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒ +
[Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ (GET ∪ CONSTRUCT)] ∧ [Coref(Subj1, Subj2)] ⇒−
[Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒ +
We also have another rule for each rule above, which states that if Coref(Subj1, Obj2) or
Coref(Subj2, Obj1) is true, and none of the verbs is CONSTRUCT or DESTROY, the final operation
is changed from addition to subtraction, or vice versa.
Dimensionality Analysis
[Coref(Unit1, Rate2) ∨ Coref(Unit2, Rate1)] ⇒×
[Coref(Unit1, Unit2)] ∧ [Rate2 6= null] ⇒÷
[Coref(Unit1, Unit2)] ∧ [Rate1 6= null] ⇒÷ (Reverse order)
Explicit Math
[Coref(Subj1, IObj2) ∨ Coref(Subj2, IObj1)] ∧ [Math1 ∈ ADD ∨ Math2 ∈ ADD] ⇒ +
[Coref(Subj1, IObj2) ∨ Coref(Subj2, IObj1)] ∧ [Math1 ∈ SUB ∨ Math2 ∈ SUB] ⇒−
[Coref(Subj1, Subj2)] ∧ [Math1 ∈ ADD ∨ Math2 ∈ ADD] ⇒−
[Coref(Subj1, Subj2)] ∧ [Math1 ∈ SUB ∨ Math2 ∈ SUB] ⇒ +
[Coref(Subj1, Subj2)] ∧ [Math1 ∈ MUL] ⇒÷ (Reverse order)
[Coref(Subj1, Subj2)] ∧ [Math2 ∈ MUL] ⇒÷
[Coref(Subj1, IObj2) ∨ Coref(Subj2, IObj1)] ∧ [Math1 ∈ MUL ∨ Math2 ∈ MUL] ⇒×
Part-Whole Relationship
[Sibling(Number1, Number2)] ⇒ +
[Hyponym(Number1, Number2)] ⇒−
[Hypernym(Number1, Number2)] ⇒−

Table 1: List of declarative rules used in our system. ÷ (reverse order) indicates the second number being divided by
the first. To determine the order of subtraction, we always subtract the smaller number from the larger number.

3. In the case of explicit math, we choose the
number which is not directly associated with
the explicit math expression.

4.1 Scoring Answer Derivations

Given the input word problem x, the solution math
expression y is constructed by combining numbers
in x with operations. We refer to the set of opera-
tions used in an expression y as �(y). Each opera-
tion o in �(y) is generated by first choosing a con-
cept ko, and then selecting a declarative rule ro from
that concept.

In order to discriminate between multiple candi-
date solution expressions of a word problem x, we

score them using a linear model over features ex-
tracted from the derivation of the solution. Our scor-
ing function has the following form:

SCORE(x,y) =
∑

o∈�(y)
wkφk(x,k

o) + wrφr(x,r
o)

where φk(x,ko) and φr(x,ro) are feature vectors re-
lated to concept ko, and declarative rule ro, respec-
tively, and wk and wr are the corresponding weight
vectors. The term wkφk(x,ko) is the score for the
selection of ko, and the term wrφr(x,ro) is the score
for the selection of ro. Finally, the total score is the
sum of the scores of all concepts and rule choices,
over all operations of y.

164


Word Problem Tim ’s cat had 6 kittens . He gave 3 to Jessica. Then Sara gave him 9 kittens . How
many kittens does he now have ?

Knowledge
based Answer

Derivation

Word Problem Mrs. Hilt baked pies last weekend for a holiday dinner. She baked 16 pecan pies and
14 apple pies. If she wants to arrange all of the pies in rows of 5 pies each, how many
rows will she have?

Knowledge
based Answer

Derivation

Table 2: Two examples of arithmetic word problems, and derivation of the answer. For each combination, first a math
concept is chosen, and then a declarative rule from that concept is chosen to infer the operation.

4.2 Learning

We wish to estimate the parameters of the weight
vectors wk and wr, such that our scoring function
assigns a higher score to the correct math expres-
sion, and a lower score to other competing math
expressions. For learning the parameters, we as-
sume access to word problems paired with the cor-
rect math expression. In Section 5, we show that
certain simple heuristics and rate component anno-
tations can be used to create somewhat noisy anno-
tations for the concepts needed for individual op-
erations. Hence, we will assume for our formu-
lation access to concept supervision as well. We
thus assume access to m examples of the following
form: {(x1,y1,{ko}o∈�(y1)), (x2,y2,{ko}o∈�(y2)),
. . . , (xm,ym,{ko}o∈�(ym))}.

We do not have any supervision for declarative
rule selection, which we model as a latent variable.
Two Stage Learning: A straightforward solution
for our learning problem could be to jointly learn
wk and wr using latent structured SVM. However,
we found that this model does not perform well. In-
stead, we chose a two stage learning protocol. At the
first stage, we only learn wr, the weight vector for

scoring the declarative rule choice. Once learned,
we fix the parameters for wr, and then learn the pa-
rameters for wk.

In order to learn the parameters for wr, we solve:

min
wr

1

2
||wr||2 + C

m∑

i=1

∑

o∈�(yi)

[
max

r̂∈ko,r̂⇒ô
wr ·φr(x, r̂)+

∆(ô,o)
]
− max

r̂∈ko,r̂⇒o
wr ·φr(x, r̂),

where r̂ ∈ ko implies that r̂ is a declarative rule
for concept ko, r̂ ⇒ o signifies that the declarative
rule r̂ generates operation o, and ∆(ô,o) represents
a measure of dissimilarity between operations o and
ô. The above objective is similar to that of latent
structured SVM. For each operation o in the solu-
tion expression yi, the objective tries to minimize the
difference between the highest scoring rule from its
concept ko, and highest scoring rule from ko which
explains or generates the operation o.

Next we fix the parameters of wr, and solve:

min
wk

1

2
||wk||2 + C

m∑

i=1

max
y∈Y

[SCORE(xi,y) + ∆(y,yi)] − SCORE(xi,yi).

165


This is equivalent to a standard structured SVM ob-
jective. We use a 0 − 1 loss for ∆(ô,o). Note that
fixing the parameters of wr determines the scores
for rule selection, removing the need for any latent
variables at this stage.

4.3 Inference

Given an input word problem x, inferring
the best math expression involves computing
arg maxy∈Y SCORE(x,y), where Y is the set of all
math expressions that can be created by combining
the numbers in x with basic math operations.

The size of Y is exponential in the number of
quantities mentioned in x. As a result, we perform
approximate inference using beam search. We ini-
tialize the beam with the set E of all numbers men-
tioned in the problem x. At each step of the beam
search, we choose two numbers (or sub-expressions)
e1 and e2 from E, and then select a math concept and
a declarative rule to infer an operation o. We cre-
ate a new sub-expression e3 by combining the sub-
expressions e1 and e2 with operation o. We finally
create a new set E′ from E, by removing e1 and
e2 from it, and adding e3 to it. We remove E from
the beam, and add all such modified sets E′ to the
beam. We continue this process until all sets in the
beam have only one element in them. We choose the
highest scoring expression among these elements as
the solution expression.

5 Model and Implementation Details

5.1 Supervision

Each word problem in our dataset is annotated with
the solution math expression, along with alignment
of numbers from the problem to the solution expres-
sion. In addition, we also have annotations for the
numbers which possess a rate component. An ex-
ample is shown in Fig 2. This is the same level of
supervision used in Roy and Roth (2017). Many of
the annotations can be extracted semi-automatically.
The number list is extracted automatically by a num-
ber detector, the alignments require human supervi-
sion only when the same numeric value is mentioned
multiple times in the problem. Most of the rate com-
ponent annotations can also be extracted automati-
cally, see Roy and Roth (2017) for details.

We apply a few heuristics to obtain noisy anno-

Problem: Mrs. Hilt baked pies last weekend for a
holiday dinner. She baked 16 pecan pies and 14 apple
pies. If she wants to arrange all of the pies in rows of
5 pies each, how many rows will she have?
Number List: 16, 14, 5
Solution: (16[1] + 14[2])/5[3] = 6
Rates: 5

Figure 2: Annotations in our dataset. Number List refers
to the numbers detected in the problem. The subscripts in
the solution indicate the position of the numbers in the
number list.

tations for the math concepts for operations. Con-
sider the case for combining two numbers num1 and
num2, by operation o. We apply the following rules:

1. If we detect an explicit math pattern in the
neighborhood of num1 or num2, we assign
concept ko to be Explicit Math.

2. If o is multiplication or division, and one of
num1 or num2 has a rate component, we as-
sign ko to be Dimensional Analysis.

3. If o is addition or subtraction, we check if the
dependent verb of both numbers are identical.
If they are, we assign ko to be a Part-Whole re-
lationship; otherwise, we assign it to be Trans-
fer. We extract the dependent verb using the
Stanford dependency parser (Chen and Man-
ning, 2014).

The annotations obtained via these rules are of
course not perfect. We could not detect certain
uncommon rate patterns like “dividing the cost 4
ways”, and “I read the same number of books 4 days
running”. There were part-whole relationships ex-
hibited with complementary verbs, as in “I won 4
games, and lost 3.”. Both of these cases lead to noisy
math concept annotations.

However, we tested a small sample of these anno-
tations, and found less than 5% of them to be wrong.
As a result, we assume these annotations to be cor-
rect in our problem formulation.

5.2 Features

We use dependency parse labels and a small set
of rules to extract subject, indirect object, depen-
dent verb, unit and rate component of each number

166


mentioned in the problem. Details of these extrac-
tions can be found in the released codebase. Us-
ing these extractions, we define two feature func-
tions φk(x,ko) and φr(x,ro), where x is the in-
put word problem, and ko and ro are the concept
and the declarative rule for operation o respectively.
φr(x,r

o) constitutes the following features:

1. If ro contains Coref(·) function, we add fea-
tures related to similarity of the arguments of
Coref(·) (jaccard similarity score and presence
of pronoun in one of the arguments).

2. For part-whole relationships, we add indica-
tors for a list of words like “remaining”, “rest”,
“either”, “overall”, “total”, conjoined with the
part-whole function in ro (Hyponymy, Hyper-
nymy, Sibling).

3. Unigrams from the neighborhood of numbers
being combined.

Finally, φk(x,ko) generates the following features:

1. If ko is related to dimensional analysis, we add
features indicating the presence of a rate com-
ponent in the combining numbers.

2. If ko is part-whole, we add features indicating
whether the verbs of combining numbers are
identical.

Note that these features capture several interpretable
functions like coreference, hyponymy, etc.

We do not learn three components of our system
– verb classification for transfer knowledge, catego-
rization of explicit math terms, and irrelevant num-
ber detection. For verb classification, we use a seed
list of around 10 verbs for each category. Given a
new verb v, we choose the most similar verb v′ from
the seed lists according to the GloVe vector (Pen-
nington et al., 2014) based similarity . We assign
v the category of v′. This can be replaced by a
learned component (Hosseini et al., 2014). However
we found the seed list based categorization worked
well in most cases. For explicit math, we check for
a small list of patterns to detect and categorize math
terms. Note that for both the cases above, we still
have to learn Coref(·) function to determine the fi-
nal operation. Finally, to detect irrelevant numbers

(numbers which are not used in the solution), we use
a set of rules based on the units of numbers. Again,
this can be replaced by a learned model (Roy and
Roth, 2015).

6 Experiments

6.1 Results on Existing Dataset
We first evaluate our approach on the existing
datasets of AllArith, AllArithLex, and AllAr-
ithTmpl (Roy and Roth, 2017). AllArithLex and Al-
lArithTmpl are subsets of the AllArith dataset, cre-
ated to test the robustness to new vocabulary, and
new equation forms respectively. We compare to the
top performing systems for arithmetic word prob-
lems. They are as follows:

1. TEMPLATE : Template based algebra word
problem solver of Kushman et al. (2014).

2. LCA++ : System of Roy and Roth (2015) based
on lowest common ancestors of math expres-
sion trees.

3. UNITDEP: Unit dependency graph based
solver of Roy and Roth (2017).

We refer to our approach as KNOWLEDGE. For all
solvers, we use the system released by the respec-
tive authors. The system of TEMPLATE expects an
equation as the answer, whereas our dataset contains
only math expressions. We converted expressions to
equations by introducing a single variable and as-
signing the math expression to it. For example, an
expression “(2 + 3)” gets converted to “X = (2 + 3)”.

The first few columns of Table 3 shows the per-
formance of the systems on the aforementioned
datasets1. The performance of KNOWLEDGE is on
par or lower than some of the existing systems. We
analyzed the systems, and found most of them to
not be robust to perturbations of the problem text;
Table 4 shows a few examples. We further ana-
lyzed the datasets, and identified several biases in
the problems (in both train and test). Systems which
remember these biases get an undue advantage in
evaluation. For example, the verb “give” only ap-
pears with subtraction, and hence the models are

1Results on the AllArith datasets are slightly different from
(Roy and Roth, 2017), since we fixed several ungrammatical
sentences in the dataset

167


System AllArith AllArith
Lex

AllArith
Tmpl

Aggregate Aggregate
Lex

Aggregate
Tmpl

Train on
AllArith,
Test on
Perturb

TEMPLATE 71.96 64.09 70.64 54.62 45.05 54.69 24.2
LCA++ 78.34 66.99 75.66 65.21 53.62 63.0 43.57
UNITDEP 79.67 71.33 77.11 69.9 57.51 68.64 46.29
KNOWLEDGE 77.86 72.53 74.7 73.32∗ 66.63∗ 68.62 65.66∗

Table 3: Accuracy in solving arithmetic word problems. All columns except the last report 5-fold cross validation
results. ∗ indicates statistically significant improvement (p = 0.05) over second highest score in the column.

Problem
Systems which solved correctly
Trained on AllArith Trained on Aggregate

Adam has 70 marbles. Adam gave 27 marbles to
Sam. How many marbles does Adam have now?

TEMPLATE, UNITDEP,
LCA, KNOWLEDGE

LCA, UNITDEP,
KNOWLEDGE

Adam has 70 marbles. Sam gave 27 marbles to
Adam. How many marbles does Adam have now?

KNOWLEDGE TEMPLATE, KNOWLEDGE

Adam has 5 marbles. Sam has 6 more marbles than
Adam. How many marbles does Sam have?

LCA, UNITDEP,
KNOWLEDGE

LCA, UNITDEP,
KNOWLEDGE

Adam has 11 marbles. Adam has 6 more marbles
than Sam. How many marbles does Sam have?

TEMPLATE, KNOWLEDGE TEMPLATE, KNOWLEDGE

Table 4: Pairs of pertubed problems, along with the systems which get them correct

learning an erroneous correlation of “give” with sub-
traction. Since the test also exhibits the same bias,
these systems get all the “give”-related questions
correct. However, they fail to solve the problem
in Table 4, where “give” results in addition. We
also tested KNOWLEDGE on the addition subtraction
problems dataset released by Hosseini et al. (2014).
It achieved a cross validation accuracy of 77.19%,
which is competitive with the state of the art accu-
racy of 78% achieved with the same level of supervi-
sion. The system of Mitra and Baral (2016) achieved
86.07% accuracy on this dataset, but requires rich
annotations for formulas and alignment of numbers
to formulas.

6.2 New Dataset Creation

In order to remove the aforementioned biases from
the dataset, we augment it with new word problems
collected via a crowdsourcing platform. These new
word problems are created by perturbing the original
problems minimally, such that the answer is differ-
ent from the original problem. For each word prob-
lem p with an answer expression a in our original
dataset AllArith, we replace one operation in a to

create a new math expression a′. We ask annotators
to modify problem p minimally, such that a′ is now
the solution to the modified word problem.

We create a′ from a either by replacing an addi-
tion with subtraction or vice versa, or by replacing
multiplication with division or vice versa. We do not
replace addition and subtraction with multiplication
or division, since there might not be an easy per-
turbation that supports this conversion. We only al-
lowed perturbed expressions which evaluate to val-
ues greater than 1. For example, we generate the
expression “(3+2)” from “(3-2)”; we generated ex-
pressions “(10+2)/4” and “(10-2)*4” for the expres-
sion “(10-2)/4”. We generate all possible perturbed
expressions for a given answer expression, and ask
for problem text modification for each one of them.

We show the annotators the original problem text
p paired with a perturbed answer a′. The instructions
advised them to copy over the given problem text,
and modify it as little as possible so that the given
math expression is now the solution to this modified
problem. They were also instructed not to add or
delete the numbers mentioned in the problem. If the
original problem mentions two “3”s and one “2”, the

168


modified problem should also contain two “3”s and
one “2”.

We manually pruned problems which did not
yield the desired solution a′, or were too different
from the input problem p. This procedure gave us
a set of 661 new word problems, which we refer to
as Perturb. Finally we augment AllArith with the
problems of Perturb, and call this new dataset Ag-
gregate. Aggregate has a total of 1492 problems.

The addition of the Perturb problems ensures
that the dataset now has problems with similar lex-
ical items generating different answers. This mini-
mizes the bias that we discussed in subsection 6.1.
To quantify this, consider the probability distribu-
tion over operations for a quantity q, given that word
w is present in the neighborhood of q. For an un-
biased dataset, you will expect the entropy of this
distribution to be high, since the presence of a sin-
gle word in a number neighborhood will seldom be
completely informative for the operation. We com-
pute the average of this entropy value over all num-
bers and neighborhood words in our dataset. AllAr-
ith and Perturb have an average entropy of 0.34 and
0.32 respectively, whereas Aggregate’s average en-
tropy is 0.54, indicating that, indeed, the complete
data set is significantly less biased.

6.3 Generalization from Biased Dataset

First, we evaluate the ability of systems to general-
ize from biased datasets. We train all systems on
AllArith, and test them on Perturb (which was cre-
ated by perturbing AllArith problems). The last col-
umn of Table 3 shows the performance of systems
in this setting. KNOWLEDGE outperforms all other
systems in this setting with around 19% absolute im-
provement over UNITDEP. This shows that declara-
tive knowledge allows the system to learn the correct
abstractions, even from biased datasets.

6.4 Results on the New Dataset

Finally, we evaluate the systems on the Aggre-
gate dataset. Following previous work (Roy and
Roth, 2017), we compute two subsets of Aggregate
comprising 756 problems each, using the MAWPS
(Koncel-Kedziorski et al., 2016) system. The first,
called AggregateLex, is one with low lexical repeti-
tions, and the second called AggregateTmpl is one
with low repetitions of equation forms. We also

evaluate on these two subsets on a 5-fold cross val-
idation. Columns 4-6 of Table 3 show the perfor-
mance of systems on this setting. KNOWLEDGE sig-
nificantly o utperforms o ther s ystems o n Aggregate 
and AggregateLex, and is similar to UNITDEP on 
AggregateTmpl. There is a 9% absolute improve-
ment on AggregateLex, showing that KNOWLEDGE 
is significantly m ore r obust t o l ow l exical overlap 
between train and test. The last column of Table 4 
also shows that the other systems do not learn the 
right abstraction, even when trained on Aggregate.

6.5 Analysis
Coverage of the Declarative Rules We chose 
math concepts and declarative rules based on their 
prevalance in arithmetic word problems. We found 
that the four concepts introduced in this paper cover 
almost all the problems in our dataset; only missing 
4 problems involving application of area formulas. 
We also checked earlier arithmetic problem datasets 
from the works of Hosseini et al. (2014); Roy 
and Roth (2015), and found that the math concepts 
and declarative rules introduced in this paper 
cover all their problems.

A major challenge in applying these concepts and 
rules to algebra word problems is the use of variables 
in constructing equations. Variables are often im-
plicitly described, and it is difficult to extract units, 
dependent verbs, associated subjects and objects for 
the variables. However, we need these extractions in 
order to apply our declarative rules to combine vari-
ables. There has been some work to extract meaning 
of variables (Roy et al., 2016) in algebra word prob-
lems; an extension of this can possibly support the 
application of rules in algebra word problems. We 
leave this exploration to future work.

Higher standard word problems often require the 
application of math formulas like ones related to 
area, interest, probability, etc. Extending our ap-
proach to handle such problems will involve en-
coding math formulas in terms of concepts and 
rules, as well as adding concept specific features to 
the learned predictors. The declarative rules under 
the Explicit Math category currently handles sim-
ple cases; this set needs to be augmented to handle 
complex number word problems found in algebra 
datasets.
Gains achieved by Declarative Rules Table 5

169


Isabel had 2 pages of math homework and 4 pages
of reading homework. If each page had 5 prob-
lems on it, how many problems did she have to
complete total ?
Tim’s cat had kittens. He gave 3 to Jessica and 6
to Sara . He now has 9 kittens . How many kittens
did he have to start with ?
Mrs. Snyder made 86 heart cookies. She made
36 red cookies, and the rest are pink. How many
pink cookies did she make?

Table 5: Examples which KNOWLEDGE gets correct, but
UNITDEP does not.

shows examples of problems which KNOWLEDGE
gets right, but UNITDEP does not. The gains can
be attributed to the injection of declarative knowl-
edge. Earlier systems like UNITDEP try to learn the
reasoning required for these problems from the data
alone. This is often difficult in the presence of lim-
ited data, and noisy output from NLP tools. In con-
trast, we learn probabilistic models for interpretable
functions like coreference, hyponymy, etc., and then
use declarative knowledge involving these functions
to perform reasoning. This reduces the complexity
of the target function to be learned considerably, and
hence we end up with a more robust model.
Effect of Beam Size We used a beam size of 1000 in
all our experiments. However, we found that vary-
ing the beam size does not effect the performance
significantly. Even lowering the beam size to 100
reduced performance by only 1%.
Weakness of Approach A weakness of our method
is the requirement to have all relevant declarative
knowledge during training. Many of the component
functions (like coreference) are learned through la-
tent alignments with no explicit annotations. If too
many problems are not explained by the knowledge,
the model will learn noisy alignments for the com-
ponent functions.

Table 6 shows the major categories of errors with
examples. 26% of the errors are due to extraneous
number detection. We use a set of rules based on
units of numbers, to detect such irrelevant numbers.
As a result, we fail to detect numbers which are ir-
relevant due to other factors, like associated entities,
or associated verb. We can potentially expand our
rule based system to detect those, or replace it by a
learned module like Roy and Roth (2015). Another

Irrelevant
Number
Detection
(26%)

Sally had 39 baseball cards, and
9 were torn. Sara bought 24 of
Sally’s baseball cards . How many
baseball cards does Sally have now?

Parsing Rate
Component
(26%)

Mary earns $46 cleaning a home.
How many homes did she clean, if
she made 276 dollars?

Coreference
(22%)

There are 5 people on the Green
Bay High track team. If a relay
race is 150 meters long, how far will
each team member have to run?

Table 6: Examples of errors made by KNOWLEDGE

major source of errors is parsing of rate components;
that is, understanding “earns $46 cleaning a home”
should be normalized to “$46 per home”. Although
we learn a model for coreference function, we make
several mistakes related to coreference. For the ex-
ample in Table 6, we fail to detect the coreference
between “team member” and “people”.

7 Conclusion

In this paper, we introduce a framework for incorpo-
rating declarative knowledge in word problem solv-
ing. Our knowledge based approach outperforms
all other systems, and also learns better abstractions
from biased datasets. Given that the variability in
text is much larger than the number of declarative
rules that governs Math word problems, we believe
that this is a good way to introduce Math knowledge
to a natural language understanding system. Conse-
quently, future work will involve extending our ap-
proach to handle a wider range of word problems,
possibly by supporting better grounding of implicit
variables and including a larger number of math con-
cepts and declarative rules. An orthogonal explo-
ration direction is to apply these techniques to gen-
erate summaries of financial or sports news, or gen-
erate statistics of war or gun violence deaths from
news corpora. A straightforward approach can be
to augment news documents with a question asking
for the required information, and treating this aug-
mented news document as a math word problem.

Code and dataset are available at https://
github.com/CogComp/arithmetic.

170


Acknowledgments

We are grateful to anonymous reviewers for their in-
sightful comments. This work is funded by DARPA
under agreement number FA8750-13-2-0008, and a
grant from the Allen Institute for Artificial Intelli-
gence (allenai.org).

References

Yonatan Bisk and Julia Hockenmaier. 2012. Simple ro-
bust grammar induction with combinatory categorial
grammars. In Proceedings of the Twenty-Sixth Confer-
ence on Artificial Intelligence (AAAI-12), pages 1643–
1649, Toronto, Canada, July.

Ming-Wei Chang, Lev Ratinov, and Dan Roth.
2007. Guiding semi-supervision with constraint-
driven learning. In Proceedings of the Annual Meet-
ing of the Association for Computational Linguistics
(ACL), pages 280–287, Prague, Czech Republic, 6.
Association for Computational Linguistics.

Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2012.
Structured learning with constrained conditional mod-

els. Machine Learning, 88(3):399–431, 6.
Danqi Chen and Christopher D. Manning. 2014. A fast 

and accurate dependency parser using neural 
networks. In Proceedings of the 2014 Conference on 
Empirical Methods in Natural Language Processing 
(EMNLP), pages 740–750, Doha, Qatar, October. 
Association for Computational Linguistics.

James Clarke, Dan Goldwasser, Ming-Wei Chang, and
Dan Roth. 2010. Driving semantic parsing from the
world’s response. In Proc. of the Conference on Com-
putational Natural Language Learning (CoNLL), 7.

Gerald DeJong. 1993. Investigating explanation-based
learning. Kluwer International Series in Engineering
and Computer Science. Kluwer Academic Publishers.

Gerald DeJong. 2014. Explanation-based learning. In
T. Gonzalez, J. Diaz-Herrera, and A. Tucker, editors,
CRC Computing Handbook: Computer Science and
Software Engineering, pages 66.1–66.26. CRC Press,
Boca Raton.

Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and
Ben Taskar. 2010. Posterior regularization for struc-
tured latent variable models. Journal of Machine
Learning Research.

Kevin Gimpel and Mohit Bansal. 2014. Weakly-
supervised learning with cost-augmented contrastive
estimation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 1329–1341, Doha, Qatar, Octo-
ber. Association for Computational Linguistics.

Mark Hopkins, Cristian Petrescu-Prahova, Roie Levin,
Ronan Le Bras, Alvaro Herrasti, and Vidur Joshi.
2017. Beyond sentential semantic parsing: Tack-
ling the math SAT with a cascade of tree transducers.
In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pages 806–
815, Copenhagen, Denmark, September. Association
for Computational Linguistics.

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren
Etzioni, and Nate Kushman. 2014. Learning to solve
arithmetic word problems with verb categorization. In
Proceedings of the Conference on Empirical Methods
for Natural Language Processing (EMNLP).

Danqing Huang, Shuming Shi, Chin-Yew Lin, and Jian
Yin. 2017. Learning fine-grained expressions to solve
math word problems. In Proceedings of the 2017 Con-
ference on Empirical Methods in Natural Language
Processing, pages 816–825, Copenhagen, Denmark,
September. Association for Computational Linguis-
tics.

Robin Jia and Percy Liang. 2017. Adversarial exam-
ples for evaluating reading comprehension systems.
In Proceedings of the 2017 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2021–2031. Association for Computational Linguis-
tics, September.

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Peter
Clark, Oren Etzioni, and Dan Roth. 2016. Ques-
tion answering via integer programming over semi-
structured knowledge. In Proceedings of the Interna-
tional Joint Conference on Artificial Intelligence (IJ-
CAI).

Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish
Sabharwal, Oren Etzioni, and Siena Ang. 2015. Pars-
ing Algebraic Word Problems into Equations. Trans-
actions of the Association of Computational Linguis-
tics.

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate
Kushman, and Hannaneh Hajishirzi. 2016. MaWPS:
A math word problem repository. In Proceedings of
the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics.

Nate Kushman, Luke Zettlemoyer, Regina Barzilay, and
Yoav Artzi. 2014. Learning to automatically solve
algebra word problems. In Proceedings of the Annual
Meeting of the Association for Computational Linguis-
tics (ACL), pages 271–281.

Percy Liang, Michael Jordan, and Dan Klein. 2011.
Learning dependency-based compositional semantics.
In Proceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (ACL).

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-
som. 2017. Program induction by rationale gener-
ation: Learning to solve and explain algebraic word

171


problems. In Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics.

Takuya Matsuzaki, Takumi Ito, Hidenao Iwane, Hirokazu
Anai, and Noriko H. Arai. 2017. Semantic parsing of
pre-university math problems. In Proceedings of the
55th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
2131–2141, Vancouver, Canada, July. Association for
Computational Linguistics.

Arindam Mitra and Chitta Baral. 2016. Learning to use
formulas to solve simple arithmetic problems. In Pro-
ceedings of the 54th Annual Meeting of the Association
for Computational Linguistics.

Tahira Naseem, Harr Chen, Regina Barzilay, and Mark
Johnson. 2010. Using universal linguistic knowl-
edge to guide grammar induction. In Proceedings of
the 2010 Conference on Empirical Methods in Natural
Language Processing, EMNLP ’10, pages 1234–1244,
Stroudsburg, PA, USA. Association for Computational
Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher D.
Manning. 2014. GloVe: Global vectors for word rep-
resentation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP).

Dan Roth and Wen-Tau Yih. 2004. A linear program-
ming formulation for global inference in natural lan-
guage tasks. In Hwee Tou Ng and Ellen Riloff, edi-
tors, Proceedings of the Conference on Computational
Natural Language Learning (CoNLL), pages 1–8. As-
sociation for Computational Linguistics.

Subhro Roy and Dan Roth. 2015. Solving general arith-
metic word problems. In Proc. of the Conference on
Empirical Methods in Natural Language Processing
(EMNLP).

Subhro Roy and Dan Roth. 2017. Unit dependency
graph and its application to arithmetic word problem
solving. In Proceedings of the Conference on Artifi-
cial Intelligence (AAAI).

Subhro Roy, Shyam Upadhyay, and Dan Roth. 2016.
Equation parsing: Mapping sentences to grounded
equations. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing
(EMNLP).

Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang
Liu, and Yong Rui. 2015. Automatically solving num-
ber word problems by semantic parsing and reasoning.
In Empirical Methods in Natural Language Process-
ing.

Noah Smith and Jason Eisner. 2006. Annealing struc-
tural bias in multilingual weighted grammar induction.
In Proceedings of the Annual Meeting of the Associ-
ation for Computational Linguistics (ACL), ACL-44,

pages 569–576, Stroudsburg, PA, USA. Association
for Computational Linguistics.

Shyam Upadhyay, Ming-Wei Chang, Kai-Wei Chang,
and Wen-tau Yih. 2016. Learning from explicit and
implicit supervision jointly for algebra word problems.
In Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing.

Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015.
Learn to solve algebra word problems using quadratic
programming. In Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Process-
ing.

172