Edinburgh Research Explorer 
 
 
Combined Distributional and Logical Semantics

Citation for published version:
Lewis, M & Steedman, M 2013, 'Combined Distributional and Logical Semantics', Transactions of the
Association for Computational Linguistics, vol. 1, pp. 179-192. <http://aclweb.org/anthology//Q/Q13/Q13-
1015.pdf>

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Peer reviewed version

Published In:
Transactions of the Association for Computational Linguistics

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.

Download date: 06. Apr. 2021

http://aclweb.org/anthology//Q/Q13/Q13-1015.pdf
http://aclweb.org/anthology//Q/Q13/Q13-1015.pdf
https://www.research.ed.ac.uk/portal/en/publications/combined-distributional-and-logical-semantics(2aae1bb8-66ff-474c-be0d-3b829573bbbc).html


Transactions of the Association for Computational Linguistics, 1 (2013) 179–192. Action Editor: Johan Bos.
Submitted 1/2013; Revised 3/2013; Published 5/2013. c©2013 Association for Computational Linguistics.

Combined Distributional and Logical Semantics

Mike Lewis
School of Informatics

University of Edinburgh
Edinburgh, EH8 9AB, UK
mike.lewis@ed.ac.uk

Mark Steedman
School of Informatics

University of Edinburgh
Edinburgh, EH8 9AB, UK

steedman@inf.ed.ac.uk

Abstract

We introduce a new approach to semantics
which combines the benefits of distributional
and formal logical semantics. Distributional
models have been successful in modelling the
meanings of content words, but logical se-
mantics is necessary to adequately represent
many function words. We follow formal se-
mantics in mapping language to logical rep-
resentations, but differ in that the relational
constants used are induced by offline distri-
butional clustering at the level of predicate-
argument structure. Our clustering algorithm
is highly scalable, allowing us to run on cor-
pora the size of Gigaword. Different senses of
a word are disambiguated based on their in-
duced types. We outperform a variety of ex-
isting approaches on a wide-coverage question
answering task, and demonstrate the ability to
make complex multi-sentence inferences in-
volving quantifiers on the FraCaS suite.

1 Introduction

Mapping natural language to meaning representa-
tions is a central challenge of NLP. There has been
much recent progress in unsupervised distributional
semantics, in which the meaning of a word is in-
duced based on its usage in large corpora. This ap-
proach is useful for a range of key applications in-
cluding question answering and relation extraction
(Lin and Pantel, 2001; Poon and Domingos, 2009;
Yao et al., 2011). Because such a semantics can be
automically induced, it escapes the limitation of de-
pending on relations from hand-built training data,
knowledge bases or ontologies, which have proved

of limited use in capturing the huge variety of mean-
ings that can be expressed in language.

However, distributional semantics has largely de-
veloped in isolation from the formal semantics liter-
ature. Whilst distributional semantics has been ef-
fective in modelling the meanings of content words
such as nouns and verbs, it is less clear that it can be
applied to the meanings of function words. Semantic
operators, such as determiners, negation, conjunc-
tions, modals, tense, mood, aspect, and plurals are
ubiquitous in natural language, and are crucial for
high performance on many practical applications—
but current distributional models struggle to capture
even simple examples. Conversely, computational
models of formal semantics have shown low recall
on practical applications, stemming from their re-
liance on ontologies such as WordNet (Miller, 1995)
to model the meanings of content words (Bobrow et
al., 2007; Bos and Markert, 2005).

For example, consider what is needed to answer
a question like Did Google buy YouTube? from the
following sentences:

1. Google purchased YouTube

2. Google’s acquisition of YouTube

3. Google acquired every company

4. YouTube may be sold to Google

5. Google will buy YouTube or Microsoft

6. Google didn’t takeover YouTube

All of these require knowledge of lexical seman-
tics (e.g. that buy and purchase are synonyms), but
some also need interpretation of quantifiers, nega-
tives, modals and disjunction. It seems unlikely that

179


distributional or formal approaches can accomplish
the task alone.

We propose a method for mapping natural lan-
guage to first-order logic representations capable of
capturing the meanings of function words such as
every, not and or, but which also uses distributional
statistics to model the meaning of content words.
Our approach differs from standard formal seman-
tics in that the non-logical symbols used in the log-
ical form are cluster identifiers. Where standard se-
mantic formalisms would map the verb write to a
write’ symbol, we map it to a cluster identifier such
as relation37, which the noun author may also map
to. This mapping is learnt by offline clustering.

Unlike previous distributional approaches, we
perform clustering at the level of predicate-argument
structure, rather than syntactic dependency struc-
ture. This means that we abstract away from many
syntactic differences that are not present in the se-
mantics, such as conjunctions, passives, relative
clauses, and long-range dependencies. This signifi-
cantly reduces sparsity, so we have fewer predicates
to cluster and more observations for each.

Of course, many practical inferences rely heavily
on background knowledge about the world—such
knowledge falls outside the scope of this work.

2 Background

Our approach is based on Combinatory Categorial
Grammar (CCG; Steedman, 2000), a strongly lexi-
calised theory of language in which lexical entries
for words contain all language-specific information.
The lexical entry for each word contains a syntactic
category, which determines which other categories
the word may combine with, and a semantic inter-
pretation, which defines the compositional seman-
tics. For example, the lexicon may contain the entry:
write ` (S\NP)/NP : λ yλ x.write′(x,y)

Crucially, there is a transparent interface between
the syntactic category and the semantics. For ex-
ample the transitive verb entry above defines the
verb syntactically as a function mapping two noun-
phrases to a sentence, and semantically as a bi-
nary relation between its two argument entities.
This means that it is relatively straightforward to
deterministically map parser output to a logical
form, as in the Boxer system (Bos, 2008). This

Every dog barks

NP↑/N N S\NP
λ pλ q.∀x[p(x) =⇒ q(x)] λ x.dog′(x) λ x.bark′(x)

>
NP↑

λ q.∀x[dog′(x) =⇒ q(x)]
>

S
∀x[dog′(x) =⇒ bark′(x)]

Figure 1: A standard logical form derivation using CCG.
The NP↑ notation means that the subject is type-raised,
and taking the verb-phrase as an argument—so is an ab-
breviation of S/(S\NP). This is necessary in part to sup-
port a correct semantics for quantifiers.

Input Sentence
Shakespeare wrote Macbeth

⇓
Intial semantic analysis

writearg0,arg1(shakespeare, macbeth)
⇓

Entity Typing
writearg0:PER,arg1:BOOK (shakespeare:PER,

macbeth:BOOK)
⇓

Distributional semantic analysis
relation37(shakespeare:PER, macbeth:BOOK)

Figure 2: Layers used in our model.

form of semantics captures the underlying predicate-
argument structure, but fails to license many impor-
tant inferences—as, for example, write and author
do not map to the same predicate.

In addition to the lexicon, there is a small set of
binary combinators and unary rules, which have a
syntactic and semantic interpretation. Figure 1 gives
an example CCG derivation.

3 Overview of Approach

We attempt to learn a CCG lexicon which maps
equivalent words onto the same logical form—for
example learning entries such as:
author ` N/PP[o f ] : λ xλ y.relation37(x,y)
write ` (S\NP)/NP : λ xλ y.relation37(x,y)
The only change to the standard CCG derivation is
that the symbols used in the logical form are arbi-
trary relation identifiers. We learn these by first map-
ping to a deterministic logical form (using predicates

180


such as author’ and write’), using a process simi-
lar to Boxer, and then clustering predicates based on
their arguments. This lexicon can then be used to
parse new sentences, and integrates seamlessly with
CCG theories of formal semantics.

Typing predicates—for example, determining that
writing is a relation between people and books—
has become standard in relation clustering (Schoen-
mackers et al., 2010; Berant et al., 2011; Yao et
al., 2012). We demonstate how to build a typing
model into the CCG derivation, by subcategorizing
all terms representing entities in the logical form
with a more detailed type. These types are also in-
duced from text, as explained in Section 5, but for
convenience we describe them with human-readable
labels, such as PER, LOC and BOOK.

A key advantage of typing is that it allows us to
model ambiguous predicates. Following Berant et
al. (2011), we assume that different type signatures
of the same predicate have different meanings, but
given a type signature a predicate is unambiguous.
For example a different lexical entry for the verb
born is used in the contexts Obama was born in
Hawaii and Obama was born in 1961, reflecting a
distinction in the semantics that is not obvious in the
syntax1. Typing also greatly improves the efficiency
of clustering, as we only need to compare predicates
with the same type during clustering (for example,
we do not have to consider clustering a predicate
between people and places with predicates between
people and dates).

In this work, we focus on inducing binary rela-
tions. Many existing approaches have shown how
to produce good clusterings of (non-event) nouns
(Brown et al., 1992), any of which could be sim-
ply integrated into our semantics—but relation clus-
tering remains an open problem (see Section 9).
N-ary relations are binarized, by creating a bi-
nary relation between each pair of arguments. For
example, for the sentence Russia sold Alaska to
the United States, the system creates three binary
relations— corresponding to sellToSomeone(Russia,
Alaska), buyFromSomeone(US, Alaska), sellSome-
thingTo(Russia, US). This transformation does not

1Whilst this assumption is very useful, it does not always hold—
for example, the genitive in Shakespeare’s book is ambigu-
ous between ownership and authorship relations even given the
types of the arguments.

exactly preserve meaning, but still captures the most
important relations. Note that this allows us to
compare semantic relations across different syntac-
tic types—for example, both transitive verbs and
argument-taking nouns can be seen as expressing bi-
nary semantic relations between entities.

Figure 2 shows the layers used in our model.

4 Initial Semantic Analysis

The initial semantic analysis maps parser output
onto a logical form, in a similar way to Boxer. The
semantic formalism is based on Steedman (2012).

The first step is syntactic parsing. We use the
C&C parser (Clark and Curran, 2004), trained on
CCGBank (Hockenmaier and Steedman, 2007), us-
ing the refined version of Honnibal et al. (2010)
which brings the syntax closer to the predicate-
argument structure. An automatic post-processing
step makes a number of minor changes to the parser
output, which converts the grammar into one more
suitable for our semantics. PP (prepositional phrase)
and PR (phrasal verb complement) categories are
sub-categorised with the relevant preposition. Noun
compounds with the same MUC named-entity type
(Chinchor and Robinson, 1997) are merged into a
single non-compositional node2 (we otherwise ig-
nore named-entity types). All argument NPs and
PPs are type-raised, allowing us to represent quanti-
fiers. All prepositional phrases are treated as core ar-
guments (i.e. given the category PP, not adjunct cat-
egories like (N\N)/NP or ((S\NP)\(S\NP))/NP),
as it is difficult for the parser to distinguish argu-
ments and adjuncts.

Initial semantic lexical entries for almost all
words can be generated automatically from the
syntactic category and POS tag (obtained from
the parser), as the syntactic category captures the
underlying predicate-argument structure. We use
a Davidsonian-style representation of arguments
(Davidson, 1967), which we binarize by creating a
separate predicate for each pair of arguments of a
word. These predicates are labelled with the lemma
of the head word and a Propbank-style argument key
(Kingsbury and Palmer, 2002), e.g. arg0, argIn. We
distinguish noun and verb predicates based on POS
2For example, this allows us to give Barack Obama the seman-
tics λ x.barack obama(x) instead of λ x.barack(x)∧obama(x),
which is more convenient for collecting distributional statistics.

181


Word Category Semantics
Automatic author N/PP[o f ] λ xλ y.authorarg0,argOf (y,x)

write (S\NP)/NP λ xλ y.writearg0,arg1(y,x)
Manual every NP↑/N λ pλ q.∀x[p(x)→ q(x)]

not (S\NP)/(S\NP) λ pλ x.¬p(x)

Figure 3: Example initial lexical entries

tag—so, for example, we have different predicates
for effect as a noun or verb.

This algorithm can be overridden with man-
ual lexical entries for specific closed-class function
words. Whilst it may be possible to learn these
from data, our approach is pragmatic as there are
relatively few such words, and the complex logical
forms required would be difficult to induce from dis-
tributional statistics. We add a small number of lexi-
cal entries for words such as negatives (no, not etc.),
and quantifiers (numbers, each, every, all, etc.).

Some example initial lexical entries are shown in
Figure 3.

5 Entity Typing Model

Our entity-typing model assigns types to nouns,
which is useful for disambiguating polysemous
predicates. Our approach is similar to O’Seaghdha
(2010) in that we aim to cluster entities based on
the noun and unary predicates applied to them (it
is simple to convert from the binary predicates
to unary predicates). For example, we want the
pair (bornargIn, 1961) to map to a DAT type, and
(bornargIn, Hawaii) to map to a LOC type. This is
non-trivial, as both the predicates and arguments can
be ambiguous between multiple types—but topic
models offer a good solution (described below).

5.1 Topic Model

We assume that the type of each argument of a pred-
icate depends only on the predicate and argument,
although Ritter et al. (2010) demonstrate an advan-
tage of modelling the joint probability of the types
of multiple arguments of the same predicate. We use
the standard Latent Dirichlet Allocation model (Blei
et al., 2003), which performs comparably to more
complex models proposed in O’Seaghdha (2010).

In topic-modelling terminology, we construct a
document for each unary predicate (e.g. bornargIn),

based on all of its argument entities (words). We as-
sume that these arguments are drawn from a small
number of types (topics), such as PER, DAT or
LOC3. Each type j has a multinomial distribution
φ j over arguments (for example, a LOC type is more
likely to generate Hawaii than 1961). Each unary
predicate i has a multinomial distribution θi over
topics, so the bornargIn predicate will normally gen-
erate a DAT or LOC type. Sparse Dirichlet priors
α and β on the multinomials bias the distributions
to be peaky. The parameters are estimated by Gibbs
sampling, using the Mallet implementation (McCal-
lum, 2002).

The generative story to create the data is:

For every type k:
Draw the p(arg|k) distribution φk from Dir(β )

For every unary predicate i:
Draw the p(type|i) distribution θi from Dir(α)
For every argument j:

Draw a type zi j from Mult(θi)
Draw an argument wi j from Mult(φθi)

5.2 Typing in Logical Form

In the logical form, all constants and variables repre-
senting entities x can be assigned a distribution over
types px(t) using the type model. An initial type
distribution is applied in the lexicon, using the φ
distributions for the types of nouns, and the θi dis-
tributions for the type of arguments of binary predi-
cates (inverted using Bayes’ rule). Then at each β -
reduction in the derivation, we update probabilities
of the types to be the product of the type distribu-
tions of the terms being reduced. If two terms x and

3Types are induced from the text, but we give human-readable
labels here for convenience.

182


file a suit

(S\NP)/NP NP↑

λ y :

{
DOC =0.5
LEGAL =0.4

CLOTHES =0.01
...

}
λ x :
{

PER = 0.7
ORG = 0.2

...

}
. f ilearg0,arg1(x,y) λ p.∃y :

{
CLOTHES = 0.6

LEGAL = 0.3
DOC =0.001

...

}
[suit′(y)∧ p(y)]

<
S\NP

λ x :
{

PER = 0.7
ORG = 0.2

...

}
∃y :

{
LEGAL = 0.94

CLOTHES = 0.05
DOC = 0.004

...

}
)[suit′(y)∧ f ilearg0,arg1(x,y)]

Figure 4: Using the type model for disambiguation in the derivation of file a suit. Type distributions are shown after
the variable declarations. Both suit and the object of file are lexically ambiguous between different types, but after the
β -reduction only one interpretation is likely. If the verb were wear, a different interpretation would be preferred.

y combine to a term z:

pz(t) =
px(t)py(t)

∑ t′ px(t′)py(t′)

For example, in wore a suit and file a suit, the vari-
able representing suit may be lexically ambiguous
between CLOTHES and LEGAL types, but the vari-
ables representing the objects of wear and f ile will
have preferences that allow us to choose the correct
type when the terms combine. Figure 4 shows an
example derivation using the type model for disam-
biguation4.

6 Distributional Relation Clustering
Model

The typed binary predicates can be grouped
into clusters, each of which represents a dis-
tinct semantic relation. Note that because we
cluster typed predicates, bornarg0:PER,argIn:LOC and
bornarg0:PER,argIn:DAT can be clustered separately.

6.1 Corpus statistics
Typed binary predicates are clustered based on the
expected number of times they hold between each
argument-pair in the corpus. This means we cre-
ate a single vector of argument-pair counts for each
predicate (not a separate vector for each argument).
For example, the vector for the typed predicate
writearg0:PER,arg1:BOOK may contain non-zero counts
for entity-pairs such as (Shakespeare, Macbeth),
(Dickens, Oliver Twist) and (Rowling, Harry Potter).
4Our implementation follows Steedman (2012) in using Gener-
alized Skolem Terms rather than existential quantifiers, in order
to capture quantifier scope alternations monotonically, but we
omit these from the example to avoid introducing new notation.

The entity-pair counts for authorarg0:PER,argOf :BOOK
may be similar, on the assumption that both are sam-
ples from the same underlying semantic relation.

To find the expected number of occurrences of
argument-pairs for typed binary predicates in a cor-
pus, we first apply the type model to the derivation
of each sentence, as described in Section 5.2. This
outputs untyped binary predicates, with distributions
over the types of their arguments. The type of the
predicate must match the type of its arguments, so
the type distribution of a binary predicate is simply
the joint distribution of the two argument type dis-
tributions.

For example, if the arguments in a
bornarg0,argIn(obama,hawaii) derivation have the
respective type distributions (PER=0.9, LOC=0.1)
and (LOC=0.7, DAT=0.3), the distribution over bi-
nary typed predicates is (bornarg0:PER,argIn:LOC=0.63,
bornarg0:PER,argIn:DAT =0.27, etc.) The expected
counts for (obama,hawaii) in the vectors for
bornarg0:PER,argIn:LOC and bornarg0:PER,argIn:DAT are
then incremented by these probabilities.

6.2 Clustering

Many algorithms have been proposed for cluster-
ing predicates based on their arguments (Poon and
Domingos, 2009; Yao et al., 2012). The number of
relations in the corpus is unbounded, so the cluster-
ing algorithm should be non-parametric. It is also
important that it remains tractable for very large
numbers of predicates and arguments, in order to
give us a greater coverage of language than can be
achieved by hand-built ontologies.

We cluster the typed predicate vectors using the
Chinese Whispers algorithm (Biemann, 2006)—

183


although somewhat ad-hoc, it is both non-parametric
and highly scalable5. This has previously been used
for noun-clustering by Fountain and Lapata (2011),
who argue it is a cognitively plausible model for
language acquisition. The collection of predicates
and arguments is converted into a graph with one
node per predicate, and edge weights representing
the similarity between predicates. Predicates with
different types have zero-similarity, and otherwise
similarity is computed as the cosine-similarity of the
tf-idf vectors of argument-pairs. We prune nodes oc-
curring fewer than 20 times, edges with weights less
than 10−3, and a short list of stop predicates.

The algorithm proceeds as follows:

1. Each predicate p is assigned to a different se-
mantic relation rp

2. Iterate over the predicates p in a random order

3. Set rp = arg max
r

∑p′ 1r=rp′ sim(p, p
′), where

sim(p, p′) is the distributional similarity be-
tween p and p′, and 1r=r′ is 1 iff r=r’ and 0
otherwise.

4. Repeat (2.) to convergence.

7 Semantic Parsing using Relation
Clusters

The final phase is to use our relation clusters in the
lexical entries of the CCG semantic derivation. This
is slightly complicated by the fact that our predi-
cates are lexically ambiguous between all the pos-
sible types they could take, and hence the relations
they could express. For example, the system can-
not tell whether born in is expressing a birthplace
or birthdate relation until later in the derivation,
when it combines with its arguments. However, all
the possible logical forms are identical except for
the symbols used, which means we can produce a
packed logical form capturing the full distribution
over logical forms. To do this, we make the predi-
cate a function from argument types to relations.

For each word, we first take the lexical semantic
definition produced by the algorithm in Section 4.
For binary predicates in this definition (which will
5We also experimented with a Dirichlet Process Mixture Model
(Neal, 2000), but even with the efficient A* search algorithms
introduced by Daumé III (2007), the cost of inference was found
to be prohibitively high when run at large scale.

be untyped), we perform a deterministic lookup in
the cluster model learned in Section 6, using all pos-
sible corresponding typed predicates. This allows us
to represent the binary predicates as packed predi-
cates: functions from argument types to relations.

For example, if the clustering maps
bornarg0:PER,argIn:LOC to rel49 (“birthplace”)
and bornarg0:PER,argIn:DAT to rel53 (“birthdate”), our
lexicon contains the following packed lexical entry
(type-distributions on the variables are suppressed):

born ` (S\NP)/PP[in] :
λ yλ x.

{
(x : PER,y : LOC)⇒rel49
(x : PER,y : DAT)⇒rel53

}
(x,y)

The distributions over argument types then imply
a distribution over relations. For example, if the
packed-predicate for bornarg0,argIn is applied to ar-
guments Obama and Hawaii, with respective type
distributions (PER=0.9, LOC=0.1) and (LOC=0.7,
DAT=0.3)6, the distribution over relations will be
(rel49=0.63, rel53=0.27, etc.).

If 1961 has a type-distribution (LOC=0.1,
DAT=0.9), the output packed-logical form for
Obama was born in Hawaii in 1961 will be:




rel49=0.63
rel53=0.27

...



(ob,hw)∧





rel49=0.09
rel53=0.81

...



(ob,1961)

The probability of a given logical form can be read
from this packed logical form.

8 Experiments

Our approach aims to offer a strong model of both
formal and lexical semantics. We perform two eval-
uations, aiming to target each of these separately, but
using the same semantic representations in each.

We train our system on Gigaword (Graff et al.,
2003), which contains around 4 billion words of
newswire. The type-model is trained using 15
types7, and 5,000 iterations of Gibbs sampling (us-
ing the distributions from the final sample). Table 1

6These distributions are composed from the type-distributions
for both the predicate and argument, as explained in Section 5
7This number was chosen by examination of models trained
with different numbers of types. The algorithm produces se-
mantically coherent clusters for much larger numbers of types,
but many of these are fine-grained categories of people, which
introduces sparsity in the relation clustering.

184


Type Top Words
1 suspect, assailant, fugitive, accomplice
2 author, singer, actress, actor, dad
5 city, area, country, region, town, capital
8 subsidiary, automaker, airline, Co., GM
10 musical, thriller, sequel, special

Table 1: Most probable terms in some clusters induced
by the Type Model.

shows some example types. The relation clustering
uses only proper nouns, to improve precision (spar-
sity problems are partly offset by the large input cor-
pus). Aside from parsing, the pipeline takes around
a day to run using 12 cores.

8.1 Question Answering Experiments
As yet, there is no standard way of evaluating lexical
semantics. Existing tasks like Recognising Textual
Entailment (RTE; Dagan et al., 2006) rely heavily on
background knowledge, which is beyond the scope
of this work. Intrinsic evaluations of entailment rela-
tions have low inter-annotator agreement (Szpektor
et al., 2007), due to the difficulty of evaluating rela-
tions out of context.

Our evaluation is based on that performed by
Poon and Domingos (2009). We automatically con-
struct a set of questions by sampling from text,
and then evaluate how many answers can be found
in a different corpus. From dependency-parsed

newswire, we sample either X
nsub j← verbdob j→ Y, Xnsub j←

verb
pob j→ Y or Xnsub j← be dob j→ nounpob j→ Y patterns,

where X and Y are proper nouns and the verb is
not on a list of stop verbs, and deterministically con-
vert these to questions. For example, from Google
bought YouTube we create the questions What did
Google buy? and What bought YouTube?. The task
is to find proper-noun answers to these questions in
a different corpus, which are then evaluated by hu-
man annotators based on the sentence the answer
was retrieved from8. Systems can return multiple
8Common nouns are filtered automatically. To focus on evalu-
ating the semantics, annotators ignored garbled sentences due
to errors pre-processing the corpus (these are excluded from
the results). We also automatically exclude weekday and
month answers, which are overwhelmingly syntax errors for
all systems—e.g. treating Tuesday as an object in Obama an-
nounced Tuesday that...

answers to the same question (e.g. What did Google
buy? may have many valid answers), and all of
these contribute to the result. As none of the systems
model tense or temporal semantics, annotators were
instructed to annotate answers as correct if they were
true at any time. This approach means we evaluate
on relations in proportion to corpus frequency. We
sample 1000 questions from the New York Times
subset of Gigaword from 2010, and search for an-
swers in the New York Times from 2009.

We evaluate the following approaches:

• CCG-Baseline The logical form produced by
our CCG derivation, without the clustering.

• CCG-WordNet The CCG logical form, plus
WordNet as a model of lexical semantics.

• CCG-Distributional The logical form includ-
ing the type model and clusters.

• Relational LDA An LDA based model for
clustering dependency paths (Yao et al., 2011).
We train on New York Times subset of Giga-
word9, using their setup of 50 iterations with
100 relation types.

• Reverb A sophisticated Open Information Ex-
traction system (Fader et al., 2011).

Unsupervised Semantic Parsing (USP; Poon and
Domingos, 2009; USP; Poon and Domingos, 2010;
USP; Titov and Klementiev, 2011) would be another
obvious baseline. However, memory requirements
mean it is not possible to run at this scale (our system
is trained on 4 orders of magnitude more data than
the USP evaluation). Yao et al. (2011) found it had
comparable performance to Relational LDA.

For the CCG models, rather than performing full
first-order inference on a large corpus, we simply
test whether the question predicate subsumes a can-
didate answer predicate, and whether the arguments
match10. In the case of CCG-Distributional, we cal-
culate the probability that the two packed-predicates

9This is around 35% of Gigaword, and was the largest scale
possible on our resources.
10We do this as it is much more efficient than full first-order
theorem-proving. We could in principle make additional in-
ferences with theorem-proving, such as answering What did
Google buy? from Google bought the largest video website and
YouTube is the largest video website.

185


System Answers Correct
Relational-LDA 7046 11.6%
Reverb 180 89.4%
CCG-Baseline 203 95.8%
CCG-WordNet 211 94.8%
CCG-Distributional@250 250 94.1%
CCG-Distributional@500 500 82.0%

Table 2: Results on wide-coverage Question Answer-
ing task. CCG-Distributional ranks question/answer pairs
by confidence—@250 means we evaluate the top 250 of
these. It is not possible to give a recall figure, as the total
number of correct answers in the corpus is unknown.

are in the same cluster, marginalizing over their ar-
gument types. Answers are ranked by this proba-
bility. For CCG-WordNet, we check if the ques-
tion predicate is a hypernym of the candidate answer
predicate (using any WordNet sense of either term).

Results are shown in Table 2. Relational-LDA in-
duces many meaningful clusters, but predicates must
be assigned to one of 100 relations, so results are
dominated by large, noisy clusters (it is not possi-
ble to take the N-best answers as the cluster assign-
ments do not have a confidence score). The CCG-
Baseline errors are mainly caused by parser errors,
or relations in the scope of non-factive operators.
CCG-WordNet adds few answers to CCG-Baseline,
reflecting the limitations of hand-built ontologies.

CCG-Distributional substantially improves recall
over other approaches whilst retaining good preci-
sion, demonstrating that we have learnt a powerful
model of lexical semantics. Table 3 shows some
correctly answered questions. The system improves
over the baseline by mapping expressions such as
merge with and acquisition of to the same relation
cluster. Many of the errors are caused by conflating
predicates where the entailment only holds in one
direction, such as was elected to with ran for. Hier-
archical clustering could be used to address this.

8.2 Experiments on the FraCaS Suite

We are also interested in evaluating our approach
as a model of formal semantics—demonstrating that
it is possible to integrate the formal semantics of
Steedman (2012) with our distributional clusters.

The FraCaS suite (Cooper et al., 1996)11 contains
a hand-built set of entailment problems designed to
be challenging in terms of formal semantics. We
use Section 1, which contains 74 problems requiring
an understanding of quantifiers12. They do not re-
quire any knowledge of lexical semantics, meaning
we can evaluate the formal component of our system
in isolation. However, we use the same representa-
tions as in our previous experiment, even though the
clusters provide no benefit on this task. Figure 5
gives an example problem.

The only previous work we are aware of on
this dataset is by MacCartney and Manning (2007).
This approach learns the monotonicity properties
of words from a hand-built training set, and uses
this to transform a sentence into a polarity anno-
tated string. The system then aims to transform the
premise string into a hypothesis. Positively polar-
ized words can be replaced with less specific ones
(e.g. by deleting adjuncts), whereas negatively po-
larized words can be replaced with more specific
ones (e.g. by adding adjuncts). Whilst this is high-
precision and often useful, this logic is unable to per-
form inferences with multiple premise sentences (in
contrast to our first-order logic).

Development consists of adding entries to our lex-
icon for quantifiers. For simplicity, we treat multi-
word quantifiers like at least a few, as being multi-
word expressions—although a more compositional
analysis may be possible. Following MacCartney
and Manning (2007), we do not use held-out data—
each problem is designed to test a different issue, so
it is not possible to generalize from one subset of the
suite to another. As we are interested in evaluating
the semantics, not the parser, we manually supply
gold-standard lexical categories for sentences with
parser errors (any syntactic mistake causes incorrect
semantics). Our derivations produce a distribution
over logical forms—we license the inference if it
holds in any interpretation with non-zero probabil-
ity. We use the Prover9 (McCune, 2005) theorem
prover for inference, returning yes if the premise im-
plies the hypothesis, no if it implies the negation of
the hypothesis, and unknown otherwise.

Results are shown in Table 4. Our system im-
11We use the version converted to machine readable format by
MacCartney and Manning (2007)
12Excluding 6 problems without a defined solution.

186


Question Answer Sentence
What did Delta merge with? Northwest The 747 freighters came with Delta’s acquisition of Northwest
What spoke with Hu Jintao? Obama Obama conveyed his respect for the Dalai Lama to China’s

president Hu Jintao during their first meeting. . .
What arrived in Colorado? Zazi Zazi flew back to Colorado. . .
What ran for Congress? Young . . . Young was elected to Congress in 1972

Table 3: Example questions correctly answered by CCG-Distributional.

Premises: Every European has the right to live in Europe.
Every European is a person.
Every person who has the right to live in Europe can travel freely within Europe.

Hypothesis: Every European can travel freely within Europe
Solution: Yes

Figure 5: Example problem from the FraCaS suite.

System Single Multiple
Premise Premises

MacCartney&Manning 07 84% -
MacCartney&Manning 08 98% -
CCG-Dist (parser syntax) 70% 50%
CCG-Dist (gold syntax) 89% 80%

Table 4: Accuracy on Section 1 of the FraCaS suite.
Problems are divided into those with one premise sen-
tence (44) and those with multiple premises (30).

proves on previous work by making multi-sentence
inferences. Causes of errors include missing a dis-
tinct lexical entry for plural the, only taking existen-
tial interpretations of bare plurals, failing to inter-
pret mass-noun determiners such as a lot of, and not
providing a good semantics for non-monotone de-
terminers such as most. We believe these problems
will be surmountable with more work. Almost all er-
rors are due to incorrectly predicting unknown — the
system makes just one error on yes or no predictions
(with or without gold syntax). This suggests that
making first-order logic inferences in applications
will not harm precision. We are less robust than
MacCartney and Manning (2007) to syntax errors
but, conversely, we are able to attempt more of the
problems (i.e. those with multi-sentence premises).
Other approaches based on distributional semantics
seem unable to tackle any of these problems, as they
do not represent quantifiers or negation.

9 Related Work

Much work on semantics has taken place in a su-
pervised setting—for example the GeoQuery (Zelle
and Mooney, 1996) and ATIS (Dahl et al., 1994) se-
mantic parsing tasks. This approach makes sense for
generating queries for a specific database, but means
the semantic representations do not generalize to
other datasets. There have been several attempts
to annotate larger corpora with semantics—such as
Ontonotes (Hovy et al., 2006) or the Groningen
Meaning Bank (Basile et al., 2012). These typically
map words onto senses in ontologies such as Word-
Net, VerbNet (Kipper et al., 2000) and FrameNet
(Baker et al., 1998). However, limitations of these
ontologies mean that they do not support inferences
such as X is the author of Y → X wrote Y.

Given the difficulty of annotating large amounts
of text with semantics, various approaches have at-
tempted to learn meaning without annotated text.
Distant Supervision approaches leverage existing
knowledge bases, such as Freebase (Bollacker et al.,
2008), to learn semantics (Mintz et al., 2009; Krish-
namurthy and Mitchell, 2012). Dependency-based
Compositional Semantics (Liang et al., 2011) learns
the meaning of questions by using their answers as
denotations—but this appears to be specific to ques-
tion parsing. Such approaches can only learn the
pre-specified relations in the knowledge base.

The approaches discussed so far in this section
have all attempted to map language onto some pre-

187


specified set of relations. Various attempts have
been made to instead induce relations from text by
clustering predicates based on their arguments. For
example, Yao et al. (2011) propose a series of LDA-
based models which cluster relations between en-
tities based on a variety of lexical, syntactic and
semantic features. Unsupervised Semantic Pars-
ing (Poon and Domingos, 2009) recursively clusters
fragments of dependency trees based on their argu-
ments. Although USP is an elegant model, it is too
computationally expensive to run on large corpora.
It is also based on frame semantics, so does not clus-
ter equivalent predicates with different frames. To
our knowledge, our work is the first such approach
to be integrated within a linguistic theory supporting
formal semantics for logical operators.

Vector space models represent words by vectors
based on co-occurrence counts. Recent work has
tackled the problem of composing these matrices
to build up the semantics of phrases or sentences
(Mitchell and Lapata, 2008). Another strand (Co-
ecke et al., 2010; Grefenstette et al., 2011) has
shown how to represent meanings as tensors, whose
order depends on the syntactic category, allowing
an elegant correspondence between syntactic and
semantic types. Socher et al. (2012) train a com-
position function using a neural network—however
their method requires annotated data. It is also not
obvious how to represent logical relations such as
quantification in vector-space models. Baroni et al.
(2012) make progress towards this by learning a
classifier that can recognise entailments such as all
dogs =⇒ some dogs, but this remains some way
from the power of first-order theorem proving of the
kind required by the problem in Figure 5.

An alternative strand of research has attempted
to build computational models of linguistic theories
based on formal compositional semantics, such as
the CCG-based Boxer (Bos, 2008) and the LFG-
based XLE (Bobrow et al., 2007). Such approaches
convert parser output into formal semantic repre-
sentations, and have demonstrated some ability to
model complex phenomena such as negation. For
lexical semantics, they typically compile lexical re-
sources such as VerbNet and WordNet into inference
rules—but still achieve only low recall on open-
domain tasks, such as RTE, mostly due to the low
coverage of such resources. Garrette et al. (2011)

use distributional statistics to determine the proba-
bility that a WordNet-derived inference rule is valid
in a given context. Our approach differs in that
we learn inference rules not present in WordNet.
Our lexical semantics is integrated into the lexicon,
rather than being implemented as additional infer-
ence rules, meaning that inference is more efficient,
as equivalent statements have the same logical form.

Natural Logic (MacCartney and Manning, 2007)
offers an interesting alternative to symbolic logics,
and has been shown to be able to capture complex
logical inferences by simply identifying the scope of
negation in text. This approach achieves similar pre-
cision and much higher recall than Boxer on the RTE
task. Their approach also suffers from such limita-
tions as only being able to make inferences between
two sentences. It is also sensitive to word order, so
cannot make inferences such as Shakespeare wrote
Macbeth =⇒ Macbeth was written by Shakespeare.

10 Conclusions and Future Work

This is the first work we are aware of that combines a
distributionally induced lexicon with formal seman-
tics. Experiments suggest our approach compares
favourably with existing work in both areas.

Many potential areas for improvement remain.
Hierachical clustering would allow us to capture
hypernym relations, rather than the synonyms cap-
tured by our flat clustering. There is much potential
for integrating existing hand-built resources, such
as Ontonotes and WordNet, to improve the accu-
racy of clustering. There are cases where the ex-
isting CCGBank grammar does not match the re-
quired predicate-argument structure—for example
in the case of light verbs. It may be possible to re-
bank CCGBank, in a way similar to Honnibal et al.
(2010), to improve it on this point.

Acknowledgements

We thank Christos Christodoulopoulos, Tejaswini
Deoskar, Mark Granroth-Wilding, Ewan Klein, Ka-
trin Erk, Johan Bos and the anonymous reviewers
for their helpful comments, and Limin Yao for shar-
ing code. This work was funded by ERC Advanced
Fellowship 249520 GRAMPLUS and IP EC-FP7-
270273 Xperience.

188


References
C.F. Baker, C.J. Fillmore, and J.B. Lowe. 1998. The

berkeley framenet project. In Proceedings of the
36th Annual Meeting of the Association for Computa-
tional Linguistics and 17th International Conference
on Computational Linguistics-Volume 1, pages 86–90.
Association for Computational Linguistics.

M. Baroni, R. Bernardi, N.Q. Do, and C. Shan. 2012.
Entailment above the word level in distributional se-
mantics. In Proceedings of EACL, pages 23–32. Cite-
seer.

V. Basile, J. Bos, K. Evang, and N. Venhuizen. 2012.
Developing a large semantically annotated corpus. In
Proceedings of the Eighth International Conference
on Language Resources and Evaluation (LREC12). To
appear.

Jonathan Berant, Ido Dagan, and Jacob Goldberger.
2011. Global learning of typed entailment rules. In
Proceedings of the 49th Annual Meeting of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies - Volume 1, HLT ’11, pages 610–
619. Association for Computational Linguistics.

C. Biemann. 2006. Chinese whispers: an efficient graph
clustering algorithm and its application to natural lan-
guage processing problems. In Proceedings of the
First Workshop on Graph Based Methods for Natural
Language Processing, pages 73–80. Association for
Computational Linguistics.

D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent
dirichlet allocation. the Journal of machine Learning
research, 3:993–1022.

D. G. Bobrow, C. Condoravdi, R. Crouch, V. De Paiva,
L. Karttunen, T. H. King, R. Nairn, L. Price, and
A. Zaenen. 2007. Precision-focused textual inference.
In Proceedings of the ACL-PASCAL Workshop on Tex-
tual Entailment and Paraphrasing, RTE ’07, pages
16–21. Association for Computational Linguistics.

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
Sturge, and Jamie Taylor. 2008. Freebase: a col-
laboratively created graph database for structuring hu-
man knowledge. In Proceedings of the 2008 ACM
SIGMOD international conference on Management of
data, SIGMOD ’08, pages 1247–1250, New York, NY,
USA. ACM.

J. Bos and K. Markert. 2005. Recognising textual en-
tailment with logical inference. In Proceedings of
the conference on Human Language Technology and
Empirical Methods in Natural Language Processing,
pages 628–635. Association for Computational Lin-
guistics.

Johan Bos. 2008. Wide-coverage semantic analysis with
boxer. In Johan Bos and Rodolfo Delmonte, editors,
Semantics in Text Processing. STEP 2008 Conference

Proceedings, Research in Computational Semantics,
pages 277–286. College Publications.

P.F. Brown, P.V. Desouza, R.L. Mercer, V.J.D. Pietra, and
J.C. Lai. 1992. Class-based n-gram models of natural
language. Computational linguistics, 18(4):467–479.

N. Chinchor and P. Robinson. 1997. Muc-7 named entity
task definition. In Proceedings of the 7th Conference
on Message Understanding.

Stephen Clark and James R. Curran. 2004. Parsing the
WSJ using CCG and log-linear models. In Proceed-
ings of the 42nd Annual Meeting on Association for
Computational Linguistics, ACL ’04. Association for
Computational Linguistics.

Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark.
2010. Mathematical foundations for a compositional
distributional model of meaning. Linguistic Analysis:
A Festschrift for Joachim Lambek, 36(1-4):345–384.

Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox,
Johan Van Genabith, Jan Jaspars, Hans Kamp, David
Milward, Manfred Pinkal, Massimo Poesio, et al.
1996. Using the framework. FraCaS Deliverable D,
16.

Ido Dagan, O. Glickman, and B. Magnini. 2006. The
PASCAL recognising textual entailment challenge.
Machine Learning Challenges. Evaluating Predictive
Uncertainty, Visual Object Classification, and Recog-
nising Tectual Entailment, pages 177–190.

D.A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke-
Smith, D. Pallett, C. Pao, A. Rudnicky, and
E. Shriberg. 1994. Expanding the scope of the ATIS
task: The ATIS-3 corpus. In Proceedings of the work-
shop on Human Language Technology, pages 43–48.
Association for Computational Linguistics.

Hal Daumé III. 2007. Fast search for dirichlet process
mixture models. In Proceedings of the Eleventh In-
ternational Conference on Artificial Intelligence and
Statistics (AIStats), San Juan, Puerto Rico.

D. Davidson. 1967. 6. the logical form of action sen-
tences. Essays on actions and events, 1(9):105–149.

Anthony Fader, Stephen Soderland, and Oren Etzioni.
2011. Identifying relations for open information
extraction. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing,
EMNLP ’11, pages 1535–1545. Association for Com-
putational Linguistics.

T. Fountain and M. Lapata. 2011. Incremental models of
natural language category acquisition. In Proceedings
of the 32st Annual Conference of the Cognitive Science
Society.

D. Garrette, K. Erk, and R. Mooney. 2011. Integrating
logical representations with probabilistic information
using markov logic. In Proceedings of the Ninth In-
ternational Conference on Computational Semantics,

189


pages 105–114. Association for Computational Lin-
guistics.

D. Graff, J. Kong, K. Chen, and K. Maeda. 2003. English
gigaword. Linguistic Data Consortium, Philadelphia.

Edward Grefenstette, Mehrnoosh Sadrzadeh, Stephen
Clark, Bob Coecke, and Stephen Pulman. 2011. Con-
crete sentence spaces for compositional distributional
models of meaning. Computational Semantics IWCS
2011, page 125.

Julia Hockenmaier and Mark Steedman. 2007. CCG-
bank: a corpus of CCG derivations and dependency
structures extracted from the penn treebank. Compu-
tational Linguistics, 33(3):355–396.

M. Honnibal, J.R. Curran, and J. Bos. 2010. Rebanking
CCGbank for improved NP interpretation. In Proceed-
ings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 207–215. Associa-
tion for Computational Linguistics.

E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and
R. Weischedel. 2006. Ontonotes: the 90% solution.
In Proceedings of the Human Language Technology
Conference of the NAACL, Companion Volume: Short
Papers, pages 57–60. Association for Computational
Linguistics.

P. Kingsbury and M. Palmer. 2002. From treebank to
propbank. In Proceedings of the 3rd International
Conference on Language Resources and Evaluation
(LREC-2002), pages 1989–1993. Citeseer.

K. Kipper, H.T. Dang, and M. Palmer. 2000. Class-based
construction of a verb lexicon. In Proceedings of the
National Conference on Artificial Intelligence, pages
691–696. Menlo Park, CA; Cambridge, MA; London;
AAAI Press; MIT Press; 1999.

Jayant Krishnamurthy and Tom M. Mitchell. 2012.
Weakly supervised training of semantic parsers. In
Proceedings of the 2012 Joint Conference on Empir-
ical Methods in Natural Language Processing and
Computational Natural Language Learning, EMNLP-
CoNLL ’12, pages 754–765. Association for Compu-
tational Linguistics.

P. Liang, M.I. Jordan, and D. Klein. 2011. Learning
dependency-based compositional semantics. In Proc.
Association for Computational Linguistics (ACL).

Dekang Lin and Patrick Pantel. 2001. DIRT - Discovery
of Inference Rules from Text. In In Proceedings of the
ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, pages 323–328.

Bill MacCartney and Christopher D. Manning. 2007.
Natural logic for textual inference. In Proceedings
of the ACL-PASCAL Workshop on Textual Entailment
and Paraphrasing, RTE ’07, pages 193–200. Associa-
tion for Computational Linguistics.

Andrew Kachites McCallum. 2002. Mal-
let: A machine learning for language toolkit.
http://mallet.cs.umass.edu.

W. McCune. 2005. Prover9 and Mace4.
http://cs.unm.edu/˜mccune/mace4/.

G.A. Miller. 1995. Wordnet: a lexical database for en-
glish. Communications of the ACM, 38(11):39–41.

M. Mintz, S. Bills, R. Snow, and D. Jurafsky. 2009. Dis-
tant supervision for relation extraction without labeled
data. In Proceedings of the Joint Conference of the
47th Annual Meeting of the ACL and the 4th Interna-
tional Joint Conference on Natural Language Process-
ing of the AFNLP: Volume 2-Volume 2, pages 1003–
1011. Association for Computational Linguistics.

J. Mitchell and M. Lapata. 2008. Vector-based models of
semantic composition. proceedings of ACL-08: HLT,
pages 236–244.

R.M. Neal. 2000. Markov chain sampling methods for
dirichlet process mixture models. Journal of compu-
tational and graphical statistics, 9(2):249–265.

D.O. O’Seaghdha. 2010. Latent variable models of se-
lectional preference. In Proceedings of the 48th An-
nual Meeting of the Association for Computational
Linguistics, pages 435–444. Association for Compu-
tational Linguistics.

Hoifung Poon and Pedro Domingos. 2009. Unsuper-
vised semantic parsing. In Proceedings of the 2009
Conference on Empirical Methods in Natural Lan-
guage Processing: Volume 1 - Volume 1, EMNLP ’09,
pages 1–10. Association for Computational Linguis-
tics.

Hoifung Poon and Pedro Domingos. 2010. Unsuper-
vised ontology induction from text. In Proceedings of
the 48th Annual Meeting of the Association for Com-
putational Linguistics, ACL ’10, pages 296–305. As-
sociation for Computational Linguistics.

A. Ritter, O. Etzioni, et al. 2010. A latent dirichlet allo-
cation method for selectional preferences. In Proceed-
ings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 424–434. Associa-
tion for Computational Linguistics.

Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld,
and Jesse Davis. 2010. Learning first-order horn
clauses from web text. In Proceedings of the 2010
Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP ’10, pages 1088–1098.
Association for Computational Linguistics.

R. Socher, B. Huval, C.D. Manning, and A.Y. Ng. 2012.
Semantic compositionality through recursive matrix-
vector spaces. In Proceedings of the 2012 Joint
Conference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural Lan-
guage Learning, pages 1201–1211. Association for
Computational Linguistics.

190


Mark Steedman. 2000. The Syntactic Process. MIT
Press.

Mark Steedman. 2012. Taking Scope: The Natural Se-
mantics of Quantifiers. MIT Press.

Idan Szpektor, Eyal Shnarch, and Ido Dagan. 2007.
Instance-based evaluation of entailment rule acquisi-
tion. In In Proceedings of ACL 2007, volume 45, page
456.

Ivan Titov and Alexandre Klementiev. 2011. A bayesian
model for unsupervised semantic parsing. In Proceed-
ings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Tech-
nologies, pages 1445–1455, Portland, Oregon, USA,
June. Association for Computational Linguistics.

Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew
McCallum. 2011. Structured relation discovery using
generative models. In Proceedings of the Conference
on Empirical Methods in Natural Language Process-
ing, EMNLP ’11, pages 1456–1466. Association for
Computational Linguistics.

Limin Yao, Sebastian Riedel, and Andrew McCallum.
2012. Unsupervised relation discovery with sense dis-
ambiguation. In ACL (1), pages 712–720.

J.M. Zelle and R.J. Mooney. 1996. Learning to parse
database queries using inductive logic programming.
In Proceedings of the National Conference on Artifi-
cial Intelligence, pages 1050–1055.

191


192