From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions


From image descriptions to visual denotations:
New similarity metrics for semantic inference over event descriptions

Peter Young Alice Lai Micah Hodosh Julia Hockenmaier
Department of Computer Science

University of Illinois at Urbana-Champaign
{pyoung2, aylai2, mhodosh2, juliahmr}@illinois.edu

Abstract

We propose to use the visual denotations of
linguistic expressions (i.e. the set of images
they describe) to define novel denotational
similarity metrics, which we show to be at
least as beneficial as distributional similarities
for two tasks that require semantic inference.
To compute these denotational similarities, we
construct a denotation graph, i.e. a subsump-
tion hierarchy over constituents and their de-
notations, based on a large corpus of 30K im-
ages and 150K descriptive captions.

1 Introduction

The ability to draw inferences from text is a prereq-
uisite for language understanding. These inferences
are what makes it possible for even brief descrip-
tions of everyday scenes to evoke rich mental im-
ages. For example, we would expect an image of
people shopping in a supermarket to depict aisles
of produce or other goods, and we would expect
most of these people to be customers who are either
standing or walking around. But such inferences
require a great deal of commonsense world knowl-
edge. Standard distributional approaches to lexical
similarity (Section 2.1) are very effective at iden-
tifying which words are related to the same topic,
and can provide useful features for systems that per-
form semantic inferences (Mirkin et al., 2009), but
are not suited to capture precise entailments between
complex expressions. In this paper, we propose a
novel approach for the automatic acquisition of de-
notational similarities between descriptions of ev-
eryday situations (Section 2). We define the (visual)

denotation of a linguistic expression as the set of im-
ages it describes. We create a corpus of images of
everyday activities (each paired with multiple cap-
tions; Section 3) to construct a large scale visual de-
notation graph which associates image descriptions
with their denotations (Section 4). The algorithm
that constructs the denotation graph uses purely syn-
tactic and lexical rules to produce simpler captions
(which have a larger denotation). But since each
image is originally associated with several captions,
the graph can also capture similarities between syn-
tactically and lexically unrelated descriptions. We
apply these similarities to two different tasks (Sec-
tions 6 and 7): an approximate entailment recogni-
tion task for our domain, where the goal is to decide
whether the hypothesis (a brief image caption) refers
to the same image as the premises (four longer cap-
tions), and the recently introduced Semantic Textual
Similarity task (Agirre et al., 2012), which can be
viewed as a graded (rather than binary) version of
paraphrase detection. Both tasks require semantic
inference, and our results indicate that denotational
similarities are at least as effective as standard ap-
proaches to similarity. Our code and data set, as
well as the denotation graph itself and the lexical
similarities we define over it are available for re-
search purposes at http://nlp.cs.illinois.edu/
Denotation.html.

2 Towards Denotational Similarities

2.1 Distributional Similarities

The distributional hypothesis posits that linguistic
expressions that appear in similar contexts have a

67

Transactions of the Association for Computational Linguistics, 2 (2014) 67–78. Action Editor: Lillian Lee.
Submitted 6/2013; Revised 10/2013; Published 2/2014. c©2014 Association for Computational Linguistics.


Gray haired man in black suit and yellow tie working in a financial environment.
A graying man in a suit is perplexed at a business meeting.
A businessman in a yellow tie gives a frustrated look.
A man in a yellow tie is rubbing the back of his neck.
A man with a yellow tie looks concerned.

A butcher cutting an animal to sell.
A green-shirted man with a butcher’s apron uses a knife to carve out the hanging carcass of a cow.
A man at work, butchering a cow.
A man in a green t-shirt and long tan apron hacks apart the carcass of a cow

while another man hoses away the blood.
Two men work in a butcher shop; one cuts the meat from a butchered cow, while the other hoses the floor.

Figure 1: Two images from our data set and their five captions

similar meaning (Harris, 1954). This has led to the
definition of vector-based distributional similarities,
which represent each word w as a vector w derived
from counts of w’s co-occurrence with other words.
These vectors can be used directly to compute the
lexical similarities of words, either via the cosine
of the angle between them, or via other, more com-
plex metrics (Lin, 1998). More recently, asymmetric
similarities have been proposed as more suitable for
semantic inference tasks such as entailment (Weeds
and Weir, 2003; Szpektor and Dagan, 2008; Clarke,
2009; Kotlerman et al., 2010). Distributional word
vectors can also be used to define the compositional
similarity of longer strings (Mitchell and Lapata,
2010). To compute the similarity of two strings, the
lexical vectors of the words in each string are first
combined into a single vector (e.g. by element-wise
addition or multiplication), and then an appropriate
vector similarity (e.g. cosine) is applied to the re-
sulting pair of vectors.

2.2 Visual Denotations

Our approach is inspired by truth-conditional se-
mantic theories in which the denotation of a declar-
ative sentence is assumed to be the set of all situa-
tions or possible worlds in which the sentence is true
(Montague, 1974; Dowty et al., 1981; Barwise and
Perry, 1980). Restricting our attention to visually
descriptive sentences, i.e. non-negative, episodic
(Carlson, 2005) sentences that can be used to de-
scribe an image (Figure 1), we propose to instantiate
the abstract notions of possible worlds or situations
with concrete sets of images. The interpretation
function J·K maps sentences to their visual denota-
tions JsK, which is the set of images i ∈ Us ⊆ U in

a ‘universe’ of images U that s describes:

JsK = {i ∈ U | s is a truthful description of i} (1)

Similarly, we map nouns and noun phrases to the
set of images that depict the objects they describe,
and verbs and verb phrases to the set of images that
depict the events they describe.

2.3 Denotation Graphs
Denotations induce a partial ordering over descrip-
tions: if s (e.g. “a poodle runs on the beach”) en-
tails a description s′ (e.g. “a dog runs”), its denota-
tion is a subset of the denotation of s′ (JsK ⊆ Js′K),
and we say that s′ subsumes the more specific s
(s′ v s). In our domain of descriptive sentences,
we can obtain more generic descriptions by simple
syntactic and lexical operations ω ∈ O ⊂ S × S
that preserve upward entailment, so that if ω(s) =
s′, JsK ⊆ Js′K. We consider three types of oper-
ations: the removal of optional material (e.g PPs
like on the beach), the extraction of simpler con-
stituents (NPs, VPs, or simple Ss), and lexical sub-
stitutions of nouns by their hypernyms (poodle →
dog). These operations are akin to the atomic ed-
its of MacCartney and Manning (2008)’s NatLog
system, and allow us to construct large subsump-
tion hierarchies over image descriptions, which we
call denotation graphs. Given a set of (upward
entailment-preserving) operations O ⊂ S × S, the
denotation graph DG = 〈E,V 〉 of a set of images I
and a set of strings S represents a subsumption hier-
archy in which each node V = 〈s, JsK〉 corresponds
to a string s ∈ S and its denotation JsK ⊆ I. Di-
rected edges e = (s,s′) ∈ E ⊆ V × V indicate a
subsumption relation s v s′ between a more generic
expression s and its child s′. An edge from s to s′

68


exists if there is an operation ω ∈ O that reduces the
string s′ to s (i.e. ω(s′) = s) and its inverse ω−1

expands the string s to s′ (i.e. ω−1(s) = s′).

2.4 Denotational Similarities
Given a denotation graph over N images, we esti-
mate the denotational probability of an expression s
with a denotation of size |JsK| as PJK(s) = |JsK|/N,
and the joint probability of two expressions analo-
gously as PJK(s,s

′) = |JsK ∩ Js′K|/N. The condi-
tional probability PJK(s | s′) indicates how likely
s is to be true when s′ holds, and yields a simple
directed denotational similarity. The (normalized)
pointwise mutual information (PMI) (Church and
Hanks, 1990) defines a symmetric similarity:

nPMI JK(s,s
′) =

log
(

PJK(s,s′)
PJK(s)PJK(s′)

)

− log(PJK(s,s′))
We set PJK(s|s) = nPMI JK(s,s) = 1, and, if s or

s′ are not in the denotation graph, nPMI JK(s,s′) =
PJK(s,s

′) = 0.

3 Our Data Set

Our data set (Figure 1) consists of 31,783 pho-
tographs of everyday activities, events and scenes
(all harvested from Flickr) and 158,915 captions
(obtained via crowdsourcing). It contains and ex-
tends Hodosh et al. (2013)’s corpus of 8,092 im-
ages. We followed Hodosh et al. (2013)’s approach
to collect images. We also use their annotation
guidelines, and use similar quality controls to cor-
rect spelling mistakes, eliminate ungrammatical or
non-descriptive sentences. Almost all of the im-
ages that we add to those collected by Hodosh et
al. (2013) have been made available under a Cre-
ative Commons license. Each image is described in-
dependently by five annotators who are not familiar
with the specific entities and circumstances depicted
in them, resulting in captions such as “Three people
setting up a tent”, rather than the kind of captions
people provide for their own images (“Our trip to
the Olympic Peninsula”). Moreover, different an-
notators use different levels of specificity, from de-
scribing the overall situation (performing a musical
piece) to specific actions (bowing on a violin). This
variety of descriptions associated with the same im-
age is what allows us to induce denotational similari-

ties between expressions that are not trivially related
by syntactic rewrite rules.

4 Constructing the Denotation Graph

The construction of the denotation graph consists
of the following steps: preprocessing and linguistic
analysis of the captions, identification of applicable
transformations, and generation of the graph itself.

Preprocessing and Linguistic Analysis We use
the Linux spell checker, the OpenNLP tok-
enizer, POS tagger and chunker (http://opennlp.
apache.org), and the Malt parser (Nivre et al.,
2006) to analyze the captions. Since the vocabulary
of our corpus differs significantly from the data these
tools are trained on, we resort to a number of heuris-
tics to improve the analyses they provide. Since
some heuristics require us to identify different entity
types, we developed a lexicon of the most common
entity types in our domain (people, clothing, bodily
appearance (e.g. hair or body parts), containers of
liquids, food items and vehicles).

After spell-checking, we normalize certain words
and compounds with several spelling variations, e.g.
barbecue (barbeque, BBQ), gray (grey), waterski
(water ski), brown-haired (brown haired), and to-
kenize the captions using the OpenNLP tokenizer.
The OpenNLP POS tagger makes a number of sys-
tematic errors on our corpus (e.g. mistagging main
verbs as nouns). Since these errors are highly sys-
tematic, we are able to correct them automatically
by applying deterministic rules (e.g. climbs is never
a noun in our corpus, stand is a noun if it is pre-
ceded by vegetable but a verb when preceded by a
noun that refers to people). These fixes apply to
27,784 (17% of the 158,915 image captions). Next,
we use the OpenNLP chunker to create a shallow
parse. Fixing its (systematic) errors affects 28,587
captions. We then analyze the structure of each
NP chunk to identify heads, determiners and pre-
nominal modifiers. The head may include more than
a single token if WordNet (or our hypernym lexi-
con, described below) contains a corresponding en-
try (e.g. little girl). Determiners include phrases
such as a couple or a few. Although we use the
Malt parser (Nivre et al., 2006) to identify subject-
verb-object dependencies, we have found it more ac-
curate to develop deterministic heuristics and lexi-

69


cal rules to identify the boundaries of complex (e.g.
conjoined) NPs, allowing us to treat “a man with red
shoes and a white hat” as an NP followed by a sin-
gle PP, but “a man with red shoes and a white-haired
woman” as two NPs, and to transform e.g. “stand-
ing by a man and a woman” into “standing” and not
“standing and a woman” when dropping the PP.

Hypernym Lexicon We use our corpus and Word-
Net to construct a hypernym lexicon that allows us
to replace head nouns with more generic terms. We
only consider hypernyms that occur themselves with
sufficient frequency in the original captions (replac-
ing “adult” with “person”, but not with “organ-
ism”). Since the language in our corpus is very
concrete, each noun tends to have a single sense, al-
lowing us to always replace it with the same hyper-
nyms.1 But since WordNet provides us with mul-
tiple senses for most nouns, we first have to iden-
tify which sense is used in our corpus. To do this,
we use the heuristic cross-caption coreference algo-
rithm of Hodosh et al. (2010) to identify coreferent
NP chunks among the original five captions of each
image.2 For each ambiguous head noun, we con-
sider every non-singleton coreference chains it ap-
pears in, and reduce its synsets to those that stand
in a hypernym-hyponym relation with at least one
other head noun in the chain. Finally, we apply a
greedy majority voting algorithm to iteratively nar-
row down each term’s senses to a single synset that
is compatible with the largest number of coreference
chains it occurs in.

Caption Normalization In order to increase the
recall of the denotations we capture, we drop all
punctuation marks, and lemmatize nouns, verbs, and
adjectives that end in “-ed” or “-ing” before gener-

1Descriptions of people that refer to both age and gen-
der (e.g. “man”) can have multiple distinct hypernyms
(“adult”/’“male”). Because our annotators never describe
young children or babies as “persons”, we only allow terms
that are likely to describe adults or teenagers (including occu-
pations) to be replaced by the term “person”. This means that
the term “girl” has two senses: a female child (the default) or a
younger woman. We distinguish the two senses in a preprocess-
ing step: if the other captions of the same image do not mention
children, but refer to teenaged or adult women, we assign girl
the woman-sense. Some nouns that end in -er (e.g. “diner”,
“pitcher” also violate our monosemy assumption.

2Coreference resolution has also been used for word sense
disambiguation by Preiss (2001) and Hu and Liu (2011).

ating the denotation graph. In order to distinguish
between frequently occurring homonyms where the
noun is unrelated to the verb, we change all forms of
the verb dress to dressed, all forms of the verb stand
to standing and all forms of the verb park to park-
ing. Finally, we drop sentence-initial there/here/this
is/are (as in there is a dog splashing in the water),
and normalize the expressions in X and dressed (up)
in X (where X is an article of clothing or a color) to
wear X. We reduce plural determiners to {two, three,
some}, and drop singular determiners except for no.

4.1 Rule Templates

The denotation graph contains a directed edge from
s to s′ if there is a rule ω that reduces s′ to s, with an
inverse ω−1 that expands s to s′. Reduction rules can
drop optional material, extract simpler constituents,
or perform lexical substitutions.

Drop Pre-Nominal Modifiers: “red shirt” →
“shirt” In an NP of the form “X Y Z”, where
X and Y both modify the head Z, we only allow
X and Y to be dropped separately if “X Z” and
“Y Z” both occur elsewhere in the corpus. Since
“white building” and “stone building” occur else-
where in the corpus, we generate both “white build-
ing” and “stone building” from the NP “white stone
building”. But since “ice player” is not used,
we replace “ice hockey player” only with “hockey
player” (which does occur) and then “player”.

Drop Other Modifiers “run quickly” → “run”
We drop ADVP chunks and adverbs in VP chunks.
We also allow a prepositional phrase (a preposi-
tion followed by a possibly conjoined NP chunk)
to be dropped if the preposition is locational
(“in”, “on”, “above”, etc.), directional (“towards”,
“through”, “across”, etc.), or instrumental (“by”,
“for”, “with”). Similarly, we also allow the drop-
ping of all “wear NP” constructions. Since the dis-
tinction between particles and prepositions is often
difficult, we also use a predefined list of phrasal
verbs that commonly occur in our corpus to identify
constructions such as “climb up a mountain”, which
is transformed into “climb a mountain” or “walk
down a street”, which is transformed into “walk”.

Replace Nouns by Hypernyms: “red shirt” →
“red clothing” We iteratively use our hypernym

70


GENERATEGRAPH():
Q, Captions, Rules ←∅
for all c ∈ ImageCorpus do

Rules(c) ← GenerateRules(sc)
pushAll(Q,{c}×RootNodes(sc, Rules(c)))

while ¬empty(Q) do
(c, s) ← pop(Q)
Captions(s) ← Captions(s) ∪{c}
if |Captions(s)| = 2 then

for all c′ ∈ Captions(s) do
pushAll(Q,{c′}×Children(s, Rules(c′)))

else if |Captions(s)| > 2 then
pushAll(Q,{c}×Children(s, Rules(c)))

Figure 2: Generating the graph

lexicon to make head nouns more generic. We only
allow head nouns to be replaced by their hypernyms
if any age based modifiers have already been re-
moved: “toddler” can be replaced with “child”, but
not “older toddler” with “older child”.

Handle Partitive NPs: cup of tea → “cup”, “tea”
In most partitive NP1-of-NP2 constructions (“cup of
tea”, “a team of football players”) the correspond-
ing entity can be referred to by both the first or the
second NP. Exceptions include the phrase “body of
water”, and expressions such as “a kind/type/sort
of ”, which we treat similar to determiners.

Handle VP1-to-VP2 Cases Depending on the first
verb, we replace VPs of the form X to Y with both X
and Y if X is a movement or posture (jump to catch,
etc.). Otherwise we distinguish between cases we
can only replace with X (wait to jump) and those we
can only replace with Y (seem to jump).

Extract Simpler Constituents Any noun phrase
or verb phrase can also be used as a node in the
graph and simplified further. We use the Malt de-
pendencies (and the person terms in the entity type
lexicon) to identify and extract subject-verb-object
chunks which correspond to simpler sentences that
we would otherwise not be able to obtain: from
“man laugh(s) while drink(ing)”, we extract “man
laugh” and “man drink”, and then further split those
into “man”, “laugh(s)”, and “drink”.

4.2 Graph Generation

The naive approach to graph generation would be to
generate all possible strings for each caption. How-
ever, this would produce far more strings than can be

processed in a reasonable amount of time, and most
of these strings would have uninformative denota-
tions, consisting of only a single image. To make
graph generation tractable, we use a top-down al-
gorithm which generates the graph from the most
generic (root) nodes, and stops at nodes that have a
singleton denotation (Figure 2). We first identify the
set of rules that can apply to each original caption
(GenerateRules). These rules are then used to re-
duce each caption as much as possible. The resulting
(maximally generic) strings are added as root nodes
to the graph (RootNodes), and added to the queue
Q. Q keeps track of all currently possible node ex-
pansions. It contains items 〈c,s〉, which pair the ID
of an original caption and its image (c) with a string
(s) that corresponds to an existing node in the graph
and can be derived from c’s caption. When 〈c,s〉 is
processed, we check how many captions have gen-
erated s so far (Captions(s)). If s has more than a
single caption, we use each of the applicable rewrite
rules of c’s caption to create new strings s′ that cor-
respond to the children of s in the graph, and push
all resulting 〈c,s′〉 onto Q. If c is the second caption
of s, we also use all of the applicable rewrite rules
from the first caption c′ to create its children.

A post-processing step (not shown in Figure 2)
attaches each original caption to all leaf nodes of the
graph to which it can be reduced. Finally, we obtain
the denotation of each node s from the set of images
whose captions are in Captions(s).

5 The Denotation Graph

Size and Coverage On our corpus of 158,439
unique captions and 31,783 images, the denotation
graph contains 1,749,097 captions, out of which
230,811 describe more than a single image. Ta-
ble 1 provides the distribution of the size of deno-
tations. It is perhaps surprising that the 161 cap-
tions which describe each over 1,000 images do
not just consist of nouns such as person, but also
contain simple sentences such as woman standing,
adult work, person walk street, or person play in-
strument. Since the graph is derived from the origi-
nal captions by very simple syntactic operations, the
denotations it captures are most likely incomplete:
Jsoccer playerK contains 251 images, Jplay soccerK
contains 234 images, and Jsoccer gameK contains

71


Size of denotations |JsK| ≥ 1 |JsK| ≥ 2 |JsK| ≥ 5 |JsK| ≥ 10 |JsK| ≥ 100 |JsK| ≥ 1000
Nr. of captions 1,749,096 230,811 53,341 22,683 1,921 161

Table 1: Distribution of the size of denotations in our graph

119 images. We have not yet attempted to iden-
tify variants in word order (“stick tongue out” vs.
“stick out tongue”) or equivalent choices of prepo-
sition (“look into mirror” vs. “look in mirror”). De-
spite this brittleness, the current graph already gives
us a large number of semantic associations.

Denotational Similarities The following exam-
ples of the similarities found by nPMI JK and PJK
show that denotational similarities do not simply
find topically related events, but instead find events
that are related by entailment:

PJK(x|y) x y
0.962 sit eat lunch
0.846 play guitar strum
0.811 surf catch wave
0.800 ride horse rope calf
0.700 listen sit in classroom

If someone is eating lunch, it is likely that they
are sitting, and people who sit in a classroom are
likely to be listening to somebody. These entail-
ments can be very precise: “walk up stair” entails
“ascend”, but not “descend”; the reverse is true for
“walk down stair”:

PJK(x|y) x =ascend x =descend
y =walk up stair 32.0 0.0

y =walk down stair 0.0 30.8

nPMI JK captures paraphrases as well as closely
related events: people look in a mirror when shav-
ing their face, and baseball players may try to tag
someone who is sliding into base:

nPMI JK x y
0.835 open present unwrap
0.826 lasso try to rope
0.791 get ready to kick run towards ball
0.785 try to tag slide into base
0.777 shave face look in mirror

Comparing the expressions that are most similar
to “play baseball” or “play football” according to
the denotational nPMI JK and the compositional Σ
similarities reveals that the denotational similarity
finds a number of actions that are part of the partic-
ular sport, while the compositional similarity finds
events that are similar to playing baseball (football):

play baseball
nPMI JK Σ

0.674 tag him 0.859 play softball
0.637 hold bat 0.782 play game
0.616 try to tag 0.768 play ball
0.569 slide into base 0.741 play catch
0.516 pitch ball 0.739 play cricket

play football
nPMI JK Σ

0.623 tackle person 0.826 play game
0.597 hold football 0.817 play rugby
0.545 run down field 0.811 play soccer
0.519 wear white jersey 0.796 play on field
0.487 avoid 0.773 play ball

6 Task 1: Approximate Entailment

A caption never provides a complete description of
the depicted scene, but commonsense knowledge
often allows us to draw implicit inferences: when
somebody mentions a bride, it is quite likely that the
picture shows a woman in a wedding dress; a pic-
ture of a parent most likely also has a child or baby,
etc. In order to compare the utility of denotational
and distributional similarities for drawing these in-
ferences, we apply them to an approximate entail-
ment task, which is loosely modeled after the Rec-
ognizing Textual Entailment problem (Dagan et al.,
2006), and consists of deciding whether a brief cap-
tion h (the hypothesis) can describe the same image
as a set of captions P = {p1, ...,pN} known to de-
scribe the same image (the premises).

Data We generate positive and negative items
〈P,h,±〉 (Figure 3) as follows: Given an image,
any subset of four of its captions form a set of
premises. A hypothesis is either a short verb phrase
or sentence that corresponds to a node in the deno-
tation graph. By focusing on short hypotheses, we
minimize the possibility that they contain extrane-
ous details that cannot be inferred from the premises.
Positive examples are generated by choosing a node
h as hypothesis and an image i ∈ JhK such that ex-
actly one caption of i generates h and the other four
captions of i are not descendants of h and hence
do not trivially entail h, giving an unfair advantage
to denotational approaches. Negative examples are
generated by choosing a node h as hypothesis and
selecting four of the captions of an image i 6∈ JhK.

72


Premises: A woman with dark hair in bending, open mouthed, towards the back of a dark headed toddler’s head.
A dark-haired woman has her mouth open and is hugging a little girl while sitting on a red blanket.
A grown lady is snuggling on the couch with a young girl and the lady has a frightened look.
A mom holding her child on a red sofa while they are both having fun.

VP Hypothesis: make face

Premises: A man editing a black and white photo at a computer with a pencil in his ear.
A man in a white shirt is working at a computer.
A guy in white t-shirt on a mac computer.
A young main is using an Apple computer.

S Hypothesis: man sit

Figure 3: Positive examples from the Approximate Entailment tasks.

Since our items are created automatically, a posi-
tive hypothesis is not necessarily logically entailed
by its premises. We have performed a small-scale
human evaluation on 300 items (200 positive, 100
negative), each judged independently by the same
three judges (inter-annotator agreement: Fleiss-κ =
0.74). Our results indicate that over half (55%) of
the positive hypotheses can be inferred from their
premises alone without looking at the original im-
age, while almost none of the negative hypotheses
(100% for sentences, 96% for verb phrases) can be
inferred from their premises. The training items are
generated from the captions of 25,000 images, and
the test items are generated from a disjoint set of
3,000 images. The VP data set consists of 290,000
training items and 16,000 test items, while the S data
set consists of 400,000 training items and 22,000 test
items. Half of the items in each set are positive, and
the other half are negative.

Models All of our models are binary MaxEnt clas-
sifiers, trained using MALLET (McCallum, 2002).
We have two baseline models: a plain bag-of-words
model (BOW) and a bag-of-words model where we
add all hypernyms in our lexicon to the captions be-
fore computing their overlap (BOW-H). This is in-
tended to minimize the advantage the denotational
features obtain from the hypernym lexicon used to
construct the denotation graph. In both cases, a
global BOW feature captures the fraction of tokens
in the hypothesis that are contained in the premises.
Word-specific BOW features capture the product of
the frequencies of each word in h and P. All other
models extend the BOW-H model.

Denotational Similarity Features We compute
denotational similarities nPMI JK and PJK (Sec-

tion 2.4) over the pairs of nodes in a denotation
graph that is restricted to the training images. We
only consider pairs of nodes n,n′ if their denota-
tions contain at least 10 images and their intersection
contains at least 2 images.

To map an item 〈P,h〉 to denotational simi-
larity features, we represent the premises as the
set of all nodes P that are ancestors of its cap-
tions. A sentential hypothesis is represented as
the set of nodes H = {hS,hsbj,hV P ,hv,hdobj}
that correspond to the sentence (h itself), its sub-
ject, its VP and its direct object. A VP hypothe-
sis has only the nodes H = {hV P ,hv,hdobj}. In
both cases, hdobj may be empty. Both of the de-
notational similarities nPMI JK(h,p) and PJK(h|p)
for h ∈ H, p ∈ P lead to two constituent-
specific features, sumx and maxx, (e.g. sumsbj =∑

p sim(hsbj,p), maxdobj = maxp sim(hdobj,p))
and two global features sump,h =

∑
p,h sim(h,p)

and maxp,h = maxp,h sim(h,p). Each constituent
type also has a set of node-specific sumx,s and
maxx,s features that are on when constituent x in
h is equal to the string s and whose value is equal
to the constituent-based feature. For PJK, each con-
stituent (and each constituent-node pair) has an ad-
ditional feature P(h|P) = 1 −

∏
n(1 −PJK(h|pn))

that estimates the probability that h is generated by
some node in the premise.

Lexical Similarity Features We use two sym-
metric lexical similarities: standard cosine distance
(cos), and Lin (1998)’s similarity (Lin):

cos(w,w′) = w·w
′

‖w‖‖w′‖

Lin(w,w′) =
∑

i:w(i)>0∧w′(i)>0 w(i)+w
′(i)∑

i w(i)+
∑

i w
′(i)

73


We use two directed lexical similarities: Clarke
(2009)’s similarity (Clk), and Szpektor and Dagan
(2008)’s balanced precision (Bal), which builds on
Lin and on Weeds and Weir (2003)’s similarity (W):

Clk(w | w′) =
∑

i:w(i)>0∧w′(i)>0 min(w(i),w
′(i))

∑
i w(i)

Bal(w | w′) =
√
W(w | w′) × Lin(w,w′)

W(w | w′) =
∑

i:w(i)>0∧w′(i)>0 w(i)∑
i w(i)

We also use two publicly available resources that
provide precomputed similarities, Kotlerman et al.
(2010)’s DIRECT noun and verb rules and Chklovski
and Pantel (2004)’s VERBOCEAN rules. Both are
motivated by the need for numerically quantifiable
semantic inferences between predicates. We only
use entries that correspond to single tokens (ignor-
ing e.g. phrasal verbs).

Each lexical similarity results in the follow-
ing features: words in the output are represented
by a max-simw feature which captures its max-
imum similarity with any word in the premises
(max-simw = maxw′∈P sim(w,w′)) and by a
sum-simw feature which captures the sum of its sim-
ilarities to the words in the premises (sum-simw =∑

w′∈P sim(w,w
′)). Global max sim and sum sim

features capture the maximal (resp. total) similarity
of any word in the hypothesis to the premise.

We compute distributional and compositional
similarities (cos, Lin, Bal, Clk, Σ, Π) on our im-
age captions (“cap”), the BNC and Gigaword. For
each corpus C, we map each word w that appears
at least 10 times in C to a vector wC of the non-
negative normalized pointwise mutual information
scores (Section 2.4) of w and the 1,000 words (ex-
cluding stop words) that occur in the most sentences
of C. We generally define P(w) (and P(w,w′)) as
the fraction of sentences in C in which w (and w′)
occur. To allow a direct comparison between dis-
tributional and denotational similarities, we first de-
fine P(w) (and P(w,w′)) over individual captions
(“cap”), and then, to level the playing field, we rede-
fine P(w) (and P(w,w′)) as the fraction of images
in whose captions w (and w′) occur (“img”), and
then we use our lexicon to augment captions with
all hypernyms (“+hyp”). Finally, we include BNC
and Gigaword similarity features (“all”).

VP task S task
Baseline 1: BoW 58.7 71.2
Baseline 2: BoW-H 59.0 73.6
External 1: DIRECT 59.2 73.5
External 2: VerbOcean 60.8 74.0

Cap All Cap All
Distributional cos 67.5 71.9 76.1 78.9
Distributional Lin 62.6 70.2 75.4 77.8
Distributional Bal 62.3 69.6 74.7 75.3
Distributional Clk 62.4 69.2 75.4 77.5
Compositional Π 68.4 70.3 75.3 77.3
Compositional Σ 67.8 71.4 76.9 79.2
Compositional Π, Σ 69.8 72.7 77.0 79.6
Denotational nPMI JK 74.9 80.2
Denotational PJK 73.8 79.5
nPMI JK, PJK 75.5 81.2
Combined cos, Π, Σ 71.1 72.6 77.4 79.2
nPMI JK, PJK, Π, Σ 75.6 75.9 80.2 80.7
nPMI JK, PJK, cos 75.6 75.7 80.2 81.2
nPMI JK, PJK, cos, Π, Σ 75.8 75.9 81.2 80.5

Table 2: Test accuracy on Approximate Entailment.

Compositional Similarity Features We use two
standard compositional baselines to combine the
word vectors of a sentence into a single vector: ad-
dition (s∑ = w1 + ... + wn, which can be inter-
preted as a disjunctive operation), and element-wise
(Hadamard) multiplication (s∏ = w1 � ... � wn,
which can be seen as a conjunctive operation). In
both cases, we represent the premises (which con-
sist of four captions) as a the sum of each caption’s
vector p = p1 + ...p4. This gives two composi-
tional similarity features: Σ = cos(pΣ,hΣ), and
Π = cos(pΠ,hΠ).

6.1 Experimental Results

Table 2 provides the test accuracy of our mod-
els on the VP and S tasks. Adding hypernyms
(BOW-H) yields a slight improvement over the ba-
sic BOW model. Among the external resources,
VERBOCEAN is more beneficial than DIRECT, but
neither help as much as in-domain distributional
similarities (this may be due to sparsity).

Table 2 shows only the simplest (“Cap”) and
the most complex (“all”) distributional and com-
positional models, but Table 3 provides accuracies
of these models as we go from standard sentence-
based co-occurrence counts towards more denota-
tion graph-like co-occurrence counts that are based
on all captions describing the same image (“Img”),

74


VP task S task
Cap Img +Hyp All Cap Img +Hyp All

cos 67.5 69.3 69.8 71.9 76.1 76.8 77.5 78.9
Lin 62.6 63.4 61.3 70.0 75.4 74.8 75.2 77.8
Bal 62.3 61.9 62.8 69.6 74.7 75.5 75.1 75.3
Clk 62.4 67.3 68.0 69.2 75.4 75.5 76.0 77.5
Π 68.4 70.5 70.5 70.3 75.3 76.6 77.1 77.3
Σ 67.8 71.4 71.6 71.4 76.9 78.1 79.1 79.2
Π, Σ 69.8 72.7 72.9 72.7 77.0 78.6 79.3 79.6
nPMI JK 74.9 80.2
PJK 73.8 79.5
nPMI JK, PJK 75.5 81.2

Table 3: Accuracy on hypotheses as various additions are
made to the vector corpora. Cap is the image corpus with
caption co-occurrence. Img is the image corpus with im-
age co-occurrence. +Hyp augments the image corpus
with hypernyms and uses image co-occurrence. All adds
the BNC and Gigaword corpora to +Hyp.

VP task S task
Words in h 1 2 3+ 2 3 4+
% of items 72.8 13.9 13.3 65.3 22.8 11.9
BoW-H 52.0 75.0 80.1 69.1 80.8 84.4
cos (All) 68.8 79.4 81.1 75.9 83.9 85.7∑

(All) 68.1 80.8 79.5 76.5 83.9 85.1
nPMI JK 72.0 82.9 82.2 77.3 85.4 86.2

Table 4: Accuracy on hypotheses of varying length.

include hypernyms (“+Hyp”), and add informa-
tion from other corpora (“All”). The “+Hyp” col-
umn in Table 3 shows that the denotational metrics
clearly outperform any distributional metric when
both have access to the same information. Al-
though the distributional models benefit from the
BNC and Gigaword-based similarities (“All”), their
performance is still below that of the denotational
models. Among the distributional model, the simple
cos performs better than Lin, or the directed Clk and
Bal similarities. In all cases, giving models access to
different similarity features improves performance.

Table 4 shows the results by hypothesis length.
As the length of h increases, classifiers that use sim-
ilarities between pairs of words (BOW-H and cos)
continue to improve in performance relative to the
classifiers that use similarities between phrases and
sentences (Σ and nPMI JK). Most likely, this is due
to the lexical similarities having a larger set of fea-
tures to work with for longer h. nPMI JK does espe-
cially well on shorter h, likely due to the shorter h
having larger denotations.

7 Task 2: Semantic Textual Similarity

To assess how the denotational similarities perform
on a more established task and domain, we apply
them to the 1500 sentence pairs from the MSR Video
Description Corpus (Chen and Dolan, 2011) that
were annotated for the SemEval 2012 Semantic Tex-
tual Similarity (STS) task (Agirre et al., 2012). The
goal of this task is to assign scores between 0 and 5
to a pair of sentences, where 5 indicates equivalence,
and 0 unrelatedness. Since this is a symmetric task,
we do not consider directed similarities. And be-
cause the goal of this experiment is not to achieve
the best possible performance on this task, but to
compare the effectiveness of denotational and more
established similarities, we only compare the impact
of denotational similarities with compositional sim-
ilarities computed on our own corpus. Since the
MSR Video corpus associates each video with mul-
tiple sentences, it is in principle also amenable to a
denotational treatment, but the STS task description
explicitly forbids its use.

7.1 Models
Baseline and Compositional Features Our start-
ing point is Bär et al. (2013)’s DKPro Similarity,
one of the top-performing models from the 2012
STS shared task, which is available and easily mod-
ified. It consists of a log-linear regression model
trained on multiple text features (word and charac-
ter n-grams, longest common substring and longest
common subsequence, Gabrilovich and Markovitch
(2007)’s Explicit Semantic Analysis, and Resnik
(1995)’s WordNet-based similarity). We investigate
the effects of adding compositional (computed on
the vectors obtained from the image-caption train-
ing data) and denotational similarity features to this
state-of-the-art system.

Denotational Features Since the STS task is
symmetric, we only consider nPMI JK similari-
ties. We again represent each sentence s by fea-
tures based on 5 types of constituents: S =
{sS,ssbj,sV P ,sv,sdobj}. Since sentences might be
complex, they might contain multiple constituents
of the same type, and we therefore think of each
feature as a feature over sets of nodes. For each
constituent C we consider two sets of nodes in the
denotation graph: C itself (typically leaf nodes),

75


DKPro +Σ, Π (img) +nPMI JK +both
Pearson r 0.868 0.880 0.888 0.890

Table 5: Performance on the STS MSRvid task: DKPro
(Bär et al., 2013) plus compositional (Σ, Π) and/or deno-
tational similarities (nPMI JK) from our corpus

and Canc, their parents and grandparents. For
each pair of sentences, C-C similarities compute
the similarity of the constituents of the same type,
while C-all similarities compute the similarity of
a C constituent in one sentence against all con-
stituents in the other sentence. For each pair of
constituents we consider three similarity features:
sim(C1,C2), max(sim(C1Canc2 ), sim(C

anc
1 ,C2)),

sim(Canc1 ,C
anc
2 ). The similarity of two sets of

nodes is determined by the maximal similarity
of any pair of their elements: sim(C1,C2) =
maxc1∈C1,c2∈C2 nPMI JK(c1,c2). This gives us 15
C-C features and 15 C-all features.

7.2 Experiments

We use the STS 2012 train/test data, normalized in
the same way as the image captions for the deno-
tation graph (i.e. we re-tokenize, lemmatize, and
remove determiners). Table 5 shows experimental
results for four models: DKPro is the off-the-shelf
DKProSimilarity model (Bär et al., 2013). From
our corpus, we either add additive and multiplicative
compositional features (Σ, Π) from Section 6 (img),
the C-C and C-All denotational features based on
nPMI JK, or both compositional and denotational
features. Systems are evaluated by the Pearson cor-
relation (r) of their predicted similarity scores to the
human-annotated ones. We see that the denotational
similarities outperform the compositional similari-
ties, and that including compositional similarity fea-
tures in addition to denotational similarity features
has little effect. For additional comparison, the
published numbers for the TakeLab Semantic Text
Similarity System (Šarić et al., 2012), another top-
performing model from the 2012 shared task, are
r = 0.880 on this dataset.

8 Conclusion

Summary of Contributions We have defined
novel denotational metrics of linguistic similarity
(Section 2), and have shown them to be at least

competitive with, if not superior to, distributional
similarities for two tasks that require simple se-
mantic inferences (Sections 6, 7), even though our
current method of computing them is somewhat
brittle (Section 5). We have also introduced two
new resources: a large data set of images paired
with descriptive captions, and a denotation graph
that pairs generalized versions of these captions
with their visual denotations, i.e. the sets of im-
ages they describe. Both of these resources are
freely available (http://nlp.cs.illinois.edu/
Denotation.html) Although the aim of this paper
is to show their utility for a purely linguistic task,
we believe that they should also be of great interest
for people who aim to build systems that automat-
ically associate image with sentences that describe
them (Farhadi et al., 2010; Kulkarni et al., 2011; Li
et al., 2011; Yang et al., 2011; Mitchell et al., 2012;
Kuznetsova et al., 2012; Gupta et al., 2012; Hodosh
et al., 2013).

Related Work and Resources We believe that the
work reported in this paper has the potential to open
up promising new research directions. There are
other data sets that pair images or video with de-
scriptive language, but we have not yet applied our
approach to them. Chen and Dolan (2011)’s MSR
Video Description Corpus (of which the STS data
is a subset) is most similar to ours, but its curated
part is significantly smaller. Instead of several in-
dependent captions, Grubinger et al. (2006)’s IAPR
TC-12 data set contains longer descriptions. Or-
donez et al. (2011) harvested 1 million images and
their user-generated captions from Flickr to create
the SBU Captioned Photo Dataset. These captions
tend to be less descriptive of the image. The de-
notation graph is similar to Berant et al. (2012)’s
‘entailment graph’, but differs from it in two ways:
first, entailment relations in the denotation graph
are defined extensionally in terms of the images de-
scribed by the expressions at each node, and sec-
ond, nodes in Berant et al.’s entailment graph corre-
spond to generic propositional templates (X treats
Y ), whereas nodes in our denotation graph corre-
spond to complete propositions (a dog runs).

76


Acknowledgements

We gratefully acknowledge the support of the
National Science Foundation under NSF awards
0803603 “INT2-Medium: Understanding the mean-
ing of images”, 1053856 “CAREER: Bayesian Mod-
els for Lexicalized Grammars”, and 1205627 “CI-
P:Collaborative Research: Visual entailment data
set and challenge for the Language and Vision Com-
munity”, as well as via an NSF Graduate Research
Fellowship to Alice Lai.

References

Eneko Agirre, Mona Diab, Daniel Cer, and Aitor
Gonzalez-Agirre. 2012. SemEval-2012 task 6: a pilot
on semantic textual similarity. In Proceedings of the
First Joint Conference on Lexical and Computational
Semantics - Volume 1: Proceedings of the main confer-
ence and the shared task, and Volume 2: Proceedings
of the Sixth International Workshop on Semantic Eval-
uation, SemEval ’12, pages 385–393.

Daniel Bär, Torsten Zesch, and Iryna Gurevych. 2013.
DKPro Similarity: An Open Source Framework for
Text Similarity. In Proceedings of the 51st Annual
Meeting of the Association for Computational Linguis-
tics: System Demonstrations, pages 121–126, Sofia,
Bulgaria, August.

Jon Barwise and John Perry. 1980. Situations and atti-
tudes. Journal of Philosophy, 78:668–691.

Jonathan Berant, Ido Dagan, and Jacob Goldberger.
2012. Learning entailment relations by global graph
structure optimization. Computational Linguistics,
38(1):73–111.

Greg Carlson, 2005. The Encyclopedia of Language and
Linguistics, chapter Generics, Habituals and Iteratives.
Elsevier, 2nd edition.

David Chen and William Dolan. 2011. Collecting
highly parallel data for paraphrase evaluation. In Pro-
ceedings of the 49th Annual Meeting of the Associa-
tion for Computational Linguistics: Human Language
Technologies, pages 190–200, Portland, Oregon, USA,
June.

Timothy Chklovski and Patrick Pantel. 2004. Verbo-
cean: Mining the web for fine-grained semantic verb
relations. In Proceedings of the 2004 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 33–40, Barcelona, Spain, July.

Kenneth Ward Church and Patrick Hanks. 1990. Word
association norms, mutual information, and lexicogra-
phy. Computational Linguistics, 16(1):22–29.

Daoud Clarke. 2009. Context-theoretic semantics for
natural language: an overview. In Proceedings of
the Workshop on Geometrical Models of Natural Lan-
guage Semantics, pages 112–119, Athens, Greece,
March.

Ido Dagan, Oren Glickman, and Bernardo Magnini.
2006. The PASCAL Recognising Textual Entailment
challenge. In Machine Learning Challenges, volume
3944 of Lecture Notes in Computer Science, pages
177–190. Springer.

David Dowty, Robert Wall, and Stanley Peters. 1981. In-
troduction to Montague Semantics. Reidel, Dordrecht.

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi,
Peter Young, Cyrus Rashtchian, Julia Hockenmaier,
and David Forsyth. 2010. Every picture tells a
story: Generating sentences from images. In Proceed-
ings of the European Conference on Computer Vision
(ECCV), Part IV, pages 15–29, Heraklion, Greece,
September.

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Com-
puting semantic relatedness using wikipedia-based ex-
plicit semantic analysis. In Proceedings of the 20th
international joint conference on Artifical intelligence,
IJCAI’07, pages 1606–1611.

Michael Grubinger, Paul Clough, Henning Müller, and
Thomas Deselaers. 2006. The IAPR benchmark: A
new evaluation resource for visual information sys-
tems. In OntoImage 2006, Workshop on Language
Resources for Content-based Image Retrieval during
LREC 2006, pages 13–23, Genoa, Italy, May.

Ankush Gupta, Yashaswi Verma, and C. Jawahar. 2012.
Choosing linguistics over vision to describe images.
In Proceedings of the Twenty-Sixth AAAI Conference
on Artificial Intelligence, Toronto, Ontario, Canada,
July.

Zellig S Harris. 1954. Distributional structure. Word,
10:146–162.

Micah Hodosh, Peter Young, Cyrus Rashtchian, and Julia
Hockenmaier. 2010. Cross-caption coreference reso-
lution for automatic image understanding. In Proceed-
ings of the Fourteenth Conference on Computational
Natural Language Learning, pages 162–171, Uppsala,
Sweden, July.

Micah Hodosh, Peter Young, and Julia Hockenmaier.
2013. Framing image description as a ranking task:
Data, models and evaluation metrics. Journal of Arti-
ficial Intelligence Research (JAIR), 47:853–899.

Shangfeng Hu and Chengfei Liu. 2011. Incorporating
coreference resolution into word sense disambigua-
tion. In Alexander F. Gelbukh, editor, Computational
Linguistics and Intelligent Text Processing, volume
6608 of Lecture Notes in Computer Science, pages
265–276. Springer Berlin Heidelberg.

77


Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan
Zhitomirsky-Geffet. 2010. Directional distributional
similarity for lexical inference. Natural Language En-
gineering, 16(4):359–389.

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming
Li, Yejin Choi, Alexander C. Berg, and Tamara L.
Berg. 2011. Baby talk: Understanding and generat-
ing simple image descriptions. In Proceedings of the
2011 IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 1601–1608.

Polina Kuznetsova, Vicente Ordonez, Alexander Berg,
Tamara Berg, and Yejin Choi. 2012. Collective gener-
ation of natural image descriptions. In Proceedings of
the 50th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), pages
359–368, Jeju Island, Korea, July.

Siming Li, Girish Kulkarni, Tamara L. Berg, Alexan-
der C. Berg, and Yejin Choi. 2011. Composing sim-
ple image descriptions using web-scale n-grams. In
Proceedings of the Fifteenth Conference on Compu-
tational Natural Language Learning (CoNLL), pages
220–228, Portland, OR, USA, June.

Dekang Lin. 1998. An information-theoretic defini-
tion of similarity. In Proceedings of the Fifteenth In-
ternational Conference on Machine Learning (ICML),
pages 296–304, Madison, WI, USA, July.

Bill MacCartney and Christopher D. Manning. 2008.
Modeling semantic containment and exclusion in nat-
ural language inference. In Proceedings of the 22nd
International Conference on Computational Linguis-
tics (Coling 2008), pages 521–528, Manchester, UK,
August.

Andrew Kachites McCallum. 2002. Mal-
let: A machine learning for language toolkit.
http://www.cs.umass.edu/ mccallum/mallet.

Shachar Mirkin, Ido Dagan, and Eyal Shnarch. 2009.
Evaluating the inferential utility of lexical-semantic
resources. In Proceedings of the 12th Conference of
the European Chapter of the ACL (EACL 2009), pages
558–566, Athens, Greece, March.

Jeff Mitchell and Mirella Lapata. 2010. Composition in
distributional models of semantics. Cognitive Science,
34(8):1388–1429.

Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Ya-
maguchi, Karl Stratos, Xufeng Han, Alyssa Mensch,
Alex Berg, Tamara Berg, and Hal Daume III. 2012.
Midge: Generating image descriptions from computer
vision detections. In Proceedings of the 13th Confer-
ence of the European Chapter of the Association for
Computational Linguistics (EACL), pages 747–756,
Avignon, France, April.

Richard Montague. 1974. Formal philosophy: papers
of Richard Montague. Yale University Press, New
Haven. Edited by Richmond H. Thomason.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. Malt-
parser: A data-driven parser-generator for dependency
parsing. In Proceedings of the International Confer-
ence on Language Resources and Evaluation (LREC),
pages 2216–2219.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.
2011. Im2text: Describing images using 1 million
captioned photographs. In Advances in Neural Infor-
mation Processing Systems 24, pages 1143–1151.

Judita Preiss. 2001. Anaphora resolution with word
sense disambiguation. In Proceedings of SENSEVAL-
2 Second International Workshop on Evaluating
Word Sense Disambiguation Systems, pages 143–146,
Toulouse, France, July.

Philip Resnik. 1995. Using information content to evalu-
ate semantic similarity in a taxonomy. In Proceedings
of the 14th international joint conference on Artificial
intelligence - Volume 1, IJCAI’95, pages 448–453.

Idan Szpektor and Ido Dagan. 2008. Learning entailment
rules for unary templates. In Proceedings of the 22nd
International Conference on Computational Linguis-
tics (Coling 2008), pages 849–856, Manchester, UK,
August. Coling 2008 Organizing Committee.

Frane Šarić, Goran Glavaš, Mladen Karan, Jan Šnajder,
and Bojana Dalbelo Bašić. 2012. Takelab: Sys-
tems for measuring semantic text similarity. In Pro-
ceedings of the Sixth International Workshop on Se-
mantic Evaluation (SemEval 2012), pages 441–448,
Montréal, Canada, 7-8 June.

Julie Weeds and David Weir. 2003. A general frame-
work for distributional similarity. In Proceedings of
the 2003 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 81–88.

Yezhou Yang, Ching Teo, Hal Daume III, and Yiannis
Aloimonos. 2011. Corpus-guided sentence genera-
tion of natural images. In Proceedings of the 2011
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 444–454, Edin-
burgh, UK, July.

78