A Bayesian Model of Grounded Color Semantics

Brian McMahan
Rutgers University

brian.mcmahan@rutgers.edu

Matthew Stone
Rutgers University

matthew.stone@rutgers.edu

Abstract

Natural language meanings allow speakers to
encode important real-world distinctions, but
corpora of grounded language use also re-
veal that speakers categorize the world in
different ways and describe situations with
different terminology. To learn meanings
from data, we therefore need to link underly-
ing representations of meaning to models of
speaker judgment and speaker choice. This
paper describes a new approach to this prob-
lem: we model variability through uncertainty
in categorization boundaries and distributions
over preferred vocabulary. We apply the ap-
proach to a large data set of color descrip-
tions, where statistical evaluation documents
its accuracy. The results are available as a
Lexicon of Uncertain Color Standards (LUX),
which supports future efforts in grounded lan-
guage understanding and generation by prob-
abilistically mapping 829 English color de-
scriptions to potentially context-sensitive re-
gions in HSV color space.

1 Introduction

To ground natural language semantics in real-world
data at large scale requires researchers to confront
the vocabulary problem (Furnas et al., 1987). Much
of what people say falls in a long tail of increas-
ingly infrequent and specialized items. Moreover,
the choice of how to categorize and describe real-
world data varies across people. We can’t account
for this complexity by deriving one definitive map-
ping between words and the world.

We see this complexity already in free text de-
scriptions of color patches. English has fewer than

Hue

0

1

2

3

4

5

6

En
tr

op
y 

(b
its

)

Figure 1: A visualization of the variability of the de-
scriptions used to name colors within small bins of color
space. For each Hue value, the entropy values for
each bin along the Saturation and Value dimensions are
grouped and plotted as box plots. The dotted line cor-
responds to a random choice out of fourteen items and to
the perplexity of a histogram model trained on the corpus.

a dozen basic color words (Berlin, 1991), but peo-
ple’s descriptions of colors are much more variable
than this would suggest. Measured on the corpus
described in Section 4.1, there’s an average of 3.845
bits of information in a color description given the
color it describes—comparable to rolling a 14-sided
die. Figure 1 summarizes the data and plots the en-
tropy of descriptions encountered within small bins
of color space. The bins are aggregated over the Sat-
uration and Value dimensions and indexed on the x-
axis by the Hue dimension. There’s little reason to
think that this variability conceals consistent mean-
ings. In formal semantics, one of the hallmarks of
vague language is that speakers can make it more
precise in alternative, incompatible ways (Barker,
2002). We see this in practice as well, for exam-
ple with the image of Figure 2, where subjects com-


Figure 2: Image by flickr user Joanne Bacon (jlbacon)
from the data set of Young et al. (2014), whose subjects
describe these dogs as a brown dog and a tan one or a tan
dog and a white one.

prehensibly describe either of two dogs as the tan
one. Systems that robustly understand or generate
descriptions of colors in situated dialogue need mod-
els of meaning that capture this variability.

This paper makes two key contributions towards
this challenge. First, we present a methodology
to infer a corpus-based model of meaning that ac-
counts for possible differences in word usage across
different speakers. As we explain in Section 2,
our approach differs from the typical perspective in
grounded semantics (Tellex et al., 2011a; Matuszek
et al., 2012; Krishnamurthy and Kollar, 2013),
where a meaning is reduced to a single classifier that
collapses patterns of variation. Instead, our model
allows for variability in meaning by positing uncer-
tainty in classification boundaries that can get re-
solved when a speaker chooses to use a word on a
specific occasion. We explain the model and its the-
oretical rationale in Section 3.

Second, we develop and release a Lexicon of
Uncertain Color Standards (LUX) by applying our
methodology to color descriptions. LUX is an inter-
pretation of 829 distinct English color descriptions
as distributions over regions of the Hue–Saturation–
Value color space that describe their possible mean-
ings. As we describe in Section 4, the model is
trained by machine learning methods from a subset
of Randall Munroe’s 2010 publicly-available cor-
pus of 3.4 million crowdsourced free-text descrip-
tions of color patches (Munroe, 2010). Data, models
and visualization software are available at http:
//mcmahan.io/lux/.

Statistical evaluation of our model against two
alternative approaches documents its effectiveness.

The model makes better quantitative predictions
than a brute-force memorization model; it seems to
generalize to unseen data in more meaningful ways.
At the same time, our meanings work as well as
special-purpose models to explain speaker choice,
even though our model supports diverse other rea-
soning. See Section 5.

We see color as the first of many applications of
our methodology, and are optimistic about learn-
ing vague meanings for other continuous domains
as quantity, space, and time. At the same time,
the methodology opens up new prospects for re-
search on negotiating meaning interactively (Lars-
son, 2013) with principled representations and with
broad coverage. In fact, many practical situated dia-
logue systems already identify unfamiliar objects by
color. We expect that LUX will provide a broadly
useful resource to extend the range of descriptions
such systems can generate and understand.

2 Related Work

Grounded semantics is the task of mapping rep-
resentations of linguistic meaning to the physical
world, whether by perceptual mechanisms (Har-
nad, 1990) or with the assistance of social inter-
action (DeVault et al., 2006). In this paper, we
are particularly concerned with grounding the mean-
ings of primitive vocabulary. However, the ulti-
mate test of grounded semantics—whether it is un-
derstanding commands (Winograd, 1970; Tellex et
al., 2011b), describing states of the world (Chen and
Mooney, 2008), or identifying objects (Matuszek et
al., 2012; Krishnamurthy and Kollar, 2013; Dawson
et al., 2013)—is the ability to interpret or generate
utterances using lexical and compositional seman-
tics so as to evoke appropriate real-world referents.
Grounded semantics therefore involves more than
just quantifying the associations between words and
perceptual representations, as Chuang et al. (2008)
and Heer and Stone (2012) do for color. Grounded
semantics involves interpreting semantic primitives
in terms of composable categories that let systems
discriminate between cases where a word applies
and cases where the word does not apply. (Our eval-
uation compares models of grounded semantics to
more direct models of word–world associations.)

Previous research has modeled these categories as

http://mcmahan.io/lux/
http://mcmahan.io/lux/


regions of suitable perceptual feature spaces. Re-
searchers have explored explicit spaces of high-level
perceptual attributes (Farhadi et al., 2009; Silberer
et al., 2013), approximations to such spaces (Ma-
tuszek et al., 2012), or low-level feature spaces such
as Bag of Visual Words (Bruni et al., 2012) or
Histogram of Gradients (Krishnamurthy and Kollar,
2013). We specifically follow Gärdenfors (2000)
and Jäger (2010) in assuming that color categories
are convex regions in an underlying color space, and
are not just determined by prototypical color values,
such as in Andreas and Klein (2014).

However, unlike previous grounded semantics,
we do not assume that words name categories un-
equivocally. Speakers may vary in how they inter-
pret a word, so we treat the link between words and
categories probabilistically. The difference makes
training our model more indirect than previous ap-
proaches to grounded meaning. In particular, our
model introduces a new layer of uncertainty that de-
scribes what category the speaker uses.

Similar kinds of uncertainty can be found in
Bayesian models of speaker strategy, such as that
of Smith et al. (2013). However, this research has
assumed that speakers aim to be as informative as
possible. We have no evidence that our speakers do
that. We assume only that speakers’ utterances are
reliable and mirror prevailing usage.

Prior work by cognitive scientists has studied
color terms extensively, but focused on basic ones—
monolexemic, top-level color words with general
application and high frequency in a language (Kay et
al., 2009; Lammens, 1994). These color categories
seem to shape people’s expectations and memory
for colors (Persaud and Hemmer, 2014), and pat-
terns of color naming can therefore enhance soft-
ware for helping people organize and interact with
color (Chuang et al., 2008; Heer and Stone, 2012).
Moreover, crosslinguistic evidence suggests that the
human perceptual system places strong biases on
the meanings of the basic color terms (Regier et al.,
2005), perhaps because basic terms must partition
the perceptual space in an efficient way (Regier et
al., 2007). We depart from research on basic color
naming in considering a much wider range of terms,
much like Andreas and Klein (2014). We consider
subordinate, non-basic terms like beige or lavender;
modified colors like light blue or bright green; and

named subcategories like olive green, navy blue or
brick red.

In order to use semantic primitives for under-
standing, it’s necessary to combine them into an
integrated sentence-level representation: this is the
problem of semantic parsing. Semantic parsers can
be built by hand (Winograd, 1970), induced through
inductive logic programming (Zelle and Mooney,
1996), or treated as a structured classification prob-
lem (Zettlemoyer and Collins, 2005). Once a suit-
able logical form is derived, interpretation typically
involves a recursive process of finding referents that
fit lexical categories and relationships (Mavridis and
Roy, 2006; Tellex et al., 2011a). While this pa-
per does not explicitly address how our meanings
might be used in conjunction with such techniques,
we see no fundamental obstacle to doing so—for ex-
ample, by resolving references probabilistically and
marginalizing over uncertainty in meaning.

3 Using Vague Color Terms: A Model

Our model involves two significant innovations over
previous approaches to grounded meaning. The
first is to capture the vagueness and flexibility of
grounded meaning with semantic representations
that treat meaning as uncertain. We represent the
semantics of a color description with a distribu-
tion over color categories, which weights possible
meanings by the relative likelihood of a speaker us-
ing this meaning on any particular occasion. For
example, speakers might associate yellowish green
with a range of possible meanings, differing in how
far the color category extends into green hues. By
representing uncertainty about meaning, our model
makes room to capture variability in language use.
For example, it implicitly quantifies how likely
speakers are to use words differently, as with the two
interpretations of tan in Figure 2.

Our second contribution is our simple model of
the relationship between semantics and pragmatics.
We assume that speakers’ choices mirror established
patterns. In particular, the model learns a measure
of availability for each color term that tracks how
frequently speakers tend to use it when it is appli-
cable. For example, although the expressions yel-
lowish green and chartreuse are associated with very
similar color categories, people say yellowish green


much more often: it has a higher availability. Empir-
ically, we find few terms with high availability and
a long tail of terms with lower availabilities. We as-
sume speakers simply sample applicable terms from
this distribution, which predicts the long tail of ob-
served responses.

Mathematically, we develop our approach
through the rational analysis methodology for
explaining human behavior proposed by Anderson
(1991), along with methodological insights from
the linguistics and philosophy of vagueness. In the
remainder of this section, we explain the theoretical
antecedents in perceptual science, linguistics and
cognitive modeling that inform our approach.

3.1 Color Categories

Color can be defined as sensations by which the per-
ceptual system tracks the diffuse reflectance of ob-
jects, despite variability, uncertainty and ambiguity
in the visual input. Red, green, and blue cones in
the retina allow the visual system to coarsely es-
timate frequency bands in the spectrum of incom-
ing light. Cameras and screens that use the red–
green–blue (RGB) color space are designed roughly
to correspond to these responses. However, colors in
the visual system summarize spectral profiles rather
than mere wavelengths of light. For example, we see
colors like cyan (green plus blue without red), ma-
genta (blue plus red without green) and yellow (red
plus green without blue) as intermediate saturated
colors between the familiar primaries. This natu-
rally leads to a wheel of hues describing the relative
prominence of different spectral components along
a continuum. Fairchild (2013) provides an overview
of color appearance.

To capture this variation, we’ll work in the sim-
ple hue–saturation–value (HSV) color space that’s
common in computer graphics and color picker user
interfaces (Hughes et al., 2013) and implemented in
python’s native colorsys package. This coordinate
system represents colors with three distinct qualita-
tive dimensions: Hue (H) represents changes in tint
around a color wheel, Saturation (S) represents the
relative proportion of color versus gray, and Value
(V) represents the location on the white–black con-
tinuum. We will associate color categories with rect-
angular box-shaped regions in HSV space. More
sophisticated color spaces have been developed to

describe the psychophysics of color more precisely,
but they depend on the photometric illumination and
other aspects of the viewing context that were not
controlled in the collection of the data we are using
(Fairchild, 2013).

3.2 Semantic Representation

Our assumption is that color terms are associated
probabilistically with color categories. We illustrate
the idea for the color label yellowish-green through
the plot in Figure 3. The plot shows variation in use
of the term across the Hue dimension: the bar graph
is a scaled histogram of the responses in the data we
use. There is a range of colors where people use yel-
lowish green often, surrounded by borderline cases
where it becomes increasingly infrequent.

Hue

µLower µUpper

0.2

0.4

0.6

0.8

1.0

P
ro

ba
bi

li
ty

τLower
τUpper

φHueY ellowishGreen

Yellowish Green data

Figure 3: The LUX model for “yellowish green” on the
Hue axis plotted against the scaled histogram of the re-
sponses in the data. The φ curve represents the likeli-
hood of “yellowish green” for different Hue values. The
τ curves represent possible boundaries.

We represent this variability by assuming that the
boundaries that delimit the color are uncertain. In
any utterance, yellowish green fits only those Hue
values that are above a minimum threshold τLower

and below a maximum threshold τUpper. However,
it is uncertain which thresholds a speaker will use.
The model describes this variability with probabil-
ity density functions. They are shown for yellowish
green in Figure 3 as the τ distributions. The figure
shows that there is a central range of hues, between
the τ distributions, that is definitely yellowish green.
The τ distributions peak at the most likely bound-
aries for yellowish green, encompassing a broad re-
gion that’s frequently called yellowish green. Fur-
ther away, threshold values and yellowish green ut-
terances alike become rapidly less likely.


Our representation is motivated by Barker (2002)
and Lassiter (2009), who show how sets of possi-
ble thresholds1 can account for many of our intu-
itions about the use of vague language. Their analy-
sis invites us to capture semantic variability through
two geometric constructs. First, there is a certain
interval, parameterized by two points, µLower and
µUpper, within which a color description definitely
applies. Outside this interval are regions of bor-
derline cases, delimited by probabilistically-varying
thresholds τLower and τUpper, where the color de-
scription sometimes applies. We represent the po-
sition of the threshold with a Γ(α,β) distribution, a
standard statistical tool to model processes that start,
continue indefinitely, and stop, like waiting times.2

We can determine a likelihood that a description fits
a color by marginalizing over the thresholds: this
gives the black curve visualized in Figure 3. As we
describe in Section 3.3, we can use this to account
for the graded responses from subjects that we ob-
serve near color boundaries.

We summarize with a formal definition of our se-
mantic representation. Let X be the 3D space of
HSV colors and let x ∈ X be a measured color
value. Each color label k has definite boundaries,
µLower and µUpper in X, delimiting a box of HSV
color space. Surrounding the definite region are
regions of uncertainty: the set of possible bound-
aries beyond µ. These are represented by probabil-
ity distributions over lower and upper threshold val-
ues in each dimension. We’ll represent these thresh-
olds by τj,dk where k ∈ K indexes the color label,
j ∈ {Lower/L, Upper/U} indexes the boundary,
and d ∈ {H, S, V} indexes color components. We
assume the thresholds are distributed as follows:

τ
Lower,d
k ∼ µ

Lower,d
k − Γ(α

Lower,d
k ,β

Lower,d
k )

τ
Upper,d
k ∼ µ

Upper,d
k + Γ(α

Upper,d
k ,β

Upper,d
k ) (1)

The meaning of a color term is thus a “blurry box”.
The distribution lets us determine the probability of

1We treat the terms “boundary”, “threshold”, and “standard”
to be synonymous, but useful in different contexts.

2Γ distributions rise quickly away from the origin point,
then trail off from the peak in an open-ended exponential de-
cay. One intuition for applying them in this case is Graff Fara’s
(2000) suggestion that a particular categorization decision in-
volves waiting to find a natural break among salient colors.
However, we choose them for mathematical convenience rather
than psychological or linguistic considerations.

Figure 4: The Rational Observer observes a color patch,
x. The applicability of each label (ktrue) is based upon
the label parameters (α,β,µ) and x. The label (ksaid)
is sampled proportional to the applicability and a back-
ground weight: how often a label is said when it applies.

a point x falling into the color category k as in Eq. 2.
We also use the compact notation in Eq. 3.

P(τ
Lower, H
k < x

H < τ
Upper, H
k )×

P(τ
Lower, S
k < x

S < τ
Upper, S
k )×

P(τ
Lower, V
k < x

V < τ
Upper, V
k ) (2)

=
∏
d

P(τ
L,d
k < x

d
i < τ

U,d
k ) (3)

3.3 Rational Observer Model

Our goal is to learn probabilistic representations
of the meanings of color terms from subjects’ re-
sponses. To do this, we need not only a framework
for representing colors but also a model of how sub-
jects choose color terms. Inspired by rational anal-
ysis (Anderson, 1991), we assume that speakers’
choices match their communicative goals and their
semantic knowledge. We leverage this assumption
to derive a Bayes Rational Observer model linking
semantics to observed color descriptions.

The graphical model in Figure 4 formalizes our
approach. We start from an observed color patch, x.
The Rational Observer uses the τ-distributions for
each color description k to determine the likelihood
that the speaker judges k applicable. As defined in
Eq. 3, the likelihood is the subset of possible bound-
aries which contain the target color value. Normally,
many descriptions will be applicable. Which the
speaker chooses depends further on the availabil-
ity of the label—a background measure of how fre-
quently a label is chosen when it’s applicable. In-
tuitively, availability creates a bias for easy descrip-
tions, capturing how natural or ordinary a descrip-


tion is in language use, how easily it springs to mind
or how easily it is understood.

We formalize this as a generative model. As we
explain in Section 4, we infer the parameters from
our data. In Eq. 4, we consider the conditional dis-
tribution of a subject observing a color patch given
HSV value x and labeling it k:

(4)P(ksaid,ktrue|x) = P(ksaid|ktrue)P(ktrue|x)

In this equation, ksaid is the event that the subject
responds to x with label k and ktrue is the event that
the subject judges k true of the HSV value x. The
two factors of Eq. 4 are respectively the availability
and applicability of the color label.

Availability: The prior P(ksaid|ktrue) quantifies
the rate at which label k is used when it applies. We
refer to this quantity as the availability and denote
it as αk. Availability captures the observed bias for
frequent color terms. When multiple color labels fit
a color value, those with higher availability will be
used more often, but those with lower availability
will still get used. This effect is partially responsible
for the long tail of subjects’ responses.

Applicability: The second factor, P(ktrue|x),
is the probability that k is true of, or applies to,
the color value x. We calculate the applicability
by marginalizing over all possible thresholds as in
Eq. 3. In other words, we calculate the probabil-
ity mass of the boundaries which allow for this de-
scription to apply. We treat each applicability judg-
ment as independent of others. This implies that the
relative frequency at which we see a color descrip-
tion used is directly proportional to the proportion of
boundaries which license it.

For clearer notation and parameter estimation, we
track thresholds with a piecewise function φdk(x

d) as
in Eq. 5 and Figure 3.

φdk(x
d) =



P(xd > τ

L,d
k ), x

d ≤ µL,dk
P(xd < τ

U,d
k ), x

d ≥ µU,dk
1, otherwise

(5)

Finally, Eq. 6 rewrites Eq. 4 to make the applica-
bility and availability explicit. The model treats this
equation as the probability of success for a Bernoulli
trial and the data as sampled from Categorical dis-
tributions formed by the set of K Bernoulli random

variables. This is discussed further in Section 4.2.

(6)P(ksaid,ktrue|x) = αk
∏
d

φdk(x
d)

4 Learning Experiment

We worked with Randall Munroe’s crowdsourced
corpus of color judgments, and fit the model us-
ing the Metropolis-Hastings Markov Chain Monte
Carlo, a Gaussian random walk optimization
method. This form of approximate Bayesian infer-
ence is described in Section 4.2.

4.1 Munroe Color Corpus
In 2010, Munroe elicited descriptions of color
patches over the web. His platform asked users
for background information such as sex, color-
blindness, and monitor type, then presented color
patches and let the user freely name them. The setup
didn’t ensure that users see controlled colors or that
users’ responses are reliable, but the experiment col-
lected over 3.4M items pairing RGB values with text
descriptions. Munroe’s methodology, data and re-
sults are published online (Munroe, 2010).3

Munroe summarizes his results with 954 idealized
colors—RGB values that best exemplify high fre-
quency color labels. In effect, Munroe’s summary
offers a prototype theory of color vocabulary, like
that of Andreas and Klein (2014). An alternative
theory, which we explore, is that variability in the
applicability of labels is an important part of peo-
ple’s knowledge of color semantics. We compare
the two theories explicitly in Section 5.

Our experiments focus on a subset of Munroe’s
data comprising 2,176,417 data points and 829 color
descriptions, divided into a training set of 70%,
a 5% development set, and a held-out test set of
25%. To minimize variability in language use, we
selected data from users who self-report as non-
colorblind English speakers. This accounts for
2.5M of Munroe’s 3.4M items. To get our sub-
set, we further restrict attention to labels used 100
times or more, to ensure that there’s substantial ev-
idence of each term’s breadth of applicability. We
hand curated the responses to correct some mi-
nor spelling variations involving a single-character

3http://blog.xkcd.com/2010/05/03/
color-survey-results/

http://blog.xkcd.com/2010/05/03/color-survey-results/
http://blog.xkcd.com/2010/05/03/color-survey-results/


change (“yellow green” vs “yellow-green”; “fuch-
sia” vs “fuschia”, “fushia”, “fuchia”, and “fucsia”)
and to remove high-frequency spam labels. We are
left with 829 color labels that fit these restrictions.
Finally, we used python’s colorsys to convert from
RGB to HSV, where we hypothesize color meanings
can be represented more simply. We include these
data sets with our release at http://mcmahan.
io/lux/ so our results can be replicated.

4.2 Fitting the Model Parameters

Optimization of the model’s parameters is framed in
a Bayesian framework and interpreted as maximiz-
ing the likelihood of the data given the parameters.
We fit each label and each dimension independently.
The data on each dimension is binned, as in Figure 3,
so we have Binomial random variables for each bin.
For each color label k, the probability of success is
based on the model’s parameters. Non-k data in the
bin are observations of failure. This gives Eq. 7:

P(ndi,k|n
d
i ,Z

d
k,φk) ∼ Bin(n

d
i ,Z

d
kφ

d
k(i)) (7)

Here ndi is the number of data points in bin i on di-
mension d, ndi,k is the number of data points for la-
bel k in bin i on dimension d, and Zdk is a normal-
ization constant, implicitly reflecting both the avail-
ability αk and the distribution of responses of the
term across other color dimensions. The optimiza-
tion process is a parameter search method which
uses as an objective function the probability of ndi,k
in Eq. 7 for all d,i, and k.

Parameter Search: We adopt a Bayesian coor-
dinate descent which sequentially samples the cer-
tain region parameter, µ, and the shape and rate pa-
rameters (α and β) of the Γ distributions for all d
and k independently. It also samples the estimated
normalization constant, ZdK . More specifically, the
sampling is done using Metropolis-Hastings Markov
Chain Monte Carlo (Metropolis et al., 1953; Chib
and Greenberg, 1995), which performs a Gaussian
random walk on the parameters4. For each sample,
the likelihood of the data, derived from the Bino-
mial variables, is compared for the new and old set

4We set the standard deviation of the sampling Gaussian to
be 1 for each µ and 0.3 for each α and β after finding experi-
mentally that it led to effective parameter search (Gelman et al.,
1996).

of parameters. The new parameters are accepted
proportionally to the ratio of the two likelihoods.
Multiple chains were run using 4 different bin sizes
per dimension and monitored for convergence using
the generalized Gelman-Rubin diagnostic method
(Brooks and Gelman, 1998). This methodology
leaves us not only with the Monte Carlo estimate
of the expected value for each parameter, but also a
sampling distribution that quantifies the uncertainty
in the parameters themselves.

Availability: Availability is estimated as the ratio
of the observed frequency of a label to its expected
frequency given the parameters which define its dis-
tribution. The expected frequency, a marginalization
of the color space for the φ function, is calculated
using the midpoint integration approximation.

(8)
αk =

P(ksaid,ktrue)

P(ktrue)

=
count(k)/N∫

x
P(ktrue|x)P(x)

5 Model Evaluation

LUX explains Munroe’s data via speakers’ rational
use of probabilistic meanings, represented as sim-
ple “blurry boxes”. In this section, we assess the
effectiveness of this explanation. We anticipate two
arguments against our model: first, that the represen-
tation is too simple; second, that factoring speakers’
choices through a model of meaning is too cumber-
some. We rebut these arguments by providing met-
rics and results that suggest that LUX escapes these
objections and captures almost all of the structure in
subjects’ responses.

5.1 Alternative Models

To test LUX’s representations, we built a brute-force
histogram model (HM) that discretizes HSV space
and tracks frequency distributions of labels directly
in each discretized bin. Similar histogram models
have been developed by Chuang et al. (2008) and
(Heer and Stone, 2012) to build interfaces for inter-
acting with color that are informed by human cat-
egorization and naming. More precisely, our HM
uses a linear interpolation method (Chen and Good-
man, 1996) to combine three histograms of various

http://mcmahan.io/lux/
http://mcmahan.io/lux/


granularity.5 This amounts to predicting responses
by querying the training data. HM has the potential
to expose whether LUX is missing important fea-
tures of the distribution of color descriptions.

We also built a direct model of subjects’ choices
of color terms. Instead of appealing to the applica-
bility and availability of a color label, it works with
the observed frequency of a color label and a Gaus-
sian model of the probability of a color value for
each label, as in Eq. 9:

(9)P(ksaid,ktrue|x) ∝P(x|ktrue)P(ksaid,ktrue)

This Gaussian model (GM) generalizes Munroe’s
pairing of labels with prototypical colors:
P(x|ktrue) is a Gaussian with diagonal covari-
ance, so it associates each color term with a mean
HSV value and with variances in each dimension
that determine a label-specific distance metric. GM
predicts speaker choice by weighting these distances
probabilistically against the priors. GM completely
sidesteps the need to model meaning categorically.
It therefore has the potential to expose whether our
assumptions about semantic representations and
speaker choices hinder LUX’s performance.

5.2 Evaluation Metrics
We evaluate the models using two classes of met-
rics on a held-out test set consisting of 25% of the
corpus. The first type is based upon the posterior
distribution over labels and the ranked position of
subjects’ actual labels of color values. The second
type is based upon the log likelihood of the models,
which quantifies model fit.

5.2.1 Decision-Based Metrics
To answer how accurate a model’s predictions are,

we can locate subjects’ responses in the weighted
rankings computed by the models.

The TOPK Measures: Each model provides a
posterior distribution over the possible labels. The
most likely label of this posterior is the maximum
likelihood estimate (MLE). We track how often the
MLE color label is what the user actually said as

5Specifically, the histograms are of size (90,10,10), (45,5,5),
and (1,1,1) across Hue, Saturation, and Value with interpolation
weights of 0.322, 0.643, and 0.035 respectively. These parame-
ters were determined by taking the training set as 5-fold valida-
tion sets.

the TOP1 measure. For the Histogram Model, the
TOP1 approximates the most frequent label ob-
served in the data for a color value. We also measure
how often the correct label appears in the first 5 and
10 most likely labels. These are denoted TOP5 and
TOP10 respectively.

5.2.2 Likelihood-Based Metrics
We can also measure how well a model explains

speaker choice using the log likelihood of the labels
given the model and the color values, denoted as
LLV (M). This is calculated using Eq. 10 across
all N data points in the held-out test set. LLV (M)
is used when computing perplexity and Aikake In-
formation Criterion (AIC). We report all measures
in bits.

LLV (M) = log2 PM (K
true,Ksaid|X)

=
∑
i

log2 PM (k
true
i ,k

said
i |xi) (10)

A more general measure of model fit is the log like-
lihood of the color values and their labels jointly
across the training set, LL(V ), given the model. It
is defined and calculated analogously.

Perplexity Perplexity has been used in past re-
search to measure the performance of statistical lan-
guage models (Jelinek et al., 1977; Brown et al.,
1992). Lower perplexity means that the model is less
surprised by the data and so describes it more pre-
cisely. We use it here to measure how well a model
encodes the regularities in color descriptions.

Akaike Information Criterion: AIC is derived
from information theory (Akaike, 1974) and bal-
ances the model’s fit to the data with the complexity
of the model by penalizing a larger number of pa-
rameters. The intuition is that a smaller AIC indi-
cates a better balance of parameters and model fit.

5.3 Evaluation Results

Table 1 summarizes the decision-based evalua-
tion results.6 We see little penalty for LUX and

6There is a caveat to these performance measures. All of the
reported numbers are for the final data subset which we discuss
in Section 4.1. We choose to use a subset which did not include
color labels that had less than 100 occurrences. In the English-
speaking and American-citizenship subset, the rare description
tail accounts for 13% of the data—Roughly one third of the
tail data is unique descriptions. If the tail represents real world


TOP1 TOP5 TOP10

LUX 39.55% 69.80% 80.46%
HM 39.40% 71.89% 82.53%
GM 39.05% 69.25% 79.99%

Table 1: Decision-based results. The percentage of cor-
rect responses of 544,764 test-set data points are shown.

−LL −LLV AIC Perp
LUX 1.13*107 2.05*106 4.13*106 13.61
HM 1.13*107 2.09*106 4.82*106 14.41
GM 1.34*107 2.08*106 4.17*106 14.14

Table 2: Likelihood-based evaluation results: negative
log likelihood of the data, negative log likelihood of
labels given points, number of parameters, Akaike In-
formation Criterion and perplexity of labels given color
values. Parameter counts for AIC are 15751 for LUX,
315669 for HM and 5803 for GM.

GM’s constrained frameworks for modeling choices.
However, the differences in the table, though nu-
merically small, are significant (by Binomial test)
at p < .02 or less. In particular, the fact that LUX
wins TOP1 hints that its representations enable bet-
ter generalization than HM or GM. The success of
HM at TOP5 and TOP10, meanwhile, suggests
that some qualitative aspects of people’s use of color
words do escape the strong assumptions of LUX and
GM—a point we return to below.

At the same time, we draw a general lesson from
the overall patterns of results in Table 1. Language
users must be quite uncertain about how speakers
will describe colors. Speakers do not seem to choose
the most likely color label in a majority of responses;
their behavior shows a long tail. These results are in
line with the probabilistic models of meaning and
speaker choice we have developed.

Table 2 summarizes the likelihood based metrics.
GM’s estimates don’t fit the distribution of the test
data as a whole: GM is a good model of what labels
speakers give but not a good model of the points that
get particular labels. By contrast, LUX tops out ev-
ery row in the table. HM is flexible enough in prin-
ciple to mirror LUX’s predictions; HM must suffer

circumstances, our model is only applicable 87% of the time,
and thus the performance metrics should be scaled down. We
do not explicitly report the scaled numbers.

from sparse data, given its vast number of parame-
ters. By contrast, LUX is able to capture the dis-
tributions of speaker responses in deeper and more
flexible ways by using semantics as an abstraction.

Our analysis of patterns of error in LUX sug-
gests that LUX would best improved by more faith-
ful models of linguistic meaning, rather than more
elaborate models of subjects’ choices or more pow-
erful learning methods. For one thing, neither LUX
nor the simple prototype model captures ambiguity,
which sometimes arises in Munroe’s data. An exam-
ple is the color label melon, which has a multimodal
distribution in the reddish-orange and green areas of
color space shown in Figure 5—most likely corre-
sponding to people thinking about the distinct col-
ors of the flesh of watermelon, cantaloupe and hon-
eydew. Interestingly, our model captures the more
common usage.

A different modeling challenge is illustrated by
the behavior of greenish in Figure 6. Greenish seems
to be an exception to the general assumption that
color terms label convex categories. Actually, green-
ish seems to fit the boundary of green—the areas that
are not definitely green but not definitely not green.
(Linguists often appeal to such concepts in the liter-
ature on vagueness.) This is not a convex area so,
not surprisingly, our model finds a poor match. Ad-
ditional research is needed to understand when it’s
appropriate to give meanings more complex repre-
sentations and how they can be learned.

6 Discussion and Conclusion

Natural language color descriptions provide an ex-
pressive, precise but open-ended vocabulary to char-
acterize real-world objects. This paper documents

Hue

0.0

0.2

0.4

0.6

0.8

1.0

P
ro

ba
bi

li
ty

φHueMelon

Melon data

Figure 5: For the Hue dimension, the data for “melon” is
plotted against the LUX model’s φ curve.


Hue

µLower µUpper
0.0

0.2

0.4

0.6

0.8

1.0

P
ro

ba
bi

li
ty

τLower τUpper

0.000

0.001

0.002

0.003

0.004

0.005

φHueGreenish

Greenish data

Figure 6: For the Hue dimension, the data for “greenish”
is plotted against the LUX model’s φ curve.

and releases Lexicon of Uncertain Color Standards
(LUX), which provides semantic representations of
829 English color labels, derived from a large cor-
pus of attested descriptions. Our evaluation shows
that LUX provides a precise description of speak-
ers’ free-text labels of color patches. Our expec-
tation therefore is that LUX will serve as a useful
resource for building systems for situated language
understanding and generation that need to describe
colors to English-speaking users.

Our work in LUX has built closely on linguis-
tic approaches to color meaning and psychological
approaches to modeling experimental subjects. Be-
cause LUX bridges linguistic theory, psychologi-
cal data, and system building, LUX also affords a
unique set of resources for future research at the in-
tersection of semantics and pragmatics of dialogue.

For example, our work explains subjects’ deci-
sions as a straightforward reflection of their com-
municative goals in a probabilistic setting. Our
measures of availability and applicability can be
seen as offering computational interpretations of the
Gricean Maxims of Manner and Quality (Grice,
1975). However, these particular interpretations
don’t give rise to implicatures on our model—
largely because our Rational Observer is so inclusive
and variable in the descriptions it offers. To show
this, we can analyze what an idealized hearer learns
about an underlying color x when the speaker uses a
color term k: this is P(x|ksaid). The model predic-

tions are formalized in Eq. 11.

P(x|ksaid) = P(x|ksaid,ktrue)

=
P(ksaid,ktrue|x)P(x)

P(ksaid,ktrue)

=
P(ksaid|ktrue)P(ktrue|x)P(x)

P(ksaid|ktrue)P(ktrue)

=
αkP(k

true|x)P(x)
αkP(k

true)
= P(x|ktrue)

(11)

We apply Bayes’s rule, exploiting our model as-
sumption that the speaker says k only when the
speaker first judges that k is true. Our model also
tells us that, given that k is true, the speaker’s choice
of whether to say k depends only on the availabil-
ity αk of the term k. Simplifying, we find that the
pragmatic posterior—what we think the speaker was
looking at when she said this word—coincides with
the semantic posterior—what we think the word is
true of. Intuitively, the hearer knows that the term is
true because the speaker has used the word, indepen-
dent of the color x the speaker is describing. Sim-
ilarly, in our model of speaker choice, the speaker
does not take x into account in choosing one of the
applicable words to say (one way the speaker could
do this, for example, would be to prefer terms that
were more informative about the target color x). In-
stead, the speaker simply samples from the candi-
dates. That’s why the speaker’s choice reveals only
what the semantics says about x.

Technically, this makes semantics a Nash equi-
librium, where the information the hearer recov-
ers from an utterance is exactly the information
the speaker intends to express—in keeping with a
longstanding tradition in the philosophy of language
(Lewis, 1969; Cumming, 2013). By contrast, re-
searchers such as Smith et al. (2013) adopt broadly
similar formal assumptions but predict asymme-
tries where sophisticated listeners can second-guess
naive speakers’ choices and recover “extra” infor-
mation that the speaker has revealed incidentally
and unintentionally. The difference between this ap-
proach and ours eventually leads to a difference in
the priors over utterances, but it’s best explained
through the different utilities that motivate speak-
ers’ different choices in the first place. Smith et al.
(2013) assume speakers want to be informative; we


assume they want to fit in. The empirical success
of our approach on Munroe’s data motivates a larger
project to elicit data that can explicitly probe sub-
jects’ communicative goals in relation to semantic
coordination.

Meanwhile, our work formalizes probabilistic
theories of vagueness with new scale and preci-
sion. These naturally suggest that we test predictions
about the dynamics of conversation drawn from the
semantic literature on vagueness. For example, in
hearing a description for an object, we come to know
more about the standards governing the applicability
of the description. This is outlined by Barker (2002)
as having a meta-semantic effect on the common
ground among interlocutors. For example, hearing
a yellow-green object called yellowish green should
make objects in the same color range more likely
to be referred to as yellowish green. We could use
LUX straightforwardly to represent such conceptual
pacts (Brennan and Clark, 1996) via a posterior over
threshold parameters. It’s natural to look for empir-
ical evidence to assess the effectiveness of such rep-
resentations of dependent context.

A particularly important case involves descrip-
tive material that distinguishes a target referent from
salient alternatives, as in the understanding or gen-
eration of referring expressions (Krahmer and van
Deemter, 2012). Following Kyburg and Morreau
(2000), we could represent this using LUX via a pos-
terior over the threshold parameters that fit the target
but exclude its alternatives. Again, our model as-
sociates such goals with quantitative measures that
future research can explore empirically. Meo et al.
(2014) present an initial exploration of this idea.

These open questions complement the key advan-
tage that makes uncertainty about meaning crucial to
the success of the model and experiments we have
reported here. Many kinds of language use seem to
be highly variable, and approaches to grounded se-
mantics need ways to make room for this variabil-
ity both in the semantic representations they learn
and the algorithms that induce these representations
from language data. We have argued that uncertainty
about meaning is a powerful new tool to do this. We
look forward to future work addressing uncertainty
in grounded meanings in a wide range of continu-
ous domains—generalizing from color to quantity,
scales, space and time—and pursuing a wide range

of reasoning efforts, to corroborate our results and
to leverage them in grounded language use.

Acknowledgments

This work was supported in part by NSF DGE-
0549115. This work has benefited from discus-
sion and feedback from the reviewers of TACL, Ma-
neesh Agrawala, David DeVault, Jason Eisner, Tarek
El-Gaaly, Katrin Erk, Vicky Froyen, Joshua Gang,
Pernille Hemmer, Alex Lascarides, and Tim Meo.

References
Hirotugu Akaike. 1974. A new look at the statistical

model identification. IEEE Transactions on Automatic
Control, 19(6):716–723.

John R. Anderson. 1991. The adaptive nature of human
categorization. Psychological Review, 98(3):409.

Jacob Andreas and Dan Klein. 2014. Grounding lan-
guage with points and paths in continuous spaces. In
Proceedings of the Eighteenth Conference on Com-
putational Natural Language Learning, pages 58–67,
June.

Chris Barker. 2002. The dynamics of vagueness. Lin-
guistics and Philosophy, 25(1):1–36.

Brent Berlin. 1991. Basic Color Terms: Their Univer-
sality and Evolution. Univ of California Press.

Susan E. Brennan and Herbert H. Clark. 1996. Concep-
tual pacts and lexical choice in conversation. Journal
of Experimental Psychology: Learning, Memory and
Cognition, 22(6):1482–1493.

Stephen P. Brooks and Andrew Gelman. 1998. Gen-
eral methods for monitoring convergence of iterative
simulations. Journal of Computational and Graphical
Statistics, 7(4):434–455.

Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer,
Stephen A. Della Pietra, and Jennifer C. Lai. 1992. An
estimate of an upper bound for the entropy of English.
Computational Linguistics, 18(1):31–40.

Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-
Khanh Tran. 2012. Distributional semantics in tech-
nicolor. In Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics, pages
136–145.

Stanley F. Chen and Joshua Goodman. 1996. An empiri-
cal study of smoothing techniques for language model-
ing. In Proceedings of the 34th annual meeting on As-
sociation for Computational Linguistics, pages 310–
318.

David L. Chen and Raymond J. Mooney. 2008. Learning
to sportscast: a test of grounded language acquisition.


In ICML ’08: Proceedings of the 25th international
conference on Machine learning, pages 128–135.

Siddhartha Chib and Edward Greenberg. 1995. Un-
derstanding the Metropolis–Hastings algorithm. The
American Statistician, 49(4):327–335.

Jason Chuang, Maureen Stone, and Pat Hanrahan. 2008.
A probabilistic model of the categorical association
between colors. In Color Imaging Conference, pages
6–11.

Sam Cumming. 2013. Coordination and content.
Philosophers’ Imprint, 13(4):1–16.

Colin R. Dawson, Jeremy Wright, Antons Rebguns,
Marco Valenzuela Escárcega, Daniel Fried, and
Paul R. Cohen. 2013. A generative probabilis-
tic framework for learning spatial language. In
2013 IEEE Third Joint International Conference on
Development and Learning and Epigenetic Robotics
(ICDL), pages 1–8. IEEE.

David DeVault, Iris Oved, and Matthew Stone. 2006. So-
cietal grounding is essential to meaningful language
use. In Proceedings of the Twenty-first National Con-
ference on Artificial Intelligence, pages 747–754.

Mark D. Fairchild. 2013. Color Appearance Models.
The Wiley-IS&T Series in Imaging Science and Tech-
nology. Wiley.

Delia Graff Fara. 2000. Shifting sands: An interest-
relative theory of vagueness. Philosophical Topics,
28(1):45–81.

Ali Farhadi, Ian Endres, Derek Hoiem, and David
Forsyth. 2009. Describing objects by their attributes.
2009 IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 1778–1785, June.

George W. Furnas, Thomas K. Landauer, Louis M.
Gomez, and Susan T. Dumais. 1987. The vocabulary
problem in human-system communication. Communi-
cations of the ACM, 30(11):964–971.

Peter Gärdenfors. 2000. Conceptual Spaces. MIT Press.
Andrew Gelman, Gareth O. Roberts, and Walter R. Gilks.

1996. Efficient Metropolis jumping rules. In J. M.
Bernardo, J. O. Berger, A. P. Dawid, and A. F. Smith,
editors, Bayesian Statistics 5, pages 599–607. Oxford
University Press.

Herbert P. Grice. 1975. Logic and conversation. In
P. Cole and J. Morgan, editors, Syntax and Semantics
III: Speech Acts, pages 41–58. Academic Press.

Stevan Harnad. 1990. The symbol grounding problem.
Physica D: Nonlinear Phenomena, 42(1–3):335–346.

Jeffrey Heer and Maureen Stone. 2012. Color naming
models for color selection, image editing and palette
design. In Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, pages 1007–
1016.

John F. Hughes, Andries van Dam, Morgan McGuire,
David F. Sklar, James D. Foley, Steven K. Feiner, and
Kurt Akeley. 2013. Computer Graphics: Principles
and Practice (3rd Edition). Addison-Wesley Profes-
sional.

Gerhard Jäger. 2010. Natural color categories are con-
vex sets. In Maria Aloni, Harald Bastiaanse, Tikitu
de Jager, and Katrin Schulz, editors, Logic, Language
and Meaning - 17th Amsterdam Colloquium, Amster-
dam, The Netherlands, December 16-18, 2009, Re-
vised Selected Papers, volume 6042 of Lecture Notes
in Computer Science, pages 11–20. Springer.

Fred Jelinek, Robert L. Mercer, Lalit R. Bahl, and
James K. Baker. 1977. Perplexity–a measure of the
difficulty of speech recognition tasks. The Journal of
the Acoustical Society of America, 62:S63.

Paul Kay, Brent Berlin, Luisa Maffi, William R. Merri-
field, and Richard Cook. 2009. The World Color Sur-
vey. CSLI.

Emiel Krahmer and Kees van Deemter. 2012. Compu-
tational generation of referring expressions: A survey.
Computational Linguistics, 38(1):173–218.

Jayant Krishnamurthy and Thomas Kollar. 2013. Jointly
learning to parse and perceive: Connecting natural lan-
guage to the physical world. Transactions of the Asso-
ciation for Computational Linguistics, 1(2):193–206.

Alice Kyburg and Michael Morreau. 2000. Fitting
words: Vague words in context. Linguistics and Phi-
losophy, 23(6):577–597.

Johan Maurice Gisele Lammens. 1994. A computational
model of color perception and color naming. Ph.D.
thesis, SUNY Buffalo.

Staffan Larsson. 2013. Formal semantics for percep-
tual classification. Journal of Logic and Computa-
tion. Advance online publication. doi: 10.1093/log-
com/ext059.

Daniel Lassiter. 2009. Vagueness as probabilistic lin-
guistic knowledge. In Rick Nouwen, Robert van
Rooij, Uli Sauerland, and Hans-Christian Schmitz,
editors, Vagueness in Communication - International
Workshop, ViC 2009, held as part of ESSLLI 2009,
Bordeaux, France, July 20-24, 2009. Revised Selected
Papers, volume 6517 of Lecture Notes in Computer
Science, pages 127–150. Springer.

David K. Lewis. 1969. Convention: A Philosophical
Study. Harvard University Press, Cambridge, MA.

Cynthia Matuszek, Nicholas Fitzgerald, Luke Zettle-
moyer, Liefeng Bo, and Dieter Fox. 2012. A joint
model of language and perception for grounded at-
tribute learning. In Proceedings of the 29th Interna-
tional Conference on Machine Learning (ICML-12),
pages 1671–1678.

http://dx.doi.org/10.1093/logcom/ext059
http://dx.doi.org/10.1093/logcom/ext059


Nikolaos Mavridis and Deb Roy. 2006. Grounded
situation models for robots: Where words and per-
cepts meet. In Intelligent Robots and Systems, 2006
IEEE/RSJ International Conference on, pages 4690–
4697. IEEE.

Timothy Meo, Brian McMahan, and Matthew Stone.
2014. Generating and resolving vague color refer-
ences. In SEMDIAL 2014: THE 18th Workshop on the
Semantics and Pragmatics of Dialogue, pages 107–
115.

Nicholas Metropolis, Arianna W. Rosenbluth, Mar-
shall N. Rosenbluth, Augusta H. Teller, and Edward
Teller. 1953. Equation of state calculations by
fast computing machines. The Journal of Chemical
Physics, 21(6):1087–1092.

Randall Munroe. 2010. Color survey results. On-
line at http://blog.xkcd.com/2010/05/03/color-survey-
results/.

Kimele Persaud and Pernille Hemmer. 2014. The in-
fluence of knowledge and expectations for color on
episodic memory. In P Bello, M Guarini, M Mc-
Shane, and B Scassellati, editors, Proceedings of the
36th Annual Conference of the Cognitive Science So-
ciety, pages 1162–1167.

Terry Regier, Paul Kay, and Richard S. Cook. 2005. Fo-
cal colors are universal after all. Proceedings of the
National Academy of Sciences, 102:8386–8391.

Terry Regier, Paul Kay, and Naveen Khetarpal. 2007.
Color naming reflects optimal partitions of color
space. Proceedings of the National Academy of Sci-
ences, 104:1436–1441.

Carina Silberer, Vittorio Ferrari, and Mirella Lapata.
2013. Models of Semantic Representation with Visual
Attributes. In Proceedings of the 51st Annual Meet-
ing of the Association for Computational Linguistics,
pages 572–582.

Nathaniel J. Smith, Noah D. Goodman, and Michael C.
Frank. 2013. Learning and using language via recur-
sive pragmatic reasoning about other agents. In Ad-
vances in Neural Information Processing Systems 26,
pages 3039–3047.

Stefanie Tellex, Thomas Kollar, and Steven Dickerson.
2011a. Approaching the symbol grounding problem
with probabilistic graphical models. AI magazine,
32(4):64–76.

Stefanie Tellex, Thomas Kollar, Steven Dickerson,
Matthew R Walter, Ashis Gopal Banerjee, Seth J
Teller, and Nicholas Roy. 2011b. Understanding nat-
ural language commands for robotic navigation and
mobile manipulation. In Proceedings of the Twenty-
Fifth AAAI Conference on Artificial Intelligence, pages
1507–1514.

Terry Winograd. 1970. Procedures as a representation
for data in a computer program for understanding nat-
ural language. Ph.D. thesis, MIT.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in-
ference over event descriptions. Transactions of the
Association for Computational Linguistics, 2:67–78.

John M. Zelle and Raymond J. Mooney. 1996. Learn-
ing to parse database queries using inductive logic pro-
gramming. In Proceedings of the National Conference
on Artificial Intelligence, pages 1050–1055.

Luke S. Zettlemoyer and Michael Collins. 2005. Learn-
ing to map sentences to logical form: Structured clas-
sification with probabilistic categorial grammars. In
UAI ’05, Proceedings of the 21st Conference in Un-
certainty in Artificial Intelligence, pages 658–666.


	Introduction
	Related Work
	Using Vague Color Terms: A Model
	Color Categories
	Semantic Representation
	Rational Observer Model

	Learning Experiment
	Munroe Color Corpus
	Fitting the Model Parameters

	Model Evaluation
	Alternative Models
	Evaluation Metrics
	Decision-Based Metrics
	Likelihood-Based Metrics

	Evaluation Results

	Discussion and Conclusion