A Bayesian Model of Grounded Color Semantics Brian McMahan Rutgers University brian.mcmahan@rutgers.edu Matthew Stone Rutgers University matthew.stone@rutgers.edu Abstract Natural language meanings allow speakers to encode important real-world distinctions, but corpora of grounded language use also re- veal that speakers categorize the world in different ways and describe situations with different terminology. To learn meanings from data, we therefore need to link underly- ing representations of meaning to models of speaker judgment and speaker choice. This paper describes a new approach to this prob- lem: we model variability through uncertainty in categorization boundaries and distributions over preferred vocabulary. We apply the ap- proach to a large data set of color descrip- tions, where statistical evaluation documents its accuracy. The results are available as a Lexicon of Uncertain Color Standards (LUX), which supports future efforts in grounded lan- guage understanding and generation by prob- abilistically mapping 829 English color de- scriptions to potentially context-sensitive re- gions in HSV color space. 1 Introduction To ground natural language semantics in real-world data at large scale requires researchers to confront the vocabulary problem (Furnas et al., 1987). Much of what people say falls in a long tail of increas- ingly infrequent and specialized items. Moreover, the choice of how to categorize and describe real- world data varies across people. We can’t account for this complexity by deriving one definitive map- ping between words and the world. We see this complexity already in free text de- scriptions of color patches. English has fewer than Hue 0 1 2 3 4 5 6 En tr op y (b its ) Figure 1: A visualization of the variability of the de- scriptions used to name colors within small bins of color space. For each Hue value, the entropy values for each bin along the Saturation and Value dimensions are grouped and plotted as box plots. The dotted line cor- responds to a random choice out of fourteen items and to the perplexity of a histogram model trained on the corpus. a dozen basic color words (Berlin, 1991), but peo- ple’s descriptions of colors are much more variable than this would suggest. Measured on the corpus described in Section 4.1, there’s an average of 3.845 bits of information in a color description given the color it describes—comparable to rolling a 14-sided die. Figure 1 summarizes the data and plots the en- tropy of descriptions encountered within small bins of color space. The bins are aggregated over the Sat- uration and Value dimensions and indexed on the x- axis by the Hue dimension. There’s little reason to think that this variability conceals consistent mean- ings. In formal semantics, one of the hallmarks of vague language is that speakers can make it more precise in alternative, incompatible ways (Barker, 2002). We see this in practice as well, for exam- ple with the image of Figure 2, where subjects com- Figure 2: Image by flickr user Joanne Bacon (jlbacon) from the data set of Young et al. (2014), whose subjects describe these dogs as a brown dog and a tan one or a tan dog and a white one. prehensibly describe either of two dogs as the tan one. Systems that robustly understand or generate descriptions of colors in situated dialogue need mod- els of meaning that capture this variability. This paper makes two key contributions towards this challenge. First, we present a methodology to infer a corpus-based model of meaning that ac- counts for possible differences in word usage across different speakers. As we explain in Section 2, our approach differs from the typical perspective in grounded semantics (Tellex et al., 2011a; Matuszek et al., 2012; Krishnamurthy and Kollar, 2013), where a meaning is reduced to a single classifier that collapses patterns of variation. Instead, our model allows for variability in meaning by positing uncer- tainty in classification boundaries that can get re- solved when a speaker chooses to use a word on a specific occasion. We explain the model and its the- oretical rationale in Section 3. Second, we develop and release a Lexicon of Uncertain Color Standards (LUX) by applying our methodology to color descriptions. LUX is an inter- pretation of 829 distinct English color descriptions as distributions over regions of the Hue–Saturation– Value color space that describe their possible mean- ings. As we describe in Section 4, the model is trained by machine learning methods from a subset of Randall Munroe’s 2010 publicly-available cor- pus of 3.4 million crowdsourced free-text descrip- tions of color patches (Munroe, 2010). Data, models and visualization software are available at http: //mcmahan.io/lux/. Statistical evaluation of our model against two alternative approaches documents its effectiveness. The model makes better quantitative predictions than a brute-force memorization model; it seems to generalize to unseen data in more meaningful ways. At the same time, our meanings work as well as special-purpose models to explain speaker choice, even though our model supports diverse other rea- soning. See Section 5. We see color as the first of many applications of our methodology, and are optimistic about learn- ing vague meanings for other continuous domains as quantity, space, and time. At the same time, the methodology opens up new prospects for re- search on negotiating meaning interactively (Lars- son, 2013) with principled representations and with broad coverage. In fact, many practical situated dia- logue systems already identify unfamiliar objects by color. We expect that LUX will provide a broadly useful resource to extend the range of descriptions such systems can generate and understand. 2 Related Work Grounded semantics is the task of mapping rep- resentations of linguistic meaning to the physical world, whether by perceptual mechanisms (Har- nad, 1990) or with the assistance of social inter- action (DeVault et al., 2006). In this paper, we are particularly concerned with grounding the mean- ings of primitive vocabulary. However, the ulti- mate test of grounded semantics—whether it is un- derstanding commands (Winograd, 1970; Tellex et al., 2011b), describing states of the world (Chen and Mooney, 2008), or identifying objects (Matuszek et al., 2012; Krishnamurthy and Kollar, 2013; Dawson et al., 2013)—is the ability to interpret or generate utterances using lexical and compositional seman- tics so as to evoke appropriate real-world referents. Grounded semantics therefore involves more than just quantifying the associations between words and perceptual representations, as Chuang et al. (2008) and Heer and Stone (2012) do for color. Grounded semantics involves interpreting semantic primitives in terms of composable categories that let systems discriminate between cases where a word applies and cases where the word does not apply. (Our eval- uation compares models of grounded semantics to more direct models of word–world associations.) Previous research has modeled these categories as http://mcmahan.io/lux/ http://mcmahan.io/lux/ regions of suitable perceptual feature spaces. Re- searchers have explored explicit spaces of high-level perceptual attributes (Farhadi et al., 2009; Silberer et al., 2013), approximations to such spaces (Ma- tuszek et al., 2012), or low-level feature spaces such as Bag of Visual Words (Bruni et al., 2012) or Histogram of Gradients (Krishnamurthy and Kollar, 2013). We specifically follow Gärdenfors (2000) and Jäger (2010) in assuming that color categories are convex regions in an underlying color space, and are not just determined by prototypical color values, such as in Andreas and Klein (2014). However, unlike previous grounded semantics, we do not assume that words name categories un- equivocally. Speakers may vary in how they inter- pret a word, so we treat the link between words and categories probabilistically. The difference makes training our model more indirect than previous ap- proaches to grounded meaning. In particular, our model introduces a new layer of uncertainty that de- scribes what category the speaker uses. Similar kinds of uncertainty can be found in Bayesian models of speaker strategy, such as that of Smith et al. (2013). However, this research has assumed that speakers aim to be as informative as possible. We have no evidence that our speakers do that. We assume only that speakers’ utterances are reliable and mirror prevailing usage. Prior work by cognitive scientists has studied color terms extensively, but focused on basic ones— monolexemic, top-level color words with general application and high frequency in a language (Kay et al., 2009; Lammens, 1994). These color categories seem to shape people’s expectations and memory for colors (Persaud and Hemmer, 2014), and pat- terns of color naming can therefore enhance soft- ware for helping people organize and interact with color (Chuang et al., 2008; Heer and Stone, 2012). Moreover, crosslinguistic evidence suggests that the human perceptual system places strong biases on the meanings of the basic color terms (Regier et al., 2005), perhaps because basic terms must partition the perceptual space in an efficient way (Regier et al., 2007). We depart from research on basic color naming in considering a much wider range of terms, much like Andreas and Klein (2014). We consider subordinate, non-basic terms like beige or lavender; modified colors like light blue or bright green; and named subcategories like olive green, navy blue or brick red. In order to use semantic primitives for under- standing, it’s necessary to combine them into an integrated sentence-level representation: this is the problem of semantic parsing. Semantic parsers can be built by hand (Winograd, 1970), induced through inductive logic programming (Zelle and Mooney, 1996), or treated as a structured classification prob- lem (Zettlemoyer and Collins, 2005). Once a suit- able logical form is derived, interpretation typically involves a recursive process of finding referents that fit lexical categories and relationships (Mavridis and Roy, 2006; Tellex et al., 2011a). While this pa- per does not explicitly address how our meanings might be used in conjunction with such techniques, we see no fundamental obstacle to doing so—for ex- ample, by resolving references probabilistically and marginalizing over uncertainty in meaning. 3 Using Vague Color Terms: A Model Our model involves two significant innovations over previous approaches to grounded meaning. The first is to capture the vagueness and flexibility of grounded meaning with semantic representations that treat meaning as uncertain. We represent the semantics of a color description with a distribu- tion over color categories, which weights possible meanings by the relative likelihood of a speaker us- ing this meaning on any particular occasion. For example, speakers might associate yellowish green with a range of possible meanings, differing in how far the color category extends into green hues. By representing uncertainty about meaning, our model makes room to capture variability in language use. For example, it implicitly quantifies how likely speakers are to use words differently, as with the two interpretations of tan in Figure 2. Our second contribution is our simple model of the relationship between semantics and pragmatics. We assume that speakers’ choices mirror established patterns. In particular, the model learns a measure of availability for each color term that tracks how frequently speakers tend to use it when it is appli- cable. For example, although the expressions yel- lowish green and chartreuse are associated with very similar color categories, people say yellowish green much more often: it has a higher availability. Empir- ically, we find few terms with high availability and a long tail of terms with lower availabilities. We as- sume speakers simply sample applicable terms from this distribution, which predicts the long tail of ob- served responses. Mathematically, we develop our approach through the rational analysis methodology for explaining human behavior proposed by Anderson (1991), along with methodological insights from the linguistics and philosophy of vagueness. In the remainder of this section, we explain the theoretical antecedents in perceptual science, linguistics and cognitive modeling that inform our approach. 3.1 Color Categories Color can be defined as sensations by which the per- ceptual system tracks the diffuse reflectance of ob- jects, despite variability, uncertainty and ambiguity in the visual input. Red, green, and blue cones in the retina allow the visual system to coarsely es- timate frequency bands in the spectrum of incom- ing light. Cameras and screens that use the red– green–blue (RGB) color space are designed roughly to correspond to these responses. However, colors in the visual system summarize spectral profiles rather than mere wavelengths of light. For example, we see colors like cyan (green plus blue without red), ma- genta (blue plus red without green) and yellow (red plus green without blue) as intermediate saturated colors between the familiar primaries. This natu- rally leads to a wheel of hues describing the relative prominence of different spectral components along a continuum. Fairchild (2013) provides an overview of color appearance. To capture this variation, we’ll work in the sim- ple hue–saturation–value (HSV) color space that’s common in computer graphics and color picker user interfaces (Hughes et al., 2013) and implemented in python’s native colorsys package. This coordinate system represents colors with three distinct qualita- tive dimensions: Hue (H) represents changes in tint around a color wheel, Saturation (S) represents the relative proportion of color versus gray, and Value (V) represents the location on the white–black con- tinuum. We will associate color categories with rect- angular box-shaped regions in HSV space. More sophisticated color spaces have been developed to describe the psychophysics of color more precisely, but they depend on the photometric illumination and other aspects of the viewing context that were not controlled in the collection of the data we are using (Fairchild, 2013). 3.2 Semantic Representation Our assumption is that color terms are associated probabilistically with color categories. We illustrate the idea for the color label yellowish-green through the plot in Figure 3. The plot shows variation in use of the term across the Hue dimension: the bar graph is a scaled histogram of the responses in the data we use. There is a range of colors where people use yel- lowish green often, surrounded by borderline cases where it becomes increasingly infrequent. Hue µLower µUpper 0.2 0.4 0.6 0.8 1.0 P ro ba bi li ty τLower τUpper φHueY ellowishGreen Yellowish Green data Figure 3: The LUX model for “yellowish green” on the Hue axis plotted against the scaled histogram of the re- sponses in the data. The φ curve represents the likeli- hood of “yellowish green” for different Hue values. The τ curves represent possible boundaries. We represent this variability by assuming that the boundaries that delimit the color are uncertain. In any utterance, yellowish green fits only those Hue values that are above a minimum threshold τLower and below a maximum threshold τUpper. However, it is uncertain which thresholds a speaker will use. The model describes this variability with probabil- ity density functions. They are shown for yellowish green in Figure 3 as the τ distributions. The figure shows that there is a central range of hues, between the τ distributions, that is definitely yellowish green. The τ distributions peak at the most likely bound- aries for yellowish green, encompassing a broad re- gion that’s frequently called yellowish green. Fur- ther away, threshold values and yellowish green ut- terances alike become rapidly less likely. Our representation is motivated by Barker (2002) and Lassiter (2009), who show how sets of possi- ble thresholds1 can account for many of our intu- itions about the use of vague language. Their analy- sis invites us to capture semantic variability through two geometric constructs. First, there is a certain interval, parameterized by two points, µLower and µUpper, within which a color description definitely applies. Outside this interval are regions of bor- derline cases, delimited by probabilistically-varying thresholds τLower and τUpper, where the color de- scription sometimes applies. We represent the po- sition of the threshold with a Γ(α,β) distribution, a standard statistical tool to model processes that start, continue indefinitely, and stop, like waiting times.2 We can determine a likelihood that a description fits a color by marginalizing over the thresholds: this gives the black curve visualized in Figure 3. As we describe in Section 3.3, we can use this to account for the graded responses from subjects that we ob- serve near color boundaries. We summarize with a formal definition of our se- mantic representation. Let X be the 3D space of HSV colors and let x ∈ X be a measured color value. Each color label k has definite boundaries, µLower and µUpper in X, delimiting a box of HSV color space. Surrounding the definite region are regions of uncertainty: the set of possible bound- aries beyond µ. These are represented by probabil- ity distributions over lower and upper threshold val- ues in each dimension. We’ll represent these thresh- olds by τj,dk where k ∈ K indexes the color label, j ∈ {Lower/L, Upper/U} indexes the boundary, and d ∈ {H, S, V} indexes color components. We assume the thresholds are distributed as follows: τ Lower,d k ∼ µ Lower,d k − Γ(α Lower,d k ,β Lower,d k ) τ Upper,d k ∼ µ Upper,d k + Γ(α Upper,d k ,β Upper,d k ) (1) The meaning of a color term is thus a “blurry box”. The distribution lets us determine the probability of 1We treat the terms “boundary”, “threshold”, and “standard” to be synonymous, but useful in different contexts. 2Γ distributions rise quickly away from the origin point, then trail off from the peak in an open-ended exponential de- cay. One intuition for applying them in this case is Graff Fara’s (2000) suggestion that a particular categorization decision in- volves waiting to find a natural break among salient colors. However, we choose them for mathematical convenience rather than psychological or linguistic considerations. Figure 4: The Rational Observer observes a color patch, x. The applicability of each label (ktrue) is based upon the label parameters (α,β,µ) and x. The label (ksaid) is sampled proportional to the applicability and a back- ground weight: how often a label is said when it applies. a point x falling into the color category k as in Eq. 2. We also use the compact notation in Eq. 3. P(τ Lower, H k < x H < τ Upper, H k )× P(τ Lower, S k < x S < τ Upper, S k )× P(τ Lower, V k < x V < τ Upper, V k ) (2) = ∏ d P(τ L,d k < x d i < τ U,d k ) (3) 3.3 Rational Observer Model Our goal is to learn probabilistic representations of the meanings of color terms from subjects’ re- sponses. To do this, we need not only a framework for representing colors but also a model of how sub- jects choose color terms. Inspired by rational anal- ysis (Anderson, 1991), we assume that speakers’ choices match their communicative goals and their semantic knowledge. We leverage this assumption to derive a Bayes Rational Observer model linking semantics to observed color descriptions. The graphical model in Figure 4 formalizes our approach. We start from an observed color patch, x. The Rational Observer uses the τ-distributions for each color description k to determine the likelihood that the speaker judges k applicable. As defined in Eq. 3, the likelihood is the subset of possible bound- aries which contain the target color value. Normally, many descriptions will be applicable. Which the speaker chooses depends further on the availabil- ity of the label—a background measure of how fre- quently a label is chosen when it’s applicable. In- tuitively, availability creates a bias for easy descrip- tions, capturing how natural or ordinary a descrip- tion is in language use, how easily it springs to mind or how easily it is understood. We formalize this as a generative model. As we explain in Section 4, we infer the parameters from our data. In Eq. 4, we consider the conditional dis- tribution of a subject observing a color patch given HSV value x and labeling it k: (4)P(ksaid,ktrue|x) = P(ksaid|ktrue)P(ktrue|x) In this equation, ksaid is the event that the subject responds to x with label k and ktrue is the event that the subject judges k true of the HSV value x. The two factors of Eq. 4 are respectively the availability and applicability of the color label. Availability: The prior P(ksaid|ktrue) quantifies the rate at which label k is used when it applies. We refer to this quantity as the availability and denote it as αk. Availability captures the observed bias for frequent color terms. When multiple color labels fit a color value, those with higher availability will be used more often, but those with lower availability will still get used. This effect is partially responsible for the long tail of subjects’ responses. Applicability: The second factor, P(ktrue|x), is the probability that k is true of, or applies to, the color value x. We calculate the applicability by marginalizing over all possible thresholds as in Eq. 3. In other words, we calculate the probabil- ity mass of the boundaries which allow for this de- scription to apply. We treat each applicability judg- ment as independent of others. This implies that the relative frequency at which we see a color descrip- tion used is directly proportional to the proportion of boundaries which license it. For clearer notation and parameter estimation, we track thresholds with a piecewise function φdk(x d) as in Eq. 5 and Figure 3. φdk(x d) =   P(xd > τ L,d k ), x d ≤ µL,dk P(xd < τ U,d k ), x d ≥ µU,dk 1, otherwise (5) Finally, Eq. 6 rewrites Eq. 4 to make the applica- bility and availability explicit. The model treats this equation as the probability of success for a Bernoulli trial and the data as sampled from Categorical dis- tributions formed by the set of K Bernoulli random variables. This is discussed further in Section 4.2. (6)P(ksaid,ktrue|x) = αk ∏ d φdk(x d) 4 Learning Experiment We worked with Randall Munroe’s crowdsourced corpus of color judgments, and fit the model us- ing the Metropolis-Hastings Markov Chain Monte Carlo, a Gaussian random walk optimization method. This form of approximate Bayesian infer- ence is described in Section 4.2. 4.1 Munroe Color Corpus In 2010, Munroe elicited descriptions of color patches over the web. His platform asked users for background information such as sex, color- blindness, and monitor type, then presented color patches and let the user freely name them. The setup didn’t ensure that users see controlled colors or that users’ responses are reliable, but the experiment col- lected over 3.4M items pairing RGB values with text descriptions. Munroe’s methodology, data and re- sults are published online (Munroe, 2010).3 Munroe summarizes his results with 954 idealized colors—RGB values that best exemplify high fre- quency color labels. In effect, Munroe’s summary offers a prototype theory of color vocabulary, like that of Andreas and Klein (2014). An alternative theory, which we explore, is that variability in the applicability of labels is an important part of peo- ple’s knowledge of color semantics. We compare the two theories explicitly in Section 5. Our experiments focus on a subset of Munroe’s data comprising 2,176,417 data points and 829 color descriptions, divided into a training set of 70%, a 5% development set, and a held-out test set of 25%. To minimize variability in language use, we selected data from users who self-report as non- colorblind English speakers. This accounts for 2.5M of Munroe’s 3.4M items. To get our sub- set, we further restrict attention to labels used 100 times or more, to ensure that there’s substantial ev- idence of each term’s breadth of applicability. We hand curated the responses to correct some mi- nor spelling variations involving a single-character 3http://blog.xkcd.com/2010/05/03/ color-survey-results/ http://blog.xkcd.com/2010/05/03/color-survey-results/ http://blog.xkcd.com/2010/05/03/color-survey-results/ change (“yellow green” vs “yellow-green”; “fuch- sia” vs “fuschia”, “fushia”, “fuchia”, and “fucsia”) and to remove high-frequency spam labels. We are left with 829 color labels that fit these restrictions. Finally, we used python’s colorsys to convert from RGB to HSV, where we hypothesize color meanings can be represented more simply. We include these data sets with our release at http://mcmahan. io/lux/ so our results can be replicated. 4.2 Fitting the Model Parameters Optimization of the model’s parameters is framed in a Bayesian framework and interpreted as maximiz- ing the likelihood of the data given the parameters. We fit each label and each dimension independently. The data on each dimension is binned, as in Figure 3, so we have Binomial random variables for each bin. For each color label k, the probability of success is based on the model’s parameters. Non-k data in the bin are observations of failure. This gives Eq. 7: P(ndi,k|n d i ,Z d k,φk) ∼ Bin(n d i ,Z d kφ d k(i)) (7) Here ndi is the number of data points in bin i on di- mension d, ndi,k is the number of data points for la- bel k in bin i on dimension d, and Zdk is a normal- ization constant, implicitly reflecting both the avail- ability αk and the distribution of responses of the term across other color dimensions. The optimiza- tion process is a parameter search method which uses as an objective function the probability of ndi,k in Eq. 7 for all d,i, and k. Parameter Search: We adopt a Bayesian coor- dinate descent which sequentially samples the cer- tain region parameter, µ, and the shape and rate pa- rameters (α and β) of the Γ distributions for all d and k independently. It also samples the estimated normalization constant, ZdK . More specifically, the sampling is done using Metropolis-Hastings Markov Chain Monte Carlo (Metropolis et al., 1953; Chib and Greenberg, 1995), which performs a Gaussian random walk on the parameters4. For each sample, the likelihood of the data, derived from the Bino- mial variables, is compared for the new and old set 4We set the standard deviation of the sampling Gaussian to be 1 for each µ and 0.3 for each α and β after finding experi- mentally that it led to effective parameter search (Gelman et al., 1996). of parameters. The new parameters are accepted proportionally to the ratio of the two likelihoods. Multiple chains were run using 4 different bin sizes per dimension and monitored for convergence using the generalized Gelman-Rubin diagnostic method (Brooks and Gelman, 1998). This methodology leaves us not only with the Monte Carlo estimate of the expected value for each parameter, but also a sampling distribution that quantifies the uncertainty in the parameters themselves. Availability: Availability is estimated as the ratio of the observed frequency of a label to its expected frequency given the parameters which define its dis- tribution. The expected frequency, a marginalization of the color space for the φ function, is calculated using the midpoint integration approximation. (8) αk = P(ksaid,ktrue) P(ktrue) = count(k)/N∫ x P(ktrue|x)P(x) 5 Model Evaluation LUX explains Munroe’s data via speakers’ rational use of probabilistic meanings, represented as sim- ple “blurry boxes”. In this section, we assess the effectiveness of this explanation. We anticipate two arguments against our model: first, that the represen- tation is too simple; second, that factoring speakers’ choices through a model of meaning is too cumber- some. We rebut these arguments by providing met- rics and results that suggest that LUX escapes these objections and captures almost all of the structure in subjects’ responses. 5.1 Alternative Models To test LUX’s representations, we built a brute-force histogram model (HM) that discretizes HSV space and tracks frequency distributions of labels directly in each discretized bin. Similar histogram models have been developed by Chuang et al. (2008) and (Heer and Stone, 2012) to build interfaces for inter- acting with color that are informed by human cat- egorization and naming. More precisely, our HM uses a linear interpolation method (Chen and Good- man, 1996) to combine three histograms of various http://mcmahan.io/lux/ http://mcmahan.io/lux/ granularity.5 This amounts to predicting responses by querying the training data. HM has the potential to expose whether LUX is missing important fea- tures of the distribution of color descriptions. We also built a direct model of subjects’ choices of color terms. Instead of appealing to the applica- bility and availability of a color label, it works with the observed frequency of a color label and a Gaus- sian model of the probability of a color value for each label, as in Eq. 9: (9)P(ksaid,ktrue|x) ∝P(x|ktrue)P(ksaid,ktrue) This Gaussian model (GM) generalizes Munroe’s pairing of labels with prototypical colors: P(x|ktrue) is a Gaussian with diagonal covari- ance, so it associates each color term with a mean HSV value and with variances in each dimension that determine a label-specific distance metric. GM predicts speaker choice by weighting these distances probabilistically against the priors. GM completely sidesteps the need to model meaning categorically. It therefore has the potential to expose whether our assumptions about semantic representations and speaker choices hinder LUX’s performance. 5.2 Evaluation Metrics We evaluate the models using two classes of met- rics on a held-out test set consisting of 25% of the corpus. The first type is based upon the posterior distribution over labels and the ranked position of subjects’ actual labels of color values. The second type is based upon the log likelihood of the models, which quantifies model fit. 5.2.1 Decision-Based Metrics To answer how accurate a model’s predictions are, we can locate subjects’ responses in the weighted rankings computed by the models. The TOPK Measures: Each model provides a posterior distribution over the possible labels. The most likely label of this posterior is the maximum likelihood estimate (MLE). We track how often the MLE color label is what the user actually said as 5Specifically, the histograms are of size (90,10,10), (45,5,5), and (1,1,1) across Hue, Saturation, and Value with interpolation weights of 0.322, 0.643, and 0.035 respectively. These parame- ters were determined by taking the training set as 5-fold valida- tion sets. the TOP1 measure. For the Histogram Model, the TOP1 approximates the most frequent label ob- served in the data for a color value. We also measure how often the correct label appears in the first 5 and 10 most likely labels. These are denoted TOP5 and TOP10 respectively. 5.2.2 Likelihood-Based Metrics We can also measure how well a model explains speaker choice using the log likelihood of the labels given the model and the color values, denoted as LLV (M). This is calculated using Eq. 10 across all N data points in the held-out test set. LLV (M) is used when computing perplexity and Aikake In- formation Criterion (AIC). We report all measures in bits. LLV (M) = log2 PM (K true,Ksaid|X) = ∑ i log2 PM (k true i ,k said i |xi) (10) A more general measure of model fit is the log like- lihood of the color values and their labels jointly across the training set, LL(V ), given the model. It is defined and calculated analogously. Perplexity Perplexity has been used in past re- search to measure the performance of statistical lan- guage models (Jelinek et al., 1977; Brown et al., 1992). Lower perplexity means that the model is less surprised by the data and so describes it more pre- cisely. We use it here to measure how well a model encodes the regularities in color descriptions. Akaike Information Criterion: AIC is derived from information theory (Akaike, 1974) and bal- ances the model’s fit to the data with the complexity of the model by penalizing a larger number of pa- rameters. The intuition is that a smaller AIC indi- cates a better balance of parameters and model fit. 5.3 Evaluation Results Table 1 summarizes the decision-based evalua- tion results.6 We see little penalty for LUX and 6There is a caveat to these performance measures. All of the reported numbers are for the final data subset which we discuss in Section 4.1. We choose to use a subset which did not include color labels that had less than 100 occurrences. In the English- speaking and American-citizenship subset, the rare description tail accounts for 13% of the data—Roughly one third of the tail data is unique descriptions. If the tail represents real world TOP1 TOP5 TOP10 LUX 39.55% 69.80% 80.46% HM 39.40% 71.89% 82.53% GM 39.05% 69.25% 79.99% Table 1: Decision-based results. The percentage of cor- rect responses of 544,764 test-set data points are shown. −LL −LLV AIC Perp LUX 1.13*107 2.05*106 4.13*106 13.61 HM 1.13*107 2.09*106 4.82*106 14.41 GM 1.34*107 2.08*106 4.17*106 14.14 Table 2: Likelihood-based evaluation results: negative log likelihood of the data, negative log likelihood of labels given points, number of parameters, Akaike In- formation Criterion and perplexity of labels given color values. Parameter counts for AIC are 15751 for LUX, 315669 for HM and 5803 for GM. GM’s constrained frameworks for modeling choices. However, the differences in the table, though nu- merically small, are significant (by Binomial test) at p < .02 or less. In particular, the fact that LUX wins TOP1 hints that its representations enable bet- ter generalization than HM or GM. The success of HM at TOP5 and TOP10, meanwhile, suggests that some qualitative aspects of people’s use of color words do escape the strong assumptions of LUX and GM—a point we return to below. At the same time, we draw a general lesson from the overall patterns of results in Table 1. Language users must be quite uncertain about how speakers will describe colors. Speakers do not seem to choose the most likely color label in a majority of responses; their behavior shows a long tail. These results are in line with the probabilistic models of meaning and speaker choice we have developed. Table 2 summarizes the likelihood based metrics. GM’s estimates don’t fit the distribution of the test data as a whole: GM is a good model of what labels speakers give but not a good model of the points that get particular labels. By contrast, LUX tops out ev- ery row in the table. HM is flexible enough in prin- ciple to mirror LUX’s predictions; HM must suffer circumstances, our model is only applicable 87% of the time, and thus the performance metrics should be scaled down. We do not explicitly report the scaled numbers. from sparse data, given its vast number of parame- ters. By contrast, LUX is able to capture the dis- tributions of speaker responses in deeper and more flexible ways by using semantics as an abstraction. Our analysis of patterns of error in LUX sug- gests that LUX would best improved by more faith- ful models of linguistic meaning, rather than more elaborate models of subjects’ choices or more pow- erful learning methods. For one thing, neither LUX nor the simple prototype model captures ambiguity, which sometimes arises in Munroe’s data. An exam- ple is the color label melon, which has a multimodal distribution in the reddish-orange and green areas of color space shown in Figure 5—most likely corre- sponding to people thinking about the distinct col- ors of the flesh of watermelon, cantaloupe and hon- eydew. Interestingly, our model captures the more common usage. A different modeling challenge is illustrated by the behavior of greenish in Figure 6. Greenish seems to be an exception to the general assumption that color terms label convex categories. Actually, green- ish seems to fit the boundary of green—the areas that are not definitely green but not definitely not green. (Linguists often appeal to such concepts in the liter- ature on vagueness.) This is not a convex area so, not surprisingly, our model finds a poor match. Ad- ditional research is needed to understand when it’s appropriate to give meanings more complex repre- sentations and how they can be learned. 6 Discussion and Conclusion Natural language color descriptions provide an ex- pressive, precise but open-ended vocabulary to char- acterize real-world objects. This paper documents Hue 0.0 0.2 0.4 0.6 0.8 1.0 P ro ba bi li ty φHueMelon Melon data Figure 5: For the Hue dimension, the data for “melon” is plotted against the LUX model’s φ curve. Hue µLower µUpper 0.0 0.2 0.4 0.6 0.8 1.0 P ro ba bi li ty τLower τUpper 0.000 0.001 0.002 0.003 0.004 0.005 φHueGreenish Greenish data Figure 6: For the Hue dimension, the data for “greenish” is plotted against the LUX model’s φ curve. and releases Lexicon of Uncertain Color Standards (LUX), which provides semantic representations of 829 English color labels, derived from a large cor- pus of attested descriptions. Our evaluation shows that LUX provides a precise description of speak- ers’ free-text labels of color patches. Our expec- tation therefore is that LUX will serve as a useful resource for building systems for situated language understanding and generation that need to describe colors to English-speaking users. Our work in LUX has built closely on linguis- tic approaches to color meaning and psychological approaches to modeling experimental subjects. Be- cause LUX bridges linguistic theory, psychologi- cal data, and system building, LUX also affords a unique set of resources for future research at the in- tersection of semantics and pragmatics of dialogue. For example, our work explains subjects’ deci- sions as a straightforward reflection of their com- municative goals in a probabilistic setting. Our measures of availability and applicability can be seen as offering computational interpretations of the Gricean Maxims of Manner and Quality (Grice, 1975). However, these particular interpretations don’t give rise to implicatures on our model— largely because our Rational Observer is so inclusive and variable in the descriptions it offers. To show this, we can analyze what an idealized hearer learns about an underlying color x when the speaker uses a color term k: this is P(x|ksaid). The model predic- tions are formalized in Eq. 11. P(x|ksaid) = P(x|ksaid,ktrue) = P(ksaid,ktrue|x)P(x) P(ksaid,ktrue) = P(ksaid|ktrue)P(ktrue|x)P(x) P(ksaid|ktrue)P(ktrue) = αkP(k true|x)P(x) αkP(k true) = P(x|ktrue) (11) We apply Bayes’s rule, exploiting our model as- sumption that the speaker says k only when the speaker first judges that k is true. Our model also tells us that, given that k is true, the speaker’s choice of whether to say k depends only on the availabil- ity αk of the term k. Simplifying, we find that the pragmatic posterior—what we think the speaker was looking at when she said this word—coincides with the semantic posterior—what we think the word is true of. Intuitively, the hearer knows that the term is true because the speaker has used the word, indepen- dent of the color x the speaker is describing. Sim- ilarly, in our model of speaker choice, the speaker does not take x into account in choosing one of the applicable words to say (one way the speaker could do this, for example, would be to prefer terms that were more informative about the target color x). In- stead, the speaker simply samples from the candi- dates. That’s why the speaker’s choice reveals only what the semantics says about x. Technically, this makes semantics a Nash equi- librium, where the information the hearer recov- ers from an utterance is exactly the information the speaker intends to express—in keeping with a longstanding tradition in the philosophy of language (Lewis, 1969; Cumming, 2013). By contrast, re- searchers such as Smith et al. (2013) adopt broadly similar formal assumptions but predict asymme- tries where sophisticated listeners can second-guess naive speakers’ choices and recover “extra” infor- mation that the speaker has revealed incidentally and unintentionally. The difference between this ap- proach and ours eventually leads to a difference in the priors over utterances, but it’s best explained through the different utilities that motivate speak- ers’ different choices in the first place. Smith et al. (2013) assume speakers want to be informative; we assume they want to fit in. The empirical success of our approach on Munroe’s data motivates a larger project to elicit data that can explicitly probe sub- jects’ communicative goals in relation to semantic coordination. Meanwhile, our work formalizes probabilistic theories of vagueness with new scale and preci- sion. These naturally suggest that we test predictions about the dynamics of conversation drawn from the semantic literature on vagueness. For example, in hearing a description for an object, we come to know more about the standards governing the applicability of the description. This is outlined by Barker (2002) as having a meta-semantic effect on the common ground among interlocutors. For example, hearing a yellow-green object called yellowish green should make objects in the same color range more likely to be referred to as yellowish green. We could use LUX straightforwardly to represent such conceptual pacts (Brennan and Clark, 1996) via a posterior over threshold parameters. It’s natural to look for empir- ical evidence to assess the effectiveness of such rep- resentations of dependent context. A particularly important case involves descrip- tive material that distinguishes a target referent from salient alternatives, as in the understanding or gen- eration of referring expressions (Krahmer and van Deemter, 2012). Following Kyburg and Morreau (2000), we could represent this using LUX via a pos- terior over the threshold parameters that fit the target but exclude its alternatives. Again, our model as- sociates such goals with quantitative measures that future research can explore empirically. Meo et al. (2014) present an initial exploration of this idea. These open questions complement the key advan- tage that makes uncertainty about meaning crucial to the success of the model and experiments we have reported here. Many kinds of language use seem to be highly variable, and approaches to grounded se- mantics need ways to make room for this variabil- ity both in the semantic representations they learn and the algorithms that induce these representations from language data. We have argued that uncertainty about meaning is a powerful new tool to do this. We look forward to future work addressing uncertainty in grounded meanings in a wide range of continu- ous domains—generalizing from color to quantity, scales, space and time—and pursuing a wide range of reasoning efforts, to corroborate our results and to leverage them in grounded language use. Acknowledgments This work was supported in part by NSF DGE- 0549115. This work has benefited from discus- sion and feedback from the reviewers of TACL, Ma- neesh Agrawala, David DeVault, Jason Eisner, Tarek El-Gaaly, Katrin Erk, Vicky Froyen, Joshua Gang, Pernille Hemmer, Alex Lascarides, and Tim Meo. References Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723. John R. Anderson. 1991. The adaptive nature of human categorization. Psychological Review, 98(3):409. Jacob Andreas and Dan Klein. 2014. Grounding lan- guage with points and paths in continuous spaces. In Proceedings of the Eighteenth Conference on Com- putational Natural Language Learning, pages 58–67, June. Chris Barker. 2002. The dynamics of vagueness. Lin- guistics and Philosophy, 25(1):1–36. Brent Berlin. 1991. Basic Color Terms: Their Univer- sality and Evolution. Univ of California Press. Susan E. Brennan and Herbert H. Clark. 1996. Concep- tual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory and Cognition, 22(6):1482–1493. Stephen P. Brooks and Andrew Gelman. 1998. Gen- eral methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434–455. Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31–40. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- Khanh Tran. 2012. Distributional semantics in tech- nicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 136–145. Stanley F. Chen and Joshua Goodman. 1996. An empiri- cal study of smoothing techniques for language model- ing. In Proceedings of the 34th annual meeting on As- sociation for Computational Linguistics, pages 310– 318. David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: a test of grounded language acquisition. In ICML ’08: Proceedings of the 25th international conference on Machine learning, pages 128–135. Siddhartha Chib and Edward Greenberg. 1995. Un- derstanding the Metropolis–Hastings algorithm. The American Statistician, 49(4):327–335. Jason Chuang, Maureen Stone, and Pat Hanrahan. 2008. A probabilistic model of the categorical association between colors. In Color Imaging Conference, pages 6–11. Sam Cumming. 2013. Coordination and content. Philosophers’ Imprint, 13(4):1–16. Colin R. Dawson, Jeremy Wright, Antons Rebguns, Marco Valenzuela Escárcega, Daniel Fried, and Paul R. Cohen. 2013. A generative probabilis- tic framework for learning spatial language. In 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pages 1–8. IEEE. David DeVault, Iris Oved, and Matthew Stone. 2006. So- cietal grounding is essential to meaningful language use. In Proceedings of the Twenty-first National Con- ference on Artificial Intelligence, pages 747–754. Mark D. Fairchild. 2013. Color Appearance Models. The Wiley-IS&T Series in Imaging Science and Tech- nology. Wiley. Delia Graff Fara. 2000. Shifting sands: An interest- relative theory of vagueness. Philosophical Topics, 28(1):45–81. Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. 2009 IEEE Conference on Computer Vision and Pat- tern Recognition, pages 1778–1785, June. George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais. 1987. The vocabulary problem in human-system communication. Communi- cations of the ACM, 30(11):964–971. Peter Gärdenfors. 2000. Conceptual Spaces. MIT Press. Andrew Gelman, Gareth O. Roberts, and Walter R. Gilks. 1996. Efficient Metropolis jumping rules. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. Smith, editors, Bayesian Statistics 5, pages 599–607. Oxford University Press. Herbert P. Grice. 1975. Logic and conversation. In P. Cole and J. Morgan, editors, Syntax and Semantics III: Speech Acts, pages 41–58. Academic Press. Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3):335–346. Jeffrey Heer and Maureen Stone. 2012. Color naming models for color selection, image editing and palette design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1007– 1016. John F. Hughes, Andries van Dam, Morgan McGuire, David F. Sklar, James D. Foley, Steven K. Feiner, and Kurt Akeley. 2013. Computer Graphics: Principles and Practice (3rd Edition). Addison-Wesley Profes- sional. Gerhard Jäger. 2010. Natural color categories are con- vex sets. In Maria Aloni, Harald Bastiaanse, Tikitu de Jager, and Katrin Schulz, editors, Logic, Language and Meaning - 17th Amsterdam Colloquium, Amster- dam, The Netherlands, December 16-18, 2009, Re- vised Selected Papers, volume 6042 of Lecture Notes in Computer Science, pages 11–20. Springer. Fred Jelinek, Robert L. Mercer, Lalit R. Bahl, and James K. Baker. 1977. Perplexity–a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62:S63. Paul Kay, Brent Berlin, Luisa Maffi, William R. Merri- field, and Richard Cook. 2009. The World Color Sur- vey. CSLI. Emiel Krahmer and Kees van Deemter. 2012. Compu- tational generation of referring expressions: A survey. Computational Linguistics, 38(1):173–218. Jayant Krishnamurthy and Thomas Kollar. 2013. Jointly learning to parse and perceive: Connecting natural lan- guage to the physical world. Transactions of the Asso- ciation for Computational Linguistics, 1(2):193–206. Alice Kyburg and Michael Morreau. 2000. Fitting words: Vague words in context. Linguistics and Phi- losophy, 23(6):577–597. Johan Maurice Gisele Lammens. 1994. A computational model of color perception and color naming. Ph.D. thesis, SUNY Buffalo. Staffan Larsson. 2013. Formal semantics for percep- tual classification. Journal of Logic and Computa- tion. Advance online publication. doi: 10.1093/log- com/ext059. Daniel Lassiter. 2009. Vagueness as probabilistic lin- guistic knowledge. In Rick Nouwen, Robert van Rooij, Uli Sauerland, and Hans-Christian Schmitz, editors, Vagueness in Communication - International Workshop, ViC 2009, held as part of ESSLLI 2009, Bordeaux, France, July 20-24, 2009. Revised Selected Papers, volume 6517 of Lecture Notes in Computer Science, pages 127–150. Springer. David K. Lewis. 1969. Convention: A Philosophical Study. Harvard University Press, Cambridge, MA. Cynthia Matuszek, Nicholas Fitzgerald, Luke Zettle- moyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded at- tribute learning. In Proceedings of the 29th Interna- tional Conference on Machine Learning (ICML-12), pages 1671–1678. http://dx.doi.org/10.1093/logcom/ext059 http://dx.doi.org/10.1093/logcom/ext059 Nikolaos Mavridis and Deb Roy. 2006. Grounded situation models for robots: Where words and per- cepts meet. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 4690– 4697. IEEE. Timothy Meo, Brian McMahan, and Matthew Stone. 2014. Generating and resolving vague color refer- ences. In SEMDIAL 2014: THE 18th Workshop on the Semantics and Pragmatics of Dialogue, pages 107– 115. Nicholas Metropolis, Arianna W. Rosenbluth, Mar- shall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092. Randall Munroe. 2010. Color survey results. On- line at http://blog.xkcd.com/2010/05/03/color-survey- results/. Kimele Persaud and Pernille Hemmer. 2014. The in- fluence of knowledge and expectations for color on episodic memory. In P Bello, M Guarini, M Mc- Shane, and B Scassellati, editors, Proceedings of the 36th Annual Conference of the Cognitive Science So- ciety, pages 1162–1167. Terry Regier, Paul Kay, and Richard S. Cook. 2005. Fo- cal colors are universal after all. Proceedings of the National Academy of Sciences, 102:8386–8391. Terry Regier, Paul Kay, and Naveen Khetarpal. 2007. Color naming reflects optimal partitions of color space. Proceedings of the National Academy of Sci- ences, 104:1436–1441. Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2013. Models of Semantic Representation with Visual Attributes. In Proceedings of the 51st Annual Meet- ing of the Association for Computational Linguistics, pages 572–582. Nathaniel J. Smith, Noah D. Goodman, and Michael C. Frank. 2013. Learning and using language via recur- sive pragmatic reasoning about other agents. In Ad- vances in Neural Information Processing Systems 26, pages 3039–3047. Stefanie Tellex, Thomas Kollar, and Steven Dickerson. 2011a. Approaching the symbol grounding problem with probabilistic graphical models. AI magazine, 32(4):64–76. Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth J Teller, and Nicholas Roy. 2011b. Understanding nat- ural language commands for robotic navigation and mobile manipulation. In Proceedings of the Twenty- Fifth AAAI Conference on Artificial Intelligence, pages 1507–1514. Terry Winograd. 1970. Procedures as a representation for data in a computer program for understanding nat- ural language. Ph.D. thesis, MIT. Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- enmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic in- ference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive logic pro- gramming. In Proceedings of the National Conference on Artificial Intelligence, pages 1050–1055. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. In UAI ’05, Proceedings of the 21st Conference in Un- certainty in Artificial Intelligence, pages 658–666. Introduction Related Work Using Vague Color Terms: A Model Color Categories Semantic Representation Rational Observer Model Learning Experiment Munroe Color Corpus Fitting the Model Parameters Model Evaluation Alternative Models Evaluation Metrics Decision-Based Metrics Likelihood-Based Metrics Evaluation Results Discussion and Conclusion