Transactions of the Association for Computational Linguistics, 1 (2013) 279–290. Action Editor: Lillian Lee. Submitted 11/2012; Revised 1/2013; Published 7/2013. c©2013 Association for Computational Linguistics. Good, Great, Excellent: Global Inference of Semantic Intensities Gerard de Melo ICSI, Berkeley demelo@icsi.berkeley.edu Mohit Bansal CS Division, UC Berkeley mbansal@cs.berkeley.edu Abstract Adjectives like good, great, and excellent are similar in meaning, but differ in intensity. In- tensity order information is very useful for language learners as well as in several NLP tasks, but is missing in most lexical resources (dictionaries, WordNet, and thesauri). In this paper, we present a primarily unsupervised approach that uses semantics from Web-scale data (e.g., phrases like good but not excel- lent) to rank words by assigning them posi- tions on a continuous scale. We rely on Mixed Integer Linear Programming to jointly deter- mine the ranks, such that individual decisions benefit from global information. When rank- ing English adjectives, our global algorithm achieves substantial improvements over pre- vious work on both pairwise and rank corre- lation metrics (specifically, 70% pairwise ac- curacy as compared to only 56% by previous work). Moreover, our approach can incorpo- rate external synonymy information (increas- ing its pairwise accuracy to 78%) and extends easily to new languages. We also make our code and data freely available.1 1 Introduction Current lexical resources such as dictionaries and thesauri do not provide information about the in- tensity order of words. For example, both WordNet (Miller, 1995) and Roget’s 21st Century Thesaurus (thesaurus.com) present acceptable, great, and su- perb as synonyms of the adjective good. However, a native speaker knows that these words represent varying intensity and can in fact generally be ranked by intensity as acceptable < good < great < superb. Similarly, warm < hot < scorching are identified as synonyms in these resources. Ranking information, 1http://demelo.org/gdm/intensity/ however, is crucial because it allows us to differen- tiate e.g. between various intensities of an emotion, and is hence very useful for humans when learning a language or judging product reviews, as well as for automatic text understanding and generation tasks such as sentiment and subjectivity analysis, recog- nizing textual entailment, question answering, sum- marization, and coreference and discourse analysis. In this work, we attempt to automatically rank sets of related words by intensity, focusing in par- ticular on adjectives. This is made possible by the vast amounts of world knowledge that are now avail- able. We use lexico-semantic information extracted from a Web-scale corpus in conjunction with an al- gorithm based on a Mixed Integer Linear Program (MILP). Linguistic analyses have identified phrases such as good but not great or hot and almost scorch- ing in a text corpus as sources of evidence about the relative intensities of words. However, pure infor- mation extraction approaches often fail to provide enough coverage for real-world downstream appli- cations (Tandon and de Melo, 2010), unless some form of advanced inference is used (Snow et al., 2006; Suchanek et al., 2009). In our work, we address this sparsity problem by relying on Web-scale data and using an MILP model that extends the pairwise scores to a more com- plete joint ranking of words on a continuous scale, while maintaining global constraints such as transi- tivity and giving more weight to the order of word pairs with higher corpus evidence scores. Instead of considering intensity ranking as a pairwise deci- sion process, we thus exploit the fact that individual decisions may benefit from global information, e.g. about how two words relate to some third word. Previous work (Sheinman and Tokunaga, 2009; Schulam and Fellbaum, 2010; Sheinman et al., 2012) has also used lexico-semantic patterns to or- 279 der adjectives. They mainly evaluate their algorithm on a set of pairwise decisions, but also present a par- titioning approach that attempts to form scales by placing each adjective to the left or right of pivot words. Unfortunately, this approach often fails be- cause many pairs lack order-based evidence even on the Web, as explained in more detail in Section 3. In contrast, our MILP jointly uses information from all relevant word pairs and captures com- plex interactions and inferences to produce inten- sity scales. We can thus obtain an order between two adjectives even when there is no explicit evi- dence in the corpus (using evidence for related pairs and transitive inference). Our global MILP is flex- ible and can also incorporate additional synonymy information if available (which helps the MILP find an even better ranking solution). Our approach also extends easily to new languages. We describe two approaches for this multilingual extension: pattern projection and cross-lingual MILPs. We evaluate our predicted intensity rankings us- ing both pairwise classification accuracy and rank- ing correlation coefficients, achieving strong results, significantly better than the previous approach by Sheinman & Tokunaga (32% relative error reduc- tion) and quite close to human-level performance. 2 Method In this section, we describe each step of our ap- proach to ordering adjectives on a single, relative scale. Our method can also be applied to other word classes and to languages other than English. 2.1 Web-based Scoring Model 2.1.1 Intensity Scales Near-synonyms may differ in intensity, e.g. joy vs. euphoria, or drizzle vs. rain. This is particu- larly true of adjectives, which can represent different degrees of a given quality or attribute such as size or age. Many adjectives are gradable and thus al- low for grading adverbial modifiers to express such intensity degrees, e.g., a house can be very big or extremely big. Often, however, completely differ- ent adjectives refer to varying degrees on the same scale, e.g., huge, gigantic, gargantuan. Even adjec- tives like enormous (or superb, impossible) that are considered non-gradable from a syntactic perspec- tive can be placed on a such a scale. Weak-Strong Patterns Strong-Weak Patterns ? (,) but not ? not ? (,) just ? ? (,) if not ? not ? (,) but just ? ? (,) although not ? not ? (,) still ? ? (,) though not ? not ? (,) but still ? ? (,) (and/or) even ? not ? (,) although still ? ? (,) (and/or) almost ? not ? (,) though still ? not only ? but ? ? (,) or very ? not just ? but ? Table 1: Ranking patterns used in this work. Among the patterns represented by the regular expressions above, we use only those that capture less than or equal to five words (to fit in the Google n-grams, see Section 2.1.2). Articles (a, an, the) are allowed to appear before the wildcards wherever possible. 2.1.2 Intensity Patterns Linguistic studies have found lexical patterns like ‘? but not ?’ (e.g. good but not great) to reveal order information between a pair of adjectives (Sheinman and Tokunaga, 2009). We assume that we have two sets of lexical patterns that allow us to infer the most likely ordering between two words when encoun- tered in a corpus. A first pattern set, Pws, contains patterns that reflect a weak-strong order between a pair of word (the first word is weaker than the sec- ond), and a second pattern set, Psw, captures the strong-weak order. See Table 1 for the adjective pat- terns that we used in this work (and see Section 4.1 for implementation details regarding our pattern col- lection). Many of these patterns also apply to other parts of speech (e.g. ‘drizzle but not rain’, ‘running or even sprinting’), with significant discrimination on the Web in the right direction. 2.1.3 Pairwise Scores Given an input set of words to be placed on a scale, we first collect evidence of their intensity or- der by using the above-mentioned intensity patterns and a large, Web-scale text corpus. Previous work on information extraction from limited-sized raw text corpora revealed that cover- age is often limited (Hearst, 1992; Hatzivassiloglou and McKeown, 1993). Some studies (Chklovski and Pantel, 2004; Sheinman and Tokunaga, 2009) used hit counts from an online search engine, but this is unstable and irreproducible (Kilgarriff, 2007). To avoid these issues, we use the largest available 280 (good, great) (great, good) (small, minute) good , but not great → 24492.0 not great , just good → 248.0 small , almost minute → 97.0 good , if not great → 1912.0 great or very good → 89.0 small , even minute → 41.0 good , though not great → 504.0 not great but still good → 47.0 good , or even great → 338.0 not just good but great → 181.0 good , almost great → 156.0 Table 2: Some examples from the Web-scale corpus of useful intensity-based phrases on adjective pairs. static corpus of counts, the Google n-grams corpus (Brants and Franz, 2006), which contains English n-grams (n = 1 to 5) and their observed frequency counts, generated from nearly 1 trillion word tokens and 95 billion sentences. We consider each pair of words (a1,a2) in the in- put set in turn. For each pattern p in the two pattern sets (weak-strong Pws and strong-weak Psw), we in- sert the word pair into the pattern as p(a1,a2) to get a phrasal query like “big but not huge”. This is done by replacing the two wildcards in the pattern by the two words in order. Finally, we scan the Web n- grams corpus in a batch approach similar to Bansal and Klein (2011) and collect frequencies of all our phrase queries. Table 2 depicts some examples of useful intensity-based phrase queries and their fre- quencies in the Web-scale corpus. We also collect frequencies for the input word unigrams and the pat- terns for normalization purposes. Given a word pair (a1,a2) and a corpus count function cnt, we define W1 = 1 P1 ∑ p1∈Pws cnt(p1(a1,a2)) S1 = 1 P2 ∑ p2∈Psw cnt(p2(a1,a2)) W2 = 1 P1 ∑ p1∈Pws cnt(p1(a2,a1)) S2 = 1 P2 ∑ p2∈Psw cnt(p2(a2,a1)) (1) with P1 = ∑ p1∈Pws cnt(p1) P2 = ∑ p2∈Psw cnt(p2), (2) such that the final overall weak-strong score is score(a1,a2) = (W1 −S1) − (W2 −S2) cnt(a1) · cnt(a2) . (3) Here W1 and S1 represent Web evidence of a1 and a2 being in the weak-strong and strong-weak relation, respectively. W2 and S2 fit the reverse pair (a2,a1) in the patterns and hence represent the strong-weak and weak-strong relations, respec- tively, in the opposite direction. Hence, overall, (W1 −S1) − (W2 −S2) represents the total weak- strong score of the pair (a1,a2), i.e. the score of a1 being on the left of a2 on a relative intensity scale, such that score(a1,a2) = −score(a2,a1). The raw frequencies in the score are divided by counts of the patterns and by individual word unigram counts to obtain a pointwise mutual information (PMI) style normalization and hence avoid any bias in the score due to high-frequency patterns or word unigrams.2 2.2 Global Ordering with an MILP 2.2.1 Objective and Constraints Given pairwise scores, we now aim at producing a global ranking of the input words that is much more informative than the original pairwise scores. Joint inference from multiple word pairs allows us to ben- efit from global information: Due to the sparsity of the pattern evidence, determining how two adjec- tives relate to each other can sometimes e.g. only be inferred by observing how each of them relate to some third adjective. We assume that we are given N input words A = a1, . . . ,aN that we wish to place on a linear scale, say [0, 1]. Thus each word ai is to be assigned a position xi ∈ [0, 1] based on the pairwise weak- strong weights score(ai,aj). A positive value for 2In preliminary experiments on a development set, we also evaluated other intuitive forms of normalization. 281 Figure 1: The input weak-strong data may contain one or more cycles, e.g. due to noisy patterns, so the final ranking will have to choose which input scores to honor and which to remove. score(ai,aj) means that ai is supposedly weaker than aj and hence we would like to obtain xi < xj. A negative value for score(ai,aj) means that ai is assumed to be stronger than aj, so we would want to obtain xi > xj. Therefore, intuitively, our goal corresponds to maximizing the objective ∑ i,j sgn(xj −xi) ·score(ai,aj) (4) Note that it is important to use the signum func- tion sgn() here, because we only care about the rel- ative order of xi and xj. Maximizing ∑ ij(xj −xi) · score(ai,aj) would lead to all words being placed at the edges of the scale, because the highest scores would dominate over all other ones. We do include the score magnitudes in the objective, because they help resolve contradictions in the pairwise scores (e.g., see Figure 1). This is discussed in more de- tail in Section 2.2.2. In order to maximize this non-differentiable ob- jective, we use Mixed Integer Linear Programming (MILP), a variant of linear programming in which some but not all of the variables are constrained to be integers. Using an MILP formalization, we can find a globally optimal solution in the joint deci- sion space, and unlike previous work, we jointly ex- ploit global information rather than just individual local (pairwise) scores. To encode the objective in a MILP, we need to introduce additional variables dij, wij, sij to capture the effect of the signum function, as explained below. We additionally also enable our MILP to make use of any external equivalence (synonymy) infor- mation E ⊆ {1, . . . ,N}×{1, . . . ,N} that may be available. In this context, two words are considered synonymous if they are close enough in meaning to be placed on (almost) the same position in the inten- sity scale. If (i,j) ∈ E, we can safely assume that ai, aj have near-equivalent intensity, so we should encourage xi, xj to remain close to each other. The MILP is defined as follows: maximize∑ (i,j) 6∈E (wij −sij) ·score(ai,aj) − ∑ (i,j)∈E (wij + sij) C subject to dij = xj −xi ∀i,j ∈{1, . . . ,N} dij −wijC ≤ 0 ∀i,j ∈{1, . . . ,N} dij + (1 −wij)C > 0 ∀i,j ∈{1, . . . ,N} dij + sijC ≥ 0 ∀i,j ∈{1, . . . ,N} dij − (1 −sij)C < 0 ∀i,j ∈{1, . . . ,N} xi ∈ [0, 1] ∀i ∈{1, . . . ,N} wij ∈{0, 1} ∀i,j ∈{1, . . . ,N} sij ∈{0, 1} ∀i,j ∈{1, . . . ,N} The difference variables dij simply capture differ- ences between xi, xj. C is any very large constant greater than ∑ i,j |score(ai,aj)|; the exact value is irrelevant. The indicator variables wij and sij are jointly used to determine the value of the signum function sgn(dij) = sgn(xj − xi). Variables wij become 1 if and only if dij > 0 and hence serve as indicator variables for weak-strong relationships in the output. Variables sij become 1 if and only if dij < 0 and hence serve as indicator variables for a strong-weak relationship in the output. The ob- jective encourages wij = 1 for score(ai,aj) > 0 and sij = 1 for score(ai,aj) < 0.3 When equiva- lence (synonymy) information is available, then for (i,j) ∈ E both sij = 0 and wij = 0 are encouraged. 2.2.2 Discussion Our MILP uses intensity evidence of all input pairs together and assimilates all the scores via global transitivity constraints to determine the posi- tions of the input words on a continuous real-valued scale. Hence, our approach addresses drawbacks 3In order to avoid numeric instability issues due to very small score(ai, aj) values after frequency normalization, in practice we have found it necessary to rescale them by a fac- tor of 1 over the smallest |score(ai, aj)| > 0. 282 Figure 2: Equivalence Information: Knowing that am, a2 are synonyms gives the MILP an indication of where to place an on the scale with respect to a1, a2, a3 of local or divide-and-conquer approaches, where adjectives are scored with respect to selected pivot words, and hence many adjectives that lack pairwise evidence with the pivots are not properly classified, although they may have order evidence with some third adjective that could help establish the ranking. Optional synonymy information can further help, as shown in Figure 2. Moreover, our MILP also gives higher weight to pairs with higher scores, which is useful when breaking global constraint cycles as in the simple example in Figure 1. If we need to break a con- straint violating triangle or cycle, we would have to make arbitrary choices if we were ranking based on sgn(score(a,b)) alone. Instead, we can choose a better ranking based on the magnitude of the pair- wise scores. A stronger score between an adjective pair doesn’t necessarily mean that they should be further apart in the ranking. It means that these two words are attested together on the Web with respect to the intensity patterns more than with other candi- date words. Therefore, we try to respect the order of such word pairs more in the final ranking when we are breaking constraint-violating cycles. 3 Related Work Hatzivassiloglou and McKeown (1993) presented the first step towards automatic identification of ad- jective scales, thoroughly discussing the background of adjective semantics and a means of discovering clusters of adjectives that belong on the same scale, thus providing one way of creating the input for our ranking algorithm. Inkpen and Hirst (2006) study near-synonyms and nuances of meaning differentiation (such as stylistic, attitudinal, etc.). They attempt to automatically ac- quire a knowledge base of near-synonym differences via an unsupervised decision-list algorithm. How- ever, their method depends on a special dictionary of synonym differences to learn the extraction pat- terns, while we use only a raw Web-scale corpus. Mohammad et al. (2013) proposed a method of identifying whether two adjectives are antonymous. This problem is related but distinct, because the de- gree of antonymy does not necessarily determine their position on an intensity scale. Antonyms (e.g., little, big) are not necessarily on the extreme ends of scales. Sheinman and Tokunaga (2009) and Sheinman et al. (2012) present the most closely related previous work on adjective intensities. They collect lexico- semantic patterns via bootstrapping from seed adjec- tive pairs to obtain pairwise intensities, albeit using search engine ‘hits’, which are unstable and prob- lematic (Kilgarriff, 2007). While their approach is primarily evaluated in terms of a local pairwise classification task, they also suggest the possibil- ity of ordering adjectives on a scale using a pivot- based partitioning approach. Although intuitive in theory, the extracted pairwise scores are frequently too sparse for this to work. Thus, many adjec- tives have no score with a particular headword. In our experiments, we reimplemented this approach and show that our MILP method improves over it by allowing individual pairwise decisions to benefit more from global information. Schulam and Fell- baum (2010) apply the approach of Sheinman and Tokunaga (2009) to German adjectives. Our method extends easily to various foreign languages as de- scribed in Section 5. Another related task is the extraction of lexico- syntactic and lexico-semantic intensity-order pat- terns from large text corpora (Hearst, 1992; Chklovski and Pantel, 2004; Tandon and de Melo, 2010). Sheinman and Tokunaga (2009) follows Davidov and Rappoport (2008) to automatically bootstrap adjective scaling patterns using seed ad- jectives and Web hits. These methods thus can be used to provide the input patterns for our algorithm. VerbOcean by Chklovski and Pantel (2004) ex- tracts various fine-grained semantic relations (in- cluding the stronger-than relation) between pairs of verbs, using lexico-syntactic patterns over the Web. 283 Our approach of jointly ranking a set of words using pairwise evidence is also applicable to the VerbO- cean pairs, and should help address similar sparsity issues of local pairwise decisions. Such scales will again be quite useful for language learners and lan- guage understanding tools. de Marneffe et al. (2010) infer yes-or-no answers to questions with responses involving scalar adjec- tives in a dialogue corpus. They correlate adjectives with ratings in a movie review corpus to find that good appears in lower-rated reviews than excellent. Finally, there has been a lot of work on measuring the general sentiment polarity of words (Hatzivas- siloglou and McKeown, 1997; Hatzivassiloglou and Wiebe, 2000; Turney and Littman, 2003; Liu and Seneff, 2009; Taboada et al., 2011; Yessenalina and Cardie, 2011; Pang and Lee, 2008). Our work in- stead aims at producing a large, unrestricted number of individual intensity scales for different qualities and hence can help in fine-grained sentiment analy- sis with respect to very particular content aspects. 4 Experiments 4.1 Data Input Clusters In order to obtain input clusters for evaluation, we started out with the satellite cluster or ‘dumbbell’ structure of adjectives in WordNet 3.0, which consists of two direct antonyms as the poles and a number of other satellite adjectives that are se- mantically similar to each of the poles (Gross and Miller, 1990). For each antonymy pair, we deter- mined an extended dumbbell set by looking up syn- onyms and words in related (satellite adjective and ‘see-also’) synonym sets. We cut such an extended dumbbell into two antonymous halves and treated each of these halves as a potential input adjective cluster. Most of these WordNet clusters are noisy for the purpose of our task, i.e. they contain adjectives that appear unrelatable on a single scale due to polysemy and semantic drift, e.g. violent with respect to super- natural and affected. Motivated by Sheinman and Tokunaga (2009), we split such hard-to-relate ad- jectives into smaller scale-specific subgroups using the corpus evidence4. For this, we consider an undi- 4Note that we do not use the WordNet dataset of Sheinman and Tokunaga (2009) for evaluation, as it does not provide full 438 115 60 35 19 12 14 5 4 3 0 100 200 300 400 500 2 3 4 5 6 7 8 9 10-14 15-17 # of c ha in s Length of chain Figure 3: The histogram of cluster sizes after partitioning. 41 27 12 3 3 2 0 10 20 30 40 50 3 4 5 6 7 8 # of c ha in s Length of chain Figure 4: The histogram of cluster sizes in the test set. rected edge between each pair of adjectives that has a non-zero intensity score (based on the Web-scale scoring procedure described in Section 2.1.3). The resulting graph is then partitioned into connected components such that any adjectives in a subgraph are at least indirectly connected via some path and thus much more likely to belong to the same inten- sity scale. While this does break up partitions when- ever there is no corpus evidence connecting them, ordering the adjectives within each such partition re- mains a challenging task. This is because the Web evidence will still not necessarily directly relate all adjectives (in a partition) to each other. Addition- ally, the Web evidence may still indicate the wrong direction. Figure 3 shows the size distribution of the resulting partitions. Patterns To construct our intensity pattern set, we started with a couple of common rankable adjective seed pairs such as (good, great) and (hot, boiling) and used the Web-scale n-grams corpus (Brants and Franz, 2006) to collect the few most frequent pat- terns between and around these seed-pairs (in both directions). Among these, we manually chose a scales. Instead, their annotators only made pairwise compar- isons with select words, using a 5-way classification scheme (neutral, mild, very mild, intense, very intense). 284 small set of intuitive patterns that are linguistically useful for ordering adjectives, several of which had not been discovered in previous work. These are shown in Table 1. Note that we only collected pat- terns that were not ambiguous in the two orders, for example the pattern ’? , not ?’ is ambiguous be- cause it can be used as both ’good, not great’ and ’great, not good’. Alternatively, one can easily also use fully-automatic bootstrapping techniques based on seed word pairs (Hearst, 1992; Chklovski and Pantel, 2004; Yang and Su, 2007; Turney, 2008; Davidov and Rappoport, 2008). However, our semi- automatic approach is a simple and fast process that extracts a small set of high-quality and very gen- eral adjective-scaling patterns. This process can quickly be repeated from scratch in any other lan- guage. Moreover, as described in Section 5.1, the English patterns can also be projected automatically to patterns in other languages. Development and Test Sets Section 2.1 describes the method for collecting the intensity scores for ad- jective pairs, using Web-scale n-grams (Brants and Franz, 2006). We relied on a small development set to test the MILP structure and the pairwise score setup. For this, we manually chose 5 representative adjective clusters from the full set of clusters. The final test set, distinct from this development set, consists of 569 word pairs in 88 clusters, each annotated by two native speakers of English. Both the gold test data (and our code) are freely avail- able.5 To arrive at this data, we randomly drew 30 clusters each for cluster sizes 3, 4, and 5+ from the histogram of partitioned adjective clusters in Fig- ure 3. While labeling a cluster, annotators could ex- clude words that they deemed unsuitable to fit on a single shared intensity scale with the rest of the cluster. Fortunately, the partitioning described ear- lier had already separated most such cases into dis- tinct clusters. The annotators ordered the remaining words on a scale. Words that seemed indistinguish- able in strength could share positions in their anno- tation. As our goal is to compare scale formation algo- rithms, we did not include trivial clusters of size 2. On such trivial clusters, the Web evidence alone de- termines the output and hence all algorithms, includ- 5http://demelo.org/gdm/intensity/ ing the baseline, obtain the same pairwise accuracy (defined below) of 93.3% on a separate set of 30 ran- dom clusters of size 2. Figure 4 shows the distribution of cluster sizes in our main gold set. The inter-annotator agreement in terms of Cohen’s κ (Cohen, 1960) on the pairwise classification task with 3 labels (weaker, stronger, or equal/unknown) was 0.64. In terms of pairwise accuracy, the agreement was 78.0%. 4.2 Metrics In order to thoroughly evaluate the performance of our adjective ordering procedure, we rely on both pairwise and ranking-correlation evaluation metrics. Consider a set of input words A = {a1,a2, . . . ,an} and two rankings for this set – a gold-standard rank- ing rG(A) and a predicted ranking rP (A). 4.2.1 Pairwise Accuracy For a pair of words ai, aj, we may consider the classification task of choosing one of three labels (<, >, =?) for the case of ai being weaker, stronger, and equal (or unknown) in intensity, respectively, com- pared to a2: L(a1,a2) =    < if r(ai) < r(aj) > if r(ai) > r(aj) =? if r(ai) = r(aj) For each pair (a1,a2), we compute gold-standard labels LG(a1,a2) and predicted labels LP (a1,a2) as above, and then the pairwise accuracy PW(A) for a particular ordering on A is simply the fraction of pairs that are correctly classified, i.e. for which the predicted label is same as the gold-standard label: PW(A) = ∑ i rG(aj) and rP (ai) > rP (aj), or rG(ai) < rG(aj) and rP (ai) < rP (aj). • discordant iff the two rankings have an inverted strict order of the two elements, i.e., rG(ai) > rG(aj) and rP (ai) < rP (aj), or rG(ai) < rG(aj) and rP (ai) > rP (aj). • tied iff rG(ai) = rG(aj) or rP (ai) = rP (aj). Spearman’s rho correlation coefficient For two n-sized ranked lists {xi} and {yi}, the Spearman correlation coefficient is defined as the Pearson cor- relation coefficient between the ranks of variables: ρ = ∑ i (xi − x̄) · (yi − ȳ) √∑ i (xi − x̄)2 · ∑ i (yi − ȳ)2 Here, x̄ and ȳ denote the means of the values in the respective lists. We use the standard procedure for handling ties correctly. Tied values are assigned the average of all ranks of items sharing the same value in the ranked list sorted in ascending order of the values. Handling Inversions While annotating, we some- times observed that the ordering itself was very clear but the annotators disagreed about which end of a particular scale was to count as the strong one, e.g. when transitioning from soft to hard or from alpha to beta. We thus also report average absolute values of both correlation coefficients, as these properly ac- count for anticorrelations. Our test set only contains clusters of size 3 or larger, so there is no need to account for inversions in clusters of size 2. 4.3 Results In Table 3, we use the evaluation metrics mentioned above to compare several different approaches. Web Baseline The first baseline simply reflects the original pairwise Web-based intensity scores. We classify (with one of 3 labels) a given pair of adjectives using the Web-based intensity scores (as described in Section 2.1.3) as follows: Lbaseline(a1,a2) =    < if score(ai,aj) > 0 > if score(ai,aj) < 0 =? if score(ai,aj) = 0 Since score(ai,aj) represents the weak-strong score of the two adjectives, a more positive value means a higher likelihood of ai being weaker (<, on the left) in intensity than aj. In Table 3, we observe that the (micro-averaged) pairwise accuracy, as defined earlier, for the origi- nal Web baseline is 48.2%, while the ranking mea- sures are undefined because the individual pairs do not lead to a coherent scale. Divide-and-Conquer The divide-and-conquer baseline recursively splits a set of words into three subgroups, placed to the left (weaker), on the same position (no evidence), or to the right (stronger) of a given randomly chosen pivot word. While this approach shows only a minor improve- ment in terms of the pairwise accuracy (50.6%), its main benefit is that one obtains well-defined inten- sity scales rather than just a collection of pairwise scores. Sheinman and Tokunaga The approach by Sheinman and Tokunaga (2009) involves a simi- lar divide-and-conquer based partitioning in the first phase, except that their method makes use of syn- onymy information from WordNet and uses all syn- onyms in WordNet’s synset for the headword as neutral pivot elements (if the headword is not in WordNet, then the word with the maximal unigram frequency is chosen). In the second phase, their method performs pairwise comparisons within the more intense and less intense subgroups. We reim- plement their approach here, using the Google N- Grams dataset instead of online Web search engine hits. We observe a small improvement over the Web baseline in terms of pairwise accuracy. Note that the 286 Method Pairwise Accuracy Avg. τ Avg. |τ| Avg. ρ Avg. |ρ| Web Baseline 48.2% N/A N/A N/A N/A Divide-and-Conquer 50.6% 0.45 0.53 0.52 0.62 Sheinman and Tokunaga (2009) 55.5% N/A N/A N/A N/A MILP 69.6% 0.57 0.65 0.64 0.73 MILP with synonymy 78.2% 0.57 0.66 0.67 0.80 Inter-Annotator Agreement 78.0% 0.67 0.76 0.75 0.86 Table 3: Main test results Predicted Class Weaker Tie Stronger True Class Weaker 117 127 15 Tie 5 42 15 Stronger 11 122 115 Table 4: Confusion matrix (Web baseline) rank correlation measure scores are undefined for their approach. This is because in some cases their method placed all words on the same position in the scale, which these measures cannot handle even in their tie-corrected versions. Overall, the Sheinman and Tokunaga approach does not aggregate informa- tion sufficiently well at the global level and often fails to make use of transitive inference. MILP Our MILP exploits the same pairwise scores to induce significantly more accurate pair- wise labels with 69.6% accuracy, a 41% relative error reduction over the Web baseline, 38% over Divide-and-Conquer, and 32% over Sheinman and Tokunaga (2009). We further see that our MILP method is able to exploit external synonymy (equiv- alence) information (using synonyms marked by the annotators). The accuracy of the pairwise scores as well as the quality of the overall ranking increase even further to 78.2%, approaching the human inter- annotator agreement. In terms of average correlation coefficients, we observe similar improvement trends from the MILP, but of different magnitudes, because these averages give small clusters the same weight as larger ones. 4.4 Analysis Confusion Matrices For a given approach, we can study the confusion matrix obtained by cross- tabulating the gold classification with the predicted Predicted Class Weaker Tie Stronger True Class Weaker 177 29 53 Tie 9 24 29 Stronger 15 38 195 Table 5: Confusion matrix (MILP) classification of every unique pair of adjectives in the ground truth data. Table 4 shows the confusion matrix for the Web baseline. We observe that due to the sparsity of pairwise intensity order evidence, the baseline method predicts too many ties. Table 5 provides the confusion matrix for the MILP (without external equivalence information) for comparison. Although the middle column still shows that the MILP predicts more ties than humans annotators, we find that a clear majority of all unique pairs are now correctly placed along the diagonal. This confirms that our MILP successfully infers new ordering decisions, although it uses the same input (corpus evidence) as the baseline. The remaining ties are mostly just the result of pairs for which there simply is no evidence at all in the input Web counts. Note that this problem could for instance be circum- vented by relying on a crowdsourcing approach: A few dispersed tie-breakers are enough to allow our MILP to correct many other predictions. Predicted Examples Finally, in Table 6, we pro- vide a selection of real results obtained by our algo- rithm. For instance, it correctly inferred that terri- fying is more intense than creepy or scary, although the Web pattern counts did not provide any explicit information about these words pairs. In some cases, however, the Web evidence did not suffice to draw the right conclusions, or it was misleading due to is- sues like polysemy (as for the word funny). 287 Accuracy Prediction Gold Standard Good hard < painful < hopeless hard < painful < hopeless full < stuffed < (overflowing, overloaded) full < stuffed < overflowing < overloaded unusual < uncommon < rare < exceptional < extraordinary uncommon < unusual < rare < extraordinary < exceptional Average creepy < scary < sinister < frightening < terrifying creepy < (scary, frightening) < terrifying < sinister Bad (awake, conscious) < alive < aware alive < awake < (aware, conscious) strange < (unusual, weird) < (funny, eerie) (strange, funny) < unusual < weird < eerie Table 6: Some examples (of bad, average and good accu- racy) of our MILP predictions (without synonymy infor- mation) and the corresponding gold-standard annotation. While we show results on gold-standard chains here for evaluation purposes, in practice one can also recombine two [0, 1] chains for a pair of antonymic clusters to form a single scale from [−1, 1] that visu- alizes the full spectrum of available adjectives along a dimension, from adjacent all the way to removed, or from black to glaring. 5 Extension to Multilingual Ordering Our method for globally ordering words on a scale can easily be applied to languages other than En- glish. The entire process is language-independent as long as the required resources are available and a small number of patterns are chosen. For morpho- logically rich languages, the information extraction step of course may require additional morphologi- cal analysis tools for stemming and aggregating fre- quencies across different forms. Alternatively, a cross-lingual projection approach is possible at multiple levels, utilizing information from the English data and ranking. As the first step, the set of words in the target language that we wish to rank can be projected from the English word set if necessary – e.g., as shown in de Melo and Weikum (2009). Next, we outline two projection methods for the ordering step. The first method is based on pro- jection of the English intensity-ordering patterns to the new language, and then using the same MILP as described in Section 2.2. In the second method, we also change the MILP and add cross-lingual con- straints to better inform the target language’s ad- jective ranking. A detailed empirical evaluation of these approaches remains future work. 5.1 Cross-Lingual Pattern Projection Instead of creating new patterns, in many cases we obtain quite adequate intensity patterns by us- ing cross-lingual projection. We simply take sev- eral adjective pairs, instantiate the English patterns with them, and obtain new patterns using a machine translation system. Filling the wildcards in a pat- tern, say ‘? but not ?’, with good/excellent results in ‘good but not excellent’. This phrase is then trans- lated into the target language using the translation system, say into German ‘gut aber nicht ausgezeich- net’. Finally, put back the wildcards in the place of the translations of the adjective words, here gut and ausgezeichnet, to get the corresponding German pat- tern ‘? aber nicht ?’. Table 7 shows various German intensity patterns that we obtain by projecting from the English patterns as described. The process is re- peated with multiple adjective pairs in case different variants are returned, e.g. due to morphology. Most of these translations deliver useful results. Now that we have the target language adjectives and the ranking patterns, we can compute the pair- wise intensity scores using large-scale data in that language. We can use the Google n-grams cor- pora for 10 European languages (Brants and Franz, 2009), and also for Chinese (LDC2010T02) and Japanese (LDC2009T08). For other languages, one can use available large raw-text corpora or Web crawling tools. 5.2 Crosslingual MILP To improve the rankings for lesser-resourced lan- guages, we can further use a joint MILP approach for the new language we want to transfer this pro- cess to. Additional constraints between the English 288 Weak-Strong Patterns Strong-Weak Patterns English German English German ? but not ? ? aber nicht ? not ? just ? nicht ? gerade ? ? if not ? ? wenn nicht ? not ? but just ? nicht ? aber nur ? ? and almost ? ? und fast ? not ? though still ? nicht ? aber immer noch ? not just ? but ? nicht nur ? sondern ? ? or very ? ? oder sehr ? Table 7: Examples of German intensity patterns projected (translated) directly from the English patterns. words and their corresponding target language trans- lations, in combination with the English ranking in- formation, allow the algorithm to obtain better rank- ings for the target words whenever the non-English target language corpus does not provide sufficient intensity order evidence. In this case, the input set A contains words in multiple languages. The Web intensity scores score(ai,aj) should be set to zero when comparing words across languages. We instead link them using a translation table T ⊆ {1, . . . ,N} × {1, . . . ,N} from a translation dictionary or phrase table. Here, (i,j) ∈ T signifies that ai is a translation of aj. We do not require a bijective relationship between them (i.e., translations needn’t be unique). The objective function is augmented by adding the new term ∑ (i,j)∈T (w′ij + s ′ ij)CT (5) for a constant CT > 0 that determines how much weight we assign to translations as opposed to the corpus count scores. The MILP is extended by adding the following extra constraints. dij −w′ijCT < −dmax ∀i,j ∈{1, . . . ,N} dij + (1 −w′ij)CT ≥−dmax ∀i,j ∈{1, . . . ,N} dij + s ′ ijCT > dmax ∀i,j ∈{1, . . . ,N} dij − (1 −s′ij)CT ≤ dmax ∀i,j ∈{1, . . . ,N} w′ij ∈{0, 1} ∀i,j ∈ T s′ij ∈{0, 1} ∀i,j ∈ T The variables di,j, as before, encode distances be- tween positions of words on the scale, but now also include cross-lingual pairs of words in different lan- guages. The new constraints encourage translational equivalents to remain close to each other, preferably within a desired (but not strictly enforced) maximum distance dmax. The new variables w′ij, s ′ ij are sim- ilar to wij, sij in the standard MILP. However, the w′ij become 1 if and only if dij ≥−dmax and the s′ij become 1 if and only if dij ≤ dmax. If both w′ij and s′ij are 1, then the two words have a small distance −dmax ≤ dij ≤ dmax. The augmented objective function explicitly encourages this for translational equivalents. Overall, this approach thus allows evi- dence from a language with more Web evidence to improve the process of adjective ordering in lesser- resourced languages. 6 Conclusion In this work, we have presented an approach to the challenging and little-studied task of ranking words in terms of their intensity on a continuous scale. We address the issue of sparsity of the intensity order ev- idence in two ways. First, pairwise intensity scores are computed using linguistically intuitive patterns in a very large, Web-scale corpus. Next, a Mixed Integer Linear Program (MILP) expands on this fur- ther by inferring new relative relationships. Instead of making ordering decisions about word pairs in- dependently, our MILP considers the joint decision space and factors in e.g. how two adjectives relate to some third adjective, thus enforcing global con- straints such as transitivity. Our approach is general enough to allow addi- tional evidence such as synonymy in the MILP, and can straightforwardly be applied to other word classes (such as verbs), and to other languages (monolingually as well as cross-lingually). The overall results across multiple metrics are substan- tially better than previous approaches, and fairly close to human agreement on this challenging task. Acknowledgments We would like to thank the editor and the anony- mous reviewers for their helpful feedback. 289 References Mohit Bansal and Dan Klein. 2011. Web-scale features for full-scale parsing. In Proceedings of ACL 2011. Thorsten Brants and Alex Franz. 2006. The Google Web 1T 5-gram corpus version 1.1. LDC2006T13. Thorsten Brants and Alex Franz. 2009. Web 1T 5-gram, 10 European languages, version 1. LDC2009T25. Timothy Chklovski and Patrick Pantel. 2004. VerbO- cean: Mining the web for fine-grained semantic verb relations. In Proceedings of EMNLP 2004. Jacob Cohen. 1960. A coefficient of agreement for nom- inal scales. Educational and Psychological Measure- ment, 20(1):37–46. Dmitry Davidov and Ari Rappoport. 2008. Unsuper- vised discovery of generic relationships using pattern clusters and its evaluation by automatically generated sat analogy questions. In Proceedings of ACL 2008. Marie-Catherine de Marneffe, Christopher D. Manning, and Christopher Potts. 2010. Was it good? it was provocative. learning the meaning of scalar adjectives. In Proceedings of ACL 2010. Gerard de Melo and Gerhard Weikum. 2009. Towards a universal wordnet by learning from combined evi- dence. In Proceedings of CIKM 2009. Zhicheng Dou, Ruihua Song, Xiaojie Yuan, and Ji-Rong Wen. 2008. Are click-through data adequate for learn- ing web search rankings? In Proc. of CIKM 2008. Derek Gross and Katherine J. Miller. 1990. Adjectives in WordNet. International Journal of Lexicography, 3(4):265–277. Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1993. Towards the automatic identification of adjecti- val scales: Clustering adjectives according to meaning. In Proceedings of ACL 1993. Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1997. Predicting the semantic orientation of adjec- tives. In Proceedings of ACL 1997. Vasileios Hatzivassiloglou and Janyce M. Wiebe. 2000. Effects of adjective orientation and gradability on sen- tence subjectivity. In Proceedings of COLING 2000. Marti Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING 1992. Diana Inkpen and Graeme Hirst. 2006. Building and using a lexical knowledge base of near-synonym dif- ferences. Computational Linguistics, 32(2):223–262. Maurice G. Kendall. 1938. A new measure of rank cor- relation. Biometrika, 30(1/2):81–93. Adam Kilgarriff. 2007. Googleology is bad science. Computational Linguistics, 33(1). William H. Kruskal. 1958. Ordinal measures of associa- tion. Journal of the American Statistical Association, 53(284):814–861. Jingjing Liu and Stephanie Seneff. 2009. Review senti- ment scoring via a parse-and-paraphrase paradigm. In Proceedings of EMNLP 2009. George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM, 38(11):39–41. Said M. Mohammad, Bonnie J. Dorr, Graeme Hirst, and Peter D. Turney. 2013. Computing lexical contrast. Computational Linguistics. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Infor- mation Retrieval, 2(1-2):1–135, January. Peter F. Schulam and Christiane Fellbaum. 2010. Au- tomatically determining the semantic gradation of ger- man adjectives. In Proceedings of KONVENS 2010. Vera Sheinman and Takenobu Tokunaga. 2009. AdjS- cales: Visualizing differences between adjectives for language learners. IEICE Transactions on Information and Systems, 92(8):1542–1550. Vera Sheinman, Takenobu Tokunaga, I. Julien, P. Schu- lam, and C. Fellbaum. 2012. Refining WordNet adjec- tive dumbbells using intensity relations. In Proceed- ings of Global WordNet Conference 2012. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction from heterogenous evi- dence. In Proceedings of COLING/ACL 2006. Charles Spearman. 1904. The proof and measurement of association between two things. The American journal of psychology, 15(1):72–101. Fabian M. Suchanek, Mauro Sozio, and Gerhard Weikum. 2009. SOFIE: a self-organizing framework for information extraction. In Proceedings of WWW 2009. Maite Taboada, Julian Brooke, Milan Tofiloskiy, and Kimberly Vollz. 2011. Lexicon-based methods for sentiment analysis. Computational Linguistics. Niket Tandon and Gerard de Melo. 2010. Information extraction from web-scale n-gram data. In Proceed- ings of the SIGIR 2010 Web N-gram Workshop. Peter D. Turney and Michael L. Littman. 2003. Mea- suring praise and criticism: Inference of semantic orientation from association. ACM Trans. Inf. Syst., 21(4):315–346, October. Peter D. Turney. 2008. A uniform approach to analogies, synonyms, antonyms, and associations. In Proceed- ings of COLING 2008. Xiaofeng Yang and Jian Su. 2007. Coreference resolu- tion using semantic relatedness information from auto- matically discovered patterns. In Proceedings of ACL 2007. Ainur Yessenalina and Claire Cardie. 2011. Composi- tional matrix-space models for sentiment analysis. In Proceedings of EMNLP 2011. 290