Transactions of the Association for Computational Linguistics, 1 (2013) 279–290. Action Editor: Lillian Lee.
Submitted 11/2012; Revised 1/2013; Published 7/2013. c©2013 Association for Computational Linguistics.

Good, Great, Excellent:
Global Inference of Semantic Intensities

Gerard de Melo
ICSI, Berkeley

demelo@icsi.berkeley.edu

Mohit Bansal
CS Division, UC Berkeley

mbansal@cs.berkeley.edu

Abstract

Adjectives like good, great, and excellent are
similar in meaning, but differ in intensity. In-
tensity order information is very useful for
language learners as well as in several NLP
tasks, but is missing in most lexical resources
(dictionaries, WordNet, and thesauri). In this
paper, we present a primarily unsupervised
approach that uses semantics from Web-scale
data (e.g., phrases like good but not excel-
lent) to rank words by assigning them posi-
tions on a continuous scale. We rely on Mixed
Integer Linear Programming to jointly deter-
mine the ranks, such that individual decisions
benefit from global information. When rank-
ing English adjectives, our global algorithm
achieves substantial improvements over pre-
vious work on both pairwise and rank corre-
lation metrics (specifically, 70% pairwise ac-
curacy as compared to only 56% by previous
work). Moreover, our approach can incorpo-
rate external synonymy information (increas-
ing its pairwise accuracy to 78%) and extends
easily to new languages. We also make our
code and data freely available.1

1 Introduction

Current lexical resources such as dictionaries and
thesauri do not provide information about the in-
tensity order of words. For example, both WordNet
(Miller, 1995) and Roget’s 21st Century Thesaurus
(thesaurus.com) present acceptable, great, and su-
perb as synonyms of the adjective good. However,
a native speaker knows that these words represent
varying intensity and can in fact generally be ranked
by intensity as acceptable < good < great < superb.
Similarly, warm < hot < scorching are identified as
synonyms in these resources. Ranking information,

1http://demelo.org/gdm/intensity/

however, is crucial because it allows us to differen-
tiate e.g. between various intensities of an emotion,
and is hence very useful for humans when learning a
language or judging product reviews, as well as for
automatic text understanding and generation tasks
such as sentiment and subjectivity analysis, recog-
nizing textual entailment, question answering, sum-
marization, and coreference and discourse analysis.

In this work, we attempt to automatically rank
sets of related words by intensity, focusing in par-
ticular on adjectives. This is made possible by the
vast amounts of world knowledge that are now avail-
able. We use lexico-semantic information extracted
from a Web-scale corpus in conjunction with an al-
gorithm based on a Mixed Integer Linear Program
(MILP). Linguistic analyses have identified phrases
such as good but not great or hot and almost scorch-
ing in a text corpus as sources of evidence about the
relative intensities of words. However, pure infor-
mation extraction approaches often fail to provide
enough coverage for real-world downstream appli-
cations (Tandon and de Melo, 2010), unless some
form of advanced inference is used (Snow et al.,
2006; Suchanek et al., 2009).

In our work, we address this sparsity problem by
relying on Web-scale data and using an MILP model
that extends the pairwise scores to a more com-
plete joint ranking of words on a continuous scale,
while maintaining global constraints such as transi-
tivity and giving more weight to the order of word
pairs with higher corpus evidence scores. Instead
of considering intensity ranking as a pairwise deci-
sion process, we thus exploit the fact that individual
decisions may benefit from global information, e.g.
about how two words relate to some third word.

Previous work (Sheinman and Tokunaga, 2009;
Schulam and Fellbaum, 2010; Sheinman et al.,
2012) has also used lexico-semantic patterns to or-

279


der adjectives. They mainly evaluate their algorithm
on a set of pairwise decisions, but also present a par-
titioning approach that attempts to form scales by
placing each adjective to the left or right of pivot
words. Unfortunately, this approach often fails be-
cause many pairs lack order-based evidence even on
the Web, as explained in more detail in Section 3.

In contrast, our MILP jointly uses information
from all relevant word pairs and captures com-
plex interactions and inferences to produce inten-
sity scales. We can thus obtain an order between
two adjectives even when there is no explicit evi-
dence in the corpus (using evidence for related pairs
and transitive inference). Our global MILP is flex-
ible and can also incorporate additional synonymy
information if available (which helps the MILP find
an even better ranking solution). Our approach also
extends easily to new languages. We describe two
approaches for this multilingual extension: pattern
projection and cross-lingual MILPs.

We evaluate our predicted intensity rankings us-
ing both pairwise classification accuracy and rank-
ing correlation coefficients, achieving strong results,
significantly better than the previous approach by
Sheinman & Tokunaga (32% relative error reduc-
tion) and quite close to human-level performance.

2 Method

In this section, we describe each step of our ap-
proach to ordering adjectives on a single, relative
scale. Our method can also be applied to other word
classes and to languages other than English.

2.1 Web-based Scoring Model
2.1.1 Intensity Scales

Near-synonyms may differ in intensity, e.g. joy
vs. euphoria, or drizzle vs. rain. This is particu-
larly true of adjectives, which can represent different
degrees of a given quality or attribute such as size
or age. Many adjectives are gradable and thus al-
low for grading adverbial modifiers to express such
intensity degrees, e.g., a house can be very big or
extremely big. Often, however, completely differ-
ent adjectives refer to varying degrees on the same
scale, e.g., huge, gigantic, gargantuan. Even adjec-
tives like enormous (or superb, impossible) that are
considered non-gradable from a syntactic perspec-
tive can be placed on a such a scale.

Weak-Strong Patterns Strong-Weak Patterns
? (,) but not ? not ? (,) just ?
? (,) if not ? not ? (,) but just ?

? (,) although not ? not ? (,) still ?
? (,) though not ? not ? (,) but still ?
? (,) (and/or) even ? not ? (,) although still ?
? (,) (and/or) almost ? not ? (,) though still ?

not only ? but ? ? (,) or very ?
not just ? but ?

Table 1: Ranking patterns used in this work. Among the
patterns represented by the regular expressions above, we
use only those that capture less than or equal to five words
(to fit in the Google n-grams, see Section 2.1.2). Articles
(a, an, the) are allowed to appear before the wildcards
wherever possible.

2.1.2 Intensity Patterns
Linguistic studies have found lexical patterns like

‘? but not ?’ (e.g. good but not great) to reveal order
information between a pair of adjectives (Sheinman
and Tokunaga, 2009). We assume that we have two
sets of lexical patterns that allow us to infer the most
likely ordering between two words when encoun-
tered in a corpus. A first pattern set, Pws, contains
patterns that reflect a weak-strong order between a
pair of word (the first word is weaker than the sec-
ond), and a second pattern set, Psw, captures the
strong-weak order. See Table 1 for the adjective pat-
terns that we used in this work (and see Section 4.1
for implementation details regarding our pattern col-
lection). Many of these patterns also apply to other
parts of speech (e.g. ‘drizzle but not rain’, ‘running
or even sprinting’), with significant discrimination
on the Web in the right direction.

2.1.3 Pairwise Scores
Given an input set of words to be placed on a

scale, we first collect evidence of their intensity or-
der by using the above-mentioned intensity patterns
and a large, Web-scale text corpus.

Previous work on information extraction from
limited-sized raw text corpora revealed that cover-
age is often limited (Hearst, 1992; Hatzivassiloglou
and McKeown, 1993). Some studies (Chklovski
and Pantel, 2004; Sheinman and Tokunaga, 2009)
used hit counts from an online search engine, but
this is unstable and irreproducible (Kilgarriff, 2007).
To avoid these issues, we use the largest available

280


(good, great) (great, good) (small, minute)
good , but not great → 24492.0 not great , just good → 248.0 small , almost minute → 97.0

good , if not great → 1912.0 great or very good → 89.0 small , even minute → 41.0
good , though not great → 504.0 not great but still good → 47.0

good , or even great → 338.0
not just good but great → 181.0

good , almost great → 156.0

Table 2: Some examples from the Web-scale corpus of useful intensity-based phrases on adjective pairs.

static corpus of counts, the Google n-grams corpus
(Brants and Franz, 2006), which contains English
n-grams (n = 1 to 5) and their observed frequency
counts, generated from nearly 1 trillion word tokens
and 95 billion sentences.

We consider each pair of words (a1,a2) in the in-
put set in turn. For each pattern p in the two pattern
sets (weak-strong Pws and strong-weak Psw), we in-
sert the word pair into the pattern as p(a1,a2) to get
a phrasal query like “big but not huge”. This is done
by replacing the two wildcards in the pattern by the
two words in order. Finally, we scan the Web n-
grams corpus in a batch approach similar to Bansal
and Klein (2011) and collect frequencies of all our
phrase queries. Table 2 depicts some examples of
useful intensity-based phrase queries and their fre-
quencies in the Web-scale corpus. We also collect
frequencies for the input word unigrams and the pat-
terns for normalization purposes. Given a word pair
(a1,a2) and a corpus count function cnt, we define

W1 =
1

P1

∑

p1∈Pws
cnt(p1(a1,a2))

S1 =
1

P2

∑

p2∈Psw
cnt(p2(a1,a2))

W2 =
1

P1

∑

p1∈Pws
cnt(p1(a2,a1))

S2 =
1

P2

∑

p2∈Psw
cnt(p2(a2,a1)) (1)

with

P1 =
∑

p1∈Pws
cnt(p1)

P2 =
∑

p2∈Psw
cnt(p2), (2)

such that the final overall weak-strong score is

score(a1,a2) =
(W1 −S1) − (W2 −S2)

cnt(a1) · cnt(a2)
. (3)

Here W1 and S1 represent Web evidence of a1
and a2 being in the weak-strong and strong-weak
relation, respectively. W2 and S2 fit the reverse
pair (a2,a1) in the patterns and hence represent
the strong-weak and weak-strong relations, respec-
tively, in the opposite direction. Hence, overall,
(W1 −S1) − (W2 −S2) represents the total weak-
strong score of the pair (a1,a2), i.e. the score of a1
being on the left of a2 on a relative intensity scale,
such that score(a1,a2) = −score(a2,a1). The raw
frequencies in the score are divided by counts of the
patterns and by individual word unigram counts to
obtain a pointwise mutual information (PMI) style
normalization and hence avoid any bias in the score
due to high-frequency patterns or word unigrams.2

2.2 Global Ordering with an MILP
2.2.1 Objective and Constraints

Given pairwise scores, we now aim at producing a
global ranking of the input words that is much more
informative than the original pairwise scores. Joint
inference from multiple word pairs allows us to ben-
efit from global information: Due to the sparsity of
the pattern evidence, determining how two adjec-
tives relate to each other can sometimes e.g. only
be inferred by observing how each of them relate to
some third adjective.

We assume that we are given N input words A =
a1, . . . ,aN that we wish to place on a linear scale,
say [0, 1]. Thus each word ai is to be assigned a
position xi ∈ [0, 1] based on the pairwise weak-
strong weights score(ai,aj). A positive value for

2In preliminary experiments on a development set, we also
evaluated other intuitive forms of normalization.

281


Figure 1: The input weak-strong data may contain one
or more cycles, e.g. due to noisy patterns, so the final
ranking will have to choose which input scores to honor
and which to remove.

score(ai,aj) means that ai is supposedly weaker
than aj and hence we would like to obtain xi < xj.
A negative value for score(ai,aj) means that ai is
assumed to be stronger than aj, so we would want
to obtain xi > xj. Therefore, intuitively, our goal
corresponds to maximizing the objective

∑

i,j

sgn(xj −xi) ·score(ai,aj) (4)

Note that it is important to use the signum func-
tion sgn() here, because we only care about the rel-
ative order of xi and xj. Maximizing

∑
ij(xj −xi) ·

score(ai,aj) would lead to all words being placed
at the edges of the scale, because the highest scores
would dominate over all other ones. We do include
the score magnitudes in the objective, because they
help resolve contradictions in the pairwise scores
(e.g., see Figure 1). This is discussed in more de-
tail in Section 2.2.2.

In order to maximize this non-differentiable ob-
jective, we use Mixed Integer Linear Programming
(MILP), a variant of linear programming in which
some but not all of the variables are constrained to
be integers. Using an MILP formalization, we can
find a globally optimal solution in the joint deci-
sion space, and unlike previous work, we jointly ex-
ploit global information rather than just individual
local (pairwise) scores. To encode the objective in a
MILP, we need to introduce additional variables dij,
wij, sij to capture the effect of the signum function,
as explained below.

We additionally also enable our MILP to make
use of any external equivalence (synonymy) infor-
mation E ⊆ {1, . . . ,N}×{1, . . . ,N} that may be
available. In this context, two words are considered
synonymous if they are close enough in meaning to

be placed on (almost) the same position in the inten-
sity scale. If (i,j) ∈ E, we can safely assume that
ai, aj have near-equivalent intensity, so we should
encourage xi, xj to remain close to each other. The
MILP is defined as follows:

maximize∑

(i,j) 6∈E
(wij −sij) ·score(ai,aj)

−
∑

(i,j)∈E
(wij + sij) C

subject to
dij = xj −xi ∀i,j ∈{1, . . . ,N}
dij −wijC ≤ 0 ∀i,j ∈{1, . . . ,N}
dij + (1 −wij)C > 0 ∀i,j ∈{1, . . . ,N}
dij + sijC ≥ 0 ∀i,j ∈{1, . . . ,N}
dij − (1 −sij)C < 0 ∀i,j ∈{1, . . . ,N}
xi ∈ [0, 1] ∀i ∈{1, . . . ,N}
wij ∈{0, 1} ∀i,j ∈{1, . . . ,N}
sij ∈{0, 1} ∀i,j ∈{1, . . . ,N}

The difference variables dij simply capture differ-
ences between xi, xj. C is any very large constant
greater than

∑
i,j |score(ai,aj)|; the exact value is

irrelevant. The indicator variables wij and sij are
jointly used to determine the value of the signum
function sgn(dij) = sgn(xj − xi). Variables wij
become 1 if and only if dij > 0 and hence serve
as indicator variables for weak-strong relationships
in the output. Variables sij become 1 if and only if
dij < 0 and hence serve as indicator variables for
a strong-weak relationship in the output. The ob-
jective encourages wij = 1 for score(ai,aj) > 0
and sij = 1 for score(ai,aj) < 0.3 When equiva-
lence (synonymy) information is available, then for
(i,j) ∈ E both sij = 0 and wij = 0 are encouraged.
2.2.2 Discussion

Our MILP uses intensity evidence of all input
pairs together and assimilates all the scores via
global transitivity constraints to determine the posi-
tions of the input words on a continuous real-valued
scale. Hence, our approach addresses drawbacks

3In order to avoid numeric instability issues due to very
small score(ai, aj) values after frequency normalization, in
practice we have found it necessary to rescale them by a fac-
tor of 1 over the smallest |score(ai, aj)| > 0.

282


Figure 2: Equivalence Information: Knowing that am, a2
are synonyms gives the MILP an indication of where to
place an on the scale with respect to a1, a2, a3

of local or divide-and-conquer approaches, where
adjectives are scored with respect to selected pivot
words, and hence many adjectives that lack pairwise
evidence with the pivots are not properly classified,
although they may have order evidence with some
third adjective that could help establish the ranking.
Optional synonymy information can further help, as
shown in Figure 2.

Moreover, our MILP also gives higher weight
to pairs with higher scores, which is useful when
breaking global constraint cycles as in the simple
example in Figure 1. If we need to break a con-
straint violating triangle or cycle, we would have to
make arbitrary choices if we were ranking based on
sgn(score(a,b)) alone. Instead, we can choose a
better ranking based on the magnitude of the pair-
wise scores. A stronger score between an adjective
pair doesn’t necessarily mean that they should be
further apart in the ranking. It means that these two
words are attested together on the Web with respect
to the intensity patterns more than with other candi-
date words. Therefore, we try to respect the order of
such word pairs more in the final ranking when we
are breaking constraint-violating cycles.

3 Related Work

Hatzivassiloglou and McKeown (1993) presented
the first step towards automatic identification of ad-
jective scales, thoroughly discussing the background
of adjective semantics and a means of discovering
clusters of adjectives that belong on the same scale,
thus providing one way of creating the input for our
ranking algorithm.

Inkpen and Hirst (2006) study near-synonyms and
nuances of meaning differentiation (such as stylistic,

attitudinal, etc.). They attempt to automatically ac-
quire a knowledge base of near-synonym differences
via an unsupervised decision-list algorithm. How-
ever, their method depends on a special dictionary
of synonym differences to learn the extraction pat-
terns, while we use only a raw Web-scale corpus.

Mohammad et al. (2013) proposed a method of
identifying whether two adjectives are antonymous.
This problem is related but distinct, because the de-
gree of antonymy does not necessarily determine
their position on an intensity scale. Antonyms (e.g.,
little, big) are not necessarily on the extreme ends of
scales.

Sheinman and Tokunaga (2009) and Sheinman et
al. (2012) present the most closely related previous
work on adjective intensities. They collect lexico-
semantic patterns via bootstrapping from seed adjec-
tive pairs to obtain pairwise intensities, albeit using
search engine ‘hits’, which are unstable and prob-
lematic (Kilgarriff, 2007). While their approach
is primarily evaluated in terms of a local pairwise
classification task, they also suggest the possibil-
ity of ordering adjectives on a scale using a pivot-
based partitioning approach. Although intuitive in
theory, the extracted pairwise scores are frequently
too sparse for this to work. Thus, many adjec-
tives have no score with a particular headword. In
our experiments, we reimplemented this approach
and show that our MILP method improves over it
by allowing individual pairwise decisions to benefit
more from global information. Schulam and Fell-
baum (2010) apply the approach of Sheinman and
Tokunaga (2009) to German adjectives. Our method
extends easily to various foreign languages as de-
scribed in Section 5.

Another related task is the extraction of lexico-
syntactic and lexico-semantic intensity-order pat-
terns from large text corpora (Hearst, 1992;
Chklovski and Pantel, 2004; Tandon and de Melo,
2010). Sheinman and Tokunaga (2009) follows
Davidov and Rappoport (2008) to automatically
bootstrap adjective scaling patterns using seed ad-
jectives and Web hits. These methods thus can be
used to provide the input patterns for our algorithm.

VerbOcean by Chklovski and Pantel (2004) ex-
tracts various fine-grained semantic relations (in-
cluding the stronger-than relation) between pairs of
verbs, using lexico-syntactic patterns over the Web.

283


Our approach of jointly ranking a set of words using
pairwise evidence is also applicable to the VerbO-
cean pairs, and should help address similar sparsity
issues of local pairwise decisions. Such scales will
again be quite useful for language learners and lan-
guage understanding tools.

de Marneffe et al. (2010) infer yes-or-no answers
to questions with responses involving scalar adjec-
tives in a dialogue corpus. They correlate adjectives
with ratings in a movie review corpus to find that
good appears in lower-rated reviews than excellent.

Finally, there has been a lot of work on measuring
the general sentiment polarity of words (Hatzivas-
siloglou and McKeown, 1997; Hatzivassiloglou and
Wiebe, 2000; Turney and Littman, 2003; Liu and
Seneff, 2009; Taboada et al., 2011; Yessenalina and
Cardie, 2011; Pang and Lee, 2008). Our work in-
stead aims at producing a large, unrestricted number
of individual intensity scales for different qualities
and hence can help in fine-grained sentiment analy-
sis with respect to very particular content aspects.

4 Experiments

4.1 Data

Input Clusters In order to obtain input clusters for
evaluation, we started out with the satellite cluster or
‘dumbbell’ structure of adjectives in WordNet 3.0,
which consists of two direct antonyms as the poles
and a number of other satellite adjectives that are se-
mantically similar to each of the poles (Gross and
Miller, 1990). For each antonymy pair, we deter-
mined an extended dumbbell set by looking up syn-
onyms and words in related (satellite adjective and
‘see-also’) synonym sets. We cut such an extended
dumbbell into two antonymous halves and treated
each of these halves as a potential input adjective
cluster.

Most of these WordNet clusters are noisy for the
purpose of our task, i.e. they contain adjectives that
appear unrelatable on a single scale due to polysemy
and semantic drift, e.g. violent with respect to super-
natural and affected. Motivated by Sheinman and
Tokunaga (2009), we split such hard-to-relate ad-
jectives into smaller scale-specific subgroups using
the corpus evidence4. For this, we consider an undi-

4Note that we do not use the WordNet dataset of Sheinman
and Tokunaga (2009) for evaluation, as it does not provide full

438 

115 
60 35 19 12 14 5 4 3 

0 

100 

200 

300 

400 

500 

2 3 4 5 6 7 8 9 10-14 15-17 

# 
of

 c
ha

in
s 

Length of chain 

Figure 3: The histogram of cluster sizes after partitioning.

41 

27 

12 
3 3 2 

0 

10 

20 

30 

40 

50 

3 4 5 6 7 8 

# 
of

 c
ha

in
s 

Length of chain 

Figure 4: The histogram of cluster sizes in the test set.

rected edge between each pair of adjectives that has
a non-zero intensity score (based on the Web-scale
scoring procedure described in Section 2.1.3). The
resulting graph is then partitioned into connected
components such that any adjectives in a subgraph
are at least indirectly connected via some path and
thus much more likely to belong to the same inten-
sity scale. While this does break up partitions when-
ever there is no corpus evidence connecting them,
ordering the adjectives within each such partition re-
mains a challenging task. This is because the Web
evidence will still not necessarily directly relate all
adjectives (in a partition) to each other. Addition-
ally, the Web evidence may still indicate the wrong
direction. Figure 3 shows the size distribution of the
resulting partitions.

Patterns To construct our intensity pattern set, we
started with a couple of common rankable adjective
seed pairs such as (good, great) and (hot, boiling)
and used the Web-scale n-grams corpus (Brants and
Franz, 2006) to collect the few most frequent pat-
terns between and around these seed-pairs (in both
directions). Among these, we manually chose a

scales. Instead, their annotators only made pairwise compar-
isons with select words, using a 5-way classification scheme
(neutral, mild, very mild, intense, very intense).

284


small set of intuitive patterns that are linguistically
useful for ordering adjectives, several of which had
not been discovered in previous work. These are
shown in Table 1. Note that we only collected pat-
terns that were not ambiguous in the two orders, for
example the pattern ’? , not ?’ is ambiguous be-
cause it can be used as both ’good, not great’ and
’great, not good’. Alternatively, one can easily also
use fully-automatic bootstrapping techniques based
on seed word pairs (Hearst, 1992; Chklovski and
Pantel, 2004; Yang and Su, 2007; Turney, 2008;
Davidov and Rappoport, 2008). However, our semi-
automatic approach is a simple and fast process that
extracts a small set of high-quality and very gen-
eral adjective-scaling patterns. This process can
quickly be repeated from scratch in any other lan-
guage. Moreover, as described in Section 5.1, the
English patterns can also be projected automatically
to patterns in other languages.

Development and Test Sets Section 2.1 describes
the method for collecting the intensity scores for ad-
jective pairs, using Web-scale n-grams (Brants and
Franz, 2006). We relied on a small development
set to test the MILP structure and the pairwise score
setup. For this, we manually chose 5 representative
adjective clusters from the full set of clusters.

The final test set, distinct from this development
set, consists of 569 word pairs in 88 clusters, each
annotated by two native speakers of English. Both
the gold test data (and our code) are freely avail-
able.5 To arrive at this data, we randomly drew 30
clusters each for cluster sizes 3, 4, and 5+ from the
histogram of partitioned adjective clusters in Fig-
ure 3. While labeling a cluster, annotators could ex-
clude words that they deemed unsuitable to fit on
a single shared intensity scale with the rest of the
cluster. Fortunately, the partitioning described ear-
lier had already separated most such cases into dis-
tinct clusters. The annotators ordered the remaining
words on a scale. Words that seemed indistinguish-
able in strength could share positions in their anno-
tation.

As our goal is to compare scale formation algo-
rithms, we did not include trivial clusters of size 2.
On such trivial clusters, the Web evidence alone de-
termines the output and hence all algorithms, includ-

5http://demelo.org/gdm/intensity/

ing the baseline, obtain the same pairwise accuracy
(defined below) of 93.3% on a separate set of 30 ran-
dom clusters of size 2.

Figure 4 shows the distribution of cluster sizes in
our main gold set. The inter-annotator agreement in
terms of Cohen’s κ (Cohen, 1960) on the pairwise
classification task with 3 labels (weaker, stronger,
or equal/unknown) was 0.64. In terms of pairwise
accuracy, the agreement was 78.0%.

4.2 Metrics

In order to thoroughly evaluate the performance of
our adjective ordering procedure, we rely on both
pairwise and ranking-correlation evaluation metrics.
Consider a set of input words A = {a1,a2, . . . ,an}
and two rankings for this set – a gold-standard rank-
ing rG(A) and a predicted ranking rP (A).

4.2.1 Pairwise Accuracy
For a pair of words ai, aj, we may consider the

classification task of choosing one of three labels (<,
>, =?) for the case of ai being weaker, stronger, and
equal (or unknown) in intensity, respectively, com-
pared to a2:

L(a1,a2) =





< if r(ai) < r(aj)
> if r(ai) > r(aj)
=? if r(ai) = r(aj)

For each pair (a1,a2), we compute gold-standard
labels LG(a1,a2) and predicted labels LP (a1,a2) as
above, and then the pairwise accuracy PW(A) for
a particular ordering on A is simply the fraction of
pairs that are correctly classified, i.e. for which the
predicted label is same as the gold-standard label:

PW(A) =

∑
i<j

1{LG(ai,aj) = LP (ai,aj)}
∑
i<j

1

4.2.2 Ranking Correlation Coefficients
Our second type of evaluation assesses the

rank correlation between two ranking permutations
(gold-standard and predicted). Many studies use
Kendall’s tau (Kendall, 1938), which measures the
total number of pairwise inversions, while others
prefer Spearman’s rho (Spearman, 1904), which
measures the L1 distance between ranks.

285


Kendall’s tau correlation coefficient We use the
τb version of Kendall’s correlation metric, as it in-
corporates a correction for ties (Kruskal, 1958; Dou
et al., 2008):

τb =
P −Q√

(P + Q + X0) · (P + Q + Y0)
where P is the number of concordant pairs, Q is
the number of discordant pairs, X0 is the number
of pairs tied in the first ranking, Y0 is the number of
pairs tied in the second ranking. Given the two rank-
ings of an adjective set A, the gold-standard ranking
rG(A) and the predicted ranking rP (A), two words
ai, aj are:

• concordant iff both rankings have the same strict
order of the two elements, i.e., rG(ai) > rG(aj)
and rP (ai) > rP (aj), or rG(ai) < rG(aj) and
rP (ai) < rP (aj).
• discordant iff the two rankings have an inverted

strict order of the two elements, i.e., rG(ai) >
rG(aj) and rP (ai) < rP (aj), or rG(ai) <
rG(aj) and rP (ai) > rP (aj).
• tied iff rG(ai) = rG(aj) or rP (ai) = rP (aj).
Spearman’s rho correlation coefficient For two
n-sized ranked lists {xi} and {yi}, the Spearman
correlation coefficient is defined as the Pearson cor-
relation coefficient between the ranks of variables:

ρ =

∑
i

(xi − x̄) · (yi − ȳ)
√∑

i
(xi − x̄)2 ·

∑
i

(yi − ȳ)2

Here, x̄ and ȳ denote the means of the values in the
respective lists. We use the standard procedure for
handling ties correctly. Tied values are assigned the
average of all ranks of items sharing the same value
in the ranked list sorted in ascending order of the
values.

Handling Inversions While annotating, we some-
times observed that the ordering itself was very clear
but the annotators disagreed about which end of a
particular scale was to count as the strong one, e.g.
when transitioning from soft to hard or from alpha
to beta. We thus also report average absolute values
of both correlation coefficients, as these properly ac-
count for anticorrelations. Our test set only contains
clusters of size 3 or larger, so there is no need to
account for inversions in clusters of size 2.

4.3 Results
In Table 3, we use the evaluation metrics mentioned
above to compare several different approaches.

Web Baseline The first baseline simply reflects
the original pairwise Web-based intensity scores.
We classify (with one of 3 labels) a given pair of
adjectives using the Web-based intensity scores (as
described in Section 2.1.3) as follows:

Lbaseline(a1,a2) =





< if score(ai,aj) > 0
> if score(ai,aj) < 0
=? if score(ai,aj) = 0

Since score(ai,aj) represents the weak-strong
score of the two adjectives, a more positive value
means a higher likelihood of ai being weaker (<, on
the left) in intensity than aj.

In Table 3, we observe that the (micro-averaged)
pairwise accuracy, as defined earlier, for the origi-
nal Web baseline is 48.2%, while the ranking mea-
sures are undefined because the individual pairs do
not lead to a coherent scale.

Divide-and-Conquer The divide-and-conquer
baseline recursively splits a set of words into three
subgroups, placed to the left (weaker), on the same
position (no evidence), or to the right (stronger) of a
given randomly chosen pivot word.

While this approach shows only a minor improve-
ment in terms of the pairwise accuracy (50.6%), its
main benefit is that one obtains well-defined inten-
sity scales rather than just a collection of pairwise
scores.

Sheinman and Tokunaga The approach by
Sheinman and Tokunaga (2009) involves a simi-
lar divide-and-conquer based partitioning in the first
phase, except that their method makes use of syn-
onymy information from WordNet and uses all syn-
onyms in WordNet’s synset for the headword as
neutral pivot elements (if the headword is not in
WordNet, then the word with the maximal unigram
frequency is chosen). In the second phase, their
method performs pairwise comparisons within the
more intense and less intense subgroups. We reim-
plement their approach here, using the Google N-
Grams dataset instead of online Web search engine
hits. We observe a small improvement over the Web
baseline in terms of pairwise accuracy. Note that the

286


Method Pairwise Accuracy Avg. τ Avg. |τ| Avg. ρ Avg. |ρ|
Web Baseline 48.2% N/A N/A N/A N/A
Divide-and-Conquer 50.6% 0.45 0.53 0.52 0.62
Sheinman and Tokunaga (2009) 55.5% N/A N/A N/A N/A
MILP 69.6% 0.57 0.65 0.64 0.73
MILP with synonymy 78.2% 0.57 0.66 0.67 0.80
Inter-Annotator Agreement 78.0% 0.67 0.76 0.75 0.86

Table 3: Main test results

Predicted Class
Weaker Tie Stronger

True Class
Weaker 117 127 15

Tie 5 42 15
Stronger 11 122 115

Table 4: Confusion matrix (Web baseline)

rank correlation measure scores are undefined for
their approach. This is because in some cases their
method placed all words on the same position in the
scale, which these measures cannot handle even in
their tie-corrected versions. Overall, the Sheinman
and Tokunaga approach does not aggregate informa-
tion sufficiently well at the global level and often
fails to make use of transitive inference.

MILP Our MILP exploits the same pairwise
scores to induce significantly more accurate pair-
wise labels with 69.6% accuracy, a 41% relative
error reduction over the Web baseline, 38% over
Divide-and-Conquer, and 32% over Sheinman and
Tokunaga (2009). We further see that our MILP
method is able to exploit external synonymy (equiv-
alence) information (using synonyms marked by the
annotators). The accuracy of the pairwise scores as
well as the quality of the overall ranking increase
even further to 78.2%, approaching the human inter-
annotator agreement. In terms of average correlation
coefficients, we observe similar improvement trends
from the MILP, but of different magnitudes, because
these averages give small clusters the same weight
as larger ones.

4.4 Analysis

Confusion Matrices For a given approach, we
can study the confusion matrix obtained by cross-
tabulating the gold classification with the predicted

Predicted Class
Weaker Tie Stronger

True Class
Weaker 177 29 53

Tie 9 24 29
Stronger 15 38 195

Table 5: Confusion matrix (MILP)

classification of every unique pair of adjectives in
the ground truth data. Table 4 shows the confusion
matrix for the Web baseline. We observe that due to
the sparsity of pairwise intensity order evidence, the
baseline method predicts too many ties.

Table 5 provides the confusion matrix for the
MILP (without external equivalence information)
for comparison. Although the middle column still
shows that the MILP predicts more ties than humans
annotators, we find that a clear majority of all unique
pairs are now correctly placed along the diagonal.
This confirms that our MILP successfully infers new
ordering decisions, although it uses the same input
(corpus evidence) as the baseline. The remaining
ties are mostly just the result of pairs for which there
simply is no evidence at all in the input Web counts.
Note that this problem could for instance be circum-
vented by relying on a crowdsourcing approach: A
few dispersed tie-breakers are enough to allow our
MILP to correct many other predictions.

Predicted Examples Finally, in Table 6, we pro-
vide a selection of real results obtained by our algo-
rithm. For instance, it correctly inferred that terri-
fying is more intense than creepy or scary, although
the Web pattern counts did not provide any explicit
information about these words pairs. In some cases,
however, the Web evidence did not suffice to draw
the right conclusions, or it was misleading due to is-
sues like polysemy (as for the word funny).

287


Accuracy Prediction Gold Standard

Good
hard

< painful
< hopeless

hard
< painful
< hopeless

full
< stuffed
< (overflowing,

overloaded)

full
< stuffed
< overflowing
< overloaded

unusual
< uncommon
< rare
< exceptional
< extraordinary

uncommon
< unusual
< rare
< extraordinary
< exceptional

Average creepy
< scary
< sinister
< frightening
< terrifying

creepy
< (scary, frightening)
< terrifying
< sinister

Bad
(awake, conscious)

< alive
< aware

alive
< awake
< (aware, conscious)

strange
< (unusual, weird)
< (funny, eerie)

(strange, funny)
< unusual
< weird
< eerie

Table 6: Some examples (of bad, average and good accu-
racy) of our MILP predictions (without synonymy infor-
mation) and the corresponding gold-standard annotation.

While we show results on gold-standard chains
here for evaluation purposes, in practice one can also
recombine two [0, 1] chains for a pair of antonymic
clusters to form a single scale from [−1, 1] that visu-
alizes the full spectrum of available adjectives along
a dimension, from adjacent all the way to removed,
or from black to glaring.

5 Extension to Multilingual Ordering

Our method for globally ordering words on a scale
can easily be applied to languages other than En-
glish. The entire process is language-independent
as long as the required resources are available and a
small number of patterns are chosen. For morpho-
logically rich languages, the information extraction
step of course may require additional morphologi-
cal analysis tools for stemming and aggregating fre-
quencies across different forms.

Alternatively, a cross-lingual projection approach
is possible at multiple levels, utilizing information
from the English data and ranking. As the first step,

the set of words in the target language that we wish
to rank can be projected from the English word set if
necessary – e.g., as shown in de Melo and Weikum
(2009). Next, we outline two projection methods for
the ordering step. The first method is based on pro-
jection of the English intensity-ordering patterns to
the new language, and then using the same MILP
as described in Section 2.2. In the second method,
we also change the MILP and add cross-lingual con-
straints to better inform the target language’s ad-
jective ranking. A detailed empirical evaluation of
these approaches remains future work.

5.1 Cross-Lingual Pattern Projection

Instead of creating new patterns, in many cases
we obtain quite adequate intensity patterns by us-
ing cross-lingual projection. We simply take sev-
eral adjective pairs, instantiate the English patterns
with them, and obtain new patterns using a machine
translation system. Filling the wildcards in a pat-
tern, say ‘? but not ?’, with good/excellent results in
‘good but not excellent’. This phrase is then trans-
lated into the target language using the translation
system, say into German ‘gut aber nicht ausgezeich-
net’. Finally, put back the wildcards in the place of
the translations of the adjective words, here gut and
ausgezeichnet, to get the corresponding German pat-
tern ‘? aber nicht ?’. Table 7 shows various German
intensity patterns that we obtain by projecting from
the English patterns as described. The process is re-
peated with multiple adjective pairs in case different
variants are returned, e.g. due to morphology. Most
of these translations deliver useful results.

Now that we have the target language adjectives
and the ranking patterns, we can compute the pair-
wise intensity scores using large-scale data in that
language. We can use the Google n-grams cor-
pora for 10 European languages (Brants and Franz,
2009), and also for Chinese (LDC2010T02) and
Japanese (LDC2009T08). For other languages, one
can use available large raw-text corpora or Web
crawling tools.

5.2 Crosslingual MILP

To improve the rankings for lesser-resourced lan-
guages, we can further use a joint MILP approach
for the new language we want to transfer this pro-
cess to. Additional constraints between the English

288


Weak-Strong Patterns Strong-Weak Patterns
English German English German
? but not ? ? aber nicht ? not ? just ? nicht ? gerade ?
? if not ? ? wenn nicht ? not ? but just ? nicht ? aber nur ?

? and almost ? ? und fast ? not ? though still ? nicht ? aber immer noch ?
not just ? but ? nicht nur ? sondern ? ? or very ? ? oder sehr ?

Table 7: Examples of German intensity patterns projected (translated) directly from the English patterns.

words and their corresponding target language trans-
lations, in combination with the English ranking in-
formation, allow the algorithm to obtain better rank-
ings for the target words whenever the non-English
target language corpus does not provide sufficient
intensity order evidence.

In this case, the input set A contains words
in multiple languages. The Web intensity scores
score(ai,aj) should be set to zero when comparing
words across languages. We instead link them using
a translation table T ⊆ {1, . . . ,N} × {1, . . . ,N}
from a translation dictionary or phrase table. Here,
(i,j) ∈ T signifies that ai is a translation of aj. We
do not require a bijective relationship between them
(i.e., translations needn’t be unique). The objective
function is augmented by adding the new term

∑

(i,j)∈T
(w′ij + s

′
ij)CT (5)

for a constant CT > 0 that determines how much
weight we assign to translations as opposed to the
corpus count scores. The MILP is extended by
adding the following extra constraints.

dij −w′ijCT < −dmax ∀i,j ∈{1, . . . ,N}
dij + (1 −w′ij)CT ≥−dmax ∀i,j ∈{1, . . . ,N}
dij + s

′
ijCT > dmax ∀i,j ∈{1, . . . ,N}

dij − (1 −s′ij)CT ≤ dmax ∀i,j ∈{1, . . . ,N}
w′ij ∈{0, 1} ∀i,j ∈ T
s′ij ∈{0, 1} ∀i,j ∈ T

The variables di,j, as before, encode distances be-
tween positions of words on the scale, but now also
include cross-lingual pairs of words in different lan-
guages. The new constraints encourage translational
equivalents to remain close to each other, preferably
within a desired (but not strictly enforced) maximum
distance dmax. The new variables w′ij, s

′
ij are sim-

ilar to wij, sij in the standard MILP. However, the

w′ij become 1 if and only if dij ≥−dmax and the s′ij
become 1 if and only if dij ≤ dmax. If both w′ij and
s′ij are 1, then the two words have a small distance
−dmax ≤ dij ≤ dmax. The augmented objective
function explicitly encourages this for translational
equivalents. Overall, this approach thus allows evi-
dence from a language with more Web evidence to
improve the process of adjective ordering in lesser-
resourced languages.

6 Conclusion

In this work, we have presented an approach to the
challenging and little-studied task of ranking words
in terms of their intensity on a continuous scale. We
address the issue of sparsity of the intensity order ev-
idence in two ways. First, pairwise intensity scores
are computed using linguistically intuitive patterns
in a very large, Web-scale corpus. Next, a Mixed
Integer Linear Program (MILP) expands on this fur-
ther by inferring new relative relationships. Instead
of making ordering decisions about word pairs in-
dependently, our MILP considers the joint decision
space and factors in e.g. how two adjectives relate
to some third adjective, thus enforcing global con-
straints such as transitivity.

Our approach is general enough to allow addi-
tional evidence such as synonymy in the MILP,
and can straightforwardly be applied to other word
classes (such as verbs), and to other languages
(monolingually as well as cross-lingually). The
overall results across multiple metrics are substan-
tially better than previous approaches, and fairly
close to human agreement on this challenging task.

Acknowledgments

We would like to thank the editor and the anony-
mous reviewers for their helpful feedback.

289


References
Mohit Bansal and Dan Klein. 2011. Web-scale features

for full-scale parsing. In Proceedings of ACL 2011.
Thorsten Brants and Alex Franz. 2006. The Google Web

1T 5-gram corpus version 1.1. LDC2006T13.
Thorsten Brants and Alex Franz. 2009. Web 1T 5-gram,

10 European languages, version 1. LDC2009T25.
Timothy Chklovski and Patrick Pantel. 2004. VerbO-

cean: Mining the web for fine-grained semantic verb
relations. In Proceedings of EMNLP 2004.

Jacob Cohen. 1960. A coefficient of agreement for nom-
inal scales. Educational and Psychological Measure-
ment, 20(1):37–46.

Dmitry Davidov and Ari Rappoport. 2008. Unsuper-
vised discovery of generic relationships using pattern
clusters and its evaluation by automatically generated
sat analogy questions. In Proceedings of ACL 2008.

Marie-Catherine de Marneffe, Christopher D. Manning,
and Christopher Potts. 2010. Was it good? it was
provocative. learning the meaning of scalar adjectives.
In Proceedings of ACL 2010.

Gerard de Melo and Gerhard Weikum. 2009. Towards
a universal wordnet by learning from combined evi-
dence. In Proceedings of CIKM 2009.

Zhicheng Dou, Ruihua Song, Xiaojie Yuan, and Ji-Rong
Wen. 2008. Are click-through data adequate for learn-
ing web search rankings? In Proc. of CIKM 2008.

Derek Gross and Katherine J. Miller. 1990. Adjectives
in WordNet. International Journal of Lexicography,
3(4):265–277.

Vasileios Hatzivassiloglou and Kathleen R. McKeown.
1993. Towards the automatic identification of adjecti-
val scales: Clustering adjectives according to meaning.
In Proceedings of ACL 1993.

Vasileios Hatzivassiloglou and Kathleen R. McKeown.
1997. Predicting the semantic orientation of adjec-
tives. In Proceedings of ACL 1997.

Vasileios Hatzivassiloglou and Janyce M. Wiebe. 2000.
Effects of adjective orientation and gradability on sen-
tence subjectivity. In Proceedings of COLING 2000.

Marti Hearst. 1992. Automatic acquisition of hyponyms
from large text corpora. In Proceedings of COLING
1992.

Diana Inkpen and Graeme Hirst. 2006. Building and
using a lexical knowledge base of near-synonym dif-
ferences. Computational Linguistics, 32(2):223–262.

Maurice G. Kendall. 1938. A new measure of rank cor-
relation. Biometrika, 30(1/2):81–93.

Adam Kilgarriff. 2007. Googleology is bad science.
Computational Linguistics, 33(1).

William H. Kruskal. 1958. Ordinal measures of associa-
tion. Journal of the American Statistical Association,
53(284):814–861.

Jingjing Liu and Stephanie Seneff. 2009. Review senti-
ment scoring via a parse-and-paraphrase paradigm. In
Proceedings of EMNLP 2009.

George A. Miller. 1995. WordNet: A lexical database for
english. Communications of the ACM, 38(11):39–41.

Said M. Mohammad, Bonnie J. Dorr, Graeme Hirst, and
Peter D. Turney. 2013. Computing lexical contrast.
Computational Linguistics.

Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and Trends in Infor-
mation Retrieval, 2(1-2):1–135, January.

Peter F. Schulam and Christiane Fellbaum. 2010. Au-
tomatically determining the semantic gradation of ger-
man adjectives. In Proceedings of KONVENS 2010.

Vera Sheinman and Takenobu Tokunaga. 2009. AdjS-
cales: Visualizing differences between adjectives for
language learners. IEICE Transactions on Information
and Systems, 92(8):1542–1550.

Vera Sheinman, Takenobu Tokunaga, I. Julien, P. Schu-
lam, and C. Fellbaum. 2012. Refining WordNet adjec-
tive dumbbells using intensity relations. In Proceed-
ings of Global WordNet Conference 2012.

Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006.
Semantic taxonomy induction from heterogenous evi-
dence. In Proceedings of COLING/ACL 2006.

Charles Spearman. 1904. The proof and measurement of
association between two things. The American journal
of psychology, 15(1):72–101.

Fabian M. Suchanek, Mauro Sozio, and Gerhard
Weikum. 2009. SOFIE: a self-organizing framework
for information extraction. In Proceedings of WWW
2009.

Maite Taboada, Julian Brooke, Milan Tofiloskiy, and
Kimberly Vollz. 2011. Lexicon-based methods for
sentiment analysis. Computational Linguistics.

Niket Tandon and Gerard de Melo. 2010. Information
extraction from web-scale n-gram data. In Proceed-
ings of the SIGIR 2010 Web N-gram Workshop.

Peter D. Turney and Michael L. Littman. 2003. Mea-
suring praise and criticism: Inference of semantic
orientation from association. ACM Trans. Inf. Syst.,
21(4):315–346, October.

Peter D. Turney. 2008. A uniform approach to analogies,
synonyms, antonyms, and associations. In Proceed-
ings of COLING 2008.

Xiaofeng Yang and Jian Su. 2007. Coreference resolu-
tion using semantic relatedness information from auto-
matically discovered patterns. In Proceedings of ACL
2007.

Ainur Yessenalina and Claire Cardie. 2011. Composi-
tional matrix-space models for sentiment analysis. In
Proceedings of EMNLP 2011.

290