Bootstrap Domain-Specific Sentiment Classifiers from Unlabeled Corpora

Andrius Mudinas, Dell Zhang, and Mark Levene
Department of Computer Science and Information Systems

Birkbeck, University of London
London WC1E 7HX, UK

andrius@dcs.bbk.ac.uk, dell.z@ieee.org, mark@dcs.bbk.ac.uk

Abstract

There is often the need to perform sentiment 
classification in a particular domain where no 
labeled document is available. Although we 
could make use of a general-purpose off-the-
shelf sentiment classifier or a pre-built one for 
a different domain, the effectiveness would be 
inferior. In this paper, we explore the possibil-
ity of building domain-specific sentiment clas-
sifiers w ith u nlabeled d ocuments o nly. Our 
investigation indicates that in the word em-
beddings learned from the unlabeled corpus 
of a given domain, the distributed word rep-
resentations (vectors) for opposite sentiments 
form distinct clusters, though those clusters 
are not transferable across domains. Ex-
ploiting such a clustering structure, we are 
able to utilize machine learning algorithms 
to induce a quality domain-specific senti-
ment lexicon from just a few typical senti-
ment words (“seeds”). An important finding 
is that simple linear model based supervised 
learning algorithms (such as linear SVM) 
can actually work better than more sophis-
ticated semi-supervised/transductive learning 
algorithms which represent the state-of-the-
art technique for sentiment lexicon induction. 
The induced lexicon could be applied directly 
in a lexicon-based method for sentiment clas-
sification, b ut a  h igher p erformance c ould be 
achieved through a two-phase bootstrapping 
method which uses the induced lexicon to as-
sign positive/negative sentiment scores to un-
labeled documents first, a nd t hen u ses those 
documents found to have clear sentiment sig-
nals as pseudo-labeled examples to train a 
document sentiment classifier v ia supervised 
learning algorithms (such as LSTM). On sev-

eral benchmark datasets for document senti-
ment classification, our end-to-end pipelined
approach which is overall unsupervised (ex-
cept for a tiny set of seed words) outper-
forms existing unsupervised approaches and
achieves an accuracy comparable to that of
fully supervised approaches.

1 Introduction

Sentiment analysis (Liu, 2015) is a popular research
topic which has a wide range of applications, such
as summarizing customer reviews, monitoring social
media, and predicting stock market trends (Bollen et
al., 2011). A basic task in sentiment analysis is to
classify the sentiment polarity of a given piece of
text (document), i.e., whether the opinion expressed
in the text is positive or negative (Pang et al., 2002),
which is the focus of this paper.

There are many different approaches to senti-
ment classification in the Natural Language Process-
ing (NLP) literature — from simple lexicon-based
methods (Ding et al., 2008; Thelwall et al., 2010;
Thelwall et al., 2012) to learning-based approaches
(Pang and Lee, 2004; Turney, 2002; Jo and Oh,
2011; Argamon et al., 2007; Lin and He, 2009),
and also hybrid methods in between (Mudinas et al.,
2012; Zhang et al., 2011). No matter which ap-
proach is taken, a sentiment classifier built for its
target domain would work well only within that spe-
cific domain, but suffer a serious performance loss
once the domain boundary is crossed. The same
word could drastically change its sentiment polarity
(and/or strength) if it is used in a different domain.
For example, being “small” is likely to be negative

269

Transactions of the Association for Computational Linguistics, vol. 6, pp. 269–285, 2018. Action Editor: Diana McCarthy.
Submission batch: 11/2017; Revision batch: 2/2018; Published 5/2018.

c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


for a hotel room but positive for a digital camcorder,
being “unexpected” may be a good thing for the end-
ing of a movie but not for the engine of a car, and we
will probably enjoy “interesting” books but not nec-
essarily “interesting” food. Here, the domain could
be defined not by the topic of the documents but
by the style of writing. For example, the meanings
of words like “gay” and “terrific” would depend on
whether the text was written in a historical era or
modern times.

When we need to perform sentiment classifica-
tion in a new domain unseen before, there are usu-
ally neither labeled dictionary available to employ
lexicon-based sentiment classifiers nor labeled cor-
pus available to train learning-based sentiment clas-
sifiers. It is, of course, possible to resort to a general-
purpose off-the-shelf sentiment classifier, or a pre-
built one for a different domain. However, the ef-
fectiveness would often be unsatisfactory because
of the reasons mentioned above. There have been
some studies on domain adaptation or transfer learn-
ing for sentiment classification (Blitzer et al., 2007;
Tan et al., 2009; Pan et al., 2010; Glorot et al., 2011;
Yoshida et al., 2011; Bollegala et al., 2013; Xia
et al., 2013; Yang and Eisenstein, 2015), but they
still require a large amount of labeled training data
from a fairly similar source domain, which is not
always feasible. Those algorithms also tend to be
computational-expensive and time-consuming (Mo-
hammad and Turney, 2010; Fast et al., 2016).

In this paper, we propose an end-to-end pipelined
nearly-unsupervised approach to domain-specific
sentiment classification of documents for a new
domain based on distributed word representations
(vectors). As shown in Fig. 1, the proposed approach
consists of three main stages (components):
(1) domain-specific sentiment word embedding,
(2) domain-specific sentiment lexicon induction,
(3) domain-specific sentiment classification of doc-

uments.
Briefly speaking, given a large unlabeled corpus for
a new domain, we would first set up the vector
space for that domain via word embedding, then
induce a sentiment lexicon in the discovered vec-
tor space from a very small set of seed words as
well as a general-purpose lexicon, and finally exploit
the induced lexicon in a lexicon-based document
sentiment classifier to bootstrap a more effective

learning-based document sentiment classifier for
that domain. The second stage of our approach out-
performs the state-of-the-art unsupervised method
for sentiment lexicon induction (Hamilton et al.,
2016), which is the most closely related work (see
Section 2). The key to the superior performance
of our method compared with theirs is the insight
gained from our first stage that positive and neg-
ative sentiment words are largely clustered in the
domain-specific vector space but these two clus-
ters have a non-negligible overlap, therefore semi-
supervised/transductive learning algorithms could
be easily misled by the examples in the overlap and
would actually not work as well as simple super-
vised classification algorithms. Overall, the docu-
ment sentiment classifier resulting from our nearly-
unsupervised approach does not require any labeled
document to be trained, and it can outperform the
state-of-the-art unsupervised method for document
sentiment classification (Eisenstein, 2017). The
source code for our implemented system and the
datasets for our experiments are open to the research
community1.

The rest of this paper is organized as follows. In
Section 2, we review previous studies on this topic.
In Sections 3 to 5, we describe the three main stages
of our approach respectively. In Section 6, we draw
conclusions and discuss future work.

2 Related Work

Most of the early sentiment analysis systems took
lexicon-based approaches to document sentiment
classification which rely on pre-compiled sentiment
lexicons (Owsley et al., 2006). Various methods
have been proposed to automatically produce such
sentiment lexicons (Hu and Liu, 2004; Ding et
al., 2008). Later, the focus of research shifted to
learning-based approaches (Pang et al., 2002; Pang
and Lee, 2004), as supervised learning algorithms
usually deliver a much higher accuracy in senti-
ment classification than pure lexicon-based meth-
ods. However, lexicons have not completely lost
their attractiveness: they are usually easier to un-
derstand and to maintain by non-experts, and they
can also be integrated into learning-based sentiment
classifiers (Mudinas et al., 2012; Eisenstein, 2017).

1https://goo.gl/8K9PbE

270


Unlabeled
Training 

Documents

Word 
Embeddings

Sentiment 
Lexicon

Pseudo-labeled
Training 

Documents

Probabilistic
Word 

Classifier

Sentiment 
Seeds

Lexicon-based 
Sentiment 
Classifier

Learning-based 
Sentiment 
Classifier

Unlabeled
Test 

Documents

Classified 
Test 

Documents

Figure 1: Our nearly-unsupervised approach to domain-specific sentiment classification.

The lexicon-based sentiment classifier used in our
experiments is a publicly-available system called
pSenti2 (Mudinas et al., 2012). In addition to a
customizable sentiment lexicon, it also uses shallow
NLP techniques like part-of-speech (POS) tagging
and the detection of sentiment inverters and other
modifiers (intensifying and diminishing adverbs).

The introduction of modern word embedding
techniques like word2vec (Mikolov et al., 2013)
and GloVe (Pennington et al., 2014) have opened
the possibility of new sentiment analysis methods.
Given a large unlabeled corpus, such techniques can
learn from word co-occurrence information and pro-
duce a vector space of hundreds of dimensions, with
each word being assigned a corresponding vector.
The resulting vector space helps in understanding
the semantic relationships between words and al-
lows grouping of words based on their linguistic
similarities. Recently Rothe et al. (2016) proposed
the DENSIFIER method that can reduce the dimen-
sionality of word embeddings without losing seman-
tic information and explored its application in vari-
ous domains. For the SemEval-2015 task (Rosenthal
et al., 2015), DENSIFIER performed slightly worse
compared to word2vec, though its training time was
shorter by a factor of 21. In fact, previous studies
such as (Rothe et al., 2016; Cliche, 2017) suggest
that word2vec usually provides the best word em-
beddings for sentiment analysis tasks.

In their recent work, Hamilton et al. (2016)

2https://goo.gl/pj4XAQ

demonstrated that by starting from a small set of
seed words and conducting label propagation over
the lexical graph derived from the pairwise prox-
imities of word embeddings, they could induce a
domain-specific sentiment lexicon comparable to a
hand-curated one. Intuitively, the success of their
method named SentProp requires a relatively clear
separation between sentiment words of opposite po-
larity in the vector space which, as we will show
later, is not very realistic. Moreover, they have fo-
cused on the induction of sentiment lexicons alone,
while we are trying to design an end-to-end pipeline
that can turn unlabeled documents in a new do-
main directly to their sentiment classifications, with
domain-specific sentiment lexicon induction as a
key component.

Recent advances in deep learning (LeCun et al.,
2015) has elevated sentiment analysis to new perfor-
mance levels (Kim, 2014; Dai and Le, 2015; Hong
and Fang, 2015). As reported by Dai and Le (2015),
the Long Short-Term Memory (LSTM) (Hochreiter
and Schmidhuber, 1997) Recurrent Neural Network
(RNN) can reach or surpass the performance lev-
els of all previous baselines for sentiment classifi-
cation of documents. One of the many appeals of
LTSM is that it can connect previous information to
the current context and allow seamless integration
of pre-trained word embeddings as the first (projec-
tion) layer of the neural network. Moreover, Rad-
ford et al. (2017) discovered the “sentiment unit”,
the single unit which can learn the perfect represen-
tation of sentiment, in a multiplicative LSTM with

271


4096 units, despite the fact that the LSTM was only
trained for a completely different purpose — to pre-
dict the next character in the text of Amazon re-
views. Our results are in line with those findings
and confirmed the superiority of LSTM in building
document-level sentiment classifiers.

Zhang et al. (2011) tried to address the low re-
call problem of lexicon-based methods for Twitter
sentiment classification via training a learning-based
sentiment classifier using the noisy labels generated
by a lexicon-based sentiment classifier (Ding et al.,
2008). Although the basic idea of their work is
similar to what we do in the third stage of our ap-
proach (see Section 5), there exist several notable
differences. First, they adopted a single general-
purpose sentiment lexicon provided by Ding et al.
(2008) and used it for all domains, while we would
induce a different lexicon for each different domain.
Consequently, their method could have a relatively
large variance in the document sentiment classifica-
tion performance because of the domain mismatch
(e.g., F1 = 0.874 for the “Tangled” tweets and
F1 = 0.647 for the “Obama” tweets), whereas our
approach would perform quite consistently over dif-
ferent domains. Second, they would need to strip
out all the previously-known opinion words in their
single general-purpose sentiment lexicon from the
training documents in order to prevent the training
bias and force their document sentiment classifier
to exploit domain-specific features, but doing this
would obviously lose the very valuable sentiment
signals carried by those opinion words. In contrast,
we would be able to utilize all terms in the training
documents, including those opinion words that ap-
peared in our automatically induced domain-specific
lexicons, as features, when building our document
sentiment classifiers. Third, they designed their
method specifically for Twitter sentiment classifica-
tion, while our approach would work for not only
short texts such as tweets (see Section 5.2) but
also long texts such as customer reviews (see Sec-
tion 5.1). Fourth, they had to use an intermediate
step to identify additional opinionated tweets (ac-
cording to the opinion indicators extracted through
the χ2 test on the results of their lexicon-based sen-
timent classifier) in order to handle the neutral class,
but we would not require that time-consuming step
as we would use the calibrated probabilistic outputs

of our document sentiment classifier to detect the
neutral class (see Section 5.3).

3 Domain-Specific Sentiment Word
Embedding

Our approach to domain-specific document-level
sentiment classification is built on top of word em-
beddings — distributed word representations (vec-
tors) that could be learned from an unlabeled corpus
to encode the semantic similarities between words
(Goldberg, 2017).

In this section, we investigate how the embed-
dings of sentiment words for a particular domain
would look like in the domain-specific vector space.
To ensure a fair comparison with the state-of-the-
art sentiment lexicon induction technique SentProp3

(Hamilton et al., 2016) later in Section 4, we adopt
the same publicly-available pre-trained word em-
beddings for the following three domains together
with the corresponding sets of sentiment words (i.e.,
sentiment lexicons).
• Standard-English. We use the the Google News

word embeddings4 and the ‘General Inquirer’ lex-
icon (Stone et al., 1966) with the sentiment polar-
ity scores collected by Warriner et al. (2013).

• Twitter. We use the word embeddings constructed
by Rothe et al. (2016) and the sentiment lexicon
from the SemEval-2015 Task 10E (Rosenthal et
al., 2015).

• Finance. We use the word embeddings learned us-
ing an SVD-based method (Manning et al., 2008)
from a collection of “8-K” financial reports5 (Lee
et al., 2014) and the finance sentiment lexicon
hand-crafted by Hamilton et al. (2016).

Note that the above three sentiment lexicons would
be used for both the inspection of sentiment word
distributions in this section and the evaluation of
sentiment lexicon induction later in the next sec-
tion. Furthermore, to facilitate a fair compari-
son with the state-of-the-art unsupervised document
sentiment classification technique ProbLex-DCM6

(Eisenstein, 2017) later in Section 5, we also adopt
the following two document collections which they
have used.

3https://goo.gl/BFkY8N
4https://goo.gl/5r79l6
5https://goo.gl/7ntr2V
6https://goo.gl/Qr993F

272


• IMDB. We use 50k movie reviews in English
from IMDB (Maas et al., 2011) with 25k labeled
training documents.

• Amazon. We use about 28k product reviews in
English across four product categories from Ama-
zon (Blitzer et al., 2007; McAuley and Leskovec,
2013) with 8k labeled training documents.

The word embeddings for the above two domains
were trained by us on the respective corpora us-
ing word2vec (Mikolov et al., 2013) which employs
a two-layer neural network and is by far the most
widely used word embedding technique. Specifi-
cally, we ran word2vec with skip-gram of a five-
word window to construct word vectors of 500 di-
mensions, as recommended by previous studies7.
The sentiment lexicon made by Liu (2015) is consis-
tently one of the best for analyzing reviews (Ribeiro
et al., 2016), so it is used for both of those domains.

Drawing an analogy to the well-known cluster hy-
pothesis in Information Retrieval (IR) (Manning et
al., 2008), here we put forward the cluster hypothe-
sis for sentiment analysis: words in the same cluster
behave similarly with respect to sentiment polarity
in a specific domain. That is to say, we expect pos-
itive and negative sentiment words to form distinct
clusters, given that they have been represented in an
appropriate vector space. To verify this hypothesis,
it would be useful to visualize the high-dimensional
sentiment word vectors in a 2D plane. We have
tried a number of dimensionality reduction tech-
niques including the t-distributed Stochastic Neigh-
bor Embedding (t-SNE) (van der Maaten and Hin-
ton, 2008), but found that simply using the clas-
sic Principle Component Analysis (PCA) (Bishop,
2006) works very well for this purpose.

We have found that in general, the above cluster
hypothesis holds for word embeddings within a spe-
cific domain. Fig. 2a shows that in the Standard-
English domain, the sentiment words with opposite
polarities would form two distinct clusters. How-
ever, it can also be seen that those two clusters would
overlap with each other. That is because each word
carries not only a sentiment value but also its linguis-
tic and semantic information. Zooming into one of
the word vector space regions (Fig. 2b) can help us
understand why sentiment words with different po-

7https://goo.gl/SyAdej

−

−

+

+
−

−

−

−

+

+

+

+

−

−

−

−
−

+

−

−
+

−

−

−

+

−

+

−

+

−

−

−

+

+

+

+

−

+

−

−

−
+

+

+

+
−

+

+

−

+

+

−

+

−

−

+

−−

−

+
−

−

−

+

+
−

−

+

+

−
+

−

+

−

+

−

+ −

+
+

+

−

+

+

−

+
−

+

−

−

+

−

−

−

+
+

−
−

−

−

−

+

−

−

−

−

−

+

−

−

−

+

−

+

+

−

+

+

+

+

−

+

+

−

+

−

−

−

−

−

+

+

+

+

+

+

−

−

−

−

+

−

−

−

+

−
−

−

−

+

−

+

−

−−

−

+

+

−

+

+

+

−

+

−

+

−
−

−

−

+

−

−

+

+

+

+

−

−

+

−
+

−

−

−

+

+
+

+

−

−

−

−

−

+ −
+

+
+

− −

+

−

−

−

−

−

−

+

+

−

−

+
−

−

+

+

−

−

−+

+

−−
−

+

−

−

+

−

−

+

−

−

+

−

−

+

+

+

+

−

−

−

−

−

−

−

+

−

−+
−

−

−

−

−+

−

−

−
+

−

+

−

+

+

+

+

+

+

− −

−

−

−

+ −

−

−

−

+

+
+

−

−

+

−
+

+

−

−

−

+

+
+

−

−

+

+ +

−
−

−

−

−

+

+

+

+

+

−

+

+ −

−

−

+
+

−
−+

−

−

−
−

+

−

−

+

−
+

−

−

+

−

+

−

−

+

−

−

−

−

−

−

+

−

−

−

−

−

+

−

−

−

−

−

− −

−

−

+

−

+
−+

+

−

−

−

−

+

−

−

+

+

+

−
−

+

−

−

+

+

+

−

−

−

+
−

−

+

−

+

+

+

+

−

+

+

+
+

+

−

−

−

−

+

−

+
+

+
−

−+

+

−

−

−

+

−
+

−

−

−

+

+

−

+

−

+

+

+

−

−

−

−

−

−

+

−

+ −

−

−

−

−

+

−

−

−

−

−

− −

+

−

+

−

+

+

−

−−

−

−

−

−

−
−

+

−

+

−

+

−

−

+

+

−

−

−

−

−

+

−

−

+

+

+

−

+

+

−

−

−

+
+

−

+

−

−

−

−

−

−

−
−

−

−

−

−

−

−

−

+

−

−

+

−

+

−

+

−
+

+

+

−

−
−

−

−

−

+

−

−

−

−

−

−

−

−

−

−
−

+

−

−

−

+
+

−

−

−

+

−

−

−

−+

−

+

+

+

−

−

−

+

−+

−

−

−

+

−
+

−

−

+

−

+

−

−

−

−
−+

+

−

+
+

−

+

−

−

−

+

−

−

+

+

−

+

+

−

−

−

−

−

−

+

−

−

−
−

+

+

−

+

+

+

−

−

+

++

+

−

+

+

−

−

+ +

−

+
+

−

−

−

+

+

−

+

−

−

−

+

+

−
−

−

−

+

−

+

−

−

−

+

−

+

−

+

−

−

−

+

+

−

−

+

−

+

+

−

+

+

−

−

+

+

+

+

−

+

−

+

−

−

−

+

−

+

−

+
−

−

−

−

−

+

+

+

+

−

+

−

+

−

+

+

−

−

+

+

−

+

+

−

−

+

−

−

+

−

−

−

−

++

−

−

+

−

−

−

−

+

−

+

+

+

−

+

+

−

+

−

+

+

+

−

−

−

−

−

−
+

−

−

−

−

−

−

−

−

−

−
+

−

+

−

+

−

−

+

−
−

+

−

−

+

+

−
+

−
+

−

−

+

−

−−

−

+

−

+

−

−

+

−

+

−

−

+

−

+

−

−
−

+

−

−

+

−

−

−

−

−

+
+

+

+

−

+

−

−

−

+

−

−

+

−

−

+

−−

−

−
− −

−

−

−

+

−

+ −

−

− −

−
−

−

−

+

−

+

−

+

+

+

−

+

−

−

−

+

+

−

−

−

−

−

+

−

+

−
+

+

−

−

−

−
−

−

−

−

+

+

−

−

−

+
−

−

−

+

−
−

−

+

−

+

−

+

−

−

−

−

−

−

+

+

+

−

+

−

−

+

−

+

+

+

−

+

−

−

−
+

+

+

−

+

−

−

+

−

−

−
−

+

+

−

−

+ −

−

−

−

−

−
−

+

−

+

−

−

−

+

−

+

+

−

+

−

−

−

−

+

−

−

−

−

+

−

+

+

−

+

+

+

−
+

−
−

−

−

−

−

−

+

−

+

+

+

−

+

+
+ −

−

+

−

−

−

−

+

−

−

−

−

+

−
+

−

−

−

+

−
+

−

+

+

−

+
+

−

+

−

−

−

−

+

+

−

−

−

−

+

−

+

−

−

−

+

+

+

−

−

−

+

+

−

−

+

−

−

−

−

−+

−

+

+

−

−

+

−

−

−

−

−

−

−

−

−

−

+

−

−

−

+

−

−

+

−

+

−

−+

−

+

−

−

−

−

−
−

−

+

+

+

+

+

+

+

−
+

−

−

−

−

−

+

−

−

+

−

+

−

−

−

+

−
−

+

+

+

−

−

+

−

+

+

−
−

−

−

−

−

−

−

+

−

+

−

−

−

−

−
−

−

+

−

−

+

−

+

+

−

−+

+
−−

+

−

−

−

−

−

−

−
−

−
−

+

−

+

−

+

−

+

−

−

+

−

−

−

− −

−

−

+

−

+

+

−

+

−

−

+
+

−

−
−

−

−

−

−

−

+
−

+

−

−

+

−

−

−

−

−

+

+

−

−

+

+

+

−

−

−

+

−

−

+

−

−

+

+

−

+

+

− −

+ −

+

+

+

+

−

+

−

−

−

−

+

+

−

+

−

−

+

−

+

+

−

−

+
+ −

+ −

−

−

−

+

−

−

+

+

−
−

+

−

−

−

−

+

−

−

−

+

−

+

−

−

+

−

−

−+

+

−

+

−
−

−

+

−

+

−

+

−

−

+

−

−

+

−

−
−

+

−+

+

+

+

−
−

+

−

+

−

+

+
−

+

−

−

−

+

+

−

+
−

+

−

+
−

−

−

−

+

−

+

−

+

+

−

+

−

+

+

−

− −

+

+

−

+

−

−

−

−

−

−

−

−
+

−

+

−

+

+

+

−

−

−

−

−
+

−

−

+

−

−

−

−

+
−

−

−

−

+

+

+

−

−

−
−

+

−
−

−

−

− −

−

+
+

−

−

−

−

−

−

−

−

+

−

−

−

−

+

−

+
+

+

−

−

+

−
+

+

+

+

+

−

−

+

−

−

−

−

+

−

+

+
+

−

−

+

−

−

−
+

+ − +

−

+

−

+

−

−
−

−

−

+

−

−

−

−

−

−

+

+

−

+
+

+

−
−

−

−
−

−

+

−

+

−

+

+

+

+ +

−

−

−
−

−

−

+

−

−

+

+

+
+

+

−

+

−

−

−

+

+

+

−

−

+

+

+

−

−

−

−

+

−

+

−

+

++

−

++

−

+

−

−

−

+

−

−

−

−

−

−

−

+

−

−

−
−

−

+

−−

+

−

−

−

+ −

−

−

−

+
−

−

−

−

−

+
−

−

+

−

+
+

+
+

+

+

−

+
−

−

−

−

−

+

−

−−

++

++

+

−

+

−

−

+

−

−

−

+

+

−

+

−

−

+

−

+

−

−

−

−

−

+

−

+

−

−

−

−
+

+

−

+

−

−

+

−

−

−

−

−

+

+

+

−
−

−

−

+

−

+

+

+

− −

−

+

+

−

−

+

−

−

+

−

−

+
+

+

+

−

−

+

−

+

−

−

+

+

−

−

+

+

+

−

+

−

−

−

−

−

+

−
−

+

+

−

−
−

+

+

−
−

+

−

+

−

−+

+

+

−

+

− −

+

−

−

−

+

− −

+

+

−

−

+

−

+

+

−

−

−

−

−−

+

−

−

+

−

−

+

−

+

−

−

−

+

−

+

−

+ −

+

+

−

+

−

+

+

+

+
+

+

−

+

−

−
−

+

+

+

−

+
−

−

−

−

+

−

−

−

−

+

−

+

−

+

+

−

+ +

+

−

−
+

−
+

− −

+

+

+

−

−

+

+

+ +

+

−

+

−

−
−

+

−

−

+

−

+

−

+

+

+

+

−

−

−

−

−

−

+

−

+

+

−

−

+

−

−

+

+

−

+

+

−

+

−
−

+
−

+

+

+

−

+

−

−

−

+

−

−

+
+

−

−

−

−

−

−

+

−

−

+

−

−
+

−

+

−

−

−

−

+

+

+

+

+

+ −

−

+

−

−

+

−

+

−

+

−

−

−

−

−

+

−

++

−

−

+

+

−

−

−

+

+

+

−

−

−

−

−

+
−

+

−

−+

+

−

−

−

−

−

+
−

+

−

−

+

+

+

−

−

+

−

−

+
−

−
+

−

+

−

+

−

+

−

+
−

−

+

+

+

+

+

+
+

−

−

+

−

+

−

−

−
−

+

−

−

+
+

−

−

−

−

+

−

−

+

+

−

−
++

− −

−

−

−

+

+

−−

−

−

+

+

−

+

−
−

+

+

−

−

+

−

+

−

−
−

+

+
−

+

−

−
+

+

−

+

+
−

−

+
+

−−

−

−

−

+

−−

−

−

+

+

−
−

−

+ −

−

+ −

+

−

−

+

−

− −

+

−

−
+

−
+

+

−

−

+
+

−

−

−

−

+

−

−

+

−

−

+

−

+

+

−

−
+

−−

−

− −

+

−

−
−

−

−

−

−

−

+

−

+

+

−

+

+

+

+

−

−

−

+

−

−

−

−
− −

−

−

−+

+

+

−

−

+

−

+

−
−

−

−

− − −+

+

−−

−

−−

−

+
−

+ −

+

−

−

−

−

−

−

+

−

−

++ +

+

−

+

−

−

−

+

− −

+ +

+

−

−

+

−

−

−

−

+

−

−

+

+

+

+

−

−

−

−

−

−

+
−

+

+

−

−

−

+

−

+

−

+ −

−

−

+

−

−

+

+

+

−

+

−

−

−

−

−

−

+

++

+

−

−

− +

+

−

+

+

−

−

−

−

−

+

−

−

−

+
−

−
+

−

−

−

−

+

−

−
−

+

+

−

+

+

+

+

−

+

+

+ +

−

−
−

−

−

+

−

+

−

−

+

−

−

−

+

+ −

−

−

−

++

−

+

−

+

−

−

+

−

−

−

+ −

−

−

−

+

−
−

+

+

+

+

−

−

−

+
+ −
+

−

−

− +

−

−

+

−

−

+

−
−

+

+

+ −+

−

+

−

+

− −

−

+

+

+

+

−

−

−

+

+

−

−
−

+

+

−−

−

+

+

−

−

−

−
+

−

+

−

−

−

+

−

+

+

−
+

−

+

−

+

+

+

−−

−

−

+

+

+

−

−

+

−

+

−

−

−

−

+

−

−

−

+

−

+

−

−
−

+

−

−1

0

1

−1 0 1

(a) The global vector space showing two clusters.

(b) A local region of the vector space zoomed in.

Figure 2: Visualisation of the sentiment words in the
Standard-English domain.

larities could be grouped together: ‘hail’, ‘stormy’
and ‘sunny’ are linguistically similar as they all de-
scribe weather conditions, yet they convey very dif-
ferent sentiment values. Moreover, as described by
(Plutchik, 1984), sentiment could be grouped into
multiple dimensions such as joy–sadness, anger–
fear, trust–disgust and anticipation–surprise. Putting
that aside, certain sentiment words can be classified
sometimes as positive and sometimes as negative,
depending on the context. These reasons lead to the
phenomenon that many sentiment words are located
in the overlapping noisy region between two clusters
in the domain-specific vector space.

On visual inspection of the Finance (Fig. 3a) sen-
timent words and IMDB (Fig. 4a) sentiment words
in their respective vector spaces, we can see that pos-
itive and negative words form distinct clusters which
are largely separable. However, if we consider Fi-
nance sentiment words in the IMDB vector space
(see Fig. 3b), positive and negative words would be
mixed together and could not be separated easily.

One may be surprised that positive and negative
sentiment words form their respective clusters, be-
cause most of the time they could be used in ex-

273


−

−

+

−

+

−

+

−

−

+−

−

+

−

−

+

−

−

−

−

−

+

−

−

+

−

+

−

−

−

−

− +

+

−

+

−

−

−

−

−

+

−

−

−

−

−

−

+

+

−

−

−

−

+

−

+

+
−

+
−−

−

−

−

+

−

+

+

−

−

−

+

+

−

−

+

+

+

− +

−

+

−

−

−

−

+

+

+

−

+

−

−−

−

−

−

−

−

+−

+

−

−

−
−

−

−

+

−

−

−

+−

−

−

−

+

−

+

−

+

−

+

−

+

−

+

−

−

+

−

+

−

−
−

−
−

−

+

−

−

+

−

−

−

−

−

−

−
−

−

+

−

−

−

+

−

+

−

+

−+

−

−

+
−

−

−

−

+

+

+

−

−

−

−

−

+
−

−

+

+

−

−

−

−−

−

−

+

−

−
−

−

−

+

−

−

− +

+

−

−

+

−

−

−

−
−

−

−

−−

−

−

−

−

−

+

−

−

−

−

−

−

−

+

−
−

−

−

−

−
−

+

−
+

−

−

−
+

−

+

−

−

−

−

+

−

−

+

−

−

+

+

+

+

−

−

−

−

−

−

−

−

− +
−

−

+

−

−

+

−
−

−
−

+
+

+

+

−

−

+

−

−

−

−

−

−

−

−

−

−

+

−

−

+

−

−

−

−

−

+

−

+

−

+
−

+−

− −
−

−

−

−

−

−

−
− −

+

−−

+

+

+

−

−

+

−

+

−

−

−

−

−
−

−

+

+

−

−

−

+

−

+

−
−

−

+

−

+

+

−

−

−

+

−

−

+

−

−

+−

+

−

−

−

−

−

−

+

−
−

−

− −

−

−

−

− +

−

−

++

+

−

−

−

+

−

−

+

−

+−

−

−

−

−

−10

0

10

20

−20 −10 0 10 20

(a) In the Finance (same domain) vector space.

−+ −
−

+

−
−

+

−

−

−

+

− −

−

−

+
−

+
−

−

−

+

−

−

+

− −

+

+ −

−

−

+
−−

−

−−

−

−

−

−

+

−

−
+

+
−

−

−

−

−

−

−

−

−
−−

+

−

−

+

+

−+

−

+

−

+
−

−

+

−

−

−

−

−

−

−

+

−
−

−

−

−

−

−

−

−

−

+

+

−
−−

−

−

−

+

−

−
+

−

+

−

−

+

−

+

−

−

−

−

−

−

−
−

− −

−

+

+

−

−

+

+

−

−
+

+

−

−

−

−

−

−
−

+

−
+

−

−

−

−
−+

−

+

−

−

−

−

+

−

+− −+

+

−

−

−

+
−

+−

−

−

+ −

−
−

+

−

−

+

−

−

+
−

+

−

+
−

−

−

−

−

+

+

−

−

+
−

−

+

−−

−

−

+

−

−

−

−

+

−

−

−

−

−

−+
+

−
−

− −
+

−

−
+

−

−

−

−

−−

−

−
−
−+

+

−

−
+

−
−

−

−

−

−

−

+

+

− −+
−

−

+

−

−

−

+
−

−

−

−

−

−

−−

−−−

−

+

−

−

+

−

−
−

−

−
+

+

−

−

−

− +

−

−

−

+

−

−

+

+

−

−

+

−

−

−

+

−

−

−

−

−

+

+

−

−

+
+ −

−

−

+

+

−
+

−

−−

−

−

−
−

−

−

−

−

−

−

−

−

−

−

+−
−

−

−

−

−

−

−
−

−
−

−

−

+

−
−

−

−

−

−

−

−
+

−

−

−

−

+

−

−

−

−

−

−
−

−

+

−
−
−

−

−

+
−

−

−

−

−

−

−

+

−

−

−
+

−

−

−

−

−

−

−
−

−

−

−

−

−

−

+

+
+ −−

−

−

−

−
−

−

+ −

+

+
+

−

−

−
−

−

−
+

+

−

+

−
+ −

−
− −

−

−

+

−

+
−

−

+

−
−

+

−

+

+

−
+

−

−

−

−

−

+

−

−
−

−

−

−

−

−

− −

−

−

−

+

−

−

+
+

−+

−

−

−

+
−

−

−

+

−

−

−

−

−

+

−

+

−

+

−

−
−

−

−

+

−

−

−

−
−

−

−

− −−

−

−

+

−

−

−

−

−
−+

+

+

−

+

+

−

+

−

+

−

−

−

−
+ −

−

−

−

+

−

−

−

+

−

−

−

−

−

−

+

−
−

−

−
−

−

−

−

+

−

−

−

+

+

+ −

−

−

+ +

−

−

+

−

−

−

+
−

−

−

−

−

−

−
−

−

+
−

+

+

−

−

−

−

−

−

−

−
−

+

−+
−

− −

+

−

− +

−

+

−

+

−

+

−

−

−
−

−

−

−+

+
−

−

−

−

−

+

−

−

−
− −

−

+

−
−

−

−

+

−

+

−

−

−

−

−

−

+

−

− −

−

+

+

−
+

−

−

−
+

+
−

+

+

+

−

−

−

+

−

+

−
−−

+

−

+

−

+

−

−

−+

+

−

+

−

−

−

−

+

−

−

−5

0

5

10

15

−15 −10 −5 0 5

(b) In the IMDB (different domain) vector space.

Figure 3: Sentiment words of Finance in the same/different domain vector space.

−

−

+

−

−

+

−

+

−−

+

−

+

−

−−

+

−

+ +

+

+

+

+

+
+

+

−

+

−

−

−

−

−

+ +

+

−

−

+

+

−

+

+

−

+

−

−

−
−

+

+

+

+

+

−

−

−

+
−

+

+

−

−

−

−
+

−

+

−

+

−

−

+

+

+
+

+

+

−
−

+

+
+

−

+

+

+

−

+
+

−

+

−

+

−

+

+

+

+

+

+

+

+

−

+

+

+

−

+
+

−

−
−

−
−

−

+

−

−

−

+

+

−

−

+

−

−

+

+ −

−

+

−

+

+

−

+

+

−

−

−

+

−
+

+

++

−

−

+ −

+

+

+

+

−

+

+

+

−

−

−

−

+

−

−

+

−

+

−

−

+

+
+

−

−

+

+

+
−

−

+

−

+

−

+

−

+

+

−
−

+
−

−

−
−

+

−

+

+

−−

+

+

−

+

+

+

+

−

−

+

−

−

+

−

−
−

+

−

−

−

+
−

+ −

+

−

−

−

−

−

+

+

−

+

−

−

+

+

+

−

+

−

+

−

−

−

+

+

+

+

−

−

+

−

+

+

+

−
+

+

+

+

−

+

+ −

−

+

−−

+

−

−

−

−
−

−

+

+

−

−−

−

−

+

+

−

−

+

+

−

−

+
+

−

−

−

−

−

+ −

−

+

−

−

−

+

+

+

−

−

−

−

−

−
+

+

−

+

+

+

−

−
−

+

−

+

+

+

+

−

−

−

+

−

−
+

+

+

+

−

−

+

−

+

−

−

−

−

+

−

−

+

+

−

−

−

−
+

+ −

+

−
−

+

+

+

−

−

−

+

+

−

−

+

+

−

+

−

−

−

−+

+

+

+

+

−

+

−

−

−

+

+

−

+

−

−

+
−

−

+

−

+

+

+

−

+

+ +

−

−
−

−

−

−

−

+

−

+

−

+

−

−

+

+

+

++

−

+
−

−

+

+

+

+

+−

+

+ −

+

−

−

+

−

+ −

−

+ −

+

−

+

+
−

−

−

−

+

−
+

+

−

+ +
−

+

−

−

+

−

−

−

+

+

−

+
+

+
+

−

+

−

−

−

−−

−

−

−

+

−

+
+

−

−

+

−

−

−

+

+
−

+

+

+

+

−

−+

−

−

−

−

−

+

−

+
+ +

+

+

−

−

+

+

+

−

+

−

−

+

+

−

+

+

−

+

+

+

+

−

−

+

−

−

−

−

−

−

−

−

+

+

+ −

+ −

+

−

+

+
+

+
+

+ +

−

−

−

−
−

+

+

+

−+

+

+

+

+

−

−
−

−

+

+

+

+

+

−

−
−

−

−

−

−

+ −

+

+

−

−

+

−

−

−

−

+ −

+

−

−
+

−

−

−

−

+

+
+

+

−

+

−

−

−

−

+

−

−

+

−

+

−

+

−

+

−

−

+

+

−

−

+

+

−

−

−

+

−

+−

−

−
+

−

−

−
+

−

−
+

+

+

+

−

+

+

−

−

+

+

+

+

−

−

+

+

−

+

+

+

+ −

+

−
−

−+

−

−
+ +

+

+
−

+

+
−

−

+

−

−

+

−

− −

−

+

−

−

+

−

+

+

+

−

+

+

−

+

−

−

−

+

+

−

−

−

+

+

−

−

+

+

−
−

+

−

−

+

+

−

−

+

+

−

+

+

+

−

−

+

+

+

+

+

+

−

−

−

+

−

−

−

−

−

+

−

+

−

−

−+

−

+

+

−

−
−

−

−

+

−

−

−

+

−
+

+
−

−

+

−

−

−

+

−

−

−

−

−

+

−

−

+

+

−

+

−

−

+

+

−

−

−

−

+

−

+

+

+

+

+

+

+

+

−

+

+

+

−

+
+

+

+

−

−

−

+

−

−

+

+ −

−

+

+

−

−
−

+

−

−

−

−

−

+

−

+

+

+

−

+

+

−

++

+

−

−

+
+

−

−

+

+

−

+

+

+

−

−

+

+

−

−

+

−
+

+

+
−

−

−

−

+
−+

−−

+

−

−

+

−

−

−

+
+

+

+

−

−

−

+

+

−

+

+

−

+

−
+

−

+

−
−

−

+

+

+

−

−
+

−

−

−

+

+

+

+

+

+

+

−

−

+

+

−−

+

+

−

−

−

−

−

−

−

+
−

+

−

−

−

−

−

−

+

+

+

+

−

−

+

+

+

−

−

+
+

−

+

+

−

+

−

+

−

+

−

+

+

+ −

−

+

−

+

−

+

+

−

−

+
−

+

+
+

+

−

−

+

+

−

+

+

+

−

−

−

+

+

+

−

+

−

−

−
+

−

+

+

−

+

−

+

−

−

−

+

+

+

−

+

+

−

+

+

− −
+

−

+

+

+

−

+

−

−

+

−

+

−

−

−

+

−

−

−

+

−

−

−

+

+
−

+

+

+

−

−

−

−

+

+

−

−

−

−+

+

−

+

+

−

+

+

+

+

−

+

−

+

+

−

+

+

+
−

−

−

+

+
−

−

+

−

+ +

+

−

−

+

−

−

+

+
−

+

+

+

−

−

−

+

−

+

−

+

−

−

+

+

−−

−

+

+

+−

+

−

+

+ −

+

+

+

−

−

+

+

+

−

+

+
+

−

+

−

−

−

−−

−

+

−

−

+ −
−

−

−

−

+

−

+

−

+
+

−

+

−

−

+

−

−

−

−

+

+
−

+
+

+

+

−

+

−+

−

+

+

−

+

−
+

−

+

−

+

−

+

+

−+

−

+

+

+

+

+

+

−

+

+

+

+

−

+

−

−+

−

+

−

+

−+

++

+

+
+

+

−

+

− −

−

−

+

+

+
+

−

−

−

−

−

+

+

−

+

+

+

+

+

+
+

+

−

−
−

−

+

−

+

+

−

−

−

+

+

+

−

+

+

−

+

+

+

+

+
+

−

+

−

−

+

−

+

+

+

−+

+

+ −

−

−

+
−

+
+

+

+

−

+

−

−
+

+

−

+

−
−+

+

+

+

+

−
+

−

+

−

−

+

+
+

−

−

+

−

+

+

+

−

+

+

− −

+

−

+−

+

−

+

−

−

−

+

−

+

−

+

−

−

−
+

+

−

−

−

−

+

−

+

−

−+

+

+

+

+

+

−

+

−

−

+

+

−

−

−

+

+

+

+

+

+

+

+

+

−

−

−

+

+

+

+

+

−+

+

−

− −

+ −

+

+
+

+

−

+

−

−

−

−

−

+

+

−
+ −

+
+

−

+

−

+

−

+

+

−

+

−

−

+

−

−

−

−

−

+

−

+

+

−

+
−

−

+

+

+

−

+
−

−

−

+
−

−

+

+

−

+

+

−

+

+

−

−

−

− −

−

+

−
−

−

−

−

−

−

−

+

−
+

−

+−

−

+

−

−

+

+

−

−10

−5

0

5

10

−10 −5 0 5 10

(a) Original/Full.

−

−

+

+

+

−

−

+
−

−+

+

+

+
−

+
−

+

+

−

−+

+

+

−

+

−

+

+

+

+

−

−

+

+

+

−

−

+

+

−

+

−

+

−

+

−

−

+ −

+

−
+ −

−

+

−

−

+

−

−

−

−
+

+

−

−

−

−−

+

+

+

+

−

−

−

−

+

−

+

+

−
+

−−

−

+

−

−

− −

+

−

+

+
−

−

−

+

+

−

−

−

+

+

+

+

−
+

+

+

+

+

+

+

+

−

−

−

+

−

+

+

−

+

+
−

−

−

−

−

−

−

+

+
+

−

+

+

+

−

+

−

+

−
+

+

+

−

−

−

−

−

−

+

−

−

+

−

+

−

−

−

−
+

−

−

++

+−

−

+
+

−−
+

−

−

−

+

+

−

+

−

+

+

+

+

−

−
−

−

−
−

−

−

−

−

+++

+

−

+

+

−

−

−

+

+

−

+

+

+

+

+

+

−

−

+

+

−

−

− −

−

+

+

−

+

+

−

−

−

+

+

−

+

+

−

−

−

+

+

−

+

+

−

−

−

+

+

+
+

+

−

−

−

−

+
−

−

+

+ +

−

+

+

−

+

+

−

+

−

+

−

−

+

+

−

+

−+

−

−

−

−

−

−

−

+

+

+

−

−

+

+

+

+

−

+

+

−
−

+

−

+

+

+

−
−

−

−

−

−

+

−

+

−

+

+

+

−

+
−

−

+

−

+

−

−

−

−

−

−

+

+

+

−

+

+
+ −

+

−

+

−

+ −

+

−

−

+

−

+

−

+

−

−

+
+

−

+

+

+

+

−

−

−

−

+

−

−

−10

−5

0

5

10

−10 −5 0 5 10

(b) Filtered.

Figure 4: Sentiment words about movies in the IMDB vector space before/after filtering.

actly the same context which might suggest that they
would result in similar word embeddings. For exam-
ple, we could say “the room is good” and also “the
room is bad”: both are legitimate sentences. The
probable reason for the cluster hypothesis to be true
is that in reality people tend to use positive sentiment
words together much more often than to mix them
with negative sentiment words, and vice versa. For
example, it would be much more often for us to see
sentences like “the room is clean and tidy” than “the

room is clean but messy”. It is a long established fact
in computational linguistics that words with similar
meanings tend to occur nearby each other (Miller
and Charles, 1991); sentiment words are no excep-
tion (Turney, 2002). Moreover, it has been widely
observed that online customer reviews are affected
by the so-called love-hate self-selection bias: users
tend to rate only products which they either like or
hate, leading to a lot more 1-star and 5-star ratings
than other (moderate) ratings; if the product is just

274


average or so-so, they probably will not bother to
leave reviews. The polarization of online customer
reviews would also encourage the clustering of sen-
timent words into opposite polarities.

4 Domain-Specific Sentiment Lexicon
Induction

Given the word embeddings for a specific do-
main, we can induce a customized sentiment lexi-
con from a few typical sentiment words (“seeds”)
frequently used in that particular domain. Such an
induced domain-specific sentiment lexicon plays a
crucial role in the pipeline towards domain-specific
document-level sentiment classification.

Table 1 shows the seed words for five different do-
mains which are identical to those used by Hamilton
et al. (2016) except for the two additional domains
IMDB and Amazon. The induction of a sentiment
lexicon could then be formulated as a simple word
sentiment classification problem with two classes
(positive vs. negative). Each word is represented
as a vector via domain-specific word embedding;
the seed words are labeled with their correspond-
ing classes while all the other words (i.e., “candi-
dates”) are unlabeled; the task here is to learn a clas-
sifier from the labeled examples first and then apply
it to predict the sentiment polarity of each unlabeled
candidate word. The probabilistic outputs of such
a word sentiment classifier could be regarded as the
measure of confidence about the predicted sentiment
polarity. In the end, those candidate words with a
high probability of being either positive or negative
would be added to the sentiment lexicon. The final
induced sentiment lexicon would include both the
seed words and the selected candidate words.

As pointed out by Mudinas et al. (2012), if we
simply consider all words from the given corpus as
candidate words, the above described word senti-
ment classifier tends to assign sentiment values not
only to the actual sentiment words but also to their
associated product features or more generally the as-
pects of the expressed view. For example, if a lot of
customers do not like the weight of a product, the
word sentiment classifier may assign strong nega-
tive sentiment to “weight”, yet this is not stable —
the sentiment polarity of “weight” may be different
when a new version of the product is released or the

customer population has changed, and furthermore it
probably does not apply to other products. To avoid
this potential issue, it would be necessary to consider
only a high-quality list of candidate words which are
likely to be genuine sentiment words. Such a list
of candidate words could be obtained directly from
general-purpose sentiment lexicons. It is also possi-
ble to perform NLP on the target domain corpus and
extract frequently-occurring adjectives or other typi-
cal sentiment indicators like emoticons as candidate
words, which is beyond the scope of this paper.

To examine the effectiveness of different ma-
chine learning algorithms for building such domain-
specific word sentiment classifiers, we attempt to
recreate known sentiment lexicons in three domains:
Standard-English, Twitter, and Finance (see Sec-
tion 3), in the same way as Hamilton et al. (2016)
did. Put differently, for the purpose of evaluation,
we would just use a known sentiment lexicon in the
corresponding domain as the list of candidate words
and see how different machine learning algorithms
would classify those candidate words based on their
domain-specific word embeddings. For those lexi-
cons with ternary sentiment classification (positive
vs. neutral vs. negative), the class-mass normal-
ization method (Zhu et al., 2003) used by Hamilton
et al. (2016) has been applied here to identify the
neutral category. The quality of each induced lex-
icon for a specific domain is evaluated by compar-
ing it with its corresponding known lexicon as the
ground-truth, according to the performance metrics
which are the same as in (Hamilton et al., 2016):
Area Under the Receiver-Operating-Characteristic
(ROC) Curve (AUC) for the binary classifications
(ignoring the neutral class, as is common in pre-
vious work) and Kendall’s τ rank correlation co-
efficient with continuous human-annotated polarity
scores. Note that Kendall’s τ is not suitable for the
Finance domain, as its known sentiment lexicon is
only binary. Therefore, our experimental setting and
performance measures are all identical to those of
Hamilton et al. (2016), which ensures the validity of
the empirical comparison between our approach and
theirs.

In Table 2, we compare a number of typical su-
pervised and semi-supervised/transductive learning
algorithms for word sentiment classification in the
context of domain-specific sentiment lexicon induc-

275


Corpus Positive Negative

Standard-English good, lovely, excellent, fortunate, pleasant,
delightful, perfect, loved, love, happy

bad, horrible, poor, unfortunate, unpleasant,
disgusting, evil, hated, hate, unhappy

Twitter love, loved, loves, awesome, nice, amazing,
best, fantastic, correct, happy

hate, hated, hates, terrible, nasty, awful,
worst, horrible, wrong, sad

Finance successful, excellent, profit, beneficial, im-
proving, improved, success, gains, positive

negligent, loss, volatile, wrong, losses, dam-
ages, bad, litigation, failure, down, negative

IMDB good, excellent, perfect, happy, interesting,
amazing, unforgettable, genius, gifted, in-
credible

bad, bland, horrible, disgusting, poor, banal,
shallow, disappointed, disappointing, lifeless,
simplistic, bore

Amazon IMDB domain seeds (as above) plus positive,
fortunate, correct, nice

IMDB domain seeds (as above) plus negative,
unfortunate, wrong, terrible, inferior

Table 1: The “seeds” for domain-specific sentiment lexicon induction.

tion:
• kNN — k Nearest Neighbors (Hastie et al., 2009),
• LR — Logistic Regression (Hastie et al., 2009),
• SVMlin — Support Vector Machine with the lin-

ear kernel (Joachims, 1998),
• SVMrbf — Support Vector Machine with the non-

linear RBF kernel (Joachims, 1998),
• TSVM — Transductive Support Vector Machine

(Joachims, 1999),
• S3VM — Semi-Supervised Support Vector Ma-

chine (Gieseke et al., 2012),
• CPLE — Contrastive Pessimistic Likelihood Es-

timation (Loog, 2016),
• SGT — Spectral Graph Transducer (Joachims,

2003),
• SentProp — a label propagation based classifica-

tion method proposed for the SocialSent system
(Hamilton et al., 2016).

The suitable parameter values of the above learning
algorithms (such as the C for SVM) are found via
grid search with cross-validation, and the probabilis-
tic outputs are given by Platt scaling (Platt, 2000) if
they are not provided by the original learning algo-
rithm.

The experimental results shown in Table 2
demonstrate that in almost every single domain,
simple linear model based supervised learning al-
gorithms (LR and SVMlin) can achieve the op-
timal or near-optimal accuracy for the sentiment
lexicon induction task, and they outperform the
state-of-the-art sentiment lexicon induction method
SentProp (Hamilton et al., 2016) by a large mar-
gin. The performance improvements are statisti-

cally significant (p-value < 0.05) according to the
sign test. There does not seem to be any benefit
of utilizing non-linear models (kNN and SVMrbf )
or semi-supervised/transductive learning algorithms
(TSVM, S3VM, CPLE, SGT, and SentProp). The
qualitative analysis of the sentiment lexicons in-
duced by different methods shows that they differ
only on those borderline, ambiguous words (such
as “soft”) residing in the noisy overlapping region
between two clusters in the vector space (see Sec-
tion 3). In particular, SentProp is based on label
propagation over the lexical graph of words, so it
could be easily misled by noisy borderline words
when sentiment clusters have considerable over-
lap with each other, kind of “over-fitting” (Bishop,
2006). Furthermore, according to our experiments
on the same machine, those simple linear models
are 70+ times faster than SentProp. The speed dif-
ference is mainly due to the fact that supervised
learning algorithms only need to train on a small
number of labeled words (“seeds” in our context)
while semi-supervised/transductive learning algo-
rithms need to train on not only a small number of
labeled words but also a large number of unlabeled
words.

It has also been observed in our experiments that
there is a typical precision/recall trade-off (Man-
ning et al., 2008) for the automatic induction of se-
mantic lexicons. Assuming that the classified candi-
date words are added to the lexicon in the descend-
ing order of their probabilities (of being either pos-
itive or negative), the induced lexicon will be nois-
ier and noisier when it becomes bigger and bigger.

276


Corpus Supervised Semi-Supervised/Transductive
kNN LR SVMlin SVMrbf TSVM S3VM CPLE SGT SentProp

AUC
Standard-English 0.892 0.931 0.939 0.941 0.901 0.540 0.680 0.852 0.906
Twitter 0.849 0.900 0.895 0.895 0.770 0.521 0.651 0.725 0.860
Finance 0.711 0.944 0.942 0.932 0.665 0.561 0.836 0.725 0.916

τ
Standard-English 0.469 0.495 0.498 0.495 0.487 0.038 0.162 0.409 0.440
Twitter 0.490 0.569 0.548 0.547 0.522 0.001 0.211 0.437 0.500

Table 2: Comparing the induced lexicons with their corresponding known lexicons (ground-truth) according
to the ranking of sentiment words measured by AUC and Kendall’s τ.

Fig. 5 shows that imposing a higher cut-off prob-
ability threshold (for candidate words to enter the
induced lexicon) would decrease the size of the in-
duced lexicon but increase its quality (accuracy). On
one hand, the induced lexicon needs to contain a suf-
ficient number of sentiment words, especially when
detecting sentiment from short texts, as a lexicon-
based method cannot reasonably classify documents
with none or too few sentiment words. On the other
hand, the noise (misclassified sentiment words) in
the induced lexicon would obviously have a detri-
mental impact on the accuracy of the document sen-
timent classifier built on top of it. Contrary to most
previous work like that from Qiu et al. (2011) which
tries to expand the sentiment lexicon as much as pos-
sible and thus maintain a high recall, we would put
more emphasis on the precision and keep a tight con-
trol of the lexicon size. For us, having a small senti-
ment lexicon is affordable, because our proposed ap-
proach to document sentiment classification will be
able to mitigate the low recall problem of lexicon-
based methods by combining them with learning-
based methods, which we shall talk about next.

5 Domain-Specific Sentiment
Classification of Documents

A domain-specific sentiment lexicon, automatically
induced using the above technique, provides a solid
basis for building domain-specific document senti-
ment classifiers. For the experiments here, we would
use a list of 7866 candidate words constructed by
merging two well-known general-purpose sentiment
lexicons that are both publicly available — the ‘Gen-
eral Inquirer’ (Stone et al., 1966) and the sentiment
lexicon from Liu (2012). This set of candidate words
is itself a combined, general-purpose sentiment lex-

Accuracy vs Size

0.85

0.90

0.95

1.00

Ac
cu

rac
y

0
10

00
20

00
30

00
40

00

0.50 0.60 0.70 0.80 0.90
Cutoff probability

Nu
mb

er 
of 

wo
rds

Figure 5: How the accuracy and size of an in-
duced lexicon are influenced by the cut-off proba-
bility threshold.

icon, so we name it the GI+BL lexicon. Moreover,
we would set the cut-off probability threshold to a
generally good value 0.7 in our sentiment lexicon
induction algorithm. Comparing the IMDB vector
space including all the candidate words (Fig. 4a)
with that including only the high-probability candi-
date words (Fig. 4b), it is obvious that the positive
and negative sentiment clusters become more clearly
separated in the latter.

The induced sentiment lexicon on its own could
be applied directly in a lexicon-based method for
sentiment classification of documents, and a reason-
ably good performance could be achieved as we will
show later in Table 4. However, most of the time,
lexicon-based sentiment classifiers are not as effec-
tive as learning-based sentiment classifiers. One rea-
son is that the former tends to suffer from a poor
recall. For example, with a limited size sentiment
lexicon, lexicon-based methods would often fail to

277


detect the sentiment present in short texts, e.g., from
Twitter, due to the lexical gap.

Given the induced sentiment lexicon, we propose
to use a lexicon-based sentiment classifier to classify
unlabeled documents, and then use those classified
documents containing at least three sentiment words
as pseudo-labeled documents to be used later for the
training of a learning-based sentiment classifier. The
condition of “at least three sentiment words” is to
ensure that only reliably classified documents would
be further utilised as training examples.

5.1 Sentiment Classification of Long Texts
First, we try the induced sentiment lexicons in the
lexicon-based sentiment classifier pSenti (Mudinas
et al., 2012) to see how good they are. Given a sen-
timent lexicon, pSenti is able to perform not only
binary sentiment classification but also ordinal sen-
timent classification on a five-point scale. To mea-
sure the binary classification performance, we use
both micro-averaged F1 (miF1) and macro-averaged
F1 (maF1) which are commonly used in text catego-
rization (Yang and Liu, 1999). To measure the five-
point scale classification performance, we use both
Cohen’s κ coefficient (Manning et al., 2008) and
also Root-Mean-Square Error (RMSE) (Bishop,
2006). As the baseline, we use a combined general-
purpose sentiment lexicon, GI+BL, mentioned pre-
viously in Section 4. As we can see from the results
shown in Table 3, using the induced sentiment lex-
icon for the target domain would make the lexicon-
based sentiment classifier pSenti perform better than
simply employing an existing general-purpose sen-
timent lexicon. Moreover, using the sentiment lex-
icons induced from the same domain would lead a
much better performance than using the sentiment
lexicons induced from a different domain.

Second, to evaluate the proposed two-phase boot-
strapping method, we make empirical comparisons
on the IMDB and Amazon datasets using a number
of representative methods for document sentiment
classification:
• pSenti — a concept-level lexicon-based sentiment

classifier (Mudinas et al., 2012),
• ProbLex-DCM — a probabilistic lexicon-based

classification using the Dirichlet Compound
Multinomial (DCM) likelihood to reduce effective
counts for repeated words (Eisenstein, 2017),

• SVMlin — Support Vector Machine with linear
kernel (Joachims, 1998),

• CNN — Convolutional Neural Network (Kim,
2014),

• LSTM — Long Short-Term Memory, a Recurrent
Neural Network (RNN) that can remember val-
ues over arbitrary time intervals (Hochreiter and
Schmidhuber, 1997; Dai and Le, 2015).

To apply the deep learning algorithms CNN and 
LSTM that have a word embedding projection layer, 
we fix t he r eview s ize t o 5 00 w ords, t runcating re-
views longer than that and padding reviews shorter 
than that with null values. As pointed out by Greff 
et al. (2017), the hidden layer size is an important 
hyperparameter of LSTM: usually the larger the net-
work, the better the performance but the longer the 
training time. In our experiments, we have used an 
LSTM network with 400 units on the hidden layer 
which is the capacity that a PC with one Nvidia GTX 
1080 Ti GPU can afford and a dropout (Wager et al., 
2013) rate of 0.5 which is the most common setting 
in research literature (Srivastava et al., 2014; Hong 
and Fang, 2015; Cliche, 2017).

As shown in Table 4, the above described two-
phase bootstrapping method has been demonstrated 
to be beneficial: t he l earning-based s entiment clas-
sifiers t rained o n p seudo-labeled d ata a re supe-
rior to lexicon-based sentiment classifiers, including 
the state-of-the-art unsupervised sentiment classifier 
ProbLex-DCM (Eisenstein, 2017). Furthermore, the 
two-phase bootstrapping method is a general frame-
work which can utilize any lexicon-based sentiment 
classifier t o p roduce p seudo-labeled d ata. There-
fore the more sophisticated ProbLex-DCM could 
also be used instead of pSenti in this framework, 
which is likely to deliver an even higher perfor-
mance. Among the three learning-based sentiment 
classifiers, LSTM achieved the best performance on 
both datasets, which is consistent with the observa-
tions in other studies like Dai and Le (2015).

Comparing the LSTM-based sentiment classifiers 
trained on pseudo-labeled and real labeled data, we 
can also see that using a large number of pseudo-
labeled examples could achieve a similar effect as 
using 25/4 ≈ 6k and 8/2 = 4k real labeled ex-
amples for IMDB and Amazon respectively. This 
suggests that the unsupervised approach is actually 
preferable to the supervised approach if there are

278


Lexicon
binary 5-point scale

miF1 maF1 F
pos
1 F

neg
1 Cohen’s κ RMSE

general-purpose GI+BL 0.745 0.744 0.764 0.722 0.235 1.325

domain-specific
same domain (Kitchen) 0.761 0.761 0.772 0.750 0.236 1.310
different domain (Electronics) 0.749 0.749 0.750 0.749 0.215 1.373
different domain (Video) 0.736 0.735 0.752 0.717 0.206 1.372

Table 3: Lexicon-based sentiment classification of Amazon Kitchen product reviews.

Method
IMDB Amazon

AUC F1 AUC F1

Unsupervised

Lexicon-
based

pSenti with existing general-purpose lexicon 0.808 0.705 0.818 0.747
pSenti with induced domain-specific lexicon 0.841 0.768 0.839 0.771
ProbLex-DCM (Eisenstein, 2017) 0.884 0.806 0.836 0.756

Learning-
based

SVMlin trained on pseudo-labeled data 0.863 0.771 0.845 0.763
CNN trained on pseudo-labeled data 0.879 0.781 0.849 0.773
LSTM trained on pseudo-labeled data 0.890 0.810 0.850 0.776

Supervised
Learning-
based

LSTM trained on real labeled data (full size) 0.971 0.912 0.878 0.802
” (1/2 size) 0.934 0.862 0.852 0.752
” (1/4 size) 0.892 0.821 0.841 0.744
” (1/8 size) 0.850 0.746 0.831 0.735

Table 4: Sentiment classification of long texts.

only a few thousand (or less) labeled examples.

5.2 Sentiment Classification of Short Texts

To evaluate our proposed approach to sentiment
classification of short texts, we have carried out
experiments on the Twitter sentiment classification
benchmark dataset from SemEval-2017 Task 4B
(Rosenthal et al., 2017) which is to classify 6185
tweets as either positive or negative. Other than the
training set of 20,508 tweets, we also collected un-
labeled tweets using the Twitter API. All the tweets
would be pre-processed by replacing emoticons with
their corresponding text representations and encod-
ing URLs by tokens. In addition to the Twitter-
domain seed words listed in Table 1, we have also
made use of common positive/negative emoticons
which are ubiquitous on Twitter as additional seeds
for the task of sentiment lexicon induction. Note
that in all our experiments, we do not use the sen-
timent labels and the topic information provided in
the training data.

Making use of the provided training data and our
own unlabeled data collected from Twitter, we have
constructed the domain-specific word embeddings,

induced the sentiment lexicon, and bootstrapped the
pseudo-labeled tweet data to train the binary tweet
sentiment classifier. As the learning algorithm we
have chosen LSTM with a hidden layer of 150 units
which would be enough for tweets as they are quite
short (with an average length of only 20 words).

The official performance measures for this short
text sentiment classification task (Rosenthal et al.,
2017) include Accuracy (Acc) and F1. Although
our approach is nearly-unsupervised (without any
reliance on labeled documents), its performance on
this benchmark dataset is comparable to that of su-
pervised methods: it would be placed roughly in the
middle of all the participating systems in this com-
petition (see Table 5).

5.3 Detecting Neutral Sentiment

Many real-world applications of sentiment classifi-
cation (e.g., on social media) are not simply a bi-
nary classification task, but involve a neutral cate-
gory as well. Although many lexicon-based sen-
timent classifiers including pSenti can detect neu-
tral sentiment, extending the above learning-based
sentiment classifier (trained on pseudo-labeled data)

279


System Acc F1

Unsupervised
Baseline all positive 0.398 0.285
Baseline all negative 0.602 0.376
OursLSTM 0.804 0.795

Supervised
Worst system 0.412 0.372
Median system 0.802 0.801
Best system 0.897 0.890

Table 5: Sentiment classification of short texts into
two categories — SemEval-2017 Task 4B.

to recognize neutral sentiment is challenging. To
investigate this issue, we have done experiments
on the Twitter sentiment classification benchmark
dataset from SemEval-2017 Task 4C (Rosenthal et
al., 2017) which is to classify 12379 tweets into an
ordinal five-point scale (−2, −1, 0, +1, +2) where
0 represents the neutral class.

One common way to handle neutral sentiment
is to treat the set of neutral documents as a sepa-
rate class for the classification algorithm, which is
the method advocated by Koppel and Schler (2006).
With the pseudo-labeled training examples of three
classes (−1: negative, 0: neutral, and +1: posi-
tive), we tried both standard multi-class classifica-
tion (Hsu and Lin, 2002) and ordinal classification
(Frank and Hall, 2001). However, neither of them
could deliver a reasonable performance. After care-
fully inspecting the classification results, we realised
that it is very difficult to have a set of representative
training examples with good coverage for the neu-
tral class. This is because the neutral class is not
homogeneous: a document could be neutral because
it is equally positive and negative, or because it does
not contain any sentiment. In practice, the latter case
is more often seen than the former case, and it im-
plies that the neutral class is more often defined by
the absence of sentiment word features rather than
their presence, which would be problematic to most
supervised learning algorithms.

What we discovered is that the simple method
of identifying neutral documents from the binary
sentiment classifier’s decision boundary works sur-
prisingly well, as long as the right thresholds are
found. Specifically, we take the probabilistic out-
puts of a binary sentiment classifier trained as be-
fore, and then put all the documents whose proba-

bility of being positive lies not close to 0, not close
to 1, but in the middle range into the neutral class.
It turns out that probability calibration (Niculescu-
Mizil and Caruana, 2005) is crucially important for
this simple method to work. Some supervised learn-
ing algorithms for classification can give poor es-
timates of the class probabilities, and some even
do not support probability prediction. For instance,
maximum-margin learning algorithms such as SVM
focus on hard samples that are close to the deci-
sion boundary (the support vectors), which makes
their probability prediction biased. The technique of
probability calibration allows us to better calibrate
the probabilities of a given classifier, or to add sup-
port for probability prediction. If a classifier is well
calibrated, its probabilistic output should be able
to be directly interpreted as a confidence level on
the prediction. For example, among the documents
to which such a calibrated binary classifier gives a
probabilistic output close to 0.8, approximately 80%
of the documents would actually belong to the posi-
tive class.

Using the sigmoid model of Platt (2000) with
cross-validation on the pseudo-labeled training data,
we carry out probability calibration for our LSTM
based binary sentiment classifier. Fig. 6 shows that
the calibrated probability prediction aligns with the
true confidence of prediction much better than the
raw probability prediction. In this case, the Brier
loss (Brier, 1950) that measures the mean squared
difference between the predicted probability and the
actual outcome could be reduced from 0.182 to
0.153 by probability calibration.

If we rank the estimated probabilities of being
positive from low to high, the curve of probabili-
ties would be in an “S”-shape with a distinct middle
range where the slope is steeper than the two ends, as
shown in Fig. 7. The documents with their probabil-
ities of being positive in such a middle range should
be neutral. Therefore the two elbow points in the
probability curve would make appropriate thresh-
olds for the identification of neutral sentiment, and
they could be found automatically by a simple algo-
rithm using the central difference to approximate the
second derivative. Let pL and pU denote the identi-
fied thresholds (pL < pU ), then we assign class label
“−1” to all those documents with the probability be-
low pL , “+1” to all those documents with the proba-

280


System MAEµ MAEM miF1 maF1

Unsupervised

Baseline all -2 1.895 2.000 0.006 0.014
Baseline all -1 0.923 1.400 0.089 0.286
Baseline all 0 0.525 1.200 0.133 0.500
Baseline all +1 1.127 1.400 0.063 0.188
Baseline all +2 2.105 2.000 0.004 0.011
Lexicon-based 0.939 1.135 0.253 0.189
OursLSTM 0.536 0.815 0.537 0.326

Supervised
Worst system 0.985 1.325 0.250 0.121
Median system 0.509 0.823 0.545 0.299
Best system 0.554 0.481 0.504 0.405

Table 6: Sentiment classification of short texts on a five-point scale — SemEval-2017 Task 4C.

0.0 0.2 0.4 0.6 0.8 1.0
Mean predicted value

0.0

0.2

0.4

0.6

0.8

1.0

Fr
ac

tio
n 

of
 p

os
iti

ve
s

raw (0.182)
calibrated (0.153)
perfectly calibrated (0)

Figure 6: The probability calibration plot of our
LSTM-based sentiment classifier on the SemEval-
2017 Task 4C dataset.

bility above pU , and “0” to all those documents with
the probability within [pL,pU].

The official performance measures for this sen-
timent classification task (Rosenthal et al., 2017)
are MAEµ and MAEM which stand for micro-
averaged and macro-averaged Mean Absolute Er-
ror (MAE), respectively. We would also like to
report the micro-averaged and macro-averaged F1
scores which are denoted as miF1 and maF1 respec-
tively. As shown in Fig. 7, the thresholds identi-
fied from the raw probability curve are roughly at
55 percentile and 75 percentile, which would yield
MAEµ = 0.632 and MAEM = 0.832; the thresh-
olds identified from the calibrated probability curve
are roughly at 40 percentile and 80 percentile, which
would yield much better scores MAEµ = 0.536
and MAEM = 0.815. So with the help of probabil-

ity calibration, our proposed approach would be able
to comfortably beat all the baselines including the
lexicon-based method pSenti (Mudinas et al., 2012)
and compete with the average (median) participat-
ing systems (see Table 6). Please note that this is
not a fair comparison: our approach is at a great
disadvantage because (i) it is nearly-unsupervised,
without any reliance on labeled documents while all
the other systems are supervised; and (ii) it performs
only ternary classification while all the other sys-
tems make classification on the full five-point scale.

6 Conclusions

How far can we go in sentiment classification for a
new domain, given only unlabeled data? This pa-
per presents our exploration towards answering the
above research question. Specifically, the main con-
tributions of this paper are as follows.
• We have formulated the cluster hypothesis for

sentiment analysis (i.e., words with different sen-
timent polarities form distinct clusters) and veri-
fied that in general it holds for word embeddings
within a specific domain but not across domains.

• We have demonstrated that a quality domain-
specific sentiment lexicon can be induced from
the word embeddings of that domain together with
just a few seed words. Surprisingly, simple lin-
ear model based supervised learning algorithms
(such as linear SVM) are good enough for this
purpose; there is no benefit of utilizing non-linear
models or semi-supervised/transductive learning
algorithms due to the noise at the borders of sen-
timent word clusters. Using such linear models

281


0 20 40 60 80 100
percentile

0.0

0.2

0.4

0.6

0.8

1.0

pr
ob

ab
ili

ty

(a) raw

0 20 40 60 80 100
percentile

0.0

0.2

0.4

0.6

0.8

1.0

pr
ob

ab
ili

ty

(b) calibrated

Figure 7: The probability curve with a region of intermediate probabilities representing the neutral class.

our system clearly outperforms the state-of-the-
art sentiment lexicon induction method — Sent-
Prop (Hamilton et al., 2016).

• We have shown that a lexicon-based sentiment
classifier could be enhanced by using its out-
puts as pseudo-labels and employing supervised
learning algorithms such as LSTM to train a
learning-based sentiment classifier on pseudo-
labeled documents. Our end-to-end pipelined ap-
proach which, overall, is unsupervised (except
for a very small set of seed words), works bet-
ter than the state-of-the-art unsupervised tech-
nique for document sentiment classification —
ProbLex-DCM (Eisenstein, 2017), and its perfor-
mance is at least on par with an average fully
supervised sentiment classifier trained on real la-
beled data (Rosenthal et al., 2017).

• We have revealed the crucial importance of prob-
ability calibration to the detection of neutral sen-
timent which was overlooked in previous studies
(Koppel and Schler, 2006). With the right thresh-
olds found, neutral documents can be simply iden-
tified at the binary sentiment classifier’s decision
boundary.

One promising way to further enhance the LSTM-
based sentiment classifier in the proposed approach
with the induced sentiment lexicon would be to con-
catenate word embeddings with an indicator feature
which tells whether a current word is positive, neu-
tral, or negative (Ebert et al., 2015). We leave this
for future work.

Acknowledgements

The Titan X Pascal GPU used for this research was
kindly donated by the NVIDIA Corporation. We
thank the reviewers for their constructive and help-
ful comments. We also gratefully acknowledge the
support of Geek.AI for this work.

References

Shlomo Argamon, Casey Whitelaw, Paul J. Chase, Sob-
han Raj Hota, Navendu Garg, and Shlomo Levitan.
2007. Stylistic text classification using functional lexi-
cal features. Journal of the American Society for Infor-
mation Science and Technology (JASIST), 58(6):802–
822.

Christopher M. Bishop. 2006. Pattern Recognition and
Machine Learning. Springer-Verlag.

John Blitzer, Mark Dredze, and Fernando Pereira. 2007.
Biographies, bollywood, boom-boxes and blenders:
Domain adaptation for sentiment classification. In
Proceedings of the 45th Annual Meeting of the As-
sociation for Computational Linguistics (ACL), pages
440––447, Prague, Czech Republic.

Danushka Bollegala, David J. Weir, and John A. Car-
roll. 2013. Cross-domain sentiment classification us-
ing a sentiment sensitive thesaurus. IEEE Transac-
tions on Knowledge and Data Engineering (TKDE),
25(8):1719–1731.

Johan Bollen, Huina Mao, and Xiao-Jun Zeng. 2011.
Twitter mood predicts the stock market. Journal of
Computational Science, 2(1):1–8.

Glenn W. Brier. 1950. Verification of forecasts ex-
pressed in terms of probability. Monthly Weather Re-
view, 78(1):1–3.

282


Mathieu Cliche. 2017. BB twtr at SemEval-2017 Task 4:
Twitter sentiment analysis with CNNs and LSTMs. In
Proceedings of the 11th International Workshop on Se-
mantic Evaluation (SemEval@ACL 2017), pages 573–
580, Vancouver, Canada.

Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised
sequence learning. In Advances in Neural Information
Processing Systems 28: Annual Conference on Neural
Information Processing Systems (NIPS), pages 3079–
3087, Montreal, Canada.

Xiaowen Ding, Bing Liu, and Philip S. Yu. 2008. A
holistic lexicon-based approach to opinion mining. In
Proceedings of the International Conference on Web
Search and Web Data Mining (WSDM), pages 231–
240, Palo Alto, CA, USA.

Sebastian Ebert, Ngoc Thang Vu, and Hinrich Schütze.
2015. A linguistically informed convolutional neu-
ral network. In Proceedings of the 6th Workshop on
Computational Approaches to Subjectivity, Sentiment
and Social Media Analysis (WASSA@EMNLP), pages
109–114, Lisbon, Portugal.

Jacob Eisenstein. 2017. Unsupervised learning
for lexicon-based classification. In Proceedings of
the 31st AAAI Conference on Artificial Intelligence
(AAAI), pages 3188–3194, San Francisco, CA, USA.

Ethan Fast, Binbin Chen, and Michael S. Bernstein.
2016. Empath: Understanding topic signals in large-
scale text. In Proceedings of the 2016 CHI Conference
on Human Factors in Computing Systems (CHI), pages
4647–4657, San Jose, CA, USA.

Eibe Frank and Mark A. Hall. 2001. A simple approach
to ordinal classification. In Proceedings of the 12th
European Conference on Machine Learning (ECML),
pages 145–156, Freiburg, Germany.

Fabian Gieseke, Antti Airola, Tapio Pahikkala, and
Oliver Kramer. 2012. Sparse quasi-Newton opti-
mization for semi-supervised support vector machines.
In Proceedings of the 1st International Conference
on Pattern Recognition Applications and Methods
(ICPRAM), pages 45–54, Vilamoura, Algarve, Portu-
gal.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
2011. Domain adaptation for large-scale sentiment
classification: A deep learning approach. In Pro-
ceedings of the 28th International Conference on Ma-
chine Learning (ICML), pages 513–520, Bellevue,
WA, USA.

Yoav Goldberg. 2017. Neural network methods for natu-
ral language processing. Synthesis Lectures on Human
Language Technologies, 10(1):1–309.

Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnı́k,
Bas R. Steunebrink, and Jürgen Schmidhuber. 2017.
LSTM: A search space odyssey. IEEE Transactions

on Neural Networks and Learning Systems (TNNLS),
28(10):2222–2232.

William L. Hamilton, Kevin Clark, Jure Leskovec, and
Dan Jurafsky. 2016. Inducing domain-specific senti-
ment lexicons from unlabeled corpora. In Proceedings
of the 2016 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 595–605,
Austin, TX, USA.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
2009. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, 2nd edi-
tion.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural Computation, 9(8):1735–
1780.

James Hong and Michael Fang. 2015. Sentiment anal-
ysis with deeply learned distributed representations of
variable length texts. Technical report, Stanford Uni-
versity.

Chih-Wei Hsu and Chih-Jen Lin. 2002. A comparison of
methods for multiclass support vector machines. IEEE
Transactions on Neural Networks (TNN), 13(2):415–
425.

Minqing Hu and Bing Liu. 2004. Mining and summa-
rizing customer reviews. In Proceedings of the 10th
ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining (KDD), pages 168–
177, Seattle, WA, USA.

Yohan Jo and Alice H. Oh. 2011. Aspect and sentiment
unification model for online review analysis. In Pro-
ceedings of the 4th International Conference on Web
Search and Web Data Mining (WSDM), pages 815–
824, Hong Kong, China.

Thorsten Joachims. 1998. Text categorization with sup-
port vector machines: Learning with many relevant
features. In Proceedings of the 10th European Confer-
ence on Machine Learning (ECML), pages 137–142,
Chemnitz, Germany.

Thorsten Joachims. 1999. Transductive inference for
text classification using support vector machines. In
Proceedings of the 16th International Conference on
Machine Learning (ICML), pages 200–209, Bled,
Slovenia.

Thorsten Joachims. 2003. Transductive learning via
spectral graph partitioning. In Proceedings of the
20th International Conference on Machine Learning
(ICML), pages 290–297, Washington, DC, USA.

Yoon Kim. 2014. Convolutional neural networks for sen-
tence classification. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1746–1751, Doha, Qatar.

Moshe Koppel and Jonathan Schler. 2006. The im-
portance of neutral examples for learning sentiment.
Computational Intelligence, 22(2):100–109.

283


Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton.
2015. Deep learning. Nature, 521(7553):436–444.

Heeyoung Lee, Mihai Surdeanu, Bill MacCartney, and
Dan Jurafsky. 2014. On the importance of text anal-
ysis for stock price prediction. In Proceedings of the
9th International Conference on Language Resources
and Evaluation (LREC), pages 1170–1175, Reykjavik,
Iceland.

Chenghua Lin and Yulan He. 2009. Joint sentiment/topic
model for sentiment analysis. In Proceedings of the
18th ACM Conference on Information and Knowledge
Management (CIKM), pages 375–384, Hong Kong,
China.

Bing Liu. 2012. Sentiment analysis and opinion mining.
Synthesis Lectures on Human Language Technologies,
5(1):1–167.

Bing Liu. 2015. Sentiment Analysis — Mining Opin-
ions, Sentiments, and Emotions. Cambridge Univer-
sity Press.

Marco Loog. 2016. Contrastive pessimistic likelihood
estimation for semi-supervised classification. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence (TPAMI), 38(3):462–475.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan
Huang, Andrew Y. Ng, and Christopher Potts. 2011.
Learning word vectors for sentiment analysis. In Pro-
ceedings of the 49th Annual Meeting of the Association
for Computational Linguistics (ACL), pages 142–150,
Portland, OR, USA.

Christopher D. Manning, Prabhakar Raghavan, and Hin-
rich Schütze. 2008. Introduction to Information Re-
trieval. Cambridge University Press.

Julian J. McAuley and Jure Leskovec. 2013. Hidden fac-
tors and hidden topics: Understanding rating dimen-
sions with review text. In Proceedings of the 7th ACM
Conference on Recommender Systems (RecSys), pages
165–172, Hong Kong, China.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
Corrado, and Jeffrey Dean. 2013. Distributed rep-
resentations of words and phrases and their composi-
tionality. In Advances in Neural Information Process-
ing Systems 26: Annual Conference on Neural Infor-
mation Processing Systems (NIPS), pages 3111–3119,
Lake Tahoe, NV, USA.

George A Miller and Walter G Charles. 1991. Contex-
tual correlates of semantic similarity. Language and
Cognitive Processes, 6(1):1–28.

Saif M. Mohammad and Peter D. Turney. 2010. Emo-
tions evoked by common words and phrases: Using
Mechanical Turk to create an emotion lexicon. In Pro-
ceedings of the NAACL HLT 2010 Workshop on Com-
putational Approaches to Analysis and Generation of
Emotion in Text (CAAGET), pages 26–34, Los Ange-
les, CA, USA.

Andrius Mudinas, Dell Zhang, and Mark Levene. 2012.
Combining lexicon and learning based approaches for
concept-level sentiment analysis. In Proceedings of
the 1st International Workshop on Issues of Sentiment
Discovery and Opinion Mining (WISDOM@KDD),
pages 5:1–5:8, Beijing, China.

Alexandru Niculescu-Mizil and Rich Caruana. 2005.
Predicting good probabilities with supervised learning.
In Proceedings of the 22nd International Conference
on Machine Learning (ICML), pages 625–632, Bonn,
Germany.

Sara Owsley, Sanjay Sood, and Kristian J. Hammond.
2006. Domain specific affective classification of doc-
uments. In AAAI Spring Symposium: Computational
Approaches to Analyzing Weblogs, pages 181–183,
Stanford, CA, USA.

Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang
Yang, and Zheng Chen. 2010. Cross-domain senti-
ment classification via spectral feature alignment. In
Proceedings of the 19th International Conference on
World Wide Web (WWW), pages 751–760, Raleigh,
NC, USA.

Bo Pang and Lillian Lee. 2004. A sentimental education:
Sentiment analysis using subjectivity summarization
based on minimum cuts. In Proceedings of the 42nd
Annual Meeting of the Association for Computational
Linguistics (ACL), pages 271–278, Barcelona, Spain.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up? Sentiment classification using ma-
chine learning techniques. In Proceedings of the ACL-
02 Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 79–86, Strouds-
burg, PA, USA.

Jeffrey Pennington, Richard Socher, and Christopher D.
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 1532–1543, Doha, Qatar.

John Platt, 2000. Advances in Large Margin Classi-
fiers, chapter Probabilistic Outputs for Support Vector
Machines and Comparisons to Regularized Likelihood
Methods. MIT Press.

Robert Plutchik, 1984. Approaches To Emotion, chap-
ter Emotions: A General Psychoevolutionary Theory,
pages 197–219. Psychology Press.

Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen.
2011. Opinion word expansion and target extraction
through double propagation. Computational Linguis-
tics, 37(1):9–27.

Alec Radford, Rafal Józefowicz, and Ilya Sutskever.
2017. Learning to generate reviews and discovering
sentiment. CoRR, abs/1704.01444.

Filipe N Ribeiro, Matheus Araújo, Pollyanna Gonçalves,
Marcos André Gonçalves, and Fabrı́cio Benevenuto.

284


2016. Sentibench — A benchmark comparison of
state-of-the-practice sentiment analysis methods. EPJ
Data Science, 5(1):1–29.

Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko,
Saif Mohammad, Alan Ritter, and Veselin Stoyanov.
2015. SemEval-2015 Task 10: Sentiment analysis in
Twitter. In Proceedings of the 9th International Work-
shop on Semantic Evaluation (SemEval@NAACL-
HLT), pages 451–463, Denver, CO, USA.

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.
SemEval-2017 Task 4: Sentiment analysis in Twit-
ter. In Proceedings of the 11th International Workshop
on Semantic Evaluation (SemEval@ACL), pages 502–
518, Vancouver, Canada.

Sascha Rothe, Sebastian Ebert, and Hinrich Schütze.
2016. Ultradense word embeddings by orthogonal
transformation. In Proceedings of the 2016 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies (HLT-NAACL), pages 767–777,
San Diego, CA, USA.

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural networks
from overfitting. Journal of Machine Learning Re-
search (JMLR), 15(1):1929–1958.

Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith,
and Daniel M. Ogilvie. 1966. The General Inquirer:
A Computer Approach to Content Analysis. MIT
Press.

Songbo Tan, Xueqi Cheng, Yuefen Wang, and Hongbo
Xu. 2009. Adapting naı̈ve Bayes to domain adapta-
tion for sentiment analysis. In Proceedings of the 31th
European Conference on IR Research (ECIR), pages
337–349, Toulouse, France.

Mike Thelwall, Kevan Buckley, Georgios Paltoglou,
Di Cai, and Arvid Kappas. 2010. Sentiment strength
detection in short informal text. Journal of the Amer-
ican Society for Information Science and Technology
(JASIST), 61(12):2544–2558.

Mike Thelwall, Kevan Buckley, and Georgios Paltoglou.
2012. Sentiment strength detection for the social web.
Journal of the American Society for Information Sci-
ence and Technology (JASIST), 63(1):163–173.

Peter D. Turney. 2002. Thumbs up or thumbs down?
Semantic orientation applied to unsupervised classifi-
cation of reviews. In Proceedings of the 40th Annual
Meeting of the Association for Computational Linguis-
tics (ACL), pages 417–424, Philadelphia, PA, USA.

Laurens van der Maaten and Geoffrey E. Hinton. 2008.
Visualizing data using t-SNE. Journal of Machine
Learning Research (JMLR), 9(Nov):2579–2605.

Stefan Wager, Sida I. Wang, and Percy Liang. 2013.
Dropout training as adaptive regularization. In Ad-
vances in Neural Information Processing Systems 26:
Annual Conference on Neural Information Process-
ing Systems (NIPS), pages 351–359, Lake Tahoe, NV,
USA.

Amy Beth Warriner, Victor Kuperman, and Marc Brys-
baert. 2013. Norms of valence, arousal, and domi-
nance for 13,915 English lemmas. Behavior Research
Methods, 45(4):1191–1207.

Rui Xia, Chengqing Zong, Xuelei Hu, and Erik Cambria.
2013. Feature ensemble plus sample selection: Do-
main adaptation for sentiment classification. IEEE In-
telligent Systems, 28(3):10–18.

Yi Yang and Jacob Eisenstein. 2015. Putting things in
context: Community-specific embedding projections
for sentiment analysis. CoRR, abs/1511.06052.

Yiming Yang and Xin Liu. 1999. A re-examination of
text categorization methods. In Proceedings of the
22nd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
(SIGIR), pages 42–49, Berkeley, CA, USA.

Yasuhisa Yoshida, Tsutomu Hirao, Tomoharu Iwata,
Masaaki Nagata, and Yuji Matsumoto. 2011. Trans-
fer learning for multiple-domain sentiment analysis -
identifying domain dependent/independent word po-
larity. In Proceedings of the 25th AAAI Conference
on Artificial Intelligence (AAAI), San Francisco, CA,
USA.

Lei Zhang, Riddhiman Ghosh, Mohamed Dekhil, Me-
ichun Hsu, and Bing Liu. 2011. Combining lexicon-
based and learning-based methods for Twitter senti-
ment analysis. Technical Report HPL-2011-89, HP
Laboratories.

Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty.
2003. Semi-supervised learning using Gaussian fields
and harmonic functions. In Proceedings of the 20th In-
ternational Conference on Machine Learning (ICML),
pages 912–919, Washington, DC, USA.

285


286