Discovering Filter Keywords for Company Name
Disambiguation in TwitterI

Damiano Spina∗, Julio Gonzalo, Enrique Amigó

UNED NLP & IR Group
Juan del Rosal, 16

28040 Madrid, Spain
http: // nlp. uned. es

Abstract

A major problem in monitoring the online reputation of companies, brands, and other
entities is that entity names are often ambiguous (apple may refer to the company,
the fruit, the singer, etc.). The problem is particularly hard in microblogging services
such as Twitter, where texts are very short and there is little context to disambiguate.
In this paper we address the filtering task of determining, out of a set of tweets that
contain a company name, which ones do refer to the company. Our approach relies on
the identification of filter keywords: those whose presence in a tweet reliably confirm
(positive keywords) or discard (negative keywords) that the tweet refers to the company.

We describe an algorithm to extract filter keywords that does not use any previously
annotated data about the target company. The algorithm allows to classify 58% of the
tweets with 75% accuracy; and those can be used to feed a machine learning algorithm
to obtain a complete classification of all tweets with an overall accuracy of 73%. In
comparison, a 10-fold validation of the same machine learning algorithm provides an
accuracy of 85%, i.e., our unsupervised algorithm has a 14% loss with respect to its
supervised counterpart.

Our study also shows that (i) filter keywords for Twitter does not directly derive from
the public information about the company in the Web: a manual selection of keywords
from relevant web sources only covers 15% of the tweets with 86% accuracy; (ii) filter
keywords can indeed be a productive way of classifying tweets: the five best possible
keywords cover, in average, 28% of the tweets for a company in our test collection.

Keywords: Twitter, online reputation management, name disambiguation, filtering

IThis research was partially supported by the Spanish Ministry of Education (FPU grant nr AP2009-
0507), the Spanish Ministry of Science and Innovation (Holopedia Project, TIN2010-21128-C02), the
Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542) and the European
Community’s FP7 Programme under grant agreement nr 288024 (LiMoSINe).

∗Corresponding author
Email addresses: damiano@lsi.uned.es (Damiano Spina), julio@lsi.uned.es (Julio Gonzalo),

enrique@lsi.uned.es (Enrique Amigó)

Preprint submitted to Elsevier March 6, 2013

http://nlp.uned.es


1. Introduction

The vast use of social media to share facts and opinions about entities, such as
companies, brands and public figures has generated the opportunity - and the necessity
– of managing the online reputation of those entities. Online Reputation Management
consists of monitoring – and handling – the opinion of Internet users (also referred to as
electronic word of mouth, eWOM) on people, companies and products, and is already a
fundamental tool in corporate communication [76, 20, 64, 32].

Over the last years, a wide variety of tools have been developed that facilitate the
monitoring task of online reputation managers1. The process followed by these tools
typically consists of three tasks:

1. Retrieval of potential mentions. The user gives as input a set of keywords (e.g.
the company name, products, CEO’s name. . . ), and the service uses these keywords
to retrieve documents and content generated by users from different sources: broad-
cast news sources, social media, blogs, site reviews, etc.

2. Analysis of results. Retrieved documents are automatically processed in order
to get relevant information to the user: sentiment, authority and influence, back-
ground topics, etc.

3. Results visualization. Analyzed data is presented to the user in different ways:
ranking documents, drawing graphics, generating tag clouds, etc.

A major problem concerning the first task (retrieval of potential mentions) is that
brand names are often ambiguous. For instance, the query “Ford” retrieves informa-
tion about the motor company, but also might retrieve results about Ford Models (the
modeling agency), Tom Ford (the film director), etc.

One might think that the query is too general, and the user should provide a more
specific query, such as “ford motor” or “ford cars”. In fact, some tools explicitly suggest
the user to refine possibly ambiguous queries2.

This approach has two main disadvantages: (i) users have to make an additional
effort when defining unambiguous queries and (ii) query refinement harms recall over
the mentions of the brand in the Web, which can be particularly misleading in an online
reputation management scenario.

Note that filtering out the mentions that do not refer to the monitored entity is also
crucial when estimating its visibility. Quantifying the number of mentions on the Web
about an entity, and how this number changes over time, is essential to track marketing
or Public Relationship campaigns. When the entity name is ambiguous, indicators given
by tools such as Google Trends3 or Topsy4 can be misleading.

We think that a component capable of filtering out mentions that do not refer to the
entity being monitored (specified by the user as a keyword plus a representative URL)

1At the time of writing this paper, some popular reputation management tools are trackur (http:
//www.trackur.com/), BrandsEye(http://www.brandseye.com/), Alterian SM2(http://socialmedia.
alterian.com/) or SocialMention (http://socialmention.com), among others.

2Alterian SM2 service: http://www.sdlsm2.com/social-listening-for-business/industry/.
3http://www.google.com/trends
4http://www.topsy.com

2

http://www.trackur.com/
http://www.trackur.com/
http://www.brandseye.com/
http://socialmedia.alterian.com/
http://socialmedia.alterian.com/
http://socialmention.com
http://www.sdlsm2.com/social-listening-for-business/industry/
http://www.google.com/trends
http://www.topsy.com


would be a substantial enhancement of current online reputation management tools, and
would also facilitate the analysis of the online presence/visibility of a brand.

Among the most important information sources monitored by reputation managers
are microblogging services [17, 36], As of today, Twitter5 is the most popular microblog-
ging service and provides a communication environment that deserves serious attention
as a form of eWOM. Users send short posts (tweets) where they can describe things of
interest and express attitudes that they are willing to share with others.

The work presented in this paper deals with the challenge addressed by the Online
Reputation Management Task of WePS-3 [3]. Given a set of Twitter entries containing
an (ambiguous) company name, and given the home page of the company, the task is to
filter out irrelevant information, providing a binary classification of tweets as related or
unrelated to the company. Notice that ambiguity resolution is particularly challenging
in Twitter: tweets are minimal (140 characters at most) and little context is available
for resolving name ambiguity.

The main objective of this paper is to validate an intuitive observation derived by
the set-up and the analysis of the results of the WePS-3 Online Reputation Management
Task [3]. The observation is that manual annotation can be simplified by picking up spe-
cial keywords –henceforth called filter keywords – that reliably signal positive or negative
information. For instance, “ipod” is a positive filter keyword for the Apple company,
because its presence is a highly reliable indicator that the tweet is about the company.
Reversely, “crumble” is a negative filter keyword for Apple, because it correlates with
unrelated tweets. The intuition is that automatic detection of such filter keywords can
be a valuable signal to design an automatic solution to the problem.

Our goal is to provide quantitative evidence supporting (or rejecting) our intuition,
and to answer some related questions:

• Where should we look for filter keywords in order to find them automatically?

• How much recall can we expect from such keywords?

• Can we use the notion of filter keywords to build a binary classifier that solves the
task?

In order to answer these questions we will use the WePS-3 Task 2 test collection6,
which is, to our knowledge, the first dataset built explicitly to address this problem.

The paper is organized as follows: in Section 2 we start by introducing some back-
ground concepts and discussing the State of the Art: we review some related work on
Twitter, we summarize previous work on Named Entity Disambiguation and Automatic
Keyword Extraction (which is one of the techniques we want to apply to the problem).
Then we focus on the goals and results of the WePS-3 Online Reputation Management
Task, which is the test collection used in our experiments.

In Section 3 we detail and validate our Filter Keywords hypothesis by computing
its upper bounds on our test collection. In Section 4, we study how to automatically
discover filter keywords. Then, in Section 5 we discuss how to perform the classification
task by using automatically extracted filter keywords. Finally, we discuss our results in
Section 6 and conclude in Section 7.

5http://twitter.com
6http://nlp.uned.es/weps/weps-3

3

http://twitter.com
http://nlp.uned.es/weps/weps-3


2. Related Work

Filtering out mentions that do not refer to a given company name can be seen as
a named entity disambiguation problem. Named Entity Disambiguation (NED) have
been receiving a lot of attention from the Natural Language Processing (NLP) research
community in the last decade [30, 11, 19, 13, 39, 6]. Most recent work tackle this problem
in contexts where disambiguation is more challenging, as in the case of Twitter: the
textual context is reduced (limited to 140 characters) and language is used as chat-speak
or SMS style, where the use of the phonetic spelling and acronyms for common phrases
is vast [42, 29].

In this section we provide some background about Twitter and its related datasets.
Then we introduce some previous work on NED, focusing on scenarios where NED is
crucial such as Entity Linking, Document enrichment by Wikipedia links and Web People
Search. Related work about automatic keyword extraction – one of the techniques we
want to apply to our problem – is also presented, and then we describe the definition
and the results of the WePS-3 Online Reputation Management Task – which is the test
collection used in our experiments.

2.1. Twitter

Twitter7 is a relatively new social networking site [44] which stands out as the proto-
typical microblogging service. Its particularity is that posts do not exceed 140 characters
and there are no privacy conditions. Therefore, Twitter reflects opinions in real time and
is very sensitive to burstiness phenomena [61, 12, 46].

Tweets are particularly challenging for Natural Language Processing tasks, given that
(i) tweets are shorts (i.e. 140 characters), and (ii) users post text using a nonstandard
language with similarities to SMS style [42, 47, 29].

2.1.1. Twitter-related Tasks

Up to now, most of the work on Twitter has focused on analyzing the microblogging
phenomenon [37, 46, 12], modeling the propagation of information in the social network
[78, 44, 79] and analyzing the content of tweets [36, 70, 66, 35, 15].

Some previous work address the problem of tweet classification for different purposes
such as sentiment analysis [36, 63] and information filtering [70]. Sriram et al. [70] classify
tweets in five categories (i.e. news, events , opinions, deals, and private messages) in order
to improve information filtering. They compare different tweet representations, including
bag-of-words and features like the author name, the presence of mentions of users on the
tweet and features regarding the writing style (i.e. opinionated words, currency and
percentage signs). Then, they train a Naive Bayes classification model to evaluate the
different representations, obtaining high accuracy scores when combining bag-of-words
with the other features.

Jansen et al. [36] use a multinomial Bayes model to determine the sentiment of a set
of tweets. The model is built using a lexicon of around 200,000 unigrams and bigrams of
words and phrases that have a probability distribution to determine the sentiment of a
topic. The system calculates the probability of each tweet by checking the occurrence of

7http://www.twitter.com

4

http://www.twitter.com


polarized words and then picks the class with the greatest probability in a winner-takes-
all scenario. In this work we use a similar approach, in the manner that we test, among
others, the winner-takes-all strategy in order to classify all the tweets for a company
name either as related or unrelated.

Zhao et al. [90] propose an automatic keyphrase extraction model to summarize top-
ics in Twitter. Firstly, they identify topics using Twitter-LDA [89], which assumes a
single topic assignment for each tweet. Secondly, they use a variation of Topical PageR-
ank [48] where edges between two words are weighted taking into account co-occurrences
in tweets assigned to the same topic. Then, they generate phrase candidates by looking
for combinations of the top topic keywords that co-occur as frequent phrases in the text
collection. Finally, topical keyphrases are ranked by using a probabilistic model that
takes into account (i) the specificity of a phrase given a topic and (ii) the retweet ratio
of tweets containing a keyphrase.

To the best of our knowledge, the Online Reputation Management Task on WePS-3
held at CLEF 2010 was the first campaign of NLP-centered tasks over Twitter. TREC
2011 and TREC 2012 held a track about Twitter: TREC Microblog Track8. Here, the
problem addressed is a realtime search task (Realtime Adhoc Task): given a query at a
specific time, systems should return a list both recent and relevant tweets [67].

2.1.2. Twitter Datasets

There are several Twitter datasets that are suitable for a variety of research purposes.
A corpus of 900.000 tweets has been provided by the Content Analysis in Web 2.0

Workshop (CAW 2.0)9, to tackle the problem of text normalization on user generated
contents. Yang & Leskovec [80] use a dataset of 580 million Tweets to identify tem-
poral patterns over the content published on the tweets, and Kwak et al. [46] use a
representative sample of 467 million tweets from 20 million users covering a 7 month
period from June 1 2009 to December 31 2009 to study the information diffusion and
the topological characteristics of Twitter.10. Cha et al. [14] built a dataset comprising
54,981,152 users, connected to each other by 1,963,263,821 social links and including a
total of 1,755,925,520 tweets to analyze users’ influence and how to measure it on the
Twittersphere.

Finally, the Tweets2011 corpus is the dataset that will be used on the TREC Mi-
croblog Track. The corpus is a representative sample of the twittersphere, that includes
also spam tweets. It consists of approximately 16 million tweets over a period of 2 weeks
(24th January 2011 until 8th February, inclusive), which covers both the time period of
the Egyptian revolution and the US Superbowl, among others.

To the best of our knowledge, WePS-3 Task 2 test collection (described in Section 2.4),
is the first dataset specifically built to address the problem of disambiguation of orga-
nization names on tweets. Recently, the RepLab evaluation campaign [4] held in CLEF
2012, has addressed the same problem but in a multilingual scenario: the RepLab dataset
contains tweets written in English and Spanish.

8https://sites.google.com/site/microblogtrack
9http://caw2.barcelonamedia.org/node/41

10These datasets are no longer available as per request from Twitter

5

https://sites.google.com/site/microblogtrack
http://caw2.barcelonamedia.org/node/41


2.2. Named Entity Disambiguation

In Natural Language Processing, Named Entity Disambiguation (NED) is a step fur-
ther from Named Entity Recognition. The latter consists on identifying and classifying
phrases in a text that refer to people, places or organizations, as well as temporal ex-
pressions or numeric quantities [30, 62]. The former involves the association of mentions
in one or more texts (also called references or surface forms) of an entity with the con-
crete object that they are actually referencing [11]. For instance, in the sentence: “The
Big Apple’s new Apple retail store was officially opened today.” the two occurrences
of the word Apple refer to two different entities. The former refers to New York City
(nicknamed as “The Big Apple”11), while the latter refers to the consumer electronics
and software company Apple Inc12.

Word Sense Disambiguation (WSD) is another NLP task related to NED, which deals
with the polysemy of common words, and consists of assigning the appropriate sense to
an occurrence of a word in a given context [1]. While in WSD the senses of a word are
often looked up in a dictionary, thesaurus or lexical resource such as WordNet [22], we
will see that in NED the use of Wikipedia as knowledge source is widely extended.

In the following, we review three main problems –closely related with ours– in which
entity disambiguation has been applied: Entity Linking, Document Enrichment by Link-
ing to Wikipedia Articles and Web People Search. Finally, we conclude this section
describing some previous work about named entity disambiguation on Twitter.

2.2.1. Entity Linking

Entity linking consists of associating a mention in a text with the corresponding
entity in a Knowledge Base. In the Knowledge Base Population scenario (KBP) [53, 39,
21, 38], this is a first step to discover facts about entities and augmenting a knowledge
base with these facts and with newly discovered entities. Given an entity name and a
background document where the name occurs, systems typically perform three steps to
link the entity to the knowledge base: (i) query expansion (enrich the query by mining the
Wikipedia structure or resolving co-reference in the background document); (ii) candidate
generation (find all possible entries in the knowledge base that the query might link to)
and (iii) candidate ranking (rank candidate entities by computing similarity between the
represented query and the entities, and fixing a threshold to decide when the entity does
not exist in the knowledge base).

2.2.2. Document Enrichment by Linking to Wikipedia Articles

There are a variety of systems that automatically annotate a document by linking
names appearing in the document with Wikipedia articles [13, 57, 71, 60, 45].

In general, named entity disambiguation in these systems is carried out in three steps:
– Mention or surface form representation. In this step, the context of the

mention to disambiguate is defined. The most common representations used are vector
space model [13, 57, 19], the set of named entities that occur in the text [27], and the
resolved Wikipedia links of unambiguous entities next to the mention [60, 45, 24, 54].

11http://en.wikipedia.org/wiki/Big_Apple
12http://en.wikipedia.org/wiki/Apple_Inc

6

http://en.wikipedia.org/wiki/Big_Apple
http://en.wikipedia.org/wiki/Apple_Inc


– Candidates entities retrieval and representation. The system retrieves all
possible entities that could be referenced by the mention from the knowledge base (i.e.
Wikipedia pages) and represents each entity as a bag-of-words from the page [13, 57, 19],
extracting features from the page structure (such as the categories which the page belongs
to) [13, 19, 27], or syntactic features [57] or also exploiting the hyperlink structure of
Wikipedia, retrieving all pages that link to the candidate entity page [60, 45, 24, 54].

– Best candidate selection. In the final step, the best candidate entity is selected
by computing a distance or similarity function between the surface form and each of the
candidate entities. The most common functions are cosine [13] and other vector similarity
functions [57, 19], random walk graph models to compute semantic relatedness [27] and
finally relatedness or coherence functions that involve all the entity links made in the
text [60, 45, 24].

Unlike document enrichment, our task focuses on a single organization name and does
not require linking every mention of entities in the collection, but just deciding whether
each mention refers or not to the entity.

2.2.3. Web People Search

Web People Search is defined as the task of clustering a set of web pages, which are
the result of a Web search for a person name, in as many groups as entities sharing that
name [6]. This was the original goal of the Web People Search campaign (WePS) [9], com-
plemented in subsequent editions by the tasks of person attribute extraction [10, 7] and
company name disambiguation in Twitter [3]. In the web people search scenario, Hierar-
chical Agglomerative Clustering seems to be the most competitive (and most frequently
used) clustering technique [28, 9, 10, 7]. Documents are typically represented as bag-of-
words [11, 40, 9]. Other works use smaller portions of the document, such as sentences
where the ambiguous name occurs or pre-defined windows of words [28, 49, 10]. Named
entities are also a frequently used feature for people name disambiguation [40, 9, 10], and
biographical features (e.g. title, organization, email address, phone number, etc.) are
used to a lesser extent [50, 2, 75]. The most common similarity measure in this scenario
is cosine [11, 9, 10], while some works also use Kullback-Leibler divergence [28, 52].

2.2.4. Named Entity Disambiguation in Twitter

Most Named Entity Disambiguation techniques have been applied to disambiguate
entities on reasonably long texts such as news articles or blog posts. However, little
work has been done on NED over microblogging posts [24, 54, 55]. In this scenario,
disambiguation is harder due to fact that texts are short (limited by 140 characters) and
hence the context of a mention is minimal.

Ferragina & Scaiella [24] propose a system capable of annotating short texts (includ-
ing tweets) by linking entities to Wikipedia pages. The system relies on the hyperlink
structure of Wikipedia, exploiting the links between pages and the anchors texts of the
links. When the system receives a text, it detects the anchors on the text and retrieve
the possible senses for each anchor. Ambiguous senses are disambiguated by a collective
agreement function among all senses associated to the anchors detected on the text [45],
and taking advantage of the unambiguous anchors to boost the selection of these senses
for the ambiguous anchors [60]. They do not evaluate the accuracy of their disambigua-
tion component over tweets. However, they report that almost 95% of the 5,000 analyzed
tweets have at least 3 phrases with an entry in Wikipedia (which is not necessarily an

7


entity). This result shows that Wikipedia have a high coverage as a catalog of senses for
tweet disambiguation.

Different from [24], Meij et al. [54] uses supervised machine learning techniques to
refine a list of candidate Wikipedia concepts that are potentially relevant to a given
tweet. The candidate ranking list is generated by matching n-grams in the tweet with
anchor texts in Wikipedia articles, taking into account the inter-Wikipedia link structure
to compute the most probable Wikipedia concept for each n-gram.

Michelson & Macskassy [55] focus on discovering the topics of interest for a particular
Twitter user. Given the stream of tweets corresponding to the user, they firstly find the
Wikipedia page of the entities mentioned on tweets and secondly they build a topic
profile from the high-level categories that cover these entities. Entity disambiguation is
performed by calculating the overlap between the terms on the tweet and the terms on
the page of each candidate entity. In this scenario, the accuracy of the disambiguation
process is not critical, since the system takes into account the categories that occur
frequently across all the entities found in order to produce a topic profile of a stream.

2.3. Automatic Keyphrase Extraction

We include here an overview of Automatic Keyphrase Extraction techniques, as this
is a technique that plays a crucial role in our approach to company name disambiguation
on Twitter.

Automatic Keyphrase Extraction is the task of identifying a set of relevant terms
or phrases that summarize and characterize one or more given documents [77]. Most of
the literature about automatic keyphrase extraction is focused on (well-written) technical
documents, such as scientific and medical articles, since the keywords given by the authors
can be used as gold standard [74, 77, 25, 59, 58, 43]. Some authors address automatic
keyword extraction as a way of automatically summarizating web sites [86, 87, 88, 85] .
In [85] different keyword extraction methods are compared, including tf*idf, supervised
methods, and heuristics based on both statistical and linguistic features of candidate
terms. While Zhang et al. [85] study the automatic keyword extraction from website
descriptions, in our work we explore a semi-supervised keyword extraction approach that
extract filter keywords over a stream of tweets.

Automatic keyword extraction is typically used to characterize the content of one
or more documents, using features intrinsically associated with that documents. In
order to detect both positive and negative filter keywords, however, we need to look
into external resources in order to discriminate between related and unrelated keywords.
Thus, automatic keyword extraction methods are not directly applicable to our filter
keyword approach.

2.4. The WePS-3 Online Reputation Management Task

Disambiguation of company names in text streams (and in particular in microblog
posts) is a necessary step in the monitoring of opinions about a company. However,
it is not tackled explicitly in most research on the subject. Rather than this, most
previous work assume that query terms are not ambiguous in the retrieval process. The
disambiguation task has been explicitly addressed in the WePS-3 evaluation campaign [3].
In this section we summarize the outcome of that campaign, analyzing the test collection
and comparing system results.

8


2.4.1. Task Definition and Test Collection

The Online Reputation Management task of the WePS-3 evaluation campaign consists
of filtering Twitter posts containing a given company name, depending on whether the
post is actually related with the company or not. The task is defined to deal with the
scenario where an online system accepts any company name as input, and has to learn
to disambiguate entries about that company without supervision (i.e. without a set
of previously disambiguated tweets). Therefore, the set of organization names in the
training corpora is different from the set of companies in the test set.

For each organization in the dataset, systems are provided with the company name
c (e.g. apple) used as query to retrieve the stream of tweets to annotate Tc, and a
representative URL (e.g. http://www.apple.com) that univocally identifies the target
company. The input information per tweet consists of a tuple containing: the tweet iden-
tifier, the organization name, the query used to retrieve the tweet, the author identifier,
the date and the tweet content. Systems must label each tweet as related (i.e. the tweet
refers to the company) or unrelated (the tweet does not refer to the given company).

The WePS-3 ORM task dataset comprises 52 training and 47 test cases, each of them
including a company name, its URL, and an average of 435 tweets manually annotated
as related/unrelated to the company. The dataset has been annotated using the Mechan-
ical Turk crowdsourcing marketplace13, with the following manual annotation options:
related, unrelated and undecidable. Each hit has been redundantly annotated by five Me-
chanical Turk workers. A total of 902 annotators have participated in the annotations
of 43730 tweets. Finally, an agreement analysis was done in order to decide the final
annotation for each tweet.

An interesting property of the dataset is that there is a great variability of the degree
of ambiguity across the training and test cases. That is, there are companies with
low occurrence in tweets (e.g. Delta Holdings, Zoo Entertainment), companies with
medium ambiguity (e.g. Luxor Hotel and Casino, Edmunds.com) and companies with
high presence in tweets (e.g. Yamaha, Lufthansa).

2.4.2. WePS-3 Results

A total of five research groups participated in the campaign. The best two systems
were LSIR [81] and ITC-UT [83]. The LSIR system builds a set of profiles for each
company, made of keywords extracted from external resources such as WordNet or the
company homepage, as well as a set of manually defined keywords for the company and
the most frequent unrelated senses for the company name. These profiles are used to
extract tweet-specific features that are added to other generic features that give infor-
mation about the quality of the profiles to label the tweets as related or unrelated with
an SVM classifier.

The ITC-UT system is based on a two step classification. Firstly, it predicts the class
of each query/company name according to the ratio of related tweets of each company
name and secondly applies a different heuristic for each class, basically based on the PoS
tagging and the named entity label of the company name.

The SINAI system [26] also uses a set of heuristic rules based on the occurrence of
named entities both on the tweets and on external resources like Wikipedia, DBPedia

13https://www.mturk.com

9

https://www.mturk.com


and the company homepage. The UvA system [72] does not employ any resource related
to the company, but uses features that involve the use of the language in the collection
of tweets (URLs, hashtags, capital characters, punctuation, etc.). Finally, the KALMAR
system [41] builds an initial model based on the terms extracted from the homepage to
label a seed of tweets and then uses them in a bootstrapping process, computing the
point-wise mutual information between the word and the target’s label.

In the WePS exercise, accuracy (ratio of correctly classified tweets) was used to rank
systems. The best overall system (LSIR) obtained 0.83, but including manually produced
filter keywords. The best automatic system (ITC-UT) reaches an accuracy of 0.75 (being
0.5 the accuracy of a random classification), and includes a query classification step in
order to predict the ratio of positive/negative tweets.

2.4.3. Other work using WePS-3 datasets

Recently, Yerva et al. [82] have explored the impact of extending the company profiles
presented in [81] by using the related ratio and considering new tweets retrieved from
the Twitter stream. By estimating the degree of ambiguity per entity from a subset of
50 tweets per entity, they reach 0.73 accuracy14. Using this related ratio and considering
co-occurrences with a given company profile, the original profile is extended with new
terms extracted from tweets retrieved by querying the company name in Twitter. The
expanded profile outperforms the original, achieving 0.81 accuracy.

Zhang et al. [84] presents a two stages based disambiguation system. They combine
supervised and semi-supervised methods. Firstly, they train a generic classifier using the
training set, similarly to [81]. Then, they use this classifier to annotate a seed of tweets
in the test set, using the Label Propagation algorithm to annotate the remainder tweets
in the test set. Using Näıve Bayes and Label Propagation they achieve a 0.75 accuracy,
that matches the performance of the best automatic system in WePS-3.

The RepLab evaluation campaign held at CLEF 2012 addressed, as a subtask, the
same filtering problem introduced in WePS-3. The main difference was that while WePS-
3 dataset contains only tweets written in English, the RepLab collection [4] contains
tweets both in English and Spanish.

Finally, the WePS-3 ORM task dataset has been extended with manual annotations
for the task of entity profiling in microblog posts, that includes identifying entity aspects
and opinion targets [69].

2.5. Wrap Up

There are two main findings on entity name disambiguation that motivate our re-
search:

1. Use of filter keywords. Artiles et al. [8] studied the impact of query refinement
in the Web People Search clustering task and concluded that

“although in most occasions there is an expression that can be used as a
near-perfect refinement to retrieve all and only those documents referring
to an individual, the nature of these ideal refinements is unpredictable and
very unlikely to be hypothesized by the user” [8]

14Note that, unlike the original formulation of the WePS-3 task, this is a supervised system, as it uses
part of the test set for training. Hence, their results cannot be directly compared with the results in our
work.

10


This is both a positive indication – there are keywords able to isolate relevant
information well – and a suggestion that finding optimal keywords automatically
might be a challenging task.
Another evidence in favor of using filter keywords is that the best results on the
WePS-3 ORM Task were achieved by a system that used a set of manually produced
–both positive and negative– filter keywords [81]. One of the goals of our work is
to analyze query refinement in the scenario of the Company Name Disambiguation
on Twitter. In other words, we explore the impact of defining a set of keywords to
filter both related and unrelated tweets to a given company.

2. Use of knowledge bases to represent candidate entities. Both entity linking
and document enrichment systems use knowledge bases in order to characterize
the possible entities that a mention may refer to. Most systems use Wikipedia
as knowledge base [57, 13, 19, 27, 60, 45, 24, 39, 21]. In this direction, we believe
that looking at Wikipedia pages related to the company to disambiguate could give
useful information to characterize positive filter keywords. We also explore the use
of other resources such as Open Directory Project (ODP), the company website
and the Web in general. Note, however, that not every company is listed in ODP
or has an entry in Wikipedia.

11


3. Potential Benefits of Filter Keywords for Name Disambiguation in Twitter

A positive/negative filter keyword is an expression that, if present in a tweet, indicates
a high probability that the tweet is related/unrelated to the company. In this section, we
investigate the upper bound of the filter keyword strategy, in two ways: firstly, we con-
sider manual annotations in the WePS-3 collection to derive oracle (optimal) keywords;
and secondly, manually extracting keywords from representative Web pages about the
company. Finally, we study how to use these filter keywords to solve the WePS-3 ORM
task.

3.1. Upper Bound Performance of Filter Keywords

The most useful filter keywords are those with a high coverage, i.e., those which
appear in as many tweets as possible.

As all tweets in the WePS-3 collection are manually annotated as related/unrelated
to their respective company name, we can find exactly how many filter keywords there
are (by definition, filter keywords are those terms that only appear in either the positive
or the negative tweets), and how much recall they provide. Figure 1 shows the coverage
of the first n filter keywords (for n = 1 . . . 20) in the test collection.

Coverage at step n is the proportion of tweets covered by adding the keyword that
filters more tweets among those which were not still covered by the first n−1 keywords.
We will hereafter refer to this optimal keyword selection as oracle keywords.

Figure 1: Upper bound of the filter keywords strategy.

The graph shows that, in average, the best five oracle keywords cover around 30%
of the tweets, and the best ten cover around 40% of the tweets. Note that only five
discriminative terms directly cover, in average, 130 out of 435 tweets in each stream, and
those could in turn be used to build a supervised classifier for the rest of the tweet stream.
This indicates that filter keywords are, potentially, a relevant source of information to
address the problem.

In principle, the natural place to find filter keywords is the Web: the company’s web
domain and reference to this domain in Wikipedia, ODP, etc. and the Web at large.
Using the company’s URL and web search results for the company name, we performed

12


a manual selection of positive and negative keywords for all the companies in the WePS-3
corpus. Note that the annotator inspected pages in the web search results, but did not
have access to the tweets in the corpus.

Tables 1 and 2 show some examples for positive and negative keywords, respectively.
Note that in the set of Oracle keywords there are expressions that a human would hardly
choose to describe a company (at least, without previously analyzing the tweet stream).
For instance, the best positive oracle keywords for the Fox Entertainment Group do not
include intuitive keywords such as tv or broadcast; instead, they include keywords closer
to breaking news (leader, denouncing, etc.).

Company
name

Oracle Positive Keywords Manual Positive Keywords

amazon sale, books, deal, deals, gift electronics, apparel, books,
computers, buy

apple gizmodo, ipod, microsoft,
itunes, auction

store, ipad, mac, iphone, com-
puter

fox money, weather, leader, de-
nouncing, viewers

tv, broadcast, shows, episodes,
fringe, bones

kiss fair, rock, concert, allesegretti,
stanley

tour, discography, lyrics, band,
rockers, make up design

Table 1: Differences between oracle and manual positive keywords for some of the com-
pany names on the test collection.

Looking at negative keywords (Table 2), we can find occasional oracle keywords
that are closely related with the vocabulary used in microblogging services, such as
followdaibosyu, nowplaying or wanna, while intuitive manual keywords like wildlife
for jaguar are unlikely to occur in the Twitter collection.

Company
name

Oracle Negative Keywords Manual Negative Keywords

amazon followdaibosyu, pest, plug,
brothers, pirotta

river, rainforest, deforestation,
bolivian, brazilian

apple juice, pie, fruit, tea, fiona fruit, diet, crunch, pie, recipe

fox megan, matthew, lazy, valley,
michael

animal, terrier, hunting, Volk-
swagen, racing

kiss hug, nowplaying, lips, day,
wanna

french, Meg Kevin, bang bang,
Ryan Kline

Table 2: Differences between oracle and manual negative keywords for some of the com-
pany names on the test collection.

Remarkably, manual keywords extracted from the Web (around 10 per company) only
reach 15% coverage of the tweets (compare with 40% coverage using 10 oracle keywords
extracted from the tweet stream), with an accuracy of 0.86 (which is lower than expected
for manually selected filter keywords). This seems an indication that the vocabulary and
topics of microblogging are different from those found in the Web. Our experiments in
Section 4 corroborate this finding.

13


3.2. Using Keywords to Solve The Task

So far, we have seen that keywords do not cover all tweets in the collection, but a
part of them. Then, we need an additional step in order to complete the task: given
a seed of tweets, annotate the tweets that remain uncovered by filter keywords on each
test case. To this aim, we use a standard bootstrapping method. Tweets are represented
as bag-of-words (produced after tokenization, lowercase and stop word removal) and
term occurrence is used as weighting function; then we have employed a C4.5 Decision
Tree classification model15 [65] –with its default parameters– using the implementation
provided by the Rapidminer toolkit [56]. For each stream, we use the tweets retrieved by
the keywords as seed (training set) in order to classify automatically the rest of tweets.

Table 3 displays results for different amounts of filter keywords: the bootstrapping
strategy ranges from 0.81 (with 5 keywords) up to 0.87 with 20 keywords. On the other
hand, using the tweets covered by manual keywords as training set, the bootstrapping
achieves only a 0.67 accuracy.

keyword seed set bootstrapping
selection strategy coverage accuracy overall accuracy

5 oracle keywords 28% 1.00 0.81
10 oracle keywords 40% 1.00 0.85
15 oracle keywords 47% 1.00 0.86
20 oracle keywords 53% 1.00 0.87

manual keywords 15% 0.86 0.67

Table 3: Quality of seed sets and the bootstrapping classification strategy when applying
oracle/manual filter keywords.

15We also tried with other machine learning methods, such as linear SVM and Naive Bayes, obtaining
similar results; therefore we only report results on C4.5 for the sake of clarity.

14


●

●

●

●

●●

●

●

●

●
●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●
●

●

●

●

●

●

●

●

●
●

●

●

●

●

●

●

●

●

●

avg: 0.67
0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

(a) manual keywords

●

●

●

●

●

●

●
●

●

●

●

●

●

●●

●

●
●

●

●●

●

●

●

●
●

●

●
●

●
●

●

●

●

●

●●

●
●

●

●

●
●●

●
●●

avg: 0.87
0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

(b) 20 oracle keywords

Figure 2: Fingerprints for the bootstrapping classification strategy when applying manual
keywords (left) or 20 oracle keywords (right).

In order to better understand the results, Figure 2 shows the fingerprint represen-
tation [68] for manual keywords and 20 oracle keywords. This visualization technique
consists of displaying the accuracy of the system (vertical axis) for each company/test
case (dots) vs. the ratio of related (positive) tweets for the company (horizontal axis).
The three basic baselines (all true, all negative and random) are represented as three
fixed lines: y = x, y = 1 − x and y = 0.5, respectively. The fingerprint visualization
method helps in understanding and comparing systems’ behavior, specially when class
skews are variable over different test case.

Using 20 oracle keywords (see Figure 2b), the obtained average accuracy is 0.87. The
fingerprint shows that the improvement resides in cases with a related ratio around 0.5,
i.e. the cases where it is more likely to have enough training samples for both classes.
Manual keywords, on the other hand, lead to annotations that tend to stick to the ”all
related” or ”all unrelated” baselines, which indicates that they tend to describe only one
class, and then the learning process is biased.

In summary, our results validate the idea that filter keywords can be a powerful tool
in our filtering task, but also suggest that they will not be easy to find: descriptive web
sources that can be attributed to the company do not lead to the keywords that are most
useful or accurate in the Twitter domain.

15


4. Automatic Discovery of Filter Keywords

Our next step is now is to discover automatically the terms which are most strongly
associated to the company name (positive filter keywords) and to the alternative meanings
of the company name (negative filter keywords), and to discard those which are not
discriminative (skip terms). For this, we take a machine learning approach, in which
training data corresponds to a set of company names and test data to a different set of
companies. Thus, the learning process must be able to generalize across companies. Each
term is represented by features that take into account the company’s website, Wikipedia,
ODP, the Web at a large and the WePS-3 collection itself.

In this section we start discussing the features that we propose to represent terms;
then, we perform a statistical analysis of the features, and finally we report the results
of experiments with our dataset.

4.1. Term Features
Table 4 summarizes the notation that we will use in this section to describe the

features. We will only work with terms which are not stop words and appear at least in
five different tweets on the set Tc, given a company c.

Item Description

t,ti term
c (ambiguous) name that identifies a company (e.g. jaguar)
T set of tweets in the WePS-3 collectiona

Tc set of tweets in the collection for a given company name c.
dft(Tc) document frequency of term t in the collection Tc.
dfweb(q) number of total hits returned by the Yahoo! Search BOSS

API (http://developer.yahoo.com/search/boss/) for the
query q.

M an approximation of the size of the search engine index
(30 · 109).

domainc domain of the website given as reference for the company c.
wikipedia(q) set of Wikipedia pages returned by the MediaWiki API

(http://www.mediawiki.org/wiki/API) for the query q.
dmoz(q) set of items (composed by an URL, a title, a description and

a category) returned by searching q on the Open Directory
Project (http://www.dmoz.org/search)

a
For each company name, only the dataset to which the company belongs is used (either training

or test).

Table 4: Notation used to describe the features used to represent terms.

For each of these terms, we have considered 18 features grouped in three classes:

1. Collection-based features (col * ): Terms that co-occur frequently with the (am-
biguous) company name c, or terms written as hashtags should have more proba-
bility to be (positive/negative) keywords than others. These features combine in-
formation about the occurrence of the term in the collection: document frequency

16

http://developer.yahoo.com/search/boss/
http://www.mediawiki.org/wiki/API
http://www.dmoz.org/search


in the whole corpus, document frequency in the set of tweets for the company, how
many times the term occurs as a twitter hashtag, and the average distance between
the term and the company name in the tweets.
(a) col c df : Normalized document frequency in the collection of the tweets for

the company Tc:
dft(Tc)

|Tc|
(1)

(b) col c specificity: Ratio of document frequency in the tweets for the company
Tc over the document frequency in the whole corpus T:

dft(Tc)

dft(T)
(2)

(c) col hashtag: Number of occurrences of the term as a hashtag (e.g. #jobs,
#football) in Tc.

(d) col c prox avg, col c prox sd, col c prox median: Mean, standard de-
viation and median of the distance (number of terms) between the term and
the company name in the tweets.

2. Web-based features (web * ): These features are computed from information
about the term in the all the Web (approximated by search counts), the website of
the company, Wikipedia and the Open Directory Project (ODP).16

(a) web c assoc: Intuitively, a term which is close to the company name has
more chances to be a keyword (either positive or negative) than more generic
terms. This feature represents the association, according to the search counts,
between the term t and a company name c.

dfweb(t OR c)/dfweb(c)

dfweb(t)/M
(3)

(b) web c ngd: The Normalized Google Distance [16] (applied to the Yahoo!
search engine), which is a measure of semantic distance between two terms
from the search counts. Then, for a term t and a company name c, the Google
Normalized Distance is given by (4):

max(log(f(c)), log(f(t))) − log(f(t AND c))
M − min(log(f(t)), log(f(c)))

(4)

where f(x) = dfweb(x)

(c) web dom df : Frequent terms in the company website should be meaningful
to characterize positive keywords. web dom df is the normalized document
frequency of the term in the website of the company.

dfweb(t AND site:domainc)

dfweb(site:domainc)
(5)

16Some companies in the Weps-3 collection have a Wikipedia page as reference page instead of the
company website. In these cases, the feature web dom df (that is also the numerator in the feature
web dom assoc) is computed as the presence of the term t in the Wikipedia page. Also, the query used
to get the values of the features web odp occ and web wiki occ is the title of the Wikipedia page.

17


(d) web dom assoc: degree of association of the term with the website in com-
parison with the use of the term in the web. This feature is analogous to
web c assoc, using the website domain instead of the company name c.

dfweb(t AND site:domainc)/dfweb(site:domainc)

dfweb(t)/M
(6)

(e) web odp occ: Number of occurrences of the term in all the items in dmoz (domainc).
Each item is composed by an URL, a title, a description and the ODP category
to which it belongs.

(f) web wiki occ: Number of occurrences of the term in the first 100 results in
wikipedia(domainc). In order to filter pages returned by the API that could
be unrelated to the company, only pages that contain the string domainc are
considered.

3. Features expanded with co-occurrence: In order to avoid false zeros in web-
based features, we expand some of the previous term features with the value ob-
tained by the five most co-occurrent terms.
Given a feature f, a new feature is computed as the Euclidean norm (7) of the
vector with components fti ∗w(t,ti) for the five most co-occurrent terms with t in
the set of tweets Tc (8), where fti is the web-based feature value f for the term ti
and w(t,ti) is the the grade of co-occurrence of each term (9):

cooc agg (t,f) =

√ ∑
i∈cooct

(f(ti) ∗w(ti))2 (7)

cooct = set of the five terms which most co-occur with t (8)

w(t,ti) =
|co-occurrencesTc (t,ti)|

|Tc|
(9)

f(ti) = value of the feature f for the term ti

This formula is applied to web c assoc, web c ngd, web dom df, web dom assoc,
web odp occ and web wiki occ, resulting in the features enumerated below:

(a) cooc c assoc = cooc agg (t, web c assoc)
(b) cooc c ngd = cooc agg (t, web c ngd)
(c) cooc dom df = cooc agg (t, web dom df)
(d) cooc dom assoc = cooc agg (t, web dom assoc)
(e) cooc odp occ = cooc agg (t, web odp occ)
(f) cooc wiki occ = cooc agg (t, web wiki occ)

4.2. Feature Analysis

The first step for the feature analysis is to develop a gold standard set of positive and
negative keywords.

In order to get sufficient training data and to deal with possible miss-annotations in
the corpus, we set a precision of 0.85 of a term in a related/unrelated set of tweets as
a feasible threshold to annotate a term as a keyword. Those terms with precision lower
than 0.85 in both classes are labeled as skip terms (10):

18


label (term) =




positive if
|related(Tc)|
|Tc|

> 0.85

negative if
|unrelated(Tc)|

|Tc|
> 0.85

skip otherwise

(10)

where related(Tc) and unrelated(Tc) are respectively the set of the tweets annotated as
related and unrelated in the collection Tc.

Labeling all suitable terms of the WePS-3 training dataset we end up with a total
of 6410 terms, where 34% were labeled as positive keywords, 44% as negative keywords
and the remaining 22% as skip. The test dataset, on the other hand, produces a total
of 4653 candidate terms, where 33% were labeled as positive keywords, 40% as negative
keywords and 27% as skip.

In order to study feature behavior, we calculate the distribution of each feature in
the three classes: positive, negative and skip. We rely on box/whisker plots to show
these distributions and differences or similarities between classes (see Figure 3). Each
box/whisker plot shows the distribution of values of a feature for the three classes. The
bottom and top of the box are the 25th and 75th percentile (the Q1 and Q3 quartiles,
respectively), and the band near the middle is the 50th percentile (the median, Q2). The
whiskers extend to the most extreme data point (1.5 times the length of the box away
from the box: −1.5 · IQR and 1.5 · IQR, where IQR = |Q3 −Q1|).

19


keyword

col_c_df

0.015

0.020

0.025

0.030

0.035

0.040

0.045

positive negative skip

(a) col c df

keyword

col_c_specificity

10

20

30

40

positive negative skip

(b) col c specificity

keyword

col_hashtag

0

positive negative skip

(c) col hashtag

2.5

5.0

7.5

positive negative skip
label_0.85

col_c_proximity_median

(d) col c prox median

2

4

6

8

positive negative skip
label_0.85

col_c_proximity_avg

(e) col c prox avg

0

2

4

6

positive negative skip
label_0.85

col_c_proximity_sd

(f) col c prox sd

keyword

web_odp_occ

0

positive negative skip

(g) web odp occ

keyword

web_wiki_occ

0

50

100

150

positive negative skip

(h) web wiki occ

label_0.85

cooc_odp_occ

0

positive negative skip

(i) cooc odp occ

label_0.85

cooc_wiki_occ

0

100

200

300

400

500

positive negative skip

(j) cooc wiki occ

keyword

web_c_assoc

50

100

150

200

positive negative skip

(k) web c assoc

keyword

web_c_ngd

0.35

0.40

0.45

0.50

0.55

0.60

positive negative skip

(l) web c ngd

keyword

web_dom_df

0.00

0.05

0.10

0.15

0.20

0.25

positive negative skip

(m) web dom df

keyword

web_dom_assoc

0

20

40

60

positive negative skip

(n) web dom assoc

keyword

cooc_c_assoc

100

200

300

400

500

600

positive negative skip

(o) cooc c assoc

keyword

cooc_c_ngd

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

positive negative skip

(p) cooc c ngd

keyword

cooc_dom_df

0.0

0.1

0.2

0.3

0.4

0.5

0.6

positive negative skip

(q) cooc dom df

keyword

cooc_dom_assoc

0

100

200

300

400

positive negative skip

(r) cooc dom assoc

Figure 3: Box-plots representing the distribution of each of the features in the positive,
negative and skip classes. The bottom and top of the box are the Q1 and Q3 quartiles,
respectively, and the band near the middle is the Q2 quartile –i.e., the median. The
whiskers extend to the most extreme data point (1.5 times the length of the box away
from the box: −1.5 · IQR and 1.5 · IQR, where IQR = |Q3 −Q1|).20


These plots help visualizing the range of values for each feature, as well as where
most of the values lie, allowing for a qualitative analysis of the features.

We can see that features col hashtag (Fig. 3c), web odp occ (Fig. 3g) and cooc odp occ
(Fig. 3i) are not informative, because almost all of their values are zero. There are less
than 1% of the terms in the test set that occur at least one time as hashtag. Also, less
than 1% are terms that appear in descriptions and titles of ODP search results.

Features describing term - company distance seem to capture differences between
keyword and skip terms: both negative and positive keywords, generally occur closer to
the company name than skip terms. While positive and negative keywords share similar
median and standard deviation (Figs. 3d and 3f) of proximity to the company name,
average distance for positive keywords is slightly smaller than for negative keywords
(Fig. 3e).

Features col c df, col c specificity, web c assoc, web c ngd and their expanded (by
co-ocurrence) versions cooc c assoc and cooc c ngd were defined to discriminate filter
keywords from skip terms. The most discriminative feature seems to be col c specificity
(Fig. 3b).

On the other hand, features web wiki occ, web dom df, web dom assoc, cooc dom df,cooc dom assoc
and cooc wiki occ were designed to distinguish between positive and negative filter key-
words. At a first glance, positive and negative keywords have different distributions in all
the features. Skip terms, on the other hand, tend to have distributions similar to those
of positive keywords. The features cooc dom assoc (Fig. 3r) and cooc wiki occ (Fig. 3j)
seem to be the best to discriminate positive keywords from negative and skip terms.

Remarkably, features expanded by co-occurrence seem to be more informative than
the original features, which tend to concentrate on low values (the median is near zero).
When expanding the original values by co-occurrence, positive terms receive higher values
more consistently.

In order to quantitatively evaluate the quality of features, we compute the Mann-
Whitney U test [51], which is a non-parametric test used in statistical feature selection
when a normal distribution of the features cannot be assumed. The p-value could be
used to rank the features, since the smaller the value of the p-value, the more informative
the feature is [31].

21


Filter keywords vs. Skip
Terms

Positive vs Negative
Filter Keywords

p-value rank p-value rank

col c df 2.11e-19 7 4.55e-31 8
col c specificity 8.18e-50 1 4.40e-22 9

col hashtag 1.49e-02 15 5.40e-04 12
col c prox sd 1.87e-33 2 4.20e-02 15
col c prox avg 2.03e-20 6 4.18e-02 14

col c prox median 5.79e-14 8 1.62e-01 16
web c assoc 4.76e-06 12 8.39e-19 10
web dom df 6.67e-30 4 5.33e-92 6

web dom assoc 7.14e-14 9 7.01e-138 4
web c ngd 5.12e-12 10 3.38e-08 11

web odp occ 1.96e-01 16 3.63e-01 18
web wiki occ 1.68e-20 5 4.86e-115 5
cooc c assoc 2.91e-30 3 2.04e-54 7
cooc dom df 1.53e-05 13 1.42e-189 3

cooc dom assoc 8.35e-01 18 7.27e-233 1
cooc c ngd 3.54e-07 11 2.41e-01 17

cooc odp occ 3.15e-01 17 3.92e-03 13
cooc wiki occ 2.36e-03 14 2.13e-211 2

Table 5: U test p-value and ranking position of the features, comparing filter keywords
(both positive and negative) with skip terms and comparing positive with negative filter
keywords.

Table 5 shows the p-value and the rank of each feature for the U test. The most
remarkable aspect of this table is that – in agreement with the boxplot analysis – the
col c specificity feature discriminates between filter keywords and skip terms better than
other features. In addition, the feature cooc dom assoc, which measures the association
between the term and the company website, is the best feature to discriminate between
positive and negative filter keywords. These results confirm our assumptions that salient
terms in the set of tweets of the company tend to be discriminative and salient terms
associated with the company in tweets are also associated with the company website.

Although the features analyzed above are signals that help differentiating between
positive, negative and skip terms, it seems that the vocabulary that characterizes a
company in microblog streams is different from the vocabulary associated to the company
in its website, in ODP entries or in Wikipedia.

4.3. Keyword Discovery

The features described above have been combined in three different ways to automat-
ically discover filter keywords. The first one, (machine learning-all features ), consists of
training a positive-negative-skip classifier over the training corpus in WePS-3 by using
all the features. We combine two classifiers: positive versus others and negative ver-
sus others, using the confidence thresholds learned by the classifiers. Terms which are
simultaneously under/over both thresholds are tagged as skip terms.

The second approach, (heuristic), is inspired by the analysis of the signal provided
by each of the features (in the previous section). It is a simple heuristic which looks

22


at only the best two features according to the Mann-Whitney U test: first, we define a
threshold to remove skip terms according to the specificity w.r.t. the collection of the
tweets for the company (col c specificity feature). Then we state, for the feature that
measures association with the website (cooc dom assoc feature) a lower bound to capture
positive filter keywords and an upper bound to capture negative filter keywords. These
three thresholds have been manually optimized using the training data set.

Finally, we also explored a third option: we apply machine learning using only the
best two features instead of the whole feature set. We will refer hereafter to this method
as machine learning-2features.

We have experimented with several machine learning methods using Rapidminer [56]:
Neural Nets, C4.5 and CART Decision Trees, Linea Support Vector MAchines (SVM)
and Naive Bayes. All methods have been used with “out-of-the-box” parameters.

All the terms labeled over the WePS-3 training dataset were used to train the models.
In the same way, terms extracted from the test dataset were used as test set. Table 6
shows the values of the Area Under the ROC Curve (AUC) of each of the binary classifiers
evaluated. AUC is an appropriate metric to measure the quality of binary classification
models independently of the confidence threshold [23].

We analyzed three different subsets of features to represent the terms: (i) using all
but the six features expanded by co-occurrence, (ii) using only the best two features
(those used by the heuristic and machine learning-2features classifiers), and (iii) using
all the features.

machine learning
algorithm

not expanded by
co-occurrence

features

2 best
features

all features

pos neg pos neg pos neg

Neural Net 0.68 0.67 0.73 0.72 0.75 0.73
CART Dec. Trees 0.58 0.61 0.72 0.71 0.63 0.64

Linear SVM 0.50 0.50 0.73 0.71 0.50 0.50
Näıve Bayes 0.64 0.64 0.71 0.71 0.72 0.72

C4.5 Dec.Trees 0.50 0.61 0.50 0.50 0.59 0.66

Table 6: Area Under the ROC Curve (AUC) values of the five classification models and
the three feature sets used to classify positives and negatives keywords.

The results obtained are similar for all models, except for C4.5 and SVM that in sev-
eral cases do not provide any useful information for classification (AUC = 0.5). Keeping
out the “expanded by co-occurrence“ features, the performance is, in general, lower for
all the algorithms. This corroborates the results of our previous feature analysis.

In the following experiments, we focus on the Neural Net algorithm to train both
(positive versus others and negative versus other) classifiers, because it is consistently
the best performing algorithm according to the AUC measure.

For each of the feature combinations described at the beginning of this section (ma-
chine learning-all features, heuristic and machine learning-best 2 features ), below we
analyze the obtained results. The methods were trained using terms from the WePS-3
training dataset and evaluated with the WePS-3 test set.

Table 7 shows the confusion matrix obtained for the machine learning-all features
method. The precision for the positive and negative classes is 62% and 56%, respectively,

23


while recall is 52% and 72%. In order to obtain a few representative keywords, this recall
levels are good enough; but the precision may compromise the final accuracy of the
filtering process.

Actual class
positive negative skip Class Precision

positive 790 190 304 62%
Predicted class negative 483 1334 583 56%

skip 242 330 375 40%
Class Recall 52% 72% 30%

Table 7: Confusion matrix for the machine learning-all features classifier.

Table 8 shows the confusion matrix for the heuristic method, that only uses the
col c specificity and cooc dom assoc features. This method is more precision-oriented
than machine learning - all features : precision values of positive and negative class are
higher (68% and 75%), but recall is significantly lower (26% and 19%).

Actual class
positive negative skip Class Precision

positive 391 60 122 68%
Predicted class negative 23 352 94 75%

skip 1102 1453 1056 29%
Class Recall 26% 19% 83.02%

Table 8: Confusion matrix for the heuristic classifier.

Finally, Table 9 shows the contingency matrix for the machine learning-2features
method, that represents terms with the features col c specificity and cooc dom assoc
and uses the Neural Net machine learning algorithm to build the model.

Actual class
positive negative skip Class Precision

positive 438 102 139 65%
Predicted class negative 85 399 101 68%

skip 993 1364 1032 30%
Class Recall 29% 21% 81%

Table 9: Confusion matrix for the machine learning-2features classifier.

As expected, its performance lies between machine learning-all features and heuristic
methods, with a precision higher than the former (65% and 68%) and a recall higher
than the latter (29% and 21%).

These results indicate that automatic detection of keywords is plausible and challeng-
ing at the same time. Which of the three approaches is better for our problem depends on
their performance on the final task: tweet classification. In the next section we explore
how to use these filter keywords to classify tweets.

5. Automatic Tweet Classification

After detecting the filter keywords automatically, we directly classify the subset of
tweets that contain only negative or only positive keywords. Then, the bootstrapping

24


strategy described in Section 3.2 is used to complete the task.
Ii is also interesting to compare the bootstrapping strategy with the näıve winner-

takes-all baseline –that directly classifies all the tweets as related or unrelated depending
on which is the dominant class in the seed of tweets– and the winner-takes-remainder
strategy, which consists of applying the winner-takes-all strategy only to those tweets
that were not covered by some of the filter keywords.

Figure 4: Fingerprints for each of the keyword selection strategies combined with each
of the different tweet classification strategies.

Figure 4 shows the fingerprint of each of the combinations tested and Table 10
shows the results. The best automatic method, which combines (machine learning-all
features to discover keywords and bootstrapping with the tweets annotated using that
keywords) gives an accuracy of 0.73, which is higher than using manual keywords from

25


keyword seed set overall accuracy
selection strategy coverage acc. wta wtr bootstrapping

20 oracle keywords 53%N 1.00N 0.80M 0.85N 0.87N

manual keywords 15%H 0.86 0.61 0.63 0.67
supervised bootstr. 0.85N

m. learning - all feat. 58% 0.75 0.69 0.71 0.73
heuristic 27%H 0.79 0.64 0.65 0.71

m. learning - 2 feat. 39%H 0.78 0.70 0.72 0.72

Table 10: Results for automatic keyword detection strategies (wta=winner-takes-all,
wtr=winner-takes-remainder). Statistical significance w.r.t. the ml-all features se-
lection strategy was computed using two-tailed Student’s t-test. Significant differences
are indicated using N (or H) for α = 0.01 and M (or O) for α = 0.05.

the Web (0.67) and is close to the best automatic result reported in the WePS-3 compe-
tition (0.75). In addition, the bootstrapping process almost doubles the coverage (from
58% to 100%) with only 2.7% of accuracy loss.

In general, the more tweets are covered by filter keywords (seed coverage), the lower
is the loss in accuracy: the heuristic keyword selection covers 27% of tweets with .79
accuracy and achieves a .71 accuracy with the bootstrapping process, while machine
learning-2 best features covers 39% of the tweets with 0.78 accuracy and finishes
with 0.72 accuracy. Remarkably, the bootstrapping process outperforms the winner-
takes-all and winner-takes-remainder baselines in all the cases.

●

●

●
●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●
●

● ●

●

●

●
●

●

●

●●

avg: 0.85
0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 5: Fingerprint for the bootstrapping upper bound (10-fold cross-validation).

An interesting question is how our approach - which does not use training data from
the companies in the test set - compares with a truly supervised counterpart (i.e. one
which uses the same machine learning algorithm and the same bag-of-word features, but
uses perfect training material – taken from the test set – for each of the companies). To
this aim, we carried a 10-fold validation on the test set of the machine learning algorithm.

26


This supervised upper bound achieves 0.85 accuracy, that is only 14% higher than our
unsupervised algorithm (0.73). In Figure 5 we can see the fingerprint representation
of the supervised upper bound. This fingerprint is similar to the 20 oracle keyword’s
fingerprint shown in Section 3.2.

Discovery of filter keywords has proved to be challenging using signals from the Web:
the accuracy of the resulting seed set ranges between 0.75 and 0.79, with a potentially
useful coverage (58%) in the case of the machine learning - all features. Overall, this
result reinforces the conclusion that the characterization of companies in Twitter, in
terms of vocabulary, is probably different from the characterization that can be derived
from the Web.

27


6. Discussion

6.1. How Much of The Problem is Solved?

In order to shed light on the trade-off between quality and quantity of filter keywords,
here we analyze the relation between accuracy and coverage of the tweets classified by
considering different sets of filter keywords. Figure 6 shows the coverage/accuracy curves
for oracle, manual and automatic filter keywords.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 10 20 30 40 50 60 70 80 90 100

tweet coverage with keywords (%)

tw
e

e
t 
cl

a
ss

ifi
ca

tio
n

 a
cc

u
ra

cy

Keyword Classification Strategy
● oracle keywords

manual keywords
machine learning − all features
heuristic
machine learning − 2 features

Figure 6: Coverage/accuracy curves for oracle, manual and automatic filter keywords.

Curves were generated as follows:

Oracle keywords. At step n, we consider the nth positive/negative oracle keywords
that maximizes accuracy and - in case of ties - coverage of tweets (i.e., in the
case of two keywords having same accuracy, the one that covers more tweets is
considered first).

Manual keywords. At each step n we consider the nth positive/negative manual key-
words that maximize coverage of tweets.

Machine learning - all features. The set of terms considered in this analysis are
those that were classified as positive or negative by using the confidence thresholds
learned by the two classifiers in the method “machine learning - all features”. Skip
terms are (i) those classified as skip by both binary classifiers; (ii) those classified si-
multaneously as positive and negative keywords. Then, we use the maximum of the

28


confidence scores returned by the two classifiers (i.e., max(conf(positive), conf(negative)))
to sort the keywords. The keyword with highest confidence score is added at each
step. The point in the curve with highest coverage corresponds to the classifier
used in the experiments explained in Section 4.

Machine learning - 2 features. This curve is generated by using the two classifiers
learned in “machine learning - 2 features”, similarly to the curve generated for
“machine learning - all features”.

Heuristic. Since this classifier consists of manually defining thresholds using the training
set, it doesn’t provide any confidence score for the test cases. Hence, in the graphic
it is represented as a single point (×).

The curve for Oracle keywords provides a statistical upper bound of how many tweets
can be directly covered using filter keywords. Considering the best 100 oracle keywords
for each test case/company name, it is possible to directly tag 85% of the tweets with
0.99 accuracy. On the other hand, a more realistic upper bound is given by manual
keywords. Here, we can observe how the accuracy remains stable around 0.85, while the
coverage grows from 10% to 15% approx. In the best possible case, with more keywords
the curve would continue as the line y = 0.85. Note that manual keywords have been
annotated by inspecting representative Web pages (from Google search results) rather
than inspecting tweets.

Therefore, an automatic keyword classifier cannot achieve an accuracy above 0.85.
Considering this, our automatic approaches establish a strong lower bound of 0.7 accu-
racy. In conclusion, it seems that a filter keyword classifier should have reach an accuracy
between 0.7 and 0.85 to be competitive.

6.2. Comparing Systems with Different Metrics

We have seen that related/unrelated tweets are not balanced in most of the test cases
in WePS-3, and the proportion does not follow a normal distribution (extreme values
seem to be as plausible as values around the mean). Because of this, accuracy may be
not sufficient to understand the quality of systems, and that’s why we have complemented
it with the fingerprint representation [68]. In this section, we evaluate (and compare)
results with the most popular alternative evaluation metrics found in the literature.

Considering the confusion matrix given by each system, where TP=true related
tweets, FP=false related tweets, TN=true unrelated tweets, and FN=false unrelated
tweets, we compute the following metrics, in addition to accuracy:

Normalized Utility. Utility has been used to evaluate document filtering tasks in
TREC [34, 33] and is commonly used assigning a relative α weight between true
positives and false positives:

u(S,T) = α ·TP −FP

As in the TREC-8 filtering task [33], here Utility is normalized by means of the
following scaling function:

u∗s(S,T) =
max(u(S,T),U(s)) −U(s)

MaxU(T) −U(s)
29


where u(S,T) is the original utility of system output S for topic T , U(s) is the
utility of retrieving s non-relevant documents, and MaxU(T) = α · (TP + FN) is
the maximum possible utility score for topic T. In this paper, we set α = 2 and
U(s) = −25.

lam%. lam% (logistic average misclassification percentage) has been used in TREC
to evaluate the problem of spam detection [18]. It was defined as the geometric
mean of the odds of hm% (ham misclassification percentage) and sm% (spam
misclassification percentage). More precisely, lam% is defined as

lam% = logit−1
(logit(hm%) + logit(sm%)

2

)
where

hm% =
FN

FN + TP
sm% =

FP

FP + TN

logit(x) = log(
x

1 −x
) logit−1(x) =

ex

1 + ex

Note that lam% is an error-based metric –i.e., maximum scores represent minimum
quality.

One remarkable property of this metric is that, when a system has a non-informative
behavior –that is, it classifies the documents randomly, or classifies all documents
to a single class– lam% score is around 0.5.

Reliability & Sensitivity. Reliability and Sensitivity [5] have been recently proposed
as two complementary measures to evaluate document organization tasks involving
classification, clustering and ranking. It was the official metric used in RepLab
held at CLEF 2012 [4]. When evaluating a binary classification task, Reliability
corresponds to the product of the precision of the classes, and Sensitivity to the
product of the recall of both classes:

R =
TP

TP + FP
·

TN

TN + FN
S =

TP

TP + FN
·

TN

TN + FP

and

F1(R,S) = 2 ·
R ·S
R + S

As lam%, Reliability & Sensitivity also penalizes systems that do not provide any
useful information. Contrary to lam%, that assigns a low but not the minimum
score, Reliability & Sensitivity gives zero –the minimum score– to these systems.

F1 measure. The most standard combination of Precision and Recall is F1, or balanced
F measure. Here we focus on the “related” class, where

Precision =
TP

TP + FP
Recall =

TP

TP + FN
30


and

F1 = 2 ·
Precision · Recall
Precision + Recall

Table 11 reports the results for the baselines, the WePS-3 systems and our proposed
systems for the metrics described above. All metrics were macro-averaged by topics, and
undefined scores were considered as zero values.

System accuracy utility lam% F1(R, S) F1

Gold standard 1.00N 1.00N 0.00N 1.00N 1.00N

supervised bootstr. 0.85N 0.69N 0.28N 0.30N 0.62N

WePS-3: LSIR (manual) 0.83N 0.66N 0.28N 0.27 0.62N

ml-all feat. + bootstr. 0.73 0.47 0.37 0.27 0.49
ml-2 feat. + wtr 0.72 0.49 0.43 0.16O 0.49

ml-2 feat. + bootstr. 0.72 0.47 0.43 0.17O 0.50
heuristic + bootstr. 0.71 0.45 0.42 0.11H 0.46

ml-all feat. + wtr 0.71 0.44 0.39H 0.21H 0.43H

ml-2 feat. + wta 0.70 0.48 0.50H 0.00H 0.39O

ml-all feat. + wta 0.69O 0.40O 0.50H 0.00H 0.27H

heuristic + wtr 0.65H 0.46 0.42 0.10H 0.46
heuristic + wta 0.64H 0.44 0.50H 0.00H 0.39O

WePS-3: ITC-UT 0.75 0.52 0.37 0.20 0.49
WePS-3: SINAI 0.64O 0.38O 0.35 0.12H 0.30H

WePS-3: UvA 0.58H 0.22H 0.46H 0.17H 0.36H

WePS-3: KALMAR 0.47H 0.35H 0.43 0.16O 0.48

baseline: all unrelated 0.57H 0.20H 0.50H 0.00H 0.00H

baseline: random 0.49H 0.20H 0.49H 0.16H 0.37H

baseline: all related 0.43H 0.40 0.50H 0.00H 0.52

Table 11: Results for proposed systems, WePS-3 systems and baselines compared with
different evaluation metrics. Best automatic runs are in boldface. (ml=machine learning,
wta=winner-takes-all, wtr=winner-takes-remainder). Statistical significance w.r.t. the
ml-all feat. + bootstrapping run was computed using two-tailed Student’s t-test.
Significant differences are indicated using N (or H) for α = 0.01 and M (or O) for α = 0.05.

Results show that, according to Reliability & Sensitivity, our best automatic system
ml-all features + bootstrapping achieves the same score as the WePS-3 LSIR semi-
automatic system (0.27) – which is the best result at the competition and involves manual
processing – and outperforms the best automatic system in WePS-3 (ITC-UT=0.20), with
a 35% of relative improvement. In terms of lam%, the SINAI system achieves the best
automatic score of 0.35, followed by ITC-UT & ml-all features + bootstrapping
that reaches 0.37 lam%. Note that lam% and R&S penalize non-informative/baseline-
like behaviors. Because of this, the winner-takes-all systems and the “all (un)related”
baselines get the worst scores in these metrics.

According to utility, ITC-UT is still the best automatic system (0.52). Our best runs
are between 0.47 and 0.49, being ml-2 features + bootstrapping the best of them.

31


Finally, F1 rewards systems that tend to return all tweets as related. Indeed, the best
score given by an automatic system is achieved by the “all related” baseline, that has
perfect recall and enough precision to get the highest score.

There are a number of empirical observations that can be made on this comparison
of metrics:

• In general, high F1(R,S) implies high accuracy, but not vice-versa: F1(R,S) is a
stricter metric, at least in this dataset.

• Metrics such as lam% and F1(R,S) are suitable to identify baseline-like behaviors,
while F1 is not.

• ITC-UT and ml-features + bootstrapping perform consistently well across met-
rics.

• Different metrics illustrate different aspects of the behavior of systems: If we need
to penalize non-informative behavior, we should look at results with lam% or
F1(R,S). Accuracy and utility directly show misclassification errors, but are sen-
sitive to collections where class skews are variable over different test cases, such as
our dataset.

6.3. Web vs. Twitter

In our experiments, we have found two results indicating that the vocabulary that
characterizes a company on Twitter substantially differs from the the vocabulary asso-
ciated to the company on the Web:

• Low recall of manual keywords. As we saw in Section 3.1, manually selecting
around 10 salient terms from Web search results retrieved using the company name
and its representative URL only covers 15% of the tweets.

• Web-based features are useful but inconclusive. Analyzing the features (see
Section 4.2), we found that web-based features that may discriminate positive from
negative keywords tend to receive low values. Moreover, the low quality ( ≤ 0.75
AUC) of the automatic classification of filter keywords indicates that building an
accurate classifier with features extracted from the Web is challenging to say the
least (see Section 4.3).

In order to reach a better understanding of the problem, we have explored the asso-
ciation between the best 10 oracle keywords for each tweet stream and its occurrences
in both the company’s homepage and its Wikipedia article17. The terms from each page
have been extracted using the
lynx -dump url Linux command. Table 12 shows the average percentage of the best
10 oracle keywords that occur on the company’s homepage, on the Wikipedia page, and
both.

17We manually extended the input data of each organization on the WePS-3 dataset with its Wikipedia
page (or its homepage in the cases which the Wikipedia page is provided as the representative page)

32


Filter keywords Homepage Wikipedia Both

related oracle keywords 36% 68% 33%
unrelated oracle keywords 9% 19% 6%

Table 12: Percentage of the 10 best oracle keywords extracted from the tweet stream
covered by the company’s homepage, its Wikipedia article and both.

Overall, the only substantial overlap is for positive keywords in Wikipedia, indicating
that representative Web pages are not the ideal place to look for effective filter keywords
in Twitter.

Note that the overlap of related oracle keywords with the company’s Wikipedia page
roughly doubles the overlap with its homepage. The same thing happens with unrelated
keywords: almost 20% on Wikipedia and almost 10% on the homepage. The percentage
of oracle keywords that occur both in the homepage and in the Wikipedia article is
similar to the homepage alone, indicating that Wikipedia basically extends the keywords
already present in the homepage.

In summary, exploring the nature of filter keywords leads us to the conclusion that
the vocabulary characterizing a company in Twitter is substantially different from the
vocabulary associated to the company in its homepage, in Wikipedia, and apparently in
the Web at large. These findings are in line with the “vocabulary gap“ that has been
shown between Twitter and other Web sources such as Wikipedia or news comments [73].
One way of alleviating this problem is using co-occurrence expansion of web-based fea-
tures, which allows to better recognize automatically filter keywords. While the com-
pany’s Wikipedia article seems to have more coverage of (perfect) filter keywords than
the company’s homepage, further investigation is needed on how to automatically infer
the company’s Wikipedia page from its homepage URL in order to extract additional
keyword features from it.

33


7. Conclusion

In this paper we tackled the problem of company name disambiguation in Twitter,
defined in WePS-3 as a binary classification problem. We have studied the use of filter
keywords: expressions that, if present in a tweet, indicate a high probability that it is
related/unrelated to the company.

In our experiments, automatically discovered filter keywords are able to classify a
subset of 30%-60% tweets with an accuracy range of .75 − .79.

We defined features that characterize terms in the Twitter dataset, the company’s
website, ODP, Wikipedia and the searchable Web. We found that (i) term specificity
in the tweet stream of each company name is a feature that discriminates between filter
keywords and skip terms and (ii) the association between the term and the company
website is useful to differentiate positive vs. negative filter keywords, specially when
it is averaged by considering its most co-occurrent terms. Tweets classified by these
filter keywords can be used to feed a supervised machine learning process to obtain a
complete classification of all tweets for an overall accuracy of 0.73. In comparison, a 10-
fold validation of the same machine learning algorithm provides an accuracy of 0.85, i.e.,
our unsupervised algorithm has a 14% loss with respect to its supervised counterpart.

We also found that, in average, the best five optimal keywords can directly classify
around 30% of the tweets. Nevertheless, keywords defined by a human by inspecting web
search results relevant to the company name only cover 15% of the tweets and accuracy
drops to 0.86.

Exploring the nature of filter keywords also led us to the conclusion that that the
there is a gap between the vocabulary characterizing a company in Twitter and the
vocabulary associated to the company in its homepage, in Wikipedia, and apparently in
the Web at large.

Note that all our experimentation is based on the WePS-3 dataset, which is an
English-only dataset. As immediate future work, we plan to explore cross-lingual strate-
gies to deal with multilingual data and so be able to test our approach on the RepLab
2012 dataset, which is the other test collection that fits our problem.

34


References

[1] Agirre, E., Edmonds, P., 2006. Word sense disambiguation: Algorithms and applications. Springer.
[2] Al-Kamha, R., Embley, D., 2004. Grouping search-engine returned citations for person-name

queries. In: Proceedings of the 6th annual ACM international workshop on Web information and
data management. pp. 96–103.

[3] Amigó, E., Artiles, J., Gonzalo, J., Spina, D., Liu, B., Corujo, A., 2010. WePS-3 Evaluation Cam-
paign: Overview of the Online Reputation Management Task. In: CLEF 2010 Labs and Workshops
Notebook Papers.

[4] Amigó, E., Corujo, A., Gonzalo, J., Meij, E., de Rijke, M., 2012. Overview of RepLab 2012:
Evaluating Online Reputation Management Systems. In: CLEF 2012 Labs and Workshop Notebook
Papers.

[5] Amigó, E., Gonzalo, J., Verdejo, F., 2012. Reliability and Sensitivity: Generic Evaluation Measures
for Document Organization Tasks. Tech. rep., UNED.

[6] Artiles, J., October 2009. Web people search. Ph.D. thesis, UNED University.
[7] Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigó, E., 2010. Weps-3 evaluation campaign:

Overview of the web people search clustering and attribute extraction tasks.
[8] Artiles, J., Gonzalo, J., Amigó, E., 2009. The impact of query refinement in the web people search

task. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Compu-
tational Linguistics, pp. 361–364.

[9] Artiles, J., Gonzalo, J., Sekine, S., 2007. The semeval-2007 weps evaluation: Establishing a bench-
mark for the web people search task. Proceedings of Semeval.

[10] Artiles, J., Gonzalo, J., Sekine, S., 2009. Weps 2 evaluation campaign: overview of the web people
search clustering task. In: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW
Conference.

[11] Bagga, A., Baldwin, B., 1998. Entity-based cross-document coreferencing using the vector space
model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics
and 17th International Conference on Computational Linguistics-Volume 1. pp. 79–85.

[12] Boyd, D., Golder, S., Lotan, G., 2010. Tweet, tweet, retweet: Conversational aspects of retweeting
on twitter. In: Proceedings of the 2010 43rd Hawaii International Conference on System Sciences.
HICSS ’10. IEEE Computer Society, Washington, DC, USA, pp. 1–10.

[13] Bunescu, R., Pasca, M., 2006. Using encyclopedic knowledge for named entity disambiguation. In:
Proceedings of EACL. Vol. 6. pp. 9–16.

[14] Cha, M., Haddadi, H., Benevenuto, F., Gummadi, K. P., 2010. Measuring User Influence in Twitter:
The Million Follower Fallacy. In: In Proceedings of the 4th International AAAI Conference on
Weblogs and Social Media (ICWSM). Washington DC, USA.

[15] Cheng, Z., Caverlee, J., Lee, K., 2010. You are where you tweet: a content-based approach to geo-
locating twitter users. In: Proceedings of the 19th ACM international conference on Information
and knowledge management. pp. 759–768.

[16] Cilibrasi, R. L., Vitanyi, P. M., 2007. The google similarity distance. IEEE Transactions on Knowl-
edge and Data Engineering.

[17] Comm, J., Robbins, A., 2009. Twitter power: How to dominate your market one tweet at a time.
John Wiley & Sons Inc.

[18] Cormack, G., Lynam, T., 2005. Trec 2005 spam track overview. In: The Fourteenth Text REtrieval
Conference (TREC 2005) Proceedings.

[19] Cucerzan, S., 2007. Large-scale named entity disambiguation based on wikipedia data. In: Proceed-
ings of EMNLP-CoNLL. Vol. 2007. pp. 708–716.

[20] Dellarocas, C., Awad, N., Zhang, X., 2004. Exploring the value of online reviews to organizations:
Implications for revenue forecasting and planning. In: Proceedings of the International Conference
on Information Systems.

[21] Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T., 2010. Entity disambiguation for knowl-
edge base population. In: Proceedings of the 23rd International Conference on Computational
Linguistics. pp. 277–285.

[22] Edmonds, P., Cotton, S., 2001. Senseval-2: Overview. In: Proceedings of The Second International
Workshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL-2. pp. 1–6.

[23] Fawcett, T., 2006. An introduction to roc analysis. Pattern recognition letters 27 (8), 861–874.
[24] Ferragina, P., Scaiella, U., 2010. Tagme: on-the-fly annotation of short text fragments (by wikipedia

entities). In: Proceedings of the 19th ACM international conference on Information and knowledge
management. pp. 1625–1628.

35


[25] Frank, E., Paynter, G., Witten, I., Gutwin, C., Nevill-Manning, C., 1999. Domain-specific keyphrase
extraction. In: International joint conference on artificial intelligence. Vol. 16. pp. 668–673.

[26] Garćıa-Cumbreras, M. A., Garćıa-Vega, M., Mart́ınez-Santiago, F., Peréa-Ortega, J. M., 2010.
SINAI at WePS-3: Online Reputation Management. In: CLEF 2010 Labs and Workshops Notebook
Papers.

[27] Gentile, A., Zhang, Z., Xia, L., Iria, J., 2010. Semantic relatedness approach for named entity
disambiguation. In: Digital Libraries. pp. 137–148.

[28] Gooi, C., Allan, J., 2004. Cross-document coreference on a large scale corpus. In: Proceedings of
HLT/NAACL. Vol. 4.

[29] Gouws, S., Metzler, D., Cai, C., Hovy, E., Marina del Rey, C., 2011. Contextual bearing on linguistic
variation in social media.

[30] Grishman, R., Sundheim, B., 1996. Message understanding conference-6: A brief history. In: Pro-
ceedings of the 16th conference on Computational linguistics-Volume 1. pp. 466–471.

[31] Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L., 2006. Feature extraction: foundations and applica-
tions. Springer Verlag, Ch. 2.2, pp. 67–78.

[32] Hoffman, T., 2008. Online reputation management is hot?but is it ethical? Computerworld, Febru-
ary.

[33] Hull, D., Robertson, S., 1999. The trec-8 filtering track final report. In: Proceeding of eighth Text
Retrieval Conference (TREC-8).

[34] Hull, D., et al., 1998. The trec-7 filtering track: description and analysis. NIST SPECIAL PUBLI-
CATION SP, 45–68.

[35] Hurlock, J., Wilson, M. L., 2011. Searching twitter: Separating the tweet from the chaff. In: Pro-
ceedings of ICWSM 2011, the 5th International AAAI Conference on Weblogs and Social Media.
AAAI.

[36] Jansen, B., Zhang, M., Sobel, K., Chowdury, A., 2009. Twitter power: Tweets as electronic word of
mouth. Journal of the American society for information science and technology 60 (11), 2169–2188.

[37] Java, A., Song, X., Finin, T., Tseng, B., 2007. Why we twitter: understanding microblogging usage
and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web
mining and social network analysis. pp. 56–65.

[38] Ji, H., Grishman, R., 2011. Knowledge base population: Successful approaches and challenges. In:
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies. Vol. 1. pp. 1148–1158.

[39] Ji, H., Grishman, R., Dang, H., Griffitt, K., Ellis, J., 2010. Overview of the tac 2010 knowledge
base population track. In: TAC (Text Analysis Conference) 2010 Workshop.

[40] Kalashnikov, D., Mehrotra, S., Chen, Z., Nuray-Turan, R., Ashish, N., 2007. Disambiguation algo-
rithm for people search on the web. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International
Conference on. pp. 1258–1260.

[41] Kalmar, P., 2010. Bootstrapping Websites for Classification of Organization Names on Twitter. In:
CLEF 2010 Labs and Workshops Notebook Papers.

[42] Kaufmann, M., Kalita, J., 2010. Syntactic normalization of twitter messages. In: International
Conference on Natural Language Processing, Kharagpur, India.

[43] Kim, S., Medelyan, O., Kan, M., Baldwin, T., 2010. Semeval-2010 task 5: Automatic keyphrase
extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic
Evaluation. pp. 21–26.

[44] Krishnamurthy, B., Gill, P., Arlitt, M., 2008. A few chirps about twitter. In: WOSP ’08: Proceedings
of the first workshop on Online social networks. ACM, New York, NY, USA, pp. 19–24.

[45] Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S., 2009. Collective annotation of wikipedia
entities in web text. In: Proceedings of the 15th ACM SIGKDD international conference on Knowl-
edge discovery and data mining. pp. 457–466.

[46] Kwak, H., Lee, C., Park, H., Moon, S., 2010. What is twitter, a social network or a news media?
In: WWW ’10: Proceedings of the 19th international conference on World wide web. ACM, New
York, NY, USA, pp. 591–600.

[47] Laboreiro, G., Sarmento, L., Teixeira, J., Oliveira, E., 2010. Tokenizing micro-blogging messages
using a text classification approach. In: Proceedings of the fourth workshop on Analytics for noisy
unstructured text data. pp. 81–88.

[48] Liu, Z., Huang, W., Zheng, Y., Sun, M., 2010. Automatic keyphrase extraction via topic decomposi-
tion. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.
pp. 366–376.

[49] Mann, G., 2006. Multi-document statistical fact extraction and fusion. Ph.D. thesis, THE JOHNS

36


HOPKINS UNIVERSITY.
[50] Mann, G., Yarowsky, D., 2003. Unsupervised personal name disambiguation. In: Proceedings of the

HLT-NAACL’03. pp. 33–40.
[51] Mann, H., Whitney, D., 1947. On a test of whether one of two random variables is stochastically

larger than the other. The Annals of Mathematical Statistics 18 (1), 50–60.
[52] Martinez-Romo, J., Araujo, L., 2009. Web people search disambiguation using language model

techniques.
[53] McNamee, P., Dang, H., 2009. Overview of the TAC 2009 knowledge base population track. In:

Text Analysis Conference (TAC).
[54] Meij, E., Weerkamp, W., de Rijke, M., 2012. Adding semantics to microblog posts. In: Proceedings

of the fifth ACM international conference on Web search and data mining.
[55] Michelson, M., Macskassy, S., 2010. Discovering users’ topics of interest on twitter: a first look. In:

Proceedings of the fourth workshop on Analytics for noisy unstructured text data. pp. 73–80.
[56] Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T., 2006. YALE: Rapid prototyping for

complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD.
[57] Mihalcea, R., Csomai, A., 2007. Wikify!: linking documents to encyclopedic knowledge. In: CIKM.

Vol. 7. pp. 233–242.
[58] Mihalcea, R., Tarau, P., 2004. Textrank: Bringing order into texts. In: Proceedings of EMNLP.

Vol. 4. pp. 404–411.
[59] Milios, E., Zhang, Y., He, B., Dong, L., 2003. Automatic term extraction and document similarity

in special text corpora. In: Proceedings of the Sixth Conference of the Pacific Association for
Computational Linguistics. pp. 275–284.

[60] Milne, D., Witten, I., 2008. Learning to link with wikipedia. In: Proceeding of the 17th ACM
conference on Information and knowledge management. pp. 509–518.

[61] Milstein, S., Chowdhury, A., Hochmuth, G., Lorica, B., Magoulas, R., 2008. Twitter and the
micro-messaging revolution: Communication, connections, and immediacy–140 characters at a time.
O’Reilly Media, Inc.

[62] Nadeau, D., Sekine, S., 2007. A survey of named entity recognition and classification. Lingvisticae
Investigationes 30 (1), 3–26.

[63] Pak, A., Paroubek, P., 2010. Twitter as a corpus for sentiment analysis and opinion mining. Pro-
ceedings of LREC 2010.

[64] Pollach, I., 2006. Electronic word of mouth: A genre analysis of product reviews on consumer
opinion web sites. In: Proceedings of the 39th Annual Hawaii International Conference on System
Sciences. Vol. 3.

[65] Quinlan, J., 1993. C4. 5: programs for machine learning. Morgan Kaufmann.
[66] Sakaki, T., Okazaki, M., Matsuo, Y., 2010. Earthquake shakes twitter users: real-time event de-

tection by social sensors. In: Proceedings of the 19th international conference on World wide web.
WWW ’10. pp. 851–860.

[67] Soboroff, I., McCullough, D., Lin, J., Macdonald, C., Ounis, I., McCreadie, R., 2012. Evaluating
real-time search over tweets.

[68] Spina, D., Amigó, E., Gonzalo, J., 2011. Filter keywords and majority class strategies for com-
pany name disambiguation in Twitter. In: CLEF 2011 Conference on Multilingual and Multimodal
Information Access Evaluation. Springer Berlin/Heidelberg, pp. 50–61.

[69] Spina, D., Meij, E., Oghina, A., Bui, M. T., Breuss, M., de Rijke, M., 2012. A Corpus for Entity
Profiling in Microblog Posts. In: LREC Workshop on Language Engineering for Online Reputation
Management.

[70] Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M., 2010. Short text classification
in twitter to improve information filtering. In: Proceeding of the 33rd International ACM SIGIR
Conference on Research and Development in Information Retrieval. SIGIR ’10. ACM, New York,
NY, USA, pp. 841–842.

[71] Strube, M., Ponzetto, S., 2006. Wikirelate! computing semantic relatedness using wikipedia. In:
Proceedings of the National Conference on Artificial Intelligence. Vol. 21. p. 1419.

[72] Tsagkias, M., Balog, K., 2010. The University of Amsterdam at WePS3. In: CLEF 2010 Labs and
Workshops Notebook Papers.

[73] Tsagkias, M., de Rijke, M., Weerkamp, W., 2011. Linking online news and social media. In: Pro-
ceedings of the fourth ACM International Conference on Web Search and Data Mining.

[74] Turney, P., 1999. Learning to extract keyphrases from text, national research council. Institute for
Information Technology, Technical Report ERB-1057.

[75] Wan, X., Gao, J., Li, M., Ding, B., 2005. Person resolution in person search results: Webhawk. In:

37


Proceedings of the 14th ACM international conference on Information and knowledge management.
pp. 163–170.

[76] Wilson, R., 2003. Keeping a watch on corporate reputation. Strategic Communications Management
7 (2).

[77] Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., Nevill-Manning, C. G., 1999. Kea: practical
automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on Digital libraries.
DL ’99. pp. 254–255.

[78] Wu, S., Hofman, J. M., Mason, W. A., Watts, D. J., 2011. Who says what to whom on twitter.
In: Proceedings of the 20th International Conference on World Wide Web. WWW ’11. ACM, New
York, NY, USA, pp. 705–714.

[79] Yang, J., Leskovec, J., 2010. Modeling information diffusion in implicit networks. Data Mining,
IEEE International Conference on 0, 599–608.

[80] Yang, J., Leskovec, J., 2011. Patterns of temporal variation in online media. In: Proceedings of the
fourth ACM international conference on Web search and data mining. WSDM ’11. pp. 177–186.

[81] Yerva, S. R., Miklós, Z., Aberer, K., 2010. It was easy when apples and blackberries were only
fruits. In: CLEF 2010 Labs and Workshops Notebook Papers.

[82] Yerva, S. R., Mikls, Z., Aberer, K., 2012. Entity-based classification of twitter messages. IJCSA
9 (1), 88–115.

[83] Yoshida, M., Matsushima, S., Ono, S., Sato, I., Nakagawa, H., 2010. ITC-UT: Tweet Categorization
by Query Categorization for On-line Reputation Management. In: CLEF 2010 Labs and Workshops
Notebook Papers.

[84] Zhang, S., Wu, J., Zheng, D., Meng, Y., Xia, Y., Yu, H., 2012. Two stages based organization name
disambiguity. Computational Linguistics and Intelligent Text Processing, 249–257.

[85] Zhang, Y., Milios, E., Zincir-Heywood, N., 2007. A comparative study on key phrase extraction
methods in automatic web site summarization. Journal of Digital Information Management 5 (5),
323.

[86] Zhang, Y., Zincir-Heywood, N., Milios, E., 2004. Term-based clustering and summarization of web
page collections. Advances in Artificial Intelligence, 60–74.

[87] Zhang, Y., Zincir-Heywood, N., Milios, E., 2004. World wide web site summarization. WEB IN-
TELLIGENCE AND AGENT SYSTEMS. 2, 39–54.

[88] Zhang, Y., Zincir-Heywood, N., Milios, E., 2005. Narrative text classification for automatic key
phrase extraction in web document corpora. In: Proceedings of the 7th annual ACM international
workshop on Web information and data management. pp. 51–58.

[89] Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X., 2011. Comparing twitter and
traditional media using topic models. In: Proceedings of the 33rd European conference on Advances
in Information Retrieval.

[90] Zhao, X., Jiang, J., He, J., Song, Y., Achananuparp, P., LIM, E., Li, X., 2011. Topical keyphrase
extraction from twitter.

38


	1 Introduction
	2 Related Work
	2.1 Twitter
	2.1.1 Twitter-related Tasks
	2.1.2 Twitter Datasets

	2.2 Named Entity Disambiguation
	2.2.1 Entity Linking
	2.2.2 Document Enrichment by Linking to Wikipedia Articles
	2.2.3 Web People Search
	2.2.4 Named Entity Disambiguation in Twitter

	2.3 Automatic Keyphrase Extraction
	2.4 The WePS-3 Online Reputation Management Task
	2.4.1 Task Definition and Test Collection
	2.4.2 WePS-3 Results
	2.4.3 Other work using WePS-3 datasets

	2.5 Wrap Up

	3 Potential Benefits of Filter Keywords for Name Disambiguation in Twitter
	3.1 Upper Bound Performance of Filter Keywords
	3.2 Using Keywords to Solve The Task

	4 Automatic Discovery of Filter Keywords
	4.1 Term Features
	4.2 Feature Analysis
	4.3 Keyword Discovery

	5 Automatic Tweet Classification
	6 Discussion
	6.1 How Much of The Problem is Solved?
	6.2 Comparing Systems with Different Metrics
	6.3 Web vs. Twitter

	7 Conclusion