Exploiting Parallel News Streams for Unsupervised Event Extraction

Congle Zhang, Stephen Soderland & Daniel S. Weld
Computer Science & Engineering

University of Washington
Seattle, WA 98195, USA

{clzhang, soderlan, weld}@cs.washington.edu

Abstract

Most approaches to relation extraction, the
task of extracting ground facts from natural
language text, are based on machine learning
and thus starved by scarce training data. Man-
ual annotation is too expensive to scale to a
comprehensive set of relations. Distant super-
vision, which automatically creates training
data, only works with relations that already
populate a knowledge base (KB). Unfortu-
nately, KBs such as FreeBase rarely cover
event relations (e.g. “person travels to loca-
tion”). Thus, the problem of extracting a wide
range of events — e.g., from news streams —
is an important, open challenge.

This paper introduces NEWSSPIKE-RE, a
novel, unsupervised algorithm that discovers
event relations and then learns to extract them.
NEWSSPIKE-RE uses a novel probabilistic
graphical model to cluster sentences describ-
ing similar events from parallel news streams.
These clusters then comprise training data
for the extractor. Our evaluation shows that
NEWSSPIKE-RE generates high quality train-
ing sentences and learns extractors that per-
form much better than rival approaches, more
than doubling the area under a precision-recall
curve compared to Universal Schemas.

1 Introduction
Relation extraction, the process of extracting struc-
tured information from natural language text, grows
increasingly important for Web search and ques-
tion answering. Traditional supervised approaches,
which can achieve high precision and recall, are lim-
ited by the cost of labeling training data and are un-
likely to scale to the thousands of relations on the
Web. Another approach, distant supervision (Craven
and Kumlien, 1999; Wu and Weld, 2007), creates its
own training data by matching the ground instances
of a Knowledge base (KB) (e.g. Freebase) to the un-
labeled text.

Unfortunately, while distant supervision can work
well in some situations, the method is limited to rela-
tively static facts (e.g., born-in(person, location) or
capital-of(location,location)) where there is a cor-
responding knowledge base. But what about dy-
namic event relations (also known as fluents), such
as travel-to(person, location) or fire(organization,
person)? Since these time-dependent facts are
ephemeral, they are rarely stored in a pre-existing
KB. At the same time, knowledge of real-time
events is crucial for making informed decisions in
fields like finance and politics. Indeed, news stories
report events almost exclusively, so learning to ex-
tract events is an important open problem.

This paper develops a new unsupervised tech-
nique, NEWSSPIKE-RE, to both discover event rela-
tions and extract them with high precision. The in-
tuition underlying NEWSSPIKE-RE is that the text
of articles from two different news sources are not
independent, since they are each conditioned on the
same real-world events. By looking for rarely de-
scribed entities that suddenly “spike” in popularity
on a given date, one can identify paraphrases. Such
temporal correspondence (Zhang and Weld, 2013)
allow one to cluster diverse sentences, and the re-
sulting clusters may be used to form training data in
order to learn event extractors. Furthermore, one can
also exploit parallel news to obtain direct negative
evidence. To see this, suppose one day the news in-
cludes the following: (a) “Snowden travels to Hong
Kong, off southeastern China.” (b) “Snowden can-
not stay in Hong Kong as Chinese officials will not
allow ...” Since news stories are usually coherent, it
is highly unlikely that travel to and stay in (which is
negated) are synonymous. By leveraging such direct
negative phrases, we can learn extractors capable of
distinguishing heavily co-occurring but semantically
different phrases, thereby avoiding many extraction
errors. Our NEWSSPIKE-RE system encapuslates
these intuitions in a novel graphical model making

117

Transactions of the Association for Computational Linguistics, vol. 3, pp. 117–129, 2015. Action Editor: Hal Daumé III.
Submission batch: 10/2014; Revision batch 1/2015; Published 2/2015. c©2015 Association for Computational Linguistics.


the following contributions:
• We develop a method to discover a set of dis-

tinct, salient event relations from news streams.
• We describe an algorithm to exploit paral-

lel news streams to cluster sentences that be-
long to the same event relations. In particu-
lar, we propose the temporal negation heuris-
tic to avoid conflating co-occurring but non-
synonymous phrases.
• We introduce a probabilistic graphical model to

generate training for a sentential event extractor
without requiring any human annotations.
• We present detailed experiments demonstrating

that the event extractors, learned from the gener-
ated training data, significantly outperform sev-
eral competitive baselines, e.g. our system more
than doubles the area under the micro-averaged,
PR curve (0.80 vs. 0.30) compared to Riedel’s
Universal Schema (Riedel et al., 2013).

2 Previous Work
Supervised learning approaches have been widely
developed for event extraction tasks such as MUC-4
and ACE. They often focus on a hand-crafted on-
tology and train the extractor with manually created
training data. While they can offer high precision
and recall, they are often domain-specific (e.g. bio-
logical events (Riedel et al., 2011; McClosky et al.,
2011) and entertainment events (Benson et al., 2011;
Reichart and Barzilay, 2012)), and are hard to scale
over the events on the Web.

Open IE systems extract open domain relations
(e.g. (Banko et al., 2007; Fader et al., 2011)) and
events (e.g. (Ritter et al., 2012)). They often perform
self-supervised learning of relation-independent ex-
tractions. It allows them to scale but makes them
unable to output canonicalized relations.

Distant supervised approaches have been devel-
oped to learn extractors by exploiting the facts exis-
ting in a knowledge base, thus avoiding human an-
notation. Wu et al. (2007) and Reschke et al. (2014)
learned Infobox relations from Wikipedia, while
Mintz et al. (2009) heuristically matched Freebase
facts to texts. Since the training data generated
by the heuristic matching is often imperfect, multi-
instance learning approaches (Riedel et al., 2010;
Hoffmann et al., 2011; Surdeanu et al., 2012) have
been developed to combat this problem. Unfortu-

nately, most facts existing in the KBs are static facts
like geographical or biographical data. They fall
short of learning extractors for fluent facts such as
sports results or travel and meetings by a person.

Bootstrapping is another common extraction
technique (Brin, 1999; Agichtein and Gravano,
2000; Carlson et al., 2010; Nakashole et al., 2011;
Huang and Riloff, 2013). This typically takes a set
of seeds as input, which can be ground instances or
key phrases. The algorithms then iteratively gener-
ate more positive instances and phrases. While there
are many successful examples of bootstrapping, the
challenge is to avoid semantic drift. Large-scale sys-
tems, therefore, often require extra processing such
as manual validation between the iterations or addi-
tional negative seeds as the input.

Unsupervised approaches have been developed
for relation discovery and extractions. These algo-
rithms are usually based on some clustering assump-
tions over a large unlabeled corpus. Common as-
sumptions include the distributional hypothesis used
by (Hasegawa et al., 2004; Shinyama and Sekine,
2006), latent topic assumption by (Yao et al., 2012;
Yao et al., 2011), and low rank assumption by (Taka-
matsu et al., 2011; Riedel et al., 2013). Since the
assumptions largely rely on co-occurrence, previous
unsupervised approaches tend to confuse correlated
but semantically different phrases during extraction.
In contrast to this, our work largely avoids these er-
rors by exploiting the temporal negation heuristic
in parallel news streams. In addition, unlike many
unsupervised algorithms requiring human effort to
canonicalize the clusters, our work automatically
discovers events with readable names.

Paraphrasing techniques inspire our work. Some
techniques, such as DIRT (Lin and Pantel, 2001)
and Resolver (Yates and Etzioni, 2009), are based
on the distributional hypothesis. Another common
approach is to use parallel corpora, including news
streams (Barzilay and Lee, 2003; Dolan et al., 2004;
Zhang and Weld, 2013), multiple translations of the
same story (Barzilay and McKeown, 2001) and
bilingual sentence pairs (Ganitkevitch et al., 2013)
to generate the paraphrases. Although these algo-
rithms create many good paraphrases, they can not
be directly used to generate enough training data to
train a relation extractor for two reasons: first, the
semantics of the paraphrases is often context depen-
dent; second, the generated paraphrases are often in

118


Parallel news 

streams
E=e(t1,t2)

Event

Discover 

event 

relations

NewsSpike w/

Parallel sentences

r1 r2 r3  

(a1,a2,t)

r1 r2 r3  
r4 r5

r1 r2 r3  

NS=(a1,a2,d,S)

Group

S={s1, s2 ,s3} r1 r2 r3  

(a1,a2,t)

r1 r2 r3  
r4 r5

r1 r2 r3  

E=e(t1,t2)

s→E(a1,a2)
s’→E(a’1,a’2)

Generate 

training data

Training sentences

Event
Extractor

learn

Test

sentences

input extract

s→ E(a1,a2)

Extractions

s

Training Phase Testing Phase

Figure 1: During its training phase, NEWSSPIKE-RE
first groups parallel sentences as NewsSpikes. Next, the
system automatically discovers a set of event relations.
Then, a probabilistic graphical model clusters sentences
from the NewsSpike as training data for each discovered
relation, which is used to learn sentential event extrac-
tors. During the testing phase, the extractor takes test
sentences as input and predicts event extractions.

small clusters and it remains challenging to merge
them for the purpose of training an extractor. Our
work extends previous paraphrasing techniques, no-
tably that of Zhang and Weld (2013), but we fo-
cus on generating high-quality, positive and negative
training sentences for the discovered events in order
to learn extractors with high precision and recall.

3 System Overview
News articles report an enormous number of events
every day. Our system, NEWSSPIKE-RE, aligns
paralel news streams to indentify and extract these
events as shown in Figure 1. NEWSSPIKE-RE
has both training and test phases. Its training
phase has two main steps: event-relation discov-
ery and training-set generation. Section 4 describes
our event relation discovery algorithm, which pro-
cesses time-stamped news articles to discern a set
of salient, distinct event relations in the form of
E = e(t1, t2), where e is a representative event
phrase and ti are types of the two arguments.
NEWSSPIKE-RE generates the event phrases using
an Open Information Extraction (IE) system (Fader
et al., 2011), and uses a fine-grained entity recogni-
tion system FIGER (Ling and Weld, 2012) to gen-
erate type descriptors such as “company ”, “politi-
cian”, and “medical treatment”.

The second part of NEWSSPIKE-RE’s training
phase, described in Section 5, is a method for build-
ing extractors for the discovered event relations. Our
approach is motivated by the intuition, adapted from
Zhang and Weld (2013), that articles from different

news sources typically use different sentences to de-
scribe the same event, and that corresponding sen-
tences can be identified when they mention a unique
pair of real-world entities. For example, when an un-
usual entity pair (Selena, Norway) is suddenly seen
in three articles on a single day:

Selena traveled to Norway to see her ex-boyfriend.
Selena arrived in Norway for a rendezvous with Justin.
Selena’s trip to Norway was no coincidence.

It is likely that all three refer to the same event re-
lation, travel-to(person, location)1, and can be used
as positive training examples for the relation. As in
Zhang & Weld (2013), we group parallel sentences
sharing the same argument pair and date in a struc-
ture called a NewsSpike. However, we include all
sentences mentioning the arguments (e.g. Selena’s
trip to Norway) in the NewsSpike (not just those
yielding OpenIE extractions), and use the lexical-
ized dependency path between the arguments (e.g.
<-[poss]-trip-[prep-to]->2, as the event phrase. In
this way, we can generalize extractors beyond the
scope of OpenIE. Formally, a NewsSpike is a tu-
ple, (a1,a2,d,S), where a1 and a2 are arguments
(e.g. Selena), d is a date, and S is a set of argument-
labeled sentences {(s,a1,a2,p) . . .} in which s is a
sentence with arguments ai and event phrase p.

It’s important that non-synonomous sentences
like “Selena stays in Norway” should be excluded
from the training data for travel-to(person, loca-
tion) even if a travel-to event did apply to that argu-
ment pair. In order to select only the synonomous
sentences, we develop a probabilistic graphical
model, described in Section 5.2, to accurately as-
sign sentences from NewsSpikes to each discov-
ered event relation E. Given this annotated data,
NEWSSPIKE-RE trains extractors using a multi-
class logistic regression classfier.

During the testing phase, NEWSSPIKE-RE ac-
cepts arbitrary sentences (no date-stamp required),
uses FIGER to identify possible arguments, and uses
the classifier to predicts which events (if any) hold
between an argument pair. We describe the extrac-
tion process in Section 6.

Note that NEWSSPIKE-RE is an unsupervised al-

1For clarity in the paper, we refer to this relation as travel-to,
even though the phrase arrive in is actually more frequent and
is selected as the name of this relation by our event discovery
algorithm, as shown in Table 2.

2This dependency path will be referred to as “’s trip to”.

119


𝐸1 𝐸2

𝜂1 𝜂2 𝜂3

𝐸3

Figure 2: A simple example of the edge-cover algo-
rithm with K=2, where Ei are event relations and ηj are
NewsSpikes. The optimal solution selects E1 with edges
to η1 and η2, and E3 with edge to η3. These two event
relations cover all the NewsSpikes.

gorithm that requires no manual labelling of the
training instances. Like distant supervision, the
key is to automatically generate the training data,
at which point a traditional supervised classifier
may be applied to learn an extractor. Because dis-
tant supervision creates very noisy annotations, re-
searchers often use specialized learners that model
the correctness of a training example with a la-
tent variable (Riedel et al., 2010; Hoffmann et
al., 2011), but we found this unnecessary, because
NEWSSPIKE-RE creates high quality training data.

4 Discovering Salient Events
The first step of NEWSSPIKE-RE is to discover a
set of event relations in the form of E = e(t1, t2),
where e is an event phrase, and ti are fine-grained ar-
gument types generated by FIGER, augmented with
the important types “number” and “money”, which
are recognized by the Stanford name entity recogni-
tion system (Finkel et al., 2005). To be most useful,
the discovered event relations should cover salient
events that are frequently reported in the news ar-
ticles. Formally, we say that a NewsSpike η =
(a1,a2,d,S) mentions E = e(t1, t2) if the types
of ai are ti for each i, and one of its sentence has e
as the event phrase between the arguments. To max-
imize the salience of the events, NEWSSPIKE-RE
will prefer event relations that are “mentioned” by
more NewsSpikes.

In addition, the set of event relations should be
distict. For example, if the relation travel-to(person,
location) is already in the set, then visit(person, lo-
cation) should not be selected as a separate relation.
To reduce overlap, discovered event relations should
not be mentioned by the same NewsSpike.

Let E be all candidate event relations, N be all
NewsSpikes. Our goal is to select the K most salient
relations from E, minimizing overlap between re-
lations. We can frame this task as a variant of the

bipartite graph edge-cover problem. Let a bipartite
graph G have one node Ei for each event relation in
E and one node ηj for each NewsSpike in N . There
is an edge between Ei and ηj if ηj mentions Ei. The
edge-cover problem is to select a largest subset of
edges subject to (1) at most K nodes of Ei are cho-
sen and all edges incident to them are chosen as the
covered edges; (2) each node of ηj is incident to at
most one edge. The first constraint guarantees that
there are exactly K event relations discovered; the
second constraint ensures that no NewsSpike partic-
ipates in two event relations. Figure 2 shows the
optimized solution of a simple graph with K = 2,
which can cover 3 edges with 2 event relations that
have no overlapping NewsSpikes.

Since both the objective function and constraints
are linear, we can optimize this edge-cover problem
with integer linear programming (Nemhauser and
Wolsey, 1988). By solving the optimization prob-
lem, NEWSSPIKE-RE finds a salient set of event re-
lations incident to the covered edges. The discov-
ered relations with K set to 30 are shown in Table 2
in Section 7. In addition, the covered edges bring
us the initial mapping between the event types and
NewsSpikes, which is used to train the probablistic
model in Section 5.3.

5 Generating the Training Sentences
After NEWSSPIKE-RE has discovered a set of event
relations, it then generates training instances to learn
an extractor for each relation. In this section, we
present our algorithm for generating the training
sentences. As shown in Figure 1, the generator takes
N NewsSpikes {ηi = (a1i,a2i,di,Si)|i = 1 . . .N}
and K event relations {Ek = ek(t1k, t2k)|k =
1 . . .K} as input. For every event relation, Ek,
the generator identifies a subset of sentences from
∪Ni=1Si expressing the event relation as training sen-
tences. In this section, we first characterize the
paraphrased event phrases and the parallel sentences
in NewsSpikes. Then we show how to encode
this heuristic in a probabilistic graphical model that
jointly paraphrases the event phrases and identifies a
set of training sentences.

5.1 Exploiting Properties of Parallel News
Previous work (Zhang and Weld, 2013) proposed
several heuristics that are useful to find similar sen-
tences in a NewsSpike. For example, the tempo-
ral functionality heuristic says that sentences in a

120


NewsSpike with the same tense tend to be para-
phrases. Unfortunately, these methods are too weak
to generate enough data for training high quality
event extractors: (1) they are “in-spike heuristics”
that tend to generate small clusters from individual
NewsSpikes. It remains unclear how to merge sim-
ilar events occuring on different days and between
different entities to increase cluster size. (2) they in-
cluded heuristics to “gain precision at the expense
of recall” (e.g. news articles do not state the same
fact twice), because it is hard to obtain direct nega-
tive phrases inside one NewsSpike. In this paper, we
exploit news streams in a cross-spike, global man-
ner to obtain accurate positive and negative signals.
This allows us to dramatically improve recall while
maintaining high precision.

Our system starts from the basic observation that
the parallel sentences tend to be coherent. So if a
NewsSpike η = (a1,a2,d,S) is an instance of an
event relation E = e(t1, t2), the event phrases in its
parallel sentences tend to be paraphrases. But some-
times the sentences in the NewsSpike are related but
not paraphrases. For example, one day “Snowden
will stay in Hong Kong ...” appears together with
“Snowden travels to Hong Kong ...”. Although the
fact stay-in(Snowden, Hong Kong) is true, it is harm-
ful to include “Snowden will stay in Hong Kong” in
the training for travel-to(person, location).

Detecting paraphrases remains a challenge to
most unsupervised approaches because they tend
to cluster heavily co-occurring phrases which may
turn out to be semantically different or even antony-
mous. (Zhang and Weld, 2013) presented a method
to avoid confusion between antonym and synonyms
in NewsSpikes, but did not address the problem of
related but different phrases like travel to and stay
in in a NewsSpike.

To handle this, our method rests on a simple ob-
servation: when you read “Snowden travels to Hong
Kong” and “Snowden cannot stay in Hong Kong
as Chinese officials do not allow ...” in the same
NewsSpike, it is unlike that travel to and stay in are
synonymous event phrases because otherwise the
two news stories are describing the opposite event.
This observation leads to:

Temporal Negation Heuristic. Two event phrases
p and q tend to be semantically different if they co-
occur in the NewsSpike but one of them is in negated
form.

The temporal negation heuristic helps in two
ways: (1) it provides some direct negative phrases
for the event relations; NEWSSPIKE-RE uses these
to heuristically label some variables in the model.
(2) It creates some useful features to implement a
form of transitvity. For example, if we find that live
in and stay in are frequently co-occurring and the
temporal negation heuristic tells us that travel to and
stay in are not paraphrases, this is evidence that live
in is unlikely to be a paraphrase of travel to, even if
they are heavily co-occurring.

The following section describes our implementa-
tion that uses these properties to generate high qual-
ity training. Our goal is the following: a sentence
(s,a1,a2,p) from NewsSpike η = (a1,a2,d,S)
should be included in the training data for event re-
lation E = e(t1, t2) if the event phrase p is a para-
phrase of e and the event relation E happens to the
argument pair (a1,a2) at time d.

5.2 Joint Cluster Model

As discussed above, to identify a high quality set of
training sentences from NewsSpikes, one needs to
combine evidence that event phrases are paraphrases
with evidence from NewsSpikes. For this purpose,
we define an undirected graphical model to jointly
reason about paraphrasing the event phrases and
identifying the training sentences from NewsSpikes.
We first list the notation used in this section:

E event relation
p ∈ P event phrases
s ∈ Sp sentences w/ the event phrase p
Y p Is p a paraphrase for E?
Zsp Is s w/ p good training for E?
Φ factors

Let P be the union of all the event phrases from
every NewsSpike. For each p ∈ P , let Sp be the set
of sentences having p as its event phrase.

Figure 3(a) shows the model in plate form. There
are two kinds of random variables corresponding to
phrases and sentences, respectively. For each event
relation E = e(t1, t2), there exists a connected com-
ponent for every event phrase p ∈ P that models (1)
whether p is a paraphrase of e or not (modeled us-
ing Boolean phrase variables, Y p); and (2) whether
each sentence of Sp is a good training sentence for
E (modeled using |Sp| Boolean sentence variables
{Zsp|s ∈ Sp}. Intuitively, the goal of the model
is to find the set of good training sentences, with

121


𝑝

𝑌

𝑆𝑝

𝑍𝑖

(a)

1
𝑍1

1

𝑌
′s trip to

1 0

Selena Gomez’s trip to 
Norway

… for UCLA’s trip to 
Nebraska

Tsarnaev’s six-month trip to 
Russia escapes the FBI’s attention

𝑍2 𝑍3

Φjoint

Φphrase

Φin

Φcross

(b)

0

0

𝑌stay in

0

Snowden plans to stay in 
Hongkong

Manziel stays in Austin to 
attend a fraternity party

𝑍2𝑍1

Φjoint

Φphrase

Φin

(c)

Figure 3: (a) The connected components depicted as plate model, where each Y is a Boolean variable for a relation
phrase and each Z is a Boolean variable for a training sentence for with that phrase; (b) and (c) are example connected
components for the event phrases ’s trip to and stay in respectively. The goal of the model is to set Y = 1 for good
paraphrases of a relation and to set Z = 1 for good training sentences.

Zsp = 1. The union of such sentences over the dif-
ferent phrases, ∪p{s|Zsp = 1}, defines the training
sentences for the event. Figure 3(b) and 3(c) show
two example connected-components for the event
phrases ’s trip to and stay in respectively.

Now, we can define the joint distribution over
the event phrases and the sentences. The joint dis-
tribution is a function defined on factors that en-
code our observations about NewsSpikes as features
and constraints. The phrase factor Φphrase is a log-
linear function attaching to Y p with the paraphras-
ing features, such as whether p and e co-occur in
the NewsSpikes, or whether p shares the same head
word with e. They are used to distinguish whether p
is a good event phrase.

A sentence should not be identified as a good
training sentence if it does not contain a positive
event phrase. For example, if Y stay in in Figure 3(b)
takes the value of 0, thus all sentences with the event
phrase stay in should also take the value of 0. We
implement this constraint with a joint factor Φjoint

among Y p and Zsp variable.
In addition, good training sentences occur when

the NewsSpike is an event instance. To encode this
observation, we need to featurize the NewsSpikes
and let them bias the assignments. Our model im-
plements this with two types of log-linear factors:
(1) the unary in-spike factor Φin depends on the sen-
tence variables and contains features about the cor-
responding NewsSpike. The factor is used to dis-
tinguish whether the NewsSpike is an instance of
e(t1, t2), such as whether the argument types of the
NewsSpike match the designated types t1, t2; (2) the
pairwise cross-spike factors Φcross connect pairs of
sentences. This uses features such as whether the
pair of NewsSpikes for the two sentences have high

textual similarity, and whether two NewsSpikes con-
tain negated event phrases.

We define the joint distribution for the connected
component for p as follows. Let Z be the vector of
sentence variables, let x be the features. The joint
distribution is:

p(Y = y, Z = z|x; Θ) def= 1
Zx

Φphrase(y, x)

×Φjoint(y,z)
∏

s

Φin(zs,x)
∏

s,s′
Φcross(zs,zs

′
,x)

where the parameter vector Θ is the weight vec-
tor of the features in Φin and Φcross, which are log-
linear functions. The joint factors Φjoint is zero when
Y p = 0 but some Zsp = 1. Otherwise, it is set to 1.
We use integer linear programming to perform MAP
inference on the model, finding the predictions y,z
that maximize the probability.

5.3 Learning from Heuristic Labels
We now present the learning algorithm for our joint
cluster model. The goal of the learning algorithm
is to set Θ for the log-linear functions in the factors
in a way that maximizes the likelihood estimation.
We do this in a totally unsupervised manner, since
manual annotation is expensive and not scalable to
large numbers of event relations.

The weights are learned in three steps: (1)
NEWSSPIKE-RE creates a set of heuristic labels for
a subset of variables in the graphical model; (2)
it uses the heuristic labels as supervision for the
model; (3) it updates Θ with the perceptron learning
algorithm. The weights are used to infer the values
of the variables that don’t have heuristic labels. The
procedure is summarized in Figure 4.

For each event relation E = e(t1, t2),
NEWSSPIKE-RE creates heuristic labels as follows:

122


Input: NewsSpikes and the connected components of
the model;

Heuristic Labels:
1. find positive and negative phrases and sentences
P+,P−,S+,S−;

2. label the connected componenets accordingly
and create {(Y labeli ,Zlabeli ) |Mi=1}.

Learning: Update Θ with the perceptron learning al-
gorithm.

Output: the values of all variables in the connected
components with the MAP inference.

Figure 4: Learning from Heuristic Labels

(1) P+: the temporal functionality heuristic (Zhang
and Weld, 2013) says that if an event phrase p co-
occurs with e in the NewsSpikes, it tends to be a
paraphrase of e. We add the most frequently co-
occurring event phrases to P+. P+ also includes e
itself. (2) P−: the temporal negation heuristic says
that if p and e co-occur in the NewsSpike but one
of them is in its negated form, p should be nega-
tively labeled. We add those event phrases to P−.
If a phrase p appears in both P+ and P−, we re-
move it from both sets. (3) S+: we first get the
positive NewsSpikes from the solution of the edge-
cover problem in section 4. We treat the NewsSpike
η as positive if the edge between η and E is cov-
ered. Next, every sentence with p ∈ P+ is added
into S+. (4) S−: since the event relations discovered
in section 4 tend to be distinct relations, a sentence
is treated as negative sentence for E if it is heuris-
tically labeled as positive for E′ 6= E. In addition,
S− includes all sentences with p ∈ P−.

With P+,P−,S+,S−, we define the heuristic la-
beled set to be {(Y labeli ,Zlabeli ) |Mi=1}, where M
is the number of the connected components with
the corresponding event phrases p ∈ P+ ∪ P−;
Y labeli = 1 if p ∈ P+ and Y labeli = 0 if p ∈ P−. Zi
is labeled similarly, but note that if the sentence in
the connected component doesn’t exist in S+ ∪S−,
NEWSSPIKE-RE doesn’t include the corresponding
variable in Zlabeli . With {(Y labeli ,Zlabeli ) |Mi=1}, learn-
ing can be done with maximum likelihood estima-
tion as L(Θ) = log

∏
i p(Yi = y

label
i ,Zi = z

label
i |

xi, Θ). Following (Collins, 2002), we use a fast per-
ceptron learning approach to update Θ. It consists
of iterating two steps: (1) MAP inference given the
current weight; (2) penalizing the weights if the in-
ferred assignments are different from the heuristic

labeled assignments.

6 Sentential Event Extraction
As shown in Figure 1, we learn the extractors from
the generated training sentences. Note that most dis-
tant supervised (Hoffmann et al., 2011; Surdeanu et
al., 2012) approaches use multi-instance, aggregate-
level training (i.e. the supervision comes from la-
beled sets of instances instead of individually la-
beled sentences). Coping with the noise inherent
in these multi-instance bags remains a big challenge
for distant supervision. In contrast, our sentence-
level training data is more direct and minimizes
noise. Therefore, we implement the event extrac-
tor as a simple multi-class, L2-regularized logistic
regression classifier.

For features of the classifier, we use the lexi-
calized dependency paths, the OpenIE phrases, the
minimal subtree of the dependency parse and the
bag-of-words between the arguments. We also aug-
ment them with fine grained argument types pro-
duced by FIGER (Ling and Weld, 2012). The event
extractor that is learned can take individual test sen-
tences (s,a1,a2) as input and predict whether that
sentence expresses the event between (a1,a2).

7 Empirical Evaluation
Our evaluation addresses two questions. Section 7.2
considers whether our training generation algorithm
identifies accurate and diverse sentences. Then,
Section 7.3 investigates whether the event extrac-
tor, learned from the training sentences, outperforms
other extraction approaches.

7.1 Experimental Setup
We follow the procedure described in (Zhang and
Weld, 2013) to collect parallel news streams and
generate the NewsSpikes: first, we get news seeds
and query the Bing newswire search engine to gather
additional, time-stamped, news articles on a simi-
lar topic; next, we extract OpenIE tuples from the
news articles and group the sentences that share the
same arguments and date into NewsSpikes. We col-
lected the news stream corpus from March 1st 2013
to July 1st 2014. We split the dataset into two parts:
in the training phrase, we use the news streams in
2013 (named NS13) to generate the training sen-
tences. NS13 has 33k NewsSpikes containing 173k
sentences.

We evaluated the extraction performance on news
articles collected in 2014 (named NS14). In this

123


way, we make sure the test sentences are unseen
during training. There are 15 million sentences in
NS14. We randomly sample 100k unique sentences
having two different arguments recognized by the
name entity recognition system.

For our event discovery algorithm, we set the
number of event relations to be 30 and ran the al-
gorithm on NS13. The algorithm takes 6 seconds
to run on a 2.3GHz CPU. Note that most previous
unsupervised relation discovery algorithms require
additional manual post-processing to assign names
to the output clusters. In contrast, NEWSSPIKE-RE
discovers the event relations fully automatically and
the output is self-explanatory. We list them together
with the by-event extraction performance in Table 2.
From the table, we can see that most of the discov-
ered event relations are salient with little overlap be-
tween relations.

While we arbitrarily set K to 30 in our experi-
ments, there is no inherent limit to the number of
relation phrases as long as the news corpus provides
sufficient support to learn an extractor for each rela-
tion. In future, we plan to explore much larger sets
of event relations to see if the extraction accuracy is
maintained.

The joint cluster model that identifies training
sentences for each event relation E = e(t1, t2) uses
cosine similarity between the event phrase p of a
sentence and the canonical phrases of each relation
as features in the phrase factors in Figure 3(a). It
also includes the cosine similarity between p and a
set of “anti-phrases” for the event relation which are
recognized by the temporal negation heuristic.

For the in-spike factor, we measure whether the
fine-grained argument types of the sentence returned
from the FIGER system matches the required ti
respectively. In addition, we implement the fea-
tures from (Zhang and Weld, 2013) to measure
whether the sentence is describing the event of the
NewsSpike. For the cross-spike factors, we use tex-
tual similarity features between the two sets of par-
allel sentences to measure the distance between the
pair of NewsSpikes.

7.2 Quality of the Generated Training Set
The key to a good learning system is a high-quality
training set. In this section, we compare our joint
model against pipeline systems that consider para-
phrases and argument type matching sequentially,

system
all diverse

# mi. ma. # mi. ma.
Basic 43,718 .50 .62 12,701 .38 .51
Yates09 15,212 .78 .76 586 .48 .50
Ganit13 14,420 .74 .71 1,210 .53 .53
Zhang13 14,804 .76 .75 890 .63 .61
NEWSSPIKE-RE 20,105 .88 .89 2,156 .71 .72
w/o cross 16,463 .86 .86 1,883 .67 .69
w/o neg 33,548 .76 .81 4,019 .64 .68

Table 1: Quality of the generated training sentences
(count, micro- and macro- accuracy), where “all” in-
cludes sentences with all event phrases and “diverse” are
those with distinct event phrases.

based on the following paraphrasing techniques.
Basic is based on the temporal functionality

heuristic of (Zhang and Weld, 2013). It treats all
event phrases appearing in the same NewsSpike
as paraphrases. Yates09 uses Resolver (Yates and
Etzioni, 2009) to create clusters of phrases. Re-
solver measures the similarity between the phrases
by means of both distributional features and textual
features. We convert the sentences in NewsSpikes
into tuples in the form of (a1,p,a2), and run Re-
solver on these tuples to generate the paraphrases.
Zhang13: We used the generated paraphrase set
from (Zhang and Weld, 2013). Ganit13: Gan-
itkevitch et al. (2013) released a large paraphrase
database (PPDB) based on exploiting the bilingual
parallel corpora. Note that some of these para-
phrasing systems do not handle dependency paths.
So when p is a dependency path, we use the sur-
face string between the arguments as the phrase.
NewsSpike-RE: We also conduct ablation testing
on NEWSSPIKE-RE to measure the effect of the
cross-spike factors and the temporal negation heuris-
tic: w/o Cross uses a simpler model by remov-
ing the cross-spike factors of NEWSSPIKE-RE; w/o
Negation uses the same joint cluster model as
NEWSSPIKE-RE but removes the features and the
heuristic labels coming from the temporal negation
heuristic.

We measured the micro- and macro- accuracy of
each system by manually labeling 1000 randomly
chosen output from each system3. Annotators read
each training sentence, and decided if it was a good
example for a particular event. We also report the
number of generated sentences. Since the extrac-
tor should generalize over sentences with dissimilar
expressions, it is crucial to identify sentences with

3Two Odesk workers were asked to label the dataset, a grad-
uate student then reconciled any disagreements.

124


Event F1 @ max recall area u/ PR curve area u/ diverse PR curve
# R13 R13P N-RE R13 R13P N-RE # R13 R13P N-RE

acquire(organization,person) 59 0.34 0.33 0.58 0.26 0.26 0.57 20 0.26 0.17 0.58
arrive in(organization,location) 95 0.11 0.40 0.56 0.01 0.12 0.42 18 0.01 0.02 0.50
arrive in(person,location) 130 0.61 0.86 0.86 0.35 0.67 0.93 18 0.26 0.33 0.80
beat(organization,organization) 178 0.42 0.85 0.90 0.14 0.64 0.84 24 0.06 0.53 0.58
beat(person,person) 107 0.57 0.82 0.94 0.21 0.53 0.91 14 0.08 0.25 0.77
buy(organization,organization) 84 0.47 0.47 0.78 0.25 0.50 0.82 34 0.19 0.40 0.79
defend(person,person) 41 0.37 0.38 0.52 0.36 0.47 0.65 12 0.13 0.06 0.47
die at(person,number) 158 0.53 0.97 0.98 0.31 0.93 0.97 17 0.33 0.83 0.94
die(person,time) 179 0.85 0.91 0.97 0.66 0.80 0.96 16 0.22 0.63 0.87
fire(organization,person) 39 0.36 0.33 0.53 0.32 0.45 0.88 8 0.20 0.10 0.66
hit(event,location) 33 0.00 0.42 0.64 0.00 0.51 0.48 24 0.00 0.45 0.50
lead(person,organization/sports team) 119 0.77 0.86 0.87 0.57 0.73 0.77 14 0.30 0.36 0.62
leave(person,organization) 61 0.40 0.52 0.59 0.14 0.38 0.57 14 0.07 0.13 0.38
meet with(person,person) 137 0.74 0.86 0.92 0.48 0.73 0.88 14 0.28 0.56 0.93
nominate(person/politician,person) 44 0.12 0.38 0.54 0.13 0.44 0.77 27 0.11 0.53 0.75
pay(organization,money) 134 0.77 0.91 0.93 0.52 0.85 0.90 17 0.33 0.90 0.56
place(organization,person) 34 0.17 0.28 0.50 0.24 0.23 0.95 16 0.19 0.21 0.94
play(person/artist,person) 173 0.92 0.89 0.87 0.88 0.79 0.73 15 0.63 0.56 0.47
release(organization,person) 30 0.18 0.22 0.60 0.08 0.25 0.72 16 0.06 0.15 0.81
replace(person,person) 115 0.82 0.89 0.94 0.62 0.75 0.87 18 0.46 0.58 0.89
report(government agency,time) 140 0.37 0.84 0.91 0.09 0.74 0.83 35 0.06 0.52 0.70
report(written work,time) 130 0.64 0.85 0.83 0.43 0.82 0.74 22 0.38 0.58 0.51
return to(person/athlete,location) 45 0.14 0.34 0.50 0.03 0.30 0.49 21 0.08 0.23 0.78
shoot(person,number) 101 0.71 0.89 0.92 0.49 0.74 0.84 8 0.35 0.37 0.48
sign with(person,organization) 129 0.47 0.62 0.89 0.25 0.46 0.85 44 0.15 0.17 0.91
sign(organization,person) 110 0.45 0.71 0.85 0.26 0.63 0.79 26 0.15 0.27 0.66
unveil(organization,product) 88 0.43 0.71 0.44 0.26 0.52 0.30 22 0.31 0.22 0.63
vote(government,time) 32 0.29 0.24 0.74 0.32 0.25 0.77 19 0.35 0.22 0.83
win at(person,location) 100 0.24 0.68 0.85 0.08 0.60 0.90 40 0.01 0.42 0.90
win(person,event) 107 0.54 0.77 0.86 0.22 0.63 0.77 19 0.03 0.26 0.78
micro average 2,903 0.53 0.70 0.81 0.30 0.59 0.80 609 0.15 0.31 0.71
macro average 97 0.46 0.64 0.76 0.30 0.56 0.76 20 0.20 0.37 0.70

Table 2: Performance of extractors by event relation, reporting both precision and the area under the PR curve. The
# column shows the number of true extractions in the pool of sampled output. NEWSSPIKE-RE (labeled N-RE)
outperforms two implementations of Riedel’s Universal Schemas (See Section 7.3 for details). The advantage of
NEWSSPIKE-RE over Universal Schemas is greatest on a diverse test set where each sentence has a distinct event
phrase.

diverse event phrases. Therefore we also measured
the accuracy and the count of a “diverse” condition:
only consider the subset of sentences with distinct
event phrases.

Table 1 shows the accuracy and the number
of training examples. The basic temporal system
brings us 0.50/0.62 micro- and macro- accuracy
overall and 0.38/0.51 in the diverse condition. It
shows that NewsSpikes are promising resources to
generate the training set, but that elaboration is nec-
essary. Yates09 gets 0.78/0.76 accuracy overall be-
cause its textual features help it to recognize many
good sentences with similar phrases. But for the di-
verse condition, it gets lower precision because the
distributional hypothesis fails to distinguish those
correlated but different phrases.

Although Ganitkevitch13 and Zhang13 leverage
existing paraphrase databases, it is interesting that
their accuracy is still not good. It is largely because
many times the paraphrasing must depend on the

context: e.g. “Cutler hits Martellus Bennett with TD
in closing seconds.” is not good for the beat(team,
team) relation, even though hit is a synonym for beat
in general. These two systems show that it is not
enough to use an off-the-shelf paraphrasing database
for extraction.

The ablation test shows the effectiveness of the
temporal negation hypothesis: after turning off the
relevant features and heuristic labels, the precision
drops about 10 percentage points. In addition,
the cross-spike factors bring NEWSSPIKE-RE about
22% more training sentences and also increase the
accuracy.

We did bootstrap sampling to test the statistical
significance of NEWSSPIKE-RE’s improvement in
accuracy over each comparison system and abla-
tion of NEWSSPIKE-RE. For each system we com-
puted the accuracy of 10 samples of 100 labeled
outputs. We then ran the paired t-test over the ac-
curacy numbers of each other system compared to

125


NEWSSPIKE-RE. For all but w/o cross the improve-
ment is strongly significant with p-value less than
1%. The increase in accuracy compared to w/o cross
has borderline significance (p-value 5.5%), but is a
clear win with its 22% increase in training size.

7.3 Performance of the Event Extractors

Most previous relation extraction approaches either
require a manually labeled training set, or work
only on a pre-defined set of relations that have
ground instances from KBs. The closest work to
NEWSSPIKE-RE is Universal Schemas (Riedel et
al., 2013), which addresses the limitation of dis-
tant supervision that the relations must exist in KBs.
Their solution is to treat the surface strings, de-
pendency paths, and relations from KBs as equal
“schemas”, and then to exploit the correlation be-
tween the instances and the schemas from a very
large unlabeled corpus. In their paper, Riedel et al.
evaluated only on static relations from Freebase and
achieve state-of-the-art performance. But Universal
Schemas can be adapted to handle events, by intro-
ducing the events as schemas and heuristically find-
ing seed instances.

We set up a competing system (R13) as follows:
(1) We take the NYTimes corpus published between
1987 and 2007 (Sandhaus, 2008), the dataset used
by Riedel et al. (2013) containing 1.8 million NY
Times articles; (2) The instances (i.e. the rows of the
matrix) come from the entity pairs from the news ar-
ticles; (3) There are two types of columns: some are
the extraction features used by NEWSSPIKE-RE, in-
cluding the lexicalized dependency paths described
in Riedel et al.; others are event relations E =
e(t1, t2); (4) For an entity pair (a1,a2), if there is
an OpenIE extraction (a1,e,a2) and the entity types
of (a1,a2) match (t1, t2), we assume the event rela-
tion E is observed on that instance.

As shown in Table 1, parallel news streams are
a promising resource for clustering because of the
strong correlation between the instances and the
event phrases. We train another version of Universal
Schemas R13P on the parallel news streams NS13.
In particular, entity pairs from different NewsSpikes
are used as different rows in the matrix.

We would like to measure the precision and re-
call of the extractors. But note that it is impos-
sible to fully label all the sentences, so we follow
the “pooling” technique described in (Riedel et al.,

2013) to create the labeled dataset. For every com-
peting system, we sample 100 top outputs for every
event relation and add this to the pool. The anno-
tators are shown these sentences and asked to judge
whether the sentence expresses the event relation or
not. After that, the labeled set become “gold” and
can be used to measure the precision and pseudo-
recall. There are in all 6,178 distinct sentences in
the pool, since some outputs are produced by mul-
tiple systems. Among them, 2,903 sentences are la-
beled as positive. In Table 2, the # columns show
the number of true extractions in the pool for every
event relation.

Similar to the diverse condition in Table 1, it is
important that the extractor can correctly predict on
diverse sentences that are dissimilar to each other.
Thus we conducted a “diverse pooling”: for each
system, we report numbers for the sentences with
different dependency paths between the arguments
for every discovered event.

Figure 5(a) shows the precision pseudo-recall
curve for all sentences for the three systems.
NEWSSPIKE-RE outperforms the competing sys-
tems by a large margin. For example, the area un-
der the curve (AUC) of NEWSSPIKE-RE for all sen-
tences is 0.80 while that of R13P and R13 are 0.59
and 0.30. This is a 35% increase over R13P and 2.7
times the area compared to R13. Similar increases
in AUC are observed on diverse sentences. Table 2
further lists the breakdown numbers for each event
relation, as well as the micro and macro average.
Although Universal Schemas had some success for
several relations, NEWSSPIKE-RE achieved the best
F1 for 26 out of 30 event relations; best AUC for 26
out of 30. The advantage is even greater in the di-
verse condition. It is interesting to see that R13P
performs much better than R13, since the data com-
ing from NYTimes is much noisier.

A closer look shows that Universal Schemas
tends to confuse correlated but different phrases.
NEWSSPIKE-RE, however, rarely made these errors
because our model can effectively exploit negative
evidence to distinguish them.

7.3.1 Comparing to Distant Supervision

Although the most event relations in Table 2 can-
not be handled by the distant supervised approach,
it is possible to match buy(org,org) to Freebase re-
lations with appropriate database operators such as

126


0.0 0.2 0.4 0.6 0.8 1.0
0.0

0.2

0.4

0.6

0.8

1.0

R13: Uschema on NYT

R13P: Uschema on NS13

NewsSpike-RE on NS13

Pseudo Recall

P
re

c
is

io
n

(a)

0.0 0.2 0.4 0.6 0.8 1.0
0.0

0.2

0.4

0.6

0.8

1.0

R13: Uschema on NYT

R13P: Uschema on NS13

NewsSpike-RE on NS13

Pseudo Recall

P
re

c
is

io
n

DS on NYT

(b)

Figure 5: Precision pseudo-recall curves for (a) all event relations; (b) buy(org, org), this figure includes the distant
supervision algorithm MIML learned from matching the Freebase relation5 to The New York Times. NEWSSPIKE-RE
has AUC 0.80, more than doubling R13 (0.30) and 35% higher than R13P (0.59) for all event relations.

join and select (Zhang et al., 2012). To evaluate
how distant supervision performs, we introduce the
system DS on NYT based on a manual mapping of
buy(org,org) to the join relation4 in Freebase. Then
we match its instances to NYTimes articles and fol-
low the steps of Surdeanu et al. (2012) to train the
extractor.

The matching to NYTimes brings us 264 positive
instances having 5,333 sentences, but unfortunately
the sentence-level accuracy is only 13% based on
examination of 100 random sentences. Figure 5(b)
shows the PR curves for all the competing systems.
Distant supervision predicts the top extractions cor-
rectly because the multi-instance technique recog-
nizes some common expressions (e.g. buy, acquire),
but the precision drops dramatically since most pos-
itive expressions are overwhelmed by the noise.

8 Conclusions and Future Work
Popular distant supervised approaches have limited
ability to handle event extraction, since fluent facts
are highly time dependent and often do not exist in
any KB. This paper presents a novel unsupervised
approach for event extraction that exploits parallel
news streams. Our NEWSSPIKE-RE system auto-
matically identifies a set of argument-typed events
from a news corpus, and then learns a sentential
(micro-reading) extractor for each event.

We introduced a novel, temporal negation heuris-
tic for parallel news streams that identifies event
phrases that are correlated, but are not paraphrases.
We encoded this in a probabilistic graphical model

4
/organization/organization/companies_

acquired1/business/acquisition/company_acquired

to cluster sentences, generating high quality training
data to learn a sentential extractor. This provides
negative evidence crucial to achieving high preci-
sion training data.

Experiments show the high quality of the gener-
ated training sentences and confirm the importance
of our negation heuristic. Our most important exper-
iment shows that we can learn accurate event extrac-
tors from this training data. NEWSSPIKE-RE out-
performs comparable extractors by a wide margin,
more than doubling the area under a precision-recall
curve compared to Universal Schemas.

In future work we plan to implement our system
as an end-to-end online service. This would allow
users to conveniently define events of interest, learn
extractors for each event, and return extracted facts
from news streams.

Acknowledgments

We thank Hal Daume III, Xiao Ling, Luke Zettle-
moyer and the reviewers. This work was supported
by ONR grant N00014-12-1-0211, the WRF/Cable
Professorship, a gift from Google, and the Defense
Advanced Research Projects Agency (DARPA) Ma-
chine Reading Program under Air Force Research
Laboratory (AFRL) prime contract no. FA8750-09-
C-0181. Any opinions, findings, and conclusions
or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect
the view of DARPA, AFRL, or the US government.

127


References

Eugene Agichtein and Luis Gravano. 2000. Snowball:
extracting relations from large plain-text collections.
In ACM DL, pages 85–94.

Michele Banko, Michael J. Cafarella, Stephen Soderland,
Matthew Broadhead, and Oren Etzioni. 2007. Open
information extraction from the web. In Proceedings
of the 20th International Joint Conference on Artificial
Intelligence (IJCAI-07), pages 2670–2676.

Regina Barzilay and Lillian Lee. 2003. Learning to
paraphrase: an unsupervised approach using multiple-
sequence alignment. In Proceedings of the 2003 Con-
ference of the North American Chapter of the Associ-
ation for Computational Linguistics on Human Lan-
guage Technology (HLT-NAACL), pages 16–23.

Regina Barzilay and Kathleen R McKeown. 2001. Ex-
tracting paraphrases from a parallel corpus. In Pro-
ceedings of the 39th Annual Meeting on Association
for Computational Linguistics (ACL), pages 50–57.

Edward Benson, Aria Haghighi, and Regina Barzilay.
2011. Event discovery in social media feeds. In Pro-
ceedings of the 49th Annual Meeting of the Associa-
tion for Computational Linguistics: Human Language
Technologies (HLT-NAACL), pages 389–398.

Sergey Brin. 1999. Extracting patterns and relations
from the world wide web. In The World Wide Web
and Databases, pages 172–183.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr
Settles, Estevam R. Hruschka Jr., and Tom M.
Mitchell. 2010. Toward an architecture for never-
ending language learning. In Proceedings of the AAAI
Conference on Artificial Intelligence (AAAI-10).

Michael Collins. 2002. Discriminative training meth-
ods for hidden Markov models: Theory and experi-
ments with perceptron algorithms. In Proceedings of
the ACL-02 conference on Empirical methods in natu-
ral language processing-Volume 10, pages 1–8.

Mark Craven and Johan Kumlien. 1999. Constructing
biological knowledge bases by extracting information
from text sources. In Proceedings of the Seventh Inter-
national Conference on Intelligent Systems for Molec-
ular Biology (ISMB), pages 77–86.

Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un-
supervised construction of large paraphrase corpora:
Exploiting massively parallel news sources. In Com-
putational Linguistics, page 350.

Anthony Fader, Stephen Soderland, and Oren Etzioni.
2011. Identifying relations for open information
extraction. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 1535–1545.

Jenny Rose Finkel, Trond Grenager, and Christopher
Manning. 2005. Incorporating non-local informa-
tion into information extraction systems by Gibbs sam-
pling. In Proceedings of the 43rd Annual Meeting
on Association for Computational Linguistics (ACL),
pages 363–370.

Juri Ganitkevitch, Benjamin Van Durme, and Chris
Callison-Burch. 2013. PPDB: The paraphrase
database. In Joint Human Language Technology Con-
ference/Annual Meeting of the North American Chap-
ter of the Association for Computational Linguistics
(HLT-NAACL 2013), pages 758–764.

Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman.
2004. Discovering relations among named entities
from large corpora. In Proceedings of the 42nd Annual
Meeting on Association for Computational Linguistics
(ACL), page 415.

Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke
Zettlemoyer, and Daniel S Weld. 2011. Knowledge-
based weak supervision for information extraction of
overlapping relations. In Proceedings of the 49th An-
nual Meeting of the Association for Computational
Linguistics: Human Language Technologies (HLT-
ACL), pages 541–550.

Ruihong Huang and Ellen Riloff. 2013. Multi-faceted
event recognition with bootstrapped dictionaries. In
the Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies (HLT-NAACL), pages 41–51.

Dekang Lin and Patrick Pantel. 2001. Discovery of infer-
ence rules for question-answering. Natural Language
Engineering, 7(4):343–360.

Xiao Ling and Daniel S Weld. 2012. Fine-grained entity
recognition. In Association for the Advancement of
Artificial Intelligence (AAAI).

David McClosky, Mihai Surdeanu, and Christopher D
Manning. 2011. Event extraction as dependency pars-
ing. In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics: Hu-
man Language Technologies (HLT-ACL), pages 1626–
1635.

Mike Mintz, Steven Bills, Rion Snow, and Daniel Juraf-
sky. 2009. Distant supervision for relation extrac-
tion without labeled data. In Proceedings of the 47th
Annual Meeting of the Association for Computational
Linguistics (ACL), pages 1003–1011.

Ndapandula Nakashole, Martin Theobald, and Gerhard
Weikum. 2011. Scalable knowledge harvesting with
high precision and high recall. In Proceedings of the
fourth ACM international conference on Web search
and data mining (WSDM), pages 227–236.

George L Nemhauser and Laurence A Wolsey. 1988. In-
teger and combinatorial optimization, volume 18. Wi-
ley New York.

128


Roi Reichart and Regina Barzilay. 2012. Multi event ex-
traction guided by global constraints. In Proceedings
of the 2012 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies (HLT-NAACL), pages
70–79.

Kevin Reschke, Martin Jankowiak, Mihai Surdeanu,
Christopher D Manning, and Daniel Jurafsky. 2014.
Event extraction using distant supervision. In Lan-
guage Resources and Evaluation Conference (LREC).

Sebastian Riedel, Limin Yao, and Andrew McCallum.
2010. Modeling relations and their mentions with-
out labeled text. In Machine Learning and Knowledge
Discovery in Databases (ECML), pages 148–163.

Sebastian Riedel, David McClosky, Mihai Surdeanu, An-
drew McCallum, and Christopher D Manning. 2011.
Model combination for event extraction in BioNLP
2011. In Proceedings of the BioNLP Shared Task 2011
Workshop, pages 51–55.

Sebastian Riedel, Limin Yao, Benjamin M. Mar-
lin, and Andrew McCallum. 2013. Relation
extraction with matrix factorization and universal
schemas. In Joint Human Language Technology Con-
ference/Annual Meeting of the North American Chap-
ter of the Association for Computational Linguistics
(HLT-NAACL).

Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012. Open
domain event extraction from twitter. In Proceedings
of the 18th ACM SIGKDD international conference on
Knowledge discovery and data mining (KDD), pages
1104–1112.

Evan Sandhaus. 2008. The New York Times annotated
corpus. Linguistic Data Consortium.

Yusuke Shinyama and Satoshi Sekine. 2006. Preemp-
tive information extraction using unrestricted relation
discovery. In Proceedings of the main conference
on Human Language Technology Conference of the
North American Chapter of the Association of Com-
putational Linguistics (HLT-NAACL), pages 304–311.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and
Christopher D Manning. 2012. Multi-instance multi-
label learning for relation extraction. In Proceedings
of the 2012 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
Natural Language Learning (EMNLP), pages 455–
465.

Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa.
2011. Probabilistic matrix factorization leveraging
contexts for unsupervised relation extraction. In Ad-
vances in Knowledge Discovery and Data Mining,
pages 87–99.

Fei Wu and Daniel S. Weld. 2007. Autonomously se-
mantifying wikipedia. In Proceedings of the Inter-

national Conference on Information and Knowledge
Management (CIKM), pages 41–50.

Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew
McCallum. 2011. Structured relation discovery using
generative models. In Proceedings of the Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 1456–1466.

Limin Yao, Sebastian Riedel, and Andrew McCallum.
2012. Unsupervised relation discovery with sense dis-
ambiguation. In Proceedings of the 50th Annual Meet-
ing of the Association for Computational Linguistics
(ACL), pages 712–720.

Alexander Yates and Oren Etzioni. 2009. Unsupervised
methods for determining object and relation synonyms
on the web. Journal of Artificial Intelligence Research,
34(1):255.

Congle Zhang and Daniel S Weld. 2013. Harvesting par-
allel news streams to generate paraphrases of event re-
lations. In Proceedings of the 2013 Joint Conference
on Empirical Methods in Natural Language Process-
ing and Computational Natural Language Learning
(EMNLP), pages 455–465.

Congle Zhang, Raphael Hoffmann, and Daniel S Weld.
2012. Ontological smoothing for relation extraction
with minimal supervision. In Association for the Ad-
vancement of Artificial Intelligence (AAAI).

129


130