A Hierarchical Distance-dependent Bayesian Model for
Event Coreference Resolution

Bishan Yang Claire Cardie
Department of Computer Science

Cornell University
{bishan, cardie}@cs.cornell.edu

Peter Frazier
School of Operations Research
and Information Engineering

Cornell University
pf98@cornell.edu

Abstract

We present a novel hierarchical distance-
dependent Bayesian model for event coref-
erence resolution. While existing generative
models for event coreference resolution are
completely unsupervised, our model allows
for the incorporation of pairwise distances be-
tween event mentions — information that is
widely used in supervised coreference mod-
els to guide the generative clustering process-
ing for better event clustering both within and
across documents. We model the distances
between event mentions using a feature-rich
learnable distance function and encode them
as Bayesian priors for nonparametric cluster-
ing. Experiments on the ECB+ corpus show
that our model outperforms state-of-the-art
methods for both within- and cross-document
event coreference resolution.

1 Introduction

The task of event coreference resolution consists of
identifying text snippets that describe events, and
then clustering them such that all event mentions in
the same partition refer to the same unique event.
Event coreference resolution can be applied within
a single document or across multiple documents
and is crucial for many natural language process-
ing tasks including topic detection and tracking, in-
formation extraction, question answering and tex-
tual entailment (Bejan and Harabagiu, 2010). More
importantly, event coreference resolution is a neces-
sary component in any reasonable, broadly applica-
ble computational model of natural language under-
standing (Humphreys et al., 1997).

In comparison to entity coreference resolu-
tion (Ng, 2010), which deals with identifying and
grouping noun phrases that refer to the same dis-
course entity, event coreference resolution has not
been extensively studied. This is, in part, because
events typically exhibit a more complex structure
than entities: a single event can be described via
multiple event mentions, and a single event mention
can be associated with multiple event arguments that
characterize the participants in the event as well as
spatio-temporal information (Bejan and Harabagiu,
2010). Hence, the coreference decisions for event
mentions usually require the interpretation of event
mentions and their arguments in context. See, for
example, Figure 1, in which five event mentions
across two documents all refer to the same under-
lying event: Plane bombs Yida camp.

Event: Plane bombs Yida camp 

Document 1 Document 2

The {Yida refugee camp} {in South 
Sudan} was bombed {on Thursday}.

The {Yida refugee camp} was the 
target of an air strike {in South 
Sudan} {on Thursday}. 

{Two bombs} fell {within 
the Yida camp}, including {one} {close 
to the school}. 

{At least four bombs} were reportedly 
dropped.{Four bombs} were dropped 

within just a few moments - {two} 
{inside the camp itself }, while {the 
other two} {near the airstrip}.

Figure 1: Examples of event coreference. Mutually
coreferent event mentions are underlined and in boldface;
participant and spatio-temporal information for the high-
lighted event is marked by curly brackets.

Most previous approaches to event coreference
resolution (e.g., Ahn (2006), Chen et al. (2009)) op-
erated by extending the supervised pairwise classi-

517

Transactions of the Association for Computational Linguistics, vol. 3, pp. 517–528, 2015. Action Editor: Hwee Tou Ng.
Submission batch: 4/2015; Revision batch: 7/2015; Published 9/2015.

c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


fication model that is widely used in entity corefer-
ence resolution (e.g., Ng and Cardie (2002)). In this
framework, pairwise distances between event men-
tions are modeled via event-related features (e.g.,
that indicate event argument compatibility), and ag-
glomerative clustering is applied to greedily merge
event mentions into clusters. A major drawback of
this general approach is that it makes hard decisions
on the merging and splitting of clusters based on
heuristics derived from the pairwise distances. In
addition, it only captures pairwise coreference deci-
sions within a single document and can not account
for signals that commonly appear across documents.
More recently, Bejan and Harabagiu (2010; 2014)
proposed several nonparametric Bayesian models
for event coreference resolution that probabilisti-
cally infer event clusters both within a document and
across multiple documents. Their method, however,
is completely unsupervised, and thus can not en-
code any readily available supervisory information
to guide the model toward better event clustering.

To address these limitations, we propose a novel
Bayesian model for within- and cross-document
event coreference resolution. It leverages super-
vised feature-rich modeling of pairwise coreference
relations and generative modeling of cluster distri-
butions, and thus allows for both probabilistic in-
ference over event clusters and easy incorporation
of pairwise linking preferences. Our model builds
on the framework of the distance-dependent Chi-
nese restaurant process (DDCRP) (Blei and Frazier,
2011), which was introduced to incorporate data de-
pendencies into nonparametric clustering models.
Here, however, we extend the DDCRP to allow
the incorporation of feature-based, learnable dis-
tance functions as clustering priors, thus encourag-
ing event mentions that are close in meaning to be-
long to the same cluster. In addition, we introduce to
the DDCRP a representational hierarchy that allows
event mentions to be grouped within a document and
within-document event clusters to be grouped across
documents.

To investigate the effectiveness of our approach,
we conduct extensive experiments on the ECB+ cor-
pus (Cybulska and Vossen, 2014b), an extension
to EventCorefBank (ECB) (Bejan and Harabagiu,
2010) and the largest corpus available that contains
event coreference annotations within and across

documents. We show that integrating pairwise
learning of event coreference relations with unsu-
pervised hierarchical modeling of event clustering
achieves promising improvements over state-of-the-
art approaches for within- and cross-document event
coreference resolution.

2 Related Work

Coreference resolution in general is a difficult natu-
ral language processing (NLP) task and typically re-
quires sophisticated inferentially-based knowledge-
intensive models (Kehler, 2002). Extensive work in
the literature focuses on the problem of entity coref-
erence resolution and many techniques have been
developed, including rule-based deterministic mod-
els (e.g. Cardie and Wagstaff (1999), Raghunathan
et al. (2010), Lee et al. (2011)) that traverse over
mentions in certain orderings and make determin-
istic coreference decisions based on all available
information at the time; supervised learning-based
models (e.g. Stoyanov et al. (2009), Rahman and Ng
(2011), Durrett and Klein (2013)) that make use of
rich linguistic features and the annotated corpora to
learn more powerful coreference functions; and fi-
nally, unsupervised models (e.g. Bhattacharya and
Getoor (2006), Haghighi and Klein (2007, 2010))
that successfully apply generative modeling to the
coreference resolution problem.

Event coreference resolution is a more complex
task than entity coreference resolution (Humphreys
et al., 1997) and also has been relatively less stud-
ied. Existing work has adapted similar ideas to
those used in entity coreference. Humphreys et
al. (1997) first proposed a deterministic cluster-
ing mechanism to group event mentions of pre-
specified types based on hard constraints. Later ap-
proaches (Ahn, 2006; Chen et al., 2009) applied
learning-based pairwise classification decisions us-
ing event-specific features to infer event clustering.
Bejan and Harabagiu (2010; 2014) proposed sev-
eral unsupervised generative models for event men-
tion clustering based on the hierarchical Dirichlet
process (HDP) (Teh et al., 2006). Our approach
is related to both supervised clustering and gener-
ative clustering approaches. It is a nonparametric
Bayesian model in nature but encodes rich linguis-
tic features in clustering priors. More recent work

518


modeled both entity and event information in event
coreference. Lee et al. (2012) showed that itera-
tively merging entity and event clusters can boost
the clustering performance. Liu et al. (2014) demon-
strated the benefits of propagating information be-
tween event arguments and event mentions during
a post-processing step. Other work modeled event
coreference as a predicate argument alignment prob-
lem between pairs of sentences, and trained clas-
sifiers for making alignment decisions (Roth and
Frank, 2012; Wolfe et al., 2015). Our model also
leverages event argument information into the de-
cisions of event coreference but incorporates it into
Bayesian clustering priors.

Most existing coreference models, both for events
and entities, focus on solving the within-document
coreference problem. Cross-document coreference
has attracted less attention due to lack of annotated
corpora and the requirement for larger model capac-
ity. Hierarchical models (Singh et al., 2010; Wick et
al., 2012; Haghighi and Klein, 2007) have been pop-
ular choices for cross-document coreference as they
can capture coreference at multiple levels of gran-
ularities. Our model is also hierarchical, capturing
both within- and cross-document coreference.

Our model is also closely related to the
distance-dependent Chinese Restaurant Process
(DDCRP) (Blei and Frazier, 2011). The DDCRP
is an infinite clustering model that can account for
data dependencies (Ghosh et al., 2011; Socher et al.,
2011). But it is a flat clustering model and thus can-
not capture hierarchical structure that usually exists
in large data collections. Very little work has ex-
plored the use of DDCRP in hierarchical clustering
models. Kim and Oh (2011; Ghosh et al. (2011)
combined a DDCRP with a standard CRP in a two-
level hierarchy analogous to the HDP with restricted
distance functions. Ghosh et al. (2014) proposed
a two-level DDCRP with data-dependent distance-
based priors at both levels. Our model is also a two-
level DDCRP model but differs in that its distance
function is learned using a feature-rich log-linear
model. We also derive an effective Gibbs sampler
for posterior inference.

Action bombs
Participant Sudan, Yida refugee camp

Time Thursday, Nov 10, 2011
Location South Sudan

Table 1: Mentions of event components

3 Problem Formulation

We adopt the terminology from ECB+ (Cybulska
and Vossen, 2014b), a corpus that extends the widely
used EventCorefBank (ECB (Bejan and Harabagiu,
2010)). An event is something that happens or a sit-
uation that occurs (Cybulska and Vossen, 2014a). It
consists of four components: (1) an Action: what
happens in the event; (2) Participants: who or what
is involved; (3) a Time: when the event happens; and
(4) a Location: where the event happens. We as-
sume that each document in the corpus consists of a
set of mentions — text spans — that describe event
actions, their participants, times, and locations. Ta-
ble 1 shows examples of these in the sentence “Su-
dan bombs Yida refugee camp in South Sudan on
Thursday, Nov 10th, 2011.”

In this paper, we also use the term event men-
tion to refer to the mention of an event action, and
event arguments to refer collectively to mentions of
the participants, times and locations involved in the
event. Event mentions are usually noun phrases or
verb phrases that clearly describe events. Two event
mentions are considered coreferent if they refer to
the same actual event, i.e. a situation involving a par-
ticular combination of action, participants, time and
location. Note that in text, not all event arguments
are always present for an event mention; they may
even be distributed over different sentences. Thus
whether two event mentions are coreferential should
be determined based on the context. For example,
in Figure 1, the event mention dropped in DOCU-
MENT 1 corefers with air strike in the same docu-
ment as they describe the same event, Plane bombs
Yida camp, in the discourse context; it also corefers
with dropped in DOCUMENT 2 based on the con-
texts of both documents.

The problem of event coreference resolution can
be divided into two sub-problems: (1) event ex-
traction: extracting event mentions and event ar-
guments, and (2) event clustering: grouping event

519


mentions into clusters according to their corefer-
ence relations. We consider both within- and cross-
document event coreference resolution and hypothe-
size that leveraging context information from multi-
ple documents will improve both within- and cross-
document coreference resolution. In the following,
we first describe the event extraction step and then
focus on the event clustering step.

4 Event Extraction

The goal of event extraction is to extract from a text
all event mentions (actions) and event arguments
(the associated participants, times and locations).
One might expect that event actions could be ex-
tracted reasonably well by identifying verb groups;
and event arguments, by applying semantic role la-
beling (SRL) to identify, for example, the Agent and
Patient of each predicate. Unfortunately, most SRL
systems only handle verbal predicates and so would
miss event mentions described via noun phrases. In
addition, SRL systems are not designed to capture
event-specific arguments. Accordingly, we found
that a state-of-the-art SRL system (SwiRL (Sur-
deanu et al., 2007)) extracted only 56% of the ac-
tions, 76% of participants, 65% of times and 13% of
locations for events in a development set of ECB+
based on a head word matching evaluation measure.
(We provide dataset details in Section 6.)

To produce higher recall, we adopt a supervised
approach and train an event extractor using sen-
tences from ECB+, which are annotated for event
actions, participants, times and locations. Be-
cause these mentions vary widely in their length
and grammatical type, we employ semi-Markov
CRFs (Sarawagi and Cohen, 2004) using the loss-
augmented objective of Yang and Cardie (2014) that
provides more accurate detection of mention bound-
aries. We make use of a rich feature set that includes
word-level features such as unigrams, bigrams, POS
tags, WordNet hypernyms, synonyms and FrameNet
semantic roles, and phrase-level features such as
phrasal syntax (e.g., NP, VP) and phrasal embed-
dings (constructed by averaging word embeddings
produced by word2vec (Mikolov et al., 2013)). Our
experiments on the same (held-out) development
data show that the semi-CRF-based extractor cor-
rectly identifies 95% of actions, 90% of participants,

94% of times and 74% of locations again based on
head word matching.

Note that the semi-CRF extractor identifies event
mentions and event arguments but not relation-
ships among them, i.e. it does not associate argu-
ments with an event mention. Lacking supervi-
sory data in the ECB+ corpus for training an event
action-argument relation detector, we assume that
all event arguments identified by the semi-CRF ex-
tractor are related to all event mentions in the same
sentence and then apply SRL-based heuristics to
augment and further disambiguate intra-sentential
action-argument relations (using the SwiRL SRL).
More specifically, we link each verbal event men-
tion to the participants that match its ARG0, ARG1
or ARG2 semantic role fillers; similarly, we asso-
ciate with the event mention the time and locations
that match its AM-TMP and AM-LOC role fillers, re-
spectively. For each nominal event mention, we as-
sociate those participants that match the possessor of
the mention since these were suggested in Lee et al.
(2012) as playing the ARG0 role for nominal predi-
cates.

5 Event Clustering

Now we describe our proposed Bayesian model for
event clustering. Our model is a hierarchical exten-
sion of the distance-dependent Chinese Restaurant
Process (DDCRP). It first groups event mentions
within a document to form within-document event
cluster and then groups these event clusters across
documents to form global clusters. The model can
account for the similarity between event mentions
during the clustering process, putting a bias toward
clusters comprised of event mentions that are simi-
lar to each other based on the context. To capture
event similarity, we use a log-linear model with rich
syntactic and semantic features, and learn the feature
weights using gold-standard data.

5.1 Distance-dependent Chinese Restaurant
Process

The Distance-dependent Chinese Restaurant Pro-
cess (DDCRP) is a generalization of the Chinese
Restaurant process (CRP) that models distributions
over partitions. In a CRP, the generative process can
be described by imagining data points as customers

520


in a restaurant and the partitioning of data as tables
at which the customers sit. The process randomly
samples the table assignment for each customer se-
quentially: the probability of a customer sitting at an
existing table is proportional to the number of cus-
tomers already sitting at that table and the probabil-
ity of sitting at a new table is proportional to a scal-
ing parameter. For each customer sitting at the same
table, an observation can be drawn from a distri-
bution determined by the parameter associated with
that table. Despite the sequential sampling process,
the CRP makes the assumption of exchangeability:
the permutation of the customer ordering does not
change the probability of the partitions.

The exchangeability assumption may not be rea-
sonable for clustering data that has clear inter-
dependencies. The DDCRP allows the incorporation
of data dependencies in infinite clustering, encour-
aging data points that are closer to each other to be
grouped together. In the generative process, instead
of directly sampling a table assignment for each cus-
tomer, it samples a customer link, linking the cus-
tomer to another customer or itself. The clustering
can be uniquely constructed once the customer links
are determined for all customers: two customers be-
long to the same cluster if and only if one can reach
the other by traversing the customer links (treating
these links as undirected).

More formally, consider a sequence of customers
1, ...,n, and denote a = (a1, ...,an) as the assign-
ments of the customer links. ai ∈ {1, . . . ,n} is
drawn from

p(ai = j|F,α) ∝
{
F(i,j), j 6= i
α, j = i

(1)

where F is a distance function and F(i,j) is a value
that measures the distance between customer i and
j. α is a scaling parameter, measuring self-affinity.
For each customer, the observation is generated by
the per-table parameters as in the CRP. A DDCRP
is said to be sequential if F(i,j) = 0 when i < j,
so customers may link only to themselves, and to
previous customers.

5.2 A Hierarchical Extension of the DDCRP
We can model within-document coreference reso-
lution using a sequential DDCRP. Imagining cus-
tomers as event mentions and the restaurant as a

document, each mention can either refer to an an-
tecedent mention in the document or no other men-
tions, starting the description of a new event. How-
ever, the coreference relations may also exist across
documents — the same event may be described in
multiple documents. Thus it is ideal to have a two-
level clustering model that can group event men-
tions within a document and further group them
across documents. Therefore we propose a hierar-
chical extension of the DDCRP (HDDCRP) that em-
ploys a DDCRP twice: the first-level DDCRP links
mentions based on within-document distances and
the-second level DDCRP links the within-document
clusters based on cross-document distances, forming
larger clusters in the corpus.

The generative process of an HDDCRP can be
described using the same “Chinese Restaurant”
metaphor. Imagine a collection of documents as a
collection of restaurants, and the event mentions in
each document as customers entering a restaurant.
The local (within-document) event clusters corre-
spond to tables. The global (within-corpus) event
clusters correspond to menus (tables that serve the
same menu belong to the same cluster). The hid-
den variables are the customer links and the table
links. Figure 2 shows a configuration of these vari-
ables and the corresponding clustering structure.

Figure 2: A cluster configuration generated by the HDD-
CRP. Each restaurant is represented by a rectangle. The
small green circles represent customers. The ovals repre-
sent tables and the colors reflect the clustering. Each cus-
tomer is assigned a customer link (a solid arrow), linking
to itself or another customer in the same restaurant. The
customer who first sits at the table is assigned a table link
(a dashed arrow), linking to itself or another customer in a
different restaurant, resulting in the linking of two tables.

More formally, the generative process for the HD-
DCRP can be described as follows:

1. For each restaurant d ∈ {1, ...,D}, for each

521


customer i ∈ {1, ...,nd}, sample a customer
link using a sequential DDCRP:

p(ai,d = (j,d)) ∝





Fd(i,j), j < i

αd, j = i

0, j > i

(2)

2. For each restaurant d ∈{1, ...,D}, for each ta-
ble t, sample a table link for the customer (i,d)
who first sits at t using a DDCRP:

p(ci,d = (j,d
′)) ∝

{
F0((i,d),(j,d

′)), j ∈{1, ...,nd′},d′ 6= d
α0, j = i,d

′ = d
(3)

3. Calculate clusters z(a,c) by traversing all the
customer links a and the table links c. Two
customers are in the same cluster if and only
if there is a path from one to the other along the
links, where we treat both table and customer
links as undirected.

4. For each cluster k ∈ z(a,c), sample parame-
ters φk ∼ G0(λ).

5. For each customer i in cluster k, sample an ob-
servation xi ∼ p(·|φzi) where zi = k.

F1:D and F0 are distance functions that map a pair
of customers to a distance value. We will discuss
them in detail in Section 5.4.

5.3 Posterior Inference with Gibbs Sampling
The central computation problem for the HDDCRP
model is posterior inference — computing the con-
ditional distribution of the hidden variables given the
observations p(a,c|x,α0,F0,α1:D,F1:D). The pos-
terior is intractable due to a combinatorial number
of possible link configurations. Thus we approxi-
mate the posterior using Markov Chain Monte Carlo
(MCMC) sampling, and specifically using a Gibbs
sampler.

In developing this Gibbs sampler, we first observe
that the generative process is equivalent to one that,
in step 2 samples a table link for all customers,
and then in step 3, when calculating z(a,c), in-
cludes only those table links ci,d originating at cus-
tomers (i,d) that started a new table, i.e. that chose
ai,d = (i,d).

The Gibbs sampler for the HDDCRP iteratively
samples a customer link for each customer (i,d)
from

p(a∗i,d|a−(i,d),c,x,λ) ∝ p(a∗i,d)Ha(x,z,λ) (4)

where

Ha(x,z,λ) =
p(x|z(a−(i,d) ∪a∗i,d,c,λ))
p(x|z(a−(i,d),c),λ))

After sampling all the customer links, it samples
a table link for all customers (i,d) according to

p(c∗i,d|a,c−(i,d),x,λ) ∝ p(c∗i,d)Hc(x,z,λ) (5)

where

Hc(x,z,λ) =
p(x|z(a,c−(i,d) ∪ c∗i,d,λ))
p(x|z(a,c−(i,d)),λ))

For those customers (i,d) that did not start a new
table, i.e. with ai,d 6= (i,d), the table link c∗i,d does
not affect the clustering, and so Hc(x,z,λ) = 1 in
this case.

Referring back to the event coreference example
in 1, Figure 3 shows an example of variable config-
uration for the HDDCRP model and the correspond-
ing coreference clusters.

a1=1     a2=2        a3=3    a4=4    a5=4             
c1=3     c2=2        c3=2    c4=2     c5=5[ina]    

Figure 3: An example of event clustering and the cor-
responding variable assignments. The assignments of a
induce tables, or within-document (WD) clusters, and the
assignments of c induce menus, or cross-document (CD)
clusters. [ina] denotes that the variable is inactive and
will not affect the clustering.

In implementation, we can simplify the computa-
tions of both Ha(x,z,λ) and Hc(x,z,λ) by using
the fact that the likelihood under clustering z(a,c)
can be factorized as

p(x|z(a,c),λ) =
∏

k∈z(a,c)
p(xz=k|λ)

522


where xz=k denotes all customers that belong to the
global cluster k. p(xz=k|λ) is the marginal proba-
bility. It can be computed as

p(xz=k|λ) =
∫
p(φ|λ)

∏

i∈z=k
p(xi|φ)dφ

where xi is the observation associated with cus-
tomer i. In our problem, the observation corre-
sponds to the lemmatized words in the event men-
tion. We model the observed word counts using
cluster-specific multinomial distributions with sym-
metric Dirichlet priors.

5.4 Feature-based Distance Functions

The distance functions F1:D and F0 encode the pri-
ors for the clustering distribution, preferring cluster-
ing data points that are closer to each other. We con-
sider event mentions as the data points and encode
the similarity (or compatibility) between event men-
tions as priors for event clustering. Specifically, we
use a log-linear model to estimate the similarity be-
tween a pair of event mentions (xi,xj)

fθ(xi,xj) ∝ exp{θT ψ(xi,xj)} (6)

where ψ is a feature vector, containing a rich set
of features based on event mentions i and j: (1)
head word string match, (2) head POS pair, (3) co-
sine similarity between the head word embeddings
(we use the pre-trained 300-dimensional word em-
beddings from word2vec1), (4) similarity between
the words in the event mentions (based on term fre-
quency (TF) vectors), (5) the Jaccard coefficient be-
tween the WordNet synonyms of the head words,
and (6) similarity between the context words (a win-
dow of three words before and after each event men-
tion). If both event mentions involve participants,
we consider the similarity between the words in the
participant mentions based on the TF vectors, sim-
ilarly for the time mentions and the location men-
tions. If the SRL role information is available, we
also consider the similarity between words in each
SRL role, i.e. Arg0, Arg1, Arg2.

Training We train the parameter θ using logis-
tic regression with an L2 regularizer. We construct
the training data by considering all ordered pairs

1https://code.google.com/p/word2vec/

Train Dev Test Total
# Documents 462 73 447 982
# Sentences 7,294 649 7,867 15,810

# Annotated event mentions 3,555 441 3,290 7,286
# Cross-document chains 687 47 486 1,220
# Within-document chains 2,499 316 2,137 4,952

Table 2: Statistics of the ECB+ corpus

of event mentions within a document, and also all
pairs of event mentions across similar documents.
To measure document similarity, we collect all men-
tions of events, participants, times and locations in
each document and compute the cosine similarity
between the TF vectors constructed from all the
event-related mentions. We consider two documents
to be similar if their TF-based similarity is above a
threshold σ (we set it to 0.4 in our experiments).

After learning θ, we set the within-
document distances as Fd(i,j) = fθ(xi,xj),
and the across-document distances as
F0((i,d),(j,d

′)) = w(d,d′)fθ(xi,d,xj,d′), where
w(d,d′) = exp(γsim(d,d′)) captures document
similarity where sim(d,d′) is the TF-based sim-
ilarity between document d and d′, and γ is a
weight parameter. Higher γ leads to a higher
effect of document-level similarities on the linking
probabilities. We set γ = 1 in our experiments.

6 Experiments

We conduct experiments using the ECB+ cor-
pus (Cybulska and Vossen, 2014b), the largest
available dataset with annotations of both within-
document (WD) and cross-document (CD) event
coreference resolution. It extends ECB 0.1 (Lee et
al., 2012) and ECB (Bejan and Harabagiu, 2010)
by adding event argument and argument type an-
notations as well as adding more news documents.
The cross-document coreference annotations only
exist in documents that describe the same seminal
event (the event that triggers the topic of the docu-
ment and has interconnections with the majority of
events from its surrounding textual context (Bejan
and Harabagiu, 2014)). We divide the dataset into a
training set (topics 1-20), a development set (topics
21-23), and a test set (topics 24-43). Table 2 shows
the statistics of the data.

We performed event coreference resolution on all
possible event mentions that are expressed in the

523


documents. Using the event extraction method de-
scribed in Section 4, we extracted 53,429 event men-
tions, 43,682 participant mentions, 5,791 time men-
tions and 3,836 location mentions in the test data,
covering 93.5%, 89.0%, 95.0%, 72.8% of the an-
notated event mentions, participants, time and loca-
tions, respectively.

We evaluate both within- and cross-document
event coreference resolution. As in previous
work (Bejan and Harabagiu, 2010), we evaluate
cross-document coreference resolution by merg-
ing all documents from the same seminal event
into a meta-document and then evaluate the meta-
document as in within-document coreference reso-
lution. However, during inference time, we do not
assume the knowledge of the mapping of documents
to seminal events.

We consider three widely used coreference reso-
lution metrics: (1) MUC (Vilain et al., 1995), which
measures how many gold (predicted) cluster merg-
ing operations are needed to recover each predicted
(gold) cluster; (2) B3 (Bagga and Baldwin, 1998),
which measures the proportion of overlap between
the predicted and gold clusters for each mention and
computes the average scores; and (3) CEAF (Luo,
2005) (CEAFe), which measures the best alignment
of the gold-standard and predicted clusters. We also
consider the CoNLL F1, which is the average F1 of
the above three measures. All the scores are com-
puted using the latest version (v8.01) of the official
CoNLL scorer (Pradhan et al., 2014).

6.1 Baselines
We compare our proposed HDDCRP model (HDD-
CRP) to five baselines:

• LEMMA: a heuristic method that groups all
event mentions, either within or across docu-
ments, which have the same lemmatized head
word. It is usually considered a strong baseline
for event coreference resolution.

• AGGLOMERATIVE: a supervised clustering
method for within-document event corefer-
ence (Chen et al., 2009). We extend it to
within- and cross-document event coreference
by performing single-link clustering in two
phases: first grouping mentions within doc-
uments and then grouping within-document

clusters to larger clusters across documents.
We compute the pairwise-linkage scores using
the log-linear model described in Section 5.4.

• HDP-LEX: an unsupervised Bayesian clus-
tering model for within- and cross-document
event coreference (Bejan and Harabagiu,
2010)2. It is a hierarchical Dirichlet process
(HDP) model with the likelihood of all the lem-
matized words observed in the event mentions.
In general, the HDP can be formulated using a
two-level sequential CRP. Our HDDCRP model
is a two-level DDCRP that generalizes the HDP
to allow data dependencies to be incorporated
at both levels3.

• DDCRP: a DDCRP model we develop for event
coreference resolution. It applies the distance
prior in Equation 1 to all pairs of event men-
tions in the corpus, ignoring the document
boundaries. It uses the same likelihood func-
tion and the same log-linear model to learn
the distance values as HDDCRP. But it has
fewer link variables than HDDCRP and it does
not distinguish between the within-document
and cross-document link variables. For the
same clustering structure, HDDCRP can gener-
ate more possible link configurations than DD-
CRP.

• HDDCRP∗: a variant of the proposed HDDCRP
that only incorporates the within-document de-
pendencies but not the cross-document depen-
dencies. The generative process of HDDCRP∗ is
similar to the one described in Section 5.2, ex-
cept that in step 2, for each table t, we sample

2We re-implement the proposed HDP-based models: the
HDP1f , HDPflat (including HDPflat (LF), (LF+WF), and
(LF+WF+SF)) and HDPstruct, but found that the HDPflat
with lexical features (LF) performs the best in our experiments.
We refer to it as HDP-LEX.

3Note that HDP-LEX is not a special case of HDDCRP be-
cause we define the table-level distance function as the distances
between customers instead of between tables. In our model, the
probability of linking a table t to another table s depends on
the distance between the head customer at table t and all other
customers who sit at table s. Defining the table-level distance
function this way allows us to derive a tractable inference algo-
rithm using Gibbs sampling.

524


a cluster assignment ct according to

p(ct = k) ∝
{
nk, k ≤ K
α0, k = K + 1

where K is the number of existing clusters,
nk is the number of existing tables that be-
long to cluster k, α is the concentration param-
eter. And in step 3, the clusters z(a,c) are con-
structed by traversing the customer links and
looking up the cluster assignments for the ob-
tained tables. We also use Gibbs sampling for
inference.

6.2 Parameter settings
For all the Bayesian models, the reported results are
averaged results over five MCMC runs, each for 500
iterations. We found that mixing happens before
500 iterations in all models by observing the joint
log-likelihood. For the DDCRP, HDDCRP∗ and HDD-
CRP, we randomly initialized the link variables. Be-
fore initialization, we assume that each mention be-
longs to its own cluster. We assume mentions are
ordered according to their appearance within a doc-
ument, but we do not assume any particular ordering
of documents. We also truncated the pairwise men-
tion similarity to zero if it is below 0.5 as we found
that it leads to better performance on the develop-
ment set. We set α1 = ... = αD = 0.5, α0 = 0.001
for HDDCRP, α0 = 1 for HDDCRP∗, α = 0.1 for DD-
CRP, and λ = 10−7. All the hyperparameters were
set based on the development data.

6.3 Main Results
Table 3 shows the event coreference results. We
can see that LEMMA-matching is a strong baseline
for event coreference resolution. HDP-LEX provides
noticeable improvements, suggesting the benefit of
using an infinite mixture model for event cluster-
ing. AGGLOMERATIVE further improves the per-
formance over HDP-LEX for WD resolution, how-
ever, it fails to improve CD resolution. We conjec-
ture that this is due to the combination of ineffective
thresholding and the prediction errors on the pair-
wise distances between mention pairs across docu-
ments. Overall, HDDCRP∗ outperforms all the base-
lines in CoNLL F1 for both WD and CD evaluation.
The clear performance gains over HDP-LEX demon-
strate that it is important to account for pairwise

mention dependencies in the generative modeling of
event clustering. The improvements over AGGLOM-
ERATIVE indicate that it is more effective to model
mention-pair dependencies as clustering priors than
as heuristics for deterministic clustering.

Comparing among the HDDCRP-related models,
we can see that HDDCRP clearly outperforms DD-
CRP, demonstrating the benefits of incorporating the
hierarchy into the model. HDDCRP also performs
better than HDDCRP∗ in WD CoNLL F1, indicat-
ing that incorporating cross-document information
helps within-document clustering. We can also see
that HDDCRP performs similarly to HDDCRP∗ in CD
CoNLL F1 due to the lower B3 F1, in particular,
the decrease in B3 recall. This is because apply-
ing the DDCRP prior at both within- and cross-
document levels results in more conservative clus-
tering and produces smaller clusters. This could be
potentially improved by employing more accurate
similarity priors.

To further understand the effect of modeling
mention-pair dependencies, we analyze the impact
of the features in the mention-pair similarity model.
Table 4 lists the learned weights of some top features
(sorted by weights). We can see that they mainly
serve to discriminate event mentions based on the
head word similarity (especially embedding-based
similarity) and the context word similarity. Event
argument information such as SRL Arg1, SRL Arg0,
and Participant are also indicative of the coreferen-
tial relations.

6.4 Discussion

We found that HDDCRP corrects many errors made
by the traditional agglomerative clustering model
(AGGLOMERATIVE) and the unsupervised genera-
tive model (HDP-LEX). AGGLOMERATIVE easily
suffers from error propagation as the errors made
by the supervised distance learner cannot be cor-
rected. HDP-LEX often mistakenly groups mentions
together based on word co-occurrence statistics but
not the apparent similarity features in the mentions.
In contrast, HDDCRP avoids such errors by perform-
ing probabilistic modeling of clustering and mak-
ing use of rich linguistic features trained on avail-
able annotated data. For example, HDDCRP cor-
rectly groups the event mention “unveiled” in “Ap-
ple’s Phil Schiller unveiled a revamped MacBook

525


MUC B3 CEAFe CoNLL
P R F1 P R F1 P R F1 F1

Cross-document Event Coreference Resolution (CD)
LEMMA 75.1 55.4 63.8 71.7 39.6 51.0 36.2 61.1 45.5 53.4

HDP-LEX 75.5 63.5 69.0 65.6 43.7 52.5 34.8 60.2 44.1 55.2
AGGLOMERATIVE 78.3 59.2 67.4 73.2 40.2 51.9 30.2 65.6 41.4 53.6

DDCRP 79.6 58.2 67.1 78.1 39.6 52.6 31.8 69.4 43.6 54.4
HDDCRP∗ 77.5 66.4 71.5 69.0 48.1 56.7 38.2 63.0 47.6 58.6
HDDCRP 80.3 67.1 73.1 78.5 40.6 53.5 38.6 68.9 49.5 58.7

Within-document Event Coreference Resolution (WD)
LEMMA 60.9 30.2 40.4 78.9 57.3 66.4 63.6 69.0 66.2 57.7

HDP-LEX 50.0 39.1 43.9 74.7 67.6 71.0 66.2 71.4 68.7 61.2
AGGLOMERATIVE 61.9 39.2 48.0 80.7 67.6 73.5 65.6 76.0 70.4 63.9

DDCRP 71.2 36.4 48.2 85.4 64.9 73.8 61.8 76.1 68.2 63.4
HDDCRP∗ 58.1 42.8 49.3 78.4 68.7 73.2 67.6 74.5 70.9 64.5
HDDCRP 74.3 41.7 53.4 85.6 67.3 75.4 65.1 79.8 71.7 66.8

Table 3: Within- and cross-document coreference results on the ECB+ corpus

Pro today” together with the event mention “an-
nounced” in “this notebook isn’t the only laptop Ap-
ple announced for the MacBook Pro lineup today”,
while both HDP-LEX and AGGLOMERATIVE models
fail to make such connection.

By looking further into the errors, we found that
a lot of mistakes made by HDDCRP are due to the
errors in event extraction and pairwise linkage pre-
diction. The event extraction errors include false
positive and false negative event mentions and event
arguments, boundary errors for the extracted men-
tions, and argument association errors. The pairwise
linking errors often come from the lack of seman-
tic and world knowledge, and this applies to both
event mentions and event arguments, especially for
time and location arguments which are less likely
to be repeatedly mentioned and in many cases re-
quire external knowledge to resolve their meanings,
e.g., “May 3, 2013” is “Friday” and “Mount Cook”
is “New Zealand’s highest peak”.

7 Conclusion

In this paper we propose a novel Bayesian model
for within- and cross-document event coreference
resolution. It leverages the advantages of genera-
tive modeling of coreference resolution and feature-
rich discriminative modeling of mention reference
relations. We have shown its power in resolving
event coreference by comparing it to a traditional ag-

Features Weight
Head Embedding sim 4.5

String match 2.77
Context sim 1.75

Synonym sim 1.56
TF sim 1.17

SRL Arg1 sim 1.10
SRL Arg0 sim 0.89
Participant sim 0.68

Table 4: Learned weights for selected features

glomerative clustering approach and a state-of-the-
art unsupervised generative clustering approach. It
is worth noting that our model is general and can be
easily applied to other clustering problems involving
feature-rich objects and cluster sharing across data
groups. While the model can effectively cluster ob-
jects of a single type, it would be interesting to ex-
tend it to allow joint clustering of objects of different
types, e.g., events and entities.

Acknowledgments

We thank Cristian Danescu-Niculescu-Mizil, Igor
Labutov, Lillian Lee, Moontae Lee, Jon Park, Chen-
hao Tan, and other Cornell NLP seminar partici-
pants and the reviewers for their helpful comments.
This work was supported in part by NSF grant
IIS-1314778 and DARPA DEFT Grant FA8750-
13-2-0015. The third author was supported by

526


NSF CAREER CMMI-1254298, NSF IIS-1247696,
AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-
0038, and the ACSF AVF. The views and conclu-
sions contained herein are those of the authors and
should not be interpreted as necessarily represent-
ing the official policies or endorsements, either ex-
pressed or implied, of NSF, DARPA or the U.S.
Government.

References
David Ahn. 2006. The stages of event extraction. In

Proceedings of the Workshop on Annotating and Rea-
soning about Time and Events, pages 1–8.

Amit Bagga and Breck Baldwin. 1998. Algorithms for
scoring coreference chains. In The First International
Conference on Language Resources and Evaluation
Workshop on Linguistics Coreference, volume 1, pages
563–6.

Cosmin Adrian Bejan and Sanda Harabagiu. 2010. Un-
supervised event coreference resolution with rich lin-
guistic features. In ACL, pages 1412–1422.

Cosmin Adrian Bejan and Sanda Harabagiu. 2014. Un-
supervised event coreference resolution. Computa-
tional Linguistics, 40(2):311–347.

Indrajit Bhattacharya and Lise Getoor. 2006. A latent
Dirichlet model for unsupervised entity resolution. In
SDM, volume 5, page 59.

David M. Blei and Peter I. Frazier. 2011. Distance de-
pendent Chinese restaurant processes. The Journal of
Machine Learning Research, 12:2461–2488.

Claire Cardie and Kiri Wagstaff. 1999. Noun phrase
coreference as clustering. In Proceedings of the 1999
Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large Cor-
pora, pages 82–89.

Zheng Chen, Heng Ji, and Robert Haralick. 2009. A
pairwise event coreference model, feature impact and
evaluation for event coreference resolution. In Pro-
ceedings of the Workshop on Events in Emerging Text
Types, pages 17–22.

Agata Cybulska and Piek Vossen. 2014a. Guidelines
for ECB+ annotation of events and their coreference.
Technical report, NWR-2014-1, VU University Ams-
terdam.

Agata Cybulska and Piek Vossen. 2014b. Using a
sledgehammer to crack a nut? lexical diversity and
event coreference resolution. In Proceedings of the
9th Language Resources and Evaluation Conference
(LREC2014), pages 26–31.

Greg Durrett and Dan Klein. 2013. Easy victories and
uphill battles in coreference resolution. In EMNLP,
pages 1971–1982.

Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth,
and David M. Blei. 2011. Spatial distance depen-
dent Chinese restaurant processes for image segmen-
tation. In Advances in Neural Information Processing
Systems, pages 1476–1484.

Soumya Ghosh, Michalis Raptis, Leonid Sigal, and
Erik B. Sudderth. 2014. Nonparametric clustering
with distance dependent hierarchies.

Aria Haghighi and Dan Klein. 2007. Unsupervised
coreference resolution in a nonparametric Bayesian
model. In ACL, volume 45, page 848.

Aria Haghighi and Dan Klein. 2010. Coreference reso-
lution in a modular, entity-centered model. In NAACL,
pages 385–393.

Kevin Humphreys, Robert Gaizauskas, and Saliha Az-
zam. 1997. Event coreference for information extrac-
tion. In Proceedings of a Workshop on Operational
Factors in Practical, Robust Anaphora Resolution for
Unrestricted Texts, pages 75–81.

Andrew Kehler. 2002. Coherence, Reference, and the
Theory of Grammar. CSLI publications Stanford, CA.

Dongwoo Kim and Alice Oh. 2011. Accounting for data
dependencies within a hierarchical Dirichlet process
mixture model. In Proceedings of the 20th ACM Inter-
national Conference on Information and Knowledge
Management, pages 873–878.

Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael
Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011.
Stanford’s multi-pass sieve coreference resolution sys-
tem at the CoNLL-2011 shared task. In Proceedings
of the Fifteenth Conference on Computational Natural
Language Learning: Shared Task, pages 28–34.

Heeyoung Lee, Marta Recasens, Angel Chang, Mihai
Surdeanu, and Dan Jurafsky. 2012. Joint entity and
event coreference resolution across documents. In
Proceedings of the 2012 Joint Conference on Empir-
ical Methods in Natural Language Processing and
Computational Natural Language Learning, pages
489–500.

Zhengzhong Liu, Jun Araki, Eduard Hovy, and Teruko
Mitamura. 2014. Supervised within-document event
coreference using information propagation. In Pro-
ceedings of the International Conference on Language
Resources and Evaluation.

Xiaoqiang Luo. 2005. On coreference resolution perfor-
mance metrics. In EMNLP, pages 25–32.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen-
tations in vector space. Proceedings of Workshop at
ICLR.

Vincent Ng and Claire Cardie. 2002. Improving ma-
chine learning approaches to coreference resolution.
In ACL, pages 104–111.

527


Vincent Ng. 2010. Supervised noun phrase coreference
research: The first fifteen years. In ACL, pages 1396–
1411.

Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed-
uard Hovy, Vincent Ng, and Michael Strube. 2014.
Scoring coreference partitions of predicted mentions:
A reference implementation. In ACL, pages 22–27.

Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran-
garajan, Nathanael Chambers, Mihai Surdeanu, Dan
Jurafsky, and Christopher Manning. 2010. A multi-
pass sieve for coreference resolution. In EMNLP,
pages 492–501.

Altaf Rahman and Vincent Ng. 2011. Coreference reso-
lution with world knowledge. In ACL, pages 814–824.

Michael Roth and Anette Frank. 2012. Aligning pred-
icate argument structures in monolingual comparable
texts: A new corpus for a new task. In SemEval, pages
218–227.

Sunita Sarawagi and William W. Cohen. 2004. Semi-
markov conditional random fields for information ex-
traction. In Advances in Neural Information Process-
ing Systems, pages 1185–1192.

Sameer Singh, Michael Wick, and Andrew McCallum.
2010. Distantly labeling data for large scale cross-
document coreference. arXiv:1005.4298.

Richard Socher, Andrew L. Maas, and Christopher D.
Manning. 2011. Spectral Chinese restaurant pro-
cesses: Nonparametric clustering based on similari-
ties. In International Conference on Artificial Intel-
ligence and Statistics, pages 698–706.

Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and
Ellen Riloff. 2009. Conundrums in noun phrase coref-
erence resolution: Making sense of the state-of-the-
art. In Proceedings of the Joint Conference of the 47th
Annual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing of
the AFNLP: Volume 2-Volume 2, pages 656–664.

Mihai Surdeanu, Lluı́s Màrquez, Xavier Carreras, and
Pere R. Comas. 2007. Combination strategies for se-
mantic role labeling. Journal of Artificial Intelligence
Research, pages 105–151.

Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and
David M. Blei. 2006. Hierarchical Dirichlet pro-
cesses. Journal of the American Statistical Associa-
tion, 101(476).

Marc Vilain, John Burger, John Aberdeen, Dennis Con-
nolly, and Lynette Hirschman. 1995. A model-
theoretic coreference scoring scheme. In Proceed-
ings of the 6th Conference on Message Understanding,
pages 45–52.

Michael Wick, Sameer Singh, and Andrew McCallum.
2012. A discriminative hierarchical model for fast
coreference at large scale. In ACL, pages 379–388.

Travis Wolfe, Mark Dredze, and Benjamin Van Durme.
2015. Predicate argument alignment using a global
coherence model. In NAACL, pages 11–20.

Bishan Yang and Claire Cardie. 2014. Joint modeling
of opinion expression extraction and attribute classifi-
cation. Transactions of the Association for Computa-
tional Linguistics, 2:505–516.

528