A Hierarchical Distance-dependent Bayesian Model for Event Coreference Resolution Bishan Yang Claire Cardie Department of Computer Science Cornell University {bishan, cardie}@cs.cornell.edu Peter Frazier School of Operations Research and Information Engineering Cornell University pf98@cornell.edu Abstract We present a novel hierarchical distance- dependent Bayesian model for event coref- erence resolution. While existing generative models for event coreference resolution are completely unsupervised, our model allows for the incorporation of pairwise distances be- tween event mentions — information that is widely used in supervised coreference mod- els to guide the generative clustering process- ing for better event clustering both within and across documents. We model the distances between event mentions using a feature-rich learnable distance function and encode them as Bayesian priors for nonparametric cluster- ing. Experiments on the ECB+ corpus show that our model outperforms state-of-the-art methods for both within- and cross-document event coreference resolution. 1 Introduction The task of event coreference resolution consists of identifying text snippets that describe events, and then clustering them such that all event mentions in the same partition refer to the same unique event. Event coreference resolution can be applied within a single document or across multiple documents and is crucial for many natural language process- ing tasks including topic detection and tracking, in- formation extraction, question answering and tex- tual entailment (Bejan and Harabagiu, 2010). More importantly, event coreference resolution is a neces- sary component in any reasonable, broadly applica- ble computational model of natural language under- standing (Humphreys et al., 1997). In comparison to entity coreference resolu- tion (Ng, 2010), which deals with identifying and grouping noun phrases that refer to the same dis- course entity, event coreference resolution has not been extensively studied. This is, in part, because events typically exhibit a more complex structure than entities: a single event can be described via multiple event mentions, and a single event mention can be associated with multiple event arguments that characterize the participants in the event as well as spatio-temporal information (Bejan and Harabagiu, 2010). Hence, the coreference decisions for event mentions usually require the interpretation of event mentions and their arguments in context. See, for example, Figure 1, in which five event mentions across two documents all refer to the same under- lying event: Plane bombs Yida camp. Event: Plane bombs Yida camp Document 1 Document 2 The {Yida refugee camp} {in South Sudan} was bombed {on Thursday}. The {Yida refugee camp} was the target of an air strike {in South Sudan} {on Thursday}. {Two bombs} fell {within the Yida camp}, including {one} {close to the school}. {At least four bombs} were reportedly dropped.{Four bombs} were dropped within just a few moments - {two} {inside the camp itself }, while {the other two} {near the airstrip}. Figure 1: Examples of event coreference. Mutually coreferent event mentions are underlined and in boldface; participant and spatio-temporal information for the high- lighted event is marked by curly brackets. Most previous approaches to event coreference resolution (e.g., Ahn (2006), Chen et al. (2009)) op- erated by extending the supervised pairwise classi- 517 Transactions of the Association for Computational Linguistics, vol. 3, pp. 517–528, 2015. Action Editor: Hwee Tou Ng. Submission batch: 4/2015; Revision batch: 7/2015; Published 9/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. fication model that is widely used in entity corefer- ence resolution (e.g., Ng and Cardie (2002)). In this framework, pairwise distances between event men- tions are modeled via event-related features (e.g., that indicate event argument compatibility), and ag- glomerative clustering is applied to greedily merge event mentions into clusters. A major drawback of this general approach is that it makes hard decisions on the merging and splitting of clusters based on heuristics derived from the pairwise distances. In addition, it only captures pairwise coreference deci- sions within a single document and can not account for signals that commonly appear across documents. More recently, Bejan and Harabagiu (2010; 2014) proposed several nonparametric Bayesian models for event coreference resolution that probabilisti- cally infer event clusters both within a document and across multiple documents. Their method, however, is completely unsupervised, and thus can not en- code any readily available supervisory information to guide the model toward better event clustering. To address these limitations, we propose a novel Bayesian model for within- and cross-document event coreference resolution. It leverages super- vised feature-rich modeling of pairwise coreference relations and generative modeling of cluster distri- butions, and thus allows for both probabilistic in- ference over event clusters and easy incorporation of pairwise linking preferences. Our model builds on the framework of the distance-dependent Chi- nese restaurant process (DDCRP) (Blei and Frazier, 2011), which was introduced to incorporate data de- pendencies into nonparametric clustering models. Here, however, we extend the DDCRP to allow the incorporation of feature-based, learnable dis- tance functions as clustering priors, thus encourag- ing event mentions that are close in meaning to be- long to the same cluster. In addition, we introduce to the DDCRP a representational hierarchy that allows event mentions to be grouped within a document and within-document event clusters to be grouped across documents. To investigate the effectiveness of our approach, we conduct extensive experiments on the ECB+ cor- pus (Cybulska and Vossen, 2014b), an extension to EventCorefBank (ECB) (Bejan and Harabagiu, 2010) and the largest corpus available that contains event coreference annotations within and across documents. We show that integrating pairwise learning of event coreference relations with unsu- pervised hierarchical modeling of event clustering achieves promising improvements over state-of-the- art approaches for within- and cross-document event coreference resolution. 2 Related Work Coreference resolution in general is a difficult natu- ral language processing (NLP) task and typically re- quires sophisticated inferentially-based knowledge- intensive models (Kehler, 2002). Extensive work in the literature focuses on the problem of entity coref- erence resolution and many techniques have been developed, including rule-based deterministic mod- els (e.g. Cardie and Wagstaff (1999), Raghunathan et al. (2010), Lee et al. (2011)) that traverse over mentions in certain orderings and make determin- istic coreference decisions based on all available information at the time; supervised learning-based models (e.g. Stoyanov et al. (2009), Rahman and Ng (2011), Durrett and Klein (2013)) that make use of rich linguistic features and the annotated corpora to learn more powerful coreference functions; and fi- nally, unsupervised models (e.g. Bhattacharya and Getoor (2006), Haghighi and Klein (2007, 2010)) that successfully apply generative modeling to the coreference resolution problem. Event coreference resolution is a more complex task than entity coreference resolution (Humphreys et al., 1997) and also has been relatively less stud- ied. Existing work has adapted similar ideas to those used in entity coreference. Humphreys et al. (1997) first proposed a deterministic cluster- ing mechanism to group event mentions of pre- specified types based on hard constraints. Later ap- proaches (Ahn, 2006; Chen et al., 2009) applied learning-based pairwise classification decisions us- ing event-specific features to infer event clustering. Bejan and Harabagiu (2010; 2014) proposed sev- eral unsupervised generative models for event men- tion clustering based on the hierarchical Dirichlet process (HDP) (Teh et al., 2006). Our approach is related to both supervised clustering and gener- ative clustering approaches. It is a nonparametric Bayesian model in nature but encodes rich linguis- tic features in clustering priors. More recent work 518 modeled both entity and event information in event coreference. Lee et al. (2012) showed that itera- tively merging entity and event clusters can boost the clustering performance. Liu et al. (2014) demon- strated the benefits of propagating information be- tween event arguments and event mentions during a post-processing step. Other work modeled event coreference as a predicate argument alignment prob- lem between pairs of sentences, and trained clas- sifiers for making alignment decisions (Roth and Frank, 2012; Wolfe et al., 2015). Our model also leverages event argument information into the de- cisions of event coreference but incorporates it into Bayesian clustering priors. Most existing coreference models, both for events and entities, focus on solving the within-document coreference problem. Cross-document coreference has attracted less attention due to lack of annotated corpora and the requirement for larger model capac- ity. Hierarchical models (Singh et al., 2010; Wick et al., 2012; Haghighi and Klein, 2007) have been pop- ular choices for cross-document coreference as they can capture coreference at multiple levels of gran- ularities. Our model is also hierarchical, capturing both within- and cross-document coreference. Our model is also closely related to the distance-dependent Chinese Restaurant Process (DDCRP) (Blei and Frazier, 2011). The DDCRP is an infinite clustering model that can account for data dependencies (Ghosh et al., 2011; Socher et al., 2011). But it is a flat clustering model and thus can- not capture hierarchical structure that usually exists in large data collections. Very little work has ex- plored the use of DDCRP in hierarchical clustering models. Kim and Oh (2011; Ghosh et al. (2011) combined a DDCRP with a standard CRP in a two- level hierarchy analogous to the HDP with restricted distance functions. Ghosh et al. (2014) proposed a two-level DDCRP with data-dependent distance- based priors at both levels. Our model is also a two- level DDCRP model but differs in that its distance function is learned using a feature-rich log-linear model. We also derive an effective Gibbs sampler for posterior inference. Action bombs Participant Sudan, Yida refugee camp Time Thursday, Nov 10, 2011 Location South Sudan Table 1: Mentions of event components 3 Problem Formulation We adopt the terminology from ECB+ (Cybulska and Vossen, 2014b), a corpus that extends the widely used EventCorefBank (ECB (Bejan and Harabagiu, 2010)). An event is something that happens or a sit- uation that occurs (Cybulska and Vossen, 2014a). It consists of four components: (1) an Action: what happens in the event; (2) Participants: who or what is involved; (3) a Time: when the event happens; and (4) a Location: where the event happens. We as- sume that each document in the corpus consists of a set of mentions — text spans — that describe event actions, their participants, times, and locations. Ta- ble 1 shows examples of these in the sentence “Su- dan bombs Yida refugee camp in South Sudan on Thursday, Nov 10th, 2011.” In this paper, we also use the term event men- tion to refer to the mention of an event action, and event arguments to refer collectively to mentions of the participants, times and locations involved in the event. Event mentions are usually noun phrases or verb phrases that clearly describe events. Two event mentions are considered coreferent if they refer to the same actual event, i.e. a situation involving a par- ticular combination of action, participants, time and location. Note that in text, not all event arguments are always present for an event mention; they may even be distributed over different sentences. Thus whether two event mentions are coreferential should be determined based on the context. For example, in Figure 1, the event mention dropped in DOCU- MENT 1 corefers with air strike in the same docu- ment as they describe the same event, Plane bombs Yida camp, in the discourse context; it also corefers with dropped in DOCUMENT 2 based on the con- texts of both documents. The problem of event coreference resolution can be divided into two sub-problems: (1) event ex- traction: extracting event mentions and event ar- guments, and (2) event clustering: grouping event 519 mentions into clusters according to their corefer- ence relations. We consider both within- and cross- document event coreference resolution and hypothe- size that leveraging context information from multi- ple documents will improve both within- and cross- document coreference resolution. In the following, we first describe the event extraction step and then focus on the event clustering step. 4 Event Extraction The goal of event extraction is to extract from a text all event mentions (actions) and event arguments (the associated participants, times and locations). One might expect that event actions could be ex- tracted reasonably well by identifying verb groups; and event arguments, by applying semantic role la- beling (SRL) to identify, for example, the Agent and Patient of each predicate. Unfortunately, most SRL systems only handle verbal predicates and so would miss event mentions described via noun phrases. In addition, SRL systems are not designed to capture event-specific arguments. Accordingly, we found that a state-of-the-art SRL system (SwiRL (Sur- deanu et al., 2007)) extracted only 56% of the ac- tions, 76% of participants, 65% of times and 13% of locations for events in a development set of ECB+ based on a head word matching evaluation measure. (We provide dataset details in Section 6.) To produce higher recall, we adopt a supervised approach and train an event extractor using sen- tences from ECB+, which are annotated for event actions, participants, times and locations. Be- cause these mentions vary widely in their length and grammatical type, we employ semi-Markov CRFs (Sarawagi and Cohen, 2004) using the loss- augmented objective of Yang and Cardie (2014) that provides more accurate detection of mention bound- aries. We make use of a rich feature set that includes word-level features such as unigrams, bigrams, POS tags, WordNet hypernyms, synonyms and FrameNet semantic roles, and phrase-level features such as phrasal syntax (e.g., NP, VP) and phrasal embed- dings (constructed by averaging word embeddings produced by word2vec (Mikolov et al., 2013)). Our experiments on the same (held-out) development data show that the semi-CRF-based extractor cor- rectly identifies 95% of actions, 90% of participants, 94% of times and 74% of locations again based on head word matching. Note that the semi-CRF extractor identifies event mentions and event arguments but not relation- ships among them, i.e. it does not associate argu- ments with an event mention. Lacking supervi- sory data in the ECB+ corpus for training an event action-argument relation detector, we assume that all event arguments identified by the semi-CRF ex- tractor are related to all event mentions in the same sentence and then apply SRL-based heuristics to augment and further disambiguate intra-sentential action-argument relations (using the SwiRL SRL). More specifically, we link each verbal event men- tion to the participants that match its ARG0, ARG1 or ARG2 semantic role fillers; similarly, we asso- ciate with the event mention the time and locations that match its AM-TMP and AM-LOC role fillers, re- spectively. For each nominal event mention, we as- sociate those participants that match the possessor of the mention since these were suggested in Lee et al. (2012) as playing the ARG0 role for nominal predi- cates. 5 Event Clustering Now we describe our proposed Bayesian model for event clustering. Our model is a hierarchical exten- sion of the distance-dependent Chinese Restaurant Process (DDCRP). It first groups event mentions within a document to form within-document event cluster and then groups these event clusters across documents to form global clusters. The model can account for the similarity between event mentions during the clustering process, putting a bias toward clusters comprised of event mentions that are simi- lar to each other based on the context. To capture event similarity, we use a log-linear model with rich syntactic and semantic features, and learn the feature weights using gold-standard data. 5.1 Distance-dependent Chinese Restaurant Process The Distance-dependent Chinese Restaurant Pro- cess (DDCRP) is a generalization of the Chinese Restaurant process (CRP) that models distributions over partitions. In a CRP, the generative process can be described by imagining data points as customers 520 in a restaurant and the partitioning of data as tables at which the customers sit. The process randomly samples the table assignment for each customer se- quentially: the probability of a customer sitting at an existing table is proportional to the number of cus- tomers already sitting at that table and the probabil- ity of sitting at a new table is proportional to a scal- ing parameter. For each customer sitting at the same table, an observation can be drawn from a distri- bution determined by the parameter associated with that table. Despite the sequential sampling process, the CRP makes the assumption of exchangeability: the permutation of the customer ordering does not change the probability of the partitions. The exchangeability assumption may not be rea- sonable for clustering data that has clear inter- dependencies. The DDCRP allows the incorporation of data dependencies in infinite clustering, encour- aging data points that are closer to each other to be grouped together. In the generative process, instead of directly sampling a table assignment for each cus- tomer, it samples a customer link, linking the cus- tomer to another customer or itself. The clustering can be uniquely constructed once the customer links are determined for all customers: two customers be- long to the same cluster if and only if one can reach the other by traversing the customer links (treating these links as undirected). More formally, consider a sequence of customers 1, ...,n, and denote a = (a1, ...,an) as the assign- ments of the customer links. ai ∈ {1, . . . ,n} is drawn from p(ai = j|F,α) ∝ { F(i,j), j 6= i α, j = i (1) where F is a distance function and F(i,j) is a value that measures the distance between customer i and j. α is a scaling parameter, measuring self-affinity. For each customer, the observation is generated by the per-table parameters as in the CRP. A DDCRP is said to be sequential if F(i,j) = 0 when i < j, so customers may link only to themselves, and to previous customers. 5.2 A Hierarchical Extension of the DDCRP We can model within-document coreference reso- lution using a sequential DDCRP. Imagining cus- tomers as event mentions and the restaurant as a document, each mention can either refer to an an- tecedent mention in the document or no other men- tions, starting the description of a new event. How- ever, the coreference relations may also exist across documents — the same event may be described in multiple documents. Thus it is ideal to have a two- level clustering model that can group event men- tions within a document and further group them across documents. Therefore we propose a hierar- chical extension of the DDCRP (HDDCRP) that em- ploys a DDCRP twice: the first-level DDCRP links mentions based on within-document distances and the-second level DDCRP links the within-document clusters based on cross-document distances, forming larger clusters in the corpus. The generative process of an HDDCRP can be described using the same “Chinese Restaurant” metaphor. Imagine a collection of documents as a collection of restaurants, and the event mentions in each document as customers entering a restaurant. The local (within-document) event clusters corre- spond to tables. The global (within-corpus) event clusters correspond to menus (tables that serve the same menu belong to the same cluster). The hid- den variables are the customer links and the table links. Figure 2 shows a configuration of these vari- ables and the corresponding clustering structure. Figure 2: A cluster configuration generated by the HDD- CRP. Each restaurant is represented by a rectangle. The small green circles represent customers. The ovals repre- sent tables and the colors reflect the clustering. Each cus- tomer is assigned a customer link (a solid arrow), linking to itself or another customer in the same restaurant. The customer who first sits at the table is assigned a table link (a dashed arrow), linking to itself or another customer in a different restaurant, resulting in the linking of two tables. More formally, the generative process for the HD- DCRP can be described as follows: 1. For each restaurant d ∈ {1, ...,D}, for each 521 customer i ∈ {1, ...,nd}, sample a customer link using a sequential DDCRP: p(ai,d = (j,d)) ∝    Fd(i,j), j < i αd, j = i 0, j > i (2) 2. For each restaurant d ∈{1, ...,D}, for each ta- ble t, sample a table link for the customer (i,d) who first sits at t using a DDCRP: p(ci,d = (j,d ′)) ∝ { F0((i,d),(j,d ′)), j ∈{1, ...,nd′},d′ 6= d α0, j = i,d ′ = d (3) 3. Calculate clusters z(a,c) by traversing all the customer links a and the table links c. Two customers are in the same cluster if and only if there is a path from one to the other along the links, where we treat both table and customer links as undirected. 4. For each cluster k ∈ z(a,c), sample parame- ters φk ∼ G0(λ). 5. For each customer i in cluster k, sample an ob- servation xi ∼ p(·|φzi) where zi = k. F1:D and F0 are distance functions that map a pair of customers to a distance value. We will discuss them in detail in Section 5.4. 5.3 Posterior Inference with Gibbs Sampling The central computation problem for the HDDCRP model is posterior inference — computing the con- ditional distribution of the hidden variables given the observations p(a,c|x,α0,F0,α1:D,F1:D). The pos- terior is intractable due to a combinatorial number of possible link configurations. Thus we approxi- mate the posterior using Markov Chain Monte Carlo (MCMC) sampling, and specifically using a Gibbs sampler. In developing this Gibbs sampler, we first observe that the generative process is equivalent to one that, in step 2 samples a table link for all customers, and then in step 3, when calculating z(a,c), in- cludes only those table links ci,d originating at cus- tomers (i,d) that started a new table, i.e. that chose ai,d = (i,d). The Gibbs sampler for the HDDCRP iteratively samples a customer link for each customer (i,d) from p(a∗i,d|a−(i,d),c,x,λ) ∝ p(a∗i,d)Ha(x,z,λ) (4) where Ha(x,z,λ) = p(x|z(a−(i,d) ∪a∗i,d,c,λ)) p(x|z(a−(i,d),c),λ)) After sampling all the customer links, it samples a table link for all customers (i,d) according to p(c∗i,d|a,c−(i,d),x,λ) ∝ p(c∗i,d)Hc(x,z,λ) (5) where Hc(x,z,λ) = p(x|z(a,c−(i,d) ∪ c∗i,d,λ)) p(x|z(a,c−(i,d)),λ)) For those customers (i,d) that did not start a new table, i.e. with ai,d 6= (i,d), the table link c∗i,d does not affect the clustering, and so Hc(x,z,λ) = 1 in this case. Referring back to the event coreference example in 1, Figure 3 shows an example of variable config- uration for the HDDCRP model and the correspond- ing coreference clusters. a1=1 a2=2 a3=3 a4=4 a5=4 c1=3 c2=2 c3=2 c4=2 c5=5[ina] Figure 3: An example of event clustering and the cor- responding variable assignments. The assignments of a induce tables, or within-document (WD) clusters, and the assignments of c induce menus, or cross-document (CD) clusters. [ina] denotes that the variable is inactive and will not affect the clustering. In implementation, we can simplify the computa- tions of both Ha(x,z,λ) and Hc(x,z,λ) by using the fact that the likelihood under clustering z(a,c) can be factorized as p(x|z(a,c),λ) = ∏ k∈z(a,c) p(xz=k|λ) 522 where xz=k denotes all customers that belong to the global cluster k. p(xz=k|λ) is the marginal proba- bility. It can be computed as p(xz=k|λ) = ∫ p(φ|λ) ∏ i∈z=k p(xi|φ)dφ where xi is the observation associated with cus- tomer i. In our problem, the observation corre- sponds to the lemmatized words in the event men- tion. We model the observed word counts using cluster-specific multinomial distributions with sym- metric Dirichlet priors. 5.4 Feature-based Distance Functions The distance functions F1:D and F0 encode the pri- ors for the clustering distribution, preferring cluster- ing data points that are closer to each other. We con- sider event mentions as the data points and encode the similarity (or compatibility) between event men- tions as priors for event clustering. Specifically, we use a log-linear model to estimate the similarity be- tween a pair of event mentions (xi,xj) fθ(xi,xj) ∝ exp{θT ψ(xi,xj)} (6) where ψ is a feature vector, containing a rich set of features based on event mentions i and j: (1) head word string match, (2) head POS pair, (3) co- sine similarity between the head word embeddings (we use the pre-trained 300-dimensional word em- beddings from word2vec1), (4) similarity between the words in the event mentions (based on term fre- quency (TF) vectors), (5) the Jaccard coefficient be- tween the WordNet synonyms of the head words, and (6) similarity between the context words (a win- dow of three words before and after each event men- tion). If both event mentions involve participants, we consider the similarity between the words in the participant mentions based on the TF vectors, sim- ilarly for the time mentions and the location men- tions. If the SRL role information is available, we also consider the similarity between words in each SRL role, i.e. Arg0, Arg1, Arg2. Training We train the parameter θ using logis- tic regression with an L2 regularizer. We construct the training data by considering all ordered pairs 1https://code.google.com/p/word2vec/ Train Dev Test Total # Documents 462 73 447 982 # Sentences 7,294 649 7,867 15,810 # Annotated event mentions 3,555 441 3,290 7,286 # Cross-document chains 687 47 486 1,220 # Within-document chains 2,499 316 2,137 4,952 Table 2: Statistics of the ECB+ corpus of event mentions within a document, and also all pairs of event mentions across similar documents. To measure document similarity, we collect all men- tions of events, participants, times and locations in each document and compute the cosine similarity between the TF vectors constructed from all the event-related mentions. We consider two documents to be similar if their TF-based similarity is above a threshold σ (we set it to 0.4 in our experiments). After learning θ, we set the within- document distances as Fd(i,j) = fθ(xi,xj), and the across-document distances as F0((i,d),(j,d ′)) = w(d,d′)fθ(xi,d,xj,d′), where w(d,d′) = exp(γsim(d,d′)) captures document similarity where sim(d,d′) is the TF-based sim- ilarity between document d and d′, and γ is a weight parameter. Higher γ leads to a higher effect of document-level similarities on the linking probabilities. We set γ = 1 in our experiments. 6 Experiments We conduct experiments using the ECB+ cor- pus (Cybulska and Vossen, 2014b), the largest available dataset with annotations of both within- document (WD) and cross-document (CD) event coreference resolution. It extends ECB 0.1 (Lee et al., 2012) and ECB (Bejan and Harabagiu, 2010) by adding event argument and argument type an- notations as well as adding more news documents. The cross-document coreference annotations only exist in documents that describe the same seminal event (the event that triggers the topic of the docu- ment and has interconnections with the majority of events from its surrounding textual context (Bejan and Harabagiu, 2014)). We divide the dataset into a training set (topics 1-20), a development set (topics 21-23), and a test set (topics 24-43). Table 2 shows the statistics of the data. We performed event coreference resolution on all possible event mentions that are expressed in the 523 documents. Using the event extraction method de- scribed in Section 4, we extracted 53,429 event men- tions, 43,682 participant mentions, 5,791 time men- tions and 3,836 location mentions in the test data, covering 93.5%, 89.0%, 95.0%, 72.8% of the an- notated event mentions, participants, time and loca- tions, respectively. We evaluate both within- and cross-document event coreference resolution. As in previous work (Bejan and Harabagiu, 2010), we evaluate cross-document coreference resolution by merg- ing all documents from the same seminal event into a meta-document and then evaluate the meta- document as in within-document coreference reso- lution. However, during inference time, we do not assume the knowledge of the mapping of documents to seminal events. We consider three widely used coreference reso- lution metrics: (1) MUC (Vilain et al., 1995), which measures how many gold (predicted) cluster merg- ing operations are needed to recover each predicted (gold) cluster; (2) B3 (Bagga and Baldwin, 1998), which measures the proportion of overlap between the predicted and gold clusters for each mention and computes the average scores; and (3) CEAF (Luo, 2005) (CEAFe), which measures the best alignment of the gold-standard and predicted clusters. We also consider the CoNLL F1, which is the average F1 of the above three measures. All the scores are com- puted using the latest version (v8.01) of the official CoNLL scorer (Pradhan et al., 2014). 6.1 Baselines We compare our proposed HDDCRP model (HDD- CRP) to five baselines: • LEMMA: a heuristic method that groups all event mentions, either within or across docu- ments, which have the same lemmatized head word. It is usually considered a strong baseline for event coreference resolution. • AGGLOMERATIVE: a supervised clustering method for within-document event corefer- ence (Chen et al., 2009). We extend it to within- and cross-document event coreference by performing single-link clustering in two phases: first grouping mentions within doc- uments and then grouping within-document clusters to larger clusters across documents. We compute the pairwise-linkage scores using the log-linear model described in Section 5.4. • HDP-LEX: an unsupervised Bayesian clus- tering model for within- and cross-document event coreference (Bejan and Harabagiu, 2010)2. It is a hierarchical Dirichlet process (HDP) model with the likelihood of all the lem- matized words observed in the event mentions. In general, the HDP can be formulated using a two-level sequential CRP. Our HDDCRP model is a two-level DDCRP that generalizes the HDP to allow data dependencies to be incorporated at both levels3. • DDCRP: a DDCRP model we develop for event coreference resolution. It applies the distance prior in Equation 1 to all pairs of event men- tions in the corpus, ignoring the document boundaries. It uses the same likelihood func- tion and the same log-linear model to learn the distance values as HDDCRP. But it has fewer link variables than HDDCRP and it does not distinguish between the within-document and cross-document link variables. For the same clustering structure, HDDCRP can gener- ate more possible link configurations than DD- CRP. • HDDCRP∗: a variant of the proposed HDDCRP that only incorporates the within-document de- pendencies but not the cross-document depen- dencies. The generative process of HDDCRP∗ is similar to the one described in Section 5.2, ex- cept that in step 2, for each table t, we sample 2We re-implement the proposed HDP-based models: the HDP1f , HDPflat (including HDPflat (LF), (LF+WF), and (LF+WF+SF)) and HDPstruct, but found that the HDPflat with lexical features (LF) performs the best in our experiments. We refer to it as HDP-LEX. 3Note that HDP-LEX is not a special case of HDDCRP be- cause we define the table-level distance function as the distances between customers instead of between tables. In our model, the probability of linking a table t to another table s depends on the distance between the head customer at table t and all other customers who sit at table s. Defining the table-level distance function this way allows us to derive a tractable inference algo- rithm using Gibbs sampling. 524 a cluster assignment ct according to p(ct = k) ∝ { nk, k ≤ K α0, k = K + 1 where K is the number of existing clusters, nk is the number of existing tables that be- long to cluster k, α is the concentration param- eter. And in step 3, the clusters z(a,c) are con- structed by traversing the customer links and looking up the cluster assignments for the ob- tained tables. We also use Gibbs sampling for inference. 6.2 Parameter settings For all the Bayesian models, the reported results are averaged results over five MCMC runs, each for 500 iterations. We found that mixing happens before 500 iterations in all models by observing the joint log-likelihood. For the DDCRP, HDDCRP∗ and HDD- CRP, we randomly initialized the link variables. Be- fore initialization, we assume that each mention be- longs to its own cluster. We assume mentions are ordered according to their appearance within a doc- ument, but we do not assume any particular ordering of documents. We also truncated the pairwise men- tion similarity to zero if it is below 0.5 as we found that it leads to better performance on the develop- ment set. We set α1 = ... = αD = 0.5, α0 = 0.001 for HDDCRP, α0 = 1 for HDDCRP∗, α = 0.1 for DD- CRP, and λ = 10−7. All the hyperparameters were set based on the development data. 6.3 Main Results Table 3 shows the event coreference results. We can see that LEMMA-matching is a strong baseline for event coreference resolution. HDP-LEX provides noticeable improvements, suggesting the benefit of using an infinite mixture model for event cluster- ing. AGGLOMERATIVE further improves the per- formance over HDP-LEX for WD resolution, how- ever, it fails to improve CD resolution. We conjec- ture that this is due to the combination of ineffective thresholding and the prediction errors on the pair- wise distances between mention pairs across docu- ments. Overall, HDDCRP∗ outperforms all the base- lines in CoNLL F1 for both WD and CD evaluation. The clear performance gains over HDP-LEX demon- strate that it is important to account for pairwise mention dependencies in the generative modeling of event clustering. The improvements over AGGLOM- ERATIVE indicate that it is more effective to model mention-pair dependencies as clustering priors than as heuristics for deterministic clustering. Comparing among the HDDCRP-related models, we can see that HDDCRP clearly outperforms DD- CRP, demonstrating the benefits of incorporating the hierarchy into the model. HDDCRP also performs better than HDDCRP∗ in WD CoNLL F1, indicat- ing that incorporating cross-document information helps within-document clustering. We can also see that HDDCRP performs similarly to HDDCRP∗ in CD CoNLL F1 due to the lower B3 F1, in particular, the decrease in B3 recall. This is because apply- ing the DDCRP prior at both within- and cross- document levels results in more conservative clus- tering and produces smaller clusters. This could be potentially improved by employing more accurate similarity priors. To further understand the effect of modeling mention-pair dependencies, we analyze the impact of the features in the mention-pair similarity model. Table 4 lists the learned weights of some top features (sorted by weights). We can see that they mainly serve to discriminate event mentions based on the head word similarity (especially embedding-based similarity) and the context word similarity. Event argument information such as SRL Arg1, SRL Arg0, and Participant are also indicative of the coreferen- tial relations. 6.4 Discussion We found that HDDCRP corrects many errors made by the traditional agglomerative clustering model (AGGLOMERATIVE) and the unsupervised genera- tive model (HDP-LEX). AGGLOMERATIVE easily suffers from error propagation as the errors made by the supervised distance learner cannot be cor- rected. HDP-LEX often mistakenly groups mentions together based on word co-occurrence statistics but not the apparent similarity features in the mentions. In contrast, HDDCRP avoids such errors by perform- ing probabilistic modeling of clustering and mak- ing use of rich linguistic features trained on avail- able annotated data. For example, HDDCRP cor- rectly groups the event mention “unveiled” in “Ap- ple’s Phil Schiller unveiled a revamped MacBook 525 MUC B3 CEAFe CoNLL P R F1 P R F1 P R F1 F1 Cross-document Event Coreference Resolution (CD) LEMMA 75.1 55.4 63.8 71.7 39.6 51.0 36.2 61.1 45.5 53.4 HDP-LEX 75.5 63.5 69.0 65.6 43.7 52.5 34.8 60.2 44.1 55.2 AGGLOMERATIVE 78.3 59.2 67.4 73.2 40.2 51.9 30.2 65.6 41.4 53.6 DDCRP 79.6 58.2 67.1 78.1 39.6 52.6 31.8 69.4 43.6 54.4 HDDCRP∗ 77.5 66.4 71.5 69.0 48.1 56.7 38.2 63.0 47.6 58.6 HDDCRP 80.3 67.1 73.1 78.5 40.6 53.5 38.6 68.9 49.5 58.7 Within-document Event Coreference Resolution (WD) LEMMA 60.9 30.2 40.4 78.9 57.3 66.4 63.6 69.0 66.2 57.7 HDP-LEX 50.0 39.1 43.9 74.7 67.6 71.0 66.2 71.4 68.7 61.2 AGGLOMERATIVE 61.9 39.2 48.0 80.7 67.6 73.5 65.6 76.0 70.4 63.9 DDCRP 71.2 36.4 48.2 85.4 64.9 73.8 61.8 76.1 68.2 63.4 HDDCRP∗ 58.1 42.8 49.3 78.4 68.7 73.2 67.6 74.5 70.9 64.5 HDDCRP 74.3 41.7 53.4 85.6 67.3 75.4 65.1 79.8 71.7 66.8 Table 3: Within- and cross-document coreference results on the ECB+ corpus Pro today” together with the event mention “an- nounced” in “this notebook isn’t the only laptop Ap- ple announced for the MacBook Pro lineup today”, while both HDP-LEX and AGGLOMERATIVE models fail to make such connection. By looking further into the errors, we found that a lot of mistakes made by HDDCRP are due to the errors in event extraction and pairwise linkage pre- diction. The event extraction errors include false positive and false negative event mentions and event arguments, boundary errors for the extracted men- tions, and argument association errors. The pairwise linking errors often come from the lack of seman- tic and world knowledge, and this applies to both event mentions and event arguments, especially for time and location arguments which are less likely to be repeatedly mentioned and in many cases re- quire external knowledge to resolve their meanings, e.g., “May 3, 2013” is “Friday” and “Mount Cook” is “New Zealand’s highest peak”. 7 Conclusion In this paper we propose a novel Bayesian model for within- and cross-document event coreference resolution. It leverages the advantages of genera- tive modeling of coreference resolution and feature- rich discriminative modeling of mention reference relations. We have shown its power in resolving event coreference by comparing it to a traditional ag- Features Weight Head Embedding sim 4.5 String match 2.77 Context sim 1.75 Synonym sim 1.56 TF sim 1.17 SRL Arg1 sim 1.10 SRL Arg0 sim 0.89 Participant sim 0.68 Table 4: Learned weights for selected features glomerative clustering approach and a state-of-the- art unsupervised generative clustering approach. It is worth noting that our model is general and can be easily applied to other clustering problems involving feature-rich objects and cluster sharing across data groups. While the model can effectively cluster ob- jects of a single type, it would be interesting to ex- tend it to allow joint clustering of objects of different types, e.g., events and entities. Acknowledgments We thank Cristian Danescu-Niculescu-Mizil, Igor Labutov, Lillian Lee, Moontae Lee, Jon Park, Chen- hao Tan, and other Cornell NLP seminar partici- pants and the reviewers for their helpful comments. This work was supported in part by NSF grant IIS-1314778 and DARPA DEFT Grant FA8750- 13-2-0015. The third author was supported by 526 NSF CAREER CMMI-1254298, NSF IIS-1247696, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1- 0038, and the ACSF AVF. The views and conclu- sions contained herein are those of the authors and should not be interpreted as necessarily represent- ing the official policies or endorsements, either ex- pressed or implied, of NSF, DARPA or the U.S. Government. References David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and Rea- soning about Time and Events, pages 1–8. Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, volume 1, pages 563–6. Cosmin Adrian Bejan and Sanda Harabagiu. 2010. Un- supervised event coreference resolution with rich lin- guistic features. In ACL, pages 1412–1422. Cosmin Adrian Bejan and Sanda Harabagiu. 2014. Un- supervised event coreference resolution. Computa- tional Linguistics, 40(2):311–347. Indrajit Bhattacharya and Lise Getoor. 2006. A latent Dirichlet model for unsupervised entity resolution. In SDM, volume 5, page 59. David M. Blei and Peter I. Frazier. 2011. Distance de- pendent Chinese restaurant processes. The Journal of Machine Learning Research, 12:2461–2488. Claire Cardie and Kiri Wagstaff. 1999. Noun phrase coreference as clustering. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Cor- pora, pages 82–89. Zheng Chen, Heng Ji, and Robert Haralick. 2009. A pairwise event coreference model, feature impact and evaluation for event coreference resolution. In Pro- ceedings of the Workshop on Events in Emerging Text Types, pages 17–22. Agata Cybulska and Piek Vossen. 2014a. Guidelines for ECB+ annotation of events and their coreference. Technical report, NWR-2014-1, VU University Ams- terdam. Agata Cybulska and Piek Vossen. 2014b. Using a sledgehammer to crack a nut? lexical diversity and event coreference resolution. In Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), pages 26–31. Greg Durrett and Dan Klein. 2013. Easy victories and uphill battles in coreference resolution. In EMNLP, pages 1971–1982. Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth, and David M. Blei. 2011. Spatial distance depen- dent Chinese restaurant processes for image segmen- tation. In Advances in Neural Information Processing Systems, pages 1476–1484. Soumya Ghosh, Michalis Raptis, Leonid Sigal, and Erik B. Sudderth. 2014. Nonparametric clustering with distance dependent hierarchies. Aria Haghighi and Dan Klein. 2007. Unsupervised coreference resolution in a nonparametric Bayesian model. In ACL, volume 45, page 848. Aria Haghighi and Dan Klein. 2010. Coreference reso- lution in a modular, entity-centered model. In NAACL, pages 385–393. Kevin Humphreys, Robert Gaizauskas, and Saliha Az- zam. 1997. Event coreference for information extrac- tion. In Proceedings of a Workshop on Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts, pages 75–81. Andrew Kehler. 2002. Coherence, Reference, and the Theory of Grammar. CSLI publications Stanford, CA. Dongwoo Kim and Alice Oh. 2011. Accounting for data dependencies within a hierarchical Dirichlet process mixture model. In Proceedings of the 20th ACM Inter- national Conference on Information and Knowledge Management, pages 873–878. Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011. Stanford’s multi-pass sieve coreference resolution sys- tem at the CoNLL-2011 shared task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 28–34. Heeyoung Lee, Marta Recasens, Angel Chang, Mihai Surdeanu, and Dan Jurafsky. 2012. Joint entity and event coreference resolution across documents. In Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning, pages 489–500. Zhengzhong Liu, Jun Araki, Eduard Hovy, and Teruko Mitamura. 2014. Supervised within-document event coreference using information propagation. In Pro- ceedings of the International Conference on Language Resources and Evaluation. Xiaoqiang Luo. 2005. On coreference resolution perfor- mance metrics. In EMNLP, pages 25–32. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. Proceedings of Workshop at ICLR. Vincent Ng and Claire Cardie. 2002. Improving ma- chine learning approaches to coreference resolution. In ACL, pages 104–111. 527 Vincent Ng. 2010. Supervised noun phrase coreference research: The first fifteen years. In ACL, pages 1396– 1411. Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed- uard Hovy, Vincent Ng, and Michael Strube. 2014. Scoring coreference partitions of predicted mentions: A reference implementation. In ACL, pages 22–27. Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran- garajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A multi- pass sieve for coreference resolution. In EMNLP, pages 492–501. Altaf Rahman and Vincent Ng. 2011. Coreference reso- lution with world knowledge. In ACL, pages 814–824. Michael Roth and Anette Frank. 2012. Aligning pred- icate argument structures in monolingual comparable texts: A new corpus for a new task. In SemEval, pages 218–227. Sunita Sarawagi and William W. Cohen. 2004. Semi- markov conditional random fields for information ex- traction. In Advances in Neural Information Process- ing Systems, pages 1185–1192. Sameer Singh, Michael Wick, and Andrew McCallum. 2010. Distantly labeling data for large scale cross- document coreference. arXiv:1005.4298. Richard Socher, Andrew L. Maas, and Christopher D. Manning. 2011. Spectral Chinese restaurant pro- cesses: Nonparametric clustering based on similari- ties. In International Conference on Artificial Intel- ligence and Statistics, pages 698–706. Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and Ellen Riloff. 2009. Conundrums in noun phrase coref- erence resolution: Making sense of the state-of-the- art. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 656–664. Mihai Surdeanu, Lluı́s Màrquez, Xavier Carreras, and Pere R. Comas. 2007. Combination strategies for se- mantic role labeling. Journal of Artificial Intelligence Research, pages 105–151. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet pro- cesses. Journal of the American Statistical Associa- tion, 101(476). Marc Vilain, John Burger, John Aberdeen, Dennis Con- nolly, and Lynette Hirschman. 1995. A model- theoretic coreference scoring scheme. In Proceed- ings of the 6th Conference on Message Understanding, pages 45–52. Michael Wick, Sameer Singh, and Andrew McCallum. 2012. A discriminative hierarchical model for fast coreference at large scale. In ACL, pages 379–388. Travis Wolfe, Mark Dredze, and Benjamin Van Durme. 2015. Predicate argument alignment using a global coherence model. In NAACL, pages 11–20. Bishan Yang and Claire Cardie. 2014. Joint modeling of opinion expression extraction and attribute classifi- cation. Transactions of the Association for Computa- tional Linguistics, 2:505–516. 528