key: cord-0197110-vrvdv2b2
authors: Shen, Jiaming; Zhang, Yunyi; Ji, Heng; Han, Jiawei
title: Corpus-based Open-Domain Event Type Induction
date: 2021-09-07
journal: nan
DOI: nan
sha: c23b13179711e524b5ed53b7fc0a60c2567db897
doc_id: 197110
cord_uid: vrvdv2b2

Traditional event extraction methods require predefined event types and their corresponding annotations to learn event extractors. These prerequisites are often hard to be satisfied in real-world applications. This work presents a corpus-based open-domain event type induction method that automatically discovers a set of event types from a given corpus. As events of the same type could be expressed in multiple ways, we propose to represent each event type as a cluster ofpairs. Specifically, our method (1) selects salient predicates and object heads, (2) disambiguates predicate senses using only a verb sense dictionary, and (3) obtains event types by jointly embedding and clusteringpairs in a latent spherical space. Our experiments, on three datasets from different domains, show our method can discover salient and high-quality event types, according to both automatic and human evaluations.

One step towards converting massive unstructured text into structured, machine-readable representations is event extraction-the identification and typing of event triggers and arguments in text. Most event extraction methods (Ahn, 2006; Ji and Grishman, 2008; Du and Cardie, 2020; Li et al., 2021) assume a set of predefined event types and their corresponding annotations are curated by human experts. This annotation process is expensive and time-consuming. Besides, those manually-defined event types often fail to generalize to new domains. For example, the widely used ACE 2005 event schemas 2 do not contain any event type 1 The programs, data and resources are publicly available for research purpose at https://github.com/ mickeystroller/ETypeClus. 2 https://www.ldc.upenn.edu/ collaborations/past-projects/ace detain_1 people arrest_1 people arrest_2 spread stop_1 transmission stop_1 planning "Arrest-Jail" "Stop-Spread" "Stop-Plan"

Hundreds of people are detained for distributing purported false information online.

Researchers say that vaccinating 46 percent of Haitians could arrest the cholera spread.

The Zimbabwe CTU said 69 people were arrested during Wednesday's demonstrations.

More censorship of social media posts are enforced to stop protest planning online. about Transmit Virus or Treat Disease and thus cannot be readily applied to extract pandemic events.

To automatically induce event schemas from raw text, researchers have studied ad-hoc clusteringbased algorithms (Sekine, 2006; Chambers and Jurafsky, 2011) and probabilistic generative methods (Chambers, 2013; Cheung et al., 2013; Nguyen et al., 2015) to discover a set of event types and argument roles. These methods typically utilize bag-of-word text representations and impose strong statistical assumptions. Huang et al. (2016) relax those restrictions using a pipelined approach that leverages extensive lexical and semantic resources (e.g., FrameNet (Baker et al., 1998) , VerbNet (Schuler and Palmer, 2005) , and PropBank ) to discover event schemas. While being effective, this method is limited by the scope of external resources and accuracies of its preprocessing tools. Recently, some studies (Huang et al., 2018; Lai and Nguyen, 2019; Huang and Ji, 2020) have used transfer learning to Table 1 : Statistics of verb triggered event types in three popular event extraction datasets. Event types triggered by verbs more than 5 times are considered as "Verb Frequently Triggered Event Types". extend traditional event extraction models to new types without explicitly deriving schemas of new event types. Nevertheless, these methods still require many annotations for a set of seen types.

In this work, we study the problem of event type induction which aims to discover a set of salient event types based on a given corpus. We observe that about 90% of event types can be frequently triggered by predicate verbs (c.f. Table 1 ) and thus propose to take a verb-centric view toward inducing event types. We use the five sentences (S1-S5) in Figure 1 to motivate our design of event type representation. First, we observe that verb lemma itself might be ambiguous. For example, the two mentions of lemma "arrest" in S2 and S3 have different senses and indicate different event types. Second, even for predicates with the same sense, their different associated object heads 3 could lead them to express different event types. Taking S4 and S5 as examples, two "stop" mentions have the same sense but belong to different types because of their corresponding object heads. Finally, we can see that people have multiple ways to communicate the same event type due to the language variability.

From the above observations, we propose to represent an event type as a cluster of predicate sense, object head pairs (P-O pairs for short) 4 . We present a new event type induction framework ETYPECLUS to automatically discover event types, customized for a specific input corpus. ETYPECLUS requires no human-labeled data other than an existing general-domain verb sense dictionary such as VerbNet (Schuler and Palmer, 2005) and OntoNotes Sense Groupings (Hovy et al., 2006) . ETYPECLUS contains four major steps.

3 Intuitively, the object head is the most essential word in the object such as "people" in object "hundreds of people". 4 Subjects are intentionally left here because (Allerton, 1979) finds objects play a more important role in determining predicate semantics. Also, many P-O pairs indicate the same event type but share different subjects (e.g., "police capture X" and "terrorists capture X" are considered as two different events but belong to the same event type Capture Person. Adding subjects may help divide current event types into more fine-grained types and we leave this for future work.

First, it extracts predicate, object head pairs from the input corpus based on sentence dependency tree structures. As some extracted pairs could be too general (e.g., say, it ) or too specific (e.g., document, microcephaly ), the second step of ETYPECLUS will identify salient predicates and object heads in the corpus. After that, we disambiguate the sense of each predicate verb by comparing its usage with those example sentences in a given verb sense dictionary. Finally, ETYPECLUS clusters the remaining salient P-O pairs into event types using a latent space generative model. This model jointly embeds P-O pairs into a latent spherical space and performs clustering within this space. By doing so, we can guide the latent space learning with the clustering objective and enable the clustering process to benefit from the well-separated structure of the latent space.

We show our ETYPECLUS framework can save annotation cost and output corpus-specific event types on three datasets. The first two are benchmark datasets ACE 2005 and ERE (Entity Relation Event) (Song et al., 2015) . ETYPECLUS can successfully recover predefined types and identify new event types such as Build in ACE and Bombing in ERE. Furthermore, to test the performance of ETYPECLUS in new domains, we collect a corpus about the disease outbreak scenario. Results show that ETYPECLUS can identify many interesting fine-grained event types (e.g., Vaccinate, Test) that align well with human annotations. Contributions. The major contributions of this paper are summarized as follows: (1) A new event type representation is created as a cluster of predicate sense, object head tuples; (2) a novel event type induction framework ETYPECLUS is proposed that automatically disambiguates predicate senses and learns a latent space with desired event cluster structures; and (3) extensive experiments on three datasets verify the effectiveness of ETYPECLUS in terms of both automatic and human evaluations.

In this section, we first introduce some important concepts and then present our task definition. A corpus S = {S 1 , . . . , S N } is a set of sentences where each sentence S i ∈ S is a word sequence [w i,1 , . . . , w i,n ]. A predicate is a verb mention in a sentence and can optionally have an associated object in the same sentence. We follow previous stud- < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 J d r p d b 1 p n s 2 q M K D Z S A / I s x z K N s = " > A A A B 8 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I R Z d F N y 4 r 2 g d M h 5 J J M 2 1 o J h m S j F C G f o Y b F 4 q 4 9 W v c + T d m 2 l l o 6 4 H A 4 Z x 7 y b k n T D j T x n W / n d L a + s b m V n m 7 s r O 7 t 3 9 Q P T z q a J k q Q t t E c q l 6 I d a U M 0 H b h h l O e 4 m i O A 4 5 7 Y a T 2 9 z v P l G l m R S P Z p r Q I M Y j w S J G s L G S 3 4 + x G R P M s 4 f Z o F p z 6 + 4 c a J V 4 B a l B g d a g + t U f S p L G V B j C s d a + 5 y Y m y L A y j H A 6 q / R T T R N M J n h E f U s F j q k O s n n k G T q z y h B F U t k n D J q r v z c y H G s 9 j U M 7 m U f U y 1 4 u / u f 5 q Y m u g 4 y J J D V U k M V H U c q R k S i / H w 2 Z o s T w q S W Y K G a z I j L G C h N j W 6 r Y E r z l k 1 d J 5 6 L u N e q X 9 4 1 a ies (Corbett et al., 1993; O'Gorman et al., 2016) and refer to the most important word in the object as the object head. For example, one predicate from the first sentence in Figure 1 is "detain" and its corresponding object is "hundreds of people" with the word "people" being the object head.

As predicates with the same lemma may have different senses, we disambiguate each predicate verb based on a verb sense dictionary V wherein each verb lemma has a list of candidate senses with example usage sentences. One illustrative example of our verb sense dictionary is shown in Figure 3 . We refer to the sense of predicate verb lemma as the predicate sense.

Task Definition. Given a corpus S and a verb sense dictionary V, our task of event type induction is to identify a set of K event types where each type T j is represented by a cluster of predicate sense, object head pairs.

The ETYPECLUS framework (outlined in Figure 2 ) induces event types in four major steps: (1) predicate and object head extraction, (2) salient predicate lemma and object head selection, (3) predicate sense disambiguation, and (4) latent space joint predicate sense and object head clustering.

We propose a lightweight method to extract predicates and object heads in sentences without relying on manually-labeled training data. Specifically, given a sentence S i , we first use a dependency parser 5 to obtain its dependency parse tree and select all non-auxiliary verb tokens 6 as our candidate predicates. Then, for each candidate predicate, we check its dependent words and if any of them has a 5 We use the Spacy en_core_web_lg model. 6 A token with part-of-speech tag VERB and dependency label not equal to aux and auxpass.

Arrest; 3 senses Sense 1: Catch and take into custody Example 1: He was arrested when customs officers found drugs in his bag. Example 2: The police arrested her for drinking and driving. Example 3: Airport officials were arrested after a major heist. Sense 2: Stop or interrupt something Example 1: The treatment has so far done little to arrest the spread of the cancer. Example 2: The look in his eyes arrested him on the spot. Example 3: The mechanism will arrest the motion of the flywheel.

Sense 3: Take a hold and capture suddenly Example 1: An astonishing sight arrested our attention. Example 2: The musician had arrested his interest at first glance. dependency label auxpass, we believe this predicate verb is in passive voice and find its object heads within its syntactic children that occur before it and have a dependency label in SUBJECT label set 7 . Otherwise, we consider this predicate is in active voice and identify its object heads within its dependents that occur after it and have a dependency label in OBJECT label set 8 . Finally, we aggregate all predicate, object head pairs along with their frequencies in the corpus.

The above extracted predicate, object head pairs have different qualities. Some are too general and contain little information, while others are too specific and hard to generalize. Thus, this step of ETYPECLUS tries to select those salient predicate lemmas and object heads from our input corpus. We compute the salience of a word (either a predicate lemma or an object head) based on two criteria. First, it should appear frequently in our corpus. Second, it should not be too frequent in a large general-domain background corpus 9 . Computationally, we follow the TF-IDF idea and define the word salience as follows:

where f req(w) is the frequency of word w, N _bs is the number of background sentences, and bsf (w) is the background sentence frequency of word w. Finally, we select those terms with salience scores ranked in top 80% as our salient predicate lemmas and object heads. Table 2 lists the top 5 most salient predicate lemmas and object heads in three datasets. The first two datasets contain news articles about wars and thus terms like "kill" and "weapon" are ranked top. The third dataset includes articles about disease outbreaks and thus most salient terms include "infect", "virus", and "outbreak".

As verbs typically exhibit large sense ambiguities, we disambiguate each predicate's sense in the sentence. Huang et al. (2016) achieves this goal by utilizing a supervised word sense disambiguation tool (Zhong and Ng, 2010) to link each predicate to a WordNet sense (Miller, 1995) and then mapping that sense back to an OntoNotes sense grouping (Hovy et al., 2006) . In this work, we propose to remove such extra complexity and present a lightweight sense disambiguation method that requires only a verb sense dictionary.

The key idea of our method is to compare the usage of a predicate with each verb sense's example sentences in the dictionary. Given a predicate verb v in sentence S i , we compute two types of features to capture both its content and context information. The first one, denoted as v emb , is obtained by feeding the sentence S i into the BERT-Large model (Devlin et al., 2019) and retrieving the predicate's corresponding contextualized embedding. The second feature v mwp is a rank list of 10 alternative words that can be used to replace v in sentence S i . Specifically, we replace the original word v in S i with a special [MASK] token and feed the masked sentence S mask i into BERT-Large for masked word prediction. From the prediction results, we select the top 10 most likely words and sort them into v mwp .

After obtaining the predicate representation, we compute the representations of its candidate senses in the dictionary. Suppose the lemma of this predicate v has N v candidate senses in the dictionary and each sense E j , j ∈ [1, . . . , N v ] has N j example sentences {S j,k } N j k=1 in the dictionary. Then, within each example sentence S j,k , we locate where the predicate lemma v occurs and compute its corresponding feature v emb j,k and v mwp j,k similarly as discussed before. After that, we obtain two types of features for each sense E j as follows:

where RA(·) stands for the rank aggregation operation based on mean reciprocal rank. This method is widely used in previous literature (Shen et al., 2017 (Shen et al., , 2020 Zhang et al., 2020; Huang et al., 2020) for fusing ranked lists. Finally, we choose the sense that is most similar to the predicate v as follows:

where cos(x, y) is the cosine similarity between two vectors x and y, and rbo(a, b) is the rankbiased overlap similarity (Webber et al., 2010) between two ranked lists. We evaluate our method on the verb subset of standard word sense disambiguation benchmarks (Navigli et al., 2017) . Our method achieves 55.7% F1 score. In comparison, the supervised IMS method in (Huang et al., 2016) gets a 56.9% F1 score. Thus, we think our method is comparable to supervised IMS but being more lightweight and requires no training data.

After obtaining salient predicate sense, object head pairs (P-O pairs for short), we aim to cluster them into event types. Below, we first discuss how to obtain the initial features for predicate senses and object heads (Section 3.4.1). As those predicate senses and object heads are living in two separate spaces, we aim to fuse them into one joint feature space wherein the event cluster structures are better preserved. Inspired by (Meng et al., 2022) , we achieve this goal by proposing a latent space generative method that jointly embeds P-O pairs into a unified spherical space and performs clustering in this space. Finally, we discuss how to train this generative model in Section 3.4.3.

We obtain two types of features for each term w (either a predicate sense w p or an object head w o ) by first locating its mentions in the corpus and then aggregating mention-level representations into termlevel features. Suppose term w appears M w times, for each of its mentions m w,l , l ∈ [1, . . . , M w ], we extract this mention's content feature m emb w,l and context feature m mwp w,l , following the same process discussed in Section 3.3. Then, we average all mentions' content features into this term's content feature m emb w = 1 Mw Mw l=1 m emb w,l . The aggregation of mention context features is more difficult as each m mwp w,l is not a numerical vector but instead a set of words predicted by BERT to replace m w,l . In this work, we propose the following aggregation scheme. For each term w, we first construct a pseudo document D w using the bag union operation 10 . Then, we obtain the vector representations of pseudo documents based on TF-IDF transformation and apply Principal Component Analysis (PCA) to reduce the dimensionality of document vectors. A similar idea is discussed before in (Amrami and Goldberg, 2018) . The resulting vector will be considered as the term's context feature vector m mwp w . Finally, we concatenate m emb w with m mwp w to obtain the initial feature vector of predicate senses (denoted as h p ) and object heads (denoted as h o ).

To cluster P-O pairs into K event types based on two separate feature spaces (H p for predicate sense and H o for object head), one straightforward approach is to represent each P-

and directly applying clustering algorithms to all pairs. However, this approach cannot guarantee the concatenated space H = [H p , H o ] will be naturally suited for clustering. Therefore, we propose to jointly embed and cluster P-O pairs in latent space Z. By doing so, we can unify two feature spaces H p and H o . More importantly, the latent space learning is guided by the clustering objective, and the clustering process can benefit from 10 Namely, Dw contains a word T times if this word appears in T different m mwp w,l , l ∈ [1, . . . , Mw]. the well-separated structure of the latent space, which achieves a mutually-enhanced effect.

We design the latent space to have a spherical topology because cosine similarity more naturally captures word/event semantic similarities than Euclidean/L2 distance. Previous studies (Meng et al., 2019a (Meng et al., , 2020 also show that learning spherical embeddings directly is better than first learning Euclidean embeddings and normalizing them later. Thus, we assume there is a spherical latent space Z with K clusters 11 . Each cluster in this space corresponds to one event type and is associated with a von Mises-Fisher (vMF) distribution (Banerjee et al., 2005) from which event type representative P-O pairs are generated. The vMF distribution of an event type c is parameterized by a mean vector c and a concentration parameter κ. A unit-norm vector z is generated from vMF d (c, κ) with the probability as follows:

where d is the dimensionality of latent space Z and

is assumed to be generated as follows: (1) An event type c k is sampled from a uniform distribution over K types; (2) a latent embedding z i is generated from the vMF distribution associated with c k ; and (3) a function g p (g o ) maps the latent embedding z i to the original embedding h p i (h o i ) corresponding to the predicate sense p i (object head o i ). Namely, we have:

We parameterize g p and g o as two deep neural networks and jointly learn the mapping function f p : H p → Z as well as f o : H o → Z from the original space to the latent space. Such a setup closely follows the autoencoder architecture (Hinton and Zemel, 1993) which is shown to be effective for preserving input information.

We learn our generative model by jointly optimizing two objectives. The first one is a reconstruction objective defined as follows:

This objective encourages our model to preserve input space semantics and generate the original data faithfully. The second clustering-promoting objective enforces our model to learn a latent space with K well-separated cluster structures. Specifically, we use an expectation-maximization (EM) algorithm to sharpen the posterior event type distribution of each input P-O pair. In the expectation step, we first compute the posterior distribution based on current model parameters as follows:

We then compute a new estimate of each P-O pair's cluster assignment q(c k |z i ) and use it to update the model in the maximization step. Instead of making hard cluster assignments like K-means which directly assigns each z i to its closest cluster, we compute a soft assignment q(c k |z i ) as follows:

where s k = N i=1 p(c k |z i ). This squaring-thennormalizing formulation has a sharpening effect that skews the distribution towards its most confident cluster assignment, as shown in (Xie et al., 2016; Meng et al., 2018 Meng et al., , 2019b . The formulation encourages unambiguous assignment of P-O pairs to event types so that the learned latent space will have gradually well-separated cluster structures. Finally, in the maximization step, we update the model parameters to maximize the expected logprobability of the current cluster assignments under the new cluster assignment estimates as follows:

where p is updated to approximate fixed target q. We summarize our training procedure in Algorithm 1. We first pretrain the model using only the reconstruction objective, which provides a stable initialization of all parameterized mapping functions. Then, we apply the EM algorithm to iteratively update all mapping functions and event type parameters C with a joint objective O rec + λO clus where the hyper-parameter λ balances two objectives. The algorithm is considered converged if less than δ = 5% of the P-O pairs change cluster assignment between two iterations or a maximum iteration number is reached. Finally, we output each P-O pair's distribution over K event types.

We first evaluate ETYPECLUS on two widely used event extraction datasets: ACE (Automatic Content Extraction) 2005 12 and ERE (Entity Relation Event) (Song et al., 2015) . For both datasets, we follow the same preprocessing steps from (Lin et al., 2020; Li et al., 2021) and use sentences in the training split as our input corpus. The ACE dataset contains 17,172 sentences with 33 event types and the ERE dataset has 14,695 sentences with 38 types. We test the performance of ETYPECLUS on event type discovery and event mention clustering.

Arrest-Jail arrest_0, protester arrest_0, militant arrest_0, suspect

• For the most part the marches went off peacefully, but in New York a small group of protesters were arrested after they refused to go home at the end of their rally, police sources said. • On Tuesday, Saudi security officials said three suspected al-Qaida militants were arrested in Jiddah, Saudi Arabia.

Build ∇ build_0, facility build_0, center build_0, housing

• Plans were underway to build destruction facilities at all other locations but now the Bush junta has removed from its proposed defense budget for fiscal year 2006 all but the minimum funding. • Virginia is apparently going to be build a data center in Richmond, a back-up data center, and a help desk/call center as a follow-on to the creation of VITA, the Virginia Information Technology Agency.

Transfer-Money fund_0, activity fund_0, operation fund_0, people

• The grants will fund advisory activities, including local capacity building, infrastructure development and product development. • The White House had hoped to hold off asking for more money to fund military operations in Iraq and Afghanistan until after the election, but with costs rising faster than expected, it sent a request for an early installment of $25 billion to Congress this week. Table 4 : Event mention clustering results. All values are in percentage. We run each method 10 times and report its averaged result for each metric with the standard deviation. Note that ACC is not applicable for Triframes because it assumes the equal number of clusters in ground truth and generated results.

We apply ETYPECLUS on each input corpus to discover 100 candidate event clusters and follow (Huang et al., 2016) to manually check whether discovered clusters can reconstruct ground truth event types. On ACE, we recover 24 out of 33 event types (19 out of 20 most frequent types) and 7 out of 9 missing types have a frequency less than 10. On ERE, we recover 28 out of 38 event types (18 out of 20 most frequent types). We show some example clusters in Table 3 which includes top ranked P-O pairs and their occurring sentences. We observe that ETYPECLUS successfully identifies human defined event types (e.g., Arrest-Jail in ACE and Transfer-Money in ERE). It can also identify finer-grained types compared with the original ground truth types (e.g., the 4th row of Table 3 shows one discovered event type Bombing in ERE which is in finer scale than "Conflict:Attack", the closest human-annotated type in ERE). Further, ETYPECLUS is able to identify new salient event types (e.g., finding new event type Build in ACE). Finally, ETYPECLUS not only induces event types but also provides their example sentences, which serve as the corpus-specific annotation guidance.

We evaluate the effectiveness of our latent space generative model via the event mention clustering task. We first match each event mention with one extracted P-O pair if possible, and select 15 event types with the most matched results 13 . Then, for each selected type, we collect its associated mentions and add them into a candidate pool. We represent each mention using the feature of its corresponding P-O pair. Finally, we cluster all mentions in the candidate pool into 15 groups and evaluate whether they align well with the original 15 types. The event mention clustering quality also serves as a good proxy of the event type quality. This is because if a method can discover good event types from a corpus, it should also be able to generate good event mention clusters when the ground truth number of clusters is given.

Compared Methods. We compare the following methods: (1) Kmeans: A standard clustering algorithm that works in the Euclidean feature space. We run this algorithm with the ground truth number of clusters. (2) sp-Kmeans: A variant of Kmeans 13 More details are discussed in Appendix Section D. bert and Arabie, 1985) measures the similarity between two cluster assignments based on the number of pairs in the same/different clusters.

(2) NMI denotes the normalized mutual information between two cluster assignments.

(3) BCubed-F1 (Bagga and Baldwin, 1998 ) estimates the quality of the generated cluster assignment by aggregating the precision and recall of each element. (4) ACC measures the clustering quality by finding the permutation function from predicted cluster IDs to ground truth IDs that gives the highest accuracy. The math for-mulas of these metrics are in Appendix Section E. For all four metrics, the higher the values, the better the model performance.

Experiment Results. Table 4 shows ETYPECLUS outperforms all the baselines on both datasets in terms of all metrics. The major advantage of ETYPECLUS is the latent event space: different types of information can be projected into the same space for effective clustering. We also observe that JCSC is the strongest among all baselines. We think the reason is that it uses a joint clustering strategy where event types are defined as predicate clusters and the constraint function enables objects to refine predicate clusters. Thus, a predicate-centric clustering algorithm can outperform all other baselines, which supports our verb-centric view of events.

To evaluate the portability of ETYPECLUS to a new open domain, we collect a new dataset that includes 98,000 sentences about disease outbreak events 14 . We run the top-3 performing baselines and ETYPECLUS to generate 30 candidate event types and evaluate their quality using intrusion test. Specifically, we inject a negative sample from other clusters into each cluster's top-5 results and ask three annotators to identify the outlier. More details on how we construct the intrusions are in Appendix. The intuition behind this test is that the annotators will be easier to identify the intruders if the clustering results are clean and tuples are semantically coherent. As shown in Table 6 , ETYPECLUS achieves the highest accuracy among all the baseline methods, indicating that it generates semantically coherent types in each cluster. Table 5 shows some discovered event types of ETYPECLUS. 15 Interesting examples include tuples with the same predicate sense but object heads with different granularities (e.g., spread_2, virus and spread_2, coronavirus for Spread-Virus type), tuples with same object head but different predicate senses (e.g., prevent_1, spread , and mitigate_1, spread for Prevent-Spread type), and event types with predicate verb lemmas that are not directly linkable to OntoNotes Senses grouping (e.g., "immunize" and "vaccinate" for Vaccinate type).

6 Related Work Event Schema Induction. Early studies on event schema induction adopt rule-based approaches (Lehnert et al., 1992; Chinchor et al., 1993) and classification-based methods (Chieu et al., 2003; Bunescu and Mooney, 2004) to induce templates from labeled corpus. Later, unsupervised methods are proposed to leverage relation patterns (Sekine, 2006; Qiu et al., 2008) and coreference chains (Chambers and Jurafsky, 2011) for event schema induction. Typical approaches use probabilistic generative models (Chambers, 2013; Cheung et al., 2013; Nguyen et al., 2015; Li et al., 2020 Li et al., , 2021 or ad-hoc clustering algorithms (Huang et al., 2016; Sha et al., 2016) to induce predicate and argument clusters. In particular, (Liu et al., 2019) takes an entity-centric view toward event schema induction. It clusters entities into semantic slots and finds predicates for entity clusters in a post-processing step. (Yuan et al., 2018 ) studies the event profiling task and includes one module that leverages a Bayesian generative model to cluster predicate:role:label triplets into event types. These methods typically rely on discrete hand-crafted features derived from bag-of-word text representations and impose strong statistics assumptions; whereas our method uses pre-trained language models to reduce the feature generation complexity and relaxes stringent statistics assumptions via latent space clustering. Weakly-Supervised Event Extraction. Some studies on event extraction (Bronstein et al., 2015; Ferguson et al., 2018; Chan et al., 2019) propose to leverage annotations for a few seen event types to help extract mentions of new event types specified by just a few keywords. These methods reduce the annotation efforts but still require all target new 15 More example outputs are in Appendix Section H. types to be given. Recently, some studies (Huang et al., 2018; Lai and Nguyen, 2019; Huang and Ji, 2020) use transfer learning techniques to extend traditional event extraction models to new types without explicitly deriving schemas of new event types. Compared to our study, these methods still require many annotations for a set of seen types and their resulting vector-based event type representations are less human interpretable. Another related work by (Wang et al., 2019) uses GAN to extract events from an open domain corpus. It clusters entity:location:keyword:date quadruples related to the same event rather than finds event types.

In this paper, we study the event type induction problem that aims to automatically generate salient event types for a given corpus. We define a novel event type representation and propose ETYPECLUS that can extract and select salient predicates and object heads, disambiguate predicate senses, and jointly embed and cluster P-O pairs in a latent space. Experiments on three datasets show that ETYPECLUS can recover human curated types and identify new salient event types. In the future, we propose to explore the following directions: (1) improve predicate and object extraction quality with tools of higher semantic richness (e.g., a SRL labeler or an AMR parser); (2) leverage more information from lexical resources to enhance event representation; and (3) cluster objects into argument roles for each discovered event type. Quan Yuan, Xiang Ren, Wenqi He, Chao Zhang, Xinhe Geng, Lifu Huang, Heng Ji, Chin-Yew Lin, and Jiawei Han. 2018 . Open-schema event profiling for massive news corpora. In CIKM.

Yunyi Zhang, Jiaming Shen, Jingbo Shang, and Jiawei Han. 2020 . Empower entity set expansion via language model probing. In ACL.

Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: A wide-coverage word sense disambiguation system for free text. In ACL.

We use the OntoNotes sense grouping 16 as our input verb sense dictionary. The background Wikipedia corpus is obtained from (Shen et al., 2020) . We implement our latent space clustering model using PyTorch 1.7.0 with the Huggingface Library (Wolf et al., 2020) . To obtain each predicate sense's context information m mwp w (c.f. Section 3.4.1 in main text), we leverage PCA to reduce the original pseudo-document multi-hot representations into 500-dimension vectors. The hyperparameters of our latent space generative model are set as follows: the latent space dimension d = 100, the DNN hidden dimensions are 500-500-1000 for encoder f p /f o and 1000-500-500 for decoder g p /g o ; the shared concentration parameter of event type clusters κ = 10, the weight for clustering-promoting object λ = 0.02, the convergence threshold δ = 0.05, and the maximum iteration number is 100. We learn the generative model using Adam optimizer with learning rate 0.001 and batch size 64.

We implement Kmeans and AggClus based on the Scikit-learn codebase (Pedregosa et al., 2011) . We use L 2 distance for both methods. For Kmeans, we use k-means++ strategy for model initialization, and each time the result with the best inertia is used within 10 initializations. We use ward linkage for AggClus and set the stop criterion to be reaching the target number of clusters. For spherical Kmeans, we use an open source implementation 17 . Similar to Kmeans, we use k-means++ to initialize the model and select the best results among 10 initializations. For Triframes (Ustalov et al., 2018) , we use its authors' original implementation 18 , and tune the parameter k in the k-NN graph construction step for different tasks and datasets to get a reasonable number of clusters. Specifically, we use k = 30 for the event mention clustering task, which gives us the overall best evaluation results on both ACE and ERE. On the Pandemic corpus, we take k = 100, which generates 35 clusters that contain at least 40 tuples. For JCSC, we implement the clustering algorithm based on Algorithm 1 in (Huang et al., 2016) . The spectral clustering used in JCSC is based on Scikit-learn's implementation, and the label assigning strategy is K-means with 30 random initializations each time.

We run all experiments on a single cluster with 80 CPU cores and a Quadro RTX 8000 GPU. The BERT model is moved to the GPU for initial predicate sense and object head feature extraction and it consumes about 11GB GPU memory. We also train our latent space generative model on GPU and it consumes about 14GB GPU memory. In principles, ETYPECLUS should be runnable on CPU.

We create the evaluation dataset for event mention clustering as follows. First, we select event mentions whose trigger is a single token verb. Then, for each selected event mention, we construct a P-O pair by choosing its non-pronoun argument that has some overlap with the object of our extracted predicate, object with the same verb trigger. After that, we select the top-15 event types with the most matched results for both datasets to avoid types with too few mentions, and their corresponding event mentions are used as ground truth clusters.

We denote the ground truth clusters as C * , the predicted clusters as C, and the total number of event mentions as N .

• ARI (Hubert and Arabie, 1985) measures the similarity between two cluster assignments. Let T P (T N ) denote the number of element pairs in the same (different) cluster(s) in both C * and C.

Then, ARI is calculated as follows:

where E(RI) is the expected RI of random assignments.

• NMI denotes the normalized mutual information between two cluster assignments and is widely used in previous studies. Let MI(·; ·) be the Mutual Information between two cluster assignments, and H(·) denote the Entropy. Then the NMI is formulated as follows:

• BCubed (Bagga and Baldwin, 1998 ) estimates the quality of the generated cluster assignment by aggregating the precision and recall of each element. B-Cubed precision, recall, and F1 are thus calculated as follows:

where C * (·) (C(·)) is the mapping function from an element to its ground truth (predicted) cluster.

• ACC measures the quality of the clustering results by finding the permutation function from predicted cluster IDs to ground truth IDs that gives the highest accuracy. Let y i (y * i ) denote the i-th element's predicted (ground truth) cluster ID, the ACC is formulated as follows:

where k is the number of clusters for both C * and C, P erm(k) is the set of all permutation functions on the set {1, 2, . . . , k}, and 1(·) is the indicator function.

We follow a similar approach in (Li et al., 2021) to construct our Pandemic Dataset. First, we resort to Wikipedia lists to get a set of Wikipedia articles related to disease outbreaks 19 . Then, we extract the news article links from the "references" section of those Wikipedia article pages. Finally, we crawl these news articles based on their above extracted links 20 and construct a corpus related to disease outbreaks.

Given the top-5 tuples of each detected type, we inject a randomly sampled tuple from the top results of other types to serve as a negative sample. For methods that have cluster centers, we rank tuples within each cluster by their distances to the center. Otherwise, we rank tuples according to their frequencies in the corpus. Then, the intrusion questions from all compared methods are randomly shuffled to avoid bias. Three annotators 21 are asked to identify the injected tuples independently, and we take the average of their labeling accuracy to 21 All three annotators are not in the author list of this paper and provide independent judgements of the tuple quality.

show the quality of the generated event types. • For the most part the marches went off peacefully, but in New York a small group of protesters were arrested after they refused to go home at the end of their rally, police sources said. • On Tuesday, Saudi security officials said three suspected al-Qaida militants were arrested in Jiddah, Saudi Arabia, in sweeps following the near-simultaneous suicide attacks on three residential compounds on the outskirts of Riyadh on May 12. • can owe tell us exactly the details, the precise details of how you arrested the suspect?

Build ∇ build_0, facility build_0, center build_0, housing • Plans were underway to build destruction facilities at all other locations but now the Bush junta has removed from its proposed defense budget for fiscal year 2006 all but the minimum funding for these destruction projects. • Virginia is apparently going to be build a data center in Richmond, a back-up data center , and a help desk/call center as a follow-on to the creation of VITA, the Virginia Information Technology Agency. • The Habitat for Humanity might be a good one to consider, since their expertise is in building housing, which of course is so beadly needed over there at this time.

Transfer-Money fund_0, activity fund_0, operation fund_0, people

• The grants will fund advisory activities, including local capacity building, infrastructure development, product development, and development of local insurance companies' capacity to provide index-based insurance products. • The White House had hoped to hold off asking for more money to fund military operations in Iraq and Afghanistan until after the election, but with costs rising faster than expected, it sent a request for an early installment of $25 billion to Congress this week. • Watch 'Secret Pakistan' on the BBC iPlayer , it's an awesome two part documentary about how Pakistan has been supporting and funding these people for years.

Bombing ∇ bomb_0, factory bomb_0, checkpoint bomb_0, base

• He bombed the Aspirin factory in 1998 (which turned out to have nothing to do with Bin Laden) the week he revealed he had been lying to us for eight months about Lewinsky. • Prosecutors then also pointed to the men's suicide bomber training in 2011 in Somalia and association with Beledi, who prosecutors said bombed a government checkpoint in Mogadishu that year. • Once the war breaks out, Iran will immediately use all kinds of missiles to bomb the military bases of the United States in the Gulf and Israel to pieces. Table 7 : Example outputs of ETYPECLUS discovered event types with their associated sentences in ACE and ERE datasets. The first two types come from ACE and the remaining two are from ERE. The event types with superscript " ∇ " originally do not exist in human-labeled schemas and are discovered by ETYPECLUS framework.

Predicates are in bold and object heads are underlined and in italics. • Pence chose not to wear a face mask during the tour despite the facility's policy.

• It should not be necessary for workers to wear facemasks routinely when in contact with the public.

• The WHO offers a conditional recommendation that health care providers also wear a separate head cover that protects the head and neck.

Prevent Spread prevent_1, spread mitigate_1, spread mitigate_1, transmission

• Infection prevention and control measures are critical to prevent the possible spread of MERS-CoV in health care facilities . • A vaccine can mitigate spread, but not fully prevent the virus circulating.

• Asymptomatic infection could also potentially be directly harnessed to mitigate transmission.

Delay Gathering delay_1, gathering postpone_1, gathering suspend_1, gathering • The 2020 edition of the Cannes Film Festival, was left in limbo following an announcement from the festival's organizers that the gathering could be delayed until late June or early July. • States with EVD should consider postponing mass gatherings until EVD transmission is interrupted. • On Thursday, leaders of The Church of Jesus Christ of Latter -day Saints told its 15 million members worldwide all public gatherings would be suspended until further notice .

Provide Testing provide_1, testing conduct_1, testing perform_1, testing

• Governments are racing to buy medical equipment as a debate intensifies over providing adequate testing, when it 's advisable to wear masks, and whether stricter lockdowns should be imposed. • Additional testing is being conducted to confirm that the family members had H1N1 and to try to verify that the flu was transmitted from human to cat. • Additional laboratories perform antiviral testing and report their results to CDC .

Warn Country warn_1, country warn_1, authority warn_1, government • WHO uses six phases of alert to communicate the seriousness of infectious threats and to warn countries of the need to prepare and respond to outbreaks. • The message showed a photo of a letter, written by the operators of the hospital's oxygen supply plant, warning the authorities that the supply was running dangerously low . • WHO staff concluded there was a high risk of further spread, and issued a global alert to warn all member governments of the existence of a new and highly infectious form of "atypical pneumonia" on March 12th .

Vaccinate People vaccinate_0, person immunize_0, people vaccinate_0, family • All persons in a recommended vaccination target group should be vaccinated with the 2009 H1N1 monovalent vaccine and the seasonal influenza vaccine. • U.K. Will Start Immunizing People Against COVID-19 On Tuesday, Officials Say. • "In the Samoan language there is no word for bacteria or virus" says Henrietta Aviga, a nurse travelling around villages to vaccinate and educate families. 

The stages of event extraction

Essentials of grammatical theory: A consensus view of syntax and morphology

Word sense induction with neural bilm and symmetric patterns

Entitybased cross-document coreferencing using the vector space model

The Berkeley FrameNet project

Clustering on the unit hypersphere using von mises-fisher distributions

Seed-based event trigger labeling: How far can event descriptions get us

Collective information extraction with relational markov networks

Event schema induction with a probabilistic entity-driven model

Template-based information extraction without the templates

Rapid customization for event extraction

Probabilistic frame induction

Closing the gap: Learning-based information extraction rivaling knowledge-engineering methods

Evaluating message understanding systems: An analysis of the third message understanding conference (muc-3)

Heads in grammatical theory

Weakly-supervised neural text classification

Weakly-supervised hierarchical text classification

Topic discovery via latent space clustering of pretrained language model representations

Hierarchical topic mining via joint spherical tree and text embedding

WordNet: A lexical database for English

Word sense disambiguation: A unified evaluation framework and empirical comparison

Generative event schema induction with entity disambiguation

Richer event description: Integrating event coreference with temporal, causal and bridging annotation

The proposition bank: An annotated corpus of semantic roles

Scikit-learn: Machine learning in python

Modeling context in scenario template creation

VerbNet: A broad-coverage, comprehensive verb lexicon

On-demand information extraction

Joint learning templates and slots for event schema induction

SynSet-Expan: An iterative framework for joint entity set expansion and synonym discovery

SetExpan: Corpus-based set expansion via context feature selection and rank ensemble

From light to rich ERE: Annotation of entities, relations, and events

Unsupervised semantic frame induction using triclustering

Open event extraction from online text using a generative adversarial network

A similarity measure for indefinite rankings

Transformers: State-of-the-art natural language processing

Unsupervised deep embedding for clustering analysis

Research was supported in part by US DARPA KAIROS Program No. FA8750-19-2-1004, So-cialSim Program No. W911NF-17-C-0099, and INCAS Program No. HR001121C0165, NSF IIS-19-56151, IIS-17-41317, and IIS 17-04532, and the Molecule Maker Lab Institute: An AI Research Institutes program supported by NSF under Award No. 2019897. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily represent the views, either expressed or implied, of DARPA or the U.S. Government. We want to thank Martha Palmer and Ghazaleh Kazeminejad for the help on VerbNet and OntoNotes Sense Groupings. We also would like to thank Sha Li, Yu Meng, Lifu Huang for insightful discussions and anonymous reviewers for valuable feedback.

Both event extraction and event type induction are standard tasks in NLP. We do not see any significant ethical concerns. The expected usage of our work is to identify interesting event types from user input corpus such as a set of news articles or a collection of scientific papers.