key: cord-0043141-ik11jmok authors: Lai, Viet Dac; Dernoncourt, Franck; Nguyen, Thien Huu title: Exploiting the Matching Information in the Support Set for Few Shot Event Classification date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_18 sha: 7a1b250142e047eaa008e7ada45b24a9fe1f415d doc_id: 43141 cord_uid: ik11jmok The existing event classification (EC) work primarily focuses on the traditional supervised learning setting in which models are unable to extract event mentions of new/unseen event types. Few-shot learning has not been investigated in this area although it enables EC models to extend their operation to unobserved event types. To fill in this gap, in this work, we investigate event classification under the few-shot learning setting. We propose a novel training method for this problem that extensively exploit the support set during the training process of a few-shot learning model. In particular, in addition to matching the query example with those in the support set for training, we seek to further match the examples within the support set themselves. This method provides more training signals for the models and can be applied to every metric-learning-based few-shot learning methods. Our extensive experiments on two benchmark EC datasets show that the proposed method can improve the best reported few-shot learning models by up to 10% on accuracy for event classification. Event Classification (EC) is an important task of Information Extraction (IE) in Natural Language Processing (NLP). The target of EC is to classify the event mentions for some set of event types (i.e., classes). Event mentions are often associated with some words/phrases that are responsible to trigger the corresponding events in the sentences. For example, consider the following two sentences: (1) The companies fire the employee who wrote anti-diversity memo. (2) The troops were ordered to cease fire In these examples, an EC system should be able to classify the word "fire" in the two above sentences as an Employment-Termination event and an Attack event, respectively. As demonstrated by the examples, a notable challenge in EC is that the similar surface forms of the words might convey different events depending on the context. Two main methods have been employed for EC. The first approach explores linguistic features (e.g., syntactic and semantic properties) to train statistical models [9] . The second approach, on the other hand, focuses on developing deep neural network models (e.g., convolutional neural network (CNN) and recurrent neural network (RNN)) to automatically learn effective features from large scale datasets [5, 13] . Due to the development of the deep learning models, the performance for EC has been improved significantly [14, 16, 17, 19, 23] . The current EC models mainly employ the traditional supervised learning setting [17, 19] where the set of event types for classification has been predetermined. However, once a model is trained on the datasets with the given set of event types, it is unable to detect event mentions of unseen event types. To extend EC to new event types, a common solution is to annotate additional training data for such new event types and re-train the models, which is extremely expensive. It is thus desirable to formalize EC in the few-shot learning setting where the systems need to learn to recognize event mentions for new event types from a handful of examples. This is, in fact, closer to how humans learn to do tasks and make the EC models more applicable in practice. However, to our knowledge, there has been no prior work on few-shot learning for EC. In few-shot learning, we are given a support set and a query instance. The support set contains examples from a set of classes (e.g. events in EC). A learning model needs to predict the class, to which the query instance belongs, among the classes presented in the support set. This is done based on the matching information between the query example and those in the support set. To apply this setting to extract the examples of some new type, we need to collect just a few examples of the new type and add them to the support set to form a new class. Afterward, whenever we need to predict whether a new example has the new type or not, we can set it as the query example and perform the models in this setting. In practice, we often have some existing datasets (denoted by D) with examples for some pre-defined types. The previous work on few-shot learning has thus exploited such datasets to simulate the aforementioned few-shot learning setting to train the models [26] . Basically, in each episode of the training process, a subset of the types in D is sampled for which a few examples are selected for each type to serve as the support set. Some other examples are also chosen from the remaining examples of each sampled type to establish the query points. The models would then be trained to correctly map the query examples to their corresponding types in the support set based on the context matching of the examples [7] . One potential issue with this training procedure is that the training signals for the models only come from the matching information between the query examples and the examples in the support set. The available matching information between the examples in the support set themselves is not yet explored in the existing few-shot learning work [26, 28] , especially for the NLP tasks [7] . While this approach can be acceptable for the tasks in computer vision, it might not be desirable for NLP applications, especially for EC. Overall, datasets in NLP are much smaller than those in computer vision, thus limiting the variety of the context for training purposes. The ignorance of the matching information for the examples in the support set might cause inefficiency in using the training data for EC where the models cannot fully exploit the available information and fail to achieve good performance. Consequently, in this work, we propose to simultaneously exploit the matching information between the examples in the support set and between the query examples with the examples in the support set to train the few-shot learning models for EC. This is done by adding additional terms in the loss function (i.e., the auxiliary losses) to capture the matching knowledge between the examples in the support set. We expect that this new training technique can better utilize the training data and improve the performance of few-shot learning in EC. We extensively apply the proposed training method on different metric learning models for few-shot learning on two benchmark EC datasets. The experiments show that the new training technique can significantly improve all the considered few-shot learning methods over the two datasets with a large performance gap. In summary, the contribution of this work includes: (i) for the first time in the literature, we study the few-shot learning problem for event Classification, (ii) we propose a novel training technique for the few-shot learning models based on metric learning. The proposed training method exploits the matching information between the examples in the support set as additional training signals, and (iii) we achieve the state-of-the-art performance for EC on the few-shot learning setting, functioning as the baselines for the future research in this area. Early studies in event classification mainly focus on designing linguistic features [1, 9, 12] for statistical models. Due to the development of deep learning, many advanced network architectures have been investigated to advance the event classification accuracy [5, 13, [17] [18] [19] 21, 22] . However, none of them investigates the few-shot learning problem for EC as we do in this work. Although some recent studies have considered a related setting where event types are augmented with some keywords [3, 11, 24] , these works do not explicitly examine the few-shot learning setting as we do in this work. Some other efforts on zero-shot learning for event classification [8] are also related to our work in this paper. Few-shot learning facilitates the models to learn effective latent features without large scale data. The early studies apply transfer learning to fine-tune the pre-trained models, exploiting the latent information from the common classes with adequate instances [2, 4] . Metric learning, on the other hand, learns to model the distance distribution among the observed classes [10, 26, 28] . Recently, the idea of a fast learner that can generalize to a new concept quickly is introduced in meta-learning [6, 25] . Among these methods, metric-learning is more explainable and easier to train and implement compared to transfer learning and meta-learning. Notably, the prototypical networks in metric learning achieve state-of-the-art performance on several FSL benchmarks and show its robustness against noisy data [7, 26] . Although many FSL methods are proposed for image recognition [6, 10, 25, 26, 28] , there have been few studies investigating this setting for NLP problems [7, 29] . The task of few-shot event classification is to predict the event type of a query example x given a support set S and a set of event type T = {t 1 , t 2 , . . . , t N } (N is the number of event types). In few-shot learning, S contains a few examples for each event type in T . For convenience, we denote the support set as: where (s j i , a j i , t i ) indicates that the a j i -th word in the sentence s j i is the trigger word of an event mention with the event type t i , and K 1 , K 2 , . . . , K N are the numbers of examples in the support set for each type t 1 , t 2 , . . . , t N respectively. For simplicity, we use w 1 , w 2 , . . . , w l to represent the word sequence for some sentence with length l in this work. Similarly, the query example x can also be represented by x = (q, p, t) where q, p and t represent the query sentence, the position of the trigger word in the sentence, and the true event type for this event mention respectively. Note that t ∈ T is only provided in the training time and the models need to predict this event type in the test time. In practice, the numbers of support examples in S (i.e., K 1 , . . . , K N ) may vary. However, to ease the processing and speed up the training process with GPU, similar to recent studies in FSL [7] , we employ the N-way K-shot FSL setting. In this setting, the numbers of instances per class in the support set are equal (K 1 = . . . = K N = K > 1) and small (K ∈ {5, 10}). Note that to evaluate the few-shot learning models for EC, we would need the training data D train and the test data D test . For few-shot learning, it is crucial that the sets of event types in D train and D test are disjoint. The event type set T in each episode would then be a sample of the sets of event types in D train or D test , depending on the training and evaluation time respectively. Also, as mentioned in the introduction, in one episode of the training process, a set of query examples (i.e., the query set) would be sampled so it involves the similar event types T as the support set, and the examples for each type in the query set would be different from those in the support set. At the test time, the classification accuracy of the models over all the examples in the test set would be evaluated. The few-shot learning framework for EC in this work follows the typical metric learning structures in the prototypical networks [7, 26] , involving three major components: instance encoder, prototypical module, classifier module. Instance Encoder: Given a sentence s = {w 1 , w 2 , . . . , w l } and the position of the trigger word a (i.e., w a is the trigger word of the event mention in s and (s, a) can belong to an example in S or the query example), following the common practice in EC [5, 19] , we first convert each word w i ∈ s into a real-valued vector to facilitate the neural computation in the following steps. In particular, in this work, we represent each word w i using the concatenation of the following two vectors: -The pre-trained word embedding of w i : this vector is expected to capture the hidden syntactic and semantic information for w i [15] . -The position embedding of w i : this vector is obtained by mapping its relative distance to the trigger word w a (i.e., i − a) to an embedding vector in the position embedding table. The position embedding table is initialized randomly and updated during the training process of the models. The purpose of the position embedding vectors is to explicitly inform the models of the position of the trigger word in the sentence [5] . After converting w i into a representation vector e i , the input sentence s becomes a sequence of representation vectors E = e 1 , e 2 , . . . , e l . Based on this sequence of vectors, a neural network architecture f would be used to transform E into an overall representation vector v to encode the input example (s, m) (i.e., v = f (s, m)). In this work, we investigate two network architectures for the encoding function f , i.e., one early architecture for EC based on CNN and one recent popular architecture for NLP based on Transformers: CNN Encoder: This model applies the temporal convolution operation with some window size k and multiple filters over the input vector sequence E, producing a hidden vector for each position in the input sentence. Such hidden vectors are then aggregated via the max-pooling operation to obtain the overall representation vector v for (s, m) [5, 7] . Transformer Encoder: This is an advanced model to encode sequences of vectors based on attention mechanism without recurrent neural network [27] . The transformer encoder involves multiple layers; each of them consumes the sequence of hidden vectors from the previous layer to generate the sequence of hidden vectors for the current layer. The first layer would take E as the input while the hidden vector sequence returned by the last layer (i.e., the vector at the position a of the trigger word) would be used to constitute the overall representation vector v in this case. Each layer in the transformer encoder is composed of two sublayers (i.e., a multi-head self-attention layer and a feedforward layer) augmented with a residual connection around them [27] . Prototypical Module: The prototypical module aims to compute a single prototype vector to represent each class in T of the support set. In this work, we consider two versions of this prototypical module in the literature. The first version is from the original prototypical networks [26] . It simply obtains the prototype vector c i for a class t i using the average of the representation vectors of the examples with the event type t i in the support set S: The second version, on the other hand, comes from the hybrid attentionbased prototypical networks [7] . The prototype vector is a weighted sum of the representation vectors of the examples in the support set. The example weights (i.e., the attention weights) are determined by the similarity of the examples in the support set with respect to the query example x = (q, p, t): In this formula, is the element-wise multiplication and sum is the summation operation done over all the dimensions of the input vector. Classifier Module: In this module, we compute the probability distribution over the possible types for x in T using the distances from the query example x = (q, p, t) to the prototypes of the classes/event types T in the support set: where d is a distance function, and c i and c j are the prototype vectors obtained in either Eq. (2) or Eq. (3). In this paper, we consider three popular distance functions in different fewshot learning models using metric learning: -Cosine similarity in matching networks (called Matching) [28] -Euclidean distance in the prototypical networks. Depending on whether the prototype vectors are computed with Eq. 2 or 3, we have two variations of this distance function, called as Proto [26] , and Proto+Att (i.e., in hybrid attention-based prototypical networks [7] ) respectively. -Learnable distance function using convolutional neural networks in relation networks (called Relation) Given the probability distribution P (y|x, S), the typical way to train the few shot learning framework is to optimize the negative log-likelihood function for x (with t as the ground-truth event type for x) [7, 26] : Let Q be some integer that is less than K (i.e., 1 ≤ Q < K). For each type t i , we randomly select Q examples from S i (called the auxiliary query examples), forming the auxiliary query set We unify the sets S S i to constitute an auxiliary support set S S while the union of S Q i serves as the auxiliary query set: Given the auxiliary support set S S , we seek to enhance the training signals for the few-shot models by matching the examples in the auxiliary query set S Q with S S . Specifically, we first use the same networks in the instance encoder and prototypical modules to compute the auxiliary prototypes for the classes in T of the auxiliary support set S S . For each auxiliary example z = (s z , a z , t z ) ∈ S Q (s z , a z and t z are the sentence, the trigger word position and the event type in z respectively), we use the network in the classifier module to obtain the probability distribution P (.|z, S S ) over the possible event types for z based on the auxiliary support set S S . Afterward, we enforce that the models can correctly predict the event types for all the examples in the auxiliary query sets S Q i given the support set S S by introducing the auxiliary loss function: Eventually, the overall loss function to be optimized to train the models in this work is: L(x, S) = L query (x, S) + λL aux (S) where λ is a trade-off parameter between the main loss function and the auxiliary loss function. For convenience, we call the training method with the auxiliary loss function for few shot learning in this section LoLoss (i.e., leave-out loss) in the following experiments. We evaluate all the models in this study on the ACE 2005. ACE 2005 involves 33 event subtypes which are categorized into 8 event types: Business, Contact, Conflict, Justice, Life, Movement, Personnel, and Transaction. The TAC KBP dataset, on the other hand, contains 38 event subtypes for 9 event types. Due to the larger numbers of the event subtypes, we will use the subtypes in these datasets as the classes for our few-shot learning problem. As For the hyper-parameters, similar to the prior work [7] , we evaluate all the models using N -way K-shot FSL settings with N, K ∈ {5, 10}. For training, we avoid feeding the same set of event subtypes in every batch to make training batches more diverse. Thus, following [7] , we sample 20 event subtypes for each training batch while still keeping either 5 or 10 classes in the test time. We initialize the word embeddings using the pre-trained GloVe embeddings with 300 dimensions. The word embeddings are updated during the training time as in [20] . We also randomly initialize the position embedding vectors with 50 dimensions. The other parameters are selected based on the development data of the datasets, leading to similar parameters for both ACE 2005 and TAC KBP 2015. In particular, the CNN encoder contains a single CNN layer with window size 3 and 250 filters. We manage to use this simple CNN encoder to have a fair comparison with the previous study [7] . The Transformer encoder contains 2 layers with a context size of 512 and 10 heads in the attention mechanism. The number of examples per class in the auxiliary query sets Q is set to 2 while the trade-off parameter λ in the loss function is 0.1. Table 1 shows the accuracy of the models (i.e., Matching, Proto, Proto+Att, and Relation) on the ACE 2005 test dataset, using the CNN encoder and Transformer encoder. There are several observations from the table. First, comparing the instance encoders, it is clear that the transformer encoder is significantly better than the CNN encoder across all the possible few-shot learning models and settings for EC. Second, comparing the few-shot learning models, the prototypical networks significantly outperform Matching and Relation with a large performance gap across all the settings. Among the prototypical networks, Proto+Att achieves better performance than Proto, thus confirming the benefits of the attention-based mechanism for the prototypical module. Third, comparing the pairs (5-way 5-shot vs 5-way 10-shot) and (10-way 5 shot vs 10 way 10 shot), we see that the performance of the models would be almost always better with larger K (i.e., the number of examples per class in the support set) on different settings, consistent with the natural intuition about the benefit of having more examples for training. Most importantly, we see that training the models with the LoLoss procedure would significantly improve the models' performance. This is true across different few-shot learning models, N-way K-shot settings, and encoder choices. The results clearly demonstrate the effectiveness of the proposed training procedure to exploit the matching information between examples in the support set for few-shot learning for EC. For simplicity, we only focus on the best few-shot learning models (i.e., the prototypical networks) and the Transformer encoder under 5-way 5-shot and 10-way 10-shot in the following analysis. Even though we show the results in fewer settings and models in Table 2 and 3, the same trends are observed for the other models and settings as well. Table 2 additionally reports the accuracy of Transformer-based models on the TAC KBP 2015 dataset. As we can see from the table, most of our observations for the ACE 2005 dataset still hold for TAC KBP 2015, once again confirming the advantages of the proposed LoLoss technique in this work. In this section, we seek to evaluate the robustness of the few-shot learning models against the possible noise in the training data. In particular, in each training episode where a set of examples is sampled for each type in T to form the query set Q, we simulate the noisy data by randomly selecting a portion of the examples in Q for label perturbation. Essentially, for each example in the selected subset of Q, we change its original label to another random one in T , making it a noisy example with an incorrect label. By varying the size of the selected portion in Q for label perturbation, we can control the level of noise in the training process for FSL in EC. Table 3 shows the accuracy of the Proto+Att model on the ACE 2005 test set that employs the Transformer encoder with or without the LoLoss training procedure for different noise rates. As we can see from the table, the introduction of noisy data would, in general, degrade the accuracy of the models (i.e., comparing the cells in Table 3 with the Proto+Att based model in Table 1) . However, over different noise rates and N way K shot settings, the Proto+Att model trained with LoLoss is still always significantly better than those without LoLoss. The performance gap is substantial that is at least 4.5% over different settings. In fact, we see that LoLoss can improve Proto+Att in the noisy setting (i.e., at least 4.5%) more significantly than those in the setting without noisy data (i.e., at most 3.3% on the 5 way 5 shot and 10 way 10 shot settings in Table 1 ). Such evidence further confirms the effectiveness and robustness against noisy data of LoLoss for few-shot learning due to its exploitation of the matching information between the examples in the support set. In this paper, we perform the first study on few-shot learning for event classification. We investigate different metric learning methods for this problem, featuring the typical prototypical network framework with several choices for the instance encoder (i.e., CNN and Transformer). In addition, we propose a novel technique, called LoLoss, to train the few-shot learning models for EC based on the matching information for the examples in the support set. The proposed LoLoss technique is applied to different few-shot learning methods for different datasets and settings that altogether help to significantly improve the performance of the baseline models. In the future, we plan to examine LoLoss for few-shot learning for other NLP and vision problems (e.g., relation extraction, image classification). The stages of event extraction Deep learning of representations for unsupervised and transfer learning Seed-based event trigger labeling: how far can event descriptions get us? Learning many related tasks at the same time with backpropagation Event extraction via dynamic multipooling convolutional neural networks Model-agnostic meta-learning for fast adaptation of deep networks Hybrid attention-based prototypical networks for noisy few-shot relation classification Zero-shot transfer learning for event extraction Refining event extraction through cross-document inference Siamese neural networks for one-shot image recognition Extending event detection to new types with learning from keywords Constructing information networks using one single model Exploiting argument information to improve event detection via supervised attention mechanisms Similar but not the same: word sense disambiguation improves event detection via neural representation matching Distributed representations of words and phrases and their compositionality New york university 2016 system for KBP event nugget: a deep learning approach Joint event extraction via recurrent neural networks A two-stage approach for extending event detection to new types via neural networks Event detection and domain adaptation with convolutional neural networks Relation extraction: perspective from convolutional neural networks Modeling skip-grams for event detection with convolutional neural networks Graph convolutional networks with argument-aware pooling for event detection One for all: neural joint modeling of entities and events Event detection and co-reference with minimal supervision Meta-learning with memory-augmented neural networks. In: ICML Prototypical networks for few-shot learning Attention is all you need Matching networks for one shot learning Diverse few-shot text classification with multiple metrics