key: cord-0331286-inhg64s5 authors: Kaushal, Ayush; Vaidhya, Tejas title: Leveraging Event Specific and Chunk Span features to Extract COVID Events from tweets date: 2020-12-18 journal: nan DOI: 10.18653/v1/2020.wnut-1.79 sha: 933a193b4a6cd778ba35482bfcddb9a963dc4fb0 doc_id: 331286 cord_uid: inhg64s5 Twitter has acted as an important source of information during disasters and pandemic, especially during the times of COVID-19. In this paper, we describe our system entry for WNUT 2020 Shared Task-3. The task was aimed at automating the extraction of a variety of COVID-19 related events from Twitter, such as individuals who recently contracted the virus, someone with symptoms who were denied testing and believed remedies against the infection. The system consists of separate multi-task models for slot-filling subtasks and sentence-classification subtasks while leveraging the useful sentence-level information for the corresponding event. The system uses COVID-Twitter-Bert with attention-weighted pooling of candidate slot-chunk features to capture the useful information chunks. The system ranks 1st at the leader-board with F1 of 0.6598, without using any ensembles or additional datasets. The code and trained models are available at this https URL. The World Health Organization declared COVID-19, a global pandemic on March 11, 2020. As of 2020/09/21, there are over 30 million cases 2 and 900,000 deaths due to the infection. With the imposed lockdown, work from home and physical distancing, social media like twitter saw an increased usage. A large part of the use was posting and consuming information on the novel infection. These information include potential reasons for contraction of the disease, such as via exposure to a family member who tested positive, or someone who is showing COVID symptoms but was denied testing. Accompanying to the pandemic was an infodemic of misinformation about COVID-19, including fake remedies, treatments and prevention-suggestions in social media (Alam et al., 2020) . Zong et al. (2020) show the possibility to automatically extract structured knowledge on COVID-19 events from Twitter and released a dataset of COVID related tweets across 5 event types. We used this dataset in our experiments for the sharedtask. These tweets are annotated for whether they belong to an event (we refer to this as the event-prediction task in this paper) and their eventspecific questions (factual or opinion). We identify these event-specific questions into two types of subtasks, slot-filling and sentence classification. Our system consists of separate multi-task models for slot-filling subtasks and sentenceclassification subtasks. Our contribution comprises improvement upon the baseline (mentioned in section 2) in three ways: • We incorporate the event-prediction task as auxiliary subtask and fuse its features for all the event-specific subtasks. • We perform an attention-weighted pooling over the candidate chunk span enabling the model to attend to subtask specific cues. • We use the domain-specific Bert of Covid-Twitter Bert (Müller et al., 2020) . Sentence classification tasks (such as opinion or sentiment mining) as well as slot-filling tasks have greatly progressed with deep learning advancements such as LSTM (Hochreiter and Schmidhuber, 1997) , Tree-LSTM (Tai et al., 2015) and transfer learning over pre-trained models (Peters et al., 2018; Howard and Ruder, 2018; Devlin et al., 2019) . Among these, CT-Bert outperforms others on COVID related twitter tasks (Müller et al., 2020) . Taking inspiration from the same, we use CT-Bert as part of our architecture. A variety of slot-filling approaches have been built on top of these deep learning advancements (Kurata et al., 2016; Qin et al., 2019) . The proposed baseline for our task (Zong et al., 2020) (2019), we build upon this baseline approach. Extraction of structured knowledge from tweets pertaining of events (Benson et al., 2011) has been studied for disaster and crises management (Abhik and Toshniwal, 2013; Rudra et al., 2018) and in pandemic scenarios (Al-Garadi et al., 2016) . Extracting such entities can be useful for epidemiologists, deciding policies and preventing spread (Al-Garadi et al., 2016; Zong et al., 2020) . Due to the fast-spreading nature of the infection, it is also difficult to manually trace the spread of the pandemic. However, with twitter event-specific entity extraction and Geo-location, one could potentially build a real-time pandemic surveillance system (Lwowski and Najafirad, 2020; Al-Garadi et al., 2020) . Bal et al. (2020) show that healthissues related misinformation is prevalent in social media, while Alam et al. (2020) talks about covidspecific misinformation. Such systems for extracting structured knowledge over the tweets talking about potential cures for COVID will help study how users perceive the COVID misinformation. In §3, we describe the dataset and the problem statement. Then in §4, we discuss the details of our two multi-task models followed by experiments, results and conclusion. Now, we will briefly go over the dataset. The reader may refer (Zong et al., 2020) for full details. Each of the 7500 tweets in the dataset belongs to one of the 5 event types: tested-positive, tested-negative, can-not-test, death, and cure. The first four events aimed at extracting structured reports of coronavirus related events, such as self-reported cases or news stories about public figures who were exposed to the virus. Each tweet was first annotated for whether it belongs to its respective event (e.g. Is the tweet belonging to the tested-positive event talking about someone who tested positive?). Throughout this paper, we refer to this as the Event-Prediction task. The tweets that correspond to its event were then annotated for event-specific questions or sub- tasks about factual information and user's opinions. All annotations are done by multiple Amazon Mechanical Turks with inter-annotation agreement. The event-specific questions or subtasks (e.g. name, age, gender of the person tested positive) varies depending on the event. These subtasks are of two categories: slot-filling (e.g., Who tested positive/negative?, Where are they located?, Who is in close contact with person contracting the disease?) and sentence classification (e.g. Is author related to infected person?, Does the author experience any symptoms?, Does the author believe a cure method is effective?). The dataset released tweet IDs and their annotations. We obtain our text corresponding to tweets using the official Twitter API 3 . Table 1 shows the statistics for the dataset we scrapped in early July. 4 Figure 1 shows an annotated example from the dataset. We identify the event-specific subtasks into two categories shown in Table 2 . We now formally describe the two types of eventspecific subtasks: Slot-filling subtasks: Assume n slot-filling subtasks {S 1 , S 2 ...S n }. We set up each slot-filling subtask S i as a supervised binary classification problem. Given the tweet t and the candidate slot s, the model f (t, s) → {0, 1} predicts whether s answers its designated question. We extract a list of The proposed event-specific subtasks split into two subtask types: slot-filling and sentence classification candidate slot of all noun chunks and name entities in each of the tweets by using a Twitter tagging tool (Ritter et al., 2011) same as the baseline. Sentence classification subtasks: Assume m sentence classification subtasks {C 1 , C 2 ...C m , }. Given a sentence classification subtask C i aims to learn a model g(t) → {l 1 , l 2 ...l k }, where t is a tweet and l j is a label. Here the number of labels can vary depending on the subtask, for example, gender is labelled with {Male, Female, Others/Not Specified}, Relation with {Yes, No}, Opinion with {effective, no cure, not effective, no opinion} and so on. All these subtasks are 'supervised' classification problems. The dataset is also annotated with whether a tweet corresponds to its respective event or not. We treat this as an additional Event-Prediction task. This is a binary classification task that aims to learn a model h(t) → 0, 1 where t is a tweet. In the following subsections §4.1 and §4.2, we describe our multi-task model for slot-filling and sentence-classification respectively. We improve upon the baseline (Zong et al., 2020) by using domain-specific Bert, using attentionweighted pooling over the candidate chunk feature sequence, incorporating auxiliary Event-Prediction task and utilizing its logits for all the slot-filling subtasks. Before describing the approach, we first describe the Bert baseline. Our slot-filling model can be seen in figure 2. The baseline consists of Bert based classifier. It takes a tweet t as input and encloses the candidate slot s, within the tweet, inside special entity start < E > and end < /E > markers. The Bert hidden representation of token < E > is then processed through a fully connected layer with softmax activation to make the binary prediction for a task (Baldini Soares et al., 2019) . Since many slot-filling Tweet CT-Bert [CLS] ... tasks within an event are semantically related to each other, they jointly trained the final softmax layers of all the subtasks S i in an event by sharing their Bert model parameters. COVID Twitter Bert (CT-Bert) is a Bert-Large model pretrained on Twitter Corpus on COVID-19 topics, leading to marginal improvements from Bert on tasks based on Twitter datasets (Müller et al., 2020) . This motivates us to use CT-Bert instead of Bert from the baseline model. The baseline, uses the Bert hidden representation of token < E > for classification. Here, however, we use attention-weighted pool of the CT-Bert hidden representation of tokens between < E > and < /E > (both inclusive). Formally, let {x 0 , ...x p , ...x q , ...x n } be the output vectors from the hidden representation of CT-Bert where p and q are indices of < E > and < /E > respectively, then for any of the slot-filling subtask S j , we get its pooled vector as follows: T denotes the transpose of x i , a S j is a trainable vector. The motivation for attention weighted pooling is that depending on the task, model can attend to different portions of the candidate slot chunk. Next we obtain the binary classification score vector: Here W S j and b S j are trainable parameters. We treat the Event-Prediction task as an auxiliary task and then fuse its logits to each of the other slot-filling subtasks. The motivation is that a taskspecific entity shall be present in a tweet only if the tweet belongs to its respective event. To predict the label for Event-Prediction task, we take the CT-Bert features of [CLS] token and pass it through a MultiLayer Perceptron (MLP) to get logits h ces . We fuse h ces prediction over each subtasks S j by adding it to h S j (from (2)) to get the logits h In practice, we share the parameters of the M LP S j across all the slot-filling subtasks S j . Given a tweet t and slot s, our loss for slot-filling model over n slot-filling subtasks {S 1 , S 2 ...S n } and Event-Prediction task looks like: Loss(t, s, y ces , (y 1 , y 2 ...y n )) = λ 1 CE Loss (h ces , y ces ) + n k=1 CE Loss (h S k f , y k ) (4) where CE loss is softmax cross entropy loss, y ces is ground truth label for Event-Prediction task and (y 1 , y 2 ...y n ) are the labels for the candidate slot s of tweet t for the subtasks {S 1 , S 2 ...S n }. We keep λ 1 = 1. Our preprocessing for this is same as baseline. Our Sentence classification model is shown in figure 3. We use a Bert based sentence classifier and improve it by using CT-Bert, incorporating the auxiliary Event-Prediction task and attention-weighted pooling over the entire sequence. This model uses CT-Bert instead of Bert and the auxiliary Event-Prediction task for same reason as the slot-filling model. respectively). Then for any of the sentence classification subtask C j , we get its pooled vector x C j as follows: where a C j , c C j are trainable vector and scalar respectively. For the Event-Prediction task, we take the CT-Bert vector representation of [CLS] token and pass it through a MLP. Assume the MLP's final and hidden states to be v ces and h ces . Next, we incorporate information from Event-Prediction task into sentence classification subtask C j . Since the sentence classification subtasks aren't binary classification, so, unlike the slot-filling model, we cannot merely add the Event-Prediction logits to all tasks. Additionally, we desire sentence-level event specific features for each of the sentence level predictions. Hence, we concatenate the hidden state features from the MLP of Event-Prediction task h ces to pooled vector x C j from 5 to get the logits h C j f for each subtask C j , as follows: Here T denotes transpose, [; ] denotes vector concatenation. W C j and b C j are trainable. Given a tweet t, our loss for sentence classification model over m sentence classification subtasks {C 1 , C 2 ...C m } and Event-Prediction task is: where CE Loss is softmax cross entropy loss, y ces is ground truth label for Event-Prediction task and (y 1 , y 2 ...y m ) are the labels for tweet t for the subtasks {C 1 , C 2 ...C m }. We keep λ 2 = 1. Preprocessing for sentence classification is done using ekphrasis library (Baziotis et al., 2017) . We remove Emoji, URL, Email, punctuation and normalize text by word segmenting, lower-casing and word decontraction. All the experiments were performed using PyTorch (Paszke et al., 2019) and Hugging Face's transformers (Wolf et al., 2019) . We use git and wandb (Biewald, 2020) for experiment tracking. Optimization is done using Adam (Kingma and Ba, 2014) with a learning rate of 2e-5. Slot-filling models are trained for 8 epochs and sentence classification model for 10 epochs. Average training time per epoch on Tesla P100 is ≈ 4 minutes for slot-filling, and ≈ 30 second for sentence classification. We use a 70-30 split for train-valid set. The valid set is used to obtain the best threshold for each of the slot classification tasks over the grid {0.1, 0.2, ..., 0.9}. We exclude labels with "No consensus" from our data. 5 All the MLP have 1 hidden layer and 0.1 dropout. M LP S j has 4 hidden size, LeakyReLU activation (Maas et al., 2013) with 0.1 negative slope, rest of the MLP have 50 hidden size and Tanh activation. Our performance on the held-out test set is shown in Table 3 . Our system ranks 1st position in the W-NUT 2020 Shared Task -3 (Zong et al., 2020) . We also independently rank 1st for 3 of the 5 events: 'Can Not Test', 'Death', and 'Cure'. Now we discuss our various experiments. We experimented with a variety of architectures for slot-filling model. Our (SF) is our Slot-Filling Model from §4.1. Our (SF) w/o pool is our slot-filling model that uses the CT-Bert hidden representation of token < E > to classify instead of doing an attention-weighted pooling. Our (SF) w/o CES is our slot-filling model without Event-Prediction task. CT-Bert and Bertlarge are baseline models using CT-Bert and Bertlarge instead of Bert-base. Table 4 shows the performance of these models. There is a considerable performance difference by using CT-Bert instead of Bert, demonstrate the benefits of domain specific pre-training. Our (SF) w/o pool and Our (SF) w/o CES outperform CT-Bert demonstrating the importance of Event-Prediction task and attention-weighted pooling over slot-chunk respectively. Our (SF) using CT-Bert with Event-Prediction and attentionweighted pooling performs the best among these models. Sentence level tasks: We experimented with various architectures for sentence level tasks. Our (SC) is our Sentence Classification architecture from §4.2. Our (SC) w/o CES is our Sentence Classification without Event-Prediction task. Bert multitask model predicts using the [CLS] representation from Bert (Devlin et al., 2019) . We also build an LSTM model (Hochreiter and Schmidhuber, 1997) with GloVe embedding (Pennington et al., 2014) , and twitter-tokenization using Word- Tokenizers package (Kaushal et al., 2020) . Table 5 shows the performance of these architectures. Our (SC) outperforms others on macro F1 and micro F1, followed by Our (SC) w/o CES. The performance difference between these two, shows the benefits of including the Event-Prediction task. While the performance difference between CT-Bert multitask and Our (SC) w/o CES shows the gains from attention weighted pooling. CT-Bert also outperforms Bert multitask, showing its usefulness in our proposed system over using Bert. Lastly, Bert multitask, and all the models using Bert/CT-Bert outperform LSTM by a very large margin demonstrating the superiority of these pretrained language models. Separate Sentence classification and slot filling models: Consider Bert separate, a simple system treating the two categories of tasks separately. It has the Bert baseline as its slot filling model and a simple Bert sentence classifier using features from [CLS] for sentence prediction. Bert separate does not have the event-prediction auxilliary task or any attention weighted pooling. Table 6 shows the performance of Bert separate against the baseline. Bert separate outperforms the Bert baseline by a considerable margin, thus showing the importance of treating the two subtasks differently. Micro F1 Macro F1 Bert Separate .631 .545 Bert Baseline .608 .512 Table 6 : Results comparing the systems treating the sentence classification and slot-filling subtasks separately vs those treating it similarly. We report results on the valid set across all the subtasks of both categories across the 5 events. In this paper, we presented our system that bagged 1st position in the WNUT-2020 Shared Task-3 on Extracting COVID Entities from Twitter. We divided the event-specific subtasks into slot-filling and sentence classification subtasks, building separate architectures for the two. For both architectures, we used COVID-Twitter Bert, weightedattention pooling over chunk-spans/sentence and fused logits and features from auxiliary Event-Prediction task. Our ablation studies demonstrated the usefulness of each component in our system. There is a lot of scope of improvement for subtasks with few positive labels. Pretraining on relevant data (such as COVID-misinformation datasets for event cure) is a promising direction. Another direction would be to reduce the training and inference time of slot-filling model by not enclosing the candidate chunk within special start < E > and special end < /E > tokens. We can instead use the attention-weighted pooling over candidate slot chunks. This will reduce the number of Bert forward passes from O(k) to O(1), where k is the number of candidate chunks in a tweet. Sub-event detection during natural hazards using features of social media data Using online social networks to track a pandemic: A systematic review Text classification approach for the automatic detection of twitter posts containing self-reported covid-19 symptoms Fighting the covid-19 infodemic in social media: A holistic perspective and a call to arms Analysing the extent of misinformation in cancer related tweets Matching the blanks: Distributional similarity for relation learning DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis Event discovery in social media feeds Experiment tracking with weights and biases. Software available from wandb BERT: Pre-training of deep bidirectional transformers for language understanding Long short-term memory Universal language model fine-tuning for text classification Wordtokenizers.jl: Basic tools for tokenizing natural language in julia Adam: A method for stochastic optimization Leveraging sentence-level information with encoder LSTM for semantic slot filling Covid-19 surveillance through twitter using selfsupervised learning and few shot learning Rectifier nonlinearities improve neural network acoustic models Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter Pytorch: An imperative style, high-performance deep learning library GloVe: Global vectors for word representation Deep contextualized word representations A stack-propagation framework with token-level intent detection for spoken language understanding Named entity recognition in tweets: An experimental study Identifying sub-events and summarizing disaster-related information from microblogs Improved semantic representations from tree-structured long short-term memory networks Huggingface's transformers: State-of-the-art natural language processing Extracting covid-19 events from twitter We are very grateful for the invaluable suggestions given by Nikhil Shah, Dibya Prakash Das and Sayan Sinha. We also thank the organizers of the Shared Task-3 at WNUT, EMNLP-2020.