key: cord-0160390-5o8mtkhc authors: Nguyen, Anh Tuan title: TATL at W-NUT 2020 Task 2: A Transformer-based Baseline System for Identification of Informative COVID-19 English Tweets date: 2020-08-28 journal: nan DOI: nan sha: 096b3a89b028ea605b1b7b8a8a444a3e8c99154b doc_id: 160390 cord_uid: 5o8mtkhc As the COVID-19 outbreak continues to spread throughout the world, more and more information about the pandemic has been shared publicly on social media. For example, there are a huge number of COVID-19 English Tweets daily on Twitter. However, the majority of those Tweets are uninformative, and hence it is important to be able to automatically select only the informative ones for downstream applications. In this short paper, we present our participation in the W-NUT 2020 Shared Task 2: Identification of Informative COVID-19 English Tweets. Inspired by the recent advances in pretrained Transformer language models, we propose a simple yet effective baseline for the task. Despite its simplicity, our proposed approach shows very competitive results in the leaderboard as we ranked 8 over 56 teams participated in total. The COVID-19 pandemic has been spreading rapidly across the globe and has infected more than 20 millions men and women. As a result, more and more people have been sharing a wide variety of information related to COVID-19 publicly on social media. For example, there are a huge number of COVID-19 English Tweets daily on Twitter. However, the majority of those Tweets are uninformative and do not contain useful information, therefore, systems which can automatically filter out uninformative tweets are needed by the community. Tweets are generally different from traditional written-text such as Wikipedia or news articles due to its short length and informal use of words and grammars (e.g abbreviations, hashtags, marker). These special characteristics of Tweets may pose a challenge for many NLP techniques that focus solely on formally written texts. In this paper, we present our participation in the W-NUT 2020 Shared Task Informative COVID-19 English Tweets (Nguyen et al., 2020b) . Inspired by the recent success of Transformer-based pre-trained language models in many NLP tasks (Devlin et al., 2019; Nguyen and Nguyen, 2020; Lai et al., 2020) , we propose a simple yet effective baseline for the task. Despite its simplicity, our proposed approach shows very competitive results. In the following sections, we first describe the task definitions in Section 2 and proposed methods in Section 3. We then describe the experiments and their results in Section 4. Finally, in Section 5, we conclude this work and discuss potential future research directions. The goal of Shared task 2 is to identify whether a COVID 19 English Tweet is informative or not. Such informative Tweet provides information about recovered, suspected, confirmed and death cases as well as location and history of each case. The dataset introduced in this Shared task consists of 10K COVID 19 English Tweets. Dataset statistics can be found in Table 1 3 Method The task is formulated as a binary classification of Tweets into informative or uninformative classes. Figure 1 gives a high-level overview of our proposed approach. Given a Tweet consisting of n tokens x = {x 1 , x 2 , ..., x n }, we first form a contextualized representation for each token using a Transformer-based encoder such as BERT (Devlin et al., 2019) . Following common conventions, we append special tokens to the beginning and end of the input Tweet before feeding it to the Transformer model. For example, if we use BERT, x 1 will be the special [CLS] token and x n will be the special [SEP] token. Let H = {h 1 , h 2 , ..., h n } denote the contextualized representations produced by the Transformer model. We then use h 1 as an aggregate representation of the original input and feed it to a linear layer to calculate the final output: where the transformation matrix W and the bias term b are model parameters. σ denotes the sigmoid function. It squashes the score to a probability between 0 and 1. y is the predicted probability of the input Tweet being informative. In this work, we experiment with various state-of-the-art Transformer models including BERTweet (Nguyen et al., 2020a), XLM-RoBERTa (Conneau et al., 2020) , RoBERTa (Liu et al., 2019) , and ELECTRA (Clark et al., 2020) . In the following subsections, we will briefly describe these Transformer models. RoBERTa (Liu et al., 2019) improved over BERT (Devlin et al., 2019) by leveraging different training objectives which leads to more robust optimization i.e removing next sentence prediction and using dynamic masking for masked language modelling. Liu et al. (2019) also shows that training the language model longer and with more data hugely benefits the performance on downstream tasks. Inspired by the success of multilingual language model (Devlin et al., 2019; Lample and Conneau, 2019) , XLM-RoBERTa (Conneau et al., 2020) significantly scaled up the amount of multilingual training data used in unsupervised MLM pretraining compares to previous work (Lample and Conneau, 2019) and achieved state-of-the-art performance in both monolingual and cross-lingual benchmarks. BERTweet (Nguyen et al., 2020a) is a domainspecific language model pre-trained on a large corpus of English Tweets. Similar to the success of BioBERT in BioNLP domain and the success of SciBERT (Beltagy et al., 2019) in ScientificNLP domain, BERTweet achieved stateof-the-art performance across many TweetNLP tasks, outperformed its counterparts RoBERTa (Liu et al., 2019) and XLM-RoBERTa (Conneau et al., 2020) . ELECTRA (Clark et al., 2020) proposed a new pretraining objective which is different from Masked Language Modelling (Devlin et al., 2019; Liu et al., 2019) . Instead of masking input tokens, ELEC-TRA corrupts the tokens using a small generator network to produces distribution over tokens, while the discriminator tries to guess which tokens are actually corrupted by the generator. ELECTRA achieved state-of-the-art results across many tasks in the GLUE benchmark while using much less compute resources compared to other pre-training methods (Devlin et al., 2019; Liu et al., 2019) . To further boost the performance of our baseline models, we leverage ensemble learning technique. We performed ensemble learning over all of the Transformer models mentioned in the previous section and employed two different ensemble schemes, namely Unweighted Averaging and Majority Voting. In this approach, the final prediction is estimated from the unweighted average of the posterior probability from all of our models. Thus, the final prediction is given by: where C is the number of classed, M is the number of models, and p i is the probability vector computed using the softmax function of model i. Majority Voting counts the votes of all the models and select the class with most votes as prediction. Formally, the final prediction is given by: where v c denotes the votes of class c from all different models, F i is the binary decision of model i, which is either 0 or 1. To fine-tune our baseline models, we employ transformers library (Wolf et al., 2019) . We use AdamW optimizer (Loshchilov and Hutter, 2019) with a fixed batch size of 32 and learning rates in the set {1e − 5, 2e − 5, 5e − 5}. We finetune the models for 30 epochs and select the best checkpoint based on performance of the model on the validation set. Table 2 shows the overall results on the validation set. The large version of RoBERTa achieves the highest F1 score on the validation set (compared to other individual models). To our surprise, we find that BERTweet does not outperform the base version of RoBERTa on the validation set, even though BERTweet was trained on English Tweets using the same training procedure of RoBERTa. Finally, XLM-RoBERTa achieves lower F1 score than both RoBERTa and ELECTRA, suggesting that using a multilingual pretrained language models may not improve the performance since the shared task is mainly about English Tweets. We also evaluate the performance of our ensemble models. The results show that ensemble learning improves the F1 score compare to each individual model and Unweighted Averaging perform better than Majority Voting on the validation set. We also submitted the predictions of both ensemble scheme to the competition and final results on the leaderboard are shown in table 3. We notice that Majority Voting slightly performs better than Unweighted Averaging on the hidden test set. In this paper, we introduce a simple but effective approach for identifying informative COVID-19 English Tweets. Despite the simplicity of our approach, it achieves very competitive results in the leaderboard as we ranked 8 over 56 teams partici-pated in total. In future work, we will conduct thorough error analysis and apply visualization techniques to gain more understandings of our models (Murugesan et al., 2019) . Furthermore, we will also extend our approach to other languages. Finally, we will investigate the use of advanced techniques such as transfer learning, few-shot learning, and self-training to improve the performance of our system further (Pan et al., 2017; Huang et al., 2018; Lai et al., 2018; Xie et al., 2020) . Scibert: Pretrained language model for scientific text Bert for joint intent classification and slot filling Electra: Pre-training text encoders as discriminators rather than generators. ArXiv, abs Unsupervised cross-lingual representation learning at scale BERT: Pre-training of deep bidirectional transformers for language understanding Zero-shot transfer learning for event extraction Supervised transfer learning for product information question answering A gated self-attention memory network for answer selection A simple but effective bert model for dialog state tracking on resource-limited systems Crosslingual language model pretraining BioBERT: a pre-trained biomedical language representation model for biomedical text mining RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint Decoupled Weight Decay Regularization Deepcompare: Visual and interactive comparison of deep learning model performance PhoBERT: Pre-trained language models for Vietnamese Bertweet: A pre-trained language model for english tweets WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets Crosslingual name tagging and linking for 282 languages GLUE: A multi-task benchmark and analysis platform for natural language understanding Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv preprint Self-training with noisy student improves imagenet classification A compareaggregate model with latent clustering for answer selection