key: cord-0122366-vtfa127o authors: Hoang, Thai Quoc; Vu, Phuong Thu title: Not-NUTs at W-NUT 2020 Task 2: A BERT-based System in Identifying Informative COVID-19 English Tweets date: 2020-09-14 journal: nan DOI: nan sha: b5ad99b458be2f6c9326690c3d624fbc4fbe14d2 doc_id: 122366 cord_uid: vtfa127o As of 2020 when the COVID-19 pandemic is full-blown on a global scale, people's need to have access to legitimate information regarding COVID-19 is more urgent than ever, especially via online media where the abundance of irrelevant information overshadows the more informative ones. In response to such, we proposed a model that, given an English tweet, automatically identifies whether that tweet bears informative content regarding COVID-19 or not. By ensembling different BERTweet model configurations, we have achieved competitive results that are only shy of those by top performing teams by roughly 1% in terms of F1 score on the informative class. In the post-competition period, we have also experimented with various other approaches that potentially boost generalization to a new dataset. Following the rise of smart technology and an increasingly wide coverage of Internet, social network websites are becoming ubiquitous these days. Besides serving as a platform for various types of entertainment, social media is particularly helpful in spreading information, and such can be leveraged to keep the majority of its users wellinformed amidst a natural disaster or a pandemic like COVID-19. One major advantage of sourcing information via social media is that all information is updated in real-time. Any person with a social media account can post or share information instantly at the moment he/she witness a noteworthy event. This is a much faster way to obtain information compared to reading newspaper, watching the news on TV, or viewing other official source * Equal contribution with the first author of information since most tend to be updated only at mid-day or at the end of day. Nevertheless, information on social media platforms is mostly not verified, heavily opinionated towards the person who posted it, and at worst, completely inaccurate. This highlights the need for a system that can automatically identify legitimate information from the huge pool of information. In order to address the aforementioned need for such a system, in this paper we attempt to tackle the WNUT 2020 Task 2: Identification of Informative COVID-19 English Tweets (Nguyen et al., 2020b) . As stated in the task's description paper, this task requires its participants to build and refine systems that, given an English Tweet carrying COVID-19related content, automatically classify whether it is informative or not. In the context of this shared task, being informative is defined as bearing information regarding suspected, confirmed, recovered or death cases related to COVID-19 as well as location or travel history of these cases. Text classification is a simple but practical task in the field of natural language processing. Early models such as Naive Bayes, Logistic Regression, and Support Vector Machine are widely known and used as a headstart for experimenting classification tasks due to their simplicity and fast training time while still able to achieve a reasonable performance. The rise of modern neural network brings deep learning to the classification tasks within the language processing field as it helps induce features for learning. Further development of recurrent networks gives us the ability to deal with sequences of varied lengths, which improves the performance of text classification to a great extent. While classifying texts, it is essential to make the machine understand deeply the characteristics of input sequences. Because of that, having a well-performing system that embed text sequences is an important prerequisite in building a good model for text classification. Recently, pre-trained language models let us achieve high quality text embeddings, which then can be used for further downstream tasks. For language processing, the most famous pre-trained contextual language models recently are BERT (Devlin et al., 2018) , ELMOs (Peters et al., 2018) , and XL-NET (Yang et al., 2019) . We use the pre-trained language model BERTweet (Nguyen et al., 2020a) , an English Tweet domainspecific model inspired by the original BERT model (Devlin et al., 2018) , as the core for our system (more details will be discussed later). To accomplish the task of identifying informativeness of COVID-19 English Tweets, we attach a classification block on top of our BERTweet block, which is a combination of one or more linear layers. BERTweet (Nguyen et al., 2020a) is a large-scale language model pre-trained for English Tweets. Because of its nature of being a domain-specific model, BERTweet has achieved state-of-the-art performances on many downstream Tweet NLP tasks such as part-of-speech tagging, named entity recognition, and text classification, outperformed top models such as RoBERTa-base (Liu et al., 2019) and XLM-R-base (Conneau et al., 2019) . Trained on 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related the COVID-19 pandemic as pre-training resources, BERTweet has an advantage compares to other models for classifying COVID-19 related English Tweets. Before feeding into the BERTweet model, we first tokenize input sequences with BPE Tokenizer (Sennrich et al., 2015) , then pad the input sequences with the [CLS] and [SEP] tokens at their beginning and ending positions. To ensure all sequences have uniform length, we also add padding blocks at the end of the input sequences. The tokenized and padded input sequences are then fed directly into the Transformer block to retrieve contextualized sequence embeddings. Each Transformer layer within BERTweet model learns different information. We experiment different ways of extracting the pooled token from our BERTweet model, which corresponds to the encoded [CLS] token in our implementation, to analyze the performance on this downstream task. More detail would be discussed in the "Experiments" section. By a close manual inspection of the dataset provided for the task, we realize that many Tweets have noteworthy information at some particular parts. Follow that reasoning, paying special attention to smaller parts of the Tweets is also important. Inspired by that idea, we propose a method to train 3 BERTweet models simultaneously: one for getting contextualized embeddings over the whole input sequences, one for getting embeddings over the first part of the Tweets, and one for getting embeddings over the remaining part. The pooled token from each model would then be extracted and concatenated together for the system to learn both global and local information of the Tweets. Please refer to Figure 2 for a visualization of the model. The classification block contains one or more linear layers stacked on top of each other. The final layer is then used to classify whether a Tweet is informative or not. We use the dataset released by the competition organizer, consisting of 10,000 COVID-19 English Tweet. Each Tweet in the dataset is annotated by 3 annotators independently, and the overall inter-annotator agreement score of Fleiss' Kappa is 0.818. The dataset is then divided into 3 distinct set for training, validation, and testing, with the ratio of 70/10/20, respectively. Table 1 shows the division of the dataset. During the final evaluation phrase, we re-split the dataset by combining training and validation sets then dividing randomly with the ratio of 90/10. The test set is not modified. We mainly rely on the transformers library (Wolf et al., 2019) with PyTorch framework (Paszke et al., 2017) to run our code. We divide the training process into two phrases. In the first phrase, we freeze all the BERTweet paramaters to train the classification block. In the second phrase, we then unfreeze all parameters in our end-to-end model for finetuning. For all models belonging to the scope of our project, we utilized the AdamW optimizer as implemented in the transformers library. This is a third-party implementation of the algorithm originally proposed in the paper named Decoupled Weight Decay Regularization (Loshchilov and Hutter, 2019) The max length for padding input sequences before feeding into the BERTweet model is set to be 256. We trained our models on 1 NVIDIA Tesla V100 and 1 NVIDIA GeForce RTX 2080 Ti using batch size of 16 and 32 alternatively. We use an initial learning rate of 5e − 4 in 12 epochs for the first phrase and 4e − 5 in 6 epochs for the second phrase of training along with linear learning rate decay then choose the best checkpoint. We pre-process input data by tokenizing the data, record the count of occurrences of each token in a matrix then transform such count matrix into a tf-idf representation. To do so, we use CountVectorizer() and TfidfTransformer() as implemented in sklearn (Pedregosa et al., 2011) . We then use 3 different classifiers, namely SVM, Naive Bayes and Logistic Regression, to get results on the original validation set. We acknowledge that the performance of these baselines are relatively poor; nevertheless, it is a trade-off between accuracy and efficiency since follow a non-deep learning approach which does not require much time regarding training and finetuning. As mentioned above, we experiment different ways to extract embeddings after feeding Tweets into BERTweet model. Besides experimenting ways to extract BERTweet embeddings, we also experiment different configurations for our Global Local BERTweet model. Define p i (dimension (1 × 2)) to be the predicted softmax vector of model i-th for each Tweet, c to be the classes (namely Informative/Uninformative), and N to be the number of models. Let C be a function that takes a softmax vector as an input and returns the corresponding binary classification result as output. The output o mv of majority voting is calculated as follows: The output o a of averaging is calculated as follows: We ensemble all the models shown in Table 3 and Table 4 by doing majority voting and averaging softmax vectors. The results on original validation set are summarized in Table 5 . Ensembling Method F1 Majority Voting 0.9130 Averaging 0.9111 During final evaluation phrase, we used the Majority votted prediction of our BERTweet models after training on the re-splitted training set and got the F1 Score of 0.8991 on the hidden test set, which ranked 12 over 56 participated teams. The first team got the corresponding score of 0.9096. To investigate our assumption that Tweet length does affect classification result, we analyze the Tweets in the given dataset and come up with an idea to choose the best models for ensembling while dealing with Tweets within a particular length. In particular, we divide the Tweets sequence into 3 categories: short Tweets (0 − 22 words), medium Tweets (23 − 44 words), long Tweets (> 44 words). For each category, we choose 7 models that have the most correct predictions on our training set and use these models for predictions. With this, we gain 0.9182 F1-Score on the original validation set. Indeed, the reported result shows that the selective ensembling of BERTweet models based tailor-trained for a certain range of input Tweet length does boost classification performance. In this paper, we proposed a system that carries out the automatic identification of informative versus uninformative tweets. While this system is simple, it has leveraged recent advances and state-of-theart results in natural language processing and deep learning, namely BERT-based models. For our future work, we will augment this system so that it can work for various forms of information circulating on social media such as Facebook status, Reddit post, Instagram caption, etc. Bert: Pre-training of deep bidirectional transformers for language understanding Decoupled weight decay regularization BERTweet: A pre-trained language model for English Tweets WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets Automatic differentiation in pytorch Scikit-learn: Machine learning in Python Neural machine translation of rare words with subword units Huggingface's transformers Xlnet: Generalized autoregressive pretraining for language understanding