key: cord-0450954-ax2i1t07 authors: Huynh, Tin Van; Nguyen, Luan Thanh; Luu, Son T. title: BANANA at WNUT-2020 Task 2: Identifying COVID-19 Information on Twitter by Combining Deep Learning and Transfer Learning Models date: 2020-09-06 journal: nan DOI: nan sha: 50ffa665e7317512b2e9945775863f31f9bcdfb5 doc_id: 450954 cord_uid: ax2i1t07 The outbreak COVID-19 virus caused a significant impact on the health of people all over the world. Therefore, it is essential to have a piece of constant and accurate information about the disease with everyone. This paper describes our prediction system for WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. The dataset for this task contains size 10,000 tweets in English labeled by humans. The ensemble model from our three transformer and deep learning models is used for the final prediction. The experimental result indicates that we have achieved F1 for the INFORMATIVE label on our systems at 88.81% on the test set. The rapid spread of the coronavirus has caused a global health crisis. This virus is hazardous to people's health and causes a big panic all over the world. Statistics show that each day there are 4 million tweets related to COVID-19 on Twitter (Lamsal, 2020) . Therefore, it is essential to keep track of the information associated with this disease. Along with the development of many social networking platforms such as Twitter and Facebook. This is the primary way that helps people capture information about COVID-19 regularly. However, there is much content appearing daily on these social media platforms. Most of them do not have information about the status of COVID-19, such as the number of suspected cases or cases near the user's area. In this article, we present our approach at WNUT-2020 Task 2 to identify Tweets containing information about COVID-19 on the social networking platform Twitter or not. A Tweet is believed to have information if it includes information such as recovered, suspected, confirmed, and death cases and location or travel history of the patients. Specifically, we described the problem as follows. • Input: Given English Tweets on the social networking platform. • Output: One of two labels (INFORMATIVE and UNINFORMATIVE) predicted by our system. Several examples are shown in Table 1 Tweet Label A New Rochelle rabbi and a White Plains doctor are among the 18 confirmed coronavirus cases in Westchester. HTTPURL 0 Day 5: On a family bike ride to pick up dinner at @USER Broadway, we encountered our pre-COVID-19 Land Park happy hour crew keeping up the tradition at an appropriate #SocialDistance.HTTPURL 1 In this paper, we have two main contributions as follows. • Firstly, we implemented four different models based on neural networks and transformers such as Bi-GRU-CNN, BERT, RoBERTa, XLNet to solve the WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. • Secondly, we propose a simple ensemble model by combining multiple deep learning and transformer models. This model gives the highest performance compared with the single models with F1 on the test set is 88.81% and on the development set is 90.65%. During the happening of the COVID-19 pandemic, the information about the number of infected cases, the number of patients is vital for governments. Dong et al. (2020) constructed a real-time database for tracking the COVID-19 around the world. This dataset is collected by experts from the World Health Organization (WHO), US CDC, and other medical agencies worldwide and is operated by John Hopkins University. Also, there are many other COVID-19 datasets such as multilingual data collected on Twitter from January 2020 (Chen et al., 2020) or Real World Worry Dataset (RWWD) (Kleinberg et al., 2020) . Besides, on social media, the spreading of COVID-19 information is extremely fast and enormous and sometimes leads to misinformation. Shahi et al. (2020) conducted a pilot study about detecting misinformation about COVID-19 on Twitter by analyzing tweets using standard social media analytics techniques. From the researching results, the authors want to help authorities and social media users counter misinformation. Moreover, the rumors and conspiracy theories within the emergence times of COVID-19 spreading had made communities feel fearmongering and panicky, which lead to racism about COVID-19 patients and citizens from infected countries, and mass purchase of face masks as well as the shortage of necessaries, according to (Depoux et al., 2020) . Thus it is necessary to identify the right information from the social media text. The dataset provided by contains 10,000 English Tweets about COVID-19, which is used to automatically identify whether a tweet contains useful information about the COVID-19 (informative) or not (uninformative). There are 4,719 INFORMATIVE tweets and 5,281 UNINFORMATIVE tweets in the dataset, and three different annotators annotate each tweet. The inter-annotator agreement calculated by Fleiss' Kappa score of the dataset is 81.80%. Also, the dataset is split into the training, development, and test sets with proportion 7-1-2. Table 2 shows the overview information about the dataset. Training 3,303 3,697 Development 472 528 Test 944 1,056 In this paper, we propose an ensemble method that combines the deep learning models with the transfer learning models to identify information about COVID-19 from users' tweets. We implement the Bi-GRU-CNN model, which was used for salary prediction by and Job prediction by Van Huynh et al. (2020), with the GloVe-300d word embedding (Pennington et al., 2014) . This model consists of three main layers: the word representation layers (word embedding), the 1D Convolutional layers (CONV-1D), and the bidirectional GRU layer (Bi-GRU). This model also achieved high performances on previous study works Van Huynh et al., 2019 . Fig 1 illustrates the Bi-GRU-CNN model. Inspired by transfer learning success on many NLP tasks such as text classification (Do and Ng, 2006; Rizoiu et al., 2019) and machine reading comprehension (Devlin et al., 2019; . In this paper, we used the SOTA transfer learning models, such as BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , and XLNet (Yang et al., 2019) with fine-tuning techniques for the problem of identifying informative tweet about COVID-19. In our experiment, we used the pretrained language model, as described in Table 3 . All of these pre-trained models are constructed on English texts. As the success of the ensemble models of previous tasks ( In this study, we experimented with datasets provided by WNUT-2020 Task 2. Training, development, and testing sets are divided as described in Section 3. To evaluate our models, we use four metrics include accuracy, precision, recall, and F1. To prepare data for the model training and model evaluation phases, we perform the simple and effective pre-processing of input data as follows: • Step 1: Converting the tweet into the lowercase strings. • Step 2: Removing the user names in the tweets. • Step 3: Deleting all URLs in the tweets. • Step 4: Representing words into vectors with pre-trained word embedding sets for deep neural network models. According to analyzing the length of the tweets in the data, we set max length of the models to be 512 and epochs to be 15 for two models Bi-GRU-CNN and XLNet, and 3 for model BERT and RoBERTa. After searching for extensive hyperparameter, we set learning rate equal to 1e-3 and dropout equal to 0.2 for the Bi-GRU-CNN model and learning rate equal to 1e-5 and dropout equal to 0.1 for three models BERT, RoBERTa, and XLNet. Experimental results of the single model and the ensemble model on the development set are presented in Table 4 . Specifically, in the single models, the Bi-GRU-CNN model gives the lowest performance with 85.66% by F1 and 86.10% by accuracy. The single model with the highest efficiency is XLNet, which attained 89.86% by F1 and 90.30% by accuracy. In addition, the BERT model gives the highest Precision with 89.53%, and the RoBERTa model achieved the highest Recall result with 90.74%. In particular, our recommend ensemble model gives the best performance when combining the power of single models together, which accomplished 90.65%, 91.00%, and 92.37% by F1, Accuracy, and Recall respectively, according to Table 4 . Specifically, our model improved 0.79% by F1 and 0.70% by Accuracy over the most extensive single model (XLNet), and 0.63% by Recall over the RoBERTa model. Accuracy After the system evaluation of WNUT-2020 Task 2, Table 5 displays our ensemble model results on the testing set. This result is compared with the top 5 highest teams' results and the baseline model (BASELINE -FASTTEXT). Our model with F1 is 88.81%, 2.15% lower than the first rank team, and 13.78% higher than the baseline model. As for the results of accuracy, we get 89.40%, 2.10% lower than the first rank team, and 12.10% higher than the baseline model. In addition, Table 6 displays some error prediction examples from the dataset. Most of the wrong predictions occurred because of the appearance of special characters such as hashtag, the HTTPURL phrases, which stand for the URL links in the tweets. For the INFORMATION tweets, the appearance of the HTTPURL phrase and the hashtag #coronavirus make the classification model predict the wrong label. This mistake is the same for the UNINFORMATION tweets, where the appearance of HTTPURL phrase and the hashtags related to the Coronavirus affected the results of the prediction model. This paper has addressed our work on the WNUT-2020 Task 2: Identifying COVID-19 Information on Twitter. We proposed our ensemble model combining the deep learning models and the transfer learning models for detecting information about COVID-19 from users' tweets. Our ensemble model achieved 91.00% by accuracy and 90.65% by F1 on the development set, and achieved 89.04% by accuracy and 88.81% by F1 on the public test set, which ranked #25 in the competition. In the future, we will improve our model's performance by exploring different features of the users' tweets and transfer learning models with fine-tuning techniques. Finally, we hope our study can be applied in practice for detecting COVID-19 from social networks to support the COVID-19 battle all over the world. Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set Annelies Wilder-Smith, and Heidi Larson. 2020. The pandemic of social media panic travels faster than the COVID-19 outbreak BERT: Pre-training of deep bidirectional transformers for language understanding Transfer learning for text classification An interactive web-based dashboard to track covid-19 in real time Measuring emotions in the covid-19 real world worry dataset WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets Nlp@uit at vlsp 2019: A simple ensemble model for vietnamese dependency parsing Glove: Global vectors for word representation Transfer learning for hate speech detection in social media An exploratory study of covid-19 misinformation on twitter Hate speech detection on vietnamese social media text using the bi-gru-lstm-cnn model Job prediction: From deep neural network models to applications New vietnamese corpus for machine reading comprehension of health news articles Salary prediction using bidirectionalgru-cnn model Xlnet: Generalized autoregressive pretraining for language understanding