key: cord-0511970-k9kuogly authors: Felber, Thomas title: Constraint 2021: Machine Learning Models for COVID-19 Fake News Detection Shared Task date: 2021-01-11 journal: nan DOI: nan sha: 505e0ab637ca61fdda77785b66eac89e63ccdb0a doc_id: 511970 cord_uid: k9kuogly In this system paper we present our contribution to the Constraint 2021 COVID-19 Fake News Detection Shared Task, which poses the challenge of classifying COVID-19 related social media posts as either fake or real. In our system, we address this challenge by applying classical machine learning algorithms together with several linguistic features, such as n-grams, readability, emotional tone and punctuation. In terms of pre-processing, we experiment with various steps like stop word removal, stemming/lemmatization, link removal and more. We find our best performing system to be based on a linear SVM, which obtains a weighted average F1 score of 95.19% on test data, which lands a place in the middle of the leaderboard (place 80 of 167). The internet functions as a valuable resource for individuals who seek information online. A majority of people for instance use it when searching for healthrelated information [11] . With the rise of social media in recent years, platforms like Facebook, Instagram and Twitter are a key resource to go for following updates during crisis events [12] like the COVID-19 pandemic spreading. On these platforms any registered user can publish and disseminate unverified content. This can act as a breeding ground for disinformation and fake news, potentially exposing millions of users to harmful misinformation in a short frame of time [10] . In order to prevent the negative impact of misinformation spread, there is an incentive to develop new methods capable of assessing the validity of social media posts. Constraint 2021 1 is a workshop on combating online hostile posts (hate speech, fake news detection, cyberbullying, etc.) in regional languages during emergency situation and is held as part of the 35th AAAI conference on artificial intelligence on February 8, 2021. It encourages researchers from interdisciplinary domains to think of new ways on how to combat online hostile posts during emergencies such as the 2021 US presidential election or the COVID-19 pandemic spreading. In alignment with that, the workshop organizes a shared task 2 [13] and invites for participation in it's two subtasks, which focus on COVID-19 Fake News Detection in English (Subtask 1) and Hostile Post Detection in Hindi (Subtask 2). In this paper, we discuss our take on Subtask 1, where we apply classical machine learning algorithms with diverse groups of linguistic features, considering n-grams, readability, emotional tone and punctuation, in order to classify social media posts as either fake or real. Interest about fake news detection has increased over the last years, and there are various shared tasks in this domain, like Profiling Fake News Spreaders on Twitter [19] , Fake News Challenge (FNC-1) [17] , RumorEval [3, 8] , CheckThat! [1] , and Fact Extraction and Verification (FEVER) [21, 22] . In Profiling Fake News Spreaders on Twitter, the task aims at identifying possible fake news spreaders on social media as a first step towards preventing fake news from being propagated among online users. The organizers of FNC-1 propose their task to be tackled from a stance perspective, i.e. estimate the stance of a body text from a news article relative to a headline, where the body text may agree, disagree, discuss or be unrelated to the headline. This is similar to task-A of RumorEval whose objective is to track how other sources orient to the accuracy of a rumourous story. In both task-B of RumorEval and task-B of CheckThat! the goal is to predict the veracity of a given rumor or claim. Task-A of CheckThat! is concerned with the detection of check-worthy tweets, i.e. tweets that include a claim that is of interest to a large audience (specially journalists), might have a harmful effect, etc. The idea behind FEVER is to evaluate the ability of systems to verify factual claims by extracting evidence from Wikipedia. Based on the extracted evidence, the claims are labeled Supported, Refuted, or NotEnoughInfo (if there isn't sufficient evidence to either support of refute it). Through these shared tasks, numerous methods for fake news detection have been found. Zhou et al. [24] propose GEAR, a graph-based evidence aggregating and reasoning framework, that enables information to transfer on a fullyconnected evidence graph and then utilizes different aggregators to collect multievidence information. In conjunction with BERT [4] , an effective pre-trained language representation model, they are able to achieve a promising FEVER score of 67.10%. Umer et al. [23] address the problem of fake news stance detection in FNC-1 by following a hybrid neural network approach, that combines the capabilities of CNN and LSTM, in addition with two different dimensionality reduction approaches, Principle Component Analysis (PCA) and Chi-Square. Within their experimental results they show that PCA outperforms Chi-Square and obtain state-of-the-art performance with 97.8% accuracy. Ghanem et al. [7] use classical machine learning algorithms together with stylistic, lexical, emotional, sentiment, meta-structural and Twitter-based features for the task of rumor stance classification in RumorEval, and introduce two novel feature sets that proved to be successful for this task. The Constraint 2021 Shared Task is composed of two individual subtasks: -Subtask 1: COVID-19 Fake News Detection in English -Subtask 2: Hostile Post Detection in Hindi Since our contribution to the shared task is solely concerned with Subtask 1, we shall now give a closer definition of that task, which is as follows: Given a collection of COVID-19-related posts from various social media platforms such as Twitter, Facebook, Instagram, etc., classify the posts as either fake or real, as illustrated in Table 1 The basis of the social media posts is provided by a manually annotated COVID-19 fake news dataset that is described in the following section. The foundation of social media posts provided for Subtask 1 is laid by a fake news dataset [14] containing 10700 manually annotated social media posts and articles of real and fake news on COVID-19. The dataset is split into training, validation and test sets with respect to a 3:1:1 ratio as illustrated in Table 2 . Regarding class label distribution, a consistent balance across training, validation and test set has been established, where 52.34% of the samples are attributed to real news and the remaining 47.66% to fake news. Further dataset metrics can be derived by determining total unique words, average words per post and average characters per post as shown in Table 3 . Based on data from the table, it can be observed that on average, real news posts are longer than fake news post in terms of both characters per post and words per post. The size of the vocabulary, given by the unique words in the dataset, is 37505 where 5141 of these words are contained in both fake and real posts. Each team participating at Subtask 1 has access to training, validation and test splits of the dataset, where training and validation data are labeled, and test data are unlabeled. Participant systems should predict class labels for social media posts contained in test data. Performance is measured in terms of weighted average F1 score, however, weighted average precision, weighted average recall and accuracy are also reported. The best performing baseline result provided by the organizers is a linear SVM with 93.46% weighted average F1 score. In our approach, we experiment with different pre-processing pipelines and apply classical machine learning algorithms with diverse groups of linguistic features, such as n-grams, readability, emotional tone and punctuation. Regarding machine learning, we utilize scikit-learn [15] , a Python module providing a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. For some of our pre-processing steps we utilize functionality provided by the Natural Language Toolkit (NLTK) [9] . Furthermore, to extract some of the linguistic features, we use Linguistic Inquiry and Word Count (LIWC) [16] , which is a text analysis program, capable of reading a given text and counting the percentage of words that reflect different emotions, thinking styles, social concerns, and even parts of speech. Using scikit-learn's pipeline capability, we create and experiment with different pre-processing pipelines, where in each of these, the following steps may be applied: Stop word removal: If this step is applied, English stop words are removed from the social media posts. In that case, the stop word dictionary is provided by NLTK. Link removal: In this step, hypertext links are removed from the social media posts. This is accomplished via regular expressions. Lemmatization or stemming: During this step either lemmatization or stemming is performed. Lemmatization is achieved by NLTK's WordNet Lemmatizer and stemming by NLTK's Snowball Stemmer implementation, which is based on Porter2 stemming algorithm [18] . Reply removal: In this step, words starting with @ (mostly used for Twitter replies) are removed. This is also accomplished via regular expressions. Apart from the pre-processing steps described above, the following steps are always applied to every social media post: Lowercase transformation: To account for differences in capitalization, in this step, every word is transformed to lowercase. XML entity replacement: Some social media posts contain XML entities such as "&", ">" and "<". In this step, those entities are replaced by their corresponding text symbol. In terms of features, we focus on linguistic aspects such as n-grams, readability, emotional tone and punctuation. Based on these features, together with scikit-learn's pipeline capability, we create and experiment with different feature extraction pipelines, where in each of these, the features mentioned above may be contained: N-grams: We extract unigrams and bigrams derived from the bag of words representation of each pre-processed social media post. To address differences in content length, we encode these features as TF-IDF values. To accomplish this, we make use of scikit-learn's CountVectorizer in conjunction with it's TfidfTransformer. Regarding tokenization, we experiment with NLTK's TweetTokenizer and WordPunctTokenizer. Readability: Here we extract features that reflect text understandability. These features are based on properties such as words per sentence, characters per word, syllables per word, number of complex words, and so on. Various readability metrics that calculate a readability score using this information already exist. In our experiments, we calculate several of these metrics, including Automatic Readability Index (ARI) [20] , Flesch-Kincaid [5] , Coleman-Liau [2] , Flesch Reading Ease [6] and more. Emotional tone: We use the Linguistic Inquiry and Word Count text analysis software (LIWC, Version 1.6.0 2019) [16] to extract the proportions of words that fall into certain psycholinguistic categories (e.g., positive emotions, negative emotions, cognitive processes, social words). In our work, we experiment with features derived from LIWC summary categories (e.g. emotional tone, analytical thinking) and psychological categories (e.g. social words, cognitive processes). Punctuation: Also using LIWC, we construct a punctuation feature set composed of various types of punctuation as derived from LIWC's punctuation categories. These punctuation categories include punctuation characters such as question marks, exclamation marks, periods, commas, dashes, etc. We conduct several experiments with different combinations of pre-processing steps and feature sets. Regarding machine learning algorithms, we use Support Vector Machine (SVM) with linear kernel, Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB) and Multilayer Perceptron (MLP), all of which are provided by scikit-learn [15] . Every classifier is trained on the predefined training split of the dataset and validated against the predefined validation split, with accuracy, precision, recall and F1 measure. In order to tune the classifiers, we employ a grid search method as offered by scikit-learn. Within this method, we first predefine sets of pre-processing steps, features, and hyperparameters. Then the grid search is started, where elements of these sets are automatically combined into different classification pipelines, which are then trained and evaluated using respective training and validation data. After the grid search is finished, the best performing combination is retrieved. Table 4 shows results of highest performing models in each machine learning approach as found via grid search. Table 4 . Performance on validation data in terms of accuracy (Acc), weighted average precision (P), weighted average recall (R) and weighted average F1 score (F1) of highest performing models in each machine learning approach as found via grid search. Overall, only SVM and LR models were able to beat the baseline. The confusion matrix corresponding to SVM is illustrated in Figure 1 . In Table 5 , we have a closer look at the concrete pre-processing and feature combination that led to best performance in our SVM model and compare it to other combinations. We observe that using readability, punctuation and emotional features alone doesn't perform very well. In combination with each other, performance increases noticeably, however, overall performance is still far away from top. Interestingly, using n-gram features with pre-processing alone works fairly well and already is enough to outperform the baseline. With the additional introduction of the remaining features we reach our maximum performance. In total, shared task participants were allowed to submit five system runs. We decided to submit a system run for each of our five types of machine learning models. To do that, we used the models from Table 4 and had them predict labels for social media posts contained in the test set. Among our five submissions, the one using linear SVM yielded the best result with a weighted average F1 score of 95.19%. This is consistent with our expectations based on validation results, and is able to land a place in the middle of the leaderboard, where we are now placed at rank 80 of 167. Table 5 . Different feature and pre-processing combinations as used in our linear SVM and corresponding model performance in terms of weighted average F1 score on validation data. Model A describes the best feature and pre-processing combination as found by grid search. A N-grams: Unigrams and bigrams are extracted from preprocessed social media posts. In this paper we presented an overview of our participation at the Constraint 2021 COVID-19 Fake News Detection Shared Task. We followed a classical machine learning approach, where we used Support Vector Machine, Random Forest, Logistic Regression, Naive Bayes and Multilayer Perceptron for classification. Feature-wise, we experimented with several linguistic features, such as n-grams, readability, emotional tone and punctuation. Our pre-processing pipeline consisted of steps like stop word removal, stemming/lemmatization, link removal and more. Each of the five classifiers we used was tuned by conducting a grid search over different combinations of feature sets, pre-processing steps and hyperparameters. In the end, we submitted the best run of each classifier and achieved the best result (95.19% F1 score on test data) with our Support Vector Machine, landing on place 80 of 167 in the leaderboard. Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims A computer readability formula designed for machine scoring Semeval-2017 task 8: Rumoureval: Determining rumour veracity and support for rumours Bert: Pre-training of deep bidirectional transformers for language understanding Flesch-kincaid readability test A new readability yardstick Upv-28-unito at semeval-2019 task 7: Exploiting post's nesting and syntax information for rumor stance classification Semeval-2019 task 7: Rumoureval, determining rumour veracity and support for rumours Nltk: the natural language toolkit Media manipulation and disinformation online Online health information seeking: The influence of age, information trustworthiness, and search challenges Online social media in crisis events Overview of constraint 2021 shared tasks: Detecting english covid-19 fake news and hindi hostile posts Fighting an infodemic: Covid-19 fake news dataset Scikit-learn: Machine learning in python. the The development and psychometric properties of liwc2015 The fake news challenge: Exploring how artificial intelligence technologies could be leveraged to combat fake news The english (porter2) stemming algorithm Overview of the 8th author profiling task at pan 2020: Profiling fake news spreaders on twitter Automated readability index The fact extraction and verification (fever) shared task The second fact extraction and verification (fever2. 0) shared task Fake news stance detection using deep learning architecture (cnn-lstm) Gear: Graph-based evidence aggregating and reasoning for fact verification