key: cord-0517605-50wu4hv1 authors: Schlicht, Ipek Baris; Sezerer, Erhan; Tekir, Selma; Han, Oul; Boukhers, Zeyd title: Leveraging Commonsense Knowledge on Classifying False News and Determining Checkworthiness of Claims date: 2021-08-08 journal: nan DOI: nan sha: 11cde391fdf0386542878ba850084e0ac1fcd5e7 doc_id: 517605 cord_uid: 50wu4hv1 Widespread and rapid dissemination of false news has made fact-checking an indispensable requirement. Given its time-consuming and labor-intensive nature, the task calls for an automated support to meet the demand. In this paper, we propose to leverage commonsense knowledge for the tasks of false news classification and check-worthy claim detection. Arguing that commonsense knowledge is a factor in human believability, we fine-tune the BERT language model with a commonsense question answering task and the aforementioned tasks in a multi-task learning environment. For predicting fine-grained false news types, we compare the proposed fine-tuned model's performance with the false news classification models on a public dataset as well as a newly collected dataset. We compare the model's performance with the single-task BERT model and a state-of-the-art check-worthy claim detection tool to evaluate the check-worthy claim detection. Our experimental analysis demonstrates that commonsense knowledge can improve performance in both tasks. The increase of social media usage in recent years has changed the way of consuming news. Although social media is useful for following the updates of breaking events such as the COVID-19 pandemic, it also misleads through false news spreading rapidly and globally on platforms (Karlova and Fisher, 2013; Vosoughi et al., 2018) , which results in negative emotions, confusion, and anxiety in society (Budak et al., 2011) and even in the manipulation of major outcomes, such as political elections (Vosoughi et al., 2018; Lazer et al., 2018) . * Corresponding Author: ibarsch@doctor.upv.es. Work done during the time at the University of Koblenz-Landau, Germany. To combat false news, the number of factchecking initiatives around the world has increased. However, manual fact-checking can satisfy only a small share of the demand as fact-checking is a labor intensive and time consuming task. It requires fact checkers to contact people/organizations that are mentioned in the claim, consult experts that provide background knowledge if needed, seek source validation, etc. Typical fact-checking for one news item takes about one day in order to research the facts and report the result (Graves, 2018) . This yields a time lag between the spread of false news and the delivery of the fact-checked article (Hassan et al., 2015a) . In recent years, we have seen a growing interest in developing computational approaches for combating false news. Some studies focus on automatising the steps of manual fact-checking (Cazalens et al., 2018; Thorne and Vlachos, 2018; Graves, 2018) . As a seminal study, ClaimBuster (Hassan et al., 2015b (Hassan et al., , 2017 is the first end-to-end automatic fact-checking framework and is widely used by professional fact checkers and journalists (Adair et al., 2019) . ClaimBuster identifies checkworthy factual claims in texts from various data sources (e.g social media, websites) using a classification and score model that is trained on human labeled political debates, searches identified claims against fact-checking websites, collects supporting/debunking evidences and in a final step creates a report which combines the search results, aforementioned evidences and claim checkworthiness score. Although ClaimBuster is able to spot simple declarative claims, it misses those implicitly stated in sentences (Graves, 2018) . Other studies focus on false news detection based on the style of news content or social media context (Shu et al., 2017; Zhou and Zafarani, 2020) . Most of these studies treat false news detection as a binary classification problem that labels a news article as fake or true. As an example, Singhania et al. (2017) propose a three level hierarchical attention network model (3HAN) that exploits the article structure. The authors evaluated on a dataset that is constructed of a small list of fake and legit news sources and outperformed models such as single level hierarchical attention network (HAN) (Yang et al., 2016) . However, on the web, multiple types of false news can be found such as satire or propaganda (Zannettou et al., 2019; Rashkin et al., 2017b) . While satire conveys irony or parody and contains unrealistic situations, propaganda mimics the style of real news and can mislead readers with malintent. Furthermore, political polarization propels biases in news (Lazer et al., 2018) . False news detection models must therefore be more fine-grained than the aforementioned binary models in order to address diverse types of false news. Therefore, the literature is in quest of new techniques and methodologies (Bozarth and Budak, 2020) as well as datasets (Torabi Asr and Taboada, 2019). Addressing the diversity of false news, we identify one common trait that enables them all. We point out that the impression of veracity is created by a seeming plausibility of news stories, since humans believe depending on how much a story fits their prior knowledge Keane, 2004, 2003) . This background knowledge is termed as commonsense knowledge. Goldwasser and Zhang (2016) already demonstrated that commonsense knowledge can improve satire detection better than traditional text classification models. With this motivation, we integrate commonsense knowledge into the tasks of false news classification and checkworthy claim detection, applied on diverse news. In our study, commonsense knowledge is captured by a commonsense question answering (CSQA) task (Talmor et al., 2019) where the inputs are questions and multiple choices as answer candidates, and the output is an answer (Task A in Table 1 ). For classifying different types of news on the web, we leverage the false news taxonomy proposed by Zannettou et al. (2019) . In this task, the inputs are body and title of a news article and the output is the news type, namely satire, conspiracy, propaganda, bias-right, bias-left or neutral (Task B in Table 1 ). In check-worthy claim detection, the in-put is a sentence and the output is a label indicating its checkworthiness, namely, check-worthy factual statement (CFS), not factual statement (NFS) and unimportant factual statement (UFS) (Task C in Table 1 ). To transfer commonsense knowledge to task B and C using multitask training (Liu et al., 2019) , we prepend them with fine-tuning a pre-trained BERT model over a CSQA task. The main contributions of this study are as follows: • To the best of our knowledge, it is the first attempt to leverage commonsense knowledge to classify fake news and detect check-worthy claims. • We collected a new community interest news dataset (CIND) from the social media platform Reddit. Unlike the publicly available false news collection NELA 2019 (Gruppi et al., 2020) , CIND is a collection (1) of news that news consumers found interesting or plausible, (2) that features a diverse number of sources which makes the experiment of predicting news from an unseen source more reliable (3) that covers news events which occurred from 2016 to 2019, allowing for forecasting. We label both datasets with the false news taxonomy (Zannettou et al., 2019) . • We conducted an extensive set of experiments to validate our hypothesis. The results show that commonsense knowledge could improve the (1) predictions of 4 out of 6 classes on CIND and the predictions of bias-right articles from the NELA dataset for the experiment of predicting news articles from unseen sources, (2) the predictions of 3 out of 6 classes on CIND for the experiment of forecasting, (3) the prediction of all classes in check-worthy claim detection task. The rest of the paper is organized as follows. Section 2 summarizes related work. Section 3 presents the proposed models. Section 4 introduces the new dataset collected for this study along with other datasets used for comparison. Finally, section 5 discusses the experimental results. In this section, we present the studies related to our research. Section 2.1 and section 2.2 outline the studies in check-worthy claim detection and false news classification, section 2.3 presents the studies encoding commonsense knowledge for text classification tasks and section 2.4 presents the studies leveraging multi-task learning for fact-checking and false news classification. Check-worthy claim detection is the first step of the fact-checking pipeline (Cazalens et al., 2018; Graves, 2018; Thorne and Vlachos, 2018) . The component of ClaimBuster (Hassan et al., 2015b (Hassan et al., , 2017 ) that detects check-worthy claims is trained with a SVM classifier using tf-idf bag of words, named entity types, POS tags, sentiment and sentence length as a feature set. Gencheva et al. (2017) proposed a fully connected neural network model trained on claims and their related political debate content. Additionally, CLEF Check That! Lab (CTL) has organized shared tasks to tackle this problem in political debates (Atanasova et al., 2018 (Atanasova et al., , 2019 and in social media (Barrón-Cedeño et al., 2020). Style-based approaches for false news classification attempt to capture the writing style or deceptive clues in news articles (Shu et al., 2017; Zhou and Zafarani, 2020; Potthast et al., 2018) . The methods range from hand-crafted feature-based methods (Volkova et al., 2017; Pérez-Rosas et al., 2018) to sophisticated deep neural networks (Singhania et al., 2017; Riedel et al., 2017; Karimi and Tang, 2019) . Riedel et al. (2017) focus on the first part of the fake news detection problem: stance detection. Their model RDEL extracts the most frequent unigrams and bigrams, constructs tf-idf vectors for article headlines and bodies, and also computes the cosine similarity of headline and body. Finally, all of these features are fed into a Multilayer Perceptron for classification. The model 3HAN (Singhania et al., 2017) encodes the body of news articles similar to HAN (Yang et al., 2016) where the words in each sentence are encoded with BiGRU (Cho et al., 2014) and then an attention mechanism (Bahdanau et al., 2015) identifies informative words in a sentence. Finally sentences are passed through an attention layer to find informative sentences in the document before classifying them with a dense layer. In addition to HAN, it also concatenates the encoded headline with the processed body of the article and runs attention mechanism on the concatenated features before feeding it to the dense layer. Most of these aforementioned studies tackle the problem as binary classification. However, the dataset used in both studies (Rashkin et al., 2017a; Ghanem et al., 2020 ) covers a few sources for each category. In our study, we increase the number of sources for each category and include articles from biased sources. Incorporating commonsense knowledge into text representations can improve many tasks in NLP and NLU, such as machine comprehension (i.e Ma et al. (2018) . Closest to this paper is the study by Goldwasser and Zhang (2016) who leverage commonsense knowledge for satire detection. Their approach constructs a narrative representation of an article by extracting main actors, events and statements, then makes inferences to quantify the likelihood of those entities appeared in a real/satire context. To leverage commonsense knowledge, one approach is to use knowledge aware distributional word embeddings such as Numberbatch (Speer et al., 2017) which is built on the commonsense knowledge base ConceptNet. Alternatively, commonsense knowledge can be transferred by using multi-task learning (Bosselut et al., 2019) . In our study, we evaluate both approaches. Multi-task learning is motivated by human learning. While learning new tasks, we apply the knowledge that is gained from related tasks. In contrast to single-task learning, multi-task learning can learn a more general representation by leveraging the knowledge of auxiliary tasks when the original task has noisy/small amount of samples (Ruder, 2017) . There are several attempts to apply multi-task learning to the tasks that aid fact-checking or detect false news. Kochkina et al. (2018) applied a multi-task learning model that encodes inputs with a shared LSTM and then jointly learns the tasks in a rumour verification pipeline (stance detection, veracity and identifying rumours). To test whether the exploitation of commonsense knowledge improves false news classification and check-worthy claim detection, the MTBERT model of Liu et al. (2019) was used. Figure 1 illustrates the resulting model architecture. The model consists of two shared lower levels and two task specific layers. Lower layers incorporate the original BERT architecture where the first layer maps the inputs to tokens required by BERT and the second layer resides the transformer encoders of BERT. The upper two layers represent specific task configurations along with their loss functions. The model is trained in two steps: First, the shared BERT model is trained on the pre-training tasks of masked word prediction and next sentence prediction. In the second phase, all samples belonging to all tasks are shuffled. Then, each sample is used to train/fine-tune the shared parameters of BERT with respect to the loss functions of the specific tasks. This training scheme enables the model to learn a task by transferring the information gained from other tasks. In our work, instead of training the BERT model from scratch, we use the HuggingFace library to obtain the pre-trained BERT model that would result from the first training phase. In the second phase, we fine-tune the pretrained model using CSQA as the first task and false news classification or check-worthy claim detection as the second task, depending on the objectives. This allows the model to perform false news classification or check-worthy claim detection tasks, using the information gained from the CSQA task. To test the performance of the fine-tuned MTBERT in false news classification, we compare our results against the state-of-the-art false news classification models RDEL and 3HAN. For both models, we encode the inputs with Glove embeddings (Pennington et al., 2014) , as originally stated in their papers. These embeddings, however, do not incorporate commonsense knowledge. In order to make them comparable, we also feed Numberbatch embeddings that incorporate ConceptNet commonsense knowledge (Speer et al., 2017) into them. We report the baseline performances of HAN and SVM at this task, additionally. Furthermore, we compare the performance against the original BERT to confirm that the performance gains depend not only on the classifier but also on the use of commonsense knowledge. For SVM, we used the 25,000 most frequent unigrams and bigrams as features. In all the We propose the community interest news dataset CIND to overcome the aforementioned limitations of NELA 2019. Instead of relying on the news sources' selection of articles, it respects the interest of news consumers. As the collection source, we choose Reddit because (1) it is popular in various communities (2) openly accessible and (3) contains articles from a variety of news sources. Reddit users can share news and discuss them in online communities so-called subreddits. Each subreddit has its own discussion theme and moderation system. For instance, users share satire-like news articles from main-stream news sources in r/nottheonion and discuss conspiracy theories in r/conspiracy. Articles from unreliable sources are removed by the moderators of r/nottheonion while only clickbait articles are allowed to be shared in r/savedyouaclick. We selected subreddits that have been analyzed in Zannettou et al. (2017) or Horne et al. (2018), or was used as the source in a dataset before (Nakamura et al., 2020) which were active within 2016-2019. Additionally, we added r/fakenews where false news stories are highlighted and factchecks are shared, with respect to the specified time frame. Table 2 lists the subreddits that are used for this study. We crawled posts with PushshiftAPI (Baumgartner et al., 2020) by ignoring those that are removed by moderators and users. We filtered out the posts whose metadata contain flair link text 1 which is used for posts that violate subreddit rules. We extracted the articles by using Newspaper3k 2 . We filtered out the articles which are not in English, non news sources such as Facebook, Youtube, etc. and sources that were not accessible due to technical issues such as huffpost.com. We categorized the sources of news articles in both datasets based on the false news taxonomy proposed by Zannettou et al. (2019) . The news types we selected from the taxonomy are satire, conspiracy, propaganda, biased (left & right) as false news types and additionally neutral news as most credible news. For identifying the news sources in each category, we leveraged Media Bias Fact Check (MBFC) 3 which is an independent organization that manually annotates factuality and political leaning of media sources. Labels provided by MBFC have been widely used by the research community (e.g Baly et al. 2019)). We scraped MBFC labels 4 and augmented the list with satire sources from the r/satire subreddit 5 . We explain the sources below with the traits of online information (Wardle and Derakhshan, 2017; Zannettou et al., 2019) . Satire sources use irony, exaggeration and humour. Satiric articles do not aim to deceive the news consumer, but to entertain. However, if satire is taken seriously, it misinforms. We used the sources in the the r/satire subreddit and excluded sources that are not mutually exclusive (e.g https://www.newyorker.com/humor is also a bias source). Conspiracy sources are not credible and mostly consist of articles that are not verifiable. These sources fabricate content with the intention to disinform. We extracted conspiracy sources from the conspiracy-pseudoscience category of MBFC. Propaganda sources influence the news consumer in favor of a particular agenda. They may mislead in order to frame issues or individuals. We manually checked the questionable source category of MBFC to identify propaganda sources. MBFC provides tags to inform the reason why the source is questionable. One such tag is propaganda. Thus, we removed the sources that have labels other than propaganda in this tag. Neutral sources are the most credible. They are least biased and their reporting is factual and verifiable. We extracted least biased sources from MBFC as neutral sources. Biased sources are strongly biased toward one ideology (typically: conservative/liberal) in their story selection and framing. We extracted biased sources from MBFC as bias sources. The bias category of the dataset contains 61% of left and 39% of biasright sources. After identifying source types of each article in both dataset, we removed sources that have samples of less than 10 articles and down-sampled sources that have more than 250 documents. Additionally, for CIND dataset, we removed the outliers for each category by computing the length of tokens of article bodies and by performing the local outlier factor algorithm (Breunig et al., 2000) to yield its final dataset. The details of the CIND and NELA 2019 are shown in Table 4 and example articles are shown for each source type from CIND in Table 3 . To evaluate the model in claim-level, we used the ClaimBuster dataset (Arslan et al., 2020) . It contains human annotated 23k short statements with a metadata, from all U.S presidential debates between 1960-2016. The dataset has been used for developing ClaimBuster. As part of the annotation process, the authors asked the coders to label the sentences as check-worthy factual claims (CFS) if they contain factual claims about which the public will be interested in learning about their veracity. Similarly, if the sentences contain factual statements but aren't worth being fact-checked, they are annotated as unimportant factual sentences (UFS). Lastly, the coders considered the subjective sentences as non-factual sentences (NFS), such as opinions. We used CSQA dataset (Talmor et al., 2019) for the model to learn commonsense knowledge. The dataset has been created based on the commonsense knowledge encoded in ConceptNET (Speer et al., 2017) . The dataset contains 12k multiple-choice questions that have four choices and one correct answer. In our experiments, we investigate whether CSQA task helps on (1) robustness on new events and changes of style by news publishers (Section 5.1), (2) classifying news from previously unseen publishers (Section 5.2), and (3) discovering checkworthy claims (Section 5.3). We report per-class and average macro F1 scores for each experiment. In this analysis, we test the models' robustness against new events or the style changes by the news publishers. For this, we adopt the forecasting experiment proposed by Bozarth and Budak (2020) , first we extract sources that published articles in between 2015 and 2019 from the CIND dataset. We use data samples whose published year is earlier than 2019 as training and the rest is as test set. Table 6 shows the macro-F1 scores of the models for each category and the Figure 2 shows the macro-F1 scores of BERT variations for each month in 2019. Overall, BERT models outperform SVMs, HANs and RDEL in this task, however, detecting right-lean articles are hard for all of the models. Additionally, even though MTBERT using merged features cannot outperform the single task in terms of macro-F1, it has performance gains in the classes conspiracy (3.42%), propaganda (2.87%) and left (1.21%). As shown in Figure 2 , the performances of BERT variants in the satire class degrade but the performances in propaganda and left types are stable throughout the year. Furthermore, the fluctuations in MBERT (with the merged features) are less than the single task BERT on propaganda across the year, which could be more generalized at identifying propaganda throughout a year. Moreover, as shown in Figure 5 , MTBERT produces higher number of true positive samples than the single task, BERT in conspiracy articles. However, the single task BERT is better at identifying neutral articles for the forecasting task. To meet the condition of unseen publishers, we organize our tests such that the publisher of a news article has not been encountered before. To see how MTBERT can generalize in such a scenario, we adopt the evaluation scheme for predicting unseen sources (Bozarth and Budak, 2020) and apply it to NELA and CIND datasets. First, we group news articles based on source (reuters, fox news, etc.) under each source type (conspiracy, propaganda and so on). From each source type we randomly sample 90% sources as training set and the rest as test set and repeat it 5 times. We report the mean and standard deviation of macro-F1 scores of sets in Table 7 . Similar to forecasting experiments, both BERTbased models outperform all baseline models, although a significant performance drop is observed in all models compared to forecasting experiments. This is expected since in forecasting experiments, test set contains articles from sources that may also have some samples in training set. It makes the predictions in forecasting easier for the models considering that they can also learn from stylistic features of sources. We observe that satire sources are the best detected types in the CIND dataset, but not in the NELA dataset, because the test sets contain only one satire source and adds a new challenge for the models. We additionally ran a paired t-test on the scores. MTBERT combining all features significantly outperformed the single task version on the CIND (at p value of 0.001, with a large effect size: cohen's d 2.339). Yet single task BERT achieves 0.39% better F1 score than the MTBERT on NELA dataset, and no significant difference is observed. The reason could be the difference between the construction of the datasets. For example, CIND contains selected samples by Reddit users, which might have clickbait/check-worthy statements. However, NELA samples are collected from RSS of websites where publisher preference has a significant role. That also explains why a title of an article does not improve the results for the NELA. Like the forecasting task, commonsense knowledge has a positive effect on true positive conspiracy samples on CIND, as seen in Figure 6 . However, both BERT models misclassified the samples from propaganda sources mostly as right-lean article (shown in Figures 6,7) . Identifying Check-worthy Claims Finally, we assess the help of commonsense knowledge in detecting check-worthy claims. Therefore, we evaluate the MTBERT for check-worthy claim detection task (see Section 4.2). We use the splits provided by the authors (Arslan et al., 2020). Table 8 shows the performance of the models. Utilizing commonsense knowledge significantly improved the scores of each class in the task. As shown in Figure 8 , while the single task BERT confused with un/important sentences, MBERT could correctly classified some of the misclassified samples. It implies that commonsense knowledge could help detection of check-worthy claims from the all types of facts in the dataset. Adding CSQA as a downstream task in checkworthy claim detection was more effective on true positives than multiclass false news classification with the CSQA. That could be due to data for-mats of commonsense question answering task and check-worthy claim detection task are similar. It had a positive effect on knowledge transfer between the tasks. Both tasks contain short texts, one or two sentences, while samples in the datasets of false news detection tasks are composed of long texts that might bring about less overlapping between the tokens from the task inputs. In the experiments on the false news detection task, we found that detecting right-lean articles, in general, was not helpful among the classifiers. The reason for the difficulty in detecting right-lean articles could be a potential data bias. Social media and crowd-sourced platforms tend to lend high visi- bility to viral content, which includes right-leaning and left-leaning news publishers that are by definition extreme in their worldviews, coupled with a sensationalist tone. This issue also reflects the existing dataset NELA and our newly collected dataset CIND. In our study, we observed that right-leaning articles are mostly misclassified as left-leaning articles. The difficulty of detecting right-leaning articles is also observed by Potthast et al. (2018) for the hyperpartisan news task. Additionally, Bozarth and Budak (2020) analyzed that fake news are mostly misclassified as right-leaning news and observed that right-leaning news publishers sometimes campaigned false information. The significant difference of our study from the prior studies (Potthast et al., 2018; Bozarth and Budak, 2020) is that we tackle the problem as a multi-class news type classification because different fake news types may have different implications. Biased sources can also misinform the readers (Zannettou et al., 2019; Wardle and Derakhshan, 2017) , and fine-grained detection is vital for prioritizing what should be fact-checked. We attempt to transfer commonsense knowledge to BERT representations implicitly with a multitasking approach. We achieved better performance on the check-worthy claim detection task due to the similar data type with CQSD. A sentence-based approach could be utilized to improve the performance in false news detection tasks and transfer knowledge more effectively than the current method. Also, explicit methods could be used for false news detection tasks. For example, a new task could be introduced, a plausibility detection task on news articles where annotators would evaluate the degree of believability on articles. And then, this task could be used as a downstream task. In this paper, we explore the impact of commonsense knowledge in the tasks of false news classification and check-worthy claim detection. To learn commonsense knowledge implicitly, we fine tune BERT jointly with CSQA for each tasks. The results show that the proposed model can improve the predictions of minority classes in the datasets (e.g conspiracy in CIND, CFS in check-worthy claim detection task). Also, similar formats of the inputs such as in CSQD and check-worthy claim detection could have a positive effect on performance. In conclusion, we introduced a new challenging dataset for a false news classification task and to our knowledge, this is the first work that examines the effects of using commonsense knowledge on false news classification and check-worthy claim detection tasks. The human touch in automated factchecking: How people can help algorithms expand the production of accountability journalism A benchmark dataset of check-worthy factual claims Overview of the CLEF-2018 checkthat! lab on automatic identification and verification of political claims. task 1: Check-worthiness Overview of the CLEF-2019 checkthat! lab: Automatic identification and verification of claims. task 1: Check-worthiness Generating fact checking explanations Neural machine translation by jointly learning to align and translate Predicting factuality of reporting and bias of news media sources Multitask ordinal regression for jointly predicting the trustworthiness and the leading political ideology of news media Overview of checkthat! 2020: Automatic identification and verification of claims in social media Proppy: Organizing the news based on their propagandistic content The pushshift reddit dataset COMET: commonsense transformers for automatic knowledge graph construction Toward a better performance evaluation framework for fake news classification LOF: identifying density-based local outliers Limiting the spread of misinformation in social networks A content management perspective on fact-checking Learning phrase representations using RNN encoder-decoder for statistical machine translation Pam: A cognitive model of plausibility What plausibly affects plausibility? concept coherence and distributional word coherence as factors influencing plausibility judgments Commonsense knowledge enhanced memory network for stance classification A context-aware approach for detecting worthchecking claims in political debates An emotional analysis of false information in social media and news articles Understanding satirical articles using common-sense Understanding the promise and limits of automated fact-checking NELA-GT-2019: A large multi-labelled news dataset for the study of misinformation in news articles The quest to automate fact-checking Detecting check-worthy factual claims in presidential debates Claimbuster: The firstever end-to-end fact-checking system Assessing the news landscape: A multi-module toolkit for evaluating the credibility of news Learning hierarchical discourse-level structure for fake news detection A social diffusion model of misinformation and disinformation for understanding human information behaviour All-in-one: Multi-task learning for rumour verification The science of fake news Multi-task deep neural networks for natural language understanding Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles Glove: Global vectors for word representation Automatic detection of fake news A stylometric inquiry into hyperpartisan and fake news Multilingual connotation frames: A case study on social media for targeted sentiment analysis and forecast Truth of varying shades: Analyzing language in fake news and political fact-checking A simple but tough-to-beat baseline for the fake news challenge stance detection task Deception detection for news: Three types of fakes An overview of multitask learning in deep neural networks Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter Fake news detection on social media: A data mining perspective 3han: A deep neural network for fake news detection ConceptNet 5.5: An open multilingual graph of general knowledge Commonsenseqa: A question answering challenge targeting commonsense knowledge Automated fact checking: Task formulations, methods and future directions Big data and quality data for fake news and misinformation detection Separating facts from fiction: Linguistic models to classify suspicious and trusted news posts on twitter The spread of true and false news online Yuanfudao at semeval-2018 task 11: Three-way attention and relational knowledge for commonsense machine comprehension Information disorder: Toward an interdisciplinary framework for research and policy making Huggingface's transformers: State-of-the-art natural language processing Opinion-aware knowledge embedding for stance detection Hierarchical attention networks for document classification The web centipede: understanding how web communities influence each other through the lens of mainstream and alternative news sources The web of false information: Rumors, fake news, hoaxes, clickbait, and various other shenanigans Knowledge-enriched transformer for emotion detection in textual conversations Improving question answering by commonsense-based pre-training A survey of fake news: Fundamental theories, detection methods, and opportunities Appendix 7.1 Details on CIND This section gives more details about CIND. Figure 3 and 4 show the distributions of source types in each subreddit that we used. In the diagrams, bias tag combines bias-right and bias-left articles. Even though we could expect that some subreddits contain only specific source types, some source types are shared within multiple subreddits. For instance, neutral sources are seen in each subreddits. Test 60 271 219 105 192 77 2 Train 689 1467 1330 1189 1198 1724 Test 33 414 35 21 81 252 3 Train 493 1500 1301 1167 1226 1864 Test 229 381 64 40 53 112 4 Train 666 1653 1072 1173 1198 1686 Test 56 228 293 34 81 290 5 Train 648 1770 1162 1094 1057 1859 Test 74 111 203 113 222 117 Table 10 : Data splits for the unseen prediction task.Tables 9 and 10 include overview of the class distributions for the forecasting and unseen prediction tasks on CIND. For false news detection task, we apply the following steps by using clean-text to the news articles before encoding them as an input of the models library 6 :• We lower tokens.• We replace urls, emails, phones, numbers and currency symbols with specific tags.• We fix the unicodes and remove all whitespaces. We list the hyperparameters of each model and parameters for training them in Table 11 .