key: cord-0275009-yup7b9ig authors: Sheng, Qiang; Zhang, Xueyao; Cao, Juan; Zhong, Lei title: Integrating Pattern- and Fact-based Fake News Detection via Model Preference Learning date: 2021-09-23 journal: nan DOI: 10.1145/3459637.3482440 sha: bc251481aa5566b1e86a8dbd0417cdf858205e3b doc_id: 275009 cord_uid: yup7b9ig To defend against fake news, researchers have developed various methods based on texts. These methods can be grouped as 1) pattern-based methods, which focus on shared patterns among fake news posts rather than the claim itself; and 2) fact-based methods, which retrieve from external sources to verify the claim's veracity without considering patterns. The two groups of methods, which have different preferences of textual clues, actually play complementary roles in detecting fake news. However, few works consider their integration. In this paper, we study the problem of integrating pattern- and fact-based models into one framework via modeling their preference differences, i.e., making the pattern- and fact-based models focus on respective preferred parts in a post and mitigate interference from non-preferred parts as possible. To this end, we build a Preference-aware Fake News Detection Framework (Pref-FEND), which learns the respective preferences of pattern- and fact-based models for joint detection. We first design a heterogeneous dynamic graph convolutional network to generate the respective preference maps, and then use these maps to guide the joint learning of pattern- and fact-based models for final prediction. Experiments on two real-world datasets show that Pref-FEND effectively captures model preferences and improves the performance of models based on patterns, facts, or both. Relevant Article State FDA: 12331 does not accept the report of dog meat restaurants… Predict FAKE Predict • Frequent use of "!"… • Urge readers to spread using "repost quickly"… • ... [Good news for protecting small animals!] Shanghai has the first hotline for reporting dog meat selling in China! As long as you call 12331 and report, the government will reward you ¥500!!! You who love dogs, repost quickly!!! Figure 1 : A motivating example. Ideally, given the same news post, the pattern-based and the fact-based model have different preferences on textual clues to predict whether the post is fake. The post is translated into English. Fake news that spreads on "online" social media continually causes "offline" real-world harms in crucial domains, such as politics [11] , finance [25] , and public security [36] . The most recent example is the COVID-19 infodemic [30] where thousands of fake news pieces spread through social media [52] . Under such severe circumstances, developing fake news detection systems has been critical for maintaining a trustful online news ecosystem. To detect fake news on social media, researchers propose to extract hand-crafted features or deep-learning features [4] from contents, social contexts, propagation networks, etc. In this paper, we focus on the deep learning methods based on textual contents, which can be grouped as: 1) Pattern-based methods (e.g., [13, 21, 51, 62] ), which aim at learning shared features (patterns) among fake news posts and expect these features to generalize to unseen news posts. Once trained, they can operate without reliance on external resources. 2) Fact-based methods (e.g., [27, 34, 48, 56] ), which focus on the claim's veracity itself with the help from external fact-checking sources. The key difference between these two methods lies in their different preferences of textual clues. As Figure 1 shows, given the post about a newly opened hotline that accepts reports of dog meat selling, an ideal pattern-based model tends to predict the veracity relying more on the highly frequent use of exclamation marks or the words that urge readers to repost ("repost quickly"), while an ideal fact-based one retrieves to check whether the hotline accepts reports of dog meat selling. From the motivating example, we see that the different preferences of the two models lead to their complementary roles. This inspires us to integrate patten-and fact-based models with considering their preferences, which may bring additional gain for fake news detection. However, how to effectively integrate them remains under-explored by existing works. In this paper, we first study the problem of integrating the pattern-and fact-based models into one framework. The challenge lies in preference modeling: The models, though having different preferences, generally lack the constraints to make themselves focus on preferred parts and ignore non-preferred parts of inputs. As a consequence, a pattern-based model may overfit by memorizing frequently shown non-preferred words (e.g., event-specific words) in the training set, and a fact-based one may be distracted from the part that describes a verifiable event. Moreover, the preference of each model should be dynamically determined with contexts, making rule-based modeling inapplicable. To address these aforementioned challenges, we propose to learn the models' preferences simultaneously with joint fake news detection and build Preference-aware Fake News Detection Framework (Pref-FEND). As Figure 2 (a) shows, Pref-FEND generates preference maps to assist each model to focus on its expected preferred part. Specifically, we exploit the prior knowledge verified by existing works (e.g., [5, 46, 62] ) to recognize cue tokens for patterns and facts, and obtain three sets of tokens (i.e., stylistic tokens, entities, and others). Then, we use a graph-based preference learner to dynamically learn the preferences within the contexts, as presented in Figure 2 (b). We construct a heterogeneous graph using these sets and design a Heterogeneous Dynamic Graph Convolutional Network (HetDGCN) for node correlation learning. The final correlation matrix is used by two preference-aware readout functions to generate the Fact and the Pattern Preference Map, respectively. For joint fake news detection, we feed the post and the Map to each model and fuse their last-layer features for final prediction. During training, besides the normal classification loss, we design two auxiliary losses as enhancements, which respectively minimize the similarity between the two maps and the classification loss when the input maps are exchanged and ground-truth labels are reversed. Experimental results on two real-world datasets show that our proposed Pref-FEND can effectively learn the models' preferences and improve the performance of both single preference (pattern-or fact-based) and integrated (pattern-and-fact-based) models. Our contributions are summarized as follows: • To the best of our knowledge, our work is the first that combines pattern-and fact-based fake news detection. We discuss their complementary roles in fake news detection and propose to consider their preferences for better integration. • We propose a novel framework, Pref-FEND, which leverages a heterogeneous dynamic GCN to learn model preferences and effectively integrates them for fake news detection. • Extensive experiments on two newly constructed datasets demonstrate the effectiveness of Pref-FEND on learning models' preferences and improving the detection performance for both singlepreference models and integrated models. The code and datasets are available at https://github.com/ICTMCG/Pref-FEND. Fake news detection aims at automatically classifying a news piece as real or fake. Existing methods mostly capture features from contents (texts or/and images) and social contexts that generate in the spread process, such as propagation networks [32, 41, 44, 63] , user profiles [43] , metadata [28] , and crowd feedbacks [37, 40, 62] . In this paper, we focus on the text-based methods which can be grouped as: Pattern-based Fake News Detection. As fake news often contains opinionated and inflammatory language to attract readers [42] , common patterns that are different from those in real news are shared across fake news pieces of different topics. In the very first work on evaluating information credibility on social media, Castillo et al. [5] list a series of post-based features, including the length, whether the post contains exclamation or question marks, etc. Following this line, Volkova et al. [49] injects subjectivity, psycholinguistic, and moral foundations features into deep neural networks (CNNs and RNNs). Przybyla [35] focuses on writing styles. Some works attempt to differentiate the patterns across multiple topical categories [31, 45] . A recent trend of pattern-based methods is to refocus on the sentiment and emotional patterns [1, 13, 14, 62] , as the use of eye-catching terms in deceptive and fake posts may manipulate the readers' emotions [6] . Fact-based Fake News Detection. These methods judge the veracity of a news piece more objectively, with references to pre-constructed external resources such as knowledge graphs [7, 60] and online encyclopedias [46] . A more flexible way is to directly use articles retrieved by search engines as evidence to predict the news veracity [2, 34] . Popat et al. [34] use post-specific attention to model the post-article interactions, while the following works [27, 48, 56, 57] consider text entailment, such as coherence and conflicts using the attention mechanism. Note that the claims provided by the datasets for evaluation of fact-based methods is generally normalized by the human factcheckers to be declarative and concise, so they are not suitable to evaluate the pattern-based ones. In this paper, we construct two new datasets (in English and Chinese) by referring to existing datasets and external sources for evaluation of pattern-and-factbased methods. Different from the above methods, our work do not develop better pattern-or fact-based methods, but integrate the existing ones for comprehensively detecting fake news based on texts. E 1 E 2 T 2 T 1 T 3 S 1 S 2 Preference Learner … … … Pattern Preference Map Fact Preference Map E 1 E 2 T 2 T 1 T 3 S 1 S 2 … Heterogenous Graph Conv. Node Features H (") Layer 0 Preference Map Generation Preference-aware Joint Fake News Detection (a) Framework Layer l+1 Post Due to its expressive power for integrating structural and semantic information, graph neural networks (GNNs) have been widely used for applications in text mining such as information extraction [17] and sentiment analysis [50] . Most works use homogenous GNNs which treat nodes as the same type. Hu et al. [19] leverages a heterogeneous GNN to handle multiple types of nodes such as topics and entities for text classification. Similarly, we use heterogeneous GNN to obtain the preference scores of each token, but our graph is dynamic as its node correlation matrix is adjustable (inspired by [59] ). The final adjusted correlations will be aggregated to obtain preference scores. Let be a news post on social media containing tokens. Let be the set of relevant articles of . is retrieved from a factchecking source D. Following most existing works, we treat fake news detection as a binary classification problem. The ground-truth label is 1 if is fake, otherwise 0. We formulate the following tasks: Pattern-based Fake News Detection: Given , learn a function : ( ) →^, such that it maximizes the predictive accuracy w.r.t. . Fact-based Fake News Detection: Given , retrieve relevant articles from D, learn a function : ( , ) →^, such that it maximizes the predictive accuracy w.r.t. . Joint Pattern-and-Fact-based Fake News Detection: Given , , a pattern-based model and a fact-based model , learn a function : ( , , , ) →^, such that it maximizes the predictive accuracy w.r.t. . Figure 2 (a) overviews the architecture of the proposed Pref-FEND, whose goal is to learn the models' preferences and employ them for better joint fake news detection. Given a post , Pref-FEND NEC Emotion Lexicon [29] first respectively generates preference maps (i.e., token-level preference scores) for the pattern-and fact-based model with a heterogeneous dynamic GCN. Then, the preference maps are fed into the corresponding model along with to help the model focus on its preferred information. Finally, the models' output features are fused to predict if is real or fake. Besides the normal classification loss, we design two auxiliary losses as enhancements, whose goals are to minimize the similarity between the two maps and to minimize the classification loss when the input maps are exchanged and ground-truth labels are reversed, respectively. (see Section 4.2) Assuming that has tokens, a preference map is a score distribution of length where the -th score represents to what extent the -th token is preferred by the corresponding fake news detection model. For the pattern-and the fact-based model, we respectively generate Pattern Preference Map and Fact Preference Map where all scores are in [0, 1] and the sum of each map is 1. As illustrated in Section 1, a pattern-based model focuses on common patterns (generally, writing styles) while a fact-based one focuses on verifiable objective claims. To guide the map generation, we exploit the prior knowledge with reference to the existing pattern-and fact-based works. Specifically, we recognize tokens that are likely to represent writing styles or key objective elements. To indicate patterns, we recognize a set of stylistic tokens = { 1 , . . . , } (e.g., emotional words, pronouns, punctuations) [62] ; and to indicate facts, we extract the entities = { 1 , . . . , } because a verifiable claim generally contains at least one entity [46] . These indicating tokens are derived using pre-constructed dictionaries and public tools. In detail, to recognize stylistic tokens, we follow [62] , which summarizes diverse emotion-related features and other useful linguistic features to represent textual patterns, and then generate a stylistic token table for each dataset. The types and references are shown in Table 1 Although the stylistic tokens and entities recognized by general dictionaries or tools provide a good prior to what tokens might be preferred, directly using the recognition result for map generation is insufficient: First, the coverage is limited, leading the map to overlook some other preferred and useful tokens for detection models; Second, a token's preference score should be dynamically determined in its context (i.e., the post) rather than static rules. To enable the information of different types of nodes to dynamically and sufficiently interact with each other, we design a graph-based preference learner, Heterogeneous Dynamic Graph Convolutional Network (HetDGCN). As shown in Figure 2 (b), we first construct a heterogeneous graph that contains multi-type nodes (tokens) with a learnable correlation matrix (i.e., adjacent matrix). Then, we leverage a heterogeneous graph convolution to enable message passing among different types of nodes. The final preference scores are obtained using the learned correlation matrix. The stages are as follows: Graph Initialization. Recall that we have divided the tokens in into three parts: stylistic tokens , entities , and others . To preserve their different roles, we construct a heterogeneous graph , where each node corresponds to a token in , , or and the weight of each edge represents the correlation between the connected tokens. The node representation is initialized with the pre-trained language model (here, BERT [8] ), denoted as H (0) ∈ R × where is the dimensionality of each node vector. Note that this matrix is stacked with the representation of , , and , i.e., The edge weights (correlations) are initialized with calculating the cosine similarity of token pairs [19] which is scaled to [0, 1]: where h (0) and h (0) are the initial node features, and A (0) ( , ) ∈ [0, 1] is the initial weight of the edge connecting the -th and the -th node. Following [23] , we define the normalized correlation matrix of the -th layer ( ) . Graph Convolution & Correlation Update. Different types of nodes describe different aspects of the given text which we expect to distinguish for preference learning. Therefore, instead of using standard graph convolution for node interaction [23] , we use a heterogeneous graph convolution [19] , which separately handle the neighbors of different types and then aggregate the interacted features. Further, we use a dynamic correlation matrix which is updated each layer according to the present node similarity and expect the final correlations (edge weights) could reflect the bias of the nodes in the context. In detail, the feature matrix of ( + 1)-th layer is calculated with where ( ) is the submatrix of the correlation matrix of the -th layer ( ) whose rows contain all the nodes and columns record their correlation with nodes of the type ∈ { , , }. W ( ) is the learnable weight matrix of the type in this layer. Then, the correlation matrix is updated using where W ( +1) is the learnable weight matrix for updating correlations, denotes the sigmoid function and is a trade-off factor in [0, 1]. Preference-aware Readout. After the -layer HetDGCN, we obtain the correlation matrix A ( ) , on which we expect to estimate the preference levels to pattern-and fact-based models of each token. For the -th node, the pattern preference score m P is calculated by its correlation with any nodes except those representing entity tokens: Similarly, the fact preference score excludes the correlation with the stylistic nodes: Finally, the preference maps are obtained by normalized the correlation sums of each token: As the fact-based and pattern-based models are diverse, we here use the typical pattern-and fact-based detection process to illustrate how to integrate the generated preference maps into them. Most specific models can be easily reformulated similarly to accommodate our framework. Figure 2 (c), a typical patternbased uses a textual feature extractor to obtain a vector for final prediction. Here, we use the Pattern Preference Map as attention weights to make the model attend to its preferred tokens in the post . For example, if the extractor is a BERT [8] or an LSTM [18] whose output is [p 1 ; . . . ; p ], the aggregated vector is calculated as Note that our preference map is at the token level, for the extractor that does not output vectors such as TextCNN [22] , the map might be used before the extractor, right after we obtain token embeddings from pre-trained models. In a typical fact-based model, the post are first used to retrieve from a fact-checking source to collect the related articles (or, evidence) . Assuming articles are returned, we represent the articles in as [d 1 ; . . . ; d ]. Then the post and evidence vectors are fed into an inference module, which is often designed to capture the complicated interactions such as coherence and conflicts between and (e.g., [27, 56] ). The output vectors of inference module f, which implicitly represent the relationship of the post-evidence pairs, is used for final prediction. To avoid the interference of non-check-worthy parts (e.g., the publisher's remark), the Fact Preference Map guides the inference module by using the attention mechanism to aggregate the token vectors in before post-evidence inference. The final vector is calculated as where q is the representation of the -th token in for fact-based methods. For final prediction, we concatenate the output vectors of pattern-and fact-based models and feed it into a multi-layer Perceptron (MLP) and obtain the prediction^: During training, we use three losses to supervise 1) the prediction of binary (fake and real) classification; and 2) the differentiation of the two preference maps. For the first goal, we minimize the cross-entropy loss between the prediction^and the label L ( ,^) = CELoss( ,^) where For the second goal, we consider the reciprocal roles of the two models and let them supervise mutually. In detail, we minimize the cosine similarity between the Pattern and the Fact Preference Map and the cross-entropy loss under the condition that the input maps for the two models are exchanged and the ground-truth label is reversed L ( ,^′) = CELoss( ,^′) where = |1 − | and the predictive value^′ = MLP( [p ′ ; f ′ ]). p ′ and f ′ are respectively the output of the pattern-based and the fact-based model with each other's preference map as input. When receiving non-preferred information, the models are expected to be misled and generate non-distinctive features. The total loss of a sample to minimize is where 1 , 2 and 3 are trade-off factors in [0, 1]. We average the loss of samples in each mini-batch before backpropagation. We conduct experiments on two datasets to answer the following evaluation questions: EQ1: Can Pref-FEND improve the performance of fake news detection models with single preference? EQ2: Can Pref-FEND improve the performance for fake news detection that is integrated by pattern-and fact-based models? EQ3: How effective are the designed components of Pref-FEND? EQ4: How different are the Fact and the Pattern Preference Map? As no existing dataset of fake news detection provides social media posts and relevant articles (as the fact-checking source) simultaneously, we construct two datasets of different languages (Chinese and English) based on the existing data and external sources. The statistics are shown in Table 2 . The details are as follows: Post. We utilize the Weibo-20 dataset [62] which contains 6,362 news posts and the ratio of fake and real news posts is roughly 1:1. We keep its original temporal split with a ratio of 6:2:2 for train, validation, and test set. Relevant Articles. We collect fact-checking articles and other relevant articles to construct our fact-checking source. In detail, we use the fact-checking articles crawled in [39] from multiple websites such as Jiaozhen 3 , Zhuoyaoji 4 , and Baidu Piyao 5 . Then, we crawl other relevant articles from Baidu News, with the keywords in the Weibo posts as queries. The keywords are extracted using jieba 6 . For each query, we obtain at most 30 items and attempt to download full articles using Newspaper3k 7 . Finally, the de-duplication of all accessible articles lead to an article base containing 17,849 articles. Post. We first combine two datasets for detecting previously fact-checked claims released by Shaar et al. [38] and Vo and Lee [47] , respectively, as they not only provide tweets but also relevant articles from Snopes 8 . As our task is formulated as a binary classification task, we merge true, mostly-true, correct-attribution into real, and false, mostly-false, misattributed, and legend into fake. The other categories are dropped. As these two datasets are largely imbalanced (1,047 real and 8,992 fake), we utilize PHEME [24] dataset as a supplement, whose annotation files provide some referred news links. For PHEME, we merge real and non-rumor into real and obtain 5,090 real and 638 fake news posts. After pre-processing using TexSmart and dropping failure cases, we obtain 14,709 posts. Relevant Articles. Because the Twitter dataset has fewer topics than the Weibo dataset, we start from the articles in these datasets to construct the relevant article base. First, we incorporate the factchecking articles from the datasets released in [38] and [47] , and referred news articles in the PHEME dataset (if accessible). Then, we use their titles (tokenized using NLTK [3] ) as queries and search on Google News using GNews 9 . After post-processing, we obtain an article base containing 12,419 articles. Note that we do not use the existing DeClarE [34] and MultiFC [2] datasets which provide both claims (posts) and relevant articles (or webpages) because its claims are normalized and thus with weak patterns of social media posts. We split the train, validation, test set temporally with a ratio of 6:2:2. We use six representative text-based models as base models: Pattern-based Models • Bi-LSTM [15] is widely used in many existing works of our task for text encoding [16, 21, 37] . We implement a one-layer Bi-LSTM with a maximum sequence length of 100 and a hidden size of 128. We average all the hidden states as representations of posts which are further fed into an MLP for prediction. • EANN-Text [51] is a model that tries to distract the fake news detection model from memorizing event-specific features. It uses TextCNN for text representation and adds an auxiliary task of event classification for adversarial learning using gradient reversal layer [12] . We re-implement the model according to the public code 10 . The complete EANN is a multi-modal model but we here use its text-only version. For TextCNN, the number of filters is 20 and the window sizes are {1, 2, 3, 4}. The labels for the auxiliary event classification task are derived by clustering the training set with K-means where = 300. • BERT-Emo [62] is a model that uses BERT to encode the text and captures the emotion that news publishers express. As we focus on the contents rather than social contexts, we adopt a simplified version where emotions in comments are not considered. We use the author-released code 11 . The maximum sequence length is 150 and the size of embedding vectors is 768. • DeClarE [34] is a model which uses claim-specific attention to focus on salient words in relevant articles. We remove the source embedding which is unavailable in the datasets. We reimplement the model according to the third-party code 12 . The text encoder is a one-layer Bi-LSTM with the hidden size of 128. • EVIN [56] is an evidence inference network, which captures the semantic conflicts between the post and relevant articles using the attention mechanism. We re-implement the model according to the paper as no public code is available. The hidden size of one-layer Bi-LSTM is 60. The maximum sequence length is 200. • MAC [48] is a hierarchical multi-head attentive network that combines word-and article-level attention. We re-implement according to the author-released code 13 . We use one-layer Bi-LSTM networks with a hidden size of 300 to build MAC. Two multi-head attention modules have 5 and 2 heads, respectively. Note that when base models are used as a module in Pref-FEND, we extract the last-layer feature before the MLP layer. Evaluation Metrics. We report accuracy (Acc.) and macro F1 score (macF1). For each class, we also report precision, recall, and F1 score, denoted as , , and 1 where = { , }. Implementation Details. In Pref-FEND, the number of layers in HetDGCN is 2. We perform grid search in a small interval and finally let = 0.5, 1 = 2, 2 = 1, and 3 = 1. For all base models and our Pref-FEND, the initial token embeddings are obtained from pretrained models in HuggingFace's Transformers [55] (specifically, bert-base-chinese and bert-base-uncased). For all fact-based models, the top 5 retrieved articles are considered. Other hyperparameters have been described in Section 5.2. The methods are implemented with PyTorch [33] and Pytorch Geometric [10] . 5.4.1 Comparing with Pattern-or Fact-based Methods. To fairly compare with existing single-preference (i.e., pattern-or fact-based) models, we reduce our framework to a single-model version named Pref-FEND . In detail, when comparing with a pattern-based model, we remove the fact-based model but preserve the Fact Preference Map for training; and vice versa. From the results in Table 3 , we have the following observations: First, Pref-FEND successfully improves the performance of all the pattern-based and fact-based models on the two datasets. This verifies our observation that the original base models might be distracted from non-preferred information, which thus limits their generalizability to unseen samples. With the help of Pref-FEND , the base models are more focused during training. Second, BERT-Emo outperforms Bi-LSTM and EANN-Text. This is as expected because BERT can generate expressive representations and the additional emotion-related features are proved helpful for this task. With the guidance of Pref-FEND , it gains a boost of 3.6 percent points in macro F1 scores on Weibo and a boost of 1.4 percent points on Twitter. This reveals the importance of preference modeling for alleviating the overfitting of specific features. Third, MAC outperforms DeClarE and EVIN, though they are all based on the attention mechanism. This might be because it effectively uses multi-head attention to capture multi-aspect information. However, some heads might be distracted from the event description in the post, which can be alleviated by our framework. We implement the following methods which fuse the information from pattern-and fact-based models: • Last-layer Fusion which uses the post as input and concatenates the last-layer features of two models for final prediction; • Logits Average which averages the models' logits (which are in [0, 1]) for final prediction. We implement these fusion methods and Pref-FEND with two groups of base models, Bi-LSTM+DeClarE and BERT-Emo+MAC. The results are shown in Table 4 . Our observations are as follows: First, Pref-FEND outperforms the two pattern-and-fact-based methods, which validates its effectiveness for integrating patternand fact-based models. Second, comparing with the results in Table 3 , Pref-FEND brings further improvements based on the remarkable performance of Pref-FEND w.r.t the same base models. For example, on the Weibo dataset, Pref-FEND with Bi-LSTM and DeClarE gains another increase of macro F1 by 0.3 percent points than Pref-FEND with Bi-LSTM and 1.1 percent points than Pref-FEND with DeClarE. This proves that our framework is applicable to both the singlepreference models and the integrated models based on them. Third, the last-layer fusion does not necessarily perform better than the simple logits average. This indicates that last-layer fusion may be insufficient to align the feature spaces of the pattern-and the fact-based model, which leads to negative fusion effects. We study the effectiveness of our designed components or strategies based on the Pref-FEND models in Table 4 . Learning. Instead of recognizing the entities and stylistic tokens according to the prior knowledge, we randomly initialize preference maps (named as Pref-FEND w/ rand init maps). That forces the generation of preference maps to rely only on the supervision of ground-truth labels. The results show that although Pref-FEND w/ rand init maps is superior or comparable to the base models on both of the two datasets in terms of accuracy and macro F1, it falls behind the complete Pref-FEND. This proves the effectiveness of our model preference learning, which exploits prior knowledge in a dynamic graph representation learning process. We remove one of the two losses which aim at differentiating the two preference maps, or both. The variants are with the suffixes w/o L , w/o L ( ,^′), and w/ only L ( ,^), respectively. We see that removing these losses brings performance drops w.r.t. accuracy. The largest drop occurs when removing both the two losses. This indicates that the auxiliary losses are effective and necessary to integrate the two models with different preferences. 5.6.1 Analysis on Most Frequent Token Set. To explore how different the Fact and the Pattern Preference Map are, we analyze the frequently preferred tokens in the Maps. For each post in the validation and test sets of Weibo, we first divide the tokens into a pattern group and a fact group, which indicates this token is scored higher in the Pattern or the Fact Preference Map. Then we extract the top 10 tokens in each group of all the posts and construct two token sets for frequency analysis. The frequent tokens in each set are shown with fine-grained categories in Table 6 . We see that: First, in a pattern-preferred token set, punctuations and negation words are important as they express the publishers' tones and emotions. The other frequent tokens are closely related to selfexpression, like "think", "may", and "kind of". Second, in the fact-preferred set, evidence-related tokens that indicate materials and actions (e.g., "video", "webpage", "picture", "claim", and "uncover") and entity-related tokens (places, positions, etc.) are more focused. Some of the other words do not directly describe an event, but are often around the elements of a news event (e.g., 5W in journalism [53] ), such as "already" and "when". Third, the focus on pronouns is different between the Pattern and the Fact Preference Map. Plural personal pronouns ("we", "they", and "you all") are frequently focused by pattern-based models, while single ones ("he", "it", and "you") are preferred by fact-based models. The reason might be that a post with significant fake news patterns often discusses some groups or inspires the audience to take action, while a post with an event description is generally related to specific persons or things. Our analysis reveals that the learned preference maps are highly correlated to the ideal model preferences and thus effective for the guidance of models' focuses. 5.6.2 Case study. In Table 7 , we show three fake news posts that are successfully judged by Pref-FEND with Bi-LSTM and DeClarE. Case 1 conveys strong signals of emotional patterns, which are preferred by pattern-based models, such as "helplessly", and "aggressive". Case 2 contains a large number of places and event descriptions, which is friendly to utilize the evidential texts in relevant articles. Due to the different dominant signals, the pattern-based Bi-LSTM judges correctly in Case 1, but fails in Case 2. And the judgments of the fact-based DeClarE are the opposite. However, in Case 3, both of them wrongly judge this post as real. Based on the observation, a pattern-based model can attend to the emotion trigger tokens like "cute" and "really', while a fact-based model can use the place ("Shanghai") and the dog breed ("Golden Retriever") to find evidence. Generally, it is unlikely that the two models both fail. We speculate that the failure is led by the negative interference from the non-preferred information. With the help of model preference learning, our Pref-FEND, however, succeed in judging all three posts as fake. These cases demonstrate the necessity of model preference learning and the effectiveness of Pref-FEND. We propose the framework Pref-FEND to integrate the patternbased and fact-based fake news detection models in a preferenceaware fashion. The learned preference maps guide the models to focus more on their preferred parts with less interference by the nonpreferred parts. Experiments on the two newly constructed datasets show that Pref-FEND outperforms the existing detection models. Further analysis shows that preference learning helps models of different preferences more focused and thus makes both the singlepreference and the integrated models better-performing. How to enhance the interaction between the preference map generation and specific models and how to extend the framework to multi-class and multi-preference scenarios are expected to be explored in the future. The acquisition and exploitation of prior knowledge in this task are also worth studying further to improve overall performance. Sentiment aware fake news detection on online social networks MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims NLTK: The Natural Language Toolkit Automatic rumor detection on microblogs: A survey Information Credibility on Twitter Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media DETERRENT: Knowledge Guided Graph Attention Network for Detecting Healthcare Misinformation BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding HowNet: A hybrid language and knowledge resource Fast Graph Representation Learning with PyTorch Geometric Pizzagate: From rumor, to hashtag, to gunfire in DC Unsupervised Domain Adaptation by Backpropagation FakeFlow: Fake News Detection by Modeling the Flow of Affective Information Leveraging Emotional Signals for Credibility Detection Framewise phoneme classification with bidirectional LSTM and other neural network architectures Rumor detection with hierarchical social attention network Attention Guided Graph Convolutional Networks for Relation Extraction Long short-term memory Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification Chinese Lexical Analysis with Deep Bi-GRU-CRF Network Learning Hierarchical Discourse-level Structure for Fake News Detection Convolutional Neural Networks for Sentence Classification Semi-supervised classification with graph convolutional networks All-in-one: Multi-task Learning for Rumour Verification Fake news: Evidence from financial markets. Available at SSRN TexSmart: A System for Enhanced Natural Language Understanding Sentence-Level Evidence Embedding for Claim Verification with Hierarchical Attention Networks SADHAN: Hierarchical Attention Networks to Learn Latent Aspect Embeddings for Fake News Detection Crowdsourcing a word-emotion association lexicon The Covid-19 'infodemic': a new front for information professionals MD-FEND: Multi-domain Fake News Detection FANG: Leveraging Social Context for Fake News Detection Using Graph Representation PyTorch: An Imperative Style, High-Performance Deep Learning Library DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning Capturing the Style of Fake News Bangladesh mobs lynch eight people over child abduction rumours Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking That is a Known Lie: Detecting Previously Fact-Checked Claims Article Reranking by Memory-Enhanced Key Sentence Matching for Detecting Previously Fact-Checked Claims dE-FEND: Explainable Fake News Detection Hierarchical propagation networks for fake news detection: Investigation and exploitation The Role of User Profiles for Fake News Detection Propagation2Vec: Embedding partial propagation networks for explainable fake news early detection Embracing Domain Differences in Fake News: Cross-domain Fake News Detection using Multi-modal Data The Fact Extraction and VERification (FEVER) Shared Task Where Are the Facts? Searching for Factchecked Information to Alleviate the Spread of Fake News Hierarchical Multi-head Attentive Network for Evidence-aware Fake News Detection Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter Relational Graph Attention Network for Aspect-based Sentiment Analysis EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection Misinformation related to the COVID-19 pandemic Wikipedia. 2021. Five Ws List of emoticons Transformers: State-of-the-Art Natural Language Processing Evidence Inference Networks for Interpretable Claim Verification Evidence-Aware Hierarchical Interactive Attention Networks for Explainable Claim Verification Journal of the China society for scientific and technical information Attentiondriven dynamic graph convolutional network for multi-label image recognition Multi-Modal Knowledge-Aware Event Memory Network for Social Media Rumor Detection TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis Mining Dual Emotion for Fake News Detection Network-Based Fake News Detection: A Pattern-Driven Approach The authors would like to thank the anonymous reviewers for their valuable comments. This work was supported by the National Key Research and Development Program of China (2017YFC0820604), and the National Natural Science Foundation of China (U1703261).