key: cord-0465902-9ygtsk14 authors: Ding, Yasan; Guo, Bin; Liu, Yan; Liang, Yunji; Shen, Haocheng; Yu, Zhiwen title: MetaDetector: Meta Event Knowledge Transfer for Fake News Detection date: 2021-06-21 journal: nan DOI: nan sha: 28c4ab58a57bb59b2fe6afa2ed437e09e861555c doc_id: 465902 cord_uid: 9ygtsk14 The blooming of fake news on social networks has devastating impacts on society, economy, and public security. Although numerous studies are conducted for the automatic detection of fake news, the majority tend to utilize deep neural networks to learn event-specific features for superior detection performance on specific datasets. However, the trained models heavily rely on the training datasets and are infeasible to apply to upcoming events due to the discrepancy between event distributions. Inspired by domain adaptation theories, we propose an end-to-end adversarial adaptation network, dubbed as MetaDetector, to transfer meta knowledge (event-shared features) between different events. Specifically, MetaDetector pushes the feature extractor and event discriminator to eliminate event-specific features and preserve required event-shared features by adversarial training. Furthermore, the pseudo-event discriminator is utilized to evaluate the importance of historical event posts to obtain partial shared features that are discriminative for detecting fake news. Under the coordinated optimization among the four submodules, MetaDetector accurately transfers the meta-knowledge of historical events to the upcoming event for fact checking. We conduct extensive experiments on two large-scale datasets collected from Weibo and Twitter. The experimental results demonstrate that MetaDetector outperforms the state-of-the-art methods, especially when the distribution shift between events is significant. Furthermore, we find that MetaDetector is able to learn the event-shared features, and alleviate the negative transfer caused by the large distribution shift between events. The enthusiasm about social media not only boosts the exchange of information, but also provides ideal platforms for the wide spread of false information, commonly known as fake news. Especially in major public events (e.g.,the political election [3] and plague prevention [9] ), the prevalence of fake news distracts decision makers' attention, causes cognitive misperception among audience, and spreads panic to the public [32] . The mainstream social media platforms have witnessed the outbreak of the 'infodemic' [7] about the COVID-19 pandemic. For example, a Facebook post claiming that microwaves could sterilize the used masks has been shared more than 7,000 times and misled a large number of people 1 . What's worse, nearly 50 Twitter accounts and 400 Facebook communities promoted the conspiracy theory that 5G wireless technology could spread the coronavirus, leading to the destruction of most base stations in Britain 2 . As fake news is flooding social networks, a large number of institutions and researchers are taking actions to curb the dissemination of falsehoods. For instance, Facebook and Twitter encourage users to flag inaccurate or incorrect posts to remind other potential audiences [19] . Several fact-checking websites (e.g., Politifact.com 3 and Snopes.com 4 ) regularly released evidences (usually provided by experienced officers and journalists) of recently checked fake news. Although the aforementioned evidence-based detection methods are highly interpretable, they are time-consuming and labour-intensive. Fortunately, the group behaviors, social interactions, and community dynamics triggered by news dissemination imply the credibility of contents [45] , which may embrace potential characteristics of real and fake news. Therefore, automatic fake news detection method based on mining content or social context features has become a research hotspot [36] . Specifically, content-based methods usually extract lexical, syntactic or topic features of event-related posts to train bi-classifiers for fake news detection [4, 5, 27, 33] ; While social context-based methods decide whether a piece of news is real based on its propagation patterns [2, 16] , social interactions [17, 48] , and the corresponding disseminators' credibility [37, 38] . Although automatic fake news detection is not a new phenomenon, existing methods are still powerless to solve the practical detecting problem due to the lack of model adaptivity to the new events [13] . Most detection approaches use deep neural networks to embed news contents of different events to a high-dimensional feature space for learning latent representations of news. However, the single fake news classification loss induces them to capture event-specific features, which are challenging to be shared with other events. Unfortunately, the data distributions of different news events often deviate from each other, and features change dramatically across new and historical events consequently. As shown in Figure 1 , we randomly sample 800 tweets about the COVID-19 5 and the Germanwings Crash 6 respectively on Twitter to empirically analyze their characteristics underlying corresponding keywords, sentiments, and data distributions. The wordcloud illustrates that the tweets in each event have abundant event-specific keywords. For example, coronavirus, vaccine, Germanwings, and Airbus etc. only appear in one of the events. While the sentiment sub-figures reveal that most tweets' sentiment values of the two events are in the neutral (equal to '0') and slightly positive (less than '3') range. In addition, we use t-SNE [26] to visualize the word embeddings of all the sampled tweets pre-trained by BERT [8] (blue nodes represent tweets about COVID-19, and red nodes indicate tweets about Germanwings Crash in From left to right are the wordcloud, sentiment score, and data distribution of sampled tweets from the two events. In the wordcloud, the top figure shows hot words related to COVID-19, and the bottom one belongs to Germanwings Crash. In the sentiment score, we set zero, positive, and negative numbers to indicate neutral, positive, and negative sentiment tendencies. A score with a larger absolute value indicates a stronger intensity, as shown on the horizontal axis of the figures. In the data distribution, we use the 768-dimensional word vectors pre-trained by BERT to represent all the sampled tweets, and visualize their distributions in a two-dimensional plane (red nodes are tweets from COVID-19, and blue nodes are tweets from Germanwings Crash), showing two obvious clusters. Figure 1 ), which is a strong indicator of the distribution shift between different events. However, existing methods do not take into consideration the event distribution property and only focus on the event-specific features for fake news detection, which are infeasible to discriminate news that have not been seen in training datasets [49] . The development of transfer learning [31] provides a feasible solution to this problem. Although we could hardly collect abundant epidemiological data in a limited time for detecting fake news about COVID-19, we have accumulated a large amount of data on other verified events. Transferring the knowledge learned from historical data (hereinafter referred to as "source event") to the identification of latest news on COVID-19 (hereinafter referred to as "target event") cultivates to construct a generalized detection model in specific events. The basis of event knowledge transfer lies in that different news events consist of both eventspecific information and event-shared information. Obviously, social posts are meaningful contents created and manipulated by users based on their own perceptions of news events, which inevitably contain social clues about target incidents, descriptions about the cause of events, profiles of some mentioned persons and other event-specific information. In addition, there are also common or similar information that can appear in multiple events, e.g., peculiar emotional words, psychological signals, user stances, etc. Event knowledge transfer is to transfer event-shared features to guide the fake news detection in upcoming events. For example, Wang et al. [41] proposed the EANN model to reduce event-specific features and utilized event-shared features to detect fake news in new events. It employed an event discriminator to measure the dissimilarities between events, reduced the shift in feature distributions of different events by adversarial learning, and finally disclosed transferable features for detecting fake news in the target event. However, EANN ignores the importance of each post to the target event. Due to users' publishing behaviors, the quality of social posts is uneven, resulting in different importance and contribution of each post to new events. It seems to be invalid to those fake news detection methods that apply the attention mechanism to find useful temporal linguistic features [6, 14] . Consequently, it is believed that the same importance of source event posts cannot effectively represent the shared feature space of events. If the discrepancy between events' feature distributions is noticeable, event-level knowledge transfer may lead to negative transfer, demonstrating that crucial transferable features need to be disentangled. In this paper, we refer to crucial event-shared (or transferable) features as meta knowledge, and focus on detecting fake news in new events by transferring the meta knowledge learned from historical events. Acquiring meta knowledge is essentially decoupling event-shared and eventspecific features from high-level feature space of different events, and learning latent representations of the target event under shared features for fact checking. The primary challenges are shown below: • How to distinguish the event-specific features of news posts and reduce them as much as possible? A straightforward idea is to measure the dissimilarity of content features of different events. The more dissimilar the feature distributions compared, the more likely that the extracted features belong to a specific event. However, the high-dimensional content features learned by deep neural networks are difficult to describe the dissimilarity among events by traditional metrics. During model training, feature embeddings of source and target events are constantly changing. Whether the detection model can capture these changes is vital to the observation of event-specific features. • How to manifest which labeled historical post is superior to identifying fake news in unlabeled new events? Since each post has a different degree of contribution to new event fact checking, it is appropriate to implement post-level knowledge transfer through a weighted mechanism. Unfortunately, the detection model hardly know which historical posts are so relevant and important to upcoming events that they should be introduced into new fake news detection (given higher weights), because all the target posts are unlabeled. If the importance of transferred knowledge cannot be reasonably estimated, it may also cause negative transfer [22, 46] . To address the above challenges, we propose an end-to-end debunking framework based on adversarial domain adaptation methods, namely MetaDetector, which automatically transfers learned meta knowledge from verified events to guide the fake news detection in target events. Our model is mainly composed of the feature extractor, the fake news detector, the event discriminator, and the pseudo-event discriminator (as presented in Figure 2 ). In response to the first challenge, MetaDetector trains feature extractor and event discriminator in a min-max game to gradually reduce event-specific features and retain event-shared features. When features learned by feature extractor are enough to deceive event discriminator from distinguishing the source of them, then transferable features can be extracted. For the second challenge, we incorporate a pseudo-event discriminationbased weighting mechanism to calculate the importance of source event instances. It is based on the observation that the event discriminator indicates probabilities of input features from the source event. The higher the probability score, the more likely the feature is to be derived from the source-event feature space, and it should be given a small weight. Obviously, the transferability of each post can be measured by this indirect weighting mechanism. In order to calculate weights while extracting event-shared features, MetaDetector introduces the pseudo-event discriminator, similar to event discriminator, to apply weights to source event posts. Then MetaDetector utilizes weighted source posts and unlabeled target posts to corporately train feature extractor and event discriminator for meta knowledge. Finally, fake news detector identifies fake news in the target event based on learned meta knowledge. In summary, the main contributions of our work are as follows: • We explore a practical fake news detection problem of detecting fake news in upcoming events from the perspective of knowledge transfer. • We propose a pseudo-event discrimination-based weighting mechanism to fulfill post-level knowledge transfer, which characterizes the relationship between historical events and upcoming events in a fine-grained way. • We propose a general framework for fake news detection in new events, namely MetaDetector, which reduces the distribution discrepancy between events, learns the meta knowledge underlying different events, and improves the generalization performance on unseen events by adversarial adaptations networks. • Our experiments on two public authoritative datasets demonstrate the effectiveness of MetaDetector in detecting fake news of new events. In addition, we conduct another test on a COVID-19 related fake news dataset 7 , and analyze the experimental results in detail. The rest of this paper is organized as follows: we briefly review representative works in Section 2, and introduce the details of our proposed MetaDetector in Section 3. We conduct intensive experiments, and the experimental results are shown in Section 4. Finally, we conclude our work in Section 5. In this section, we present one brief review about the detection of fake news. Meanwhile, we present the preliminary theory of domain adaptation methods. The utilization of deep learning methods in fake news detection ameliorates the detection efficiency and accuracy, which can automatically extract the latent high-level feature representations of news events. In fact, there are several tasks, e.g., rumor / misinformation / disinformation detection. This paper adopts a broad definition of fake news, i.e., "fake news is false news, where news broadly includes claims, statements, posts, among other types of information" [49] . Subsequently, we review prior works in the following three categories: content-based, and social context-based detection. Content-based detection methods mainly construct the whole event posts into time-series segments, and then feed them into deep neural networks (DNNs) to extract latent semantic features of news events for detection. For example, Chen et al. [6] and Ma et al. [25] divide tweets per event into fixed-length and variable-length time series respectively, and feed them into recurrent neural networks (RNN) and their variants to capture temporal-linguistic features for fake news detection. In addition, Yu et al. [44] state that RNN-based methods are potentially biased against the latest input posts and inadequate to detect fake news in advance. Therefore, they propose a CNN-based detection model named CAMI, which flexibly extracts key features scattered in related posts through convolution operations and shapes the interaction among high-level features. Social context-based detection methods principally embed social interactions (e.g., user comments/likes/reposts) or information diffusion structures into dense vectors by neural networks for following detection. For instance, Liu et al. [20] observe that real and fake news have different disseminated patterns, so they use gated recurrent unit (GRU) and CNN to extract global and local features of the retweeting sequences for fake news detection. On this basis, Lu et al. [24] introduce the graph convolution neural networks (GCNs) to learn more accurate structural information of news propagation paths. Different from previous works, Bian et al. [2] emphasize the wide dispersion of fake news in addition to its deep propagation, and consequently propose the Bi-GCN model to comprehensively describe the spread of fake news in social networks. Furthermore, there are also several works comprehensively utilize content features and social context features (news articles, user profiles, and social interactions) [38] , such as CSI [34] , dEFEND [35] , and FANG [29] . Existing methods tend to improve fake news detection metrics on specific datasets. However, the prior studies heavily rely on the training datasets and are infeasible to apply to unseen events due to the discrepancy between event distributions. How to use the knowledge learned from historical events to detect fake news in upcoming events has not been studied yet. Domain adaptation mainly tackles the problem of knowledge transfer where the source domain and the target domain have different marginal probability distributions but the same conditional probability distributions [47] . Therefore, the process of transferring knowledge from the source domain to the target domain is transformed to the data distribution matching. For instance, TCA [30] introduce the marginal maximum mean discrepancy (MMD) into the loss function, and then minimize the distance between the source and target features in regenerated Hilbert kernel space (RHKS). Existing works reveal that deep neural networks can disentangle more transferable representations than shallow models [1, 12] , but deep latent representations could only narrow but not eliminate the discrepancy of two domains [43] . Inspired by adversarial learning [23] , researchers mainly focus on adversarial domain adaptation methods, which align the source domain distribution and target domain distribution by adding adversarial objects into domain adaptation networks. When the adversarial object confuses the two domains, it is considered that the shared knowledge is adapted from the source to the target [39] . Adversarial domain adaptation methods often utilize a domain classifier as the adversarial object. For example, the DANN [11] is composed of a feature extractor, a label predictor and a domain classifier. The domain classifier tries its best to judge input features' source, but the feature extractor manages to deceive domain classifier with learned features. So DANN eliminates the domain discrepancy through the min-max game of the domain classifier and feature extractor. With the utilization of a gradient reversal layer (GRL) in the domain classifier, all the submodules are trained jointly to fulfill knowledge transfer. Unlike DANN, the ADDA [40] separately learns two feature extractors for the source domain and target domain. Specifically, the source feature extractor and label predictor are trained by minimizing the cross-entropy loss, while the target feature extractor and domain classifier are trained in adversarial manner when source feature extractor's parameters are fixed. Different from existing works, this paper focuses on the detection of fake news in new events. We use adversarial adaptation networks to automatically extract meta knowledge for fact checking in new events, and we utilize a weighting mechanism to calculate the importance of each social post to alleviate negative transfer. In this section, we provide the problem definition of fake news, and present the framework in detail to show how to learn meta knowledge and utilize it to detect fake news in new events. The source and the target events are sampled from joint distributions (x , y ) and x , y respectively (satisfying ≠ ). This paper aims to design a neural network : x ↦ → y to minimize the target cost ( ) = D (x ,y )∼ x ≠ y , which formally captures the transferable features of different events, reduces the event distribution discrepancy, and enhances the model generality to new events for identifying falsehood. We formalize the fake news detection task as follows: Definition 3.1 (Fake News Detection). Under the supervised learning of source event data, fake news detection is defined as a binary classification task to predict whether a post ( = 1, 2, · · · , ) of the target event is real or fake. = 0, if is a fake post 1, otherwise We propose the MetaDetector for fake news detection in the context of historical knowledge transfer (as shown in Figure 2 ), which mainly consists of the feature extractor, the event discriminator, the pseudo-event discriminator, and the news detector. Specifically, the workflow of our proposed model is as follows: • The feature extractor embeds the source and target event data into a feature space. • The pseudo-event discriminator calculates the importance of source event posts to new events. • The event discriminator learns meta knowledge of different events by adversarial training. • The news detector predicts the label of target event posts based on learned meta knowledge. It maps news articles into dense vectors, extracts their semantic features, and passes them to subsequent modules. The feature extractor is implemented via Text-CNN [18] . For each word in both source and target posts, we utilize Word2Vec [28] to train word embeddings, and then concatenate them as the initial representation of an entire post. Let ∈ R × denotes the original representation of the i-th input post, where is the word embedding dimension and is the max sentence length. Taking into account the efficiency and quality of feature extracting, MetaDetector uses Text-CNN to learn local features between words and phrases for describing linguistic characteristics of social posts. The detailed process of content feature extraction is as follows. In the convolutional phase, each convolutional filter (ℎ in height and in width) takes embedding vectors of ℎ words as input and generates a corresponding feature . Therefore, each current filter generates a feature vector The max-pooling layer is able to select the maximum value in as a single feature extracted by the convolutional filter from the input post, i.e.,ˆ= { }. A variety of multi-granularity features can be learned by using several convolutional filters with different window sizes, so we set window sizes and adopt filters for each specific window size. The output of max-pooling layer is the concatenation of all calculated features, denoted asĉ ∈ R · . In order to integrate different features and fix the feature dimension, we further feedĉ into an activation function. Finally, the output of feature extractor is: where and are the weight matrix and bias of this fully connected layer, respectively. In summary, we abbreviate the feature extractor as (x), i.e.,ĉ = (x). Detector. It judges whether the input post is real or fake news based on corresponding featuresĉ passed by the feature extractor, consisting of a single fully connected layer with softmax. We denote fake news detector as (·), and its output is: whereŷ is a tensor of size ×2 (indicating the probability that the post is a piece of fake news). Since posts in the source event are labeled, we can train with supervised classification cross-entropy loss : Current fake news detector optimized by minimizing loss can truly identify fake news in source event (i.e., min , ), while may be invalidate when facing the unseen events in the training stage. In other words, the cooperation between feature extractor and fake news detector lays great stress on event-specific features, ignoring transferable features that are instructive for the target event fake news detection. In order to adapt the fake news detector to target events samples, we need to acquire the meta knowledge underlying different events, correctly measure the dissimilarity between the source and target event, and combine learned meta knowledge and target event-specific features to detect fake news in unlabeled new events. The event discriminator is the function to measure the dissimilarity between the source and target event, which is the same as Wang et al. [41] and is composed of two fully connected layers. It takes featuresĉ as input, and outputs the probabilities that the corresponding posts belong to the source and target event. The higher the probability is, the more likely the current feature approaches to event-specific features. In order to obtain the meta knowledge, it should be reduced from the mapped feature space. We denote the event discriminator as (·) and its final outputê is:ê Assume that source event posts are positive samples (i.e., the event label is "1") and target event posts are negative samples (i.e., the event label is "0"), this event discriminator loss can be described by: The event discriminator loss reflects the proximity of deep feature distributions of the source and target event. So the larger illustrates that event discriminator can not distinguish the source of input posts, which means that the combination of feature extractor and event discriminator can gradually reduce the discrepancy across events. Aiming at disentangling transferable features from the feature space as much as possible, the detection task plays a min-max game, which poses a challenge to model training by stochastic gradient descent (SGD): the feature extractor manages to confuse the event discriminator to increase scores, while the event discriminator confronts to clarify the source of news posts (source event or target event) to decrease scores (i.e., min max ). Given featuresĉ = (x) learned by feature extractor, the optimal event discriminator * is obtained at: According to Equation 6 , if * → 1, the current event discriminator can easily identify news posts from the source event. If * is small enough, theĉ is most likely from the shared feature distribution of events. Similar to Zhang et al. [46] , we give the following proof of Equation 6: Proof. Given different event posts x, uses maximizing Equation 5 as the training criterion: Since feature distributions of the source and target event are both in the same feature space, we substitute original data distributions in Equation 7 with feature distributions, and calculate the partial differentiation with respective to : According to the first-order optimality conditions, we let , / = 0, and the optimal event discriminator * satisfies: Finally, the optimal event discriminator can be obtained by Equation 9 . □ Discriminator. It is responsible for measuring the importance of each source event post to the target event, just like a twin event discriminator, which is composed of two fully connected layers and denoted as (·). The input of the pseudo-event discriminator is featureŝ c fed by feature extractor, and the output is importance scores of source posts. Calculating the importance of source event posts is actually to align the source event and target event posts more accurately in the same feature space by re-weighting source event posts. To solve this problem, we apply the pseudo-event discrimination mechanism to measure needed weights. The mechanism is based on the following basic assumption: the source probability of each post calculated by event discriminator also reflects the degree of sharing between feature distributions of the source and target event. Then meta knowledge can be disentangled based on some mathematical transformation ofê. This is why the mechanism is considered to be a pseudo-event discrimination mechanism, that is, the source posts' weights are defined as a function similar to event discriminator. According to Equation 6 , when * approaches 1 ( (ĉ) gradually decreasing), feature extractor pays more attention to source event-specific features, hence this type of posts should be given smaller weights. When * is small enough ( (ĉ) gradually increasing), feature extractor captures event-shared features required by fake news detector, and this kind of posts should be given larger weights. Based on the trade-off relationship between * and weights * , we concisely define * as follows: * = 1 − * (10) Now that event discriminator can also measure the source post importance indirectly, the reason why we additionally introduce a pseudo-event discriminator is as follows: the introduced weight * cannot theoretically reduce the feature distribution discrepancy between events in the same optimization. Since * is also a function of * , the final optimal event discriminator is no longer being the ratio between the source feature distribution and the sum of the source and target feature distributions (as shown in Equation 6). Compared with , the does not undergo adversarial training, and its gradient does not need to be back-propagated to optimize , considering that the gradient calculated in unweighted source posts and target posts cannot exactly reflect the corresponding event-shared feature distribution. Consequently, the is used to calculate importance of source posts, and the is utilized to reduce the discrepancy in the feature distributions for preserving meta knowledge. According to Equation 10, we denote output of asŵ = (x) (x ∈ { ∪ }), and formalize the weights of source event posts (x ∈ ) as follows: Afterwards, the pseudo-event discriminator can be trained with the following cross-entropy loss , and weights of source posts can be calculated by min : We now introduce the loss function and working framework of our proposed MetaDetector. While optimizing a fake news detecting objective, it learns cross-event meta knowledge for verifying unlabeled target event posts, which is performed by concurrently optimizing the supervised fake news detection loss , supervised event discrimination loss , and supervised weight evaluation loss . On the basis of event-level adversarial adaptation, MetaDetector utilizes the pseudo-event discrimination-based weighting mechanism to match the key feature distributions of events, which reduces the impact of irrelevant or anomalous source posts on the unlabeled target posts, and uses the learned meta knowledge to detect fake news in new events, which facilities post-level adversarial adaptation. In practical scenarios, sometimes the feature distribution of the source and target event are very comparable. For example, some new fake news just replace names and locations mentioned in historical fake news. Limited by the size of source event data for training MetaDetector, the weights applied in such cases explicitly weaken the representations of transferable features to a certain extent (since each ( ) is a scalar between 0 and 1), which may lead to the decline of detecting accuracy. Consequently, MetaDetector automatically determines the values of source posts' weights according to the distribution discrepancy between the source event and target event, i.e., where * is a hyper-parameter of event distribution shift threshold and is calculated by the maximum mean discrepancy (MMD) [21] . The squared form of is defined as: where (·) is a feature mapping function and the two expectations represent the center of the source and target feature distribution respectively. For transferring meta knowledge between events to detect fake news in new events, we reformulate the fake news detection loss and event discrimination loss by adding weights to Equation 3 and 5 respectively. Specifically, the weighted fake news detection loss − and weighted event discrimination loss − are as follows: In conclusion, MetaDetector optimizes the fake news detector by minimizing the weighted fake news detection loss − , learns the event discriminator by minimizing the weighted event discrimination loss − , and cultivates the pseudo-event discriminator by minimizing the weight evaluation loss . Besides, the feature extractor is optimized to minimize − but maximize − at the same time. The final loss of MetaDetector is the linear combination of all three losses: where is a scalar that adjusts the impact of weight evaluation loss on the final loss function, and is a hyper-parameter that adapts to the trade-off between the fake news detection and event discrimination. To perform the min-max game between feature extractor and event discriminator, we add the GRL [10] in the event discriminator before fully connected layers, which has no other parameters except the hyper-parameter . Specifically, during forward propagation, GRL is equivalent to an identity transformation function, which feeds the features extracted from into . During back propagation, GRL receives the gradient from its subsequent layer, multiplies it by − , and passes it to for learning parameters of and simultaneously. In this section, we conduct extensive experiments on two large-scale datasets, and compare the performance of MetaDetector against the baselines. In addition, we perform case study to evaluate the effectiveness of MetaDetector. To evaluate the performance of the proposed solution, we collected two large-scale fake news dataset from Sina Weibo and Twitter. We perform several fake news detection tasks by transferring shared features from → , where corresponds to the source events and represents the target events. news. This dataset also collects original news posts, as well as their comments. Similar to the first part, we apply the original text of this data. • The third part is a COVID-19 fake news dataset published by Yang et al. [42] , namely CHECKED, which consists of 344 fake news and 1,776 real news about COVID-19 collected from December, 2019 to August, 2020. Since the first two parts of our dataset do not distinguish the association between news, we perform hierarchical clustering on them to characterize upcoming events after removing posts less than 10 in length, similar to Wang et al. [41] . Then we select 3 generalized events from clustering results (abbreviated as 1 , 2 , and 3 ) and the COVID-19 news data. Afterwards, we take each event as the source or target event to define six fake news detection tasks: 1 → 2 , 2 → 1 , 1 → 3 , 3 → 1 , 2 → 3 , and 3 → 2 . 4.1.2 Twitter. We select the tweets of three breaking news from PHEME [50] to formalize fake news detection tasks. The three representative events are Charlie Hebdo 8 (denoted as ), Ferguson 9 (denoted as ), and Sydney Siege 10 (denoted as ). The six fake news detection tasks are formalized as follows: → , → , → , → , → , and → . The statistics of the two datasets are shown in Table 1 . For each detection task of Weibo, we randomly select 2,500 labeled source posts and 2,500 unlabeled target posts for training, 500 target posts for validation, and 900 target posts for testing. For the Twitter data, we choose 800 labeled source tweets and 800 unlabeled target tweets for training, 150 target tweets for validation, and target tweets for testing ranged from 200 to 300. We compare our proposed model with the following representative methods: • DNN: It uses the typical nonlinear fitting ability of multiple fully connected layers to learn the textual features of news. In this experiment, the DNN model adopts two fully connected layers, and the size of the first layer and the second layer are 64 and 32, respectively. • Text-CNN [18] : It uses convolutional neural networks to model content in news, which obtains latent linguistic features of different granularities by diverse filters. The initial parameter setting is the same as the (see details in section 4.2.2) • GRU-2 [25] : It takes the news content as a time sequential data, and then applies two GRU hidden layers to obtain high-level feature representations of the news. In addition, to study the performance of the MetaDetector on the upcoming news, we also take following advanced domain adaptation methods (adjusted to fake news detection tasks) as the baselines. • EANN [41] : It jointly trains a multi-modal feature extractor, an event discriminator, and a fake news detector for multi-modal fake news detection. Since this article focuses on the text modal, we remove the feature extraction part of the visual modal and denote it as EANN-text. • ADDA [40] : It trains a feature extractor for the source domain and target domain respectively, and combines discriminative modeling, untied weight sharing, and a GAN loss for characterizing representations of the target domain in the shared feature space. In order to make a fair comparison with our pseudo-event discrimination-based weighting mechanism, we utilize the ADDA model based on the GRL layer (abbreviated as ADDA-grl), where the min-max game is not iteratively trained by a GAN loss but is implemented through the GRL layer. Besides, both the source and target encoders of ADDA-grl are Text-CNNs. Note that unlabeled target event instances are not used when training non-transfer fake news detection methods, i.e., the DNN, Text-CNN, and GRU-2. For training the MetaDetector, we set = = 1 for the final loss function , calculate the MMD distance between events using 7 Gaussian kernels, and set the distribution shift threshold * to 0.8 according to calculated distances in Table 2 . In the feature extractor, we set the dimension of word embeddings = 32, and the maximum sentence length depends on posts' lengths in Sina Weibo and Twitter data respectively. In addition, the number of convolutional filters is 20, and its window size ranges from 1 to 4. The hidden size of the fully connected in this feature extractor is 32. The event discriminator and the pseudo-event discriminator are the same architecture, consisting of two fully connected layers. In the fake news detector, the hidden size of the single fully connected layer is also 32. For all the baselines and the proposed model, the training batch size and number of epochs are both 100, learning rate is 0.01, and the dropout rate is 0.2. We first use t-SNE to visualize the distribution shifts among the three breaking news (Charlie Hebdo Shooting, Ferguson Unrest, and Sydney Siege) in Twitter. As shown in Figure 3(a) , the distribution shifts among the three events are subtle. This could be resulted from the fact that the three events are all related to shooting or armed robbery crimes. Apart from the event properties including the locations and scenarios, the text descriptions of events are semantic comparable, which are suitable for detecting fake news by transferring event meta knowledge. Figure 3 (b)-(d) present the data distribution shifts between one specific event and the COVID-19 event. Each sub-figure illustrates that contains information that is abnormal or not shared with COVID-19 (samples circled by dotted lines), which should be given a smaller weight, and it also contains important transferable information for identifying posts in new events. Subsequently, we compare the performance of our proposed MetaDetector and baseline models on designed fake news detection tasks. Specifically, we compare the accuracy of MetaDetector and baselines on a given set of detection tasks on Twitter (posts in English) and Sina Weibo (posts in Chinese), and design 6 knowledge transfer-based fake news detection tasks on each dataset. Table 3 and Table 4 respectively indicate the fake news detection results of different models. In addition, we take 1 , 2 , and 3 as source events and the COVID-19 related posts as the target event to evaluate the detection performance of our MetaDetector, mainly characterized by the accuracy (abbreviated as .), precision (denoted as ), recall (denoted as ), and F1 score, as shown in Table 5 . We can observe that our proposed MetaDetector shows superior performance on most tasks, and it has competitive and even superior performance against the state-of-the-art domain adaptation methods. In the experimental results of two datasets, the fake news detection accuracy of knowledge transfer-based methods (i.e., EANN-text, ADDA-grl, and our proposed MetaDetector) is generally higher than that of traditional deep learning models (i.e., DNN, GRU-2, and Text-CNN ). Compared with non-knowledge transfer methods, the average detection accuracy of MetaDetector is improved by nearly 3%-14% in Weibo data tasks and 7%-10% in Twitter data tasks. In fact, these non-knowledge transfer methods do not fully consider the distribution discrepancy between the source event and the target event, and they are more sensitive to labeled source event posts, which leads to a special attention to the source event-specific features during the model training. Consequently, the ability to recognize fake news in new events declines dramatically. For example, for the task → , the MMD distance between the source and target is 0.6378 (higher than the distances between other events in Twitter dataset), and the detection accuracy of DNN and GRU-2 are approximately 10% lower than that of MetaDetector. In addition, the detection accuracy of Text-CNN is typically higher than that of GRU-2 and DNN, and is equivalent to the knowledge transfer methods especially in the task 2 → 3 and 3 → 2 , which may be explained by the latent representation of fake news in this paper. MetaDetector mainly explores the event-shared latent semantic information underlying the news contents of different events, so it does not combine comments or user profiles of disseminators to characterize each piece of news in new events (since this type of contextual information is mostly event-specific). The GRU-2 models textual information as time series data to capture discriminative features of the questionable real and fake news, which pays more attention to global semantic features and ignores local linguistic features. Therefore, the lack of feature representation for social reactions modeling has a certain negative impact on its detection accuracy. The Text-CNN relies on the multi-size convolutional filters to extract the rich feature representations of news posts, resulting in a 6.71% higher accuracy rate than DNN in task → , and 14.32% higher in task 1 → 2 . 4.3.2 Effectiveness of the pseudo-event discrimination-based weighting mechanism. Compared with other transfer learning methods, the average fake news detection accuracy of our proposed MetaDetector on setting tasks of the two datasets is higher than that of baselines (74.8% in Weibo data and 72.19% in Twitter data). In this experiment, we mainly compare our proposed model with EANN-text and ADDA-grl. As mentioned in Section 1, EANN-text is an adversarial nets-based fake news detection method, which utilizes a domain classifier to guide its feature extractor to learn event-shared features. In fact, the adversarial training stage only reduces the marginal distribution discrepancy between events, and does not thoroughly align the conditional distribution of the source and target events. In other words, there is misleading information underlying learned event-shared features, and it matches the target posts to some specific source posts, making it perform weaker than Text-CNN on task 2 → 1 and 3 → 2 , i.e., inducing the negative transfer. EANN-text can be regarded as a simplified version of our proposed model without the weighting mechanism. Therefore, we use the same network parameters as EANN-text to train MetaDetector. The results in Table 3 and Table 4 indicate that MetaDetector outperforms EANN-text on our designed detection tasks. The average accuracy of MetaDetector is 3.21% higher than that of EANN-text in Weibo tasks, especially 6.16% and 4.67% in task 2 → 1 and 3 → 2 . As shown in Table 2 , the MMD distances between 1 and 2 , 2 and 3 are greater than the threshold * (set by empirical analysis of events' distributions). In these two tasks, when MetaDetector aligns the source event distribution with the target event distribution, it re-weights source posts through the importance scores learned by the pseudo-event discriminator, which reduces the negative impact of instances, similar to the circled nodes in Figure 3 (c), on the target event fake news detection. Therefore, compared with the EANN-text, MetaDetector does not show obvious negative transfer in 12 detection tasks. This demonstrates that the effectiveness of the pseudo-event discrimination-based weighting mechanism, which cultivates our proposed model to comprehensively evaluate the importance of source event posts, recognizes abnormal or unrelated historical instances, and reduces the discrepancy in feature distributions between the source event and an upcoming event. Although the accuracy of MetaDetector is slightly lower than that of ADDA-grl in specific tasks (i.e., → , and 1 → 3 ), the overall detection performance is comparable to ADDA-grl. In additional to no weighting mechanism, the ADDA-grl utilizes a non-shared feature extraction operation and uses a source encoder and a target encoder to extract the source event and target event features respectively. When the discrepancy of feature distributions is narrow, the non-shared feature extractor allows ADDA-grl to preserve the part of the transferable features that is closer to the target event, which explains the highest detection accuracy (2.12% higher than our method) in task 1 → 3 (MMD distance is 0.3386). However, ADDA-grl matches the source event and target event distributions in two separate feature spaces. When the distribution discrepancy between the events is noticeable (the MMD distance is large enough), the accuracy of detecting fake news in target events is lower than that of MetaDetector, which performs event meta knowledge transfer by re-weighting the source event distribution in the same feature space. In order to illustrate the importance of event meta knowledge for detecting fake news in new events, we further utilize the MetaDetector trained on historical events (i.e., 1 , 2 , and 3 ) to identify fake news in the COVID-19 event [42] . Considering the small amount of high-quality COVID-19 posts, we sample 1,500 labeled source posts and 1,000 unlabeled epidemic posts for training, 400 and 600 epidemic posts for validating and testing respectively. Afterwards, we set the following fake news detection tasks: 1 → COVID-19, 2 → COVID-19, and 3 → COVID-19. We compare the detection accuracy, precision, recall and F1 scores of each fake news detection model under the three detection tasks, as shown in Table 5 . It demonstrates that MetaDetector has achieved the highest accuracy rate on the 2 → COVID-19 and 3 → COVID-19 tasks, which is equivalent to the highest accuracy rate on task 1 → COVID-19. A noteworthy experimental result is that the recall, precision, and F1 scores of MetaDetector on real news are all higher than other baselines, but it performs poorly on these metrics of fake news. For example, in task 1 → COVID-19, the F1 scores of ADDA-grl on fake news is 0.834, while our model is only 0.226. Since the ratio of real and fake news in the COVID-19 data is close to 5:1, MetaDetector may pay more attention to real news of the target event in the same feature space, resulting in a decline in the recall of fake news. Note that the MMD distance between 1 and the COVID-19 (0.4095, details in Table 2 ) is much smaller than our setting threshold, and their distribution discrepancy is much narrow. At this case, it is more conductive to the ADDA-grl that utilizes the non-shared feature extraction operation. For the other two tasks, as the distribution discrepancy increases, the MMD distance increasing from 0.9161 to 1.1324, we can observe that MetaDetector maintains a high accuracy and F1 score in real news, while the precision, recall, and F1 scores of fake news are all equivalent to ADDA-grl. Case studies for COVID-19 related fake news detection. We further analyze the validation of meta event knowledge transfer for fake news detection, and map the weight of each source event instance learned by MetaDetector to the event data distribution through the gradient color. In this subsection, we take the task 2 → COVID-19 as an example. As shown in the Figure 4 , there are 1,000 sampling points from the data about COVID-19 and 1,500 sampling points from the 2 in the training stage. The gradient colors indicate corresponding weight values (red represents the highest value to 1). It can be seen that most of the source posts, far away from target posts, are cyan (with lower weights), and that gold circles (with higher weights) are mostly distributed in the area overlapping with the target event. However, some circles that are farther away from the target red points are also given high weights, and vice versa, indicating that MetaDetector not only pays attention to the transferable features of different events but also integrates the event-specific features. This ensures that our proposed model improves the generalization performance while maintaining the accuracy of target fake news detection. In addition, we select several representative news posts based on weight values to illustrate that the pseudo-event discrimination-based weighting mechanism cultivates the model to capture influential source posts for detecting target fake news, as shown in Figure 5 . After abstracting the semantic information of sampled posts, we could observe that social posts related to diet generally have lower weights, while viruses-related posts have generally higher weights. This further reveals that directly utilizing traditional adversarial learning methods to align the source and target event data distributions can easily confuse the sub-semantic information underlying the whole event and cause the shared feature space to shift. We also notice that the sixth post of 2 in Figure 5 has a relatively high weight, probably because of a similar description with the forth post in COVID-19 (as highlighted in the Figure 5 ). This work focuses on the upcoming fake news detection on social media, which is a practical problem ignored by the research community. The major challenge is how to learn the eventshared meta knowledge to alleviate the negative transfer. In order to tackle this problem, we propose a weighted adversarial event adaptation network based on unsupervised adversarial domain adaptation nets, namely MetaDetector, which aims to extract transferable meta-knowledge for fake news. MetaDetector utilizes the pseudo-event discrimination-based weighting mechanism to automatically recognize historical posts that are irrelevant or abnormal to the target events, which reduces the discrepancy of events and promotes the model generality to unseen events. Experiments on public datasets collected from Twitter and Sina Weibo demonstrate that the meta-knowledge is constructive to learning the representation space of upcoming events, and MetaDetector outperforms the state-of-the-art fake news detection methods especially when the distribution shift between events is significant, which is constructive to alleviating the negative transfer. For the future work, we will explore multi-source events knowledge transfer methods for detecting fake news in new events. The meta knowledge presented in this paper is currently from the perspective of transferable linguistic features. However, the transferability of information such as the logic, causality, and factual evidence underlying different events is also worth exploring. Representation learning: A review and new perspectives Rumor Detection on Social Media with Bi-Directional Graph Convolutional Networks Influence of fake news in Twitter during the 2016 US presidential election Information credibility on twitter Fake News in the News: An Analysis of Partisan Coverage of the Fake News Phenomenon Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection The covid-19 social media infodemic Bert: Pre-training of deep bidirectional transformers for language understanding Assessing the risks of" infodemics" in response to COVID-19 epidemics Unsupervised domain adaptation by backpropagation Domain-adversarial training of neural networks Domain adaptation for large-scale sentiment classification: A deep learning approach The Future of False Information Detection on Social Media: New Perspectives and Trends Rumor detection with hierarchical social attention network Multimodal fusion with recurrent neural networks for rumor detection on microblogs News credibility evaluation on microblog with a hierarchical propagation model News Verification by Exploiting Conflicting Social Viewpoints in Microblogs Convolutional Neural Networks for Sentence Classification A multi-view attention-based deep learning system for online deviant content detection. World Wide Web Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks Learning transferable features with deep adaptation networks Conditional adversarial domain adaptation Adversarial learning GCAN: Graph-aware Co-Attention Networks for Explainable Fake News Detection on Social Media Detecting Rumors from Microblogs with Recurrent Neural Networks Visualizing data using t-SNE Characterizing online rumoring behavior using multi-dimensional signatures Distributed representations of words and phrases and their compositionality FANG: Leveraging social context for fake news detection using graph representation Domain adaptation via transfer component analysis A survey on transfer learning Limited individual attention and online virality of low-quality information Truth of varying shades: Analyzing language in fake news and political fact-checking CSI: A Hybrid Deep Model for Fake News Detection defend: Explainable fake news detection Fake news detection on social media: A data mining perspective Understanding user profiles on social media for fake news detection Beyond news contents: The role of social context for fake news detection Simultaneous deep transfer across domains and tasks Adversarial discriminative domain adaptation Eann: Event adversarial neural networks for multi-modal fake news detection CHECKED: Chinese COVID-19 Fake News Dataset How transferable are features in deep neural networks? A Convolutional Approach for Misinformation Identification The emergence of social and community intelligence Importance weighted adversarial nets for partial domain adaptation Transfer adaptation learning: A decade survey Enquiring minds: Early detection of rumors in social media from enquiry posts A survey of fake news: Fundamental theories, detection methods, and opportunities Learning reporting dynamics during breaking news for rumour detection in social media This work was partially supported by the National Science Fund for Distinguished Young Scholars (62025205, 61725205), National Key R&D Program of China (2019QY0600), and the National Natural Science Foundation of China (No. 61960206008, 61902320, 61972319).