key: cord-0619140-v8m15qau
authors: Lin, Hongzhan; Ma, Jing; Chen, Liangliang; Yang, Zhiwei; Cheng, Mingfei; Chen, Guang
title: Detect Rumors in Microblog Posts for Low-Resource Domains via Adversarial Contrastive Learning
date: 2022-04-18
journal: nan
DOI: nan
sha: 8cc650fbf78e95565b21390a46e31591789f0393
doc_id: 619140
cord_uid: v8m15qau

Massive false rumors emerging along with breaking news or trending topics severely hinder the truth. Existing rumor detection approaches achieve promising performance on the yesterday's news, since there is enough corpus collected from the same domain for model training. However, they are poor at detecting rumors about unforeseen events especially those propagated in different languages due to the lack of training data and prior knowledge (i.e., low-resource regimes). In this paper, we propose an adversarial contrastive learning framework to detect rumors by adapting the features learned from well-resourced rumor data to that of the low-resourced. Our model explicitly overcomes the restriction of domain and/or language usage via language alignment and a novel supervised contrastive training paradigm. Moreover, we develop an adversarial augmentation mechanism to further enhance the robustness of low-resource rumor representation. Extensive experiments conducted on two low-resource datasets collected from real-world microblog platforms demonstrate that our framework achieves much better performance than state-of-the-art methods and exhibits a superior capacity for detecting rumors at early stages.

With the proliferation of social media such as Twitter and Weibo, the emergence of breaking events provides opportunities for the spread of rumors, which is difficult to be identified due to limited domain expertise and relevant data. For instance, along with the unprecedented COVID-19 pandemic, a false rumor claimed that "everyone who gets the vaccine will die or suffer from auto-immune diseases" 1 was translated into many languages and spread at lightning speed on social media, which seriously confuses the public and destroys the achievements of epidemic prevention in related countries or regions of the world. Although some recent works focus on collecting microblog posts corresponding to COVID-19 (Chen et al., 2020a; Zarei et al., 2020; Alqurashi et al., 2020) , existing rumor detection methods perform poorly without a large-scale qualified training corpus, i.e., in a low-resource scenario (Hedderich et al., 2021) . Thus there is an urgent need to develop automatic approaches to identify rumors in such low-resource domains especially amid breaking events.

Social psychology literature defines a rumor as a story or a statement whose truth value is unverified or deliberately false (Allport and Postman, 1947) . Recently, techniques using deep neural networks (DNNs) Khoo et al., 2020; Bian et al., 2020) have achieved promising results for detecting rumors on microblogging websites by learning rumor-indicative features from sizeable rumor corpus with veracity annotation. However, such DNN-based approaches are purely data-driven and have a major limitation on detecting emerging events concerning about low-resource domains, i.e., the distinctive topic coverage and word distribution (Silva et al., 2021 ) required for detecting low-resource rumors are often not covered by the public benchmarks (Zubiaga et al., 2016; Ma et al., 2016 Ma et al., , 2017 . On another hand, for rumors propagated in different languages, existing monolingual approaches are not applicable since there are even no sufficient open domain data for model training in the target language.

In this paper, we assume that the close correlations between the well-resourced rumor data and the low-resourced could break the barriers of domain and language, substantially boosting lowresource rumor detection within a more general framework. Taking the breaking event COVID-19 as an example, we collect corresponding rumorous and non-rumorous claims with propaga- tion threads from Twitter and Sina Weibo which are the most popular microblogging websites in English and Chinese, respectively. Figure 1 illustrates the word clouds of rumor and non-rumor data from an open domain benchmark (i.e., TWITTER (Ma et al., 2017) ) and two COVID-19 datasets (i.e., Twitter-COVID19 and Weibo-COVID19). It can be seen that both TWITTER and Twitter-COVID19 contain denial opinions towards rumors, e.g., "fake", "joke", "stupid" in Figure 1 (a) and "wrong symptom", "exactly sick", "health panic" in Figure 1 (b). In contrast, supportive opinions towards non-rumors can be drawn from Figure 1 (d)-1(e). Moreover, considering that COVID-19 is a global disease, massive misinformation could be widely propagated in different languages such as Arabic (Alam et al., 2020) , Indic (Kar et al., 2020) , English (Cui and Lee, 2020) and Chinese (Hu et al., 2020) . Similar identical patterns can be observed in Chinese on Weibo from Figure 1 (c) and Figure 1 (f). Although the COVID-19 data tend to use expertise words or language-related slang, we argue that aligning the representation space of identical rumorindicative patterns of different domains and/or languages could adapt the features captured from wellresourced data to that of the low-resourced.

To this end, inspired by contrastive learning Chen et al., 2020b,c) , we propose an Adversarial Contrastive Learning approach for lowresource rumor detection (ACLR), to encourage effective alignment of rumor-indicative features in the well-resourced and low-resource data. More specifically, we first transform each microblog post into a language-independent vector by semantically aligning the source and target language in a shared vector space. As the diffusion of rumors generally follows a propagation tree that provides valuable clues on how a claim is transmitted , we thus resort to a structure-based neural network (Bian et al., 2020) to catch informative patterns. Then, we propose a novel supervised contrastive learning paradigm to minimize the intraclass variance of source and target instances with same veracity, and maximize inter-class variance of instances with different veracity. To further enhance the feature adaption of contrastive learning, we exploit adversarial attacks (Kurakin et al., 2016) to plenish noise to the original event-level representation by computing adversarial worst-case perturbations, forcing the model to learn non-trivial but effective features. Extensive experiments conducted on two real-world low-resource datasets confirm that (1) our model yields outstanding performances for detecting low-resource rumors over the state-of-the-art baselines with a large margin; and (2) our method performs particularly well on early rumor detection which is crucial for timely intervention and debunking especially for breaking events. The main contributions of this paper are of three-fold:

• To our best knowledge, we are the first to present a radically novel adversarial contrastive learning framework to study the lowresource rumor detection on social media 2 .

• We propose supervised contrastive learning for structural feature adaption between different domains and languages, with adversarial attacks employed to enhance the diversity of low-resource data for contrastive paradigm.

• We constructed two low-resource microblog datasets corresponding to COVID-19 with propagation tree structure, respectively gathered from English tweets and Chinese microblog posts. Experimental results show that our model achieves superior performance for both rumor classification and early detection tasks under low-resource settings.

Pioneer studies for automatic rumor detection focus on learning a supervised classifier utilizing features crafted from post contents, user profiles, and propagation patterns (Castillo et al., 2011; Yang et al., 2012; Liu et al., 2015) . Subsequent studies then propose new features such as those representing rumor diffusion and cascades (Kwon et al., 2013; Friggeri et al., 2014; Hannak et al., 2014) . Zhao et al. (2015) alleviate the engineering effort by using a set of regular expressions to find questing and denying tweets. DNN-based models such as recurrent neural networks (Ma et al., 2016) , convolutional neural networks (Yu et al., 2017) , and attention mechanism (Guo et al., 2018) are then employed to learn the features from the stream of social media posts. However, these approaches simply model the post structure as a sequence while ignoring the complex propagation structure. To extract useful clues jointly from content semantics and propagation structures, some approaches propose kernel-learning models (Wu et al., 2015; Ma et al., 2017) to make a comparison between propagation trees. Tree-structured recursive neural networks (RvNN) and transformer-based models (Khoo et al., 2020; Ma and Gao, 2020) are proposed to generate the representation of each post along a propagation tree guided by the tree structure. More recently, graph neural networks (Bian et al., 2020; Lin et al., 2021a) have been exploited to encode the conversation thread for higher-level representations. However, such data-driven approaches fail to detect rumors in low-resource regimes (Janicka et al., 2019) because they often require sizeable training data which is not available for low-resource domains and/or languages. In this paper, we propose a novel framework to adapt existing models with the effective propagation structure for detecting rumors from different domains and/or languages.

To facilitate related fact-checking tasks in lowresource settings, domain adaption techniques are utilized to detect fake news (Wang et al., 2018; Yuan et al., 2021; Zhang et al., 2020; Silva et al., 2021) by learning features from multi-modal data such as texts and images. Lee et al. (2021) proposed a simple way of leveraging the perplexity score obtained from pre-trained language models (LMs) for the few-shot fact-checking task. Different from these works of adaption on multi-modal data and transfer learning of LMs, we focus on language and domain adaptation to detect rumors from low-resource microblog posts corresponding to breaking events.

Contrastive learning (CL) aims to enhance representation learning by maximizing the agreement among the same types of instances and distinguishing from the others with different types (Wang and Isola, 2020) . In recent years, CL has achieved great success in unsupervised visual representation learning (Chen et al., 2020b; Chen et al., 2020c) . Besides computer vision, recent studies suggest that CL is promising in the semantic textual similarity (Gao et al., 2021; , stance detection (Mohtarami et al., 2019) , short text clustering , unknown intent detection (Lin et al., 2021b) , and abstractive summarization (Liu and Liu, 2021) , etc. However, the above CL frameworks are specifically proposed to augment unstructured textual data such as sentence and document, which are not suitable for the lowresource rumor detection task considering claims together with more complex propagation structures of community response.

In this work, we define the low-resource rumor detection task as: Given a well-resourced dataset as source, classify each event in the target lowresource dataset as a rumor or not, where the source and target data are from different domains and/or languages. Specifically, we define a well-resourced source dataset for training as a set of events D s = {C s 1 , C s 2 , · · · , C s M }, where M is the number of source events. Each event C s = (y, c, T (c)) is a tuple representing a given claim c which is associated with a veracity label y ∈ {rumor, non-rumor}, and ideally all its relevant responsive microblog post in chronolog- Figure 2 : The overall architecture of our proposed method. For source and small target training data, we first obtain post-level representations after cross-lingual sentence encoding, then train the structure-based network with the adversarial contrastive objective. For target test data, we extract the event-level representations to detect rumors.

ical order, i.e., T (c) = {c, x s 1 , x s 2 , · · · , x s |C| } 3 , where |C| is the number of responsive tweets in the conversation thread. For the target dataset with low-resource domains and/or languages, we consider a much smaller dataset for training

is the number of target events and each C t = (y, c , T (c )) has the similar composition structure of the source dataset.

We formulate the task of low-resource rumor detection as a supervised classification problem that trains a domain/language-agnostic classifier f (·) adapting the features learned from source datasets to that of the target events, that is, f (C t |D s ) → y. Note that although the tweets are notated sequentially, there are connections among them based on their responsive relationships. So most previous works represent the conversation thread as a tree structure Bian et al., 2020) .

In this section, we introduce our adversarial contrastive learning framework to adapt the features captured from the well-resourced data to detect rumors from low-resource events, which considers cross-lingual and cross-domain alignment. Figure 2 illustrates an overview of our proposed model, which will be depicted in the following subsections.

Given a post in an event that could be either from source or target data, to map it into a shared semantic space where the source and target lan-

guages are semantically aligned, we utilize XLM-RoBERTa (Conneau et al., 2019) (XLM-R) to model the context interactions among tokens in the sequence for the sentence-level representation:

where x is the original post, and we obtain the postlevel representationx using the output state of the <s> token in XLM-R. We thus denote the representation of posts in the source event C s and the target event C t as a matrix X s and X t respectively:

where X s ∈ R m×d and X t ∈ R n×d , d is the dimension of the output state of the sentence encoder.

On top of the sentence encoder, we represent the propagation of each claim with the graph convolutional network (GCN) (Kipf and Welling, 2016), which achieves state-of-the-art performance on capturing both structural and semantic information for rumor classification (Bian et al., 2020) . It is worth noting that the choice of propagation structure representation is orthogonal to our proposed framework that can be easily replaced with any existing structure-based models without any other change to our supervised contrastive learning architecture.

Given an event and its initialized embedding matrix C * , X * ; * ∈ {s, t}, We model the conversation thread of the event as a tree structure T = V, E , where V consists of the event claim and all its relevant responsive posts as nodes and E refers to a set of directed edges corresponding to the response relation among the nodes in V . Inspired by , here we consider two different propagation trees with distinct edge directions: (1) Top-Down tree where the edge follows the direction of information diffusion.

(2) Bottom-Up tree where the responsive nodes point to their responded nodes, similar to a citation network.

Top-Down GCN. We treat the Top-Down tree structure as a graph and transform the edge E into an adjacency matrix

Then we utilize a layer-wise propagation rule to update the node vector at the l-th layer:

We also leverage the structure of Bottom-Up tree to encode the informative posts. Similar to Top-Down GCN, we update the hidden representation of nodes in the same manner as Eq. 2 and finally get the output node states H BU at the L-th graph convolutional layer.

The Overall Model. Finally, we concatenate H T D and H BU via mean-pooling to jointly capture the opinions expressed in both Top-Down and Bottom-Up trees:

where o ∈ R 2d (L) is the event-level representation of the entire propagation thread, d (L) is the output dimension of GCN and [·; ·] means concatenation.

To align the representation space of rumorindicative signals from different domains and languages, we present a novel training paradigm to exploit the labeled data including rich sourced data and small-scaled target data to adapt our model on target domains and languages. The core idea is to make the representations of source and target events from the same class closer while keeping representations from different classes far away.

Given an event C s i from the source data, we firstly obtain the language-agnostic encoding for all the involved posts (see Eq. 1) as well as the propagation structure representation o s i (see Eq. 3) which is then fed into a softmax function to make rumor predictions. Then, we learn to minimize the cross-entropy loss between the prediction and the ground-truth label y s i :

where N s is the total number of source examples in the batch, p i is the probability of correct prediction.

To make rumor representation in the source events be more dicriminative, we propose a supervised contrastive learning objective to cluster the same class and separate different classes of samples: For an event C t i from the target data, we also compute the classification loss L t CE in the same manner as Eq. 4. Although we projected the source and target languages into the same semantic space after sentence encoding, rumor detection not only relies on post-level features, but also on eventlevel contextual features. Without constraints, the structure-based network can only extract eventlevel features for all samples based on their final classification signals while these features may not be critical to the target domain and language. We make full use of the minor labels in the lowresource rumor data by parameterizing our model according to the contrastive objective between the source and target instances in the event-level representation space:

where N t is the total number of target examples in the batch and N y t i is the number of source examples with the same label y t i in the event C t i . As a result, we project the source and target samples belonging to the same class closer than that of different categories, for feature alignment with minor

Input: A small set of events C t i in the target domain and language; A set of events C s i in the source domain and language. Output: Assign rumor labels y to given unlabeled target data.

1: for each mini-batch N t of the target events C t i do: 2: for each mini-batch N s of the source events C s i do: 3:

Pass C * i to the sentence encoder and then structurebased network to obtain its event-level feature o * i , where * ∈ {s, t}.

Compute the classification loss L * CE for source and target data, respectively. 5:

Adversarial augmentation for target data and update L t CE . 6:

Compute the supervised contrastive loss L * SCL . 7:

Compute the joint loss L * as Eq. 8. 8:

Jointly optimize all parameters of the model using the average loss L = mean(L s + L t ).

annotation at the target domain and language.

Data augmentation techniques were successfully utilized to enhance contrastive learning models (Chen et al., 2020b) . Some simple augmentation strategies are designed based on handcrafted features or rules, but they are not efficient and suitable for the propagation tree structures in rumor detection task. In this section, we introduce adversarial attacks to generate pseudo target samples at the event-level latent space to increase the diversity of views for model robustness in the contrastive learning manner. Specifically, we apply Fast Gradient Value (Miyato et al., 2016; Vedula et al., 2020) to approximate a worst-case perturbation as a noise vector of the event-level representation:

where the gradient is the first-order differential of the classification loss L t CE for a target sample, i.e., the direction that rapidly increases the classification loss. We perform normalization and use a small to ensure the approximate is reasonable. Finally, we can obtain the pseudo augmented sample o t adv = o t +õ t noise in the latent space to enhance our model.

We jointly train the model with the cross-entropy and supervised contrastive objectives: L * = (1 − α)L * CE + αL * SCL ; * ∈ {s, t} (8) where α is a trade-off parameter, which is set to 0.5 in our experiments. Algorithm 1 presents the training process of our approach. We set the number L of the graph convolutional layer as 2, the temperature τ as 0.1, and the adversarial per-turbation norm as 1.5. Parameters are updated through back-propagation (Collobert et al., 2011) with the Adam optimizer (Loshchilov and Hutter, 2018) . The learning rate is initialized as 0.0001, and the dropout rate is 0.2. Early stopping (Yao et al., 2007) is applied to avoid overfitting.

To our knowledge, there are no public benchmarks available for detecting low-resource rumors with propagation tree structure in tweets. In this paper, we consider a breaking event COVID-19 as a lowresource domain and collect relevant rumors and non-rumors respectively from Twitter in English and Sina Weibo in Chinese. For Twitter-COVID19, we resort to a COVID-19 rumor dataset (Kar et al., 2020) which only contains textual claims without propagation thread. We extend each claim by collecting its propagation threads via Twitter academic API with a twarc2 package 4 . For Weibo-COVID19, similar to Ma et al. (2016) , a set of related rumorous claims are gathered from the Sina community management center 5 and non-rumorous claims by randomly filtering out the posts that are not reported as rumors. Then Weibo API is utilized to collect all the repost/reply messages towards each claim (see Appendix for the dataset statistics).

We compare our model and several state-of-theart baseline methods described below. 1) CNN: A CNN-based model for misinformation identification (Yu et al., 2017) by framing the relevant posts as a fixed-length sequence; 2) RNN: A RNN-based rumor detection model (Ma et al., 2016) with GRU for feature learning of relevant posts over time; 3) RvNN: A rumor detection approach based on tree-structured recursive neural networks that learn rumor representations guided by the propagation structure; 4) PLAN: A transformerbased model (Khoo et al., 2020) for rumor detection to capture long-distance interactions between any pair of involved tweets; 5) BiGCN: A GCNbased model (Bian et al., 2020) based on directed conversation trees to learn higher-level representations (see Section 4.2); 6) DANN-*: We employ and extend an existing few-shot learning technique, domain-adversarial neural network (Ganin et al., 2016) , based on the structure-based model where * could be RvNN, PLAN, and BiGCN; 7) ACLR-*: our proposed adversarial contrastive learning framework on top of RvNN, PLAN, or BiGCN.

In this work, we consider the most challenging setting: to detect events (i.e., target) from a low-resource domain meanwhile in a cross-lingual regime. Note that although English and Chinese in our datasets are not minority languages, the target domain and/or languages can be easily replaced without any change to our ACLR framework. Specifically, we use the well-resourced TWITTER (Ma et al., 2017 ) (or WEIBO (Ma et al., 2016 ) datasets as the source data, and Weibo-COVID19 (or Twitter-COVID19) datasets as the target. We use accuracy and macro-averaged F1, as well as class-specific F1 scores as the evaluation metrics. We conduct 5-fold cross-validation on the target datasets (see more details in Appendix). Table 1 shows the performance of our proposed method versus all the compared methods on the Weibo-COVID19 and Twitter-COVID19 test sets with pre-determined training datasets. It is observed that the performances of the baselines in the first group are obviously poor due to ignoring intrinsic structural patterns. To make fair comparisons, all baselines are employed with the same cross-lingual sentence encoder of our framework as inputs. Other state-of-the-art baselines exploit the structural property of community wisdom on social media, which confirms the necessity of propagation structure representations in our framework.

Among the structure-based baselines in the second group, due to the representation power of message-passing architectures and tree structures, PLAN and BiGCN outperform RvNN with only limited labeled target data for training. The third group shows the results for DANN-based methods. It improves the performance of structurebased baselines in general since it extracts crossdomain features between source and target datasets via generative adversarial nets (Goodfellow et al., 2014) . Different from that, we use the adversarial attacks to improve the robustness of our proposed contrastive training paradigm, which explicitly encourages effective alignment of rumor-indicative features from different domains and languages.

In contrast, our proposed ACLR-based approaches achieve superior performances among all their counterparts ranging from 21.8% (13.4%) to 30.0% (17.7%) in terms of Macro F1 score on Weibo-COVID19 (Twitter-COVID19) datasets, which suggests their strong judgment on lowresource rumors from different domains/languages. ACLR-BiGCN performs the best among the three ACLR-based methods by making full use of the structural property via graph modeling for conversation threads. This also justifies the good performance of DANN-BiGCN and BiGCN. The results also indicate that the adversarial contrastive learning framework can effectively transfer knowledge from the source to target data at the event level, and substantiate our method is model-agnostic for different structure-based networks.

We perform ablation studies based on our bestperformed approach ACLR-BiGCN. As demonstrated in Table 2 , the first group shows the results for the backbone baseline BiGCN. We observe that the model performs best if pre-trained on source Model Weibo-COVID19 Twitter-COVID19 Acc. Mac-F 1 Acc. Table 2 : Ablation studies on our proposed model. data and then fine-tuned on target training data (i.e., BiGCN(S,T)), compared with the poor performance when trained on either minor labeled target data only (i.e., BiGCN(T)) or well-resourced source data (i.e., BiGCN(S)). This suggests that our hypothesis of leveraging well-resourced source data to improve the low-resource rumor detection on target data is feasible. In the second group, the DANN-based model makes better use of the source data to extract domain-agnostic features, which further leads to performance improvement. Our proposed contrastive learning approach CLR without adversarial augmentation mechanism, has already achieved outstanding performance compared with other baselines, which illustrates its effectiveness on domain and language adaptation. We further notice that our ACLR-BiGCN consistently outperforms all baselines and improves the prediction performance of CLR-BiGCN, suggesting that training model together with adversarial augmentation on target data provide positive guidance for more accurate rumor predictions, especially in low-resource regimes. More qualitative analyses of hyper-parameters, training data size and alternative source datasets are shown in Appendix.

Early alerts of rumors is essential to minimize its social harm. By setting detection checkpoints of "delays" that can be either the count of reply posts or the time elapsed since the first posting, only contents posted no later than the checkpoints is available for model evaluation. The performance is evaluated by Macro F1 obtained at each checkpoint.

To satisfy each checkpoint, we incrementally scan test data in order of time until the target time delay or post volume is reached. Figure 3 shows the performances of our approach versus DANN-BiGCN, BiGCN, PLAN, and RvNN at various deadlines. Firstly, we observe that our proposed ACLR-based approach outperforms other counterparts and baselines throughout the whole lifecycle, and reaches a relatively high Macro F1 score at a very early period after the initial broadcast. One interesting phenomenon is that the early performance of some methods may fluctuate more or less. It is because with the propagation of the claim there is more semantic and structural information but the noisy information is increased simultaneously. Our method only needs about 50 posts on Weibo-COVID19 and around 4 hours on Twitter-COVID19, to achieve the saturated performance, indicating the remarkably superior early detection performance of our method. Figure 4 shows the PCA visualization of learned target event-level features on BiGCN (left) and ACLR-BiGCN (right) for analysis. The left figure represents training with only classification loss, and the right figure uses ACLR for training. We observe that (1) due to the lack of sufficient training data, the features extracted with the traditional training paradigm are entangled, making it difficult to detect rumors in low-resource regimes; and (2) our ACLRbased approach learns more discriminative representations to improve low-resource rumor classification, reaffirming that our training paradigm can effectively transfer knowledge to bridge the gap between source and target data distribution resulting from different domains and languages.

In this paper, we proposed a novel Adversarial Contrastive Learning framework to bridge low-resource gaps for rumor detection by adapting features learned from well-resourced data to that of the low-resource breaking events. Results on two real-world benchmarks confirm the advantages of our model in low-resource rumor detection task. In our future work, we plan to collect and apply our model on other domains and minority languages. 2018). The learning rate is initialized as 0.0001, and the dropout rate is 0.2. Early stopping (Yao et al., 2007) is applied to avoid overfitting. We run all of our experiments on one single NVIDIA Tesla T4 GPU. We set the total batch size to 64, where the batch size of source samples is set to 32, the same as target samples. The hidden and output dimensions of each node in the structurebased network are set to 512 and 128, respectively. Since the focus in this paper is primarily on better leveraging the contrastive learning for domain and language adaptation on top of event-level representations, we choose the XLM-R Base (Layer number = 12, Hidden dimension = 768, Attention head = 12, 270M params) as our sentence encoder for language-agnostic representations at the post level. We use accuracy and macro-averaged F1 score, as well as class-specific F1 score as the evaluation metrics. Unusually, to conduct five-fold crossvalidation on the target dataset in our low-resource settings, we use each fold (about 80 claim posts with propagation threads in the target data) in turn for training, and test on the rest of the dataset. The average runtime for our approach on five-fold crossvalidation in one iteration is about 3 hours. The number of total trainable parameters is 1,117,954 for our model. We implement our model with pytorch 8 .

C Qualitative Analysis C.1 Effect of Adversarial Perturbation Norm Figure 5 shows the effect of adversarial perturbation norm on rumor detection performance. The X-axis denotes the value of , where = 0.0 in the line means no adversarial augmentation. In general, the adversarial augmentation contributes to the improvements and ∈ [1.0, 2.0) achieves better performances. For the Weibo-COVID19 dataset, our proposed approach ACLR with a smaller adversarial perturbation can still obtain better results but lower than the results with an optimal range of perturbation, while large norms tend to damage the effect of ACLR. In terms of Twitter-COVID19, our method still performs well with a broad range of adversarial perturbations and the performance tends to stabilize as the norm value increases. 

To study the effects of the trade-off hyperparameter in our training paradigm, we conduct ablation analysis under ACLR architecture (Figure 6) . We can see that α = 0.5 achieves the best performance while the point where α = 0.3 also has good performance. Looking at the overall trend, the performance fluctuates more or less as the value of α grows. We conjecture that this is because the supervised contrastive objective, while optimizing the representation distribution, compromises the mapping relationship with labels. Multitask means optimizing two losses simultaneously. This setting leads to mutual interference between two tasks, which affects the convergence effect. This phenomenon points out the direction for our further research in the future.

C.3 Effect of Target Training Data Size. Figure 7 shows the effect of target training data size. We randomly choose training data with a certain proportion from target data and use the rest set for evaluation. We use the cross-domain and cross-lingual settings concurrently for model training, the same as the main experiments. Results show that with the decrease of training data size, the performance gradually decreases. Especially for Weibo-COVID19, it will be greatly affected. However, even when only 20 target data are used for training, our model can still achieve more than approximately 60% and 65% rumor detection performance (Macro F1 score) on two target data sets Weibo-COVID19 and Twitter-COVID19 respectively, which further proves ACLR has strong applicability for improving low-resource rumor detection on social media.

In this section, we evaluate our proposed framework with different source datasets to discuss the low-resource settings in our experiments. Considering the cross-domain and cross-lingual settings in the main experiments, we also conduct an experiment in cross-domain settings. Specifically, for the Weibo-COVID as the target data, we utilize the WEIBO dataset as the source data with rich annotation. In terms of Twitter-COVID19, we set the TWITTER dataset as the source data. Ta- ble 4 depicted the results in different low-resource settings. It can be seen from the results that our model performs generally better in cross-domain and cross-lingual settings concurrently than that only in cross-domain settings, which demonstrates the key insight to bridge the low-resource gap is to relieve the limitation imposed by the specific language resource dependency besides the specific domain. Our proposed adversarial contrastive learning framework could alleviate the low-resource issue of rumor detection as well as reduce the heavy reliance on datasets annotated with specific domain and language knowledge.

We will explore the following directions in the future:

1. We are going to explore the pre-training method with contrastive learning and then finetune the model with classification loss, which may further improve the performance and stability of the model.

2. Considering that our model has explicitly overcome the restriction of both domain and language usage in different datasets, we plan to evaluate our model on the datasets about more breaking events in low-resource domains and/or languages by leveraging existing datasets with rich annotation. We believe that our work could provide new guidance for future rumor detection about breaking events on social media.

Fighting the covid-19 infodemic: modeling the perspective of journalists, factcheckers, social media platforms, policy makers, and the society

The psychology of rumor

Large arabic twitter dataset on covid-19

Rumor detection on social media with bi-directional graph convolutional networks

Information credibility on twitter

Covid-19: The first public coronavirus twitter dataset

A simple framework for contrastive learning of visual representations

Mohammad Norouzi, and Geoffrey Hinton. 2020c. Big selfsupervised models are strong semi-supervised learners

Natural language processing (almost) from scratch

Unsupervised cross-lingual representation learning at scale

Coaid: Covid-19 healthcare misinformation dataset

Rumor cascades

Domain-adversarial training of neural networks. The journal of machine learning research

Simcse: Simple contrastive learning of sentence embeddings

Generative adversarial nets

Rumor detection with hierarchical social attention network

Get back! you don't know me like that: The social mediation of fact checking interventions in twitter conversations

Momentum contrast for unsupervised visual representation learning

A survey on recent approaches for natural language processing in low-resource scenarios

Weibo-cov: A large-scale covid-19 social media dataset from weibo

Cross-domain failures of fake news detection

No rumours please! a multi-indic-lingual approach for covid fake-tweet detection

Interpretable rumor detection in microblogs by attending to user interactions

Semisupervised classification with graph convolutional networks

Adversarial examples in the physical world

Prominent features of rumor propagation in online social media

Towards few-shot fact-checking via perplexity

Rumor detection on twitter with claim-guided hierarchical graph attention networks

Boosting low-resource intent detection with in-scope prototypical networks

Real-time rumor debunking on twitter

Simcls: A simple framework for contrastive learning of abstractive summarization

Decoupled weight decay regularization

Debunking rumors on twitter with tree transformer

Detecting rumors from microblogs with recurrent neural networks

Detect rumors in microblog posts using propagation structure via kernel learning

Rumor detection on twitter with tree-structured recursive neural networks

Adversarial training methods for semi-supervised text classification

Contrastive language adaptation for crosslingual stance detection

Fake news detection on social media: A data mining perspective

Embracing domain differences in fake news: Cross-domain fake news detection using multi-modal data

Open intent extraction from natural language interactions

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Eann: Event adversarial neural networks for multi-modal fake news detection

False rumors detection on sina weibo by propagation structures

Consert: A contrastive framework for self-supervised sentence representation transfer

Automatic detection of rumor on sina weibo

On early stopping in gradient descent learning

A convolutional approach for misinformation identification

Improving fake news detection with domain-adversarial and graph-attention neural network. Decision Support Systems

A first instagram dataset on covid-19

Supporting clustering with contrastive learning

Bdann: Bert-based domain adaptation neural network for multi-modal fake news detection

Enquiring minds: Early detection of rumors in social media from enquiry posts

Detection and resolution of rumours in social media: A survey

Learning reporting dynamics during breaking news for rumour detection in social media

The focus of this work, as well as in many previous studies (Ma et al., 2017 Khoo et al., 2020; Bian et al., 2020) , is rumors on social media, not just the "fake news" strictly defined as a news article published by a news outlet that is verifiably false Zubiaga et al., 2018) . To our knowledge, there is no public dataset available for classifying propagation trees in tweets about COVID-19, where we need the tree roots together with the corresponding propagation structure, to be appropriately annotated with ground truth. In this paper, we organize and construct two datasets Weibo-COVID19 and Twitter-COVID19 for experiments. For Twitter-COVID19, the original dataset (Kar et al., 2020) of tweets was released with just the source tweet without its propagation thread. So we collected all the propagation threads using the Twitter academic API with the twarc2 package 6 in python. Finally, we annotated the source tweets by referring to the labels of the events they are from the raw COVID-19 rumor dataset (Kar et al., 2020) , where rumors contain fact or misinformation to be verified while non-rumors do not. For Weibo-COVID19, data annotation similar to Ma et al. (2016) , a set of rumorous claims is gathered from the Sina community management center 7 and non-rumorous claims by randomly filtering out the posts that are not reported as rumors. Weibo API is utilized to collect all the repost/reply messages towards each claim. Both Weibo-COVID19 and Twitter-COVID19 contain two binary labels: Rumor and Non-rumor. For Weibo-COVID19 as the target dataset, we use the TWITTER dataset (Ma et al., 2017) as the source data in our low-resource (i.e., cross-domain and cross-lingual) settings; In terms of Twitter-COVID19 as the target dataset, we use WEIBO (Ma et al., 2016) as the source data. The statistics of the four datasets are shown in Table 3 .

We set the number L of the graph convolutional layer as 2, the trade-off parameter α as 0.5, and the adversarial perturbation norm as 1.5. The temperature τ is set to 0.1. Parameters are updated through back-propagation (Collobert et al., 2011) with the Adam optimizer (Loshchilov and Hutter,