key: cord-0217422-rbjt8eh2 authors: Chandra, Shantanu; Mishra, Pushkar; Yannakoudakis, Helen; Nimishakavi, Madhav; Saeidi, Marzieh; Shutova, Ekaterina title: Graph-based Modeling of Online Communities for Fake News Detection date: 2020-08-14 journal: nan DOI: nan sha: c2f8a44e69e8a20e75f4d3ae2d5338c58d430b42 doc_id: 217422 cord_uid: rbjt8eh2 Over the past few years, there has been a substantial effort towards automated detection of fake news on social media platforms. Existing research has modeled the structure, style, content, and patterns in dissemination of online posts, as well as the demographic traits of users who interact with them. However, no attention has been directed towards modeling the properties of online communities that interact with the posts. In this work, we propose a novel social context-aware fake news detection framework, SAFER, based on graph neural networks (GNNs). The proposed framework aggregates information with respect to: 1) the nature of the content disseminated, 2) content-sharing behavior of users, and 3) the social network of those users. We furthermore perform a systematic comparison of several GNN models for this task and introduce novel methods based on relational and hyperbolic GNNs, which have not been previously used for user or community modeling within NLP. We empirically demonstrate that our framework yields significant improvements over existing text-based techniques and achieves state-of-the-art results on fake news datasets from two different domains. The spread of fake news online leads to undesirable consequences in many areas of societal life, notably in the political arena and healthcare with the most recent example being the COVID-19 "Infodemic" (Zarocostas, 2020) . Its consequences include political inefficacy, polarization of society and alienation among individuals with high exposure to fake news (Balmas, 2014; Norton and Greenwald, 2016) . Recent years have therefore seen a growing interest in automated methods for fake news detection, which is typically set up as a binary classification task. While a large proportion of work has focused on modeling the structure, style and content of a news article (Khan et al., 2019; Pérez-Rosas et al., 2017) , no attempts have been made to understand and exploit the online community that interacts with the article. To advance this line of research, we propose SAFER (Socially Aware Fake nEws detection fRamework), a graph-based approach to fake news detection that aggregates information from 1) the content of the article, 2) content-sharing behavior of users who shared the article, and 3) the social network of those users. We frame the task as a graph-based modeling problem over a heterogeneous graph of users and the articles shared by them. We perform a systematic comparison of several graph neural network (GNN) models as graph encoders in our proposed framework and introduce novel methods based on relational and hyperbolic GNNs, which have not been previously used for user or community modeling within NLP. By using relational GNNs, we explicitly model the different relations that exist between the nodes of the heterogeneous graph, which the traditional GNNs are not designed to capture. Furthermore, euclidean embeddings used by the traditional GNNs have a high distortion when embedding real world hierarchical and scale-free graphs 1 (Ravasz and Barabási, 2003; Chen et al., 2013) . Thus, by using hyperbolic GNNs we capture the relative distance between the node representations more precisely by operating in the hyperbolic space. Our methods generate rich community-based representations for articles. We demonstrate that, when used alongside text-based representations of articles, SAFER leads to significant gains over existing methods for fake news detection and achieves state-of-the-art 1 A Scale Free Network is one in which the distribution of links to nodes follows a power law, i.e., the vast majority of nodes have very few connections, while a few important nodes (hubs) have a huge number of connections. arXiv:2008.06274v2 [cs.CL] 10 Sep 2020 performance. Approaches to fake news detection can be categorized into three different types: content-, propagationand social-context based. Content-based approaches model the content of articles, such as the headline, body text, images and external URLs. Some methods utilize knowledge graphs and subject-predicate-object triples (Ciampaglia et al., 2015; Shi and Weninger, 2016) , while other feature-based methods model writing style, psycholinguistic properties of text, rhetorical relations and content readability (Popat, 2017; Castillo et al., 2011; Pérez-Rosas et al., 2017; Potthast et al., 2017) . Others use neural networks (Ma et al., 2016) , with attention-based architectures such as HAN (Okano et al., 2020) and dEFEND (Shu et al., 2019a) outperforming other neural methods. Recent multi-modal approaches encoding both textual and visual features of news articles as well as tweets (Shu et al., 2019c; Wang et al., 2018) , have advanced the performance further. Propagation-based methods analyze patterns in the spread of news based on news cascades (Zhou and Zafarani, 2018) which are tree structures that capture the content's post and re-post patterns. These methods make predictions in two ways: 1) computing the similarity between the cascades (Kashima et al., 2003; Wu et al., 2015) ; or 2) representing news cascades in a latent space for classification (Ma et al., 2018) . However, they are not well-suited to large social-network setting due to their computational complexity. Social-context based methods employ the users' meta-information obtained from their social media profiles (e.g. geo-location, total words in profile description, etc.) as features for detecting fake news (Shu et al., 2019b (Shu et al., , 2020 . Recently, several works have leveraged GNNs to learn user representations for other tasks, such as abuse (Mishra et al., 2019) , political perspective (Li and Goldwasser, 2019) and stance detection (Del Tredici et al., 2019) . Two works, contemporaneous to ours, have also proposed to use GNNs for the task of fake news detection. Han et al. (2020) applied GNNs on a homogeneous graph constructed in the form of news cascades by using just shallow user-level features such as no. of followers, status and tweet mentions. On the other hand, Nguyen et al. (2020) use features derived from the article, news source, users and their interactions and timeline of posting to detect fake news. They construct two homogeneous sub-graphs (news-source and user sub-graph) and model them separately in an unsupervised setting for proximity relations. They also use the user's stance in relation to the shared content as additional information via a stance detection network pre-trained on a self-curated dataset. Our formulation of the problem is distinct from these methods in three ways. Firstly, we construct a single heterogeneous graph consisting of two kinds of nodes and edges and model them together in a semi-supervised graph learning setup. Secondly, we do not perform user profiling, but rather compute community-wide social-context features, and to the best of our knowledge, no prior work has investigated the role of online communities in fake news detection. Third, to capture the role of communities, we only use the information about the users' networks, without the need for any personal information from user's profile and yet outperform the existing methods that incorporate those. Furthermore, since our methods do not use any userspecific information, such as their location, race or gender, they therefore do not learn to associate specific population groups with specific online behaviour, unlike other methods that explicitly incorporate user-specific features and their personal information by design. We believe the latter would pose an ethical concern, which our techniques help to alleviate. For our experiments, we use fake news datasets from two different domains, i.e., celebrity gossip and healthcare, to show that our proposed method is domain-agnostic. All user information collected for the experiments is de-identified. FakeNewsNet 2 (Shu et al., 2018) is a a publicly available benchmark for fake news detection. The dataset contains news articles from two factchecking sources, PolitiFact and GossipCop, along with links to Twitter posts mentioning these articles. PolitiFact 3 is a fact-checking website for political statements; GossipCop 4 is a website that factchecks celebrity and entertainment stories. Gossip-Cop contains a substantially larger set of articles compared to PolitiFact (over 21k news articles with text vs. around 900) and is therefore the one we use in our experiments. We note that some articles have become unavailable over time. We also excluded 60 articles that are less than 25 tokens long. In total, we work with 20, 350 articles of the original set (23% fake and 77% real). FakeHealth 5 (Dai et al., 2020), is a publicly available benchmark for fake news detection specifically in the healthcare domain. The dataset is collected from healthcare information review website Health News Review 6 , which reviews whether a news article is reliable according to 10 criteria and gives it a score from 1-5. In line with the original authors of the dataset, we consider an article as fake for scores less than 3 and real otherwise. The dataset is divided into two datasets based on the nature of the source of the articles. HealthStory contains articles that are news stories, i.e., reported by news media such as Reuters Health. HealthRelease contains articles that are news releases from various institutions such as universities, research centers and companies. HealthStory contains a considerably larger set of articles compared to HealthRelease (over 1600 vs around 600) and is therefore the one we use in our experiments. We again note that some articles have become unavailable over time. We also exclude 27 articles that are less than 25 tokens long. In total, we work with 1611 of the original set (28% fake and 72% real articles). 5 https://tinyurl.com/y36h42zu 6 https://www.healthnewsreview.org/ For each of the datasets, we create a heterogeneous community graph G consisting of two sets of nodes: user nodes N u and article nodes N a . An article node a ∈ N a is represented by a binary bagof-words (BOW) vector a = [w 1 , .., w j , .., w |V | ], where |V| is the vocabulary size and w i ∈ {0, 1}. A user node u ∈ N u is represented by a binary BOW vector constructed over all the articles that they have shared: u = [a 1 | a 2 | ... | a M ], where | denotes the element-wise logical OR and M is the total number of articles shared by the user. Next, we add undirected edges of two types: 1) between a user and an article node if the user shared the article in a tweet/retweet (article nodes may therefore be connected to multiple user nodes), and 2) between two user nodes if there is a follower-following relationship between them on Twitter. 7 We work with the "top N most active users" (N=20K for Health-Story, N=30K for GossipCop) subset and motivate this decision in Section 6.2. To avoid effects of any bias from frequent users, we exclude users who have shared more than 30% of the articles in either class. The resulting graph has 29, 962 user nodes, 16, 766 article nodes (articles from test set excluded) and over 1.2M edges for GossipCop. Meanwhile, the HealthStory community graph contains 12, 266 user nodes, 1291 article nodes (test set articles excluded) and over 450K edges. The proposed framework -detailed below and visualized in Figure 1 -employs two components in its architecture, namely graph-and text-based encoders, and its working can be broken down into two phases: training and testing. Training phase: We first train the graph and text encoders independently on the training set. The input to the text encoder is the text of the article and it is trained on the task of article classification for fake news detection. The trained text encoder generates the text-based features of the article content s t ∈ R dt where d t is the hidden dimension of the text encoder. The graph encoder is a GNN that takes as input the community graph (constructed as detailed in §4.1). The GNN is trained with supervised loss from the article nodes that is backpropagated to the rest of the network. The trained GNN is able to generate a set of user embeddings U g = {u 1 , u 2 , ..., u m } where u i ∈ R dg , d g is the hidden dimension of the graph encoder and m is the total number of users that interacted with the article. These individual user representations are then aggregated into a single fixed size vector via normalized sum such that s g = m i=1 u i /m, s g ∈ R dg where s g denotes the social-context features of the article. The final social context-aware representation of the article is computed as s saf er = s g ⊕ s t , where ⊕ is the concatenation operator. This form of aggregation helps SAFER to retain the information that each representation encodes about different aspects of the shared content. Finally, s saf er is used to train a logistic regression (LR) classifier on the training set. Intuitively, the trained text encoder captures the linguistic cues from the content that are crucial for the task. Similarly, the trained graph encoder learns to assign users to implicit online communities based on their content-sharing patterns and social connections. Testing phase: To classify unseen content as fake or real, SAFER takes as input the text of the article as well as the network of users that interacted with it. It then follows the same procedure as detailed above to generate the social context-aware representation of the to-be-verified test article, s saf er , and uses the trained LR classifier to classify it. We experiment with two different architectures as text encoders in SAFER: Convolutional neural network (CNN). We adopt the sentence-level encoder of Kim (2014) at the document level. This model uses multiple 1-D convolution filters of different sizes that aggregate information by sliding over the length of the article. The final fixed-length article representation is obtained via max-over-time pooling over the feature maps. RoBERTa. As our main text encoder, we fine-tune the transformer encoder architecture, RoBERTa (Liu et al., 2019b) , and use it for article classification. RoBERTa is a language model pre-trained with dynamic masking. Specifically, we use it to encode the first 512 tokens of each article and use the [C L S] token as the article embedding for classification. We experiment with six different GNN architectures for generating user embeddings as detailed below: Graph Convolution Networks (GCNs). GCNs (Kipf and Welling, 2016) take as input a graph G defined by its adjacency matrix A ∈ R n×n (where n is number of nodes in the graph) 8 , a degree matrix D such that D ii = j A ij , and a feature matrix F ∈ R n×m containing the m-dimensional feature vectors for the nodes. The recursive propagation step of a GCN at the i th convolutional layer is given by represents the output of the preceding convolution layer and t i is the number of hidden units in the i th layer, with t 0 = m. Graph Attention Networks (GAT). GAT (Veličković et al., 2017) is a non-spectral architecture that leverages the spatial information of a node directly by learning different weights for different nodes in a neighborhood using a self-attention mechanism. GAT is composed of graph attention layers. In each layer, a shared, learnable linear transformation W ∈ R t i−1 ×t i is applied to the input features of every node, where t i is the number of hidden units in layer i. Next, self-attention is applied on nodes, where a shared attention mechanism computes attention coefficients e uv between pairs of nodes to indicate the importance of the features of node v to node u. To inject graph structural information, masked attention is applied by computing e uv only for nodes v ∈ U(u) that are in the first-order neighborhood of node u. The final node representation is obtained by linearly combining normalized attention coefficients with their corresponding neighborhood node features. GraphSAGE. SAGE (Hamilton et al., 2017) is an inductive framework that learns aggregator functions that generate node embeddings from a node's local neighborhood. First, each node u ∈ G aggregates information (through either mean, sum or pooling) from its local neighborhood where k denotes the depth of the search, h k denotes the node's representation at that step and U(u) is set of neighbor nodes of u. Next, it concatenates the node's current representation h k−1 u with that of its aggregated neighborhood vector h k−1 U (u) . This vector is then passed through a multi-layer perceptron (MLP) with non-linearity to obtain the new node representation h k u to be used at depth k + 1. Once the aggregator weights are learned, the embedding of an unseen node can be generated from its features and neighborhood. Relational GCN/GAT. R- GCN (Schlichtkrull et al., 2018) and R-GAT are an extension of GCN and GAT for relational data and build upon the traditional differentiable message passing framework. The networks accept input in the form of a graph G = (V, E, R) where V denotes the set of nodes, E denotes the set of edges connecting the nodes and R denotes the edge relations (u, r, v) ∈ E where r ∈ R is a relation type and u, v ∈ V . The R-GCN forward pass update step is: u is the final node representation of node u at layer i, U r u denotes the set of neighbor indices of node u under relation r ∈ R, W r is the relationspecific trainable weight parameter and c u,r is a task specific normalization constant that can either be learned or set in advance (such as c u,r = |U r u |). Note that each node's feature at layer i is also informed of its features from layer i − 1 by adding a self-loop to the data with a relation type learned using the trainable parameter W 0 . Intuitively, this propagation step aggregates transformed feature vectors of first-order neighbor nodes through a normalized sum. R-GAT also follows the same setup, except the aggregation is done using the graph attention layer as described in GAT. This architecture helps us to aggregate information from user and article nodes selectively from our community graph. Hyperbolic GCN / GAT. Chami et al. 2019 build upon previous work (Liu et al., 2019a; Ganea et al., 2018) to combine the expressiveness of GCN/GAT with hyperbolic geometry to learn improved representations for scale-free graphs. Hy-GCN/GAT first map the euclidean input to the hyperbolic space (we use the Poincar ball model), which is the Riemannian manifold with constant negative sectional curvature -1/K. Next, analogous to the mean aggregation by the GCN, Hy-GCN computes the Frchet mean (Fréchet, 1948) of a node's neighbours' embeddings while the Hy-GAT performs aggregation in tangent spaces using hyperbolic attention. Finally, Hy-GCN/GAT use hyperbolic nonlinear activation function σ ⊗ K i−1 ,K i given the hyperbolic curvatures -1/K i−1 , -1/K i at layers i − 1 and i where ⊗ is the Mbius scalar multiplication operator. This is crucial as it allows the model to smoothly vary curvature at each layer. We compare the performance of the proposed framework with seven supervised classification methods: two purely text-based baselines, a usersharing majority voting baseline, a GNN-based "social baseline" and three architectures from the literature. Baselines. The setup for the baselines is detailed below: 1. Text-baselines. We use the CNN and RoBERTa architectures described earlier to obtain article representations. The input to the CNN encoder is ELMo embeddings (Peters et al., 2018) of the article tokens, while RoBERTa uses its own tokenizer to generate initial token representations. 2. Majority sharing baseline. This simple baseline classifies articles as fake or real based on the sharing statistics of users that tweeted or retweeted about it. If, on average, the users that interact with an article have shared more fake articles, then the article is tagged as fake, and real otherwise. 3. Social Baseline. We introduce a graph-based model that measures the effectiveness of purely structural aspects of the community graph captured by the GNNs (without access to text). The user node embeddings are constructed as described earlier, but with the article nodes being initialized randomly. Here, the community-based features solely capture properties of the network. The classification is done using just the social-context feature by an LR classifier. Comparison Systems. We compare the performance of the proposed framework with three methods from literature: 1. HAN (Shu et al., 2019a) . Hierarchical attention network first generates sentence embeddings using attention over (GRU-based) contextualised word vectors. An article embedding is then obtained in a similar manner by passing sentence vectors through a GRU and applying attention over the hidden states. 2. dEFEND (Shu et al., 2019a) . This method exploits contents of articles alongside comments from users. Comment embeddings are obtained from a single layer bi-GRU and article embeddings are generated using HAN. A cross-attention mechanism is applied over the two embeddings to exploit users' opinions and stance to better detect fake news. 3. SAFE (Zhou et al., 2020): This method uses visual and textual features of the content. It uses a CNN to encode the textual as well as visual content of an article by initially processing the visual information using a pre-trained image2sentence model 9 . It then concatenates these representations to better detect fake news. Experimental setup. We use 70%, 10% and 20% of the total articles as train, validation and test splits respectively for both datasets. For CNN we use 128 filters of sizes [3, 4, 5] each. For HAN and dEFEND we report the results in Shu et al. (2019a), while for SAFE in Zhou et al. (2020) . We use the large version of RoBERTa and fine-tune all layers. Due to class imbalance, we weight the loss from the fake class 3 times more (in line with the class frequency in each of the datasets) while optimizing the binary cross entropy loss of the sigmoid output from a 2layer MLP in all our experiments. We use dropout (Srivastava et al., 2014) , attention dropout and node masking (Mishra et al., 2020) for regularization. We use 2-layer deep architectures for all the GNNs. For Hy-GCN/-GAT we train with learnable curvature. We run all experiments with 5 random seeds using the AdamW (Loshchilov and Hutter, 2017) optimizer (except for Hy-GCN/-GAT that use Riemannian Adam; Bécigneul and Ganea 2018) with an early stopping patience of 10. For GossipCop, we use a learning rate of 5 · 10 −3 for Hy-GCN/-GAT; 1 · 10 −4 for SAGE and R-GAT; 1 · 10 −3 for R-GCN; and 5 · 10 −4 for the rest. We use weight decay of 5 · 10 −1 for RoBERTA; 2 · 10 −3 for SAGE and R-GCN; 1 · 10 −3 for the rest. We use dropout of 0.4 for GAT and R-GCN; 0.2 for SAGE and R-GAT; 0.5 for CNN; and 0.1 for the rest. We use node masking probability of 0.1 for all the GNNs and attention dropout of 0.4 for RoBERTa. Finally, we use a hidden dimension of 128 for SAGE; 256 for GCN and Hy-GCN; and 512 for the rest. Meanwhile for HealthStory, we use a learning rate of 1 · 10 −4 for SAGE; 1 · 10 −3 for R-GAT; 5 · 10 −3 GCN, Hy-GCN/GAT; and 5 · 10 −4 for the rest. We use weight decay of 5 · 10 −1 for RoBERTa; 2 · 10 −3 for GAT, SAGE and R-GCN; and 1 · 10 −3 for the rest. We use dropout of 0.4 for GCN; 0.1 for Hy-GCN/GAT and RoBERTa, 0.5 for CNN; and 0.2 for the rest. We use node masking probability of 0.2 for GAT and R-GCN; 0.3 for Hy-GCN/-GAT; and 0.1 for the rest. Finally, we use attention dropout of 0.4 for RoBERTa and a hidden dimension of 128 for SAGE; 256 for Hy-GAT and 512 for the rest. Results. The mean F1 scores for all models are summarized in Table 1 . We note that the simple majority sharing baseline achieves an F1 of 77.19 on GossipCop while just 8.20 on HealthStory. This highlights the difference in the content sharing behavior of users between the two datasets and we explore this further in Section 6.3. We can also see this difference in the strength of social context information between the 2 datasets from the performance of the social baseline. Social baseline variants of all GNNs significantly (p < 0.05 under paired t-test) outperform all text-based methods in case of GossipCop but not in case of HealthStory. However, all the social baselines outperform the majority sharing baseline demonstrating the contribution of GNNs beyond capturing just the average sharing behavior of interacting users. Note that we observe similar trends in experiments with CNN as the text-encoder of the proposed framework. Finally, in case of GossipCop, all the variants of the proposed SAFER framework significantly outperform all their social baseline counterparts as well as all the text-based models. The relational GNN variants significantly outperform all the meth- ods, while the hyperbolic variants perform on par with the traditional GNNs. In case of HealthStory, we see that the traditional GNN variants significantly outperform their social baseline counterparts but not the best-performing text-based baseline (i.e., RoBERTa). However, the relational and hyperbolic GNNs significantly outperform all other methods. Overall, we see that the proposed relational GNNs outperform the traditional GNN models, indicating the importance of modelling the different relations between nodes of a heterogeneous graph separately. Hyperbolic GNNs are more expressive in embedding graphs that have a (deep) hierarchical structure. Due to the nature of the datasets and limitation of Twitter API (all retweets are mapped to the same source tweet, rather than forming a tree structure), the community graph is just 2 levels deep. Thus, hyperbolic GNNs perform similar to the traditional GNNs under our 2-layer setup. However, if more social information were available, resulting in a deeper graph, we expect Hy-GNNs to exhibit a superior performance. In Figure 2 , we use t-SNE (Maaten and Hinton, 2008) to visualize the test articles representations generated by RoBERTa and SAFER (R-GCN). We see a much cleaner and compact segregation of fake Graph sparsity can affect the performance of GNNs as they rely on node connections to share information during training. Additionally, the presence of frequent users that share many articles of a particular class may introduce a bias in the model. In such cases, the network may learn to simply map a user to a class and use that as a shortcut for classification. To investigate the effects of these phenomena, we perform an ablation experiment on GossipCop by removing the more frequent/active users from the graph in a step-wise fashion. This makes the graph more sparse and discards many connections that the network could have learned to overfit on. Table 2 shows the performance of the GNN models when users sharing more than 10%, 5% and 1% articles of each class are removed from the graph. We see that the performance drops as users are removed successively; however, SAFER still outperforms all the text-based methods under the 10% and 5% setting, even without the presence of a possible bias introduced by frequent users. For the 1% setting, only the hyperbolic GNNs outperform the baselines and this setting illustrates that under extremely sparse conditions (65% of original density of an already sparse graph), the R-GNNs struggle to learn informative user representations. Overall, we see that Hy-GNNs are resilient to user biases (if any) and can perform well even on sparse graphs. GNNs learn node representations by aggregating information from their local neighborhoods. If the unsupervised nodes have very sparse connections (e.g., users that have shared just one (or very few) article(s)), then there is not enough support to learn their social-context features from. The effective neighborhood that a node uses is determined by the number of successive iterations of message passing steps, i.e., the number of layers in a GNN. Thus, in principle we can add more GNN layers to enable the sparse unsupervised nodes to gain access to back-propagating losses from distant supervised nodes too. However, simply stacking more layers leads to various known problems of training deep neural networks in general (vanishing gradients and overfitting due to large no. of parameters), as well as graph specific problems such as over-smoothing and bottleneck phenomenon. Over-smoothing is the phenomenon where node features tend to converge to the same vector and become nearly indistinguishable as the result of applying multiple GNN layers (Oono and Suzuki, 2019; NT and Maehara, 2019) . Moreover, in social network graphs, predictions typically rely only on short-range information from the local neighbourhood of a node and do not improve by adding distant information. In our community graph, modeling information from 2 hops away is sufficient to aggregate useful community-wide information at each node and can be achieved with 2-layer GNNs. Thus, learning node representations for sparsely connected nodes from these shallow GNNs is challenging. Bottleneck is the phenomenon of "oversquashing of information from exponentially many neighbours into small fixed-size vectors (Alon and Yahav, 2020) . Since each article is shared by many users and each user is connected to many other users, the network can suffer from bottleneck which affects learning. In a 2-layer GNN setup, the effective aggregation neighborhoods of each article node exponentially increases, as it aggregates information from all the nodes that are within 2-hops away from it. Due to these observations during our initial experiments, we choose to use just the "top N most active users". We define "active users" as those that have shared more articles, i.e., have sufficient support to learn from, and hence can help us capture their content-sharing behavior better. In Figure 3 we show the validation and test performance over varying subsets of active users in HealthStory. We see that as we successively drop the least active users, the validation and test scores show a positive trend. This illustrates the effects of bottleneck on the network. However, the scores drop after a certain threshold of users. This threshold is the optimum number of users required to learn effectively using the GNNs -adding more users leads to bottleneck and removing users leads to underfitting due to the lack of sufficient support to learn from. We see that validation and test scores are correlated in this behavior and we tune our optimal threshold of users for effective learning based on the validation set of SAFER (GCN) and use the same subset for all the other GNN encoders for fair comparison. The best validation score was achieved at the top20K subset of most active users while the test scores peaked for the top8K setting. Thus, we run all our experiments with the top 20K active users for HealthStory, and similarly top30K for GossipCop. As discussed earlier, results in Table 1 show that there is a difference in the article sharing behavior of users between the two datasets. To understand user characteristics better, we visualize the article sharing behavior of users for both the datasets in Figure 4 . We visualize the composition of 3 types of users in the datasets: (a) users that share only real articles, (b) only fake articles and (c) those that share articles from both classes. We see that the majority of the users are type (c) in both datasets (57.18% for GossipCop and 74.15% for Health-Story). However, 38% of the users are type (b) in GossipCop while just 9.96% in HealthStory. Furthermore, we visualize the average of real and fake articles shared by type (c) users on the right in Figure 4 . From these observations, we note that the GNNs are better positioned to learn user representations to detect fake articles in case of GossipCop, since: (1) The community graph has enough support of type (b) users (38%). This aids the GNNs to learn rich community-level features of users that aid in detecting fake articles; (2) Even of the 57% type (c) users, they are much more likely to share articles of a single class (here, real). This again helps the network to learn distinct features for these users and assign them to a specific community. However, in case of HealthStory, the GNNs struggle to learn equally rich user representations to detect fake articles since: (1) The community graph has only around 10% of type (b) users. This limits the GNNs from learning expressive community level features for users that are more likely to share fake articles and thereby are not able to use them for accurate prediction. (2) A vast majority of users (74%) share articles of both classes. To add to that, these bulk of users are considerably less likely to share articles of one class predominantly. This again restricts the GNNs from learning informative representations of these users, as it struggles to assign them to any specific community due to mixed signals. We presented a graph-based approach to fake news detection which leverages information-spreading behaviour of social media users. Our results demonstrate that incorporating community-based modeling leads to substantially improved performance in this task as compared to purely text-based models. The proposed relational GNNs for user/community modeling outperformed the traditional GNNs indicating the importance of explicitly modeling the relations in a heterogeneous graph. Meanwhile, the proposed hyperbolic GNNs performed on par with other GNNs and we leave their application for user/community modeling to truly hierarchical social network datasets as future work. In the future, it would be interesting to apply these techniques to other tasks, such as rumour detection and modeling changes in public beliefs. Meital Balmas. 2014. When fake news becomes real: Combined exposure to multiple news sources and political attitudes of inefficacy, alienation, and cynicism. Communication research, 41 (3) The set of best hyper-parameters for all models are reported in Table 3 . We use NVIDIA Titanrtx 2080Ti for training multiple-GPU models and 1080ti for single GPU ones. In Table 4 we report the run times (per epoch) for each model. We use F1 score (of the target class, ie, fake class) to report all our performance. F1 is defined as : where, Precision and Recall are defined as: P recision = T rue P ositive T rue P ositive + F alse P ositive Recall = T rue P ositive T rue P ositive + F alse N egative A.5 Training Details 1. To leverage effective batching of graph data during training, we cluster the Graph into 300 dense sub-graphs using the METIS (Karypis and Kumar, 1998) graph clustering algorithm. We then train all the GNN networks with a batch-size of 16, ie, 16 of these sub-graphs are sampled at each pass as detailed in (Chiang et al., 2019) . This vastly reduces the time, memory and computation complexity of large sparse graphs. 2. Additionally, for GCN we adopt "diagonal enhancement" by adding identity to the original adjacency matrix A (Chiang et al., 2019) and perform the normalization as:Ã = (D + I) −1 (A + I). 3. For SAGE we use "mean" aggregation and normalize the output features as where, 4. For GAT, we use 3 attention heads with attention dropout of 0.1 to stabilize training. We concatenate their linear combinations instead of aggregating, to have a output of each layer to be 3 × hidden dim. In Table 5 we report the performance of all the GNN variants of the proposed SAFER framework for different subsets of highly active users. A portion of the community graph is visualized in Figure 5 . We assess the performance of the SAFER (GCN) variant on Gossipcop in Figure 7a . We see that the first article is a fake article which RoBERTa incorrectly classifies as real. However, looking at Learning rate 5 · 10 −4 5 · 10 −4 1 · 10 −4 1 · 10 −3 1 · 10 −4 5 · 10 −3 5 · 10 −3 5 · 10 −4 5 · 10 −4 Weight Decay 1 · 10 −3 1 · 10 −3 2 · 10 −3 2 · 10 −3 1 · 10 −3 1 · 10 −3 1 · 10 −3 1 · 10 −3 5 · 10 −1 Attention dropout NA 0. Learning rate 5 · 10 −3 5 · 10 −4 1 · 10 −4 5 · 10 −4 1 · 10 −3 5 · 10 −3 5 · 10 −3 5 · 10 −4 5 · 10 −4 Weight Decay 1 · 10 −3 2 · 10 −3 2 · 10 −3 2 · 10 −3 1 · 10 −3 1 · 10 −3 1 · 10 −3 1 · 10 −3 5 · 10 − the content-sharing behavior of users that shared this article, we see that on average these users shared 5.8 fake articles while just 0.45 real ones (13 times more likely to share fake content than real), strongly indicating that the community of users that are involved in sharing of this article are responsible for propagation of fake news. Taking this strong community-based information into con-sideration, SAFER is able to correctly classify this article as fake. Similarly, the second article is a real article which is misclassified as fake by RoBERTa by looking at the text alone. However, the GNN features show that the users that shared this article have on average shared 533 real articles and 96.7 fake ones (5.5 times more likely to share a real article than a fake one). This is taken as a strong signal that the users are reliable and do not engage in malicious sharing of content. SAFER is then able to correctly classify this article as real. We observe similar behavior of the models on HealthStory in Figure 7b . The first article is misclassified as real by RoBERTa but the GNN features indicate that the users interacting with the article share 16.2 fake articles and 7.8 real ones on average (2.1 times more likely to share fake). SAFER takes this information into account and classifies it correctly as fake. Similarly for the second article, the interacting users share 40 real and 19.96 fake articles on average (2 times more likely to share real) which helps the proposed method to correctly classify it as real. HealthStory. Text in red denotes a fake article, while in green denotes a real one. Black central node denotes the target article node that we are trying to classify, blue nodes denote the users that shared this article while red and green nodes denote the other fake and real articles these users have interacted with respectively. Predictions by different models stated on the right. On the bottleneck of graph neural networks and its practical implications CNN: learning rate = Transformers: learning rate = GNNs: learning rate = We clean the raw text of the crawled articles of the GossipCop dataset before using them for training. More specifically, we replace any URLs and hashtags in the text with the tokens [url] and [hashtag] respectively. We also replace new line characters with a blank space and make sure that class distributions across the train-val-test splits are the same. All our code is in PyTorch and we use the Hug-gingFace library (Wolf et al., 2019) to train the transformer models. We grid-search over the following values of the parameters for the respective models and choose the best setting based on best F1 score on test set: