key: cord-0103154-ide0077n authors: Zhang, Yizhou; Sharma, Karishma; Liu, Yan title: VigDet: Knowledge Informed Neural Temporal Point Process for Coordination Detection on Social Media date: 2021-10-28 journal: nan DOI: nan sha: 2256d3eb873242449472b72f6024ed0a2cc8d105 doc_id: 103154 cord_uid: ide0077n Recent years have witnessed an increasing use of coordinated accounts on social media, operated by misinformation campaigns to influence public opinion and manipulate social outcomes. Consequently, there is an urgent need to develop an effective methodology for coordinated group detection to combat the misinformation on social media. However, existing works suffer from various drawbacks, such as, either limited performance due to extreme reliance on predefined signatures of coordination, or instead an inability to address the natural sparsity of account activities on social media with useful prior domain knowledge. Therefore, in this paper, we propose a coordination detection framework incorporating neural temporal point process with prior knowledge such as temporal logic or pre-defined filtering functions. Specifically, when modeling the observed data from social media with neural temporal point process, we jointly learn a Gibbs-like distribution of group assignment based on how consistent an assignment is to (1) the account embedding space and (2) the prior knowledge. To address the challenge that the distribution is hard to be efficiently computed and sampled from, we design a theoretically guaranteed variational inference approach to learn a mean-field approximation for it. Experimental results on a real-world dataset show the effectiveness of our proposed method compared to the SOTA model in both unsupervised and semi-supervised settings. We further apply our model on a COVID-19 Vaccine Tweets dataset. The detection result suggests the presence of suspicious coordinated efforts on spreading misinformation about COVID-19 vaccines. embeddings from the observed data with deep learning (e.g. neural temporal point process) and then conducts group detection in the embedding space [20, 32] . However, the data from social media has a special and important property, which is that the appearance of accounts in the diffusion cascades usually follows a long-tail distribution [18] (an example shown in Fig. 1b) . This property brings a unique challenge: compared to a few dominant accounts, most accounts appear sparsely in the data, limiting the performance of deep representation learning based models. Some previous works exploiting pre-defined collective behaviours [2, 37, 25] can circumvent this challenge. They mainly follow the paradigm that first constructs similarity graphs from the data with some prior knowledge or hypothesis and then conducts graph based clustering. Their expressive power, however, is heavily limited as the complicated interactions are simply represented as edges with scalar weights, and they exhibit strong reliance on predefined signatures of coordination. As a result, their performances are significantly weaker than the state-of-the-art deep representation learning based model [32] . To address above challenges, we propose a knowledge informed neural temporal point process model, named Variational Inference for Group Detection (VigDet). It represents the domain knowledge of collective behaviors of coordinated accounts by defining different signatures of coordination, such as accounts that co-appear, or are synchronized in time, are more likely to be coordinated. Different from previous works that highly rely on assumed prior knowledge and cannot effectively learn from the data [2, 37] , VigDet encodes prior knowledge as temporal logic and power functions so that it guides the learning of neural point process model and effectively infer coordinated behaviors. In addition, it maintains a distribution over group assignments and defines a potential score function that measures the consistency of group assignments in terms of both embedding space and prior knowledge. As a result, VigDet can make effective inferences over the constructed prior knowledge graph while jointly learning the account embeddings using neural point process. A crucial challenge in our framework is that the group assignment distribution, which is a Gibbs distribution defined on a Conditional Random Field [17] , contains a partition function as normalizer [16] . Consequently it is NP-hard to compute or sample, leading to difficulties in both learning and inference [4, 15] . To address this issue, we apply variational inference [22] . Specifically, we approximate the Gibbs distribution as a mean field distribution [24] . Then we jointly learn the approximation and learnable parameters with EM algorithm to maximize the evidence lower bound (ELBO) [22] of the observed data likelihood. In the E-step, we freeze the learnable parameters and infer the optimal approximation, while in the M-step, we freeze the approximation and update the parameters to maximize an objective function which is a lower bound of the ELBO with theoretical guarantee. Our experiments on a real world dataset [20] involving coordination detection validate the effectiveness of our model compared with other baseline models including the current state of the art. We further apply our method on a dataset of tweets about COVID-19 vaccine without ground-truth coordinated group label. The analysis on the detection result suggests the existence of suspicious coordinated efforts to spread misinformation and conspiracies about COVID-19 vaccines. 2 Related Work One typical coordinated group detection paradigm is to construct a graph measuring the similarity or interaction between accounts and then conduct clustering on the graph or on the embedding acquired by factorizing the adjacency matrix. There are two typical ways to construct the graph. One way is to measure the similarity or interaction with pre-defined features supported by prior knowledge or assumed signatures of coordinated or collective behaviors, such as co-activity, account clickstream and time sychronization [5, 29, 37] . The other way is to learn an interaction graph by fitting the data with the temporal point process models considering mutually influence between accounts as scalar scores as in traditional Hawkes Process [41] . A critical drawback of both methods is that the interaction between two accounts is simply represented as an edge with scalar weight, resulting in poor ability to capture complicated interactions. In addition, the performances of prior knowledge based methods are unsatisfactory due to reliance on the quality of prior knowledge or hypothesis of collective behaviors, which may vary with time [39] . To address the reliance to the quality of prior knowledge and the limited expressive power of graph based method, recent research tries to directly learn account representations from the observed data. In [20] , Inverse Reinforcement Learning (IRL) is applied to learn the reward behind an account's observed behavior and the learnt reward is forwarded into a classifier as features. However, since different accounts' activity traces are modeled independently, it is hard for IRL to model the interactions among different accounts. The current state-of-the-art method in this direction is a neural temporal point process model named AMDN-HAGE [32] . Its backbone (AMDN), which can efficiently capture account interactions from observed activity traces, contains an account embedding layer, a history encoder and an event decoder. The account embedding vectors are optimized under the regularization of a Gaussian Mixture Model (the HAGE part). However, as a data driven deep learning model, the learning process of AMDN-HAGE lacks the guidance of prior knowledge from human. In contrast, we propose VigDet, a framework integrating neural temporal point process together and prior knowledge to address inherent sparsity of account activities. A marked temporal point process (MTPP) is a stochastic process whose realization is a discrete where v i ∈ V is the type mark of event i and t i ∈ R + is the timestamp [8] . We denote the historical event collection before time t as Given a history H t , the conditional probability that an event with mark v ∈ V happens at time t is formulated as: , also known as intensity function, is defined as λ v (t|H t ) = E[dNv(t)|Ht] dt , i.e. the derivative of the total number of events with type mark v happening before or at time t, denoted as N v (t). In social media data, Hawkes Process (HP) [41] is the commonly applied type of temporal point process. In Hawkes Process, the intensity function is defined as λ v (t|H t ) = µ v + (vi,ti)∈Ht α v,vi κ(t − t i ) where µ v > 0 is the self activating intensity and α v,vi > 0 is the mutually triggering intensity modeling mark v i 's influence on v and κ is a decay kernel to model influence decay over time. In Hawkes Process, only the µ and α are learnable parameters. Such weak expressive power hinders Hawkes Process from modeling complicated interactions between events. Consequently, researchers conduct meaningful trials on modeling the intensity function with neural networks [9, 21, 40, 44, 33, 23, 32] . In above works, the most recent work related to coordinated group detection is AMDN-HAGE [32] , whose backbone architecture AMDN is a neural temporal point process model that encodes an event sequence S with masked self-attention: where σ is a masked activation function avoiding encoding future events into historical vectors, X ∈ R L×d (L is the sequence length and d is the feature dimension) is the event sequence feature, F is a feedforward neural network or a RNN that summarizes historical representation from the attentive layer into context vectors C ∈ R L×d , and W q , W k , W v are learnable weights. Each row X i in X (the feature of event (v i , t i )) is a concatenation of learnable mark (each mark corresponds to an account on social media) embedding E vi , position embedding P E pos=i with trigonometric integral function [35] and temporal embedding φ(t i − t i−1 ) using translation-invariant temporal kernel function [38] . After acquiring C, the likelihood of a sequence S given mark embeddings E and other parameters in AMDN, denoted as θ a , can be modeled as: In coordinated group detection, we are given a temporal sequence dataset S = {S 1 , ..., S |D| } from social media, where each sequence corresponds to a piece of information, e.g. a tweet, and each event (v ij , t ij ) means that an account v ij ∈ V (corresponding to a type mark in MTPP) interacts with the tweet (like comment or retweet) at time t ij . Supposing that the V consists of M groups, our objective is to learn a group assignment Y = {y v |v ∈ V, y v ∈ {1, ..., M }}. This task can be conducted under unsupervised or semi-supervised setting. In unsupervised setting, we do not have the group identity of any account. As for the semi-supervised setting, the ground-truth group identity Y L of a small account fraction V L ⊂ V is accessible. Current state-of-the-art model on this task is AMDN-HAGE with k-Means. It first learns the account embeddings E with AMDN-HAGE. Then it obtains group assignment Y using k-Means clustering on learned E. In this section, we introduce our proposed model called VigDet (Variational Inference for Group Detection), which bridges neural temporal point process and graph based method based on prior knowledge. Unlike the existing methods, in VigDet we regularize the learning process of the account embeddings with the prior knowledge based graph so that the performance can be improved. Such a method addresses the heavy reliance of deep learning model on the quality and quantity of data as well as the poor expressive power of existing graph based methods exploiting prior knowledge. For the prior knowledge based graph construction, we apply co-activity [29] to measure the similarity of accounts. This method assumes that the accounts that always appear together in same sequences are more likely to be in the same group. Specifically, we construct a dense graph G =< V, E > whose node set is the account set and the weight w uv of an edge (u, v) is the co-occurrence: However, when integrated with our model, this edge weight is problematic because the coordinated accounts may also appear in the tweets attracting normal accounts. Although the co-occurrence of coordinated account pairs is statistically higher than other account pairs, since coordinated accounts are only a small fraction of the whole account set, our model will tend more to predict an account as normal account. Therefore, we apply one of following two strategies to acquire filtered weight w uv : Power Function based filtering: the co-occurrence of a coordinated account pair is statistically higher than a coordinated-normal pairs. Thus, we can use a power function with exponent p > 1 (p is a hyper-parameter) to enlarge the difference and then conduct normalization: In this framework, we aim at learning a knowledge informed data-driven model. To this end, based on prior knowledge we construct a graph describing the potential of account pairs to be coordinated. Then we alternately enhance the prediction of the data-driven model with the prior knowledge based graph and further update the model to fit the enhanced prediction as well as the observed data. where u ∈ S and v ∈ S mean that u and v appear in the sequence respectively. Then the weight with relatively low value will be filtered via normalization (details in next subsection). Temporal Logic [19] based filtering: We can represent some prior knowledge as a logic expression of temporal relations, denoted as r(·), and then only count those samples satisfying the logic expressions. Here, we assume that the active time of accounts of the same group are more likely to be similar. Therefore, we only consider the account pairs whose active time overlap is larger than a threshold (we apply half a day, i.e. 12 hours): where t ul , t vl are the last time that u and v appears in the sequence and t us , t vs are the first (starting) time that u and v appears in the sequence. To integrate prior knowledge and neural temporal point process, while maximizing the likelihood of the observed sequences log p(S|E) given account embeddings, VigDet simultaneously learns a distribution over group assignments Y defined by the following potential score function given the account embeddings E and the prior knowledge based graph G =< V, E >: where ϕ θ (y u , E u ) is a learnable function measuring how an account's group identity y u is consistent to the learnt embedding, e.g. a feedforward neural network. And φ G (y u , y v , u, v) is pre-defined as: where d u , d v = k w uk , k w vk are the degrees of u, v and 1(y u = y v ) is an indicator function that equals 1 when its input is true and 0 otherwise. By encouraging co-appearing accounts to be assigned in to the same group, φ G (y u , y v , u, v) regularizes E and ϕ θ with prior knowledge. With the above potential score function, we can define the conditional distribution of group assignment Y given embedding E and the graph G: where Z = Y exp(Φ(Y ; E, G)) is the normalizer keeping P (Y |E, G) a distribution, also known as partition function [16, 14] . It sums up exp(Φ(Y ; E, G)) for all possible assignment Y . As a result, calculating P (Y |E, G) accurately and finding the assignment maximizing Φ(Y ; E, G) are both NP-hard [4, 15] . Consequently, we approximate P (Y |E, G) with a mean field distribution Q(Y ) = u∈V Q u (y u ). To inform the learning of E and ϕ θ with the prior knowledge behind G we propose to jointly learn Q, E and ϕ θ by maximizing following objective function, which is the Evidence Lower Bound (ELBO) of the observed data likelihood log p(S|E) given embedding E: In this objective function, the first term is the likelihood of the obeserved data given account embeddings, which can be modeled as S∈S log p θa (S|E) with a neural temporal point process model like AMDN. The second term regularizes the model to learn E and ϕ θ such that P (Y |E, G) can be approximated by its mean field approximation as precisely as possible. Intuitively, this can be achieved when the two terms in the potential score function, i.e. u∈V ϕ θ (y u , E u ) and (u,v)∈E φ G (y u , y v , u, v) agree with each other on every possible Y .The above lower bound can be optimized via variational EM algorithm [22, 27, 28, 34] . In E-step, we aim at inferring the optimal Q(Y ) that minimizes D KL (Q||P ). Note that the formulation of Φ(Y ; E, G) is same as Conditional Random Fields (CRF) [17] model although their learnable parameters are different. In E-step such difference is not important as all parameters in Φ(Y ; E, G) are frozen. As existing works about CRF [16, 14] have theoretically proven, following iterative updating function of belief propagation converges at a local optimal solution 2 : where Q u (y u = m) is the probability that account u is assigned into group m and Z u = 1≤m≤MQ u (y u = m) is the normalizer keeping Q u as a valid distribution. In M-step, given fixed inference of Q we aim at maximizing O M : The key challenge in M-step is that calculating E Y ∼Q log P (Y |E, G) is NP-hard [4, 15] . To address this challenge, we propose to alternatively optimize following theoretically justified lower bound: Theorem 1. Given a fixed inference of Q and a pre-defined φ G , we have following inequality: The proof of this theorem is provided in the Appendix. Intuitively, the above objective function treats the Q as a group assignment enhanced via label propagation on the prior knowledge based graph and encourages E and ϕ θ to correct themselves by fitting the enhanced prediction. Compared with pseudolikelihood [3] which is applied to address similar challenges in recent works [27] , the proposed lower bound has a computable closed-form solution. Thus, we do not really need to sample Y from Q so that the noise is reduced. Also, this lower bound does not contain φ G explicitly in the non-constant term. Therefore, we can encourage the model to encode graph information into the embedding. The E-step and M-step form a closed loop. To create a starting point, we initialize E with the embedding layer of a pre-trained neural temporal process model (in this paper we apply AMDN-HAGE) and initialize ϕ θ via clustering learnt on E (like fitting the ϕ θ to the prediction of k-Means). After that we repeat E-step and M-step to optimize the model. The pseudo code of the training algorithm is presented in Alg. 1. Algorithm 1 Training Algorithm of VigDet. Require: Dataset S and pre-defined G and φ G Ensure: Well trained Q, E and ϕ θ 1: Initialize E with the embedding layer of AMDN-HAGE pre-trained on S. Acquire Q by repeating Eq. 10 with E, ϕ θ and φ G until convergence.{E-step} 5: The above framework does not make use of the ground-truth label in the training procedure. In semi-supervised setting, we actually have the group identity Y L of a small account fraction V L ⊂ V. Under this setting, we can naturally extend the framework via following modification to Alg. 1: For account u ∈ V L , we set Q u as a one-hot distribution, where Q u (y u = y u ) = 1 for the groundtruth identity y u and Q u (y u = m) = 0 for other m ∈ {1, ..., M }. We utilize Twitter dataset containing coordinated accounts from Russia's Internet Research Agency (IRA dataset [20, 32] ) attempting to manipulate the U.S. 2016 Election. The dataset contains tweet sequences (i.e., tweet with account interactions like comments, replies or retweets) constructed from the tweets related to the U.S. 2016 Election. This dataset contains activities involving 2025 Twitter accounts. Among the 2025 accounts, 312 are identified through U.S. Congress investigations 3 as coordinated accounts and other 1713 accounts are normal accounts joining in discussion about the Election during during the period of activity those coordinated accounts. This dataset is applied for evaluation of coordination detection models in recent works [20, 32] . In this paper, we apply two settings: unsupervised setting and semi-supervised setting. For unsupervised setting, the model does not use any ground-truth account labels in training (but for hyperparameter selection, we hold out 100 randomly sampled accounts as validation set, and evaluate with reported metrics on the remaining 1925 accounts as test set). For the semi-supervised setting, we similarly hold out 100 accounts for hyperparameter selection as validation set, and another 100 accounts with labels revealed in training set for semi-supervised training). The evaluation is reported on the remaining test set of 1825 accounts. The hyper parameters of the backbone of VigDet (AMDN) follow the original paper [32] . Other implementation details are in the Appendix. In this experiment, we mainly evaluate the performance of two version of VigDet: VigDet (PF) and VigDet (TL). VigDet (PF) applies Power Function based filtering and VigDet (TL) applies Temporal Logic based filtering. For the p in VigDet (PF), we apply 3. We compare them against existing approaches that utilize account activities to identify coordinated accounts. Unsupervised Baselines: Co-activity clustering [29] and Clickstream clustering [37] are based on pre-defined similarity graphs. HP (Hawkes Process) [41] is a learnt graph based method. IRL [20] and AMDN-HAGE [32] are two recent representation learning method. Ablation Variants: To verify the importance of the EM-based variational inference framework and our proposed objective function in M-step, we compare our models with two variants: VigDet-E and VigDet-PL (PL for Pseudo Likelihood). In VigDet-E, we only conduct E-step once to acquire group assignments (inferred distribution over labels) enhanced with prior knowledge, but without alternating updates using the EM loop. It is similar as some existing works conducting post-processing with CRF to enhance prediction based on the learnt representations [6, 12] . In VigDet-PL, we replace our proposed objective function with pseudo likelihood function from existing works. We compare two kinds of metrics. One kind is threshold-free: Average Precision (AP), area under the ROC curve (AUC), and maxF1 at threshold that maximizes F1 score. The other kind need a threshold: F1, Precision, Recall, and MacroF1. For this kind, we apply 0.5 as threshold for the binary (coordinated/normal account) labels.. Table 1 and 2 provide results of model evaluation against the baselines averaged in the IRA dataset over five random seeds. As we can see, VigDet, as well as its variants, outperforms other methods on both unsupervised and semi-supervised settings, due to their ability to integrate neural temporal point process, which is the current state-of-the-art method, and prior knowledges, which are robust to data quality and quantity. It is noticeable that although GNN based methods can also integrate prior knowledge based graphs and representation learning from state-of-the-art model, our model still outperforms it by modeling and inferring the distribution over group assignments jointly guided by consistency in the embedding and prior knowledge space. Ablation Test: Besides baselines, we also compare VigDet with its variants VigDet-E and VigDet-PL. As we can see, for Power Filtering strategy, compared with VigDet-E, VigDet achieves significantly better result on most of the metrics in both settings, indicating that leveraging the EM loop and proposed M-step optimization can guide the model to learn better representations for E and ϕ θ . As for Temporal Logic Filtering strategy, VigDet also brings boosts, although relatively marginal. Such phenomenon suggests that the performance our M-step objective function may vary with the prior knowledge we applied. Meanwhile, the VigDet-PL performs not only worse than VigDet, but also We collect tweets related to COVID-19 Vaccines using Twitter public API, which provides a 1% random sample of Tweets. The dataset contains 62k activity sequences of 31k accounts, after filtering accounts collected less than 5 times in the collected tweets, and sequences shorter than length 10. Although the data of tweets about COVID-19 Vaccine does not have groundtruth labels, we can apply VigDet to detect suspicious groups and then analyze the collective behavior of the group. The results bolster our method by mirroring observations in other existing researches [11, 7] . Detection: VigDet detects 8k suspicious accounts from the 31k Twitter accounts. We inspect tweets and account features of the detected suspicious group of coordinated accounts. Representative tweets: We use topic mining on tweets of detected coordinated accounts and show the text contents of the top representative tweets in Table 3 . The two groups (detected coordinated and normal accounts) are clearly distinguished in the comparison of top-30 hashtags in tweets posted by the accounts in each group (presented in Fig. 3 ). In bold are the non-overlapping hashtags. The coordinated accounts seem to promote that the pandemic is a hoax (#scamdemic2020, #plandemic2020), as well as anti-mask, anti-vaccine and anti-lockdown (#notcoronavirusvaccines, #masksdontwork, #livingnotlockdown) narratives, and political agendas (#trudeaumustgo). The normal accounts narratives are more general and show more positive attitudes towards vaccine, mask and prevention protocols. Also, we measure percentage of unreliable and conspiracy news sources shared in the tweets of the detected coordinated accounts, which is 55.4%, compared to 23.2% in the normal account group. The percentage of recent accounts (created in 2020-21) is higher in coordinated group (20.4%) compared Table 3 : Representative tweets from topic clusters in tweets of detected coordinated accounts. If mRNA vaccines can cause autoimmune problems and more severe reactions to coronavirus' maybe that's why Gates is so confident he's onto a winner when he predicts a more lethal pandemic coming down the track. The common cold could now kill millions but it will be called CV21/22? This EXPERIMENTAL "rushed science" gene therapy INJECTION of an UNKNOWN substance (called a "vaccine" JUST TO AVOID LITIGATION of UNKNOWN SIDE EFFECTS) has skipped all regular animal testing and is being forced into a LIVE HUMAN TRIAL.. it seems to be little benefit to us really! This Pfizer vax doesn't stop transmission,prevent infection or kill the virus, merely reduces symptoms. So why are they pushing it when self-isolation/Lockdowns /masks will still be required. Rather sinister especially when the completion date for trials, was/is 2023 It is-You dont own anything, including your body. -Full and absolute ownership of your biological being. -Disruption of your immune system. -Maximizing gains for #BillGatesBioTerrorist. -#Transhumanism -#Dehumanization' It is embarrassing to see Sturgeon fawning all over them. The rollout of the vaccine up here is agonisingly slow and I wouldn't be surprised if she was trying to show solidarity with the EU. There are more benefits being part of the UK than the EU. It also may be time for that "boring" O'Toole (as you label him) to get a little louder and tougher. To speak up more. To contradict Trudeau on this vaccine rollout and supply mess. O'Toole has no "fire". He can't do "blood sport". He's sidelined by far right diversions. to 15.3% otherwise. Disinformation and suspensions are not exclusive to coordinated activities, and suspensions are based on Twitter manual process and get continually updated over time, also accounts created earlier can include recently compromised accounts; therefore, these measures cannot be considered as absolute ground-truth. In this work, we proposed a prior knowledge guided neural temporal point process to detect coordinated groups on social media. Through a theoretically guaranteed variational inference framework, it integrate a data-driven neural coordination detector with prior knowledge encoded as a graph. Comparison experiments and ablation test on IRA dataset verify the effectiveness of our model and inference. Furthermore, we apply our model to uncover suspicious misinformation campaign in COVID-19 vaccine related tweet dataset. Behaviour analysis of the detected coordinated group suggests efforts to promote anti-vaccine misinformation and conspiracies on Twitter. A.1 Proof of Theorem 1 Proof. To simplify the notation, let us apply following notations: Let us denote the set of all possible assignment as Y, then we have: Now, let us consider the log Y ∈Y exp(Φ(Y ; E, G)). Since φ G is pre-defined, there must be an assignment Y max that maximize Φ G (Y ; G). Thus, we have: Since φ G is pre-defined, Φ G (Y max ; G)) is a constant during the optimization. Note that Y ∈Y exp θ (Φ(Y ; E)) sums up over all possible assignments Y ∈ Y. Thus, it is actually the expansion of following product: Therefore, for Q which is a mean-field distribution and ϕ θ which model each account's assignment independently, we have: In the E-step, to acquire a mean field approximation Q(Y ) = u∈V Q u (y u ) that minimize the KL-divergence between Q and P , denoted as D KL (Q||P ), we repeat following belief propagation operations until the Q converges: Here, we provide a detailed justification based on previous works [14, 16] . Let us recall the definition of the potential function Φ(Y ; E, G) and the Gibbs distribution defined on it P (Y |E, G): where Z = Y exp(Φ(Y ; E, G)). With above definitions, we have the following theorem: Theorem 2. (Theorem 11.2 in [14] ) where H(Q) is the information entropy of the distribution Q. A more detailed derivation of the above equation can be found in the appendix of [16] . Since Z is fixed in the E-step, minimizing For this objective, we have following theorem: Theorem 3. (Theorem 11.9 in [14] ) Q is a local maximum if and only if: where Z u is the normalizer and E Y −{yu}∼Q Φ(Y − {y u }; E, G|y u = m) is the conditional expectation of Φ given that y u = m and the labels of other nodes are drawn from Q. Meanwhile, note that the expectation of all terms in Φ that do not contain y u is invariant to the value of y u . Therefore, we can reduce all such terms from both numerator (the exponential function) and denominator (the normalizer Z u ) of Q u . Thus, we have following corollary: Corollary 1. Q is a local maximum if and only if: where Z u is the normalizer A more detailed justification of the above corollary can be found in the explanation of Corollary 11.6 in the Sec 11.5.1.3 of [14] . Since the above local maximum is a fixed point of D KL (Q||P ), fixed-point iteration can be applied to find such local maximum. More details such as the stationary of the fixed points can be found in the Chapter 11.5 of [14] A. 1e-5 regularization (same as [32] ). The number of loops in the EM algorithm is picked up from {1, 2, 3} based on the performance on the validation account set. In each E-step, we repeat the belief propagation until convergence (within 10 iterations) to acquire the final inference. In each M-step, we train the model for max 50 epochs with early stopping based on validation objective function. The validation objective function is computed from the sequence likelihood on the 15% held-out validation sequences, and KL-divergence on the whole account set based on the inferred account embeddings in that iteration. We apply the Cubic Function based filtering because it shows better performance on unsupervised detection on IRA dataset. We follow all rest the settings of VigDet (CF) in IRA experiments except the GPU number (on 4 NVIDIA-2080Ti). Also, for this dataset, since we have no prior knowledge about how many groups exist, we first pre-train an AMDN by only maximizing its observed data likelihood on the dataset. Then we select the best cluster number that maximizes the silhouette score as the group number. The final group number we select is 2. The silhouette scores are shown in Fig. 4 . After that, we train the VigDet on the dataset with group number of 2. As for the final threshold we select for detection, we set it as 0.8 because it maximizes the silhouette score on the final learnt embedding 4 . Characterizing the 2016 russian ira influence campaign Cascade-based community detection Statistical analysis of non-lattice data Fast approximate energy minimization via graph cuts Uncovering large groups of active malicious accounts in online social networks Semantic image segmentation with deep convolutional nets and fully connected crfs Clustering analysis of website usage on twitter during the covid-19 pandemic An introduction to the theory of point processes: volume II: general theory and structure Recurrent marked temporal point processes: Embedding event history to vector Inductive representation learning on large graphs Primal Wijesekara, and Adriana Iamnitchi. Malicious and low credibility urls on twitter during the astrazeneca covid-19 vaccine development Incorporating network embedding into markov random field for better community detection Semi-supervised classification with graph convolutional networks Probabilistic graphical models: principles and techniques What energy functions can be minimized via graph cuts? Efficient inference in fully connected crfs with gaussian edge potentials Conditional random fields: Probabilistic models for segmenting and labeling sequence data Social contagion: An empirical study of information spread on digg and twitter follower graphs Temporal logic point processes Detecting troll behavior via inverse reinforcement learning: A case study of russian trolls in the 2016 us election The neural hawkes process: A neurally self-modulating multivariate point process A view of the em algorithm that justifies incremental, sparse, and other variants Fully neural network based model for general temporal point processes Advanced mean field methods: Theory and practice Uncovering coordinated networks on social media A survey on semi-supervised learning techniques Graph markov neural networks. arXiv: Learning Probabilistic logic neural networks for reasoning Csi: A hybrid deep model for fake news detection Combating fake news: A survey on identification and mitigation techniques Coronavirus on social media: Analyzing misinformation in twitter conversations Identifying coordinated accounts on social media through hidden influence and group behaviours Intensity-free learning of temporal point processes An em approach to non-autoregressive conditional sequence generation Attention is all you need Graph attention networks Unsupervised clickstream clustering for user behavior analysis Self-attention with functional time representation learning Who let the trolls out? towards understanding state-sponsored trolls Self-attentive hawkes process Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes Table 5: Results on semi-supervised coordination detection (IRA) on Twitter in 2016 U.S. Election Method AP AUC F1 Prec Rec MaxF1 MacroF1 4 and 5, we show detailed performance of our model and the baselines. Specifically, we provide the error bar of different methods As we can see, compared with the version with filtering strategies, the recall scores of most variants with naive edge weight are significantly worse, leading to poor F1 score (excpet VigDet-PL(NF) in unsupervised setting This work is supported by NSF Research Grant CCF-1837131. Yizhou Zhang is also supported by the Annenberg Fellowship of the University of Southern California. We sincerely thank Professor Emilio Ferrara and his group for sharing the IRA dataset with us. Also, we are very thankful for the comments and suggestions from our anonymous reviewers.