key: cord-0489687-jjw9tbip authors: Tong, Guangmo title: StratLearner: Learning a Strategy for Misinformation Prevention in Social Networks date: 2020-09-29 journal: nan DOI: nan sha: 0fe9bbac70cbf3de6847820f76972ef3134927f7 doc_id: 489687 cord_uid: jjw9tbip Given a combinatorial optimization problem taking an input, can we learn a strategy to solve it from the examples of input-solution pairs without knowing its objective function? In this paper, we consider such a setting and study the misinformation prevention problem. Given the examples of attacker-protector pairs, our goal is to learn a strategy to compute protectors against future attackers, without the need of knowing the underlying diffusion model. To this end, we design a structured prediction framework, where the main idea is to parameterize the scoring function using random features constructed through distance functions on randomly sampled subgraphs, which leads to a kernelized scoring function with weights learnable via the large margin method. Evidenced by experiments, our method can produce near-optimal protectors without using any information of the diffusion model, and it outperforms other possible graph-based and learning-based methods by an evident margin. The online social network has been an indispensable part of today's community, but it is also making misinformation like rumor and fake news widespread [1, 2] . During COVID- 19 , there have been more than 150 rumors identified by Snopes.com [3] . Misinformation prevention (MP) limits the spread of misinformation by launching a positive cascade, assuming that the users who have received the positive cascade will not be conceived by the misinformation. Such a strategy has been considered as feasible [4] , and now fact-checking services are trending on the web, such as Snopes.com [5] and Factcheck.org [6] . Formally, information cascades start to spread from their seed nodes, and the propagation process is governed by an underlying diffusion model. Given the seed nodes (attacker) of the misinformation, the MP problem seeks the seed nodes (protector) of the positive cascade such that the spread of misinformation can be maximally limited. MP without Knowing the Diffusion Model. Existing works often assume that the parameters in the diffusion model are known to us, and they focus primarily on algorithmic analysis for selecting seed nodes [7, 8] . However, the real propagation process is often complicated, and in reality, we can only have certain types of historical data with little to none prior knowledge of the underlying diffusion model. In this paper, we adopt the well-known triggering model [9] to formulate the diffusion process and assume that the parameters are unknown. Now we are given the social graph together with a collection of historical attacker-protector pairs where the protectors were successful, and the goal is to design a learning scheme to compute the best protector against a new attacker. Given the ground set V of the users, the MP problem is given by a mapping arg max P f (M, P ) : 2 V → 2 V , where f is the objective function determined by the underlying diffusion model to quantify the prevention effect of the protector P ⊆ V against the attacker M ⊆ V . Therefore, our problem is nothing but to learn a mapping from 2 V (attacker) to 2 V (protector) using training examples { M i , arg max P f (M i , P ) }. Fig. 1 for an illustration. While this problem is supervised by the attacker-protector pairs, it is somehow different from the common ones in that it attempts to learn a solution to an optimization problem. One challenge in solving it is that the input and output are sets, while machine learning methods often struggle to deal with objects invariant to permutation [10] . Another challenge lies in properly integrating the graph information into the learning design. As we will see later, directly applying existing methods like graph convolutional networks [11] cannot produce good protectors. StratLearner. We propose a method called StratLearner to solve the considered problem. for each M , and if successful, the prediction arg max P f * (M, P ) ensures a good protector. The key idea of StratLearner is to parameterize f * (M, P ) by f * (M, P ) = w T G(M, P ) where G(M, P ) ∈ R K is a feature function constructed through K ∈ Z + random subgraphs with w ∈ R K being the tunable weights. Our parameterization is justified by the fact that for each distribution over (M, P ) and any possible f given by a triggering model, there exists a w T G(M, P ) that can be arbitrarily close to f in the Hilbert space provided that K is sufficiently large. Therefore, StratLearner first generates a collection of random features to obtain G(M, P ), and then learns the weight w through structural SVM, where a new loss-augmented inference method has been designed to overcome the NP-hardness in computing the exact inference. Our experiments not only show that StratLearner can produce high-quality protectors but also verifies that StratLearner indeed benefits from the proposed feature construction. We proceed by introducing the diffusion model followed by defining the MP problem together with the learning settings. We consider a social network given by a directed graph G = (V, E). Each node u ∈ V is associated with a distribution N u (S) over 2 N − u with N − u being the set of the in-neighbors of u; each edge (u, v) ∈ E is associated with a distribution T (u,v) (x) over (0, +∞) denoting the transmission time. Suppose that there are two cascades: misinformation M and positive cascade P, with seed sets M ⊆ V (attacker) and P ⊆ V (protector), respectively. We speak of each node as being the state of M-active, P-active, or inactive. Following the triggering model [9, 12] , the diffusion process unfolds as follows: • Time 0: The nodes in M (resp, P ) are M-active (resp,. P-active) at time 0. 1 • Time t: When a node u becomes M-active (resp., P-active) at time t, each inactive node v such that u in A v will be activated by u and become M-active (resp., P-active) at time t + t (u,v) . Each node will be activated by the first in-neighbor attempting to activate them and never deactivated. When a node v is activated by two or more in-neighbors with different states at the same time, v will become M-active. 2 When there is only one cascade, the above model subsumes classic models, including Discrete-time independent cascade (DIC) model [9] , Discrete-time linear threshold (DLT) model [9] , Continuous-time independent cascade (CIC) model [12] . An example for illustrating the diffusion process is given in Supplementary A. Given the seed sets M and P , we use f (M, P ) : 2 V × 2 V → R to denote the expected number of the nodes that are not activated by the misinformation and call f the prevention function. Formally, they form a class of functions. Definition 1 (Class F PF ). Over the choices of N u and T (u,v) , we use F PF to denote the class of the prevention functions, i.e., When the misinformation M is detected, our goal is to launch a positive cascade such that the misinformation can be maximally prevented [7, 13, 14] . Problem 1 (Misinformation Prevention). Under a budget constraint given by k ∈ Z + , the misinformation prevention problem aims to compute In this paper, we assume that the social graph G is known but the diffusion model (i.e., N u and T (u,v) ) is unknown, and given a new attacker M , we aim to solve Problem 1 from historical data: a collection of samples where P i is the optimal or suboptimal solution to Problem 1 associated with input M i . That is, we aim to learn a strategy F * : 2 V → 2 V that computes the protector F * (M ) for a future attacker M ⊆ V , hoping that F * (M ) can maximize f (M, P ) with respective to P . Since f (M, P ) is unknown to us, F * (M ) is examined by the training pairs. For a training pair (M, P ), we consider a function L(P, S) that quantifies the loss for using some S ⊆ V instead of P as the protector. Assuming that the attacker M of the misinformation follows an unknown distribution M, we aim to learn a F * such that the risk 2 V L F (M ), F * (M ) d M(M ) is minimized, and we attempt to achieve this by minimizing the empirical risk The overall idea is to learn a scoring function f * such that arg max P ⊆V, |P |≤k f * (M, P ) can be a good protector. Note that the prevention function f itself is the perfect score function, but it is not known to us and no data is available for learning it. Nevertheless, we are able to construct a hypothesis space that not only covers the class of prevention function (Sec. 3.1) but also enables simple and robust learning algorithm for searching a scoring function within it (Sec. 3.2). To construct the desired hypothesis space, let us consider a function class derived through distance functions on subgraphs. where we have dis g (S, v) := min u∈S dis g (S, v) and dis g (u, v) is the length of the shortest path from The above result indicates that the prevention function can be factorized as an affine combination of the distance functions (i.e. f v g (M, P |θ)) over subgraphs with weights given by some φ(g). While the class F Φ is still not friendly for searching as no parameterization of φ(g) is given, the function therein can be further approximated by using the subgraphs randomly drawn from some fixed distribution in Φ, as shown in the following. Definition 3 (Class F G ). For a subset G = {g 1 , ..., g K } ⊆ Ψ, let us consider the function class Let φ * be any distribution in Φ with φ * (g) > 0 for each g ∈ Ψ, and let G = {g 1 , ..., g K } be a collection of random subgraphs generated iid from φ * . The following result shows the convergence bound for approximating functions in F Φ via functions in F G , which is inspired by standard analysis of random features [15] . Theorem 2. Let χ be any distribution over 2 V × 2 V and , δ > 0 be the given parameters. For each f 1 ∈ F Φ associated with certain φ 1 ∈ Φ, when K is no less than with probability at least 1 − δ over g 1 , ..., g K , there exists a f 2 ∈ F G such that where C := sup g φ1(g) φ * (g) measures the deviation between φ 1 and φ * . Theorems 1 and 2 together imply that each prevention function in F PF can be well-approximated by some function in F G provided that G had a sufficient number of random graphs and the weights were correctly chosen. Given that the underlying prevention function is the perfect scoring function, we now have a good reason to search a scoring function in F G , and we will do so by learning the weights w i , guided by the empirical risk Eq. (3). Now let us assume that the subgraphs {g 1 , ..., g K } have been generated, and we focus on learning the weights. Given the subgraphs G = {g 1 , ..., g K }, according to Eq. (5), our scoring function takes the form and w ∈ R K are the parameters to learn. For a collection of training pairs {(M i , P i )} n i=1 , the condition of zero training error requires that w T G(M, P ) identifies P i to be the best protector corresponding to M i , and it is therefore given by the constrains In addition, we requires that the weights are non-negative for several reasons. First, the proof of Theorem 2 tells that non-negative weights are sufficient to achieve the convergence bound, so such a requirement would not invalidate the function approximation guarantees. Second, as discussed later in this section, restricting the weights to be non-negative can simplify the inference problem. Finally, as observed in experiments, such a constraint can lead to a fast convergence in the training process, without scarifying the performance. In the case that Eq. (7) is feasible but the solution is not unique, we aim at the solution with the maximum margin. The standard analysis of SVM yields the following quadratic programming: t ← t + 1; 6: until stop criteria met; In general, the loss function L(P, S) can be derived from the similarity functions SIM(P, S) by L(P, S) := SIM(P, P ) − SIM(P, S), where SIM(P, S) ≥ 0 has a unique maximum at S = P . For example, the Hamming loss is given by the similarity function 1(S = P ). For the MP problem, since the graph structure is given, we can measure the similarity of two sets in terms of the overlap of their neighborhoods. Specifically, for each S ⊆ V and j ∈ [n], we denote by H j S ⊆ V the set of the nodes within j hop(s) from any node in S, including S itself, and the similarity between two sets V 1 and V 2 can be measured by We call the loss function derived from such similarities as j-hop loss. Incorporating the loss function into the training process by re-scaling the margin [16] , we have where α is a hyperparameter to control the scale of the loss. While this programming consists of an exponential number of constraints for each pair (M i , P i ), these constraints are equivalent to Therefore, the number of constraints can be reduced to polynomial provided that can be easily solved, which is the loss-augmented inference (LAI) problem. Unfortunately, such a task is not trivial, even under the Hamming loss. Theorem 3. The loss-augmented inference problem is NP-hard under the hamming loss or j-hop loss. Furthermore, it cannot be approximated within a constant factor under the j-hop loss unless N P belongs to DT IM E(n poly log n ). For the hamming loss, minimizing H (M,P ) (S) is simply to maximize w T G(M, S), which is a submodular function (See proof of Theorem 4), and thus we can utilize the greedy algorithm for an (1 − 1/e)-approximation [17] . For the j-hop loss, the next result reveals a useful combinatorial property of H (M,P ) (S) for solving the LAI problem. This result immediately yields a heuristic algorithm for minimizing H (M,P ) (S), as shown in Alg. 1. The algorithm is adapted from the modular-modular procedure for DS programming [18] , and it guarantees that H (M,P ) (S) is decreased after each iteration. Property 1. Alg. 1 guarantees that H(X t+1 ) < H(X t ), and each iteration takes O(K|V | 2 + K|V ||E|). Once the LAI problem is solved, the weights w can be learned using standard structural SVM. We adopt the one-slack cutting plane algorithm [19] . See Alg. 2 in Supplementary C. Putting the above modules together, we have the following learning strategy: given the social graph and a collection of samples {(M i , P i )} n i=1 , (a) select a distribution φ * in Φ and a loss function; (b) generate K random subgraphs {g 1 , ..., g K } using φ * ; (c) run the one-slack cutting plane algorithm to obtain w = {w 1 , ..., w K }, where the LAI problem is solved by Alg. 1. Given a new attacker M , the protector is computed by arg max S⊆V, |S|≤k w T G(M, S), which is the cardinality-constrained submodular maximization problem and therefore can be approximated again by the greedy algorithm [17] . Here we see that enforcing the weights w to be nonnegative can make this problem much more tractable, as otherwise, the objective function would not be submodular. Remark 2. Alg. 1 is conceptually simple but practically time-consuming. One can use a limit on the iterations as a simple stop criteria. In our experiment, using only one iteration in each run of Alg. 1 is sufficient to achieve a high training efficacy. For selecting φ * , the requirement that φ * (g) > 0 for each g ∈ Ψ is more technical than practical. Given that no prior information of the diffusion model is available, generating subgraphs uniformly at random is a natural choice, which has been effective in our experiments. The experiment aims to explore: (a) the performance of StratLearner compared with other possible methods in terms of maximizing f (M, P |∅); (b) the number of features and training pairs needed by StratLearner to achieve a reasonable performance; (c) the impact of the distribution φ used for generating random subgraphs. Social Graph and Diffusion Model. We adopt three types of social graphs: a Kronecker graph [21] with 1024 nodes and 2655 edges, an Erdős-Rényi graph with 512 nodes and 6638 edges, and a powerlaw graph [22] with 768 nodes and 1532 edges. Following the classic triggering model [23, 24] , the transmission time T (u,v) of each edge (u, v) follows a Weibull distribution with parameters randomly selected from {1, ..., 10}, and for each u, we have N u (S) the in-degree of v. For each attacker M , the budget k of the protector P is |M |. StratLearner. Each subgraph is generated by selecting each edge independently at random with a probability of 0.01, where each selected edge has a weight of 1.0. We denote such a distribution as φ 1.0 0.01 . The number of subgraphs (i.e. features) is enumerated from {100, 400, 800, 1600}. We adopt the one-hop loss, and the hyperparameter α in Eq. (8) is fixed as 1000. Other Methods. To set some standards, we denote by Rand the method that randomly selects the protector. Since the graph structure is known to us, we adopt two popular graph-based methods: HighDegree (HD), which selects the nodes with the highest degree as the protector, and Proximity (Pro), which selects the neighbors of the attacker as the protector. Recall that our problem can be treated as a supervised learning problem from 2 V to 2 V , so off-the-shelf learning methods are also applicable. In particular, we have implemented Naive Bayes (NB), MLP, Graph Convolutional Network (GCN) [11] , and Deep Set Prediction Networks (DSPN) [10] . GCN can make use of graph information, and DSPN is designed to process set inputs. Training and Evaluation. The size of each attacker M is randomly generated following the powerlaw distribution with parameter 2.5, and the nodes in M are selected uniformly at random from V . The best protector P true is computed using the method in [25] which is one of the algorithms for Problem 1 that gives the best possible approximation ratio. In each run, the training and testing set, given their sizes, are randomly selected from a pool of 2500 pairs, where the training size is enumerated in {270, 540, 1080, 2160} and the testing size is 270. The subgraphs used in StratLearner are also randomly generated in each run. For each method, the whole training and testing process is repeated five times, and we report the average results with standard deviations. For each predicted protector P pred , its quality is measured by the performance ratio The details of data generation and the implementations of the tested methods can be found in Supplementary E, which also includes the result on a Facebook graph. On StratLearner. The main results are given in Table 1 . We see that StratLearner performs better when more training examples or more features are given, and it is pretty robust in terms of deviation. In addition, StratLearner is more sensitive to the number of features than to the number of training examples -the performance ratio does not increase much when more training examples are given but increases significantly with more features. Comparison between Different Methods. With 400 features from φ 1.0 0.01 , StratLearner has already outperformed all other methods, regardless of the types of the social graph. Plausibly, DSPN and NB are unable to utilize the information of the social graph; GCN is unable to process set structures; HD and Pro ignore the training data. While GCN can make use of the social graph, it merely uses the adjacency between nodes without considering the triggering model. In contrast, StratLearner samples subgraphs and seeks the best combination of them through learning the weights, which, according to Theorem 2, is essentially to approximate the diffusion process under the triggering model. This enables StratLearner to leverage the social graph to learn the unknown parameter in a more explicit way. In another issue, StratLearner, with a moderate number of features, can achieve a performance ratio no less than 0.7 on all the three graphs, but other learning methods (i.e., NB, MLP, GCN) are quite sensitive to the graph structure. In particular, they perform relatively well on Kronecker but poorly on Power-law and Erdős-Rényi. For example, MLP can achieve a ratio comparable to that of HD and Pro on Kronecker, but it is not much better than Rand on Erdős-Rényi. For graph-based methods, HD is also sensitive to the graph structure, while Pro can consistently offer moderate performance, though worse than StratLearner. Overall, the performance of StratLearner is exciting. The Impact of φ. One interesting question is how the distribution used for generating random subgraphs may affect the performance of StratLearner. First, to test the density of the subgraphs, we consider two distributions φ 1.0 0.005 and φ 1.0 0.1 , where each edge is selected with probability, respectively, 0.005 (less dense) and 0.1 (more dense), with edge weights remaining as 1.0. The results of this part are given in Fig. 2 . Comparing φ 1.0 0.005 and φ 1.0 0.1 to φ 1.0 0.01 , on Power-law and Erdős-Rényi, we observe an increased performance ratio when the subgraphs become denser, but on Kronecker, decreasing the density also results in a better performance ratio. We can imagine that increasing the subgraph density Table 2 in Supplementary E. does not necessarily increase the performance. Considering the extreme setting φ 1.0 1.0 where each edge is always selected, since there is only one feature, StratLearner reduces to simply maximizing the distance function over the entire graph with uniform weight. As we can see from Fig. 2 , StratLearner does not perform well under φ 1.0 1.0 . This is very intuitive as the searching space is too simple to find a good scoring function. Second, we leak some information of the underlying model to φ 1.0 0.1 and construct φ + + where the edge is selected with a probability of 1/d v , exactly the same as that in the underlying model, with edge weights sampled from their associated Weibull distributions. While φ + + is not obtainable under our learning setting, the goal here is to verify that StratLearner can benefit more from such cheat subgraphs. Indeed, as shown in Fig. 2 , StratLearner can produce the protector that is almost as good as the optimal one. With only 100 features from φ + + , the performance ratio is no less than 0.95 on all three graphs. This confirms that StratLearner does work the way it is supposed to, and it also suggests that such prior knowledge, if any, can be easily handled by StratLearner. Random Features. Our parameterization method is inspired by the technique of random Fourier features [15, 26] , which is an effective method for many learning problems (e.g., [27, 28, 29] ). In particular, we show that subgraph sampling can be used to generate random features, and a subtle combination of them can give a kernel function w T G(M, P ) that coincides with the triggering model. This suggests a new way of putting graphs into a learning process, and it is different from other methods like graph neural networks or attentions that often use the entire graph. Set Function Learning. Our problem can be taken as a set function learning problem with sets as input and output. A learning method to solve such problems should respect the set structure invariant to permutation, but neural networks often take vectors as input and their output is sensitive to the positions of the input values. One possible method is to use operations like sum or max that are permutation-invariant [30] , and another idea is to enforce the network to learn permutation-invariant representations [10] . Our method is different. StratLearner is invariant to permuting the input set because the constructed kernel function G(M, P ) is combinatorial as it is a set function; it is also invariant to permuting the output set because the inference method is also a combinatorial algorithm that directly outputs a set. In fact, set algorithms are conceptually permutation-invariant operations that generalize sum, avg or max. Misinformation Prevention and Learning Diffusion Models. Kempe et al. [9] formulate the discrete triggering model, and Du et al. [12] later propose the continuous model for modeling information diffusion. The MP problem is first formulated by Budak et al. [7] . Even if the diffusion model is given, the MP problem is still challenging because it is NP-hard [7] and its objective function is #P-hard to compute [31] . Later in [32] , the authors study the problem of identifying the best intervention period based on the Hawkes process. Learning the diffusion model from real data is another relevant research branch [33, 34, 35, 36] . Du et al. [23] design an algorithm to learn the diffusion function without knowing the type of the diffusion models; He et al. [24] study the same problem but assuming the information is incomplete; Kalimeris et al. [37] propose a method that parameterizes each edge using the same hyperparameter. Different from the above works, this paper aims to learn a solution to the MP problem, and it does not attempt to learn the diffusion model. The work in this paper focuses on operational diffusion models without specifying a particular social network platform. Our work proposes a framework for computing protectors, but more importantly and broadly, it suggests a new method for solving learning problems by integrating graph input into the structured prediction. In addition, we do not anticipate any bias in the data used for experiments because the involved subgraphs, underlying triggering model, training examples, and training-testing partition were all randomly determined with enough repetitions. One exception is that we have considered only three graph types, Kronecker, Power-law, and Erdős-Rényi, which may lead to the bias on the graph structure. However, given that the results of StratLearner are robust over these graphs, we believe the observations can be generalized to other graph structures. The first graph in Fig. 3 shows a triggering model where each node is associated with a distribution over its in-neighbors and each edge holds a distribution showing the activation time. The second graph shows one possible scenario after initialization, a weighted subgraph where (u, v) is in the graph iff u ∈ Av. Note that the initialization step in the diffusion process is equivalent to generating a weighted subgraph with edges ∪u∈V {(v, u)|v ∈ Au} in which each edge e has a weight of te. Therefore, each diffusion model defines a distribution φ over Ψ, and it suffices to prove f (M, P |∅) = g v∈V φ(g) · f v g (M, P |∅)dg. Exchanging the summation with integration and using the linearity of expectation, it suffices to prove that ψ φ(g)·f v g (M, P |∅)dg is equal to the probability that v will not be M-active with P but would have been M-active without P . Since the rest of the diffusion is determined after realization, it is left to prove that for each v * ∈ V , f v * g (M, P |∅) = 1 iff, under the initialization corresponding to g, (a) v * will not be M-active with P and (b) v * will be M-active without P . This can be easily established from the facts: (a) a node v can be activated by one cascade only if there is a path from the seed nodes to v; (b) the node will be activated by the first cascade arriving them; (c) the arrival time each of cascade depends on the length of the shortest path from the source node to v. A formal argument can be obtained using a reduction from the arg min u∈M ∪P dis(u, v) to v along the shortest path. The proof uses the McDiarmid's Inequality. [38] ). Let X1,..., Xm be independent random variables with domain X . Let f : X m → R be a function that satisfies |f (x1, ..., xi, ..., xm) − f (x1, ..., x i , ..., xm)| ≤ ci for each i and x1, ..., xm, x i ∈ X . The for each > 0, we have , and the average function Note that f2 ∈ F G and E[hi(M, P )] = f1(M, P ) for each (M, P ). Let us denote the interested quantity as ∆K (g1, ..., gK ) = where the norm is taken under the Lebesgue measure associated to measure χ (the distribution over the pairs (M, P )). To show the stability of ∆K (g1, ..., gK ), for each g1, ..., gK , g * ∈ Ψ and i ∈ [K], replacing gi by g * , the change is bounded by |∆K (g1, ..., gi, ...gK ) − ∆K (g1, ..., g * , ...gK )| {by the reverse triangle inequality} ≤ f2(g1, ..., gi, ...gK ) − f2(g1, ..., g * , ...gK ) An instance of this problem is given by the social graph G = (V, E), a collection {g1, ..., gK } of subgraphs, a weight vector w, the budget k, and two node sets M and P . Recall that the objective function is α · H (M,P ) (S) := SIM(P, S) − w T G(M, S). The NP-hardness can be easily established through a reduction from the max k-coverage problem. The max k-coverage problem is given by an element set P = {p1, ..., pN } and a collection Q = {q1, ..., qM } ⊆ 2 P , and it asks for l sets in Q with the largest union. Setting K = 1, w1 = 1 , let us consider the g1 given in Figure 4 where one node is created for each pi and qj with an extra node z added to the graph. There is an edge from node qj to node pi if and only if element pi is in set qj, with a weight of 0.5; there is an edge from z to each pi with a weight of 1. The social graph G can be any supergraph of g1 and P can be any node set that does not contain any node in g1. Setting k = l and M = {z}, we see that the S ⊆ Q that can minimize H (M,P ) (S) corresponds to the one that has the largest union in the max k-coverage problem. To prove the approximation hardness, we seek a reduction from the positive-negative set cover (k±PSC) problem. Problem 2 (k±PSC problem). An instance of k±PSC is a triplet (X, Y, Φ) with an integer l ∈ Z + , where X and Y are two sets of elements with X ∩ Y = ∅, and Φ = {φ1, ..., φq} ⊆ 2 X∪Y is a collection of q ∈ Z + subsets over X ∪ Y . For each Φ * ⊆ Φ, its cost is defined as The k±PSC problem seeks for a Φ * ⊆ Φ with |Φ * | = l such that the cost is minimized. The following hardness of k±PSC follows fairly directly from Miettinen [39] . Lemma 1 ([39] ). Unless N P ⊆ DT IM E(n poly log n ), there exists no polynomial-time approximation algorithm for the k±PSC problem with a ratio of Ω(2 log 1− q ) for each > 0. Given an instance (X, Y, Φ) of k±PSC, we construct an instance of LAI, as follows. The social graph is show in Fig. 5a composed of the following parts: • Nodes: There are four groups of nodes X, X * , Y and Φ, where each node in X, Y and Φ corresponds to their counterpart in the k±PSC instance, and X * is a copy of X. In addition, there are two extra nodes z1 and z2; • Edges: edges can be grouped into several parts: -There is an edge from z2 to each node in X * ∪ Y , and an edge from z1 to each node in X; -There is an edge from each node in Φ to each node in X * ; -There is an edge from each node in X to each node in X * ∪ Y . -There is an edge (u, v) for each pair of the nodes in X * ∪ Y . This part is not shown in the graph. -There is an edge from φi to yj if and only if yj ∈ φi. -There is an edge from φi to xj if and only if xj ∈ φi. We again set K = 1, w1 = 1, α = 1, and g1 is a subgraph of G which includes nodes Φ, X and z1 and the edges between them, as shown in Fig. 5b . In g1, each edge between Φ and X has a weight of 0.5; each edge between X and z1 has a weight of 1.0. We set P as {z2} and M as {z1}. We consider the one-hop loss and set l = k. Due to the edge within Y and those from Y to X, to minimize SIM(P, S) (i.e., the overlap of one-hop neighbors), only the nodes in Φ should be selected. To maximize w T G(M, S) = g1(M, S), according to the construction of g1, we see again that the optimal solution must the nodes in Φ. Therefore, for the LAI problem, it suffices to restrict the node selection in Φ. For each subset Φ * ⊆ Φ, SIM(P, S) is exactly |X * | plus the number of the nodes in Y that are connected from some node in Φ * , and w T G(M, S) is the number of the nodes in X that are connected from some node in Φ * . Therefore, we have which completes the reduction. The reduction yields the hardness result immediately. A set function h over a ground set U is submodular if it has a diminishing marginal return, i.e., h(A+v)−h(A) ≤ h(B + v) − h(B) for each B ⊆ A and v / ∈ A. SIM j hop is submodular as it is a coverage function [40] . We can easily verified that f g v (M, S|∅) is also submodular for each subgraph g and v, and therefore, w T G(M, Φ * ) is submodular as well since it is a sum of submodular functions. It is well-known that submodular functions have both tight modular upper bound and tight modular lower bound. A modular upper bound of SIM j hop together with a modular lower bound of w T G(M, Φ * ) would give a modular upper bound of H (M,P ) (S). In particular, for a general submodular function h over U , the constructions can be found in [18] , as summarized below. Modular Lower Bound [41] . For a permutation σ(i) over U = [n], let us define that S i σ :={σ(1), ..., σ(i)} and define a mapping U → R: Since h is submodular, for each X and a permutation σX such that X = S |X| σ , the modular function satisfies h σ X (S) ≤ h(S) for each S ⊆ V , and h σ X (X) = h(X). Modular Upper Bound [17, 18] . For each X ⊆ U , a modular upper bound of h(S) is found by The first part follows from the fact that H(Xt+1) ≥ H X t (Xt+1) ≥ H X t (Xt) = H(Xt). Evaluating the function takes K(|E| + |V |) and thus each iteration takes K|V |(|E| + |V |). We adopt the one-slack cutting plane algorithm (Alg. 2) in [19] for training our structural SVM, with the only modification that a nonnegative constraint on w is added. The tie breaking in Sec. 2.1 is done by always giving the misinformation a higher priority. If we would do the reverse, the only modification to StratLearner is to replace Eq. (4) with The used data and source code are available in the supplementary files. The Kronecker graph is generated using SNAP 3 with parameters [0.9, 0.6; 0.6, 0.1]. The power-law graph and the Erdős-Rényi graph are generated using NetworkX [42] . Each edge follows the Weibull distribution β α ( t α ) β−1 exp(−( t α ) β ) where α and β are selected from {1, ..., 10} uniformly at random. To generate one pair of attacker and protector, we first sample the size of the attacker from the power-law distribution with a parameter 2.5. Given the size of the attacker M , the nodes in M are randomly selected from V . Given the attacker M , the protector P is computed using the method in [25] . Repeating this process, we generate a pool of 2500 pairs for each graph. StratLearner. The one-slack cutting plane algorithm is implemented based on Pystruct [43] with hyperparameters = 0.001 and C = 0.01. For MLP, we adopt three hidden layers of size (512, 512, 256) with ReLU as the activation function. The node sets are encoded as one-hot vectors, and the loss function is the pointwise cross-entropy between the output layer and the truth vector, plus the L2 regularizer. We use Adam optimizer with drop rate 0.5, and the learning rate is 0.001 with exponential decay. We adopt the valina GCN model [11] with two GCN layers followed by our MLP. Since the model in [11] was for semi-supervised learning, we slightly modify the flow to make it work for supervised learning. Other settings are the same as those in MLP. Dspn is proposed in [30] where the main modules are input encoder, set encoder, and set decoder. Given an attacker, we encode it as a set of elements where the feature of each element is the associated one-hot vector. The input encoder and set encoder are MLP with three hidden layers of size 512. The inner optimization is performed 10 steps with rate 1, 000, 000 in each round, and the outer loop is optimized with Adam with a learning rate of 0.01. Fig. 2 . The precise results in Fig. 2 are given in Table 2 . We also tested a Facebook graph with 4, 039 nodes from SNAP 4 , where StratLearner is trained with 100 subgraphs from distribution φ 1.0 0.1 and 270 training examples are used in each learning-based method. Other settings are the same as the experiments in the main paper. The results are given in Table 3 . Overall, similar to Table 1 in the main paper, we have the observation that StratLearner outperforms other competitors by an evident margin. False information on web and social media: A survey The web of false information: Rumors, fake news, hoaxes, clickbait, and various other shenanigans Detecting misinformation in online social networks using cognitive psychology Limiting the spread of misinformation in social networks On misinformation containment in online social networks Maximizing the spread of influence through a social network Deep set prediction networks Semi-supervised classification with graph convolutional networks Scalable influence estimation in continuous-time diffusion networks Influence blocking maximization in social networks under the competitive linear threshold model An efficient randomized algorithm for rumor blocking in online social networks Uniform approximation of functions with random bases Large margin methods for structured and interdependent output variables An analysis of approximations for maximizing submodular set functions-i Algorithms for approximate minimization of the difference between submodular functions, with applications Cutting-plane training of structural svms Experiments implementation Kronecker graphs: An approach to modeling networks Search in power-law networks Influence function learning in information diffusion networks Learning influence functions from incomplete observations Beyond uniform reverse sampling: A hybrid sampling technique for misinformation prevention Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning Kernel methods for deep learning On random weights and unsupervised feature learning A la carte-learning fast kernels Deep sets Scalable influence maximization for prevalent viral marketing in large-scale social networks Fake news mitigation via point process based intervention Learning influence probabilities in social networks Predicting adoption probabilities in social networks Selecting information diffusion models over social networks for behavioral analysis Influence propagation in social networks: A data mining perspective Learning diffusion using hyperparameters On the method of bounded differences On the positive-negative partial set cover problem Influence estimation and maximization in continuous-time diffusion networks Submodular functions and optimization Exploring network structure, dynamics, and function using networkx Pystruct: learning structured prediction in python Algorithm 2 One-slack Cutting Plane 1: Input: (M 1 , P 1 ), ..., (M n , P n ), C, , α; 2: W ← ∅; 3: repeat 4:Solve the QP over constraints W:w ≥ 0. for i = 1, ..., n do 6:P i ← arg min |S|≤k SIM(P i , S) − w T G(M i , S); 7: