key: cord-0155974-r87vj6hb authors: Le, Thai; Tran-Thanh, Long; Lee, Dongwon title: Socialbots on Fire: Modeling Adversarial Behaviors of Socialbots via Multi-Agent Hierarchical Reinforcement Learning date: 2021-10-20 journal: nan DOI: nan sha: 90b237b41ca5541f7e7d0337fe053be4e6cee853 doc_id: 155974 cord_uid: r87vj6hb Socialbots are software-driven user accounts on social platforms, acting autonomously (mimicking human behavior), with the aims to influence the opinions of other users or spread targeted misinformation for particular goals. As socialbots undermine the ecosystem of social platforms, they are often considered harmful. As such, there have been several computational efforts to auto-detect the socialbots. However, to our best knowledge, the adversarial nature of these socialbots has not yet been studied. This begs a question"can adversaries, controlling socialbots, exploit AI techniques to their advantage?"To this question, we successfully demonstrate that indeed it is possible for adversaries to exploit computational learning mechanism such as reinforcement learning (RL) to maximize the influence of socialbots while avoiding being detected. We first formulate the adversarial socialbot learning as a cooperative game between two functional hierarchical RL agents. While one agent curates a sequence of activities that can avoid the detection, the other agent aims to maximize network influence by selectively connecting with right users. Our proposed policy networks train with a vast amount of synthetic graphs and generalize better than baselines on unseen real-life graphs both in terms of maximizing network influence (up to +18%) and sustainable stealthiness (up to +40% undetectability) under a strong bot detector (with 90% detection accuracy). During inference, the complexity of our approach scales linearly, independent of a network's structure and the virality of news. This makes our approach a practical adversarial attack when deployed in a real-life setting. Socialbots refer to automated user accounts on social platforms that attempt to behave like real human accounts, often controlled by either automatic software, human, or a combination of both-i.e., cyborgs [4] . Different from traditional spambots, which may not have proper profiles or can be easily differentiated from regular accounts, socialbots often mimic the profiles and behaviors of real-life users by using a stolen profile picture or biography, building legitimate followships, replying to others, etc. [4] . Socialbots are often blamed for spreading divisive messages-e.g., hate speech, disinformation, and other low-credibility contents that have been shown to widen political divides and distrust among both online and offline communities [4, 20, 30] . To mitigate such harmful proliferation of socialbots, therefore, there has been extensive research, most of which focus on how to effectively detect them [10, 37, 54] . However, these works usually follow the cat-and-mouse game where they passively wait for socialbot evasion to happen before they can react and develop a suitable detector [8] . Instead of following such a reactive scheme, however, proactively modeling socialbots and their adversarial behaviors on social platforms can better advance the next bot detection research. In particular, we pose a question "Can socialbots exploit computational learning mechanism such as reinforcement learning to their advantage?" To our best knowledge, adversarial nature of socialbots has not yet been fully explored and studied. However, it is plausible that adversaries who own a farm of socialbots operate their socialbots according to certain strategies (or algorithms). Therefore, proactively simulating such a computational learning mechanism and understanding adversarial aspect of socialbots better would greatly benefit future research on socialbot detection. In general, a socialbot has two main objectives that are adversarial in nature: (i) to facilitate mass propaganda propagation through social networks and (ii) to evade and survive under socialbot detectors. The first goal can be modeled as an NP-Hard influence maximization (IM) problem [25] where the bot needs to build up its network of followers-i.e., seed users, overtime such that any new messages propagated from the bot through these users can effectively spread out and influence many other people. Simultaneously, it also needs to systematically constrain its online behaviors such that it will not easily expose itself to socialbot detectors. Although the IM problem has been widely studied by several works [3, 24, 25, 27] , they only focus on maximizing the network influence given a fixed and static budget # of seed nodes (that is relatively small) and they assume that every node is equally acquirable. However, these assumptions are not practical in our context. Not only a socialbot needs to continuously select the next best seed node or follower over a long temporal horizon-i.e., potentially large budget of seed nodes, it also needs to consider that gaining the followship from a very influential actor-e.g., Elon Musk, is practically much more challenging than from a normal user. At the same time, a socialbot that optimizes its network of followers must also refrain from making suspicious behaviors-e.g., constantly following others, that can trigger the attention of bot detectors. Thus, learning how to navigate a socialbot is a very practical yet challenging task with two intertwined goals that cannot be separately optimized. Toward this challenge, in this paper, we formulate the Adversarial Socialbot Learning (ASL) problem and design a multi-agent hierarchical reinforcement learning (HRL) framework to tackle it. 3Our main contributions are as follows. • First, we formulate a novel ASL problem as an optimization problem with constraints. • Second, we propose a solution to the ASL problem by framing it as a cooperative game of two HRL agents that represent two distinctive functions of a socialbot, namely (i) selecting the next best activity-e.g., tweet, retweet, reply, mention, and (ii) selecting the next best follower. We carefully design the RL agents and exploit unsupervised graph representation learning to minimize the potential computational cost resulted from a long time horizon and a large graph structure. • Third, we demonstrate that such RL agents can learn from synthetic graphs yet generalize well on real unseen graphs. Specifically, our experiments on a real-life dataset show that the learned socialbot outperforms baselines in terms of influence maximization while sustaining its longevity by continuously evading a strong black-box socialbot detector of 90% detection accuracy. During inference, in addition, the complexity of our approach scales linearly and is independent of a network's structure and the virality of news. • Four, we release an environment under the Open AI's gym [1] library. This enables researchers to simulate various adversarial behaviors of socialbots and develop novel bot detectors in a proactive manner. The majority of previous computational works on socialbots within the last decade [2, 10, 37, 39, 42, 52, 54] primarily focus on developing computer models to effectively detect bots on social networks [4, 8] . These models are usually trained on a ground truth dataset using supervised learning algorithms-e.g., Random Forest, Decision Tree, SVM, to classify an individual social media account into a binary label-i.e., bot or legitimate [4] . Moreover, these learning algorithms usually depend on either a set of statistical engineered predictive features such as the number of followers, tweeting frequency, etc. [5, 42, 54] , or a deep learning network where the features are automatically learned from unstructured data such as an account's description text. Even though there are many possible features that can be used to detect socialbots, statistical features that can be directly extracted from user metadata provided by official APIs-e.g., Twitter API, are more practical due to their favorable computational speed in practice [54] . In fact, many of the features that are utilized by the popular socialbot detection API botometer fall into this category. Moreover, we later also show that using simple statistical features derived from user metadata can help train a socialbot detector with around 90% prediction accuracy on a hold-out test set (Sec. 3.1). Regardless of how a socialbot detector extracts its predictive features, they are mainly designed following a reactive schema where they learn how to detect socialbots after they appear (thus a training dataset can be collected). While previous works help us to understand better the detection aspect of socialbots, the learning aspect of them has not been widely studied [8] . Distinguished from learning how to detect socialbots using a stationary snapshot of their features, ASL computationally models the adversarial learning behaviors of socialbots over time. To the best of our knowledge, relevant works on this task are limited to [7] . This work adopts an evolution optimization algorithm to find different adversarial permutations from a fixed socialbot' encoded activity sequence-e.g., "tweet→tweet→retweet→reply,... ", and examine if such permutations can help improve the detection accuracy of a bot detector. However, such permutations, even though adversarial in nature, are just static snapshots of a socialbot and do not tell a whole story on how the bot evolves. In other words, we are still lacking a general computation framework that models the temporal dynamics of socialbots and their adversarial behaviors. Therefore, this paper aims to formally formulate their behaviors as a Markov Decision Process (MDP) [21] and designs an RL framework to train socialbots that can optimize their adversarial goals on real-life networks. We investigate two adversarial objectives of a socialbot: influencing people while evading socialbot detection. While the first one can be modeled as an IM task on graph networks, traditional IM algorithms-e.g., [3, 25, 27] , assume that the number of seed nodes is relatively small and all nodes are equally acquirable, all of which are not applicable in the socialbot context as previously described. There have been also a few works-e.g., [33, 49] , that utilizes RL to IM task. Yet their scope is still limited to a single constraint on the budget number of seeds. Influence maximization under a temporal constraint-i.e., not to be detected lead to early termination in this case, is a non-trivial problem. Network Representation and Influence Diffusion Model A social network includes users, their interactions and how they influence each other. We model this network as a directed graph =( , ). An edge between two users , ∈ , denoted as ( , )∈ , means can have influence on . ( , ) also illustrates a piece of news can spread from to -i.e, follows (thus influences ). As there is no influence model that can perfectly reflect realworld behaviors, to model the influence flow through , we adopt Independence Cascade Model (ICM) [16, 17] , which is the most commonly used in the context of a social network [22, 26, 34] . ICM was originally proposed to model the "word-of-mouth" behaviors, which resemble the information sharing phenomena online well. In ICM, a node is either active or inactive. Once a node is activated, it has a single opportunity to activate or influence its inactive neighbors N ( ) with an uniform activation probability . At first, every node is inactive except a set of seed nodes S. After that, as the environment rolls out throughout a sequence of discrete timesteps, the influence will propagate from S through the network by activating different nodes in following and . The process ends when there is no additional activated nodes being activated [24, 32] . Hence is also the virality of news-i.e., how fast a piece of news can travel through . We then use =( , , ) to denote the social network . Let denote by (S, ) the spread function that measures how many nodes in a piece of information-e.g., fake news, can spread from S via the ICM model. Given a fixed network structure ( , ) and the news virality , different S will result in different values of (S, ). Hence, selecting a good S is decisive in optimizing the spread of influence on . However, choosing S to maximize (S, ) has already been proven to be an NP-Hard problem [25] . A socialbot is then a vertex in that attempts to mimic human behaviors for various aims-e.g., spreading propaganda or low-credible contents through , [4, 44, 46] . It carries out a sequence of activities A to simultaneously achieve two main objectives: Obj. 1: Optimizing its influence over by selectively collecting good seed nodes-i.e., followers, S∈ , over time Obj. 2: Evading bots detectors-i.e., not to be detected and removed These two goals are often in tension in that improving Obj 1 typically hurts Obj 2 and vice versa. That is while having a good network of followers S enables a socialbot to spread disinformation to a large number of users at any time, having a high undetectability helps it to sustain this advantage over a long period. As socialbots are usually deployed in groups, and later coming socialbots can also easily inherit a previously established network of followers S of a current one. If a bot is detected and removed from , not only it can lose its followers S and expose itself to be used to develop stronger detectors, it can also risk revealing the identity of other bots-e.g., by way of guilt-by-association [50] . This makes the sustainability achieved through Obj 2 distinguishably important from previous literature-e.g., [24, 25, 51] , where the optimization of S plays a more central role. Relationship between A and S. A denotes the activity sequencei.e., the DNA of the bot [6] . A includes four possible types of actions to be made at every timestep , namely tweet, retweet, reply or mention, and only the last three of which can directly interact with others to expand S. Despite these actions are in the Twitter context, other platforms also provide similar functions-e.g., tweet->post, retweet->share, reply->comment, mention->tag on Facebook. In practice, not every node requires an equal effort to convert to a follower. For example, a bot needs to accumulate its reputability over time and interact more frequently to have an influencer-e.g., Elon Musk, rather than a normal user to become its follower. Since a real model underlining such observation is unknown, we model it using a simple heuristic: where ( , ) with hyper-parameter ≥1, is the number of times the socialbot is required by the environment to continuously interact with an influencer -i.e., high N ( ), for it to become a follower at . Intuitively, a bot with a good reputation overtime-i.e., a high number of followers at the timestep -i.e., |S |, can influence others to follow itself more effortlessly than a newly created bot. Overall, A encodes when and what type of interaction-i.e., retweet, reply or mention, to use to acquire a new follower ∈S, then decides the frequency of such interaction in A. Thus, A and S is temporally co-dependent. Socialbot Detection Model. Bot detectors are responsible for detecting and removing socialbots from . Let F (A )∈{0, 1} denote a model that predicts whether or not an account is a socialbot based on its activity sequence up to the timestep (A ). This sequence of ordered activity is then usually represented as an unordered list of statistical features such as number of replies, tweets per day, by socialbot detectors [10, 37, 54] . In this paper, F extracts and adopts several features (Table 1 ) from previous works for detection. Most of the features are utilized by the popular bot detection API Botometer [9] . We train F using the Random Forest [47] algorithm with supervised learning on a publicly available dataset [36, 53] 1 of nearly 15K Twitter accounts, half of which is labelled as socialbots. This dataset is not exposed to the socialbots. Here we also assume that F (·) is a black-box model-i.e., we do not have access to its parameters. F achieves nearly 90% in F1 score on an unseen test set following the standard 5-fold cross validation (train and test with 80%/20% data). Since A and S are co-dependent, we can easily see that S also has effects on the detectability of a socialbot. Note that to focus on the study of the adversarial aspect of socialbots, we had to resort to a certain combination of account features and the socialbot detection model. 90% in F1 score is also in line with SOTA detectors on a similar set of features [37] . From the above analysis, this paper proposes to study the Adversarial Socialbot Learning (ASL) problem to achieve both Obj 1 and Obj 2. In other words, we aim to solve the following problem. Problem: Adversarial Socialbot Learning (ASL) aims to develop an automatic socialbot that can exercise adversarial behaviors against a black-box bot detector F while at the same time maximizing its influence on through a set of selective followers S. Since the selected user 4 at 4 is an influencer, 2 needs perform not once but =3 times of action "A" to acquire 4 (blue arrow). Whenever |A| reaches an interval of =7, the bot detector F (A ) is triggered (red arrow). Specifically, we formulate this task as an optimization problem with the objective function as follows. Objective Function: Given a black-box bot detection model F and a social network environment what is characterized by = ( , , ) , , , we want to optimize the objective function: Socialbot detector F can run prediction on the socialbot every time it performs a new activity. However, A and | | can potentially be very large. Thus, we assume that F only runs detection every time new activities is added to A (Eqn. 2b). This makes * the earliest interval timestep at which a socialbot is detected and removed by F (Eqn. 2b,c). Since * * * is monotonically increasing on both ≥ (S , )≥0 and * ≥1, to maximize * * * , a socialbot cannot focus only either on Obj 1 or Obj 2. In other words, Eqn. (2d) encourages the socialbot to simultaneously optimize both objectives. The ASL problem can be formulated as an MDP process which consists of a state set , an action set , a transition function P, a reward function , a discount factor ∈ [0, 1] and the horizon . Since the space requirement for can be very large-i.e., 4| | for 4 possible activities and | | possible seed nodes, especially on a large network, this can make the task much more challenging to optimize due to potential sparse reward problem. To overcome this, we transformed this into a HRL framework of two functional agents, AgentI and AgentII, with a global reward (Figure 1 ). We call this ACORN (Adversarial soCialbOts leaRniNg) framework. While AgentI is responsible for deciding which type of activity among {tweet, retweet, reply, mention} to perform at each timestep , AgentII is mainly responsible for S-i.e., to select which follower to accumulate, only when AgentI chooses to do so-i.e., retweet, reply, mention. This reduces the overall space of to only | |+4. Since A and S are co-dependent (Sec. 3.1), the two agents need to continuously cooperate to optimize both influence maximization and undetectability. It is noted that the Markov assumption behind this MDP is not violated because both influence function (·) and detection probability F at time only depends on statistical snapshot of the two agents at −1. This HRL task is then described in detail as follows. State. Following [12, 29, 35] , we assume that the state space can be factorized into bot-specific DNA and network-specific ENV , and ∈ DNA , ∈ ENV , where , is the state space of AgentI and AgentII, respectively. Specifically, encodes (i) the number of followers | | of the bot and (ii) a snapshot of A at timestep . While can directly store the actual A sequence, this potentially induces a computational and space overhead especially when becomes very large. Instead, we compact A into a fixed vector summarizing the frequency of each tweet, retweet, reply, and mention action up to . This effectively limits the space complexity of ∈R 5 to O (1). Similarly, ∈ R 4+| | ( +1) comprises of (i) node2vec( ) [19] which encodes the structure of to | | vectors of size , (ii) a statistical snapshot of A and (iii) information regarding S , encoded as: Previous works have often encoded the network structures ( [15, 55] ) via a parameterized Graph Neural Network (GCN) [28] as part of the policy network. As this approach requires frequent parameter updates during training, instead, we adopt node2vec( ) as an alternative unsupervised method which requires the calculation only once. While S can be encoded as a one-hot vector (1( ∉S )) | | Action and Policy. Similarly, we factor into two different action spaces , for AgentI and AgentII, respectively. ∈R 4 , ∈R | | are both encoded as one-hot vectors, representing one of four available activities and one of potential followers, respectively. We then have two policies 1 =( | ), 2 =( | , ) that control AgentI and AgentII, respectively. Reward. Even though we can directly reward the RL agents with (S , )≥1.0 at every timestep ≤ * , this calculation will incur large computational cost, especially when * becomes large. Instead, therefore, we design an accumulative reward function that consists of a step reward and a delayed reward to incentivize RL agents to maximize * (Eqn. 2) as follows. where * ≤ is the interval timestep at which the bot is detected and the episode is terminated. The step reward step , which can be efficiently computed, is the marginal gain on the network influence given a new follower selected at . Using the step reward with a discount factor, step <1.0, helps avoid the sparse reward problem and encourages good follower selection early during an episode. Since step ≥ 1.0, it also encourages the bot to survive against bot detection longer-i.e., to maximize * . In other words, as long as the socialbot survives-i.e., * increases, in other to make new friendship, it will be able to influence more people. However, since (·) is subadditive-i.e., ({ }, )+ ({ }, )≥ ({ , }, ) ∀ , ∈ , we then introduce the delayed reward delayed at the end of each episode with a discounted factor delayed <1.0 as a reward adjustments for each node selection step. A policy network 1 is a Multi-Layer Perceptron (MLP) followed by a softmax function that projects to a probability distribution of 4 possible activities. We can then sample from such a distribution. A policy network 2 utilizes Convolutional Neural Network [23] (CNN) to efficiently extract useful spatial features from the stack of representation vectors of all vertex ∈ calculated by node2vec( ) (Sec. 4.1), and MLP to extract features from the rest of the components of . The resulted vectors are then concatenated as the final feature vector. Instead of directly projecting this feature on the original action space of using an MLP, we adopt the parametricaction technique [14, 38] with invalid actions at each timestep -i.e., already chosen node, being masked. Learning algorithm. We train 1 , 2 using the actor-critic Proximal Policy Optimization (PPO) algorithm [43] . It has a theoretical guarantee and is known to be versatile in various scenarios [15, 43, 45, 55] . The actor refers to 1 and 2 , as described above. Their critics share the same network structure but output a single scalar value as the estimated accumulated reward at . Learning on synthetic and evaluating on real networks. We evaluate our method on real world data. To make our RL model generalize well on unseen real networks (Figure 2 , Left) with different possible configurations of =( , ), it is important to train our model on a sufficient number of diverse scenarios-i.e., training graphs. However, collecting such a train dataset often requires much time and efforts. Hence, we propose to train our model on synthetic graphs, which can be efficiently generated on the fly during the training [24] . To avoid distribution shifts between train and test graphs, we first collect a seed dataset of several news propagation networks and use their statistical properties ( , ) to spontaneously generate a synthetic graph (Figure 2 , Right) for each training iteration. We describe this in detail in Section 5. Datasets. We collected a total of top-100 trending articles on Twitter from January 2021 to April 2021 and their corresponding propagation networks with a maximum of 1.5K nodes using the public Hoaxy API 2 . All the downloaded data is free from user-identifiable information. The majority of these articles are relevant to the events surrounding the 2020 U.S. presidential election and the COVID-19 pandemic. We also share the same observation with previous literature [24, 41] such that retweet networks tend to have star-like shapes. These networks have a high and a low value, suggesting multiple separate star-shape communities with few connections among them. Therefore, viral news usually originates from a few very influential actors in social networks and quickly propagates to their followers. Training and Testing Set. Figure 3 illustrates how to utilize synthetic data during training. Since we observe that our framework generalizes better when trained with more complex graphs-i.e., more edges with high intra-community ( ) and inter-community ( ) edge probabilities, We first selected 10% of the collected real networks with the highest and as initial seed graphse.g., Figure 2 , Left, to generate the training set and use the rest as the test set. Then, during training, we used the average statistics ( , , # of communities and their sizes) of the seed graphs to generate a stochastic, synthetic graph for each training episode of a maximum timesteps-e.g., Figure 2 , Right. These two statistics are selected because they well capture the star-like shapes of a typical retweet network. Since the real activation probabilities of the collected networks are unknown, we found that using a fixed high value during training achieves the best results. We then reported the averaged results across 5 different random seeds on the remaining 90 real test networks with varied values and on a much longer horizon than . Note that this number of testing networks is much larger and more extensive than those of previous studies [24, 25, 51] . Baselines. Since there are no previous works that address the ASL problem, we combined different approximation and heuristic approaches for the IM task with the socialbot detector evasion feature that is provided by learned AgentI as baselines: • AgentI+C. This baseline extends the Cost Effective Lazy Forward (CELF) [31] and exploits the submodularity of the spread function (·) to become the first substantial improvement over the traditional Greedy method [25] in terms of computational complexity. IM is a standard baseline in influence maximization literature. • AgentI+H. Since consists of several star-like communities, we also used a heuristic approach Degree [3, 25] that always selects the node with the largest out-degree that is available-i.e., user with the largest # of followers. • AgentI*+C and AgentI*+H train the first-level agent independently from the second-level agent and combined it with CELF or the heuristic approach Degree, respectively. These are introduced to examine the dependency between the trained AgentI and AgentII Since the Greedy approach does not scale well with a large number of seeds, however, we excluded it from our experiments. We used a fixed hyper-parameter setting. During training, we set ←20, ←3, ←60, ←0.8, and step , delayed ←0.99. We refer the readers to the appendix for detailed configurations for RL agents. We ran all experiments on the machines with Ubuntu OS (v18.04), 20-Core Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 93GB of RAM and a Titan Xp GPU 16GB. All implementations are written in Python (v3.8) with Pytorch (v1.5.1). Network Influence Ratio. Figure 4 shows the network influence ratio-i.e., network influence over total number of users, under a bot detection environment given different number of budget seeds | | and values: (S, )/| |≤1.0 (5) A high network influence ratio requires both (i) efficient followship selection and (ii) efficient detection evasion strategy. Overall, ACORN outperforms all baselines with different news virality ( values). However, ACORN underperforms when | | is low-e.g., | |=50 in Figure 4 . This is because AgentII learns not to connect with the most influential nodes early in the process. This can help prevent disrupting the sequence A and lead to early detection, especially when it gets closer to the next prediction interval of F . The larger the value, the further-i.e., more hoops, a news can propagate through . Hence, as increases-i.e., the more viral a piece of news, utilizing the network structure to make new connections is crucial and more effective than simply selecting the most influential users. This is reflected in the inferior performance of AgentI+H when compared with AgentI+C, ACORN in Figure 4 , =0.75. This means that ACORN is able to utilize the network structured capture by node2vec and postpone short-term incentivesi.e., makes friends with influential users, for the sake of long-term rewards. Overall, Acorn also behaves more predictably than baselines in terms of the influence ratio's deviation across several runs. Survival Timesteps. We then evaluated if a trained socialbot can survive even after collecting all followers. Table 2 shows that while we train a socialbot with a finite horizon =60, it can live on the network for a much longer period during testing. However, other baselines were detected very early. Since only three out of four activities-i.e., tweet, retweet, reply, and mention, allow to collect new followers, it is natural that socialbots need to survive much longer than | | steps-e.g., around 2.0K in Table 2 , to accumulate all followers. This corresponds to 98%, 64%, and 56% of socialbots surviving-i.e., not detected, after reaching | |= for Acorn, AgentI+C and AgentI+H, respectively. Our trained socialbot can also sustain much longer if we keep it going during testing, even with different detection intervals >20. This implies that AgentI can generalize its adversarial activities against F toward unseen real-life scenarios. Dependency between RL Agents. The above results also demonstrate the effects of co-training AgentI and AgentII. First, the heuristic and CELF method when paired with the learned AgentI (blue & green lines, Figure 4 ) performs much better than when paired with an independently trained (without AgentII) AgentI (yellow & black lines, Figure 4 ). This shows that AgentI, when trained with AgentII, becomes more versatile and can help a socialbot survive a much longer period of time, especially even when the socialbot only uses a heuristic node selection. However, AgentI performs the best when paired with AgentII. This shows that two RL agents successfully learn to collaborate, not only to evade the socialbot detection but also to effectively maximize its network influence. This further reflects the co-dependency between the roles of A and S as analyzed in Sec. 3.1. Computational Analysis. We compared the computational complexity of AgentII specifically with the CELF algorithm during inference. Even though CELF significantly improves from the traditional Greedy [25] IM algorithm with the computational complexity of O (| || | ) [48] (assuming each call of takes O ( ) and only one round of Monte Carlo simulation is needed), its computation greatly depends on (·), the size of the graph and becomes only computationally practical when | | is small. This is also similar to other traditional IM algorithms such as CELF++ [18] , TIM [48] , and ASIM [13] . To illustrate, CELF takes much more time to compute as | | increases especially with large -i.e., more nodes need to be reached when computing (·) ( Figure 5 ). However, with the O (1) complexity of the forward pass through 2 , AgentII is able to scale linearly O (| |) regardless of the network structure and the virality of the news during inference. Even though our framework requires to calculate the graph representation using node2vec, it is specifically designed to be scalable to be able to process large graphs [40] and we only need to run it once. Insights on the Learned Policies. We summarized the node selection strategies of all methods in Figure 6 . We observed that both heuristic and CELF selects very influential nodes with many followers (high out-degrees) very early. Alternatively, AgentII acquires an array of normal users (low out-degrees) before connecting with influential ones. This results in early detection and removal of the baselines and sustainable survival of our approach. This shows that AgentII can learn to cope with the relationship constraint (Eqn. (1)) between A and S imposed by the environment. Moreover, the degrees of selected users by ACORN has a right long-tail distribution, which means that ACORN overall still tries to maximize its network influence early in the process. We have evaluated our approach on different real-life news propagation graphs. These networks can be considered as sub-graphs of a much larger social network. In practice, different sub-graphs can represent different communities of special interests-e.g., politics, COVID-19 news, or different characteristics-e.g., political orientation. Since socialbots usually target to influence a specific group of users-e.g., anti-vaxxer, it is practical to deploy several bots working in tandem on different sub-graphs. To evaluate this scenario, we aggregated all 90 test sub-graphs into a large network of 135K nodes and used each learned socialbot for each sub-graph. Figure 7 shows that ACORN still outperforms other baselines especially later in the time horizon. Moreover, ACORN can efficiently scale to a real-life setting thanks to its linear running time and highly parallel architecture. Our contribution goes beyond our demonstration such that one can train adversarial socialbots to effectively navigate real-life networks using an HRL framework. We will also publish a multi-agent RL environment for the ASL task under the gym library [1] . This environment will facilitate researchers to test different RL agents, examine and evaluate assumptions regarding the behaviors of socialbots, bot detection models, and the underlying influence diffusion models on synthetic and real-life news propagation networks. It remains a possibility that our proposed framework could be deliberately exploited to train and deploy socialbots to spread low-credibility content on social networks without being detected. To reduce any potential misuse of our work, we have also refrained from evaluating our framework with an actual socialbot detector API such as Botometer 3 . However, ultimately, such misuse can occur (as much as the misuse of the latest AI techniques such as GAN or GPT is unavoidable). Yet, we firmly believe that the benefits of our 3 https://botometer.osome.iu.edu/ framework in demonstrating the possibility of adversarial nature of socialbots, and enabling researchers to understand and develop better socialbot detection models far outweigh the possibility of misuse for developing "smarter" socialbots. In fact, by learning and simulating various adversarial behaviors of socialbots, we can now analyze the weakness of the current detectors. Moreover, we can also incorporate these adversarial behaviors to advance the development of novel bot detection models in a proactive manner [4] . Time-wise, this gives us a great advantage over the traditional reactive flow of developing socialbot detectors where researchers and network administrators are always one step behind the malicious bots developers [4] . One limitation of our current approach is that we only considered statistical features of a bot detector that are relevant to four activities-i.e., tweet, retweet, reply, and mention (Table 1) . While these features help achieve 90% of detection accuracy in F1 score on a real-life dataset, we hope to lay the foundation for further works to consider more complex network and content-based features [11, 36, 37, 54] . This paper proposes a novel adversarial socialbot learning (ASL) problem where a socialbot needs to simultaneously maximize its influence on social networks and minimize the detectability of a strong black-box bot detector. We carefully designed and formulated this task as a cooperative game between two functional hierarchical reinforcement learning agents with a global reward. We demonstrated that the learned socialbots can sustain their presence on unseen real-life networks over a long period while outperforming other baselines in terms of network influence. During inference, the complexity of our approach also scales linearly with the number of followers and is independent of a network's structures and the virality of the news. Our research is also the first step towards developing more complex adversarial socialbot learning settings where multiple socialbots can work together to obtain a common goal [4] . By simulating the learning of these socialbots under various realistic assumptions, we also hope to analyze their adversarial behaviors to develop effective detection models against more advanced socialbots in the future. 4 Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym Behavior enhanced deep bot detection in social media Efficient influence maximization in social networks A decade of social bot detection The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race Social fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling Better safe than sorry: An adversarial approach to improve social bot detection The coming age of adversarial social bot detection Botornot: A system to evaluate social bots Feature engineering for machine learning and data analytics Supervised machine learning bot detection techniques to identify social twitter bots Stochastic Neural Networks for Hierarchical Reinforcement Learning Asim: A scalable algorithm for influence maximization under the independent cascade model Horizon: Facebook's open source applied reinforcement learning platform TorsionNet: A Reinforcement Learning Approach to Sequential Conformer Search Talk of the network: A complex systems look at the underlying process of word-of-mouth Using complex systems analysis to advance marketing theory development: Modeling heterogeneity effects on new product growth through stochastic cellular automata Celf++ optimizing the greedy algorithm for influence maximization in social networks node2vec: Scalable feature learning for networks Disinformation, and influence campaigns on twitter Dynamic programming and markov processes Two evidential data based models for influence maximization in twitter. Knowledge-Based Systems A Convolutional Neural Network for Modelling Sentences Influence Maximization in Unknown Social Networks: Learning Policies for Effective Graph Sampling Maximizing the spread of influence through a social network Tractable models for information diffusion in social networks A numerical evaluation of the accuracy of influence maximization algorithms Semi-Supervised Classification with Graph Convolutional Networks Building Portable Options: Skill Transfer in Reinforcement Learning MALCOM: Generating Malicious Comments to Attack Neural Fake News Detection Models Cost-effective outbreak detection in networks CLAIM: Curriculum Learning Policy for Influence Maximization in Unknown Social Networks Disco: Influence maximization meets network embedding and deep learning A survey on information diffusion in online social networks: Models and methods Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards Rtbust: Exploiting temporal patterns for botnet detection on twitter Malicious Bot Detection in Online Social Networks: Arming Handcrafted Features with Deep Learning Dota 2 with Large Scale Deep Reinforcement Learning A one-class classification approach for bot detection on twitter Deep inductive graph representation learning Correcting for missing data in information cascades Detection of novel social bots by ensembles of specialized classifiers Proximal policy optimization algorithms The spread of low-credibility content by social bots Autonomous Drone Racing with Deep Reinforcement Learning The DARPA Twitter bot challenge Random forest: a classification and regression tool for compound classification and QSAR modeling Influence maximization: Nearoptimal time complexity meets practical efficiency Deep reinforcement learning-based approach to tackle topic-aware influence maximization GANG: Detecting fraudulent users in online social networks via guilt-by-association on directed graphs Online Influence Maximization under Independent Cascade Model with Semi-Bandit Feedback 2021. A novel framework for detecting social bots with deep neural networks and active learning Arming the public with artificial intelligence to counter social bots Scalable and generalizable social bot detection through data selection Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning