key: cord-0060579-4xjsicln authors: Karpov, Ilia; Glazkova, Ekaterina title: Detecting Automatically Managed Accounts in Online Social Networks: Graph Embeddings Approach date: 2021-02-20 journal: Recent Trends in Analysis of Images, Social Networks and Texts DOI: 10.1007/978-3-030-71214-3_2 sha: 79ff563c2723505725690a494f811937bf4e6837 doc_id: 60579 cord_uid: 4xjsicln The widespread of Online Social Networks and the opportunity to commercialize popular accounts have attracted a large number of automated programs, known as artificial accounts. This paper (Project repository available at http://github.com/karpovilia/botdetection) focuses on the classification of human and fake accounts on the social network, by employing several graph neural networks, to efficiently encode attributes and network graph features of the account. Our work uses both network structure and attributes to distinguish human and artificial accounts and compares attributed and traditional graph embeddings. Separating complex, human-like artificial accounts into a standalone task demonstrates significant limitations of profile-based algorithms for bot detection and shows efficiency of network structure based methods for detecting sophisticated bot accounts. Experiments show that our approach can achieve competitive performance compared with existing state-of-the-art bot detection systems with only network-driven features. The need to quickly search for artificially created accounts is in many practical research tasks. Artificial accounts distort the popularity of groups [6] , spread fake news [16, 17] , and are used for fraud [1] activities. In this work we define an artificial account as an account created and used to generate profit for the owner by violating the rules of a social network. Thus, intuitively, accounts can be divided into two types: -technical accounts (or bots) created to collect data, increase the number of group members, the popularity of posts etc. Most of these accounts are created and controlled by software. As a rule, these are poorly filled and relatively recently created profiles, which allows to identify them with a sufficiently high accuracy using several profile features. -sophisticated accounts, created and operated semiautomatically by a human, are used primarily for information propagation, marketing and fraud activity. This category also includes hacked and used for fraud accounts. Most of the existing methods show excellent results in identifying accounts of the first type. However, they do not consider accounts of the second type due to the high complexity of dataset acquisition. At the same time, accounts of the second type are visually and in terms of profile fullness indistinguishable from legitimate user accounts, which makes the task of their classification much more challenging. The problem of finding artificial accounts of the second type can be reduced to the task of classifying network nodes. Still, its solution requires much more contextual information and learning capacity of the models used than in case of general node classification problem. This makes it appropriate to study the applicability of existing models for describing network nodes and their classification to solve this problem. This paper has two main contributions as follows: -First, we introduce the two bot detection dataset, consisting of nodes of three types: real user accounts, technical accounts and manual accounts. Unlike many existing arrays, it contains not only information about the network profile, but also information about the user's friends and groups. Since the array contains data of real people, immediately after collecting the data, all identifiers of users, groups, places of residence were hashed. An important advantage of the proposed array is that it does not contain precalculated walk models, but the initial hashed data themselves for training various models. Thus, it can be used in the future to train and compare new models. -Second, we investigate two models for constructing network embeddings to solve the problem of classifying network nodes into artificial and natural. As far as we know, we are the first who use a class of the network node attributes simultaneously with the network node embedding to solve this problem. The use of random walk algorithms optimized for large networks is also of interest. This comparison is especially important given that the problem of identifying real life bots is solved for a graph of ∼10 9 nodes, which imposes significant restrictions on the computational cost of the algorithms used. The rest of the work is organized as follows. In Sect. 2, we describe existing approaches to bot detection problem and recent advances in network embedding generation. Section 3 is devoted to the proposed model. Section 4 presents our experiments. In Sect. 5, we summarize experiment results and discuss future work. This section describes the chosen vector representations with attributes approach, the formal criteria to separate simple bots from advanced ones. Due to the lack of an established definition of the "bot" term, as well as the illegality of the very usage of bots, the formation of a training dataset causes problems. Based on the definition of the above and previously performed works [13, 18] , the main methods are (1) manual bot labeling -this approach is very resource-intensive and does not work well for sophisticated bots due to the low annotators agreement coefficient, (2) monitoring of suspended users lists -the main disadvantage of this approach are that (a) it is necessary to have time to proactively collect the entire user's network profile before blocking, (b) not all users blocked by a social network due to rules violation are bots (3) Creating accounts that attract bots other than real people, this approach looks very original, but most likely attracts bots using certain communication strategies, which does not fully give an idea of the diversity. In this work, we use the second approach mainly to form a set of technical bots along with one more previously undescribed strategy. To construct a set of sophisticated accounts, we got access to ∼700 accounts sold by 12 different sellers on special account exchanges. The purchase of accounts is prohibited by the social network rules, all scenarios for using purchased accounts are negative and stay within the proposed definition, which makes monitoring of account exchanges a fairly effective strategy for generating the dataset. At the same time, advanced accounts, as a rule, have a higher price, which allows constructing a cost scale that correlates with the difficulty of identifying such an account. With it, the part of suspended users from the first set can be attributed to sophisticated accounts using the k-nearest neighbors method. Researchers designed many bot detection approaches in order to prevent the spread of social media bots. Those approaches can be classified into 3 main categories: graph-based, feature-based and anomaly-based. The anomaly-based approaches assume that a user with odd behaviour is most likely to be a malicious user. For example, [14] took into assumption that human behaviors exhibit strong heterogeneity and a bot's behavior is less complex than a human's behavior which leads to a smaller entropy of bot actions. They analyse time-interval entropy value and clustered the accounts on Sina Weibo into humans, bots and Cyborgs. Another work [9] created empty accounts and managed to construct a bot dataset with the users that interacted with these empty accounts. There are several feature-based methods where authors identified a set of behaviour features which describes user actions, interactions, timestamps and may include text counting without deeply analyzing textual content. Then they train several supervised machine learning classifiers like SVM-NN in [8] or Random Forest Classifier in [5] . The sequence of user's actions can also be used as input to CNN-LSTM algorithm [3] . Besides behaviour features researchers use content-based features such as tweets. For example in [11] authors use sentiment analysis and build Contrast Pattern-Based classifier. Besides tweets user's nicknames also can be analysed. Thus, [2] classified user names (strings) into random and non-random and it can be used as additional feature for bot detection classifiers. Graph-based bot detection methods rely on analyzing the structure of social network. The majority of them make the assumptions that real users form one strongly connected community and refuse to interact with artificial users [4, 19, 20] . These assumptions mean that there is a sparse cut between the legitimate and bot parts. [18] describe an approach to bot detection using a stacking ensemble with classifier trained on top of a combination of node embeddings from friendship graph, models trained on user's text and bag of subscribers. They use graph embedding obtained from LINE while in our work we compare various different graph embeddings. Given a weighted graph G = (V, E, A), where V is the set of nodes, E is the edge set and A is the adjacency matrix of a graph G such as vector that captures its network connectivity structure. Resulting embedding can be used as input vector for the classification problem. After the successful application of skip-gram [12] model to the problems of modeling text sequences, this method began to be actively used to model the graph nodes based on linear sequences obtained as a result of the graph random walks. The main advantage of using neural network methods in graph problems is that they do not need the whole training set during the training phase. They optimize by reading the data by batches step by step and correcting continuously which theoretically leads to training on extremely large datasets. In practice most of the frameworks require precomputed transition probabilities, that should be stored in memory. Attri2Vec [22] generates random walks to capture structural context. After random walks are generated, attri2vec uses the representation f (x i ) of node v i with attribute x i in the new attribute subspace to predict its context nodes collected from random walk sequences. In this way, network structure is seamlessly encoded into the new attribute subspace by allowing nodes sharing similar neighbors to be located closely to each other. Node2Vec [7] is the classical second-order random walk based on the transition parameters p and q, where p controls the probability to return back to the starting node of the walk and q -to walk away from the starting node of the walk. Given p = 1 and q = 1 the algorithm models unparameterized random walk. VKontakte Social Network provides a significant amount of information about the user that can be converted into features. The features can be divided into five groups: -Descriptive profile characteristics. These features include gender, age, marital status, date of birth, number of friends, groups, subscribers, country and city of residence, school, university, user work, site, number of posts on the user's wall, number of photos and albums, audio and video records, privacy settings, confirmation by phone number, date of account creation; geotags of photos, links to profiles in other networks; -Text features received from user's messages, comments and statuses. They can include text embeddings, key topics of messages and comments, variety of vocabulary, user language; -Image features, images on the page and in the user's albums; -Graph features -user group membership, user friends structure; -Time features such as user last seen time, publication time, activity at neighbor pages such as likes, comments at friends and subscribed groups; In this work, we do not use text, image and time features due to the desire to focus on graph features. We still download text and profile information for further research and dataset enrichment. In our work we consider two sources of artificial accounts information: blocked by the VKontakte social network administration ("banned") and placed on a special exchange for the purpose of selling ("sold"). All accounts have one of the following statuses: active, deleted and banned. We consider only accounts with status banned in "banned" list, which excludes users who have independently deleted their accounts from the network from being included in the set ones. This approach has a certain error ratio, since, as noted above, not only accounts participating in increasing the rating of groups and disseminating information are subject to blocking, but also real users blocked by the administration, for example, due to incorrect behavior towards other participants or publications of prohibited materials. We suggest focusing on "sold" accounts to train more precise models. For obtaining "banned" accounts we used the following procedure: 1. Users profiles with friends and texts were collected during February 2020. Only profiles, that are accessible without the social network registration were collected. 2. In June 2020 profiles from step 1 were checked with friends and friends of friends. Two lists of users with statuses "active" and "deactivated -banned" are collected. When constructing the final graph, we examined 716 sophisticated and 360 technical accounts found from account shops and 2515 blocked users before they were banned. Blocked users were classified as "technical" and "sophisticated" based on their profile fulfillment and friends count. We obtained mutual friends for all nodes (bots and their friends) for the largest connected component, that gave us 322,917 nodes totally. The resulting connected graph consists of 270 technical accounts and 159 sophisticated accounts and 44,667 users with all information about profile, groups and friends; This node framing strategy distorts periphery nodes centrality measures, since their edges with nodes outside the frame that are not taken into account. It can cause an errors in the results, since it is bot egocentric, but a full-rated modeling of the second-order friends graph is complicated by the fact that it contains about 12 million nodes, which does not allow performing random walks in an acceptable time. Sophisticated and technical accounts were represented in the largest connected component, which made it possible to train one graph embedding model to classify both types of accounts. For "banned" users, an additional assessment of the profile filling in was carried out based on the profile filling in by such features as the number of friends, subscribers, groups, photo, account verification. To train the basic model, we used the following account attributes: age, account verification by phone, nickname, website, Facebook profile, Instagram profile, Twitter profile, photo flags, number of subscriptions to public pages, number of videos, number of audio, number of days since the last login to the network, filled status, number of friends, city, gender. In the case when individual fields were not filled in or were not available for collection, we substituted the median value. In the future, it seems appropriate to fill in these fields as the median value for k nearest neighbors of the user or to apply more advanced approaches to filling described in the work [23] . To train graph embedding, the user's friend graph was used. Features of encoder walking and training will be described in the section below. We considered two types of embeddings: Node2Vec [7] and Attri2Vec [22] . To obtain Node2Vec embeddings for our graph we used the SNAP framework [10] . For Attri2Vec embeddings we used Attri2Vec authors implementation 1 . More implementation details are presented in section Experiments. The resulting graph embedding features were combined with the user feature vector and used as input parameters of the classification algorithm. The following classification algorithms were considered: 1. Support Vector Classifier (SVC) 2. Random Forest (RF) 3. Logistic Regression (LogReg) Implementation of algorithms from scikit-learn [15] was used. Logistic regression classification algorithm was trained based on the accounts features. Features described in Sect. 3.2 were used. For "city" categorical feature we have chosen 185 cities with more than 10 citizens from our dataset and preprocessed the feature with one-hot encoding. "Gender" feature in VK has several possible values (male, female, unknown), we present it as two one-hot encoded features. Overall, with one-hot encoding of categorical features we obtained 204 features for each account. Account information is usually sparse since accounts have many blank fields. We considered two approaches for empty fields filling. The first approach is to use some constant value. The second is to use average value among the filled field values. With the first approach on all 204 features we obtained ROC AUC 0.785, with the second approach -0.762. We trained 25 Node2Vec models with different p and q parameters. The grid for each of the parameters is {0.25, 0.5, 1, 2, 4} (the same as in the original paper [7] ). The other parameters are 10 random walks of length 80 per source, context size -10 and 10 epochs of SGD optimization. We consider AUC-ROC (Area Under ROC Curve) as the main metric for our binary classification task. We train Logistic Regression Classification algorithm based on the obtained Node2Vec embeddings. The results for technical and sophisticated accounts are presented in Table 1 and 2, respectively. We also considered SVM and Random Forest classification algorithms trained on the same embeddings as in Table 1 , those approaches best results are 0.929 and 0.922 respectively. The best Logistic Regression results outperform SVM and Random Forest results. Attri2Vec approach allows to use account features with graph structure features. We used authors implementation from the original paper [22] . We considered the same length of random walks, the number of random walks and the context size parameters as for Node2Vec embeddings, i.e. 80, 10 and 10, respectively. The other training parameters are 1M samples and 0.025 initial learning rate value. We considered several types of the mapping options for constructing embeddings from attributes in Attri2Vec: ReLU mapping, Fourier mapping and Sigmoid mapping. The best results were shown by Sigmoid mapping. The best obtained results for technical and sophisticated accounts are presented in Table 3 . The obtained Attri2Vec results outperform Node2Vec results for technical accounts. Results on the sophisticated accounts are worse. This can be caused by more complex structure and profile fields completion of sophisticated accounts and large number of gaps in account features of sophisticated accounts. We considered classification based on concatenated Node2Vec best embeddings and account features. We used Logistic Regression Classification algorithm. The best results on technical and sophisticated accounts are presented in Table 4 . Concatenation of Node2Vec embeddings and account features outperforms account features results for both datasets and gives the best obtained results on sophisticated dataset. Graph based methods show the best results on technical accounts dataset, but fail to classify sophisticated accounts without profile features. In this work, we proposed simple and advanced sets for the bots detection problem. We showed the effectiveness of using graph embedding and combining it with user attribute features to solve this problem. The best ROC AUC score of 0.988 for technical accounts and 0.867 for sophisticated accounts shows the complexity difference in proposed tracks. Unfortunately, we cannot directly compare our results with existing approaches because most works do not provide the necessary datasets and code repositories. We took the best results of the bot detection from the paper by Zegzhda et al. [21] and Skorniakov et al. [18] as most close to our work. Proposed approach outperforms both approaches by 4.5% of AUC 2 . Achieved result is sufficient for application purposes, greater results can be reached with the following strategies: -use of text embeddings since a significant part of artificial accounts performs the function of promoting certain goods or disseminating information, which can be used for classification; -significant number of accounts hide their friends, but leave open groups that can be used to model a user as a bipartite graph node; -network modeling as a temporal network is of interest, taking into account such characteristics as the joint appearance of accounts on the network. Resulting anonymized dataset, pretrained embeddings and evaluating scripts can be found at project repository 3 . Frauds in online social networks: a review Its all in a name: detecting and labeling bots by their name Detecting social bots by jointly modeling deep behavior and content information Sybilinfer: detecting Sybil nodes using social networks Features combination for the detection of malicious twitter accounts Leveraging candidate popularity on twitter to predict election outcome node2vec: scalable feature learning for networks Detecting fake accounts on social media Seven months with the devils: a long-term study of content polluters on Twitter SNAP: a general-purpose network analysis and graphmining library Contrast pattern-based classification for bot detection on Twitter Efficient estimation of word representations in vector space A new approach to bot detection: striking the balance between precision and recall Discriminating bot accounts based solely on temporal features of microblog behavior Scikit-learn: machine learning in Python Truthy: mapping the spread of astroturf in microblog streams The spread of fake news by social bots Make social networks clean again: graph embedding and stacking classifiers for bot detection SybilLimit: a nearoptimal social network defense against sybil attacks Sybilguard: defending against sybil attacks via social networks The use of an artificial neural network to detect automatically managed accounts in social networks Attributed network embedding via subspace discovery Treating missing network data before partitioning Acknowledgements. This work was supported by Russian Academic Excellence Project "5-100" and through computational resources of HPC facilities provided by NRU HSE. We thank Daria Musatkina, Maksim Smirnov and Matvey Osmolovsky for their assistance with data processing and meaningful discussion.