key: cord-0204110-ejvcwc2o authors: Devle, Aniket Chandrakant; Jose, Julia Ann; Saraswathula, Abhay Shrinivas; Mehta, Shubham; Srivastava, Siddhant; Kona, Sirisha; Daggumalli, Sudheera title: BotNet Detection on Social Media date: 2021-10-12 journal: nan DOI: nan sha: ef520fa2e7b2274f850f0f487b48e90c55bb6683 doc_id: 204110 cord_uid: ejvcwc2o As our reliance on social media platforms and web services increase day by day, exploiters view these platforms as an opportunity to manipulate our thoughts ad actions. These platforms have become an open playground for social bot accounts. Social bots not only learn human conversations, manners, and presence but also manipulate public opinion, act as scammers, manipulate stock markets, and so on. There has been evidence of bots manipulating people's opinions and thoughts which can be a great threat to democracy. Identification and prevention of such campaigns that release or create these bots have become critical. Our goal in this paper is to leverage web mining techniques to help detect fake bots on social media platforms such as Twitter, thereby mitigating the spread of disinformation. Given the popularity of social media and the notion of it being a platform encouraging free speech, it has become an open playground for user (bot) accounts trying to manipulate other users using these platforms. Social bots not only learn human conversations, manners, and presence but also manipulate public opinion, act as scammers, manipulate stock markets, etc. Studies as in [1] have shown that platforms such as Facebook and Twitter are most affected by this phenomenon. Network of bots acting in this manner can pose a significant threat to the flow of data on social media platforms. Bot Networks (BotNets) can be involved in spreading malicious information, amplifying accounts or social domains on these platforms, thereby, exacerbating the issue of misinformation on social media platforms. This becomes especially dangerous in times of ongoing political crises. For example, Russian BotNets [2] were said to have played a huge role in the intense polarization between the Left and the Right during the 2020 US elections (until Microsoft stepped in). By doing so, bots play a huge role in the manipulation of thoughts of the citizens which is a threat to democracy. Our goal in this paper is to leverage semantic web mining techniques to detect fake bots and bot networks on Twitter. The major objective of this paper is to develop a system that will detect fake accounts on social media platforms so that we get an honest and authentic feed. As the number of bots on the internet keep increasing day by day, it becomes even more cumbersome to detect and flag accounts part of a bot network. Bots these days are created using sophisticated machine learning algorithms and are made to sound very much like their human counterparts. Our goal is to detect bots accounts and BotNets and, further, detect the twitter accounts and domains that are amplified by these bot accounts and bot networks. Many algorithms have been proposed for Fake/BotNet Detection on Social Media Platforms. Below are summaries of some of the existing algorithms: 1) A major differentiating attribute of online problematic information and malicious users is coordination. According to [3] , Coordination is defined as the act of making people and/or things be involved in an organized activity together. In the papers [4]- [6] Coordinated Link Sharing Behaviour (CSLB) is used to detect entities that were involved in exhibiting inauthentic behavior. CLSB refers to the coordinated actions of Facebook/Twitter accounts, groups, and verified public profiles that posted the same news articles within a short period, showing a simple strategy for increasing content reach and gaming the algorithm that governs the dissemination of the most common content across the platform. In [5] , authors considered political news stories of the 2018 Italian general election and 2019 European elections data. The detection of networks of coordinated entities is a twostep process. The first step is to estimate a time threshold to identify news items shared by different entities in a short period. The networks are then identified by grouping just the entities that repeatedly shared the same news story simultaneously. 2) In the paper Identifying fake accounts on social networks based on graph analysis and classification algorithms [7] , Similarity matrices between accounts are calculated using graph adjacency matrix and then PCA algorithm is used for feature extraction and SMOTE is used for data balancing. To classify the nodes, linear SVM, Medium Gaussian SVM, regression, and logistic algorithms are used. The performance of this method was evaluated using the above classifier algorithms by training them on 10-fold cross-validation. Medium Gaussian SVM outperformed linear SVM due to its ability to map data to higher-dimensional feature spaces. 3) In Attractor+ [8] , the paper focuses on exploring synchronized and coordinated retweeting behavior of malicious retweeter groups in terms of temporal and contentbased properties. They proposed and built detectors based on the group-based features. They used four subgraph detection algorithms (Cohesive, Louvain, Attractor, Attractor+) to extract retweeter groups. They found that malicious retweeter groups had short and similar inter retweeting times (IRT) inspired by [9] by plotting IRT pairs on a log-log scale. 4) Enhanced PeerHunter [10] describes Peer-to-Peer Bot-Net detection using Community Detection strategies by clustering bots with "mutual contacts" into communities and finally, using a network-flow level community behavior analysis to detect BotNets. The main intuition here is that bots within the same BotNet will work together as a community and have features which can be distinguished from other communities. The Semantic Web is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make the internet data machine-readable [11] Web Mining is the application of data mining techniques to discover patterns from the World Wide Web [12] Semantic Web Mining aims at combining these two research areas, that is, Semantic Web and Web Mining [13] . Among the various forms of malware, BotNets are emerging as the most serious threat against cyber-security [14] . To counter this, we are using Semantic Web Mining Techniques. The first step in the project is the dataset analysis which in turn helps with pre-processing. Next, the entire pre-processed dataset is passed to the CoorNet algorithm for the detection of CLSBs. The CoorNet algorithm [6] works in two phases. First, a subset containing the top 10% of tweet groups with the shortest time span between first and second retweets is identified. This phase yields a threshold of time in seconds (Threshold 1). If a tweet's time difference of first and second retweet falls under this threshold, then the entire tweet group (all the retweets of this tweet) can be said to be suspects of exhibiting coordinated behaviour. In the second phase each pair of retweets in the subset, rendered during the first phase, is observed by building a bipartite graph. The structure of the bipartite graph is as follows, retweet IDs on the left bipartite and twitter account names on the right bipartite. An edge between the left and right bipartite exists if a twitter account name has retweeted a tweet with a given retweet ID. This helps in identifying the second threshold by calculating the median number of retweets used by 10% of quickest retweets to reach 50% of their total number of retweets (Threshold 2). The second threshold is the number of times a pair of accounts are exhibiting retweeting behaviour abnormally. The algorithm also furnishes a highly coordinated entities graph to help with the visualization purposes. The highly coordinated graph inherently has the Louvain community detection model applied on it which helps visualize the different communities amplifying a certain tweet or the twitter account responsible for that tweet. Open source software Gephi is utilized to visualize the graphs furnished by CoorNet. The next step is to identify the twitter accounts which have been most amplified by the detected BotNets. The idea here is that: for every tweet group which has been classified with coordinated behaviour, the number of retweets made on this tweet is counted. Finally, this is aggregated for every twitter account name and 10 twitter accounts with highest coordinated retweet activity are selected. This gives a list of twitter accounts most amplified by BotNets. Bot names for the second dataset are derived by applying the algorithm which was used for DS-1. As part of a future scope, we want to implement Topic Modeling algorithms like Non-negative Matrix Factorization (NMF) to discover the abstract topics which occur in the collection of tweets shared by the BotNets. Further, using Sentiment Analysis techniques we can extract and discover the positive or negative intent of the BotNets. We also extracted the retweet text and URLs to model the detection of amplified domains. This dataset contains tweets 1132525 related to the US government's strategy to track the Covid vaccination status of immigrants. We have taken the tweets from Dec 5 2020 -Jan 27 2020 time period. We use the given dataset as input to the CoorNet algorithm which returns a highly connected graph representing coordinated link shares between users and we feed this fastest retweet graph to the community detection algorithms. We plan on evaluating the results thus obtained using the dataset containing the bot names as ground truth and aim to quantify the performance using parameters like Accuracy, F1-score, precision, recall, and confusion matrix. Confusion matrix: It reports the number of True positives, False positives, true negatives, and false negatives [15] and is shown in 1. Precision: It can be defined as the fraction of retrieved bots that are actually bot names [15] 2)Tier-2: There could be cases where the accounts might not retweet within the shortest time span interval but perform coordinated retweeting behaviour. To classify both Tier-1 and Tier-2 bots, we performed experiments by varying the second threshold. DS-1 2016 dataset: Threshold-1 was 13 seconds, and metrics calculated are shown in Figure 2 From the experiments and results in Figure 2 and Figure 3 , we finalized the values as seen in the Figure 4 . Recall: It can be defined as the fraction of results that were successfully retrieved [15] F1-score: It is merely a function of precision and recall that is used to seek a balance between the two. For our experiments, we chose to divide the 2016 UK Election dataset into two sets of tweets, the first set consisting of all tweets related to the UK elections in 2016 (DS-1 2016 dataset), the second set consisting of all tweets related to the resignation of Boris Johnson (DS-1 2018 dataset). In short, we divided the dataset into two experiments and aimed to detect BotNets on these two sets independently. Bots are predicted based on two approaches: 1) Tier-1: Accounts which satisfy the Threshold 1 and Threshold 2. C. Findings 1) We realized quite early that the 2016 UK Election dataset was quite biased, the number of bots in comparison to the number of humans is quite small. Hence, a number of human accounts were classified as suspicious of coordinated behaviour during the first phase of CoorNet. But this was expected, in the sense that, so far we were only classifying bots based on the amount of time it takes for an account to retweet. 2) The second phase of CoorNet was able to weed out the majority of the bots from human accounts. 3) CoorNet also builds a highly connected graph of the input dataset and, using Gephi, we were able to visualize the graph. It is worth mentioning that most of the BotNets form a closed community (strongly connected component) of their own and work in tandem to amplify one or more twitter accounts. 4) Even though CoorNet performed relatively well on our dataset, we noticed some shortcomings on the overall model performance. First, CoorNet specifically targets fastest repetitive retweeters on a dataset. The handler of a BotNet can easily outwit our model by relaxing the time interval after which the bots have to retweet, hence, masking the overall BotNet from our model. Secondly, CoorNet specifically targets BotNets which form strongly connected components or disjoint communities in a network. Again, the handler can configure his bots to be part of different communities and still mask the overall BotNet from our model. 6) It is worth noting that even though the accuracy of our model is very high, it is not the right metric to describe accuracy for this particular use case. The reason for such a high accuracy is mainly due to the huge number of human accounts that CoorNet has classified correctly as compared to the number of bot accounts. Hence, recall is a better metric to describe the accuracy of CoorNet. We used Gephi to visualize the highly connected graphs generated by CoorNet for analyses. Also, we have used Python's standard libraries like MatPlotLib to generate pie charts depicting the amount of amplification on the top 10 amplified twitter accounts by bots. In figures 6, 7, and 8, the vertices correspond to various twitter accounts. Also, the nodes are sized on the basis of their degree, the larger the degree of the node, the larger it appears. It is also notable that larger degree means a particular twitter account is amplified by a larger number of accounts, bots and humans alike. However BotNets appear to form highly connected communities as evident in the graphs. . In this paper, we covered the CoorNet BotNet detection model, which works on the principles of Coordinated Link Sharing Behaviour (CLSB). CLSB suggests that it is unlikely for a human to share content on any social media platform repeatedly within a small time threshold. We also covered various different BotNet detection algorithms like Enhanced PeerHunter [10] and Attractor+ [8] with varying complexities and results. It should be pointed out that bots keep getting better and better and it is still a challenge to develop models and algorithms that effectively detect BotNets in all kinds of configurations. We mentioned some of the shortcomings of CoorNet as well, that it will falter if BotNets become a bit more sentient about their coordinated behaviour. While implementing CoorNet, we applied thresholds to detect coordinated behaviour in two tiers. In the first tier, we looked at the top 10% fastest retweeters in the first phase and then captured the accounts with highest retweeters in the second phase. In the second tier, we directly applied the idea of phase two from tier one to capture bots who were missed out by tier one, owing to the lower number of retweeters. We generated highly connected graphs for each of our datasets to observe and analyse the configuration of BotNets in a network and realised that BotNets effectively form strong disjoint communities. We also analysed the top accounts which were amplified by BotNets for each of the datasets and the top domains which were boosted by BotNets for DS-2. By experimenting on the second threshold for tiers one and two, we were able to achieve a recall of 57.82% for DS-1 2016 dataset and 74.54% for DS-2 2018 dataset. The model can still be further improved and as part of future scope, we want to add topic modelling and sentiment analysis features into the model to detect the abstract topics amplified by BotNets and their positive or negative intentions. The Social Impact of Bad Bots and What to do About Them Microsoft seeks to disrupt Russian BotNet it Fears Could Seek to Sow Confusion in the Presidential Elections The Many Faces of Anonymous Coordinated link sharing behavior as a signal to surface sources of problematic infor-mation on facebook It takes a village to manipulate the media: coordinated link sharing behavior during 2018 and 2019 Italian elections Understanding Coordinated and Inauthentic Link Sharing Behavior on Facebook in the Run-up Identifying Fake Ac-counts on Social Networks Based on Graph Analysis and Classification Algorithms Revealing and detecting malicious retweeter groups Mining and modeling temporal activity in social media Enhanced peerhunter: Detecting peer-to-peer botnets through network-flow level community behavior analysis Web Mining Towards Semantic Web Mining A Survey of Botnet and Botnet Detection Accuracy, Precision, Recall, or F1?", towardsdata-science