key: cord-0542545-uy4dykhg authors: Albanese, Federico; Lombardi, Leandro; Feuerstein, Esteban; Balenzuela, Pablo title: Breaking the Communities: Characterizing community changing users using text mining and graph machine learning on Twitter date: 2020-08-24 journal: nan DOI: nan sha: 8543504f276bc8128c6849a0ad514d759d6e5db5 doc_id: 542545 cord_uid: uy4dykhg Even though the Internet and social media have increased the amount of news and information people can consume, most users are only exposed to content that reinforces their positions and isolates them from other ideological communities. This environment has real consequences with great impact on our lives like severe political polarization, easy spread of fake news, political extremism, hate groups and the lack of enriching debates, among others. Therefore, encouraging conversations between different groups of users and breaking the closed community is of importance for healthy societies. In this paper, we characterize and study users who break their community on Twitter using natural language processing techniques and graph machine learning algorithms. In particular, we collected 9 million Twitter messages from 1.5 million users and constructed the retweet networks. We identified their communities and topics of discussion associated to them. With this data, we present a machine learning framework for social media users classification which detects"community breakers", i.e. users that swing from their closed community to another one. A feature importance analysis in three Twitter polarized political datasets showed that these users have low values of PageRank, suggesting that changes are driven because their messages have no response in their communities. This methodology also allowed us to identify their specific topics of interest, providing a fully characterization of this kind of users. Technologically mediated Social networks flourished as a social phenomenon at the beginning of this century with exponents such as Friendster (2002) or Myspace (2003) [1] but other popular websites soon took their place. Twitter is an online platform where news or data can reach millions of users in a matter of minutes [2] . People with different political opinions and diverse backgrounds interact on this social network. However, this diversity does not translate to enriching debates between users with different profiles because they tend to cluster according to their believes, constituting homogeneous communities known as echo chambers [52] . Aruguete et al. focused on the interaction between users in political contexts and described how Twitter users frame political events by sharing content exclusively with like-minded users forming two well-defined communities [5] . A segregated partisan structure with extremely limited connection between communities of users with different political orientations on the retweet networks can be found in multiple papers, in different contexts and countries like, for instance, the 2011 parliament elections in Germany [6] , the climate change debate [64] , the #BlackLivesMatter movement [7], the 2010 U.S. congressional midterm elections [8] , the 2011 Canadian Federal Election [44] , The Egyptian pro/anti-military intervention debate [42] or tweets about the death of Venezuelan President Hugo Chavez [43] . The presence of well defined communities can also be found in different platforms and social media [53, 54, 61] . Most scientific works focused on the dramatic consequences and negative effects of closed communities and echo chambers, which include the increase of negative discourse, hate speech and political extremism [58, 59] , confirmation bias (i.e. the users tendency to seek out and receive information that strengthens their preferred narrative) [53, 54, 56, 63] and spreading of misinformation, baseless rumors and fake news [51, 55, 57] (one of the main threats to our society according to the World Economic Forum [60] ). In this context, the large consumption of information through social networks and its consequences make it essential to analyze the behavior of these communities and think about mechanisms that break them. In this paper, we propose a machine learning framework in order to characterize the community breakers (i.e. the Twitter users that first belonged to a well defined community and then start interacting mostly with different users swinging to another community). Moreover, once we are able to correctly determine these users, we seek to identify their topics of interest, something that may be useful not only for the sake of understanding but also to intervene in the dynamics of the discussion. Three datasets were built and used in order to show that the methodology can be easily generalized to different scenarios. Namely, we examined three Twitter network datasets constructed with tweets from: 2017 Argentina parliamentary elections, 2019 Argentina presidential elections and 2020 tweets about Donald Trump. For each dataset, we analyzed two different time periods and identified the larger communities corresponding to the main political forces. Using graph topological information and detecting topics of discussion of the first network, we built and trained a model that classifies whether an individual will change his/her community, identifying the topics of interest and relevant features of the community breakers. Our main contributions are the following: 1. We describe a generalized machine learning framework for social media users classification, in particular, to detect and characterize community changing users. This framework includes natural language processing techniques and graph machine learning algorithms in order to describe the features of each individual. 2. We experimentally analyze the machine learning framework by performing a feature importance analysis. While previous works used text, Twitter profiles and some twitting behavior characteristics to automatically classify users with machine learning [9-12], here we show the value of adding graph features in order to identify the label of a user. In particular we ascert the importance of the low value of "PageRank" [18] measure for this specific task. A possible interpretation of this result is that a person changes their community because their massage was not heard in their previous community. 3. We also identify the topics that are considerably more relevant to the community breakers. Identifying these key topics has a valuable impact for social science and politics. The paper is organized as follows. In the Data Collection section, we describe the data used in the study. In the Methods section, we describe the graph unsupervised learning algorithms and other graph metrics that were used, the natural language processing tools applied to the tweets and a machine learning model for the task of classifying the community breakers. In the Results section, we analyze the classifying model and which are the important characteristics of these users. Finally, we interpret these results in the Conclusions section. Twitter has several APIs available for developers. Among them is the Streaming API that allows the developer to download in real time a sample of tweets that are uploaded to the social network filtering it by language, terms, hashtags, etc. [13, 14] . The data is composed of the tweet id, the text, the date and time of the tweet, the user id and username, among other features. In case of a retweet, it has also the information of the original tweet's user account. For this research, we collected three datasets in two different periods of time: 2017 Argentina parliamentary elections (2017ARG), 2019 Argentina presidential elections (2019ARG) and 2020 United States tweets of Donald Trump (2020US). For the Argentinian dataset, the Streaming API was used during the week preceding the primary elections and the week before the general elections. Keywords were chosen according the four main political parties present in the elections. Details and context can be found in the supporting information. For the 2020US dataset, we used "realDonaldTrump" (the official account of president Donald Trump) as keyword and the weeks from May 9 th to May 16 th and from June 10 th to June 16 th of 2020 as first and second time period respectively. Twitter messages are in the public domain and only public tweets filtered by the Twitter API were collected for this work. For the purpose of this research, we have analyzed more than 9 million tweets and more than 1.5 million individuals in total. The specific start and end collection date, the total number of tweets and users of each dataset can be seen in Table 1 . In this section, we will present the methodology employed to characterize the Twitter users. We start with the retweet network and the algorithm to find communities. Then, we introduce the different metrics which describe the interaction's networks among them. We also extract the text features of the tweets using a natural language processing algorithm. Finally, we describe a supervised learning model which uses the individual's characteristics as instances and predicts which users change their community. These models allow us to highlight which user features characterize the community breaking users. We represent the interaction among individuals in terms of a graph G = (N, E), where users are nodes (N ) and retweets between them are edges (E). Considering that a user can be retweeted multiple times by another user, this is well modelled by a directed and weighted graph. However, when a user n 1 retweets a tweet written by another user n 2 , should the edge point form n 1 to n 2 or from n 2 to n 1 ? This definition has important implications. In the first scenario, the edges represent pointers to the "influencers" and important content generators. In the second scenario, the edges represent the flow of information through the network, going from the source to the user who spread the message. Indeed, there is no clear consensus in the scientific literature about which direction should be given to the edges: while some authors [47] [48] [49] use the first, others [8, 43, 50] prefer the second one. Although they are symmetrical, they are different, and we cannot tell a priori which one is better for our purpose, so we decided to calculate the topological features in both scenarios. We named the directions of the edges CR (from content Creator to Retweeter) and RC (from Retweeter to content Creator). Isolated nodes (never retweeting nor retweeted) were not taken into account for this analysis. In Fig 1, we can visualize the retweet network for each time period and dataset. In the case of the US dataset, most of the users are concentrated in two groups, portraying the political polarization in that country. On the other hand, in the Argentinean dataset we can identify two large groups and also some smaller ones. The graph visualizations are produced with Force Atlas 2 layout using Gephi software [15] . Each node is a Twitter user (colored depending on its community) and each edge (directed and weighted) represents the retweets between two given users (in black). In a given graph, a community is a set of nodes strongly connected among them and with little or no connection with nodes of other communities [21] . We implement an algorithm to detect communities in large networks which allows us to characterize the users by their relationship with other users. In this context, the modularity is defined as the fraction of the edges that fall within a given community minus the expected fraction if edges were distributed at random [39] . The Louvain method for community detection [22] seeks to maximize modularity by using a greedy optimization algorithm. This method was chosen to perform the analysis due to the characteristics of the database. While other algorithms such as Label Propagation are good for large data networks, their performance decreases if clusters are not well defined [23] . In contrast, in these cases the Louvain or Infomap methods obtain better results. However, for the size of our graphs (in the order of hundreds of thousands of nodes and about one million edges), the Louvain method is more efficient than the other ones in terms of computation time required because it scales roughly linearly with the number of edges [24] . Despite having found several communities, we just considered the largest ones for each case. For the 2017ARG and 2019ARG datasets we used the four biggest communities because, when examining the text of the tweets and the users with the highest degree, each one had a clear political orientation corresponding to the four biggest political parties in the election. These communities are labeled as "Cambiemos", "Unidad Ciudadana", "Partido Justicialista" and "1 Pais" for 2017ARG and "Frente de Todos", "Juntos por el Cambio", "Consenso Federal" and "Frente de Izquierda-Unidad" for 2019ARG (electoral context is provided in the supporting information). Regarding the 2020US dataset, we used the 2 biggest communities because of the bipartisan political system of the United States (Republicans and Democrats) and the clear structure present in the retweet networks, where only two big clusters concentrate almost all of the users and interactions (see Fig 1) . In contrast, the Argentinean election datasets have two principal communities and some minor communities as well. This network topology with highly connected and polarized clusters had been reported in previous works [5, 7, 8, 16, 17, 36] . Given the stochasticity of the method, we follow the solution proposed by Lancichinetti et al. [38] that runs the Louvain method several times (100 in our case) and, in order to minimize the possibility of an incorrect labeling, keeps for the machine learning task only the nodes that were always consistently assigned to the same community in all iterations. Given that the analyzed datasets comprise two snapshots of the retweet network separate in time, we need to fully characterize the users in the early networks in order to properly identify those users that change their community. With this goal, we computed the following metrics for each user in the network: Degree, Indegree, Outdegree, PageRank [18] , betweenness centrality [19] , clustering coefficient [20] and cluster affiliation (the community detected by the Louvain method). As we mentioned earlier, it's important to note that the direction of the edges of the network drastically affects the value of these metrics. Consequently, we calculated them with both interpretations. All these metrics were used as features in the machine learning classification task and feature importance analysis. In order to determine the topics of discussion during the first period of each dataset, we analyzed the text of the tweets using natural language processing analysis and we calculated a low dimensional embedding for each user. The tweets were described as vectors through the Term Frequency -Inverse Document Frequency (tf-idf) representation [26] . Each value in the vector corresponded to the frequency of a word in the tweet (the term frequency, tf ) weighted by a factor which measures the degree of specificity (inverse document frequency, idf ). We used 3-grams and a modified stop-words dictionary that not only contained articles, prepositions, pronouns and some verbs but also the names of the politicians, parties and words like "election". Then, we constructed a matrix M concatenating the tf-idf vectors, with dimensions the number of tweets times the number of terms. We performed topic decomposition using Non-Negative Matrix Factorization (NMF) [25] on the matrix M . NMF is an unsupervised topic model which factorizes the matrix M into two matrices H and W with the property that all three matrices have no negative elements. We selected the NMF algorithm because this non-negativity makes the resulting matrices easier to inspect and to understand their meaning. The matrix H has a representation of the tweets in the topic space, in which the columns are the degree of membership of each tweet to a given topic. On the other hand, the matrix W provide the combination of terms which describes each topic [35] . The obtained results, analyzing just the tweets corresponding to the first time period, are detailed in the supporting information. The decomposition dimension was swept between 5 and 30, and for each dataset we chose a number of topics in the corpus so as to have a clear interpretation of each one. The same methodology was used and described in [4, 35] . Once we collected all this information, Twitter users were also characterized by a vector of features where each cell corresponds to one of the topics and its value to the percentage of tweets the user tweeted with that topic. Given that our objective was to characterize users who "break" their community and start interacting with users from other clusters, we implemented a machine learning model which classifies users and then performed a feature importance analysis. The instances of the model were the Twitter users who were active during both time periods [27] and belonged to one of the biggest communities in both time periods networks. Consequently, the number of users considered at this stage was reduced. Individuals were characterized by a feature vector with components corresponding to the mentioned topological metrics and others corresponding to the percentage of tweets in each one of the topics of interest extracted with Non-negative matrix factorization. The information used to construct these embedding was gathered from the whole first time period retweet network. The target was a binary vector that takes the value 1 if the user changed communities between the first and the second time periods (a community breaker) and 0 otherwise (not a community breaker). The summary of the datasets is shown in Table 2 . The gradient boosting technique uses an ensemble of predictive models to perform the task of supervised classification and regression [28] . These predictive models are then optimized, iteration by iteration, using the gradient of the cost function of the previous iteration. In this scenario, XGBoost, a particular implementation of this technique, had proven to be efficient in a wide variety of supervised scenarios outperforming previous models [29] . We used a 67/33 random split between train and test. In order to do hyper-parameter tuning of the XGBoost models, we used the randomized search method [30] over the training dataset with 3-fold cross-validation, which consists of trying different random combinations of parameters to find an optimum. Finally, we performed random permutation of the features values among users in order to understand which of them are the most important in the performance of our model (using the so-called Permutation Feature Importance algorithm [32] ). In these way, we could identify the most important characteristics that separates the users that do change their community form those that do not change who they interact with. Users tweet about different topics. Some discussed topics are more frequent in users that change their community than in the general audience. Considering that most users do not change their community and always interact with the same users, a simple analysis of the hole dataset and listing Twitter trending topics may not be a good representation of their interest. Consequently, we defined a "community breaking topic" as a topic used primarily by the community breakers and not used by other users. With the intention of doing a deeper analysis of the topic embedding for each dataset, we first enumerate the main topics in each corpus. For the 2020US dataset: The two most used topics found were the U.S. president "Donald Trump" and "the World Health Organization". However, Fig 2 shows that these topics were primarily used by users that did not change their community. In contrast, the "Obamagate" topic was used by users that change from the republican community to the democrat community. On the other hand, the topic "Thank you", where people thanks and vindicates president Donald Trump health policies, was the main topic used by users that change from the Democrat community to the Republican one. Considering that these last two topics were more used by community breakers than others, we refer them as "community breaking topics". In contrast, other topics such as "World Health Organization" or "Donald Trump" were commonly used by most of the users but not by the users who altered the users they interact with. In Fig 3, the most important topics for the 2017ARG classifier are "Economy", "Once Tragedy" and "Santiago Maldonado". We can contextualize these results by looking which are the main topics discussed in each community as well the ones discussed among the users that change between them. We can see that "Venezuela" is one of the most discussed topics in the people remaining in four communities and "Santiago Maldonado" is a relevant topic in the communities "Unidad Ciudadana" and "1 Pais". When we look at the main topics discussed by users that change their communities between elections, we can observe that "Venezuela" identifies those that go from "Partido Justicialista (PJ)" to "1 Pais" and "Cambiemos" meanwhile "Santiago Maldonado" is a key topic among those who arrive to "Unidad Ciudadana" from "Partido Justicialista (PJ)" and "1 Pais". The topic "Once Tragedy" is primeraly used by the users that change from "1 Pais" to "Cambiemos". Considering that these topics are considerably more used by the users who change their community than by the other users, it can be affirmed that these are "community breaking topics". In contrast, other topics such as "Economy" or "Santa Cruz" were also commonly used by most of the users but not by the users who altered the users they interact with. When the percentage of users that changed from one community to the other was less than 1%, the corresponding arrow is not drawn. We trained three different gradient boosting models for each dataset: the first one was trained only with the features obtained via text mining (how many tweets of the selected topics the user talks about); a second one was trained just with features obtained through complex network analysis (degree, PageRank, betweeness centrality, clustering coefficient and cluster affiliation); and the last one was trained with all the data. In this way, we could compare the importance of the natural language processing and the complex network analysis for the task of classifying community changing users. In table 3 we can see the area under the ROC (receiver operating characteristic) curve [31] of the different models for each dataset. The best performance is obtained in all cases by the machine learning model built with all the features of the users, which is able to more efficiently predict the community breakers. This result is expected, since an assembly of models manages to have sufficient depth and robustness to understand the network information, the topics of the tweets and the graph characteristics of the users. In table 3 we can also observe that the graph features are most informative features than the text ones in order to classify users, since the model with only graph features has a higher score than the model with only text features. We performed random permutation of the features values among users in order to understand which of them are the most important in the performance of our model (using the so-called Permutation Feature Importance algorithm [32] ). In Fig 5, we observed that the most important feature in all cases corresponds to the node's connectivity: P ageRank CR , where the edges point from the tweet source (the content creator) to the user who retweeted. In contrast, the other P ageRank RC (corresponding to the other direction of the edges), had a lower importance feature coefficient in all three datasets. These means that there is a clear privileged direction of edges for the task of detecting the the community breakers. When comparing the P ageRank CR (PR) averages of the community breakers with the other users, we observed that the latter had higher values in all cases (Table 4) . We applied the Kolmogorov-Smirnov test [62] to the PR distributions of each set and found that these differences were statistically significant in all cases (p < 0.001). The P agerank measures how relevant or important a user is in the retweet network based on the retweets of their messages and the importance of the users who retweeted. The direction of P ageRank CR represents the information flow in a network, starting from the tweet creator and then spreading throw the network. The fact that the community breakers had statistically lower P ageRank CR values means that these users were less relevant to the tweeter conversation and their messages did not spread in their original community. A possible interpretation of these results is that a user changes community when their messages have no response by their original community of belonging. The fact that the P ageRank CR is the most important feature is also consistent with the model trained with network features getting a better AU C than the model trained with the texts of the tweets in the three datasets. Previous works used text, Twitter profile and some twitting behavior characteristics to automatically classify users with machine learning, but none of them incorporated the use of these graph metrics [9-12, 33]. Our work shows the importance of also including these graph features in order to identify the community breakers. In this paper we presented a machine learning framework approach in order to identify and characterize users who break their community. The framework includes natural language processing techniques to detect their topics of interest and graph machine learning algorithms in order to describe how an individual interacts with other users. Three datasets were used in this analysis: 2017ARG, 2019ARG and 2020US. These datasets were constructed covering different scenarios: with tweets from two countries, during different political contexts (a parliamentary election, a presidential election and a non-election period) and party system (a multi-party system and a two-party system). The machine learning framework was applied to these different datasets with similar results, showing that the methodology can be easily generalized. We found that the community breakers had statistically lower values of P ageRank CR . This graph feature was also the most important indicator of the classification task in all three datasets according to the feature importance analysis. Therefore, our results indicate that this proposed feature does a good job characterizing if a user is a community breaker. This feature was neglected in previous works of user classification on Twitter [9-12, 33]. In particular, our results also show that there is a clearly privileged direction on the network for this task, with the edges going from the content creator to the retweeter. A possible interpretation for these last two results is that users change who they interact with when their messages have no response and are not being "heard" by their community. Finally, the proposed framework also identifies which of the topics are of interest for these users. Being able to identify the topics that encourage users to interact with other users outside their community is of vital importance in order to reduce the effects of echo chambers such as political extremism, hate groups or the spread of fake news and stimulate enriching debates. Also, this methodology might be useful for a political party and help them decide which issues should be prioritized in its agenda with the intention of maximizing the number of individuals that migrate to their community. Understanding the characteristics and the topics of interest of the community breakers in a polarized environment can provide an enormous benefit for social scientists and political parties. This research intends to supply them with tools to improve their understanding of their behavior. Supporting information.pdf Extra information: We described the political context of the datasets and specified the keywords which were used for collecting the tweets using the public Twitter API. Lon From Friendster To MySpace To Facebook: The Evolution and Deaths Of Social Networks longislandpress republicans and starbucks afficionados: user classification in twitter Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining Virgilio Detecting spammers on twitter Collaboration, electronic messaging, anti-abuse and spam conference Characterizing and modeling an electoral campaign in the context of Twitter: 2011 Spanish Presidential election as a case study Chaos: an interdisciplinary journal of nonlinear science Up and running: Learn how to build applications with the Twitter Is the sample good enough? comparing data from twitter's streaming api with twitter's firehose Seventh international AAAI conference on weblogs and social media Emotion shapes the diffusion of moralized content in social networks 11TH International Conference On Signal-Image Technology & Internet-Based Systems (SITIS) Terry The PageRank citation ranking: Bringing order to the web Ulrik A faster algorithm for betweenness centrality Lancichinetti, Andrea; Fortunato, Santo Community detection algorithms: a comparative analysis Physical review E Katy Analysis of network clustering algorithms and cluster quality metrics at scale PloS one Yihong Document clustering based on non-negative matrix factorization Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval Using tf-idf to determine word relevance in document queries Proceedings of the first instructional conference on machine learning Giulio Challenges in community discovery on temporal networks Temporal Network Theory A scalable tree boosting system Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining Didrik Tree boosting with xgboost-why does xgboost winëverymachine learning competition? Yoshua Random search for hyper-parameter optimization Comparing effect sizes in follow-up studies: ROC Area, Cohen's d,; r Law and human behavior Thomas Permutation importance: a corrected feature importance measure Jonny Identifying communicator roles in twitter Proceedings of the 21st International Conference on World Wide Web Tim Ken Use of Machine Learning to Detect Illegal Wildlife Product Promotion and Sales on Twitter Frontiers in Big Data Viktoriya Semeshenko; Pablo Balenzuela Analyzing mass media influence using natural language processing and time series analysis Social Media Proceedings of the Ninth ACM International Conference on Web Search and Data Mining Sandeepa Mining Twitter for Fine-Grained Political Opinion Polarity Classification, Ideology Detection and Sarcasm Detection Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining Measuring Polarization in Twitter Enabled in Online Political Conversation: The Case of 2016 US Presidential Election Judgments and group discussion: Effect of presentation and memory factors on polarization Sociometry Ingmar Content and network dynamics behind Egyptian political polarization on Twitter Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing Measuring political polarization: Twitter shows the two sides of Venezuela Chaos Jeffrey Investigating political polarization on Twitter: A Canadian perspective Policy & internet Amiram Testing two classes of theories about group induced shifts in individual choice Amiram What a person thinks upon learning he has chosen differently from others: Nice evidence for the persuasive-arguments explanation of choice shifts Paola On the retweet decay of the evolutionary retweet graph International Conference on Smart Objects and Technologies for Social Good Retweeting by the geographically-vulnerable during Hurricane Sandy Proceedings of the 18th ACM conference on computer supported cooperative work & social computing Hae-Chang Finding interesting posts in twitter based on retweet graph analysis Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval Jaideep From retweet to believability: Utilizing trust to identify rumor spreaders on Twitter Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining Jinyoung Rumor propagation is amplified by echo chambers in social media. Scientific reports Rush Limbaugh and the conservative media establishment Karrie Blogs are echo chambers: Blogs are echo chambers The spreading of misinformation online Richard Tweeting from left to right: Is online political communication more than an echo chamber? Psychological science Petter Echo chambers; viral misinformation: Modeling fake news as complex contagion Emotional contagion and group polarization on facebook Scientific reports Fabricio Inside the right-leaning echo chambers: Characterizing gab, an unmoderated social system Michele Starnini The echo chamber effect on social media Proceedings of the National Academy of Sciences The significance probability of the Smirnov two-sample test Arkiv för Matematik Mapping social dynamics on Facebook: The Brexit debate Heating up the debate? Measuring fragmentation and polarisation in a German climate change hyperlink network We thank Sebastián Pinto for careful reading of the manuscript and helpful comments.