key: cord-227156-uy4dykhg authors: Albanese, Federico; Lombardi, Leandro; Feuerstein, Esteban; Balenzuela, Pablo title: Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter date: 2020-08-24 journal: nan DOI: nan sha: doc_id: 227156 cord_uid: uy4dykhg The formation of majorities in public discussions often depends on individuals who shift their opinion over time. The detection and characterization of these type of individuals is therefore extremely important for political analysis of social networks. In this paper, we study changes in individual's affiliations on Twitter using natural language processing techniques and graph machine learning algorithms. In particular, we collected 9 million Twitter messages from 1.5 million users and constructed the retweet networks. We identified communities with explicit political orientation and topics of discussion associated to them which provide the topological representation of the political map on Twitter in the analyzed periods. With that data, we present a machine learning framework for social media users classification which efficiently detects"shifting users"(i.e. users that may change their affiliation over time). Moreover, this machine learning framework allows us to identify not only which topics are more persuasive (using low dimensional topic embedding), but also which individuals are more likely to change their affiliation given their topological properties in a Twitter graph. Technologically mediated Social networks flourished as a social phenomenon at the beginning of this century with exponents such as Friendster (2002) or Myspace (2003) [1] but other popular websites soon took their place. Twitter is an online platform where news or data can reach millions of users in a matter of minutes [2] . Twitter is also of great academic interest, since individuals voluntarily express openly their opinions and they can interact with other users by retweeting the others' tweets. In particular, in the last decade there has been an increase in interest from computational social scientists and numerous political studies have been published using information from this platform [3] [4] [5] [6] [7] [8] . Previous works applied different machine learning models to these datasets. Xu et al. collected tweets using the streaming API and implemented an unsupervised machine learning framework for detecting online wildlife trafficking using topic modeling [41] . Kurnaz et al. proposed a methodology which first extracts features of a tweet text and then applies deep sparse autoencoders in order to classify the sentiment of tweets [10] . Pinto et al. detected and analyzed the topics of discussion in the text of tweets and news articles, using Non Negative Matrix Factorization [32] , in order to understand the role of mass media in the formation of public opinion [11] . On the other hand, Kannangara implemented a probabilistic method so as to identify the topic, sentiment and political orientation of tweets [44] . Some other works are focused in political analysis and the interaction between users, as for instance the one of Aruguete et al., which described how Twitter users frame political events by sharing content exclusively with likeminded users forming two well-defined communities [12] . Dang-Xuan et al. downloaded tweets during the 2011 parliament elections in Germany and characterize the role of influencers utilizing the retweet network [13] . Stewart et al. used community detection algorithms over a network of retweets to understand the behavior of trolls in the context of the #BlackLivesMatter movement [14] . Conver et al. [15] also used similar techniques over a retweets network and showed the segregated partisan structure with extremely limited connection between clusters of users with different political ideologies during the 2010 U.S. congressional midterm elections. The same polarization on the Twitter network can be found in other contexts and countries (Canada [53] , Egypt [51] , Venezuela [52] ). Opinion shifts in group discussions have been studied from different points of view. In particular, it was stated that opinion shifts can be produced by arguments interchange, according to the Persuasive Arguments Theory (PAT) [48, 54, 55] . Primario et al. applied this theory to measure the evolution of the political polarization on Twitter during the 2016 US Presidential election [47] . In the same line, Holthoefer et al analyzed the Egyptian polarization dynamics on Twitter [51] . They classified the tweets in two groups (pro/anti military intervention) based on their text and estimated the overall proportion of users that change their position. These works analyzed the macro dynamics of polarization, rather than focus on the individuals. In contrast, we found it interesting not only to characterize the Twitter users who change their political opinion, but also predict these "shifting voters". Therefore, the focus of this paper is centered on the individuals rather than the aggregated dynamic, using machine learning algorithms. Moreover, once we were able to correctly determine these users, we seek to distinguish between persuasive and non persuasive topics 1 . In this paper, we examined three Twitter networks datasets constructed with tweets from: 2017 Argentina parliamentary elections, 2019 Argentina presidential elections and 2020 tweets of Donald Trump. Three datasets were constructed and used in order to show that the methodology can be easily generalized to different scenarios. For each dataset, we analyzed two different time periods and identify the larger communities corresponding to the main political forces. Using graph topological information and detecting topics of discussion of the first network, we built and trained a model that effectively predicts when an individual will change his/her community over time, identifying persuasive topics and relevant features of the shifting users. Our main contributions are the following: 1. We described a generalized machine learning framework for social media users classification, in particular, for detecting their affiliation at a given time and whether the user will change it in the future. This framework includes natural language processing techniques and graph machine learning algorithms in order to describe the features of an individual. 2. We observed that the proposed machine learning model has a good performance for the task of predicting changes of the user's affiliation over time. 3. We experimentally analyzed the machine learning framework by performing a feature importance analysis. While previous works used text, Twitter profiles and some twitting behavior characteristics to automatically classify users with machine learning [16] [17] [18] [19] , here we showed the value of adding graph features in order to identify the label of a user. In particular, the importance of the "PageRank" for this specific task. 4 . We also identified the topics that are considerably more relevant and persuasive to the shifting users. Identifying this key topics has a valuable impact for social science and politics. The paper is organized as follows. In the Data Collection section, we describe the data used in the study. In the Methods section, we describe the graph unsupervised learning algorithms and other graph metrics that were used, the natural language processing tools applied to the tweets and the machine learning model. In the Results section, we analyze the performance of the model for the task of detecting shifting individuals. Finally, we interpret these results in the Conclusions section. The code is in github (omitted for anonymity reasons). Twitter has several APIs available to developers. Among them is the Streaming API that allows the developer to download in real time a sample of tweets that are uploaded to the social network filtering it by language, terms, hashtags, etc. [20, 21] . The data is composed of the tweet id, the text, the date and time of the tweet, the user id and username, among other features. In case of a retweet, it has also the information of the original tweet's user account. persuasive) for the topics relevant (resp. non relevant) to those individuals For this research, we collected 3 datasets: 2017 Argentina parliamentary elections (2017ARG), 2019 Argentina presidential elections (2019ARG) and 2020 United States tweets of Donald Trump (2020US). For the Argentinan datasets, the Streaming API was used during the week before the primary elections and the week before the general elections took place. Keywords were chosen according the four main political parties present in the elections. Details and context can be found in the Appendix. For the 2020US dataset, "realDonaldTrump" (the official account of president Donald Trump) was used as keyword. Twitter messages are in the public domain and only public tweets filtered by the Twitter API were collected for this work. For the purpose of this research, we have analyzed more than 9 million tweets and more than 1.5 million individuals in total. The specific start and end collection date, the total number of tweets and users can be seen in Table 7 . In this section, we will introduce the methodology used to characterize the Twitter users. First the retweet networks (Section 3.1) and the algorithm to find communities (Section 3.2). Then, the different metrics which describe the interaction networks among them (Section 3.3). After that, the features obtained by analyzing the text of the tweets (Section 3.4). Finally, we describe the supervised learning model which uses the individual's characteristics as instances and predicts the shifting users. We represent the interaction among individuals in terms of a graph, where users are nodes and retweets between them (one or more) are edges (undirected and unweighted). Isolated nodes (never retweeting nor retweeted) were not taken into account for this analysis. In Figure 1 , we can visualize the retweet network for each time period and dataset. In the case of the US dataset, most of the users are concentrated in two groups, which allows to visualize the political polarization. On the other hand, in the Argentinean datasets we can identify two large groups and also some smaller ones. The graph visualizations are produced with Force Atlas 2 layout using Gephi software [22]. In a given graph, a community is a set of nodes largely connected among them and with little or no connection with nodes of other communities [28] . We implement an algorithm to detect communities in large networks which allows us to characterize the users by their relationship with other users. In this context, the modularity is defined as the fraction of the edges that fall within a given community minus the expected fraction if edges were distributed at random [46] . The Louvain method for community detection [29] seeks to maximize modularity by using a greedy optimization algorithm. This method was chosen to perform the analysis due to the characteristics of the database. While other algorithms such as label propagation are good for large data networks, their performance decreases if clusters are not well defined [30] . In contrast, in these cases the Louvain or Infomap methods obtain better results. However, given that the number of nodes is in the order of hundreds of thousands and edges in the order of one million, the Louvain method has a better performance [31] than other ones. Despite having found several communities, we just considered the largest for each case. For the 2017ARG and 2019ARG dataset we used the four biggest communities because, when examining the text of the tweets and the users with the highest degree, each one had a clear political orientation corresponding to the four biggest political parties in the election. These communities are labeled as "Cambiemos", "Unidad Ciudadana", "Partido Justicialista" and "1 Pais" for 2017ARG and "Frente de Todos", "Juntos por el Cambio", "Consenso Federal" and "Frente de Izquierda-Unidad" for 2019ARG (electoral context is provided in the Appendix). Regarding the 2020US dataset, we used the 2 biggest communities because of the bipartisan political system of the United States (Republicans and Democrats) and the clear structure present in the retweet networks, where only two big clusters concentrate almost all of the users and interactions (see figure 1 ). In contrast, the Argentinean election datasets have two principal communities and some minor communities as well. Considering the the fact that our dataset has more than 9 million tweets and more than 1.5 million users, it was not feasible to determine true labels of political identification of the users for this task. Neither it was viable to manually assign them. Therefore, we decided to use the communities labels of the retweet network as a proxy of political membership, and interpret changes in their label as changes in affiliation over time. This decision is supported by previous literature, where it is shown that communities identify a user's ideology and political membership [12, 14, 15, 23, 24, 43] . Moreover, taking into account the stochasticity of the Louvain method and following [45] , we decided to use for the machine learning task only the nodes that were always assigned to the same community, in order to minimize the possibility of an incorrect labeling. Additionally, we did not used individuals with less than 5 retweets, since we might have insufficient data to correctly classify them. Finally we also manually sampled and checked users from different communities to verify their political identification. With the intention of characterizing topologically the users of the primary election network, we computed the following metrics: Degree of each user in the network (i.e., the number of users that have retweeted a given one), PageRank [25], betweenness centrality [26], clustering coefficient [27] and cluster affiliation (the community detected by the Louvain method). We used all these metrics as features in the machine learning classification task. In order to determine the topics of discussion during the primary election, we analyzed the text of the tweets using natural language processing analysis and we calculated a low dimensional embedding for each user. The tweets were described as vectors through the Term Frequency -Inverse Document Frequency (tf-idf) representation [33] . Each value in the vector corresponded to the frequency of a word in the tweet (the term frequency, tf ) weighted by a factor which measures the degree of specificity (inverse document frequency, idf ). We used 3-grams and a modified stop-words dictionary that not only contained articles, prepositions, pronouns and some verbs but also the names of the candidates, parties and words like "election". Then, we constructed a matrix M concatenating the tf-idf vectors, with dimensions the number of tweets times the number of terms. We performed topic decomposition using Non-Negative Matrix Factorization (NMF) [32] on the matrix M . NMF is an unsupervised topic model which factorizes the matrix M into two matrices H and W with the property that all three matrices have no negative elements. We selected the NMF algorithm because this non-negativity makes the resulting matrices easier to inspect and to understand their meaning. The matrix H has a representation of the tweets in the topic space, in which the columns are the degree of membership of each tweet to a given topic. On the other hand, the matrix W provides the combination of terms which describes each topic [42] . The obtained results, analyzing just the tweets corresponding to the first time period, are detailed in the Appendix. The decomposition dimension was swept between 5 and 30, and for each dataset we chose a number of topics in the corpus so as to have a clear interpretation of each one. The same methodology was used and described in [11, 42] . Once we collected all this information, Twitter users were also characterized by a vector of features where each cell corresponds to one of the topics and its value to the percentage of tweets the user tweeted with that topic. Given that our objective was to identify shifting individuals and persuasive arguments, we implemented a predictive model whose instances are the Twitter users who were active during both time periods [34] and belonged to one of the biggest communities in both time periods networks. Consequently, the number of users used at this stage was reduced. Individuals were characterized by a feature vector with components corresponding to the mentioned topological metrics depicted in section 3.3 and others corresponding to the percentage of tweets in each one of the topics extracted in section 3.4. The information used to construct these embedding was gathered from the whole first time period retweet network. The target was a binary vector that takes the value 1 if the user changed communities between the first and the second time periods and 0 otherwise. The summary of the datasets is shown in Table 2 . Considering the percentage of positive targets, this is clearly a class imbalance scenario. Specially in 2020US, which is reasonable given the bipartisan retweet network with big and opposed communities [51] . The gradient boosting technique uses an ensemble of predictive models to perform the task of supervised classification and regression [35] . These predictive models are then optimized iteration by iteration using the gradient of the cost function of the previous iteration. In this scenario, XGBoost, a particular implementation of this technique, has proven to be efficient in a wide variety of supervised scenarios outperforming previous models [36] . We used a 67/33 random split between train and test. In order to do hyperparameter tuning, we used the randomized search method [37] over the training dataset with 3-fold cross-validation, which consists of trying different random combinations of parameters and then staying with the optimum. With the objective of measuring the efficiency and performance of our machine learning model, two other models, namely random and polar, were taken as baselines for comparison. In the former one, the selected user will change of community with a probability of 50%. In the latter, for a user that belongs to one of the two biggest communities in the network, we predict that he/she will stay in that community, while a user that belongs to a smaller community will change to one of the two main communities with same probability. This polar model is inspired by idea that in a polarized election, members of the smallest communities shift and are attracted to the biggest communities, and was used in the Argentinean datasets. We trained three different gradient boosting models for each dataset: the first one was trained only with the features obtained via text mining (how many tweets of the selected topics the user talks about); a second one was trained just with features obtained through complex network analysis (degree, PageRank, betweeness centrality, clustering coefficient and cluster affiliation); and the last one was trained with all the data. In this way, we could compare the importance natural language processing and complex network analysis for this task. In Figure 2 we can see the ROC [38] of the different models for each dataset. The best performance is obtained in all cases by the machine learning model built with all the characteristics of the users, which is able to efficiently predict which users are shifting individuals. This result is expected, since an assembly of models manages to have sufficient depth and robustness to understand the network information, the topics of the tweets and the graph characteristics of the users. We performed random permutation of the features values among users in order to understand which of them are the most important in the performance of our model (the so-called Permutation Feature Importance algorithm [39] ). In Figure 3 , we observe that the most important feature in all cases corresponds to the node's connectivity: PageRank, meaning that shifting individuals are the peripheral and least important nodes of big communities. The result is verifiable when comparing the PageRank averages in users who changed their affiliation (2017ARG PR = 8.97e − 6, 2019ARG PR = 2.57e − 6 and 2020US PR = 2.03e − 6) with those who did not (2017ARG PR = 1.99e − 5, 2019ARG PR = 8.12e − 6 and 2020US PR = 5.41e − 6), the latter being at least 56% higher. This is also consistent with the fact that the model trained with network features gets a better AU C than the model trained with the texts of user tweets in all datasets. Previous works have used text, Twitter profile and some twitting behavior characteristics to automatically classify users with machine learning, but none of them have incorporated the use of these graph metrics [16] [17] [18] [19] 40] . Our work shows the importance of also including these graph features in order to identify shifting individuals. This result has a relevant sociological meaning: the unpopular individuals are more prone to change their opinion. Besides the importance of the mentioned topological properties, some discussed topics are also relevant to the classifier model. A simple analysis of the most spoken topics in the network does not differentiate between topics discussed by a shifting individual and other users. Considering that most users do not change their affiliation, it is interesting to analyze those that do change. The Persuasive Arguments Theory affirms that changes in opinion occurs when people exchange strong (or persuasive) arguments [48, 54, 55] . Consequently, we defined a "persuasive topic" as a topic used primarily by shifting individuals and not used by non shifting individuals. With the intention of doing a deeper analysis of the topic embedding for the 2017ARG dataset, we first enumerate the main topics in that corpus: Equivalent analysis can be done with the other two corpora and the topic decomposition for each can be found in the Appendix. In Figure 3 , the most important topics for the classifier are "Venezuela", "Economy" and "Santiago Maldonado". We can contextualize these results by looking which are the main topics discussed in each community as well the ones discussed among the users that change between them, as it is shown in Figure 4 . We can see that "Venezuela" is one of the most discussed topics in the people remaining in four communities and "Santiago Maldonado" is a relevant topic in the communities "Unidad Ciudadana" and "1 Pais". When we look at the main topics discussed by users that change their communities between elections, we can observe that "Venezuela" identifies those that go from "Partido Justicialista (PJ)" to "1 Pais" and "Cambiemos" meanwhile "Santiago Maldonado" is a key topic among those who arrive to "Unidad Ciudadana" from "Partido Justicialista (PJ)" and "1 Pais". Considering that these topics are considerably more used by the shifting Twitter users than by the other users, it can be affirmed that these are "persuadable topics". In contrast, other topics such as "Economy" or "Santa Cruz" were also commonly used by most of the users but not by the shifting individuals. In this paper we presented a machine learning framework approach in order to identify shifting individuals and persuasive topics that, unlike previous works, focused on the persuadable users rather than studying the political polarization on social media as a whole. The framework includes natural language processing techniques and graph machine learning algorithms in order to describe the features of an individual. Also, three datasets were used for the experimentation: 2017ARG, 2019ARG and 2020US. These dataset were constructed with tweets from 2 countries, during different political contexts (during a parliamentary election, during a presidential election and during a non-election period) and in a multi-party system and a two-party system. The machine learning framework was applied to these different datasets with similar results, showing that the methodology can be easily generalized. The implemented predictive models effectively detected whether the user will change his/her political affiliation. We showed that the better performance can be achieved when representing the individuals with their community and other graph features rather than topic embedding. Therefore, our results indicate that these proposed features do a reasonable job at identifying user characteristics that determine if a user changes opinion, features that were neglected in previous works of user classification on Twitter [16] [17] [18] [19] 40] . In particular, the PageRank was the most relevant according to the permutation feature importance analysis in all datasets, showing that popular people have lower tendencies to change their opinion. Finally, the proposed framework also identifies which of the topics are the persuasive topics and good predictors of individuals changing their political affiliation. Consequently, this methodology could be useful for a political party to see which issues should be prioritized in their agenda with the intention of maximizing the number of individuals that migrate to their community. Understanding the characteristics and the topics of interest of politically shifting individuals in a polarized environment can provide an enormous benefit for social scientists and political parties. The implications of this research supplement them with tools to improve their understanding of shifting individuals and their behavior. The percentage on the arrows are the percentage of users that changed from one community to the other (When the percentage was less than 1%, the corresponding arrow is not drawn). The topics on the arrows show the most important topics among the users that change between those communities. • The President of Argentina and the governor of the province of Buenos Aires at the time of elections (i.e., "mauriciomacri", "Macri" and "mariuvidal"). These last two were added, despite not being actively present in the lists, due to their political importance, their relevance and participation during the campaign. In addition, the tweets were restricted to be in Spanish. The electoral context is the following: Former president and opposition leader Cristina Fernández de Kirchner (Former "Unidad Ciudadana") and Sergio Massa (Former "1Pais") create a new party "Frente de Todos" with Alberto Fernández as candidate for president. On the other hand Mauricio Macri (Former "Cambiemos") run for reelection as candidate of "Juntos por el Cambio". The socialist Nicolas del Cao of "Frente de Izquierda-Unidad" and Roberto Lavagna of "Consenso Federal" were also candidates for president, among others. Considering the previous subsection and the candidates for the Senate, for deputy and for governor, the following terms were chosen as keywords for tweeter: "Elisacarrio", "OfeFernandez ", "PatoBullrich", "macri", "macrismo", "mauriciomacri", "pichetto", "MiguelPichetto", "JuntosPorElCambio", "alferdez", "CFKArgentina", "CFK", "kirchner", "kirchnerismo", "FrenteTodos", "FrenteDeTodos", "Lavagna", "RLavagna", "Urtubey", "UrtubeyJM", "ConsensoFederal", "2030ConsensoFederal", "DelCao", "NicolasdelCano", "DelPla", "RominaDelPla", "FitUnidad", "FdeIzquierda", "Fte Izquierda", "Castaeira", "ManuelaC22", "Mulhall", "NuevoMas", "Espert", "jlespert", "FrenteDespertar", "Centurion", "juanjomalvinas", "Hotton", "CynthiaHotton", "Biondini", "Venturino", "FrentePatriota", "RomeroFeris", "PartidoAutonomistaNacional", "Vidal", "mariuvidal", "Kicillof", "Kicillofok", "Bucca", "BuccaBali", "chipicastillo", "Larreta", "horaciorlarreta", "Lammens", "MatiasLammens", "Tombolini", "matiastombolini", "Solano", "Solanopo", "Lousteau", "GugaLusto", "Recalde", "marianorecalde", "RAMIROMARRA", "Maxiferraro", "fernandosolanas", "MarcoLavagna", "myriambregman", "cristianritondo", "Massa", "SergioMassa", "GracielaCamano", "nestorpitrola". In addition, the tweets were restricted to be in Spanish. Also, the topic embedding obtained with non-negative matrix factorization: C 2020 tweets of Donald Trump The following term was used as keyword for the tweeter API: "realDonaldTrump". In addition, the tweets were restricted to be in English. Lon From Friendster To MySpace To Facebook: The Evolution and Deaths Of Social Networks longislandpress Garcí Emotions in Health Tweets: Analysis of American Government What the hashtag? A content analysis of Canadian politics on Twitter Information, communication & society Linh Political communication and influence through microblogging-An empirical analysis of sentiment in Analyzing the Digital Traces of Political Manipulation: The 2016 Russian Interference Twitter Campaign Politics, sentiments, and misinformation: An analysis of the Twitter discussion on the Mauricio Interest communities and flow roles in directed networks: the Twitter network of the UK riots Journal of The Royal Society Interface Donald J. Trump and the politics of debasement Critical studies in media communication Using machine learning to detect cyberbullying Ahme Sentiment Analysis in Data of Twitter using Pablo Quantifying time-dependent Media Agenda and public opinion by topic modeling Physica A: Statistical Mechanics and its Applications A scalable tree boosting system Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining Didrik Tree boosting with xgboost-why does xgboost winëverymachine learning competition? Random search for hyper-parameter optimization Comparing effect sizes in follow-up studies: ROC Area, Cohen's d, and r Law and human behavior Permutation importance: a corrected feature importance measure Jonny Identifying communicator roles in twitter Proceedings of the 21st International Conference on World Wide Web Use of Machine Learning to Detect Illegal Wildlife Product Promotion and Sales on Twitter Frontiers in Big Data Analyzing mass media influence using natural language processing and time series analysis Michael Quantifying Controversy in Social Media Proceedings of the Ninth ACM International Conference on Web Search and Data Mining Sandeepa Mining Twitter for Fine-Grained Political Opinion Polarity Classification, Ideology Detection and Sarcasm Detection Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining Consensus clustering in complex networks Scientific reports Measuring Polarization in Twitter Enabled in Online Political Conversation: The Case of 2016 US Presidential Election Judgments and group discussion: Effect of presentation and memory factors on polarization Sociometry Why do humans reason? Arguments for an argumentative theory Persuasive arguments theory, group polarization, and choice shifts Personality and Ingmar Content and network dynamics behind Egyptian political polarization on Twitter Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing Measuring political polarization: Twitter shows the two sides of Venezuela Chaos Jeffrey Investigating political polarization on Twitter: A Canadian perspective Policy & internet Testing two classes of theories about group induced shifts in individual choice Sergio Massa of "1Pais" (former Chief of the Cabinet of Ministers of Cristina Kirchner, then leader of the opposition against Cristina Kirchner in 2013 when he won his provincial election) and Florencio Randazzo of Twitter Keywords Considering the previous subsection, the following terms were chosen as keywords for tweeter • Candidates for Senate of the main four parties: their name and official user on Twitter Topic decomposition The topic embedding obtained with non-negative matrix factorization: 1. President Donald Trump: The 45 o President of the United States. 2. Obamagate: The accusation that Barack Obama is conspiring against Donald Trump World Health Organization: President Trump announcing the US will pull out of the World Health Organization Thank you: Individuals thanking President Trump for this actions in regard to the COVID-19 pandemic Fake news: Individuals discussing and claiming that certain news are fake President Barack Obama: The 44 o President of the United States and his administration