key: cord-0655940-0lzbe7rd authors: hajare, Prasad; Kamal, Sadia; Krishnan, Siddharth; Bagavathi, Arunkumar title: A Machine Learning Pipeline to Examine Political Bias with Congressional Speeches date: 2021-09-18 journal: nan DOI: nan sha: 6cfacf48a8ddd4b8152f3a6fc8a93d1240248b6f doc_id: 655940 cord_uid: 0lzbe7rd Computational methods to model political bias in social media involve several challenges due to heterogeneity, high-dimensional, multiple modalities, and the scale of the data. Political bias in social media has been studied in multiple viewpoints like media bias, political ideology, echo chambers, and controversies using machine learning pipelines. Most of the current methods rely heavily on the manually-labeled ground-truth data for the underlying political bias prediction tasks. Limitations of such methods include human-intensive labeling, labels related to only a specific problem, and the inability to determine the near future bias state of a social media conversation. In this work, we address such problems and give machine learning approaches to study political bias in two ideologically diverse social media forums: Gab and Twitter without the availability of human-annotated data. Our proposed methods exploit the use of transcripts collected from political speeches in US congress to label the data and achieve the highest accuracy of 70.5% and 65.1% in Twitter and Gab data respectively to predict political bias. We also present a machine learning approach that combines features from cascades and text to forecast cascade's political bias with an accuracy of about 85%. Social media has become an integral part of hundreds of millions of people to perceive information and share it with others across the world in a short time. Although it provides an optimistic view on connecting people around the globe, social media forums affect core beliefs, values, and an attitude of its consumers with the type and quantity of the information it provides. Political bias, opinions, hate, and misinformation is more profound on current social media forums in recent years due to massive user participation and limited fact-checking on information that is shared among the general population. The prevalence of confirmation bias majorly affects user opinion on social media, which in turn makes online user communities to fall in the wide spectrum of political bias ranging from farleft to far-right [1] . Political bias has been widely evidenced in recent events like COVID vaccine hesitancy [2] and guncontrol. Studying political bias in social media can have direct implications over several research areas like hate speech detection, misinformation, and echo chambers modeling as all such problems would involve a large volume of biased conversations on social media. Machine learning, with advancements in natural language processing and deep learning, has been actively used in study- ing political bias on social media. But the key challenge to model political bias is the requirement of human effort to label the seed social media posts to train machine learning models. Although very effective, this approach has disadvantages in the time-consuming data labeling process and the cost to label significant data for machine learning models is significantly higher. The web offers invaluable data on political bias starting from biased news media outlets publishing articles on sociopolitical issues to biased user discussions about several topics in multiple social forums. In this work, we introduce a novel approach to label political bias for social media posts directly from US congressional speeches without any human intervention for downstream machine learning models. Also, existing works model political bias as a prediction problem to predict the political leaning of given users, news articles, or social media posts. However, political bias in social media can shift as user opinion on topics change over time. Forecasting the political bias of given topics or conversations can have advantages in foreseeing user opinion on social media. In this work, we analyze a diverse set of features collected from social media text, users, sentiment, and cascades for traditional machine learning models to examine political bias in both prediction and forecasting aspects. In this work, we use posts from two popular social media forums: Twitter and Gab, which have their own set of political ideologies on free speech. Gab is relatively a newer social media forum that emphasizes on free speech to often align with far-right groups' ideology. Gab is the platform of interest in computational social science research communities to study hate speech, far-right echo chambers, and racism. Twitter on the other hand makes restrictions on tweets, news articles, and user accounts that glorify hate to limit the dissemination of hate speech and misinformation in social media. We use publicly available Gab [3] and Twitter [4] datasets aiming to design the best feature space for both tasks. In this work, we study multiple aspects of political ideology and bias in posts collected from two social media forums that are diverse in ideology. We exclusively utilize openly available text data such as real speeches and debates collected from politicians as surrogates to study the political bias. In particular, our contributions are three-fold in this work: 1) We present a method that gives political bias score labels of social media posts by adapting representations of entities learned from congressional speeches 2) We give a machine learning approach to the predict political bias of posts using text embeddings. We extend our study and show that the model learned from one social media (Twitter) can be transferred to another ideologically distinct forum (Gab) 3) We provide a list of features, including engineered features like linguistic, cascade, and user features, along with contextual text embeddings, for machine learning models to forecast the political bias of a conversation II. RELATED WORK Social media produces large amounts of data that can be exploited to extract valuable information containing human interactions and opinions. There is a rising concern in recent years that such social media forums can be responsible for causing political bias in people which has the potential in affecting presidential elections [5] and news consumption [6] . Political bias detection from the given text using machine learning approaches has been a growing interest among researchers. Most of the early approaches detect bias using a traditional "bag-of-words" based classifier to focus on word lexicons [7] . The major obstacle of these approaches is it heavily relies on primary-level lexical information and neglects semantic structure. Several studies have tried to detect bias on the multitude of data formats inducing words [8] , sentences [9] , articles [10] , and news mediums [11] . Neural network based approaches have been widely used in political bias detection and ideology detection problems in past years. Recurrent Neural Network (RNN) has been used to accumulate the political leaning of each word to determine sentence level bias [12] . RNN has also been used in determining political bias in multiple levels such as word, paragraph, and also at discourse level [13] . It is well established that news articles can have the political bias of leftleaning or right-leaning. The work in [10] provided a method to generate an article with the same topic but with flipped political leaning using NLP methods. Attention-based multiview model focuses on sentences to identify corresponding political ideology [14] . Few other works study similar method at the sentence level in [15] by extracting relations from text and news article headlines [16] along with several baseline models like SVM and CNN. Other than the traditional bag-of-words based models, few works incorporated graphs to represent user opinions and ideologies in studying political bias. A recent work introduced an opinion-aware knowledge graph which infers based on circumstantial information from text and knowledge bases [17] both by using previous background knowledge from the graph. TIMME framework [18] handles heterogeneity of networks formed from social media sites. GCN [19] based approaches have also been introduced to encode social and textual information from news articles and social networks to capture political perspective [20] . We utilize publicly available datasets from both social media and politicians' speeches in this study to study political bias. Congressional speeches: Politics based data which are openly available on the web are good resources for machine learning models to learn bias makers concerning the given context like a topic, an event, or a person. Some useful resources include news articles, previously labeled politicsbased social media posts, and actual political talks in US Congress and presidential debates. The political bias of news domains [21] can divert the actual political leaning of the given topic mentioned in news articles. This may further introduce fake news which can prove disadvantageous for ML algorithms to map political bias. Similarly, a very limited set of labeled social media posts are easily accessible in recent years and are accurate in model training but difficult for ML models to generalize for the given problem in the given timeline. Political speeches, collected from politician's talks and debates, on the other hand, give a political party's actual viewpoints on topics, which can help in understanding the context with respect to all political parties. In this work, we utilize transcripts of congressional speeches [22] which consists of both republican and democratic speeches in congress. From a large corpus of congressional speeches, we used only 1000 speeches from 478 Democrat speeches and 389 Republican speeches that align with topics discussed in social media posts. Fig. 3 : The number of news articles shared in Gab and Twitter data concerning their media bias shows Twitter leans left and Gab supports right politics. Note that our Twitter data corpus is significantly smaller than that of Gab. Social media posts: We use publicly available Twitter [4] and Gab [3] datasets in this work. The Twitter data [4] comprise of only posts that share news articles that discuss political topics from a selected news media sources. The dataset spans from January 2018 to September 2018 with a total of 722,685 tweets. The dataset comprises news articles collected from news domains that range from a wide spectrum of political leaning and public tweets that mention such news articles. Our Gab dataset [3] comprises 40 million posts including all replies, re-posts, and quotes with URLs and hashtags that were submitted between 2016 and 2018. To make a fair analysis in this work, we only use the sampled Gab posts that are submitted between January 2018 and September 2018, same time range as the Twitter data. It is evident from Figure 2 that we have a large number of posts available in the Gab corpus. This is primarily because the Twitter data collected targeting the availability of politics based news articles, while the Gab data is not focused on any such domains. Gab is widely described as the social media which supports far-right ideologies. To verify this fact, we check the political bias of news media outlets, given by adfontesmedia [21] , that are shared in posts present in our data corpus and their corresponding frequency. Our summary is reported in Figure 3 which clearly shows Gab posts mainly share more news articles from far-right and right-leaning media outlets, while Twitter shares more news articles from left-leaning media outlets. We present machine learning (ML)-based approaches for two studies performed on Twitter and Gab datasets described above. First, we propose an approach to label political bias score (γ) of social media posts and then we give approaches modeled as prediction and forecasting tasks to study political bias using the labeled data. Machine learning models to predict political ideology or political bias of social media posts require a rich set of training data. Current methods rely on human annotations to obtain the training data which is time-consuming and specific to a defined problem. We present a method to label social media posts with political bias scores in the range [−1, +1] where posts with the bias score of −1 represent far-left and posts with bias score of +1 represent far-right. Our proposed method to label social media posts consists of two characteristics: i) it is dependent on entities in general rather than on a specified problem, and ii) it relies on political party's (both republican and democrat) perspectives to get the context of entities. Our proposed method is more generic compared to existing studies in obtaining political bias as we use entities like hashtags, person, event, and place to label the data. We extract entities in both social media posts and congressional speeches after basic text pre-processing steps like case-folding, stop-words removal, and removing punctuations, with Stanford Named Entity Recognition (NER). In this work, we consider only common nouns and proper nouns as entities. We choose our congressional speeches data carefully to match entities available in either Gab or Twitter posts, so the entities extracted from social media posts would have context in one or both of the political parties. We consider entities from social media only if their occurrence frequency is at least 100 in at least one of the social media forums. Such entities are depicted in Figure 1 , which illustrates the presence of multiple topics in both social media forums. We propose a method to label social media posts based on how the extracted entities are perceived by democrats and republicans. In this work, we utilize Term-Frequency Inverse Document Frequency (TF-IDF) to get the importance of entities in terms of both republican and democrat perspectives. We consider that the political bias of a topic or entity varies according to the ideology of republicans and democrats. For a given social media post S, from Twitter or Gab, with a set of n entities E(S) = {e 1 , e 2 , . . . e n } where n = [1, ∞), we propose to identify the political bias γ(S) using Equation 1: where T F d (e i ) and T F r (e i ) are TF-IDF of the entity e i in the democrat and the republican context respectively. The above equation gives the political bias γ(S) <= 0, if entities E(S) are supported more by democrats and γ(S) > 0, if republicans support E(S). The availability of the data obtained using our method given in the previous section (Section IV-A) can provide multiple opportunities to train ML algorithms in several aspects. Since our labeling is not relevant to specific features in our datasets, we require to extract features from our text data for ML algorithms. In this work, we utilize only a very common set of stylistic features that are present in all types of text data along with the contextual text representations obtained using the FastText model [23] . FastText is an extension of the popular word representation framework Word2vec, where the FastText extracts the contextual text representation by aggregating representations of the character n-grams. FastText proves to construct better word representations to even words that do not appear in the training corpus and boosts performance compared to its predecessor Word2vec. Our engineered set of features and their description are given in Table I . The political bias of user communities in online social media tend to change as users participate in conversations and respond in multiple activities like replying/commenting, liking, and re-sharing a post. Such shifts in the political bias of a topic or a conversation depends on the bias of users participating in conversations, the current sentiment of the conversation, and the context of topics in discussion. Forecasting such shift in political bias can have a huge impact on understanding user opinion in social media, relating news media bias with user opinion, and identifying misinformation or fake news. In this work we propose political bias shift forecasting by utilizing information cascades and machine learning. Information cascades or Cascades act as dynamical processes in social networks, which captures the complete evolution of a topic over time along with their virality and user interactions [24] . In this work, we consider each conversation thread in the given social media as a cascade. The origin of a cascade begins with single-user post and then the user's social network slowly start a discussion about the post (in terms of replies and re-posts) along with their opinions. Thus, a cascade C can be represented as a graph: C = {V t , M, z, w}, where V t is a post/reply/re-post that appear at time step t ∈ {0, 1, . . . T }, M is a set of edges M = {m 1 , m 2 , . . .} with m i : V tq → V tp where t q > t p , z is a node property representing sentiment score of post v ∈ V t , and w is the edge weight of m i ∈ M representing stance of post V tq about the post V tp . Given a partial cascade C jr and its corresponding bias score θ with only upto d << T time steps, we propose the political bias shift as a binary classification problem: g : θ(C js ) → 0|1, where s = d + 1. We propose to predict the presence/absence of bias shift at the s th time step in the cascade. We perform this study only on the Gab dataset as we do not possess complete conversation in the Twitter data. We engineer features based on multiple categories from the extracted cascades as given in Table II for our proposed forecasting task. Along with engineered features we also use text representations of posts present in C jr . We examine and validate our proposed methods in three possible ways: 1) Validation: We validate our data labeling by comparing the overall labeled political bias of social media forums with similarity measures in contextual text representations 2) Prediction Analysis: We give multiple quantitative methods for political bias prediction using traditional ML models with features described in Section IV-B 3) Forecasting Analysis: With cascades from the Gab dataset, we give a quantitative evaluation on ML models to forecast political bias shift in conversations with features described in Section IV-C The entity extraction step in our data labeling method identified 131,345 and 78,546 unique political entities in Twitter and GAB data respectively. We observed that more than 20,154 entities are common in both platforms. The word cloud for frequent entities for both platforms are given in Figure 1 . We validate our data labeled with a political bias score labeled using the method described in Section IV-A using quantitative reasoning in both Twitter and Gab. In particular, we compare contextual text representations of social media posts with contextualized representations of democrat and republican speech transcripts. To validate our data labels, we used cosine similarity to compare contextual text representations of social media posts with transcripts. To obtain the contextual representations we Figure 4 . Our results align with a simple data summary given in Figure 3 based on comparing news articles shared in each social media forum. We easily can evidence that overall Twitter posts are left-leaning and Gab posts are right-leaning. The higher similarity measures also approximately correlate with the mean bias scores of posts given in Table III , which is calculated completely from our proposed labeling method. The similarity scores of Gab are much lower than the Twitter. This is due to noisy Gab posts in our corpus. Twitter data is more focused on politics based posts, whereas Gab posts are combination of general and political posts. With the labeled social media data available for ML models, we perform two analyses with engineered and contextual features mentioned in Section IV-B. Both analyses are highly motivated by transfer learning models, which train on a dataset collected from one domain without seeing the test data which is from another domain. Also, we use four traditional ML models: Random Forests, Multi Layer Perceptron, Decision Trees, and Linear Regression for both analyses. We evaluate all models in terms of Accuracy, Precision, Recall, F-Score, and AUROC measures. In the first analysis, we use the congressional transcripts as the training data and posts from each social media as the test set. The results are summarized in Table IV . The features that we extract help ML algorithms to predict political bias of both social media posts without even a fraction of social media posts available for training. In particular, models performance in Twitter data is comparatively better than that of Gab as the Twitter data is more focused on politics unlike the Gab data. Overall, the performance of Linear Regression is higher (with a minimum of 7% increase) than MLP and Decision Trees in terms of Recall, F-Score, and AUROC measures in Twitter, while the Random Forests give competing performance with Linear Regression. Model performances in Gab is close between Linear Regression, MLP, and Random Forests with Linear Regression performing 1% higher/lower compared to the MLP. In the second analysis, we completely ignore the transcripts and compare the performance of ML models trained on one social media without the knowledge of another dataset. Results have been summarized in Tables V and VI. Based on given results, we can see that the models trained on Gab posts transfer well on the Twitter data with Linear regression achieving at least 60% in all the measures. Again, the models trained on Twitter could not give the comparative performance while tested on Gab posts with all ML models achieving around 55% in all measures. This again validates that models trained on generic data like Gab can effectively transfer to more focused datasets like Twitter, but not the other way. As mentioned earlier, we conduct this study only on the Gab dataset due to the availability of cascades in the Gab data. In total, we collected 3.6 million cascades and out of which we considered only 69,746 cascades that have a minimum 5 levels. Our constraint on cascades with 5 levels is completely arbitrary in this work. As mentioned in Section IV-C, we used both engineered features and text features of cascade posts from the FastText model. In each cascade, we used nodes until an arbitrary number of levels l > 5 for training and we used ML models to predict if there is a bias shift in the (l + 1) th level of the cascade. We present our forecasting results with engineered, auto-extracted, and combined features in Figure 5 . We used four ML models: Random Forests, Ada Boost, MLP, and Quadratic Discriminant Analysis (QDA) in this study. In this paper, we provided two approaches to study political bias in online social media forums. We presented a methodology to get political bias of social media posts using congressional speeches which contain politicians' perspectives on entities present in the posts. With publicly available largescale social media posts from Gab and Twitter we provided a list of features for ML models to predict the political bias scores. We also engineered multiple categories of features from information cascades in Gab conversations for the bias forecasting task. Our multi-dimensional quantitative evaluation showed that our extracted set of features can help prediction and forecasting tasks profoundly. Our proposed data labeling procedure is very generic such that it can be applied to multiple perspectives to study problems like media bias and fake news in the future. Even though the given method is more effective, more thorough case studies with valid ground-truth data can help to understand and update the approach for more robust political bias modeling in social media. The entities considered in this work can also be extended to include a cleaner and richer set of entities to overcome noise in the data. Another future scope is to explore on graph and text representation learning methodologies for features to use in the prediction or forecasting models. Polarization and fake news: Early warning of potential misinformation targets Falling into the echo chamber: the italian vaccination debate on twitter Shouting into the void: A database of the alternative social media platform gab News sharing user behaviour on twitter: a comprehensive data collection of news articles and social interactions Learning political polarization on social media using neural networks Political polarization in online news consumption Predicting legislative roll calls from text Automated identification of bias inducing words in news articles using linguistic and context-oriented features Detecting biased statements in wikipedia Learning to flip the bias of news headlines Predicting factuality of reporting and bias of news media sources Political ideology detection using recursive neural networks Analyzing political bias and unfairness in news articles at different levels of granularity Multi-view models for political ideology detection of news articles Distant supervision for relation extraction with sentence-level attention and entity descriptions Detecting political bias in news articles using headline attention Opinion-aware knowledge graph for political ideology detection Timme: Twitter ideologydetection via multi-task multi-relational embedding Semi-supervised classification with graph convolutional networks Encoding social information with graph convolutional networks forPolitical perspective detection in news media Media bias chart v6.0 Ontheissues Bag of tricks for efficient text classification Quantifying structural patterns of information cascades