key: cord-0590079-p4a2zj1h
authors: Ledwaba, Mashadi; Marivate, Vukosi
title: Semi-supervised learning approaches for predicting South African political sentiment for local government elections
date: 2022-05-04
journal: nan
DOI: nan
sha: 760004955410fd626dc7d99a682bd2646e965cd0
doc_id: 590079
cord_uid: p4a2zj1h

This study aims to understand the South African political context by analysing the sentiments shared on Twitter during the local government elections. An emphasis on the analysis was placed on understanding the discussions led around four predominant political parties ANC, DA, EFF and ActionSA. A semi-supervised approach by means of a graph-based technique to label the vast accessible Twitter data for the classification of tweets into negative and positive sentiment was used. The tweets expressing negative sentiment were further analysed through latent topic extraction to uncover hidden topics of concern associated with each of the political parties. Our findings demonstrated that the general sentiment across South African Twitter users is negative towards all four predominant parties with the worst negative sentiment among users projected towards the current ruling party, ANC, relating to concerns cantered around corruption, incompetence and loadshedding.

sentimental language (because it evokes emotion) can influence people's perceptions, behaviour and voting decisions towards certain political parties. This can help us predict voting intentions and ultimately, parties that stand a good chance of winning the elections. The research questions this study aims to answer regarding South African political sentiment for the elections are as follows:

• What is the prevalent sentiment towards the four main political parties for the 2021 local government elections?

• What are the major topics of concern in the negative sentiment tweets towards the political parties?

• Does sentiment towards a political party drive voting intentions and thus voting results?

To achieve the objectives set out above, Natural Language Processing (NLP) was used to extract the public opinions of South Africans on Twitter towards the political parties to detect negative and positive sentiment. The analysis presents the application of Semi-Supervised Learning (SSL) using graph-based methods to overcome the challenge of using social media data to achieve NLP tasks, where a large amount of the data is unlabelled. This paper presents the analysis in the following structure: a review of related work and methods used in the study, methodology followed, results of the sentiment analysis and topic modelling and lastly, conclusion and limitations of the study.

Social media platforms such as Twitter, Facebook and Youtube have paved the way for government to interact and develop relationships with citizens. Growth towards e-governance and social networks has been an important driver for public participation, transparent and collaborative governance [12] . The adoption of social media in governance has been widely studied for its effectiveness in the dissemination and communication of information as opposed to traditional government websites [29] , to provide information on opinions of citizens towards government decisions and policies [24] , its influence on citizens' perception and behaviour [11] and its effectiveness during crisis management [8] .

The focus of this study is on the use of social media to analyse the sentiment of South African citizens in the context of local elections.

Sentiment analysis is a NLP task of extracting subjective information or opinions expressed in text data -commonly negative and positive sentiment [23] . It is a growing field that has been applied in various industries. Sanders et al. [22] used sentiment analysis within the health sector to uncover the public's attitudes towards the wearing of masks during COVID-19. Mishev et al. [13] applied it for text feature extraction to acquire financial signals driven by sentiment and, Bermingham et al. [2] presents work on monitoring political sentiment to predict elections for the Irish General Elections.

There has been growing application in the use of various social media platforms to understand and predict behavior within the political context [4, 6, 9] . Franch [6] demonstrated that the use of tweets in order to infer political sentiment gives a better idea of the political landscape of a country than traditional polls, which often suffer from sample and method bias, and are expensive to conduct than using the freely accessible 'wisdom of crowd' offered by social media platforms.

Oyebode et al. [18] presented a comparative study between three lexicon-based classifiers and five machine learning classifiers to determine sentiment in political posts. These were associated with two of the major political parties contending for the Nigerian presidential elections -All Progressives Congress and People's Democratic Party. For the study, 22 497 posts relating to the presidential candidates of the two political parties were extracted from their Manuscript submitted to ACM indigenous social media platform (Nairaland). 1041 tweets were randomly chosen from the set and annotated as positive, negative, and neutral to ensure that the annotated set has a balanced representation. VADER and Textblob lexicon-based models were used for the sentiment classification. The team also addressed the gaps in the coverage of lexicon-based methods by adding 8 748 more features to the VADER lexicon. The obtained results demonstrated that the extended VADER lexicon outperformed the other two (VADER and Textblob) as well as the machine learning approach.

There are only a few cases of published work centered around South African political sentiment using machine learning approaches. Kotzé et al. [10] presented work on using social media data to extract sentiment and perceptions towards one of South Africa's Afrikaner minority communities -Orania. Twitter Archiver was used to collect over 10 000 tweets relating to the community. Due to lack of trained data, a lexicon-based approach was followed for the sentiment analysis. Different sentiment analyzers that employ lexicon dictionaries for sentiment classification were used with the lexicons containing words associated with negative, positive, and neutral sentiment. The lexicon dictionaries that were formulated from actual tweets (such as the NRC word-emotion lexicon and AFINN) outperformed the ones that were extracted from other domains such as the Opinion Lexicon which is based off retail reviews.

The identified gap with the work presented shows that a lexicon-based approach can work well in identifying the polarity of a tweet, however, these publicly available lexicons are not trained on the context of the work studied. Political and social language differs between different social structures and countries therefore, it would be valuable to self-train data on domain-specific text data for the problem. This is typically where semi-supervised learning would be best suited. Given the low-resource gap of sentiment detection tools available where models are pre-trained on political activities in South Africa, this study will be employing semi-supervised methods to intelligently learn the political context relating to the recent local government elections to predict sentiment expressed by South Africans.

Approach: Semi-Supervised Learning. There are different methods to automatically detect subjectivity in text data such as pre-trained lexicon-based methods discussed in related work above, machine learning approaches or a hybrid of the two. Even though lexicon-based methods are quick to implement and have been pre-trained on large sets of data, the downside in their use is that they do not cover all domain-specific words. Given that the context of the South African political language is unique and has its own nuances, it would be more suitable to self-train the model on the elections data.

Most of the online pool of data comes unlabelled, making supervised learning a challenge as it would require labels that represent sentiment. Considering the cost implications in employing human expert annotators and the time constraints for this study, a machine-labelling approach was explored i.e., semi-supervised learning. Semi-supervised learning fits in between supervised and unsupervised learning. It is typically applied when there is a large amount of unlabelled data that is easily accessible than labelled data. The method makes use of the small set of trained labelled data to derive labels on the large amount of the unlabelled data. Zhu et al. [30] describes semi-supervised classification as having labelled dataset and unlabelled dataset such that you obtain a better performance than you would if you trained a classifier on alone. This approach is particularly suited in this study where there are low resources of pre-trained political sentiment analyzers. The semi-supervised method that will be explored to predict tweet sentiment is a graph-based method known as label propagation.

Label Propagation (LP) is a transductive learning approach that makes use of known labels in the training process to propagate labels to unlabelled data. LP uses graphs and a small set of labelled nodes in an n-dimensional space Manuscript submitted to ACM connected by edges to find labels for all the nodes in the space given the similarity between the nodes. The algorithm computes a probabilistic transition matrix T which gives the probability of a label jumping from node x to node y i.e., . If the probability is high, then x and y are similar, and the algorithm iteratively propagates labels to training examples by spreading label information through the graph until it achieves global convergence as proposed by Zhu et al. [30] .

LP has proven to be a robust model in many cases such as Tai et al. [26] where a sentiment lexicon for sentiment analysis was automatically constructed using LP on unlabelled Twitter data within the financial domain. Experimental results from the study showed that the automatically constructed sentiment lexicon outperformed general-purpose sentiment dictionaries. Other studies from [20, 21, 28] also present success in the use of LP for polarity analysis in different domains.

This section outlines the experimental set-up aligned to the proposed method to answer the research questions posed.

These steps include the data collection process and how it was annotated, data pre-processing and processing steps, sentiment analysis and topic modelling.

Twitter data was collected pre-elections from the beginning of September to end of October 2021. Twitter was chosen in this study due to the ease of extraction of tweets using open-source tools as opposed to Facebook which has more scraping restrictions to protect user data from being leaked. A majority of leaders of the four political parties are more engaged on Twitter, possibly due to their high follower count shown in [1] giving them more reach to citizens. To collect a wider sample of data and counter limitations of each tool, two scraping tools were used -Twint and Twarc.

Twitter relevant information relating to the four political parties was extracted using various hashtags such as #MyANC and #EFFSouthAfrica as well as party leader names (e.g., JSteenhuisen and HermanMashaba). It is important to note that even though ActionSA is not the 4th biggest political party after ANC, DA and the EFF, it was added to the study mainly due to the wide political engagement, social media hype and trends that were present preceding the elections on Twitter relating to the party coupled with it's president being a former member of the DA. Additional data for six other political parties was also collected and the more general tweets relating to the local government elections (using e.g., #LGE2021). This additional data was used to enrich our elections corpus for representation however, sentiment was predicted for tweets only associated with the four main political parties. The collected tweets were filtered for non-English tweets and duplicates. From the corpus specific to the four political parties, 1669 tweets were randomly sampled for manual annotation of positive and negative sentiment using, as a guide, an online sentiment lexicon dictionary by Mohammad [15] comprising 54,129 uni-grams of negative and positive sentiment-associated words. Mohammad [14] provides semantic-based questionnaires and techniques also used to guide the annotation process. These annotations were validated by a political commentator and editor who is not affiliated with either of the four political parties but is well-versed in political language and semantics.

Four 'different' datasets were set aside and are given identifiers (A-D) for simplicity of explaining:

1. Dataset A: The complete dataset (labelled and unlabelled) with all the political parties' election-related data used for data representation purposes.

A related only to the four main political parties that will be used for sentiment predictions. Tagging the correct tweet to each political party was very important in this case. To do this, it was ensured that tweets tagged to one political party did not contain any other information about another party or mixed sentiment about two or more political parties using a keyword search and tweet mentions. This method slightly reduces the dataset for the individual parties as was expected given contending parties and conflicting interests where users usually express their opinions towards more than one party in the same tweet. An example tweet of mixed sentiment and different subjects from Dataset B: "Looks like DA is set to lead SA after the anc.

Meanwhile @Julius_S_Malema is happy to remain a loud propagandist. "

3. Dataset C: Randomly sampled from Dataset B and manually annotated to train the semi-supervised model.

Randomly sampled from Dataset B and manually annotated and set aside as the hold-out test-set used for model evaluation and acceptance purposes. Table 1 shows a detailed summary of the categorization of datasets. 

A primary challenge associated with NLP tasks relates to handling out-of-range or non-standard text data. Social media text makes it even more of a challenge as out-of-range data comes in many forms -emojis, punctuation, misspelling, URL's, colloquial language, and other non-standard forms of texting. To handle these, various pre-processing methods were applied to make the text data suitable for analysis and predictability. These included the removal of unwanted texts and symbols in the tweet such as usernames, URL's, hashtags, numbers, and punctuation. The text was standardized by making it all lower case, contractions where expanded, ticks and the successive letters removed, and extra white spaces removed. Successive words like 'action' and 'sa' were joined to not lose the representation of ActionSA in some tweets. Stop-words were also removed from the text. These are frequent low-level information words that form part of the language syntax but do not add any semantic meaning to the text such as 'the', 'a', 'because' or 'have'. Lastly, the text was tokenized and lemmatized. The former is to reduce inflection in words with the same meaning being used in different forms due to grammatical correctness of sentences by representing them in their common root form e.g., the words; 'corrupt', 'corrupts', 'corrupted', 'corrupting' stem from the one word 'corrupt'.

Word embeddings are used to create numerical features i.e., vector representation of the text data to transform the text into a machine-readable format. Two word-embedding models that were explored for the representation of the elections data were TF-IDF Encoding and Word2Vec Embeddings. where:

• Term frequency: ( , ) is the number of occurrences of the term t in document d.

• Inverse document frequency: ( ) = log( ( ) ) measures how significant the term is in the corpus where n is the total number of documents in the corpus and ( ) is the document frequency of term t.

Embeddings. Word2Vec representations are computed from prediction-based models. Word2Vec was pre-trained on over 100 billion words from the Google News dataset and uses a shallow two-layer neural network to derive vector representations [5] . The difference between Word2Vec and TF-IDF is that Word2Vec measures the semantic and syntactic similarities between documents which considers the context that words are used in when representing them therefore, words that share the same context will have similar vector representations. Word2Vec uses two different methods to create the vector representations -Continuous Bag of Words (CBOW) and Skip-gram. CBOW tries to predict the next word in a sentence by considering the context that the target word is being used in as the input into the network. Skip-gram uses the target word to predict the context and produce the representation. Skip-gram takes in as input the one-hot-encoded vector input of the word and gives as an output the probability score of the word being used in the same context as the output layer. CBOW does the opposite by taking in the one-hot-encoded context words to give the probability score of the output word being in the center of the context.

To represent the elections data in vector form, both TF-IDF and Word2Vec (Skip-gram and CBOW) models were fit to Dataset A (the complete dataset across all political parties) allowing for the models to be trained for sentiment predictions. Metrics used for the evaluation of the models were precision, recall and F1 score [25] . The modelling process of the semi-supervised learning is summarized in the following flow diagram: Table 2 is a comparison of the topmost cosine similarities to the name of the political party from the Word2Vec embeddings -CBOW and Skip-gram. It is observed that Skip-gram produced more contextual words commonly associated with each of the political parties than CBOW.

Manuscript submitted to ACM Table 3 shows the mean of cross-validation F1 scores using 5-folds for the baseline and semi-supervised models trained using TF-IDF and Word2Vec representations. The validation scores shows that the models generalise well, with

Word2Vec trained models yielding better performance in most cases than TF-IDF models. The remaining examples of the datasets for the four political parties were automatically labelled using the semisupervised model. From the obtained sentiment results (Table 6 ), it is observed that the general sentiment of the South African political context for the local government election relating to the parties is negative. ANC is the current ruling party and is the most discussed party generally generating more tweet data than the other contenders therefore, the results will be interpreted in terms of the sentiment relative to the number of tweets generated for the party (i.e., sentiment percentage). Tweets associated with the ANC show the worst negative sentiment percentage compared to all the parties which highlights the level of dissatisfaction with the ruling party. The newly formed party, ActionSA, has a more positive tweet percentage compared to the other parties. ACTIONSA. The main concern in the negative sentiment tweets expressed towards the newly formed party is centered around their stance on illegal immigration, which has been viewed by many as draconian and xenophobic.

Another topic of concern expressed in negative sentiment tweets concerned omission of the party's name on the ballot paper for the elections. ActionSA's 4-grams covers themes immigration and on expropriation of land -a topic commonly associated with the EFF. The latter is due to expressed concerns of ActionSA's lack of planning and public interest when addressing topics related to land reform in South Africa.

This study illustrated how social media sentiment analysis and topic modelling was used to understand the opinions shared by four different political parties in the South African context. Furthermore, based on this analysis, context into the positive and negative sentiments were highlighted. To predict political sentiment, Twitter data was processed and a small subset of the data manually labelled for positive and negative sentiment using a lexicon dictionary as a guide. A semi-supervised approach was followed to predict political sentiment using the graph-based method, label propagation, to learn from and propagate the small manually labelled sentiment to the unlabelled data. This method was used to address the gap in the current under-resourced sentiment models for the South African political landscape.

The label propagation model using Skip-gram vector representation resulted in the best model performance across different metrics compared to TF-IDF and CBOW representations. This model was used to label the sentiment examples for the four political parties. The sentiment scores in this study indicated that the current ruling party (ANC) had the greatest negative score, with ActionSA constituting most of the positive sentiments as compared to any of the other political parties in this study. Topic modelling was used to extract topics uncovering the main concerns associated with each of the party's negative tweets with the main concerns towards the ruling party being centred around corruption, incompetency, and Eskom.

The appointment of Cyril Ramaphosa came with a promise to deal with the country's insuperable corruption and three years later, corruption is still a major concern for the South African public as seen through topic modelling. The negative sentiment expressed towards the ruling party does not come as a surprise given these are only a few of the many recurring issues that the ruling party is having challenges addressing. Most of these issues are mainly attributed to incompetency, leading to failure to address issues of employment, service delivery, crime and issues faced by majority of the poor in the country. Internal factional battles within the party have also not made it easy for the party to deal with critical issues of the country.

The last research question was whether sentiment towards political parties drives voting intentions. Some studies have shown the predictability of sentiment analysis for election turnout [3, 9, 17] . For our local government elections, this question can be answered by referencing the voting results as an empirical study would require additional information to be collected to extensively answer. Given the observed high negative opinions expressed towards the ruling party, it can be assumed that this sentiment and the concerns it encapsulates are reflected by the public's actions at the ballot stations. The ANC has recorded its worst voting results with an all-time electoral low below 50%. The ruling party managed to maintain votes in the poor regions of the country such as regions within the Eastern Cape but suffered major losses in regions where they previously had a stronghold such as Soweto and eThekwini. ActionSA unseated the ruling party in some of its major voting districts. For a party that is less than a year old, there has to be positive contributing factors that the public resonates with to give it such a huge support.

The elections saw the worst voter turnout post-apartheid with youth voter turnout being the worst and the youth are majority who have access to social media platforms and vast online information capable of making informed voting decisions. The low voter turn-out is a troubling concern which, the ruling party's Deputy Secretary General, attributed to failures of the ANC in a statement on the local government elections [7] : "it is in the main an unambiguous signal to the ANC from the electorate. The low voter turnout, especially in traditional ANC strongholds, communicates a clear message: The people are disappointed in the ANC with the slow progress in fixing local government, in ensuring quality and consistent basic services, in tackling corruption and greed. "

This study gave us a good sense of the political climate for local government elections however, the use of social media to determine political sentiment excludes a great number of South Africans without access to technology or Twitter.

The dataset that was used for sentiment predictions of the respective parties disregarded multi-sentiment tweets where one tweet contains expressed sentiment about more than one political party. This reduced the data associated with each of the political parties. Proposed future work to address this would be using an automatic approach such as target-dependent sentiment analysis tasks which can handle different sentiment projected towards different subjects in the same tweet.

Another limitation in the study is that only positive and negative sentiment were used which means the model is not trained to identify tweets that may not be expressing neither positive nor negative sentiment.

Due to time and cost constraints, we did the manual annotation of the tweets ourselves. Even though they were verified by a political commentator, we are not experts in political linguistics.

Based on [30] and literature [20, 21, 26, 28] presenting the success of Label Propagation, it was the only semisupervised approach that was explored in the study. As part of future proposed work, it would be good to also test different ways of propagating labels onto unlabelled data and doing a comparative study such as using other graph-based methods or wrapper-based methods like self and co-training approaches.

To further improve model performance, we could also investigate a 'transfer-learning' approach by continuing the training of the powerful pre-trained Google models on our elections dataset to create an even better vector representation and also using these for topic modelling.

Governing from the opposition?': tracing the impact of EFF's 'niche populist politics' on ANC policy shifts

On Using Twitter to Monitor Political Sentiment and Predict Election Results

Prediction and analysis of Indonesia Presidential election from Twitter using sentiment analysis

Identifying Political Sentiment between Nation States with Social Media

Google Code. 2015. word2vec. Retrieved

Wisdom of the Crowds: 2010 UK Election Prediction with Social Media

Statement of the ANC on Local Government Elections 2021 Preliminary Results

Why do citizens participate on government social media accounts during crises? A civic voluntarism perspective

Towards Prediction of Election Outcomes Using Social Media

Employing sentiment analysis for gauging perceptions of minorities in multicultural societies: An analysis of Twitter feeds on the Afrikaner community of Orania in South Africa

Attitude Toward Protective Behavior Engagement During COVID-19 Pandemic in Malaysia: The Role of E-government and Social Media

Social media adoption and resulting tactics in the U.S. federal government

Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers

A Practical Guide to Sentiment Annotation: Challenges and Solutions

Saif Mohammad. 2022. Sentiment and Emotion Lexicons

Topic Modelling of News Articles for Two Consecutive Elections in South Africa

A Method for Predicting the Winner of the USA Presidential Elections using Data extracted from Twitter

Social Media and Sentiment Analysis: The Nigeria Presidential Election 2019

Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Identifying Users with Opposing Opinions in Twitter Debates

Semi-Supervised Polarity Lexicon Induction

Unmasking the conversation on masks: Natural language processing for topical sentiment analysis of COVID-19 Twitter discourse

Semi-supervised Learning for Sentiment Classification using Small Number of Labeled Data

Intelligent Learning based Opinion Mining Model for Governmental Decision Making

Evaluation of classifiers: current methods and future research directions

Automatic Domain-Specific Sentiment Lexicon Generation with Label Propagation

Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

Large Scale and Parallel Sentiment Analysis Based on Label Propagation in Twitter Data

Social media in government offices: usage and strategies

Learning from labeled and unlabeled data with label propagation

We would also like to thank the Data Science for Social Impact research group at the University of Pretoria who assisted in proof-reading the final submission. We would like to acknowledge the following funders: ABSA (who sponsor the UP ABSA Data Science Chair) and the National Research Foundation, South Africa.