key: cord-175846-aguwenwo authors: Chatsiou, Kakia title: Text Classification of Manifestos and COVID-19 Press Briefings using BERT and Convolutional Neural Networks date: 2020-10-20 journal: nan DOI: nan sha: doc_id: 175846 cord_uid: aguwenwo We build a sentence-level political discourse classifier using existing human expert annotated corpora of political manifestos from the Manifestos Project (Volkens et al., 2020a) and applying them to a corpus ofCOVID-19Press Briefings (Chatsiou, 2020). We use manually annotated political manifestos as training data to train a local topic ConvolutionalNeural Network (CNN) classifier; then apply it to the COVID-19PressBriefings Corpus to automatically classify sentences in the test corpus.We report on a series of experiments with CNN trained on top of pre-trained embeddings for sentence-level classification tasks. We show thatCNN combined with transformers like BERT outperforms CNN combined with other embeddings (Word2Vec, Glove, ELMo) and that it is possible to use a pre-trained classifier to conduct automatic classification on different political texts without additional training. A substantial share of citizen involvement in politics arises through written discourse especially in the digital space. Through advanced, novel communication strategies, the public can play their part in constructing a political agenda, which has led politicians to increasingly use social media and other types of digital broadcasting to communicate (compared to mainstream press and traditional print media). This is especially pertinent with crisis communication discourse and the recent COVID-19 pandemic has created a great opportunity to study how similar topics get communicated in different countries and the narrative choices made by government and public health officials at different levels of governance (international, national, regional). To aid fellow scholars with the systematic study of such a large and dynamic set of unstructured data, we set out to employ a text categorization classifier trained on similar domains (like existing manually annotated sentences from political manifestos) and use it to classify press briefings about the pandemic in a more effective and scalable way. The main attraction behind using manually coded political manifestos (Volkens et al., 2020a) as training data is that the political science expert community have been manually collecting and annotating in a systematic way political parties' manifestos for years (since the 1960s) around the world in order to apply content analysis methods and to advance political science. They have subsequently been used as training data in semi-supervised domain-specific classification tasks with good results (Zirn et In this paper, we build variations of a CNN sentence-level political discourse classifier using existing annotated corpora of political manifestos from the Manifestos Project (Volkens et al., 2020a) . We test different CNN and word embedding architectures on the already annotated (english language) sentences of the Manifestos Project Corpus. We then apply them to a corpus of COVID-19 Press Briefings (Chatsiou, 2020) , a subset of which was manually annotated by political scholars for the purposes of this work. The article is organised as follows: we first offer a brief overview of previous related work on the use of human expert annotated political manifestos for discourse classification. We then describe our framework including the training data used, data pre-processing performed and used architecture. We report on a series of experiments with CNN trained on top of pre-trained word vectors for sentence-level classification tasks. We conclude with evaluation of the BERT+CNN architecture against other combinations (Word2Vec+CNN, GloVe+CNN, ELMo+CNN) for both corpora. Experimental results show that a CNN classifier combined with transformers like BERT outperforms CNN combined with other non-context sensitive embeddings (Word2Vec, Glove, ELMo). The use of NLP methods to analyse political texts is a well-established field within Political Science and Computational Social science more generally (Lazer et al., 2009; Grimmer and Stewart, 2013; Benoit, Laver, and Mikhaylov, 2009) . Researchers have used NLP methods to acccomplish various classification tasks, such as political positioning on a left to right continuum (Slapin and Proksch, 2008; Glavas, Nanni, and Ponzetto, 2017) , identification of political ideology differences from text Glavas, Nanni, and Ponzetto (2017) propose an approach for cross-lingual topical coding of sentences from electoral manifestos using as training data, manually coded manifestos with a total of 77500 sentences in four languages (English, French, German and Italian) (and CNNs with word embeddings) and inducing a joint multilingual embedding space. They report achieving better results than monolingual classifiers in English, French and Italian but worse results with their multilingual classifier than a monolingual classifier in German. More recently, Bilbao-Jayo and Almeida (2018a) build a sentence classifier using multi-scale convolutional neural networks trained in seven different languages trained with sentences extracted from annotated parties' election manifestos. They use the full range of the domains defined by the manifestos project and they prove that enhancing the multi-scale convolutional neural networks with context data improves their classification. For a detailed discussion of different deep learning text classification-based models for text classification and their technical contributions, similarities, and strengths (Chatsiou and Mikhaylov, 2020; Minaee et al., 2020, see). -Using annotated political manifestos as the training dataset for classifying other types of political texts is gaining traction in the literature, especially with the boost in performance of deep learning methods for text. Nanni et al. (2016) used expert annotated political manifestos in English and speeches to train a local supervised topic classifier (SVM with a bag of words approach) that combines lexical with semantic textual similarity features at a sentencelevel. A sub-part of the training set was annotated manually by human experts, and the rest was labelled automatically with the global optimisation step performed via a Markov Logic network presented in Zirn et al. (2016) . The advantage of such a domain transfer approach is that no manual topic annotation on the rest of the corpus is needed. They then classify the speeches from the 2008, 2012 and 2016 US presidential campaign into the 7 domains defined by the Manifestos Project, without the need for additional topic annotation. Bilbao-Jayo and Almeida (2018b) used annotated political manifestos in Spanish and the Regional Manifestos Project taxonomy Alonso, Gomez, and Cabeza (2013), to train a neural network sentence-level classifier (CNN) with Word2Vec word embeddings, also taking account the context of the phrase (like what was previously said and the political affiliation of the transmitter). They used this to analyse social media (twitter) data of the main Spanish political parties during 2015 and 2016 Spanish general elections without the need for additional manual coding of the twitter data. This paper builds on this area of research presenting a comparison of a CNN classifier trained on the manifestos project annotations for English, but comparing more context-free (Word2Vec, Glove, ELMo) to context-sensitive (BERT) word embeddings. We then apply this to a corpus of daily press-briefings on the COVID-19 status by government and public health authorities. The main attraction behind using manually coded political manifestos (Volkens et al., 2020a) as training data is that the political science community has been manually collecting and annotating in a systematic way political parties' manifestos for decades in a combined effort to create a resource for the systematic content analysis and to advance political science. The corpus is based on the work of the Manifesto Research Group (MRG) and the Comparative Manifestos (CMP) projects (Budge et al., 2001) . Classification annotations are described in the Manifesto Coding Handbook which has evolved over the years, and provides information and instructions to the human annotators on how political parties' manifestos should be coded (latest version in Volkens et al. (2020b) ). The handbook also includes a speficic set of policy areas or 'domains' (7) and subareas or 'subdomains' (56) which are available to annotators to use (see Figure 1) . For our training corpus, we use a subset of the corpus contatining 115 English Manifestos with 86,500 annotated sentences. Table 1 shows the domain codes distribution in the dataset. 11.20% Domain 7 (Social groups) 9.99% The Coronavirus (COVID-19) Press Briefings Corpus is a collection of daily briefings on the COVID-19 status and policies from the UK and the World Health Organisation. The corpus is still in development, but we have selected example sentences from the UK and WHO which were the ones available. During the peak of the pandemic, most countries around the world informed their citizens of the status of the pandemic (usually involving an update on the number of infection cases, number of deaths) and other policy-oriented decisions about dealing with the health crisis, such as advice about what to do to reduce the spread of the epidemic. At the moment the dataset includes briefings covering announcements between March 2020 and August 2020 from the UK (England, Scotland, Wales, Northern Ire-land) and the World Health Organisation (WHO) as follows: • , 2014) ). Word2Vec uses a shallow neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2Vec uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. GloVe is an unsupervised learning model for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. We also obtained word embeddings for more context-sensitive word embeddings, namely ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) . ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis. BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. It includes a variant that uses the English Wikipedia with 2.5 million words. Unlike previous context-free models, which generate a single word embedding representation for each word in the vocabulary, BERT takes into account the context for each occurrence of a given word, providing a contextualised embedding that is different for each sentence. Since Kim (2014)'s paper outlining the idea of using CNNs for text classification (traditionally used for recognising visual patterns from images), CNNs have achieved very good performance in several text classification tasks (Poria, Cambria, and Gelbukh, 2015; Bilbao-Jayo and Almeida, 2018b). CNNs involve convolutional operations of moving frames or windows (filter sizes) which analyse and reduce different overlapping regions in a matrix, to extract different features. The ability to also bootstrap word embeddings in this type of neural network make it an excellent candidate for extracting knowledge and classifying non-annotated texts. We therefore set up 4 variations of the CNN classifier M1, M2, M3, M4 as follows: 1. Word vectors of the training dataset sentences are created using one of the following word embeddings: Word2Vec (M1), GloVe (M2), ELMo (M3) and BERT (M4). Sentences are fed as sequences of words, then mapped to indexes, then a sequence of word vectors. We have chosen 300 as the word vector size and 60 x d for the space where the convolution operations can be performed. Vectors are fed to the neural network (CNN). we then perform convolution operations with 100 filters and three different filter sizes (2 x d, 3 x d, and 4 x d). We reduce the dimensionality of the feature maps generated by each group of filters using 1-max-pooling, which are consequently concatenated (Boureau, Ponce, and LeCun, 2010). A dropout rate of 0.5 is applied (Srivastava et al., 2014) as regularisation to prevent overfitting. The layer with softmax computes the probability distribution over the labels. We perform optimization using the Adam optimiser with the parameters of the original manuscript (Kingma and Ba, 2017). Note that this is a sentence-level topic classifier basing its predictions by taking into account only the information local within the sentence. For our training corpus, we use a subset of the corpus containing 115 English Manifestos with 86,500 annotated sentences. Table 1 shows the domain codes distribution in the dataset. In order to evaluate the different architectures, we divided our training dataset in 2 different subsets: training and validation sets (85%) and test set (15%). Typically, we have used a validation set (or development test set) separate from the test set, to ensure correct evaluation and that our model(s) do not overfit, thus ensuring how each domain is classified and that the evaluation is robust. We performed 4 experiments, one for each combination of CNN and word embeddings: • M1: CNN with Word2Vec Table 2 , the performance of the classifier improves when more context-sensitive word embeddings are used. Using BERT with CNN (M4) seems to provide a substantial increase in accuracy and F1, whereas using ELMo performs very well as well. We also tested the performance of the same different pre-trained models on the COVID-19 corpus. We asked two political science scholars to annotate a subset of 20 press briefings (4 of each set), using the 7 domains of the Manifestos Project. This resulting in a dataset of 1740 manually annotated sentences, with domain distrubution as in Table 3 . Note that the pre-trained models have been trained using the annotated manifestos from the Manifestos Project, without any additional training on the press briefings corpus sentences. As shown in Table 4 , the performance of the classifier improves when more context-sensitive word embeddings are used in the context of the COVID-19 press briefings corpus as well. Using BERT with CNN (M4) seems to provide a substantial increase in accuracy and F1, whereas using ELMo performs very well as well. As expected there is some loss of accuracy, as we are porting the classifier to a slightly different domain of political text (from manifestos to press briefings). In this paper, we built a sentence-level political discourse classifier using existing human expert annotated corpora of English political manifestos from the Manifestos Project (Volkens et al., 2020a) . We tested the accuracy and performance of a neural networks classifier (CNN) using different word embeddings as part of the word to vector mapping and we showed that sentence-level CNN classifiers combined with transformers like BERT outperform models with other embeddings (Word2Vec, Glove, ELMo). We then applied the same pre-trained models to a different set of text, the COVID-19 Press Briefings Corpus. We observe similar patterns in the accuracy and F1 scores, and additionally show that it is possible to use a pre-trained classifier to conduct automatic classification on different political texts without additional training In the future, we aim to conduct similar experiments also considering the 'subdomain' categories of the Manifesto Corpus Annotations. We also look forward to re-running these experiments for other languages in the Manifestos project, testing the language-agnostic advantage of word embeddings and see if we could obtain different results. This paper follows the AAAI Publications Ethics and Malpractice Statement and the AAAI Code of Professional Conduct. We use publicly available text data to ensure transparency and reproducibility of the research. Additionally, all code will be available as open source code (on github.com) at the end of the submission and reviewing process. The paper suggests ways to automatically extract topic information from political discourse texts, employing deep learning methods which are usually associated with artificial intelligence and ethical considerations around them. We do not envisage any ethical, social and legal considerations arising from the work outlined in this study, such as impact of AI on humans, on economic growth, on inequality, amplifying bias or undermining political stability or other issues described in recent reports on ethics in AI (see for example (Bird et al., 2020) ). Table 1 Domain Codes' distribution in the English subset of the Manifestos Corpus used for training the CNN classifier. . . . . . . 4 Table 2 Domain results of all models using political manifestos . . . . 6 Table 3 Manifest Project Domain Codes' distribution in the manually annotated subset of the COVID-19 corpus. . . . . . . . . . . . 7 Table 4 Domain results of all models using COVID- Probabilistic latent semantic indexing Mapping Policy Preferences: Estimates for Parties, Electors, and Governments Latent Dirichlet Allocation". en. In: a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus Automated classification of congressional legislation A Scaling Model for Estimating Time-Series Party Positions from Texts". en Treating words as data with error: Uncertainty in text statements of policy positions Life in the network: the coming age of computational social science Use of force and civil-military relations in Russia: an automated content analysis". en A Theoretical Analysis of Feature Pooling in Visual Recognition Affective News: The Automated Coding of Sentiment in Political Texts Measuring Centre-Periphery Preferences: The Regional Manifestos Project Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts". en Efficient Estimation of Word Representations in Vector Space Measuring Ideological Proportions in Political Speeches Convolutional Neural Networks for Sentence Classification Glove: Global Vectors for Word Representation Dropout: A Simple Way to Prevent Neural Networks from Overfitting". en Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterancelevel Multimodal Sentiment Analysis Crowd-sourced text analysis: Reproducible and agile production of political data Entities as Topic Labels: Combining Entity Linking and Labeled LDA to Improve Topic Interpretability and Evaluability". en Agreement and Disagreement: Comparison of Points of View in the Political Domain TopFish: Topic-Based Analysis of Political Position in US Electoral Campaigns Classifying Topics and Detecting Topic Shifts in Political Manifestos". en Understanding state preferences with text as data: Introducing the UN General Debate corpus". en Cross-Lingual Classification of Topics in Political Texts". en Adam: A Method for Stochastic Optimization Building Entity-Centric Event Collections Automatic political discourse analysis with multi-scale convolutional neural networks and contextual data". en Political discourse classification in social networks using context sensitive convolutional neural networks". en Deep contextualized word representations BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Topic models meet discourse analysis: a quantitative tool for a qualitative approach Structural Topic Modeling For Social Scientists: A Brief Case Study with Social Movement Studies Literature The ethics of artificial intelligence: issues and initiatives. en. Study -European Parliament's Panel for the Future of Science and Technology PE 634.452. LU: Publications Office COVID-19 Press Briefings Corpus. eng. type: dataset Deep Learning for Political Science Deep Learning Based Text Classification: A Comprehensive Review Manifesto Project Dataset. en. Version Number: 2020a type: dataset. 2020 The Manifesto Data Collection. Manifesto Project (MRG/CMP/MARPOR). Version The author would like to acknowledge the support of the Business and Local Government Data Research Centre (ES/S007156/1) funded by the Economic and Social Research Council (ESRC) whilst undertaking this work.