key: cord-0638351-fwnkzg13
authors: Wankmuller, Sandra
title: A Comparison of Approaches for Imbalanced Classification Problems in the Context of Retrieving Relevant Documents for an Analysis
date: 2022-05-03
journal: nan
DOI: nan
sha: 7acbc627090f539586cae2f4a8cc2308dc221a25
doc_id: 638351
cord_uid: fwnkzg13

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for the analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists risks drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder, 2017), the Social Bias Inference Corpus (SBIC) (Sap et al., 2020), and the Reuters-21578 corpus (Lewis, 1997). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1,000 documents), reaches a substantially higher retrieval performance than keyword lists.

When conducting a study on the basis of textual data, at the very start of an analysis researchers are often confronted with a difficulty: Online platforms and other sources from which textual data are generated usually cover multiple topics and hence tend to contain textual references toward a huge number of various entities. Social scientists, however, are typically interested in text elements referring to a single entity, e.g. a specific 1 arXiv:2205.01600v1 [cs.IR] 3 May 2022 person, organization, object, event, or issue.

Imagine, for example, that a study seeks to examine how rape incidents are framed in newspaper articles (Baum et al., 2018) , or that a study seeks to detect electoral violence based on social media data (Muchlinski et al., 2021) , or that a study seeks to measure attitudes expressed towards further European integration in speeches of political elites (Rauh et al., 2020) . In all these studies, one of the first steps is to extract documents that refer to the entity of interest from a large, multi-thematic corpus of documents. 1 This is, researchers have to separate the relevant documents that refer to the entity of interest from the documents that focus on entities irrelevant for the analysis at hand. Newspaper articles reporting about rape incidents have to be parted from those articles that do not. Tweets relating to electoral violence have to be extracted from the stream of all other tweets. And speech elements about the European integration have to be separated from elements in which the speaker talks about other entities.

Given a corpus comprising many diverse topics, it is likely that only a small proportion of documents relate to the entity of interest. Hence, the proportion of relevant documents is substantively smaller than the proportion of irrelevant documents in the data and the task of separating relevant from irrelevant documents turns into an imbalanced classification problem (Manning et al., 2008, p. 155) . How researchers address this imbalanced classification problem is highly important as the selection of documents affects the inferences drawn. More precisely: If there is a systematic bias in the selection of documents such that the value on a variable of interest is related to the question of whether a document is selected for analysis or not, the inferences that are made on the basis of the documents that have been selected for analysis are likely to be biased. Selection biases can be induced when the corpus is collected in the first place.And selection biases can be induced when from the already collected corpus documents that refer to relevant entities are selected for analysis. This work focuses on the second step. The more accurately a method can separate relevant from irrelevant documents, the less the potential size of the bias resulting from this second selection step.

Despite the relevance of this problem, the question of how best to retrieve documents from large, heterogenous corpora so far has received little attention in social science research. In many applications, researchers have relied on applying human-created sets of keywords and regard those documents as relevant that comprise at least one of the keywords (see e.g. Burnap et al., 2016; Jungherr et al., 2016; Beauchamp, 2017; Baum et al., 2018; Stier et al., 2018; Fogel-Dror et al., 2019; Rauh et al., 2020; Watanabe, 2021; Muchlinski et al., 2021 ). Yet, research indicates that humans are not good at generating comprehensive keyword lists and are highly unreliable at the task (King et al., 2017, p. 973-975) . This is, the keyword list generated by one human is likely to contain only a small amount of the universe of terms one could use to refer to a given entity of interest (King et al., 2017, p. 973-975) . Moreover, the list of keywords that one human comes up with is likely to show little overlap with the keyword list generated by another human (King et al., 2017, p. 973-975) . Joining forces by combining keyword lists that researchers have created independently may alleviate the problem somewhat. But still, the conventional approach of using keywords to identify relevant documents is likely to be unreliable and thus is likely to lead to very different (and potentially biased) conclusions depending on which set of keywords the researchers have used (King et al., 2017, p. 974-976) .

Other approaches for identifying relevant documents-such as passive and active supervised learning, query expansion techniques, or the construction of topic model-based classification rules-are less frequently employed in social science applications. These approaches also require human input but they detect patterns or keywords the researchers do not have to know beforehand. Except for query expansion, these methods require the researchers to recognize documents or terms related to the entity of interest rather than requiring the researchers to recall such information a priori (King et al., 2017, p. 972 ). This does not preclude these techniques from generating selection biases. A supervised learning algorithm, for example, may systematically misclassify some documents as not being relevant based on word usage that could be correlated with a main variable of the analysis. Yet, as these approaches have the potential to extract patterns beyond what a team of researchers may come up with, these methods have the potential to more precisely separate relevant from non-relevant documents. And the higher the retrieval performance of a method, the smaller the potential for strongly biasing effects due to selection biases.

These other techniques, however, also have a disadvantage: they are much more resource intensive to implement. Supervised learning algorithms require labeled training documents, query expansion techniques depend on a data source to operate on, and topic model-based classification rules hinge on estimating a topic model. As the identification of relevant documents from a large heterogeneous corpus is likely to only constitute an early small step in an elaborate text analysis, considerations regarding the costs and benefits of a retrieval method have to be taken into account.

Hence, an ideal procedure reliably achieves a high retrieval performance such that it reduces the risk of incurring large selection biases and simultaneously is cost-effective enough to be conducted as a single step of an extensive study. In practice, the performance and the cost-effectiveness of a procedure is likely to depend on the characteristics of an application (such as the length and textual style of documents, the type of the entity of interest, or the heterogeneity vs. homogeneity of the documents in the corpus). If the entity of interest is a person or organization and there is only a small set of expressions that is usually used to refer to this entity, then a list of keywords may lead to a similar performance than the resource intensive application of a supervised learning algorithm. If on the other hand the entity of interest is not easily denominated (e.g. a policy issue such as the set of restrictions implemented to address the COVID-19 pandemic), then an acceptable retrieval performance may only be achieved by training a supervised learning algorithm.

So far, however, a systematic comparison of the performances of these different retrieval methods across social science applications is lacking. Thus, it is unclear what, if anything, could be gained in terms of retrieval performance by applying a more elaborate procedure. This study seeks to answer this question by comparing the retrieval performance of a small set of predictive keywords to (1) query expansion techniques extending this initial set, (2) topic model-based classification rules as well as (3) passive and active supervised learning. The procedures are compared on the basis of three retrieval tasks:

(1) the identification of tweets referring to refugees, refugee policies, and the refugee crisis from a dataset of 24,420 German tweets (Linder, 2017) , (2) the retrieval of posts that are offensive toward mentally or physically disabled people from the Social Bias Inference Corpus (SBIC) (Sap et al., 2020) that covers 44,671 potentially toxic and offensive posts from various social media platforms, and (3) the extraction of newspaper articles referring to crude oil from the Reuters-21578 corpus (Lewis, 1997 ) that comprises economically focused newspaper articles of which 10,377 are assigned to a topic.

The results show that with the model settings studied here query expansion techniques as well as topic-model-based classification rules tend to decrease rather than increase retrieval performance compared to sets of predictive keywords. They only yield minimal improvements or acceptable results in specific settings. By contrast, active supervised learning-if implemented with a not too small number of labeled training documentsachieves relatively high retrieval performances across contexts. Moreover, in each application active learning substantively improves upon the mediocre to fair results reached by the best performing lists of predictive keywords. The observed differences of the mean F 1 -Scores achieved by active learning with 1,000 labeled training documents to the mean F 1 -Scores of keyword lists range between 0.194 and 0.476. Although active learning is designed to reduce the number of training documents that have to be annotated by human coders, it is nevertheless particularly resource intensive. Yet, the achieved performance enhancements are so considerable (and the consequences of selection biases potentially so severe) that researchers should consider spending more of their available resources on the step of separating relevant from irrelevant documents.

In the following Section 2 basic concepts relevant for discussing imbalanced classification problems in retrieval contexts are introduced. Afterward the benefits and disadvantages of the usage of keyword lists, query expansion techniques, topic model-based classification rules, and passive as well as active supervised learning techniques in the context of identifying documents relevant for further analyses are discussed (3). Then the procedures are applied on the datasets and their retrieval performances are inspected (4). The final discussion in Section 4.3.4 summarizes what has been learned and points toward aspects that merit further study.

Before continuing, note that the vocabulary used in this study often makes use of the term retrieval. As this study focuses on contexts in which the task is to retrieve relevant documents from corpora of otherwise irrelevant documents, the usage of the term retrieval seems adequate. Yet, the task examined in this study is different from the task that is typically examined in document retrieval. Document retrieval is a subfield of information retrieval in which the task usually is to rank documents according to their relevance for an explicitly stated user query (Manning et al., 2008, p. 14, 16) . In this study, in contrast, the aim is to classify-rather than rank-documents as being relevant vs. not relevant. Moreover, not all of the approaches evaluated here require the query, that states the information need, to be expressed explicitly in the form of keywords or phrases.

Imbalanced classification problems are common in information retrieval tasks (Manning et al., 2008, p. 155) . They are characterized by an imbalance in the proportions made up by one vs. the other category. When retrieving relevant documents from large corpora typically only a small fraction of documents falls into the positive relevant category whereas an overwhelming majority of documents is part of the negative irrelevant category (Manning et al., 2008, p. 155) .

When evaluating the performance of a method in a situation of imbalance, the accuracy measure that gives the share of correctly classified documents is not adequate (Manning et al., 2008, p. 155 ). The reason is that a method that would assign all documents to the negative irrelevant category would get a very high accuracy value (Manning et al., 2008, p. 155) Thus, evaluation metrics that allow for a refined view, such as precision and recall, should be employed (Manning et al., 2008, p. 155) . Precision and recall are defined as:

P recision = T P T P + F P (1)

whereby T P , F P and F N are defined in Table 1 . Precision and recall are in the range [0, 1] . However, if none of the documents is predicted to be positive, then T P + F P = 0 and precision is undefined. If there are no truly positive documents in the corpus, then T P + F N = 0 and recall is undefined. The higher precision and recall, the better.

Precision exclusively takes into account all documents that have been assigned to the positive relevant category by the classification method and informs about the share of truly positive documents among all documents that are predicted to fall into the positive category. Recall, on the other hand, exclusively focuses on the truly relevant documents and informs about the share of documents that has been identified as relevant among all truly relevant documents.

There is a trade-off between precision and recall (Manning et al., 2008, p. 156) . A keyword list comprising many terms or a classification algorithm that is lenient in considering documents to be relevant will likely identify many of the truly relevant documents (high recall). Yet, as the hurdle for being considered relevant is low, they will also classify many truly irrelevant documents into the relevant category (low precision). A keyword list consisting of few specific terms or a classification algorithm with a high threshold for assigning documents to the relevant class will likely miss out many relevant instances (low recall), but among those considered relevant many are likely to indeed be relevant (high precision).

In this study's context of identifying relevant documents to be used for further analyses from a set of otherwise irrelevant documents, recall as well as precision should be as high as possible; but recall is the slightly more important metric: Recall operates on the set of all truly relevant documents and focuses on the inclusion vs. exclusion of relevant documents into the analysis-the analytic step at which selection biases may arise. If there is a correlation between the documents identified as relevant vs. not relevant and the value of the variable of interest, a selection bias is generated. This is, if truly relevant documents are systematically misclassified in the sense that the higher (or lower) the value on the variable of interest, the higher (or lower) the probability of being assigned to the negative irrelevant category, inferences made on the basis of the set of instances classified into the positive category are biased. High recall values do not guarantee that there are no systematic misclassifications. But the higher recall, the smaller the maximum size of the bias that arises from systematic misclassifications of truly relevant documents can become.

Because of its exclusive focus on true and false positives, precision provides no information on the potential of selection bias due the missing out of truly relevant documents. Nevertheless, precision should also be high. The lower precision, the less documents among those considered to be relevant by the classification method are indeed relevant. A considerable share of false positives among the set of documents classified to be relevant also has the potential to severely bias the inferences drawn or can impede the researcher from conducting any analysis at all because the retrieved documents are not those documents he or she seeks to analyze. Yet, whereas low precision can be handled by a researcher in subsequent steps, low recall implies that a substantial proportion of truly relevant documents are never to be considered for analysis. Hence, falsely classifying a truly relevant document as irrelevant can be considered to be more severe than falsely predicting an irrelevant document to be relevant.

The trade-off between precision and recall is incorporated in the F ω -measure, which is the weighted harmonic mean of precision and recall (Manning et al., 2008, p. 156) :

The F ω -measure also is in the range [0, 1] . ω is a real-valued factor balancing the importance of precision vs. recall (Manning et al., 2008, p. 156) . For ω > 1 recall is considered more important than precision and for ω < 1 precision is weighted more than recall (Manning et al., 2008, p. 156) . A very common choice for ω is 1 (Manning et al., 2008, p. 156) . In this case, the F 1 -measure (or synonymously: F 1 -Score) is the harmonic mean between precision and recall (Manning et al., 2008, p. 156) .

The F 1 -Score is a widely used measure to evaluate the performance of classification tasks. Although recall here is considered the slightly more important measure, the F 1 -Score-because it is the measure nearly always reported-will be employed to assess the performances of the retrieval approaches evaluated in the following.

3 Retrieval Approaches

In social science, a very commonly used approach to identify documents on relevant entities is to set up a set of keywords and to consider those documents as relevant that contain at least one of the keywords (see for example the studies listed in Table 2 ). This procedure in fact is a keyword-based boolean query in which the keywords are connected with the OR operator (Manning et al., 2008, p. 4) . Slightly more advanced are boolean queries in which in addition to the OR operator also the AND operator is used. Using the AND operator is important in situations in which expressions denoting the entity of interest are composed of more than a single term (e.g. 'United States' ).

The ways in which the authors come up with a set of keywords range from simply using the most obvious terms (e.g. Baum et al., 2018) , to collecting a set of typical denominations for the entity of interest (e.g. Burnap et al., 2016; Jungherr et al., 2016; Beauchamp, 2017) , to carefully thinking about, testing, and revising sets of keywords (e.g. Stier et al., 2018; Reda et al., 2021; Gessler and Hunger, 2021) , to collecting keywords empirically based on word-usage in texts known to be about the entity (e.g. Zhang and Pan, 2019) . Though these approaches vary in their complexity and costs, they are all still very cheap and relatively fast procedures. Another advantage of the usage of keyword lists for the extraction of relevant documents is that a researcher has full control over the terms that are included-and not included-as keywords.

How are the keywords selected?

Operators in boolean query Puglisi and Snyder (2011) 11+ likely by the authors OR, AND King et al. (2013) unspecified likely by the authors unclear Burnap et al. (2016) 33 Linder (2017, p. 5) . Note that the column 'Number of keywords' gives the number of keywords the authors in the listed studies use to extract documents relating to one entity of interest. If the authors are interested in several entities, then typically several keyword lists are applied which is why here for some articles a range rather than a single number is given. Note also that Katagiri and Min (2019, p. 161) state that the keywords they use come from the Conflict and Peace Data Bank (COPDAB) (Azar, 2009 ). They do not specify how they extract keywords from this data bank.

Yet, research suggests that the human construction of keyword lists is not reliable (King et al., 2017, p. 973-975) . If a researcher generates a keyword list, then another researcher or the same researcher at another point in time is likely to construct a very different set of keywords. This is problematic: Depending on which human-generated set of search terms is used to identify relevant documents, inferences drawn may vary greatly (King et al., 2017, p. 974-976) .

Moreover, this conventional procedure of human keyword list generation might lead to biased inferences if the terms that are used to denote an entity correlate with the values of the variable of interest. To illustrate: Imagine that a researcher is interested in attitudes toward Joe Biden as expressed in comments on an online platform during a given time period. The researcher analyzes the sentiments of all comments that contain the search term 'Biden'. The obtained results can be biased if the attitudes expressed in comments that refer to Joe Biden as 'Biden' or 'Joe Biden' differ from the attitudes in comments that refer to him as 'Sleepy Joe'. For keyword-based approaches to avoid such types of selection bias, a researcher thus has to set up a set of keywords that fully captures the universe of terms and expressions that is used to refer to the entity of interest in the given corpus. 2 But humans tend to perform very poorly when it comes to constructing an extensive set of search terms (King et al., 2017, p. 973-975) .

There are several likely reasons for the problems human researchers encounter when trying to set up an extensive list of keywords. First, language is highly varied (Durrell, 2008) . There are numerous ways to refer to the same entity-and entities also can be referred to indirectly without the usage of proper names or well-defined denominations (Baden et al., 2020, p. 167) . Especially if the entity of interest is abstract and/or not easily denominated, the universe of terms and expressions referring to the entity is likely to be large and not easily to be captured (Baden et al., 2020, p. 167) . Such entities are abundant in social science. Typical entities of interest, for example, are policies (e.g. the policies implemented to address the COVID-19 pandemic), concepts (e.g. European integration or homophobia), and occurrences (e.g. the 2015 European refugee crisis or the 2021 United States Capitol riot).

A second likely reason for the human inability to come up with a comprehensive keyword list are inhibitory processes (Bäuml, 2007; King et al., 2017, p. 974) . After a set of concepts has been retrieved from memory, inhibitory processes suppress the representation of related, non-retrieved concepts in memory and thereby reduce the probability of those concepts to be recovered (Bäuml, 2007) . One method that has the potential to alleviate this second aspect are query expansion methods, which are discussed next.

By being able to move beyond keywords that researchers are able to recall a priori, query expansion methods can be employed to create a more comprehensive set of search terms. Query expansion techniques expand the original query (i.e. the original set of keywords) by appending related terms (Azad and Deepak, 2019 , p. 1699 -1700 . Here, the focus is on similarity-based automatic query expansion methods, that add new terms automatically-i.e. without interactive relevance feedback from the user-and make use of the similarity between the set of query terms to potential expansion terms (Azad and Deepak, 2019 , p. 1700 , 1706 . The underlying hypothesis used here is the association hypothesis formulated by van Rijsbergen stating that "

[i]f one index term is good at discriminating relevant from non-relevant documents, then any closely associated index term is also likely to be good at this" (van Rijsbergen, 2000, p. 11) . The specific methods differ regarding

• the data source to extract candidate terms for the expansion,

• how candidate terms from this data source are ranked (such that the ranks reflect the relatedness to the original query), and

• how (many) additional terms are selected and integrated into the original query (Azad and Deepak, 2019, p. 1701) . Data sources from which expansion terms are identified can be the corpus from which relevant documents are to be retrieved, the documents retrieved by the initial query, human-created thesauri such as WordNet, knowledge bases as Wikipedia, external corpora such as a collection of web texts, or a combination of these (Azad and Deepak, 2019 , p. 1701 -1704 . If thesauri such as WordNet are employed as a data source, terms the thesaurus encodes to be related to the query terms can be considered candidate terms for expansion (Azad and Deepak, 2019, p. 1702) . Path lengths between the synsets (word senses) in a thesaurus can be used to compute a similarity score between a query term and potential expansion terms (Azad and Deepak, 2019, p. 1705) . In Wikipedia, the network of hyperlinks between articles can be used to extract articles about concepts related to the query terms (ALMasri et al., 2013) . A similarity score, for example, can be computed based on shared ingoing and outgoing hyperlinks between articles (ALMasri et al., 2013, p. 6 ). If the data source for query expansion is the local corpus from which documents are to be retrieved or if the data source is an external global corpus, then the similarity between terms can be assessed via similarity measures that are computed based on the terms' vector representations (Azad and Deepak, 2019, p. 1706) . A frequently used measure is cosine similarity: (Manning et al., 2008, p. 122) . If the angle between the vectors equals 0 • , meaning that the vectors have the exact same orientation, the cosine is 1 (Moore and Siegel, 2013, p. 281) . If the angle is 90 • , meaning that the vectors are orthogonal to each other, then cos(θ) = 0 (Moore and Siegel, 2013, p. 281) . 3

A frequently used term representation are word embeddings (see e.g. Diaz et al., 2016; Kuzi et al., 2016; Silva and Mendoza, 2020) . A word embedding is a real-valued vector representation of a term. Important model architectures to learn word embeddings are the continuous bag-of-words (CBOW) and the skip-gram models (Mikolov et al., 2013a) as well as Global Vectors (GloVe) (Pennington et al., 2014) and fastText (Bojanowski et al., 2017) . In learning the word embedding for a target term a t , these architectures make use of the words that occur in a context window around a t (Mikolov et al., 2013a, p. 4; Pennington et al., 2014 Pennington et al., , p. 1533 Pennington et al., -1535 . In doing so, these procedures for learning word embeddings implicitly draw on the distributional hypothesis (Firth, 1957) stating that the meaning of a word can be deduced from the words it typically co-occurs with (Rodriguez and Spirling, 2022, p. 102) . This in turn implies that semantically or syntactically similar terms are likely to have similar word embedding vectors that point into a similar direction (Bengio et al., 2003 (Bengio et al., , p. 1139 (Bengio et al., -1140 Mikolov et al., 2013b) .

In similarity-based query expansion techniques, terms that are closest to the query terms are used as query expansion terms. The number of terms added varies from approach to approach between five to a few hundred (Azad and Deepak, 2019, p. 1714) . In Silva and Mendoza (2020) , for example, the original query is represented by a single vector that is computed by taking the weighted average of the word embeddings of all terms in the original query. Then the five terms whose embeddings have the highest cosine similarity with the embedding of the query are selected for expansion.

To sum up, researchers that implement query expansion methods require a data source for expansion, a way to compute a measure that captures the relatedness between terms, and a procedure that determines which and how many terms are added via which process. If they plan to represent terms as word embeddings, then either pretrained word embeddings are required or the embeddings have to be learned. Consequently, considerable resources and expertise is needed. Yet, whereas individuals due to inhibitory processes may fail to create a comprehensive list of search terms, query expansion methods can uncover terms that denote the entity of interest and are used in the corpus at hand. As query expansion techniques have the potential to expand the initial query with synonymous and related terms, recall is likely to increase (Manning et al., 2008, p. 193) . Precision, however, may decrease-especially if the added terms are homonyms or polysemes (i.e. terms that have different meanings; whereby the meanings can be conceptually distinct (homonyms) or related (polysemes)) (Manning and Schütze, 1999, p. 110; Manning et al., 2008, p. 193) . It thus may be advantageous to use as a data source for query expansion a corpus or thesaurus that is specific to the domain of the retrieval task rather than a global corpus or general thesaurus (Manning et al., 2008, p. 193) . Moreover, query expansion techniques require researchers to a priori come up with an initial set of query terms (which will encode the researchers' assumptions) and there is no guarantee the expansion starting from the initial set will capture all different denominations of the entity. For example, there is no guarantee that query expansion will succeed in moving from 'Biden' to 'Sleepy Joe'. Finally, if the entity of interest is also referred to with multi-term expressions (e.g. 'United States' ), then these only can be extracted if the term representations used by the expansion procedure also cover multi-term expressions. Word embeddings would have to be learned or be available also for bigrams and trigrams. This increases the methods' complexity, the computational resources required and limits the availability of external globally pretrained word embeddings. 4

Recently Baden et al. (2020) have proposed a procedure in which documents are categorized based on classification rules that are built by researchers on the basis of topics estimated by a topic model. Baden et al. (2020) call their procedure Hybrid Content Analysis. The idea is to assign those documents to a pre-defined category that are estimated to be comprised to a considerable share of topics that the researchers deem to be related to the category (Baden et al., 2020) . Whilst Baden et al. (2020) formulate their method for multi-class or multi-label classification tasks in a descriptive manner, here the procedure is presented with precise mathematical expressions and the focus is exclusively on the binary classification task of retrieving relevant documents.

The family of topic models most widely applied in social science are Bayesian hierarchical mixed membership models that estimate a latent topic structure based on observed word frequencies in text documents (Blei et al., 2003, p. 993, 995-997; Blei and Lafferty, 2007, p. 18; Roberts et al., 2016a, p. 988; Zhao et al., 2021, p. 4713-4714) . These topic models (which are here simply referred to as topic models) assume that each topic is a distribution over the terms in the corpus and each document is characterized by a distribution over topics (Blei et al., 2003, p. 995-997; Blei and Lafferty, 2007, p. 18) . Given a corpus of N documents, topic models estimate a latent topic structure defined by N × K document-topic matrix Θ and K × U topic-term matrix B (see Figure 1 ). Topic-term matrix B = [β 1 , . . . , β k , . . . , β K ] gives for each topic, k ∈ {1, . . . , K}, the estimated probability mass function across the U unique terms in the vocabulary: β k = [β k1 , . . . , β ku , . . . , β kU ]; whereby β ku is the probability for the uth term to occur given topic k. Document-topic matrix Θ = [θ 1 , . . . , θ i , . . . , θ N ] contains for each document d i the estimated proportion assigned to each of K latent topics: θ i = [θ i1 , . . . , θ ik , . . . , θ iK ], with θ ik being the estimated share of document d i assigned to topic k.

Given the estimated latent topic structure characterized by K × U topic-term matrix B and N × K document-topic matrix Θ, the topic model-based classification rule building procedure proceeds as follows (see Figure 1 ) (Baden et al., 2020, p. 171-174) :

1. Based on K × U topic-term matrix B the researcher inspects for each topic the most characteristic terms, e.g. the terms that are most likely to occur in a topic and the terms that are the most exclusive for a topic. 5 Given these terms that inform about the content of each topic, the researcher determines which topics refer to the entity of interest. The researcher then creates relevance matrix C of size K × 1 whose elements are 1 if the topic is considered relevant and are 0 otherwise.

2. Then N × K document-topic matrix Θ is multiplied with C. The resulting vector r = [r 1 , . . . , r i , . . . , r N ] gives for each document the sum over those topic shares that refer to relevant topics. r i can be interpreted as the share of words in document d i that come from relevant topics.

3. A threshold value ξ ∈ [0, 1] is set. All documents for which r i >= ξ are considered to be relevant.

The procedure utilizes a topic model as an unsupervised tool to uncover information about the latent topic structure of a corpus. Leveraging this information for the retrieval of relevant documents allows researchers to operate without a set of explicit keywords. Rather than having to come up with information about to be retrieved documents a priori, researchers merely have to recognize topics that refer to relevant entities. As topic models are well known and frequently developed and applied in social science (e.g. Quinn et al., 2010; Grimmer, 2013; Roberts et al., 2014; Bauer et al., 2017; Maier et al., 2018; Baerg and Lowe, 2020; Eshima et al., 2021; Schulze et al., 2021) and furthermore are implemented in corresponding software packages (e.g. Grün and Hornik, 2011; Roberts et al., 2019) , the procedure of building classification rules based on topic models seems easily accessible to the social science community.

Yet, estimating a topic model in the first place induces costs. Especially the number of topics K has to be set a priori. To set a useful value for K typically several topic models with varying K are estimated and after a manual inspection of the most likely and most exclusive terms for a topic as well as the computation of performance metrics (e.g. held-out likelihood), researchers decide on a topic number (Roberts et al., 2016b, p. 60-62) . Moreover, as topic models are unsupervised there is no way for researchersbeyond setting parameters as K-to guide the estimation process such that the results 1 Corpus of N Documents be built from any topic model that on the basis of a corpus comprising N documents estimates a latent topic structure characterized by two matrices: N × K document-topic matrix Θ and K × U topic-term matrix B. β ku is the estimated probability for the uth term to occur given topic k. θ ik is the estimated share assigned to topic k in the ith document. The topic model-based classification rule procedure proceeds as follows:

Step 1: Researchers inspect matrix B, determine which topics are relevant and create K × 1 relevance matrix C.

Step 2: Matrix multiplication of Θ with C yields the resulting vector r.

Step 3 (not shown): Documents with ri >= threshold ξ ∈ [0, 1] are retrieved.

are related to the concepts of interest. Ideally one would like to have a topic model that produces one or several topics that refer to the entity of interest and are characterized by high semantic coherence as well as exclusivity. A coherent topic referring to the entity of interest would have high occurrence probabilities for frequently co-occurring terms that refer to the entity (Roberts et al., 2014 (Roberts et al., , p. 1069 Roberts et al., 2019, p. 10) . It would be clearly about the entity of interest rather than being a fuzzy topic without a nameable content. An exclusive topic would solely refer to the entity of interest and would not refer to any other entities.

It is not guaranteed, however, that there is a topic that distinctly covers the relevant entity. Additionally, topic models can generate topics that relate to several entities rather than a single entity. By selecting each topic that refers to the relevant entity but also relates to several non-relevant entities, a researcher will construct a topic modelbased classification rule that will be characterized by high recall but low precision. For this reason Baden et al. (2020, p. 173) suggest to set K to a rather high value such that topics are fine-grained. Yet, whether this will work out in a given application is unclear as the latent topic structure uncovered by the topic model cannot be forced to neatly separate topics referring to relevant entities from topics referring to non-relevant entities. And topic models also cannot be forced to produce coherent topics referring to the entity of interest at all.

Supervised learning algorithms are trained on the basis of a training data set. The training data set contains a set of documents and corresponding classes or values. In the context of retrieving relevant documents, a training set document is assigned to the relevant class if it refers to the entity of interest and is assigned to the irrelevant class otherwise. Central to the supervised learning process is the loss function. The loss function returns a cost-signifying value which is a function of the predicted and the true values of the training set documents. In an optimization process, the parameters of the supervised learning algorithm are moved toward values for which the loss function reaches a (local) minimum.

Supervised learning methods have the advantage that they come with supervision: the separation between relevant and irrelevant documents is encoded in the training data set and then learned by the model. This is a considerable advantage over automatic query expansion methods and topic model-based approaches. In the former, researchers cannot be entirely sure that the expansion really will include terms related to the initial query terms and in the latter it is unclear whether there will be coherent and exclusive topics referring to the entity of interest.

Moreover, as the true class assignments for the training set documents are known, supervised learning approaches allow researchers to use resampling techniques (e.g. crossvalidation) in order to assess how well the retrieval of relevant documents works. The values for precision and recall not only provide information about the performance of the retrieval method but also indicate the nature of the (mis)classifications. (Is the model lenient in assigning documents to the positive relevant class and therefore most of the relevant documents are retrieved (high recall) but there are many false positives among the retrieved documents (low precision) or is it rather the other way round?)

Furthermore, just as the topic model-based approach, supervised learning techniques depend on recognizing rather than recalling: When creating the training data set, coders read the training documents and assign them to the relevant vs. irrelevant class as specified in coding instructions. Hence, supervised learning techniques require the coders to merely recognize relevant documents rather than creating information on relevant documents from scratch.

Supervised learning methods, however, also come with disadvantages. First, the labeling of training documents by human coders is extremely costly. Precise coding instructions have to be formulated, the coders have to be trained and paid, and the intercoder reliability (e.g. measured by Krippendorff's α (Krippendorff, 2013, p. 277-294) ) has to be assessed. Reading an adequately large sample of documents and labeling each as relevant vs. irrelevant (or having this being done by trained coders) takes time.

Second, in the context of retrieving relevant documents, it is likely that the share of relevant documents is small and thus further problems arise: If the training set documents are randomly sampled from the entire corpus from which relevant documents are to be retrieved and only a small share of documents refer to the entity of interest, then a large number of training documents have to be sampled, read, and coded such that the training data set contains a sufficiently large number of documents falling into the positive relevant class for the supervised method to effectively learn the distinctions between the relevant and the irrelevant class. If, for example, 3% of documents are relevant, then after coding 1,000 randomly sampled training documents only about 30 documents will be assigned to the relevant category. 6

What is more: If no adjustments are made, then each training set document has the same weight in the calculation of the value of the loss function. This is, the optimization algorithm attaches the same importance to the correct classification of each training set document.Yet, in a retrieval situation characterized by imbalance, researchers typically care more about the correct classification of relevant training documents than irrelevant documents (see also argumentation in Section 2 above) (Branco et al., 2016, p. 2-4) . Or put differently, missing a truly relevant document (false negative) is considered more problematic than falsely predicting an irrelevant document to be relevant (false positive) (Brownlee, 2020) . So, there is the question of what to do to make the supervised learning algorithm focus on correctly detecting relevant documents.

The statistical learning community has devised a large spectrum of approaches to deal with imbalanced classification problems (for an overview see Branco et al., 2016) . Among the most common and most easily applicable procedures that are employed to make the optimization algorithm put more weight on the correct classification of instances that are part of the relevant minority class are techniques that adjust the distribution of training set instances (Branco et al., 2016, p. 7-15, 21-27) . This set of techniques comprises procedures such as random oversampling, random undersampling and the synthetic minority oversampling technique (SMOTE) (Chawla et al., 2002) (Branco et al., 2016, p. 22). 7 In random oversampling instances of the minority class are randomly resampled with replacement and appended as duplicates to the training data set (Wang, 2020, p. 9833) .

In random undersampling, randomly selected instances of the majority class are removed from the training set (Wang, 2020, p. 9831) . Both resampling techniques typically are applied until a user-specified distribution of class labels is reached (e.g. until the minority class contains as many instances as the majority class) (Brownlee, 2021a) . Thereby, both resampling strategies make the training set more balanced and thus put more weight on the minority class than in the original training set distribution. As random oversampling implies that resampled minority instances are added as exact duplicates, random oversampling can lead to overfitting on the training data and reduced generalization performance on the test data (Branco et al., 2016, p. 22) . Moreover, oversampling implies higher computational costs (Branco et al., 2016, p. 22) . In random undersampling, on the other hand, information from removed majority class instances is lost (Brownlee, 2021a) . 8

Beside these techniques that adjust the training set distribution, a second set of methods to address imbalanced classification problems is the usage of cost-sensitive algorithms (Branco et al., 2016, p. 27 ff.) . There are specifically developed modifications of algorithms that allow for incorporating higher costs for misclassifying instances of the minority class (for an overview over these special-purpose methods see Branco et al., 2016, p. 27-29) . A more general method, however, is to set up a cost matrix that specifies which cell in the confusion matrix (see Table 1 in Section 2) is associated with which cost (Elkan, 2001; Brownlee, 2020) . During training, the loss of each training instance takes into account the respective cost depending on which cell the instance is in (Elkan, 2001, p. 973) . In this way, higher costs can be specified for false negatives than for false 7 SMOTE (Chawla et al., 2002) is a well-known technique in which the minority class is enlarged by adding synthetically generated minority class training examples. A synthetic training instance is created by the following process: For each feature, a feature value is randomly drawn from the line joining the feature value of a randomly sampled minority class instance and the feature value of one of its Q nearest neighbors (Chawla et al., 2002, p. 328-329) . This implies that SMOTE is "operating in 'feature space' rather than 'data space"' (Chawla et al., 2002, p. 328 ) of data that are represented in tabular form (Brownlee, 2021b) . (Thus, SMOTE can be applied on a bag-of-words-based document-feature matrix but not original sequential text data.) In contrast to a simple random oversampling procedure, SMOTE adds new instances rather than exact copies to the training data and thereby reduces the risk of overfitting (Chawla, 2005, p. 860) . (Note that SMOTE typically is combined with random undersampling (Chawla et al., 2002, p. 330 ). There are various modifications of SMOTE or combinations of SMOTE with other techniques and models (Branco et al., 2016, p. 25-26) .) 8 Beside these random resampling techniques mentioned here, there are also methods that perform oversampling or undersampling in an informed way; e.g. based on distance criteria (see Branco et al., 2016, p. 23-24). positives and be directly incorporated into the training process.

The idea of the cost matrix also underlies the techniques that modify the distribution of training instances (Elkan, 2001, p. 975 ). The undersampling rates for the majority class or the oversampling rates for the minority class ideally should reflect the cost induced by misclassifying an instance from the respective class (Brownlee, 2020) . For example, if falsely predicting an instance from the positive minority class to be negative is considered 10 times more costly than falsely predicting an instance from the negative majority class to be positive, then the cost of a false negative is 10 and the cost for a false positive 1 (and true positives and true negatives induce no costs) (Branco et al., 2016, p. 36) . Positive minority class instances then could be randomly oversampled such that their number increases by a factor of 10, or the majority class instances could be undersampled such that their number decreases by a factor of 1/10 (Branco et al., 2016, p. 36). 9 In practice, however, all discussed techniques suffer from the problem that researchers often cannot specify precise values for misclassification costs (Brownlee, 2020) . In the context of the task of retrieving relevant documents, researchers may be able to say that false negatives are more costly than false positives but how much so is likely to be highly difficult to specify (Branco et al., 2016, p. 3; Brownlee, 2020) .

The focus of the so far mentioned methods for imbalanced classification problems has been on the difference in the misclassification costs associated with instances from the positive minority vs. negative majority class. Yet, there are other types of cost that also should be considered: As elaborated above, the annotation of training documents is costly due to the resources required. And in the context of imbalanced classification problems annotating a random sample of documents is inefficient as a disproportionately large number of documents has to be annotated until an acceptably number of instances form the minority class is labeled. These training set annotation costs are the focus of active learning strategies.

Active learning refers to learning techniques in which the learning algorithm itself indicates which training instances should be labeled next (Settles, 2010, p. 4 ). The idea is to let the learning algorithm select instances for labeling that are likely to be informative for the learning process (Settles, 2010, p. 5) . Such instances could be, for example, those instances whose prediction the learner is most uncertain about (Settles, 2010, p. 5) . The underlying hypothesis is that by letting the learner actively select the instances from 9 Note that the outlined relationship between cost ratios and over-or undersampling rates only holds if the threshold at which the classifier considers an instance to fall into the positive rather than the negative class is at p = 0.5 (Elkan, 2001, p. 975) . Note furthermore that although it would be good practice for resampling rates to reflect an underlying distribution of misclassification costs as specified in the cost matrix, resampling with rates reflecting misclassification costs will not yield the same results as incorporating misclassification costs into the learning process (Branco et al., 2016, p. 36) . One reason, for example, is that in random undersampling instances are removed entirely (Branco et al., 2016, p. 36) . For information on the relationship between oversampling/undersampling, cost-sensitive learning, and domain adaptation see Kouw and Loog (2019, p. 4-5, 7) .

which it seeks to learn, a high as possible prediction accuracy can be achieved with a small as possible number of annotated training instances (Settles, 2010, p. 4, 5) . Active learning stands in contrast to the usual supervised learning procedure in which the training set instances are being randomly sampled, annotated and then handed over to the learning algorithm. When juxtaposing active learning to this usual supervised learning procedure, the latter sometimes is called passive learning (Miller et al., 2020, p. 534 ).

Active learning is useful in situations in which unlabeled training instances are abundant but the labeling process is costly (Settles, 2010, p. 4) . There are several different scenarios in which active learning can be applied (see Settles, 2010, p. 8-12) . In this study, the focus is on pool-based sampling. In pool-based sampling a large collection of instances has been collected from some data distribution in one step (Settles, 2010, p. 11) . At the start of the learning algorithm, labels are obtained only for a very small set of instances, denoted I, whilst the other instances are part of the large pool of unlabeled instances U (Settles, 2010, p. 11) . In each iteration of the active learning algorithm, the algorithm is trained on instances in the labeled set I and makes predictions for all instances in pool U (Lewis and Gale, 1994, p. 4; Settles, 2010, p. 6, 11) . The instances in pool U then are ranked according to how much information the learner would gather from an instance if it were labeled (Settles, 2010, p. 11-12) . Then the most informative instances in U are selected and labeled (e.g. by human coders) (Settles, 2010, p. 6 ). The newly labeled instances are added to set I and a new iteration starts (Settles, 2010, p. 6) . 10

In the active learning community several different strategies of how the informativeness of an instance is defined and how the most informative instances are selected have been developed (for an overview see Settles, 2010, p. 12 ff.) . These strategies are termed query strategies (Settles, 2010, p. 12) . Here, the "[p]erhaps [...] simplest and most commonly used query framework" (Settles, 2010, p . 12) will be presented: uncertainty sampling (Lewis and Gale, 1994) . In uncertainty sampling those instances are considered to be the most informative about which the learning algorithm expresses the highest uncertainty (Lewis and Gale, 1994, p. 4 ). In the context of the binary document retrieval classification task, the uncertainty could be said to be highest for instances for which the predicted probability to belong to the relevant class is closest to 0.5 (Lewis and Gale, 1994, p. 4) . 11 The usage of such a definition of uncertainty and informativeness only is possible for learning methods that return predicted probabilities (Settles, 2010, p. 12) . For methods that do not, other uncertainty-based sampling strategies have been developed (see Settles, 2010, p. 14-15) . With regard to SVMs, Tong and Koller (2002) have introduced three theoretically motivated query strategies. In their Simple Margin strategy the data point that is closest to the hyperplane is selected to be labeled next (Tong and Koller, 2002, p. 53-54) .

One important aspect to be kept in mind when applying active learning techniques is that because the training instances are not sampled randomly from the underlying corpus but are purposefully selected, the distribution of the class labels in training data set I and in unlabeled pool U is different from the distribution of labels in the entire corpus (Miller et al., 2020, p. 539) . If the expected generalization error is to be estimated, then one option is to randomly sample a set of instances from the corpus at the very start of the analysis (Tong and Koller, 2002, p. 57; Miller et al., 2020, p. 539, 541) . This set then is annotated and set aside such that it neither can become part of set I nor set U (Tong and Koller, 2002, p. 57; Miller et al., 2020, p. 539, 541) . After each learning iteration or a fixed number of iterations, the performance of the active learning algorithm then can be evaluated on this independent test set (Tong and Koller, 2002, p. 57; Miller et al., 2020, p. 539, 541) .

Empirically one can say that in a majority of published works active learning reaches the same level of prediction accuracy with fewer training instances than supervised learning with random sampling of training instances (passive learning) (Lewis and Gale, 1994; Tong and Koller, 2002; Ertekin et al., 2007; Settles, 2010; Miller et al., 2020) . Especially if data sets are imbalanced, active learning tends to reach the same level of classification performance with a substantively smaller number of labeled training instances than passive learning (Ertekin et al., 2007, p. 131; Ein-Dor et al., 2020, p. 7954; Miller et al., 2020, p. 543-544) . Closer inspections show that during the learning process, the training set I, which is selected by the active learning algorithm, is more balanced than the original data distribution (Ertekin et al., 2007, p. 133-134; Miller et al., 2020, p. 545) . One likely reason for this observation is that as active learning algorithms tend to pick instances for labeling from the uncertain region between the classes and in this region of the feature space the class distribution tends to be more balanced, the class distribution among instances an active learning algorithm tends to select is likely to be more balanced (Ertekin et al., 2007, p. 129, 133-134) . A more balanced distribution implies that more weight is given to the minority class instances. Another likely reason is that because active learning algorithms tend to pick instances close to the boundary between the classes, they are able to learn the class boundary with a smaller number of training instances (Settles, 2010, p. 28 ).

In the following section, retrieving documents via keyword lists is compared to a query expansion technique, topic model-based classification rules and active as well as passive supervised learning on the basis of three retrieval tasks. The code to replicate the analysis can be accessed via figshare at https://doi.org/10.6084/m9.figshare.19699840.v1. The 20 analysis is conducted in R (R Core Team, 2020) and Python (Van Rossum and Drake, 2009) . For the analyses pertaining to active and passive supervised learning with the pretrained language representation model BERT (standing for Bidirectional Encoder Representations from Transformers), the Python code is run in Google Colab (Google Colaboratory, 2020) in order to have access to a GPU. 12

Twitter: The first inspected retrieval task operates on a corpus comprising 24,420 German tweets. These tweets are a random sample of all tweets in German language in a larger collection of tweets that has been collected by Barberá (2016) . Linder (2017) sampled 24,420 German tweets and used CrowdFlower workers to label the sampled tweets. For each tweet, the label indicates whether the tweet refers to refugees, refugee policies, and the refugee crisis and thus is considered relevant or not (Linder, 2017, p. 23-24) . The task of retrieving the relevant tweets from this corpus indeed is an imbalanced classification problem as only 727 out of the 24,420 tweets (2.98%) are labeled to be about the refugee topic.

The aim of the second retrieval task is to extract all posts from the Social Bias Inference Corpus (SBIC) (Sap et al., 2020) that have been labeled to be offensive toward mentally or physically disabled people. The SBIC includes 44,671 potentially toxic and offensive posts from Reddit, Twitter and three websites of online hate communities (Sap et al., 2020, p. 5480) . 13 The SBIC was collected with the aim of studying impliedrather than explicitly stated-social biases (Sap et al., 2020, p. 5477) . The subreddits and websites selected to be included in the SBIC constitute intentionally offensive online communities (Sap et al., 2020, p. 5480) . The additionally included reddit comments and tweet data sets were collected such that there is an increased likelihood that the content of the collected posts is offensive (e.g. by selecting tweets that include hashtags known to be racist or sexist) (Sap et al., 2020, p. 5480) . Sap et al. (2020) used Amazon Mechanical Turk for the annotation of the posts. For each post the coder indicated, amongst others, whether the post is offensive and if so, whether the target is an individual (meaning that the post is a personal insult) or a group (implying that the post offends a social group, e.g. women, people of color) (Sap et al., 2020, p. 5479-5480) . If one or several groups were targeted, the coders were asked to name the targeted group or groups (Sap et al., 2020, p. 5479-5480) . The authors merge the 1,414 targeted groups into seven larger group categories (Sap et al., 2020, p. 5481) . One of these group categories are mentally or physically disabled people. 2.15% of the 44,671 posts are annotated as being offensive toward the disabled. 14 The category of disabled people is selected as the focus of this study because this group category is the most coherent capturing a well-defined group of people.

Reuters: The third retrieval task is to identify all newspaper articles in the Reuters-21578 corpus (Lewis, 1997 ) that refer to the topic surrounding crude oil. Reuters-21578 (Lewis, 1997 ) is a widely used corpus for evaluating retrieval approaches (Tong and Koller, 2002; Ertekin et al., 2007; HuggingFace, 2021) . The corpus contains 21,578 newspaper articles that were published on the Reuters financial newswire service in 1987 (Lewis, 1997; HuggingFace, 2021) . 10,377 articles are assigned to one or several out of 135 economic subject categories called topics (Lewis, 1997) . These categories are e.g. 'gold', 'grain', 'cotton'. Here, the 10,377 topic-annotated articles are used for the analysis. The aim is to identify the 566 (5.45%) newspaper articles that are labeled to be about the crude oil topic. The topic is the fourth largest. It is large enough to possibly contain enough documents for the algorithms to learn from and at the same time is small enough such that the identification of crude oil articles can be considered an imbalanced classification problem.

The three data sets employed here are selected with the aim to achieve and represent various types of retrieval tasks common in social science. Tweets, posts from online platforms, and newspaper articles are types of documents that are often analyzed in social science and whose analysis typically involves some preliminary retrieval step (see e.g. King et al., 2013; Beauchamp, 2017; Baum et al., 2018; Stier et al., 2018; Fogel-Dror et al., 2019; Zhang and Pan, 2019; Watanabe, 2021; Muchlinski et al., 2021) . The entities of interest in social science studies vary widely with regard to their nature and their level of abstraction. Zhang and Pan (2019) study collective action events, Baum et al. (2018) focus on rape incidents, Puglisi and Snyder (2011) retrieve information on persons involved in political scandals, Uyheng and Carley (2020) extracts tweets referring to the COVID-19 pandemic, Jungherr et al. (2016) examine parties, candidates, and campaign events during an election campaign, and the entities of interest for Fogel-Dror et al. (2019) are Israel and the Palestinian Authority. In this study, the entities of interest range from a multi-dimensional topic that includes abstract policies, occurrences as well as a social group (refugee policies, refugee crisis, refugees), to a one-dimensional topic about a single economic product (crude oil), to a specific social group (disabled people) that is referred to in a specific (namely: offending) way. Moreover, the corpora from which documents are retrieved in social science can be thematically highly heterogeneous (as is the case with the corpus of German tweets here and with the Weibo posts studied by Zhang and Pan (2019)) or-due to the nature of the source-be more homogeneous with regard to topics, linguistic style, or attitudes (see e.g. the corpus of speeches from leaders of EU institutions and member states employed by Rauh et al. (2020) and the SBIC corpus here). Note also that the task of retrieving posts that offend disabled people involves retrieving posts that are of a specific kind (namely: offending) and refer to a specific entity (disabled people). Such a retrieval task is common in sentiment analysis in which the aim is to extract documents that express an attitude toward a specific entity. The documents to be identified in such cases are required not only to refer to the entity of interest but also to be of a specific kind (namely: attitude expressing in contrast to being objective or fact-based).

In order to compare the retrieval performance of keyword lists with the other discussed methods, keyword lists have to be generated for each of the three retrieval tasks. Due to what is known from research on the human construction of keyword lists, however, the keyword lists created by humans are likely to overlap very little and thus are likely to be unreliable (King et al., 2017, p. 973-975) . This poses a problem for the planned comparison because it would be best to have a challenging and reliable basis against which the other approaches can be compared to. To address this problem, the keyword lists are not constructed by humans but rather from the set of the most predictive keywords for the positive relevant class.

To identify predictive keywords, for each of the three studied corpora, the documents are preprocessed into a document-feature matrix. 15 Then, logistic regression with regularization is applied. The regularization is introduced via the least absolute shrinkage and selection operator (LASSO; L 1 penalty) or ridge regression (L 2 penalty) depending on the outcome of hyperparameter tuning. The model is trained on the entire corpus and then the 50 most predictive terms (i.e. the terms with the highest coefficients) are extracted. The extracted terms are listed in Tables 7 to 9 in Appendix A. From each set of 50 most predictive terms 10 keywords are randomly sampled whereby the probability of drawing a term is proportional to the relative size of the term's coefficient. The 10 sampled keywords constitute one keyword list. The sampling of keywords from the set of predictive terms is repeated 100 times such that for each evaluated corpus there are 100 keyword lists of length 10 that serve as a basis for evaluation and comparison. 16

In contrast to human-constructed keyword lists for which it would be difficult to judge whether the lists perform on the higher or lower end of all lists humans would possibly generate for the posed retrieval tasks, the here constructed keyword lists mark the situation of a good start in which the selected keywords are highly indicative for the relevant class.

The keyword lists serve as the starting point for query expansion. Each keyword list is expanded via the following procedure:

1. Take a set of trained word embeddings, here denoted by {z 1 , . . . , z u , . . . , z U }. 17 2. For each keyword s v in the keyword list {s 1 , . . . , s V }:

(a) Get the word embedding of the keyword: z [sv] (b) Compute the cosine similarity between z [sv] and each word embedding z u in the set {z 1 , . . . , z u , . . . , z U }:

(c) Take the M terms that are not keyword s v itself and have the highest cosine similarity with keyword s v . Add these M terms to the keyword list.

This query expansion strategy makes use of word embedding representations and the cosine similarity as has been done in previous studies (e.g. Kuzi et al., 2016; Silva and Mendoza, 2020) . By not merging the keyword list into a single word vector representation but rather expanding the keyword list for each keyword separately, this expansion method allows to move into a different direction for each keyword. This might help in extracting a more varied range of linguistic denominations for the entity of interest and might be especially useful if the entity is abstract or combines several dimensions (as e.g. is the case with the refugee topic that combines policies, occurrences, and a group of people). A similar procedure for query expansion has been studied by Kuzi et al. (2016) .

terms can be directly used as starting terms for query expansion. In the case of the globally pretrained word embeddings, however, not all of the highly predictive terms have a corresponding global word embedding. Hence, for the globally pretrained embeddings the 50 most predictive terms for which a globally pretrained word embedding is available are extracted. If a predictive term has no corresponding global embedding, the set of extracted predictive terms is enlarged with the next most predictive term until there are 50 extracted terms. Consequently, in Tables 7 to 9 in Appendix A for each corpus two lists of the most predictive features are shown. Moreover, for the evaluation of the initial keyword lists of 10 predictive keywords, the local keyword lists have to be used because the global keyword lists have been adapted for the purposes of query expansion on the global word embedding space. 17 If necessary, the set of word embeddings is reduced to those embeddings whose terms occur in the corpus of interest and in the keyword list. For each evaluated retrieval task two different sets of word embeddings are used: embeddings that have been externally pretrained on large global corpora and embeddings trained locally on the corpus from which documents are to be retrieved. With regard to the globally pretrained embeddings for the English SBIC and the Reuters corpus, GloVe embeddings with 300 dimensions that have been trained on CommonCrawl data are made use of (Pennington et al., 2014) . 18 For the German Twitter data set 300-dimensional GloVe embeddings trained on the German Wikipedia are employed. 19

To get locally trained embeddings, on each corpus examined here, a GloVe embedding model is trained. GloVe embeddings with 300 dimensions are obtained for all unigram features that occur at least 5 times in the corpus. In training, a symmetric context window size of six tokens on either side of the target feature as well as a decreasing weighting function is used; such that a token that is q tokens away from the target feature counts 1/q to the co-occurrence count (Pennington et al., 2014) . After training, following the approach in Pennington et al. (2014) , the word embedding matrix and the context word embedding matrix are summed to yield the finally applied embedding matrix. Note that in their analysis of a large spectrum of settings for training word embeddings, Rodriguez and Spirling (2022) found that the here used popular setting of using 300-dimensional embeddings with a symmetric window size of six tokens tends to be a setting that yields good performances whilst at the same time being cost-effective regarding the embedding dimensions and the context window size.

The number of expansion terms M is varied from 1 to 9 such that after the expansion the lists of originally 10 keywords then comprise between 20 and 100 keywords. The original as well as the expanded keyword lists are applied on the lowercased documents. Following the logic of a boolean query with the OR operator, a document is predicted to belong to the positive relevant class if it contains at least one of the keywords in the keyword list.

When constructing topic model-based classification rules, there are three steps at which researchers have to make decisions that are likely to substantively affect the results. First, after having selected a specific type of topic model that is to be used, the number of to be estimated topics K has to be set. Second, for the construction of a topic model-based classification rule, a researcher has to determine how many and which of the estimated topics are considered to be about the entity of interest (see Step 1 of the procedure described in Section 3.3). Finally, threshold value ξ ∈ [0, 1] has to be set. If the sum of topic shares relating to relevant topics of a document is ≥ ξ, the document is predicted to be relevant (see Step 2 of the procedure described in Section 3.3). In each of these decision steps a researcher may be guided by expertise and/or an exploration of the results following from deciding for one or another option.

Whilst in practice a researcher has to finally settle for one of the options in each step such that a single classification rule is produced, here the aim rather is to comprehensively evaluate topic model-based classification rules and also to inspect how well topic model-based classification rules can perform if optimal decisions (w.r.t. retrieval performance) are made. Consequently, specific values for the number of topics, the number of relevant topics and threshold values are set within reasonable ranges a priori. Then, the retrieval performance for all combinations of these values is evaluated. More precisely: On each corpus seven topic models-each with a different number of topics K ∈ {5, 15, 30, 50, 70, 90, 110}-are estimated. Then, for each estimated topic model with a specific topic number, initially only one topic is considered relevant, then two topics, and then three. For each number of topics considered to be relevant, all possible combinations regarding the question which topics are considered relevant are evaluated. This implies that all ways of choosing one, two and three relevant topics (irrespective of the order in which they are selected) from the overall sets of 5, 15, 30, 50, 70, 90, and 110 topics have to be determined and evaluated. This amounts to 426,725 combinations-all of which are evaluated here. 20

Finally, for each of the 426,725 combinations, four different threshold values ξ are inspected: 0.1, 0.3, 0.5, and 0.7. Whereas ξ = 0.7 only considers those documents to be relevant that have 70% of the words they contain estimated to be generated by relevant topics, ξ = 0.1 is the most lenient solution in which all documents are classified to be relevant that have 10% of their words assigned to relevant topics. As ξ increases, recall is likely to decrease and precision is likely to increase.

The type of topic model estimated here is a Correlated Topic Model (CTM) (Blei and Lafferty, 2007) . CTM extends the basic Latent Dirichlet Allocation (LDA) (Blei et al., 2003) by allowing topic proportions to be correlated. For more details on the CTM see Blei and Lafferty (2007) . 21 20 For example, in a topic model with K = 15 topics, there are 15 ways to select one relevant topic from 15 topics (namely: the first, the second, ..., and the 15th); and there are 15 2 = 105 ways of choosing two relevant topics from the set of 15 topics, and there are 15 3 = 455 ways to pick three topics from 15 topics. 21 The CTM is estimated via the stm R-package (Roberts et al., 2019) that originally is designed to estimate the Structural Topic Model (STM) (Roberts et al., 2016a) . The STM extends the LDA by allowing document-level variables to affect the topic proportions within a document (topical prevalence) or to affect the term probabilities of a topic (topical content) (Roberts et al., 2016a, p. 989 ). If no document-level variables are specified (as is done here), the STM reduces to the CTM (Roberts et al., 2016a, p. 991) . In estimation, the approximate variational expectation-maximization (EM) algorithm as described in Roberts et al. (2016a, p. 992-993) is employed. This estimation procedure tends to be faster and tends to produce higher held-out log-likelihood values than the original variational approximation algorithm for the CTM presented in Blei and Lafferty (2007) (Roberts et al., 2019, p. 29-30) . The model is initialized via spectral initialization (Arora et al., 2013; Roberts et al., 2016b, p. 82-85; Roberts et al., 2019, p. 11) . The model is considered to have converged if the relative change in the approximate lower bound on the marginal likelihood from one step to the next is smaller than 1e-04 (Roberts et al., 2016a, 

Two types of supervised learning methods are employed. First, Support Vector Machines (SVMs) (Boser et al., 1992; Cortes and Vapnik, 1995) , and second, BERT (standing for Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) .

SVMs have been applied frequently and relatively successfully to text classification tasks in social science (Diermeier et al., 2011; D'Orazio et al., 2014; Baum et al., 2018; Pilny et al., 2019; Sebők and Kacsuk, 2020; Erlich et al., 2021) ; and also in active learning settings (Miller et al., 2020 ). An SVM operates on a document-feature matrix, X. In a document-feature matrix, each document is represented as a feature vector of length U : x i = (x i1 , . . . , x iu , . . . , x iU ). The information contained in the vector's entries, x iu , typically is based on the frequency with which each of the U textual features occurs in the ith document (Turney and Pantel, 2010, p. 147) . Given the document feature vectors, {x i } N i=1 , and corresponding binary class labels, {y i } N i=1 , whereby y i ∈ {−1, +1}, an SVM tries to find a hyperplane, that separates the training documents as well as possible into the two classes (Cortes and Vapnik, 1995) . 22

The document-feature matrix regards each document as a bag of words (Turney and Pantel, 2010, p. 147) . A bag-of-words representation only encodes information on the weighted frequency with which the terms occur in documents, but disregards word order, contextual information, and dependencies between the tokens in a document (Turney and Pantel, 2010, p. 147 ). Yet, a document is a sequence-not a bag-of tokens among which dependencies exist. Moreover, the meaning of a word often depends on the context of other words in which it is embedded in. (Take, for instance, the homonyms 'bank' or 'party'.) In order to also use a supervised learning method that processes a document as a sequence of tokens and captures dependencies between tokens as well as contextdependent meanings of tokens, the Transformer-based language representation model p. 992; Roberts et al., 2019, p. 10, 28) . 22 To create the required vector representation for each document, here the following text preprocessing steps are applied: The documents are tokenized into unigram tokens. Punctuation, symbols, numbers, and URLs are removed. The tokens are lowercased and stemmed. Subsequently terms whose mean tf-idf value across all documents in which they occur belongs to the lowest 0.1% (Twitter, SBIC) or 0.2% (Reuters) of mean tf-idf values of all terms in the corpus are discarded. Also terms that occur in only one (Twitter) or two (SBIC, Reuters) documents are removed. Finally, a boolean weighting scheme, in which only the absence (0) vs. presence (1) of a term in a corpus is recorded, is applied on the document-feature matrix. To determine the hyperparameter values for the SVMs, hyperparameter tuning via a grid search across sets of hyperparameter values is conducted in a stratified 5-fold cross-validation setting on one fold of the training data. A linear kernel and a Radial Basis Function (RBF) kernel are tried. Moreover, for the inverse regularization parameter C, that governs the trade-off between the slack variables and the training error, the values {0.1, 1.0, 10.0, 100.0} (linear) and {0.1, 1.0, 10.0} (RBF) are inspected. Additionally, for the RBF's parameter γ, that governs the training example's radius of influence, the values {0.001, 0.01, 0.1} are evaluated. 23 The folds are stratified such that the share of instances falling into the relevant minority class is the same across all folds. In each cross-validation iteration, in the folds used for training, random oversampling of the minority class is conducted such that the number of relevant minority class examples increases by a factor of 5. Among the inspected hyperparameter settings, the setting that achieves the highest F1-Score regarding the prediction of the relevant minority class and does not exhibit excessive overfitting is selected.

BERT is additionally employed here.

BERT is a deep neural network based on the Transformer architecture (Vaswani et al., 2017) . The central element of the Transformer architecture is the (self-)attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017) . This mechanism allows the representation of each token to include information from the representations of other tokens (in the same sequence) (Vaswani et al., 2017, p. 6001-6002) ; thereby enabling the model to produce token representations that encode contextual information and token dependencies.

BERT typically is applied in a sequential transfer learning setting (Devlin et al., 2019, p. 4175, 4179) . In sequential transfer learning, a model first is pretrained on a source task (Ruder, 2019, p. 64) . In pretraining, the aim is to learn model parameters such that the model can function as a well-generalizing input to a large range of different target tasks (Ruder, 2019, p. 64) . Then, in the following adaptation phase, the pretrained model (with its pretrained parameters) serves as the input for the training process on the target task (Ruder, 2019, p. 64 ). The transferral of information (in the form of pretrained model parameters) to the learning process of a target task tends to reduce the number of training instances required to reach the same level of prediction performance than when not applying transfer learning and training the model from scratch (Howard and Ruder, 2018, p. 334 ).

This potential of pretrained deep language representation models to reduce the number of required training instances is highly important for the application of deep neural networks in practice: In text classification tasks, deep neural networks tend to outperform conventional machine learning methods (such as SVMs) that often are applied on bag-ofwords representations (Socher et al., 2013; Ruder, 2020) . But deep neural networks have a much higher number of parameters to learn than conventional models and thus require much more training instances. In situations in which the annotation of training instances is expensive or inefficient-such as in the context of retrieval with a strong imbalance between the relevant vs. irrelevant class-applying a deep neural network from scratch may become prohibitively expensive. In a transfer learning setting, however, an already pretrained deep language representation model merely has to be fine-tuned to the target task at hand. If the pretrained model generalizes well, the number of training instances required to reach the same level of performance as a deep neural network that is not used in a transfer learning setting is reduced by several times (Howard and Ruder, 2018, p. 334) . This allows deep neural networks to be applied to natural language processing tasks for which only relatively few training instances are available. Moreover, Ein-Dor et al. (2020) show that especially in imbalanced classification settings active learning strategies can further improve the prediction performance of BERT such that even fewer training instances are needed for the same performance levels. 24

There are two limiting factors when applying BERT: First, due to memory limitations, BERT cannot process text sequences that are longer than 512 tokens (Devlin et al., 2019, p. 4183) . This poses no problem for the Twitter corpus that has a maximum sequence length of 73 tokens. In the Reuters news corpus, however, whilst the largest share of articles is shorter than 512 tokens, there is a long tail of longer articles comprising up to around 1,500 tokens. 25 Following the procedure by Sun et al. (2019) , Reuters news stories that exceed 512 tokens are reduced to the maximum accepted token length by keeping the first 128 and keeping the last 382 tokens whilst discarding the remaining tokens in the middle. 26 The maximum sequence length recorded for the SBIC is 354 tokens. In order to reduce the required memory capacities, the few posts that are longer than 250 tokens are shortened to 250 tokens by keeping the first 100 and the last 150 tokens.

The second limiting factor is that the prediction performance achieved by BERT after fine-tuning on the target task can vary considerably-even if the same training data set is used for fine-tuning and only the random seeds, that initialize the optimization process and set the order of the training data, differ (Devlin et al., 2019, p. 4176; Phang et al., 2019, p. 5-7; Dodge et al., 2020) . Especially when the training data set is small (e.g. smaller than 10,000 or 5,000 documents), fine-tuning with BERT has been observed to yield unstable prediction performances (Devlin et al., 2019, p. 4176; Phang et al., 2019, p. 5-7) . Recently, Mosbach et al. (2021) established that the variance in the prediction performance of BERT models, that have been fine-tuned on the same training data set with different seeds, to a large extent are likely due to vanishing gradients in the finetuning optimization process. Mosbach et al. (2021, p. 5 ) also note that it is not that small training data sets per se yield unstable performances but rather that if small data sets are fine-tuned for the same number of epochs than larger data sets (typically for 3 epochs), then this implies that smaller data sets are fine-tuned for a substantively smaller number of training iterations-which in turn negatively affects the learning rate schedule and the generalization ability (Mosbach et al., 2021, p. 4-5) . Finally, Mosbach et al. (2021, p. 2, 8-9) show that fine-tuning with a small learning rate (in the paper: 2e-05), with warmup, bias correction, and a large number of epochs (in the paper: 20) not only tends to increase prediction performances but also significantly decreases the performance instability in fine-tuning. Here, the advice of Mosbach et al. (2021) is followed. For BERT, the AdamW algorithm (Loshchilov and Hutter, 2019) with bias correction, a warmup period lasting 10% of the training steps, and a global learning rate of 2e-05 is used. Training is conducted for 20 epochs. Dropout is set to 0.1. The batch size is set to 16.

For all applications, the pretrained BERT models are taken from HuggingFace's Transformers open source library (Wolf et al., 2020) . The BERT model used as a pretrained input for the English applications based on SBIC and the Reuters corpus, has been pretrained on the English Wikipedia and the BooksCorpus (Zhu et al., 2015) as in the original BERT paper (Devlin et al., 2019) . For the data set of German tweets, a German BERT model pretrained on, amongst others, Wikipedia and CommonCrawl data by the digital library team at the Bavarian State Library is used (Münchener Digitalisierungszentrum der Bayerischen Staatsbibliothek (dbmdz), 2021). All BERT models are employed in the base (rather than the large) model version and operate on lowercased (rather than cased) tokens.

For both models, SVM and BERT, an active and a passive supervised learning procedure is implemented. The procedures consist of the following steps. (If the procedures differ between the active and the passive learning setting, it will be explicitly pointed out.):

• The data are randomly separated into 10 (SBIC, Twitter) or 5 (Reuters) equally sized folds.

• Then, for each fold g of the 10 (SBIC, Twitter) or 5 (Reuters) folds the data have been separated into the following steps are conducted:

1. Fold g is set aside as a test set.

2. From the remaining folds, 250 instances are randomly sampled to form the initial set of labeled instances I. The other instances constitute the pool of unlabeled instances U.

3. The model is trained on the instances in set I and afterward makes predictions for all instances in pool U and the set aside test fold g. Recall, precision and the F 1 -Score for the predictions made for pool U and test fold g are separately recorded. During training in the passive learning setting, random oversampling of the instances falling into the positive relevant class is conducted such that the number of positive relevant instances increases by a factor of 5-thereby reflecting a cost matrix in which the cost of a false negative prediction is set to 5 and the cost of a false positive prediction is set to 1. In the active learning setting, no random oversampling is conducted. 4. A batch of 50 instances from pool U is added to the set of labeled instances in set I. In passive learning, these 50 instances are randomly sampled from pool U. In active learning, the following query strategies are applied: In the active learning setting with BERT, the 50 instances whose predicted probability to fall into the positive relevant class is closest to 0.5 are selected. When applying an SVM for active learning, the 50 instances with the smallest perpendicular distance to the hyperplane are retrieved and added to I.

Steps 3 and 4 are repeated for 15 iterations, i.e. until set I comprises 1,000 labeled instances.

Hence, passive supervised learning with random oversampling and pool-based active learning with uncertainty sampling are applied. As the described learning procedures are repeated for 10 (SBIC, Twitter) or 5 (Reuters) times and are evaluated on each of the 10 (SBIC, Twitter) or 5 (Reuters) folds the data have been separated into, this allows taking the mean of the F 1 -Scores across the 10 (SBIC, Twitter) or 5 (Reuters) test folds as an estimate of the expected generalization error of the applied models.

The results are presented in Figures 2 to 9 and Tables 3 to 6.

Figure 2 visualizes for each of the three studied retrieval tasks (Twitter, SBIC, Reuters) the F 1 -Scores resulting from the application of the 100 keyword lists of 10 highly predictive terms as well as the evolution of the F 1 -Scores across the query expansion procedure based on locally trained GloVe embeddings (top row) and globally trained GloVe embeddings (bottom row).

In general, the retrieval performances of the initial keyword lists of 10 predictive keywords are mediocre. Only the initial keyword lists for the Reuters corpus achieve what could be called acceptable performance levels. The maximum F 1 -Scores reached by the initial lists of 10 predictive keywords are 0.417 (Twitter), 0.404 (SBIC), and 0.645 (Reuters). 27 Moreover, also with the here used empirically driven procedure for the construction of keyword lists, the variation in the initial keyword lists' retrieval performances is considerable. The difference between the maximum and the minimum F 1 -Scores is 0.267 (Twitter), 0.270 (SBIC), and 0.381 (Reuters).

Interestingly, the applied query expansion technique tends to decrease rather than increase the F 1 -Score and only shows some improvement of the F 1 -Score for the Twitter and SBIC data sets-and only if operating on the basis of word embeddings that are trained on large global external corpora rather than the local corpus at hand.

There are several factors that are likely to play a role here. First, when retrieving those terms that have the highest cosine similarity with an initial starting term, the terms retrieved from the global embedding space seem semantically or syntactically related to the initial term, whereas this is not the case for the local word embeddings (as an example see Table 3 ). One reason why the local embedding space does not yield word embeddings that position related terms closely together could be that the three corpora used here are relatively small. The information provided by the context window-based co-occurrence counts of terms thus could be too little for the embeddings to be effectively trained.

Second, in the global embedding space, terms with high cosine similarities seem to be closely related to the initial query term (see again Table 3 ). Adding these related terms, nevertheless decreases the retrieval performance (as measured by the Reuters corpus. In the case of the Twitter and SBIC corpora, adding these terms with the highest cosine similarities for some iterations at some points (especially at the beginning) slightly increase the F 1 -Score, whereas at other points there are decreasing or no visible effects. Here, a second factor comes into play: As is to be expected, query expansion increases recall and decreases precision (see Figures 11 and 10 in Appendix B). Hence, in general query expansion is only worthwhile if-and as long as-the increase in recall outweighs the decrease in precision. Applying the initial set of 10 highly predictive keywords on the Twitter and SBIC data sets, yields a retrieval result that is characterized by low recall and high precision, whereas applying the initial set of 10 highly predictive keywords on the Reuters corpus, leads to very high (sometimes even perfect) recall and low precision (see Figures 11 and 10 in Appendix B). Whereas in the second situation, there is no room for query expansion to further improve the retrieval performance via increasing recall (and thus the F 1 -Score for the Reuters corpus is moving downward), in the low-recall-high-precision situation of the Twitter and SBIC data sets there is at least the potential for query expansion to increase recall without causing a too strong decrease in precision. This potential is realized in some iterations at some expansion sets, but the decrease in precision more often than not tends to outweigh the increase in recall.

A further reason why query expansion does not perform very well also for global embeddings is the meaning conflation deficiency (Pilehvar and Camacho-Collados, 2020, p. 60): Because word embedding models such as GloVe represent one term by a single embedding vector, a polyseme or homonym is likely to have the various meanings that it refers to encoded within its single representation vector (Neelakantan et al., 2014 (Neelakantan et al., , p. 1059 . The meanings get conflated into one representation (Schütze, 1998, p. 102) . This is what also happens here. Moreover, it also seems that the conflation of meanings for the GloVe embeddings that have been pretrained on large, global corpora proceeds unequally: the global embedding space tends to position polysemous or homonymous terms close to terms that are semantically or syntactically related to the most common and general meaning of the polysemous or homonymous term (see for example the term 'vegetables' in Table 3 ). Query expansion in the global embedding space thus fails if an initial query term is a polyseme or homonym and its intended meaning is highly context-specific. Figure 3 presents the F 1 -Scores reached by topic model-based classification rules. The most notable aspect is that the retrieval performance of topic model-based classification rules is low for the Twitter and SBIC corpora and relatively high for the Reuters corpus. The highest F 1 -Score reached in the Twitter retrieval task is 0.253 and regarding the SBIC is 0.175, whereas on the Reuters corpus a score of 0.685 is achieved.

To better understand this result, the terms with the highest occurrence probabilities and the terms with the highest FREX-Score are inspected. The FREX metric is the weighted harmonic mean of a term's occurrence probability β ku and a term's exclusivity (which is given by β ku / K j=1 β ju ) (Roberts et al., 2016a, p. 993) :

whereby ECDF stands for empirical cumulative distribution function and ω is the weight balancing the two measures. Here ω is set to 0.5.

This inspection (see Tables 4, 5 In the Twitter data set the two best performing combinations for a topic model with K = 70 topics both regard three topics as being related to the Twitter topic. So the label for both is 70-3. (The two combinations, of course, differ regarding which three topics they assume to be relevant.) For each combination, the F1-Score for each evaluated threshold value ξ ∈ {0.1, 0.3, 0.5, 0.7} is given. Classification rules that assign none of the documents to the positive relevant class have a recall value of 0 and an undefined value for precision and the F1-Score. Undefined values here are visualized by the value 0. Table presents the 3 terms with the highest probability (Prob.) and the 3 terms with the highest FREX-Score (FREX). See Topic 4 for the here only moderately coherent Pegida topic. Note that German umlauts here are removed as the preprocessing procedure for the CTM involved stemming-which here also implied removing umlauts.

The entity of interest in the Twitter data set is multi-dimensional. It includes refugees as a social group, refugee policies as well as actions and occurrences revolving around the refugee crisis. When examining the most likely and exclusive terms for topic models estimated on the Twitter corpus, it becomes clear that not each aspect of this multidimensional refugee topic is captured in a coherent and exclusive topic (see for example Table 4 ). In each model for K ≥ 30 there is one relatively coherent topic on Pegida, an anti-Islam and anti-immigration movement that held many demonstrations in the context of the refugee crisis. Beside that there are further more or less integrated topics that touch refugees and refugee policies without, however, being exclusively about these entities. Thus, the topic models do not offer a set of topics that, taken together, cover all dimensions of the refugee topic in an exclusive manner.

Regarding the SBIC, the situation is even more disadvantageous for the application of topic model-based classification rules. Across all CTMs estimated on the SBIC there is no topic that identifiably relates to disabled people in a disrespectful way (as an example see Table 5 ). The CTMs with higher topic numbers include some topics that very slightly touch disabilities, but these topics are not coherent. Applying topic modelbased classification rules in this situation is futile. Among all evaluated 426,725 × 4 = 1,706,900 settings, an F 1 -Score of 0.175 is as good as it maximally gets. Table presents the 3 terms with the highest probability (Prob.) and the 3 terms with the highest FREX-Score (FREX). See Topic 14 for the crude oil topic and see Topic 13 for the military topic that at times touches crude oil.

The situation is entirely different for the crude oil topic. For K ≥ 30 each estimated topic model contains at least one coherent topic that clearly refers to aspects of crude oil (e.g. 'opec', 'bpd', 'oil' ; see Table 6 ). These coherent crude oil topics are not completely but relatively exclusive. Some of the crude oil topics also cover another energy source (namely: 'gas' ) and there is one reappearing conflict topic that refers to military aspects but also touches crude oil ('gulf ', 'missil', 'warship', 'oil' ) . Other than that, no other entities are substantially covered by crude oil topics. Building topic model-based classification rules on the basis of these crude oil topics yields relatively high recall and precision values.

Hence, topic model-based classification rules can be a useful tool-but only if the estimated topics coherently and exclusively cover the entity of interest in all its aspects.

In all three applications, and as is to be expected, high recall and low precision values tend to be achieved for topic models with smaller number of topics and lower values for threshold ξ, whereas low recall and high precision values tend to result from topic models with a higher number of topics and higher values for ξ (see Figure 12 in Appendix C). 28 Classification rules that use topic models with a higher topic number and lower threshold ξ neither tend to exhibit the highest recall nor the highest precision values but they tend to strike the best balance between recall and precision and achieve the highest F 1 -Scores (see Figure 3 here and Figure 12 in Appendix C). Across all three applications and for BERT as well as SVM, active learning with uncertainty sampling tends to dominate passive learning with random oversampling. Passive learning with random oversampling on average only shows a similar or higher F 1 -Score for the first learning iteration (i.e. at the start when training is conducted on the randomly sampled training set of 250 labeled instances). Then, however, the active learning retrieval performance strongly increases such that for the same number of labeled training instances active learning, on average, produces a higher F 1 -Score than passive learning.

One likely reason for this difference between passive and active learning is revealed in Figure 5 that contains the same information as Figure 4 ; except that on the y-axis not the F 1 -Score but the share of documents from the positive relevant class in the training set is shown. The black dashed line visualizes the share of relevant documents in the entire corpus and thus would be the expected share of relevant documents in a randomly sampled training set if neither random oversampling nor active learning were conducted. In passive learning with random oversampling (shown in blue) the 50 training instances, that are added in each step to the set of labeled training instances I, are randomly sampled from pool U. Then, the relevant instances in set I are randomly oversampled such that their number increases by a factor of 5. For this reason, the share of positive training instances in the passive learning setting is higher than in the corpus (black dashed line) but remains relatively constant across the training steps. In active learning (shown in red), no random oversampling is conducted-which is why at the beginning the share of relevant documents in at about equals the share of relevant documents in the corpus. Then, however, active learning at each step selects the 50 instances the algorithm is most uncertain about. As has been observed in other studies before (Ertekin et al., 2007, p. 133-134; Miller et al., 2020, p. 545) , this implies that disproportionately many instances from the relevant minority class are selected into set I. The share of positive training instances increases substantively-which in turn tends to increase generalization performance on the test set as shown in Figure 4 .

When decomposing the F 1 -Score into recall and precision (see Figure 13 in Appendix D), it is revealed that the supervised models' recall values gradually improve as the number of training instances increases. The precision values early reach higher levels and exhibit a more volatile path. The observed retrieval performance enhancements hence particularly are caused by the models from step to step becoming better at identifying a larger share of the truly relevant documents from the corpora. Models trained in active rather than passive learning mode tend to yield higher recall values.

Yet, there is the question of whether active learning exhibits a superior performance to passive learning with random oversampling simply because after a certain number of training steps the share of training instances is higher for active than for passive learning or whether active learning dominates passive learning (also) because active learning, due to focusing on the uncertain region between the classes and due to operating on uniquerather than duplicated-positive training instances, learns a better generalizing class boundary with fewer training instances (Settles, 2010, p. 28) . To inspect this question, for the SVMs passive learning with random oversampling is repeated whereby positive relevant documents are randomly oversampled such that their number increases by a factor of 10 (Reuters), 17 (Twitter), or 20 (SBIC) (instead of a factor of 5 as before). This results in higher shares of relevant documents in the training set for passive learning (see right column in Figure 6 ). Yet, the prediction performance on the test set as measured by the F 1 -Score either does not or does only minimally increase compared to the situation of random oversampling by a factor of 5 (see left column in Figure 6 ). Moreover, although the share of relevant documents in the stronger oversampled passive learning training data sets is similar to those of active learning, active learning still yields considerably higher F 1 -Scores. This indicates that from a certain point, merely duplicating positive instances by random oversampling has no or only a small effect on the class boundary learned by the SVM. The finding also indicates that active learning improves upon passive learning because it is effectively able to select a large share of truly positive documents for training that are not duplicated but unique and because its selection of uncertain documents provides crucial information on the class boundary.

To be more precise: When applying SVMs to imbalanced data sets, the problem in general is that the learned hyperplane tends to be positioned too close to the positive minority class instances (Akbani et al., 2004, p. 40-44) . The reason is that whereas the many negative training instances occur across the entire area belonging to the negative class, the few positive instances only occur at few points within the area belonging to the positive class (Akbani et al., 2004, p. 40-44) . Hence, the boundary of the negative area is well represented in the training data whereas the boundary of the positive area is not (for an illustration see Akbani et al., 2004, p. 43-44) . The hyperplane can be moved toward the negative side by giving more weight to positive instances, e.g. by duplicating them via random oversampling or by introducing different costs for misclassifying positive vs. negative instances (Veropoulos et al., 1999) . A problem that often arises when doing so, however, is that the hyperplane tends to overfit on the positive instances. Its orientation and shape too strongly tends to reflect the positions of positive instances (Akbani et al., 2004, p. 46) . In active learning, in contrast, the training instances the learning algorithm requests to be labeled and added next are unique-which has a positive effect on generalization performance. 30

Another important observation concerns the performances' variability (see again Figure  4 ): For all models and learning modes, given a fixed number of labeled training instances, the F 1 -Scores on the set aside test sets can vary considerably between iterations. Which set of documents is randomly sampled to form the (initial) training set and which set aside test fold is used for evaluation thus can have a profound effect on the measured retrieval performance.

A further observation is that BERT on average tends to outperform SVM (see also Figure 14 in Appendix E). Hence, applying the Transformer-based pretrained language representation model BERT with its ability to learn context-dependent meanings of tokens and with the information acquired in the pretraining phase, here tends to be better able to identify the few relevant documents, than an SVM operating on bag-of-words representations. That BERT only can process sequences of at maximum 512 tokens is not a particularly limiting or performance reducing factor here. The performance difference between the two learning methods is distinct and relatively consistent with regard to the Twitter and Reuters retrieval tasks. It is less clear cut for the SBIC. This is rather surprising as in the SBIC the disrespectful remarks toward disabled people are 30 Consider the following example: Assume that the share of positive instances in a corpus is 2.5%. Assume further that there are two training data sets each comprising 1,000 training instances and the share of positive instances in each is 30%. This high share of positive instances has been generated by random oversampling in one training data set and by selection via active learning in the other. Then, in the training set generated with random oversampling, the 300 positive training instances are mere duplicates of around 25 unique instances. In the training data set selected by active learning, in contrast, each of the 300 positive training instances is a unique instance. Therefore, active learning provides much more information on the distribution of positive training instances than passive learning with random oversampling. There are more unique positive instances observed across the area belonging to the positive class, thereby providing more information on the possible boundary of the positive area. In active as compared to passive learning with random oversampling, the fact that there are more unique positive training instances is likely to make the hyperplane more smooth and thus is likely to produce a better generalizing hyperplane. The results are presented for passive learning with a random oversampling factor of 5 (blue lines), passive learning with random oversampling factors of 17 (Twitter), 20 (SBIC), and 10 (Reuters) (golden lines), as well as pool-based active learning with uncertainty sampling (red lines). The thick and dark blue, golden, and red lines give the means across the iterations.

implied rather than stated explicitly.

Note also that with regard to the Twitter and SBIC retrieval tasks, BERT exhibits a higher instability from one learning step to the next as the number of labeled training instances I increases by a batch of 50 documents (see Figure 15 in Appendix E). On the Reuters corpus, where the retrieval task seems much more easy and all training and testing iterations with BERT early settle for similar, high performing solutions, instabilities are minimal. With regard to the more complex Twitter and SBIC retrieval tasks, however, after adding a new batch of 50 labeled training instances and fine-tuning BERT on this new, slightly expanded training data set, the F 1 -Score achieved on the test set may not only increase but also decrease considerably. The strongest de-and increases can be observed for active learning on the Twitter data set where drops and rises of the F 1 -Score by a value of about 0.85 occur.

As noted in Section 4.2.4 above, BERT's prediction performance can exhibit considerable variance across random initializations-even if trained on the exactly same training data set. Here, to make predictions and performances more stable, precautions have been taken by choosing a small learning rate of 2e-05 and setting the number of epochs to 20 such that many training iterations are conducted. To evaluate how the level and the stability of BERT's prediction performance changes if the hyperparameters are set to more conventional values (e.g. combining a learning rate of 2e-05 or 3e-05 with 3 or 4 epochs), BERT is also trained using hyperparameter values in these common ranges. Figure 16 in Appendix F visualizes the F 1 -Scores of these models. The instability of the learning paths is only slightly to moderately higher (see Figure 17 in Appendix F). Yet, for the Twitter and SBIC tasks and especially as the training data sets are very small, the F 1 -Scores of BERT models with common hyperparameter values are substantively lower than the F 1 -Scores of BERT models that have been trained for 20 epochs. In the Twitter application, for example, the mean F 1 -Score of active learning with a BERT model that is trained for 20 epochs on a set of 500 labeled training instances is 0.568 higher than active learning with a BERT model trained for 3 epochs on 500 training instances. Hence, although training over 20 epochs takes proportionally more computing time than training over 3 or 4 epochs, 31 when applying BERT to small data sets-a scenario which is likely in the retrieval settings focused on here-training for many epochs (and thus presenting each document in the small training data set many times to the model) seems important to enhance performance.

Nevertheless, because BERT tends to exhibit a relatively high degree of variability in its performance even if training is conducted for a larger number of epochs, monitoring retrieval performance with a set aside test set seems important for researchers to detect situations in which (likely due to vanishing gradients) retrieval performance drops to low values. Such situations can be easily fixed by, for example, by choosing another random seed for initialization.

To conclude, if a team of researchers has the resources to retrieve documents referring to their entity of interest via supervised learning and they have a fixed number of training instances they can maximally label, then active learning is likely to yield better results than passive learning. Moreover, if none or only a small share of documents exceeds the maximum number of tokens that Transformer-based language representation models as BERT can process, then applying a BERT-like model that is trained with a small learning rate for a large number of epochs is likely to achieve better results than applying conventional supervised machine learning methods (auch as SVM) on bag-of-words-based representations. Yet, the predictions made by BERT (and hence also BERT's retrieval performance) is prone to considerable variation depending on the initializing random seed, the initial training data set, and the changes to the training data set from one learning step to the next. This problem, however, is mitigated by the fact that mediocre performances can be easily detected if performance is monitored with a set aside test set.

The central question this study seeks to answer is: What, if anything, can be gained by applying more costly retrieval approaches as query expansion, topic model-based classification rules, or supervised learning instead of the relatively simple and inexpensive usage of a boolean query with a keyword list? In order to finally answer this question and compare the approaches against each other, Figures 7, 8, and 9 summarize the retrieval performance-as measured by the F 1 -Score-of the evaluated approaches on the retrieval tasks associated with the Twitter corpus (Figure 7) , the SBIC (Figure 8) , and the Reuters-21578 corpus (Figure 9 ). In each Figure, the left panel gives the F 1 -Scores for the lists of 10 predictive keywords that then are expanded in the local and global embedding spaces. The middle panel shows the F 1 -Scores of topic model-based classification rules with different values for threshold ξ. The right panel visualizes the F 1 -Scores for active as well as passive supervised learning with SVM and BERT.

In general, the direct comparison shows that, when taking keyword lists comprising 10 empirically predictive terms as the baseline, then the application of more complex and more expensive retrieval techniques does not guarantee better retrieval results.

Query expansion techniques here rather decrease than increase the F 1 -Score. Minimal improvements only occur sporadically in the embedding space trained on external global corpora if the increase in recall outweighs the decrease in precision. The farther the expansion, the worse the results tend to become.

In general, it seems that the identification of newspaper articles referring to crude oil from the Reuters corpus is a more simple task than the retrieval of tweets relating to the multi-dimensional refugee topic or the extraction of posts that refer to an entity (disabled people) and are of a particular kind (here: disrespectful). All approaches show higher F 1 -Scores on the Reuters corpus, and lower scores for the other two evaluated retrieval tasks. Topic model-based classification rules work relatively well for the Reuters corpus but not the other corpora. Hence, if there are no coherent and exclusive topics that cover the entity of interest in all its aspects, topic model-based classification rules exhibit rather poor retrieval performances. In the Twitter and SBIC data sets, the F 1 -Score reached by the topic model-based classification rules are in the lower range of the values achieved by the lists of predictive keywords. If, on the other hand, coherent and exclusive topics relating to the entity of interest exist (as is the case for the Reuters corpus), acceptable retrieval results are possible. Here, gains over the best performing keyword lists are achieved for combinations with larger topic numbers and smaller values for threshold ξ.

The best performing topic model-based classification rule on the Reuters corpus is based on a CTM with 70 topics that considers 2 topics to be relevant and predicts documents to be relevant that have 10% of their words assigned to these two relevant topics. This topic model-based classification rule reaches an F 1 -Score of 0.685, which is 0.04 higher than the F 1 -Score of the best performing keyword list.

Whereas query expansion techniques and topic model-based classification rules show no or small improvements, supervised learning-if conducted in an active learning modehas the potential to yield a substantively higher retrieval performance than a list of 10 predictive keywords. The prerequisite for this, however, is that not too few training instances are used. The larger the number of training instances, the higher the F 1 -Score tends to be. Yet, as has been established above, especially for BERT this relationship is not monotonic and can exhibit considerable variability. What number of training documents is required to produce acceptable retrieval results that are better than what could be achieved with a keyword list, depends on the specifics of the retrieval task at hand and the employed learning mode and model. In the Twitter application, for example, it is likely to achieve an acceptable to good retrieval performance that improves upon keyword lists when applying active learning with BERT using ≥ 350 training documents or applying passive learning with BERT on ≥ 800 unique training documents (see again Figure 4 ).

In general, active learning is to be preferred over passive learning, as across applications and learning models, active learning tends to reach a higher retrieval performance than passive learning with the same number of training documents. Moreover, the pretrained deep neural network BERT tends to yield a higher F 1 -Score compared to an SVM operating on bag-of-words-based document representations. BERT produces more unstable results but this behavior can be monitored with a set aside test set.

Across applications, applying active learning with BERT until 1,000 training instances have been labeled here produces a good separation of relevant and irrelevant documents that considerably improves upon the separation achieved by applying a keyword list. The mean F 1 -Scores of BERT applied in an active learning mode with a training budget of 1,000 labeled instances are 0.712 (Twitter), 0.622 (SBIC), and 0.908 (Reuters), whereas the maximum F 1 -Scores reached by the empirically constructed initial keyword lists are 0.417 (Twitter), 0.404 (SBIC), and 0.645 (Reuters). Hence, the improvements in the F 1 -Scores that are achieved by applying active learning with BERT rather than the best performing keyword list are 0.295 (Twitter), 0.218 (SBIC), and 0.263 (Reuters). If applying BERT-like models exceeds available capacities, active learning with a conventional machine learning model as an SVM still is a good and viable alternative. 32

The mean F 1 -Scores of active learning with SVM trained with 1,000 labeled documents are 0.538 (Twitter), 0.475 (SBIC), and 0.849 (Reuters). This still yields enhancements of the F 1 -Score by 0.121 (Twitter), 0.071 (SBIC), and 0.204 (Reuters).

Note that the performance enhancements of active learning here are observed across applications. Irrespective of document length, textual style, the type of the entity of interest, and the homogeneity or heterogeneity of the corpus from which the documents are retrieved, active learning with 1,000 training documents shows superior performance to keyword lists and the other approaches.

In text-based analyses researchers typically are interested to study documents referring to a particular entity. Yet, textual references to specific entities are often contained within multi-thematic corpora. In consequence, documents that contain references toward the entities of interest have to be separated from those that do not.

A very common approach in social science to retrieve relevant documents is to apply a list of keywords. Keyword lists are inexpensive and easy to apply, but they may result in biased inferences if they systematically miss out relevant documents. Query expansion techniques, topic model-based classification rules and active as well as passive supervised learning constitute alternative, more expensive, more complex, and in social science rarely applied procedures for the retrieval of relevant documents. These more complex procedures theoretically can have the potential to reach a higher retrieval performance than keyword lists and thus to reduce the potential size of selection biases. So far, a systematic comparison of these approaches was lacking and therefore it was unclear, whether the employment of any of these more expensive methods would yield any improvements of retrieval performance, and if so how large and consistent across contexts the improvement would be.

This study closed this gap. The comparison of the approaches on the basis of retrieval tasks associated with a data set of German tweets (Linder, 2017) , the Social Bias Inference Corpus (SBIC) (Sap et al., 2020) , and the Reuters-21578 corpus (Lewis, 1997) shows that neither of the applied more complex approaches necessarily enhances the retrieval performance, as measured by the F 1 -Score, over the application of a keyword list containing 10 empirically predictive terms. Yet, whereas across all settings and combinations evaluated for query expansion techniques and topic model-based classification rules at the very best small increases in the F 1 -Score can be observed, active supervised 32 If a large share of documents in the corpus at hand are longer than the 512 tokens that can be processed by BERT, Transformer-based models that can process longer sequences of tokens, e.g. the Longformer (Beltagy et al., 2020) , can be applied.

learning-with the Transformer-based language representation model BERT-increases the F 1 -Scores across application contexts substantively if the number of labeled training documents used in the active learning process is not too small.

Thus, in terms of retrieval performance, supervised learning in an active learning modepreferably with a pretrained deep neural network-is the procedure to be preferred to all other approaches. However, this procedure is also the most expensive of the evaluated methods. Supervised learning implies human, financial, and time resources for annotating the training documents. A training data set comprising 1,000 instances is very small for usual supervised learning settings, but the coding process also has to be monitored and coordinated. Moreover, active learning involves a dynamic labeling process in which after each iteration those documents are annotated for which a label is requested by the model. While active learning reduces the overall number of training instances for which a label is required, the dynamic labeling process may increase coordination costs or the time coders spend on coding as they wait for the model to request the next labels.

The precise separation of documents that refer to the entity of interest and thus are relevant for the planned study at hand from documents that are irrelevant is an essential analytic step. This step defines the set of documents on which all the following core analyses are conducted. Selection biases induced by the applied retrieval method ultimately bias the study's results. Therefore, attention and care should be taken when it comes to extracting relevant documents. Compared to the creation of a set of keywords, active learning requires substantive amounts of additional resources. But given the observed considerably higher retrieval performances achieved by active learning compared to keyword lists, spending these resources is likely to be worthwhile for the quality of the study.

The aim of this study was to compare different learning approaches: keyword lists, query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning. One problem that naturally arises if one seeks to compare learning approaches is that different approaches can only be compared on the basis of specific models that follow specific procedures (e.g. for query expansion), that have specific hyperparameter settings, that are trained on a specific finite set of training documents and are evaluated on a specific finite set of test set documents, and that are initialized by specific random seed values (Reimers and Gurevych, 2018) . Here, care was taken to have a broad range of several specific models with different settings for each approach. With regard to keyword lists and query expansion, 100 different keyword lists were expanded in local as well as global embedding spaces (whereby the number of expansion terms was varied from 1 to 9). For topic-model-based classification rules 4 different values for threshold ξ for each of 426,725 combinations were evaluated. With regard to active and passive supervised learning two different types of models (SVM and BERT) were applied 10 or 5 times with different random initializations. In each of the 10 or 5 runs, a different initial training set was used that then was enlarged by passive random sampling or active selection in 15 iterations. This broad evaluation setting makes it more likely that the conclusions drawn here on the set of models evaluated for each approach hold and that active learning indeed is superior to keyword lists for the studied tasks.

Nevertheless, future studies might inspect the effect of the chosen model settings of the evaluated approaches on the obtained results: Here, for each of the evaluated approaches the most simple setting or version was used. For query expansion, GloVe embeddings that represent each term by a single vector were employed and a simple boolean query using the OR operator was conducted. More complex procedures from the field of information retrieval that make use of contextualized embeddings and pseudo-relevant feedback (e.g. Zheng et al., 2020) were not applied. For the estimation of the topics, the CTM rather than, for example, a neural topic model was used (for an overview over neural topic models see Zhao et al., 2021) . For passive supervised learning, simple random oversampling was employed and for active learning uncertainty sampling was applied as a query strategy rather than more complex procedures as query-by-committee or expected model change (see e.g. Settles, 2010) . These simplest versions applied here present the core idea of each approach often most clearly and also are most easy to implement for scientists that seek to use one of the approaches as a first step in their analysis. Nevertheless, future studies might explore whether applying more complex procedures changes the substantive conclusions reached here.

Finally, there is the question of in how far the conclusions drawn here on the basis of three applications travel to further contexts. The selected data sets and tasks differ with regard to textual length and style, the heterogeneity of the corpora, the characteristics of the entities of interest, and the share of relevant documents in the corpora. The finding that active supervised learning-if applied with a not too small amount of training instances-considerably increases the F 1 -Score compared to keyword lists holds across these applications' differences. But further studies could look more closely at which contextual factors lead to which effects on retrieval performance for which procedures.

Note on Tables 7 to 9: The keyword lists comprising empirically highly predictive terms are not only applied on the corpora to evaluate the retrieval performance of keyword lists, but also form the basis for query expansion (see Section 4.2.2). The query expansion technique makes use of GloVe word embeddings (Pennington et al., 2014) trained on the local corpora at hand and also makes use of externally obtained GloVe word embeddings trained on large global corpora. In the case of the locally trained word embeddings there is a learned word embedding for each predictive term. Thus, the set of extracted highly predictive terms can be directly used as starting terms for query expansion. In the case of the globally pretrained word embeddings, however, not all of the highly predictive terms have a corresponding global word embedding. Hence, for the globally pretrained embeddings the 50 most predictive terms for which a globally pretrained word embedding is available are extracted. If a predictive term has no corresponding global embedding, the set of extracted predictive terms is enlarged with the next most predictive term until there are 50 extracted terms. In consequence, below for each corpus two lists of the most predictive features are shown. 

Twitter-global-Recall Twitter-global-Precision from the application of the keyword lists of 10 highly predictive terms as well as the evolution of the recall and precision scores across the query expansion procedure based on locally trained GloVe embeddings. For each of the sampled 100 keyword lists that then are expanded, one light blue line is plotted. The thick dark blue line gives the mean over the 100 lists. Note that the strong increase in recall for some keyword lists in the Twitter data set is due to the fact that the textual feature with the highest cosine similarity to the highly predictive initial term 'flüchtlinge' (translation: 'refugees' ) is the colon ':'.

Twitter-Recall Twitter-Precision Passive.

Active.Mean in the F1-Scores achieved on the set aside test set as the number of unique labeled documents in set I increases from one training step to the next by a batch of 50 documents. Boxplots visualizing the distribution of differences in F1-Scores of the SVMs are presented in blue. F1-Score differences for BERT are given in red. The mean is visualized by a star dot. The value of the mean as well as the standard deviation (SD) are given below the respective boxplots.

The hyperparameter values for BERT models trained with hyperparameter values in conventional value ranges are determined via hyperparameter tuning. As for the SVMs, hyperparameter tuning via a grid search across sets of hyperparameter values is conducted in a stratified 5-fold cross-validation setting on one of the folds of the data. The AdamW algorithm (Loshchilov and Hutter, 2019) with a warmup period lasting 6% of the training steps is used. Dropout is set to 0.1. The batch size is set to 16. The inspected hyperparameter values for the global learning rate are {2e-05, 3e-05}, and for the number of epochs are {2, 3, 4, 5}. The folds are stratified such that the share of instances falling into the relevant minority class is the same across all folds. In each cross-validation iteration, in the folds used for training, random oversampling of the minority class is conducted such that the number of relevant minority class examples increases by a factor of 5. Among the inspected hyperparameter settings, the setting that achieves the highest F 1 -Score regarding the prediction of the relevant minority class and does not exhibit excessive overfitting is selected. achieved on the set aside test set as the number of unique labeled documents in set I increases from one training step to the next by a batch of 50 documents. Boxplots visualizing the distribution of differences in F1-Scores of BERT models trained with hyperparameter values in conventional value ranges are presented in blue. F1-Score differences for BERT trained with a global learning rate (LR) of 2e-05 for 20 epochs (Eps) are given in red. The mean is visualized by a star dot. The value of the mean as well as the standard deviation (SD) are given below the respective boxplots.

Applying support vector machines to imbalanced datasets

RcppParallel: Parallel programming tools for 'Rcpp

CRAN

Wikipedia-based semantic query enrichment

A practical algorithm for topic modeling with provable guarantees

Query expansion techniques for information retrieval: A survey. Information Processing and Management

Conflict and Peace Data Bank (COPDAB)

Hybrid Content Analysis: Toward a strategy for the theory-driven, computer-assisted classification of large text corpora

A textual Taylor rule: Estimating central bank preferences combining topic and scaling methods

Neural machine translation by jointly learning to align and translate

Less is more? How demographic sample weights can improve public opinion estimates based on twitter data

Is the leftright scale a valid measure of ideology? Political Behavior

Does rape culture predict rape? Evidence from U.S. newspapers, 2000-2013

Making memories unavailable: The inhibitory power of retrieval

Predicting and interpolating state-level polls using Twitter textual data

Longformer: The long-document Transformer

A neural probabilistic language model

quanteda: An R package for the quantitative analysis of textual data

A correlated topic model of science

Latent Dirichlet Allocation

Enriching word vectors with subword information

A training algorithm for optimal margin classifiers

A survey of predictive modeling on imbalanced domains

Cost-sensitive learning for imbalanced classification

Random oversampling and undersampling for imbalanced classification

SMOTE for imbalanced classification with Python

140 characters to victory?: Using Twitter to predict the UK

Data mining for imbalanced datasets: An overview

SMOTE: Synthetic minority over-sampling technique

Support-vector networks

xtable: Export Tables to LaTeX or HTML (Version 1.8-4)

BERT: Pre-training of deep bidirectional Transformers for language understanding

Query expansion with locally-trained word embeddings

Language and ideology in Congress

Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv

Separating the wheat from the chaff: Applications of automated document classification using Support Vector Machines

data.table: Extension of 'data.frame' (Version 1.13.0)

Linguistic variable -linguistic variant

Active learning for BERT: An empirical study

The foundations of cost-sensitive learning

Multi-label prediction for political text-as-data

Learning on the border: Active learning in imbalanced data classification

Keyword assisted topic models. arXiv

Studies in Linguistic Analysis. Publications of the Philological Society

Role-based association of verbs, actions, and sentiments with entities in political discourse

How the refugee crisis and radical right parties shape party competition on immigration

Google Colaboratory Frequently Asked Questions. Google Colaboratory. Retrieved

Appropriators not position takers: The distorting effects of electoral incentives on Congressional representation

topicmodels: An R package for fitting topic models

Universal language model fine-tuning for text classification

Dataset card for reuters21578

Matplotlib: A 2D graphics environment

The mediation of politics through Twitter: An analysis of messages posted during the campaign for the German Federal Election

The credibility of public and private signals: A document-based approach

gdown: Download a Large File from Google Drive

Computer-assisted keyword and document set discovery from unstructured text

How censorship in China allows government criticism but silences collective expression

A review of domain adaptation without target labels

Content Analysis: An Introduction to Its Methodology

Query expansion using word embeddings

Relevance based language models

Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning

Reuters-21578 (Distribution 1.0). [Data set

A sequential algorithm for training text classifiers

Reducing bias in online text datasets: Query expansion and active learning for better data from keyword searches

Decoupled weight decay regularization

Applying LDA topic modeling in communication research: Toward a valid and reliable methodology

Introduction to Information Retrieval

Foundations of Statistical Natural Language Processing

Data structures for statistical computing in Python

Seaborn

Efficient estimation of word representations in vector space

Linguistic regularities in continuous space word representations

Active learning approaches for labeling text: Review and assessment of the performance of active learning approaches

A Mathematics Course for Political and Social Research

On the stability of finetuning BERT: Misconceptions, explanations, and strong baselines

We need to go deeper: Measuring electoral violence using Convolutional Neural Networks and social media

Model card for bert-base-german-uncased from dbmdz

Efficient nonparametric estimation of multiple embeddings per word in vector space

A Guide to NumPy

facetscales: facet grid with different scales per facet (Version 0.1.0.9000)

PyTorch: An imperative style, high-performance deep learning library

Scikit-learn: Machine learning in Python

GloVe: Global vectors for word representation

Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv

Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning

Using supervised machine learning in automated content analysis: An example using relational uncertainty

Newspaper coverage of political scandals

How to analyze political attention with minimal assumptions and costs

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

Undermining, defusing or defending European integration? assessing public communication of European executives in times of EU politicisation

Mobilizing the masses: Measuring resource mobilization on twitter

Why comparing single performance scores does not allow to draw conclusions about machine learning approaches

Beautiful Soup 4

A model of text for experimentation in the social sciences

Navigating the local modes of big data: The case of topic models

stm: An R package for Structural Topic Models

Structural Topic Models for openended survey responses

Word embeddings: What works, what doesn't, and how to tell the difference for applied research

Neural Transfer Learning for Natural Language Processing

NLP-Progress

Social bias frames: Reasoning about social and power implications of language

Exploring topic-metadata relationships with the STM: A Bayesian approach. arXiv

Automatic word sense discrimination

1.4. Support Vector Machines

RBF SVM Parameters

The multiclass classification of newspaper articles with machine learning: The hybrid binary snowball approach

text2vec: Modern Text Mining Framework for R

CRAN

Improving query expansion strategies with word embeddings

Recursive deep models for semantic compositionality over a sentiment treebank

plot3D: Plotting Multi-Dimensional Data

CRAN

Systematically Monitoring Social Media: the Case of the German Federal Election

How to fine-tune BERT for text classification?

Support Vector Machine active learning with applications to text classification

From frequency to meaning: Vector space models of semantics

rstudioapi: Safely Access the RStudio API (Version 0.11)

Bots and online hate during the COVID-19 pandemic: Case studies in the United States and the Philippines

Clause analysis: Using syntactic information to automatically extract source, subject, and predicate from texts with an application to the 2008-2009 Gaza War

Information Retrieval -Session 1: Introduction to Information Retrieval

Python 3 Reference Manual

Attention is all you need

Controlling the sensitivity of Support Vector Machines

Logistic regression for massive data with rare events

Neural transfer learning with Transformers for social science text analysis

Latent Semantic Scaling: A semisupervised text analysis technique for new domains and languages

ggplot2: Elegant Graphics for Data Analysis

stringr: Simple, Consistent Wrappers for Common String Operations

dplyr: A Grammar of Data Manipulation (Version 1.0.6)

lsa: Latent Semantic Analysis (Version 0.73.2)

ggridges: Ridgeline Plots in 'ggplot2

CRAN

HuggingFace's Transformers: State-of-the-art natural language processing

CASM: A deep-learning approach for identifying collective 77

action events with text and image data from social media

Topic modelling meets deep neural networks: A survey

BERT-QE: Contextualized Query Expansion for Document Re-ranking

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books