key: cord-0204080-dx5podvw authors: Holur, Pavan; Wang, Tianyi; Shahsavari, Shadi; Tangherlini, Timothy; Roychowdhury, Vwani title: Which side are you on? Insider-Outsider classification in conspiracy-theoretic social media date: 2022-03-08 journal: nan DOI: nan sha: fed42d5486523cbe31e4ae871e9cf97fccea1d93 doc_id: 204080 cord_uid: dx5podvw Social media is a breeding ground for threat narratives and related conspiracy theories. In these, an outside group threatens the integrity of an inside group, leading to the emergence of sharply defined group identities: Insiders -- agents with whom the authors identify and Outsiders -- agents who threaten the insiders. Inferring the members of these groups constitutes a challenging new NLP task: (i) Information is distributed over many poorly-constructed posts; (ii) Threats and threat agents are highly contextual, with the same post potentially having multiple agents assigned to membership in either group; (iii) An agent's identity is often implicit and transitive; and (iv) Phrases used to imply Outsider status often do not follow common negative sentiment patterns. To address these challenges, we define a novel Insider-Outsider classification task. Because we are not aware of any appropriate existing datasets or attendant models, we introduce a labeled dataset (CT5K) and design a model (NP2IO) to address this task. NP2IO leverages pretrained language modeling to classify Insiders and Outsiders. NP2IO is shown to be robust, generalizing to noun phrases not seen during training, and exceeding the performance of non-trivial baseline models by $20%$. Narrative models -often succinctly represented as a network of characters, their roles, their interactions (syuzhet) and associated time-sequencing information (fabula) -have been a subject of considerable interest in computational linguistics and narrative theory. Stories rest on the generative backbone of narrative frameworks (Bailey, 1999; Beatty, 2016) . While the details might vary from one story to another, this variation can be compressed into a limited set of domain-dependent narrative roles and functions (Dundes, 1962) . Social narratives that both directly and indirectly contribute to the construction of individual and group identities are an emergent phenomenon resulting from distributed social discourse. Currently, this phenomenon is most readily apparent on social media platforms, with their large piazzas and niche enclaves. Here, multiple threat-centric narratives emerge and, often, over time are linked together into complex conspiracy theories (Tangherlini et al., 2020) . Conspiracy theories, and their constituent threat narratives (legend, rumor, personal experience narrative) share a signature semantic structure: an implicitly accepted Insider group; a diverse group of threatening Outsiders; specific threats from the Outsider directed at the Insiders; details of how and why Outsiders are threatening; and a set of strategies proposed for the Insiders to counter these threats (Tangherlini, 2018) . Indeed, the Insider/Outsider groups are fundamental in most studies of belief narrative, and have been exhaustively studied in social theory and more specifically, in the context of conspiracy theories (Bodner et al., 2020; Barkun, 2013) . On social media, these narratives are negotiated one post at a time, expressing only short pieces of the "immanent narrative whole" (Clover, 1986) . This gives rise to a new type of computational linguistic problem: Given a large enough corpus of social media text data, can one automatically distill semantically-labeled narratives (potentially several overlapping ones) that underlie the fragmentary conversational threads? Figure 1 : A pair of inferred text segments labeled by NP2IO showing Insider-Outsider context-sensitivity: Colored spans are used to highlight noun phrases that are inferred (red for Outsiders; blue for Insiders). POS tags are shown along with the noun phrases to illustrate an example of syntactic and semantic hints used by NP2IO to generate the inferred labels. Note that, based solely on context, the same agents ("tech", "vaccines", "People", "Bill Gates" and "the vaccine") switch Insider-Outsider label. Even though the training data is highly biased in terms of the identities of the Insiders/Outsiders, the pretrained language model used in our classifier allows NP2IO to learn to infer using the context phrases and not by memorizing the labels. Outsiders, strategies for dealing with Outsiders and their attendant threats and, in the case of conspiracy theories, causal chains of events that support that theory. By itself, this unsupervised platform does not "understand" the different narrative parts. Since the submodules are not trained to look for specific semantic abstractions inherent in conspiracy theories, the platform cannot automatically generate a semantically tagged narrative for downstream NLP tasks. It cannot, for example, generate a list across narratives of the various outside threats and attendant inside strategies being recommended on a social media forum, nor can it address why these threats and strategies are being discussed. As a fundamental first step bringing in supervised information to enable automated narrative structure discovery, we introduce the Insider-Outsider classification task: To classify the noun phrases in a post as Insider, Outsider or neither. A working conceptualization of what we consider Insiders and Outsiders is provided in the following insets. As with most NLP tasks, we do not provide formal definitions of and rules to determine these groups. Instead we let a deep learning model learn the representations needed to capture these notions computationally by training on data annotated with human-generated labels. The partitioning of actors from a post into these Insiders: Some combination of actors and their associated pronouns, who display full agency (people, organizations, government), partial agency (policies, laws, rules, current events) or no agency (things, places, circumstances), with whom the author identifies (including themselves). These are often ascribed beneficial status; Outsiders: A set of actors whom the author opposes and, in many cases, perceives as threatening the author and the insiders with disruption or harm. For our purposes, these agents need not have full agency: Diseases and natural disasters, for example, would be universal outsiders, and any manmade object/policy that works against the Insiders would be included in this group. different categories is inspired by social categorization, identification and comparison in the wellestablished Social Identity Theory (SIT) (Tajfel et al., 1979; Tajfel, 1974) and rests on established perspectives from Narrative Theory (Dundes, 1962; Labov and Waletzky, 1967; Nicolaisen, 1987) . Following are some of the reasons why this classification task is challenging and why the concepts of Insiders/Outsiders are not sufficiently cap-tured by existing labeled datasets used in Sentiment Analysis (SA) (discussed in more detail in Section 3): 1. Commonly-held Beliefs and Worldviews: Comprehensively incorporating shared values, crucial to the classification of Insiders and Outsiders, is a task with varied complexity. Some beliefs are easily enumerated: most humans share a perception of a nearly universal set of threats (virus, bomb, cancer, dictatorship) or threatening actions ("kills millions of people", "tries to mind-control everyone") or benevolent actions ("donating to a charitable cause", "curing disease", "freeing people"). Similarly, humans perceive themselves and their close family units as close, homogeneous groups with shared values, and therefore "I", "us", "my children" and "my family" are usually Insiders. In contrast, "they" and "them" are most often Outsiders. Abstract beliefs pose a greater challenge as the actions that encode them can be varied and subtle. For example, in the post: "The microchips in vaccines track us", the noun phrase "microchips" is in the Outsider category as it violates the Insiders' right to privacy by "track[ing] us". Thus, greater attention needs to be paid in labeling datasets, highlighting ideas such as the right to freedom, religious beliefs, and notions of equality. 2. Contextuality and Transitivity: People express their opinions of Insider/Outsider affiliation by adding contextual clues that are embedded in the language of social media posts. For example, a post "We should build cell phone towers" suggests that "cell phone towers" are helpful to Insiders, whereas a post "We should build cell phone towers and show people how it fries their brains" suggests, in contrast, that "cell phone towers" are harmful to Insiders and belong, therefore, to the class of Outsiders. Insider/Outsider affiliations are also implied in a transitive fashion within a post. For example, consider two posts: (i) "Bill Gates is developing a vaccine. Vaccines kill people." and (ii) "Bill Gates is developing a vaccine. Vaccines can eradicate the pandemic." In the first case, the vaccine's toxic quality and attendant Outsider status would transfer to Bill Gates, making him an Outsider as well; in the second post, vaccine's beneficial qualities would transfer to him, now making "Bill Gates" an Insider. Conditions: Designing effective classifiers that do not inherit bias from the training data -especially data in which particular groups or individuals are derided or dehumanized -is a challenging but necessary task. Because conspiracy theories evolve, building on earlier versions, and result in certain communities and individuals being "othered", our models must learn the phrases, contexts, and transitivity used to ascribe group membership, here either Insiders or Outsiders and not memorize the communities and/or individuals being targeted. Figure 1 illustrates an example where we probed our model to explore whether such a requirement is indeed satisfied. The first text conforms to the bias in our data, where "tech", "Bill Gates", and "vaccines" are primarily Outsiders. The second text switches the context by changing the phrases. Our classifier is able to correctly label these same entities, now presented in a different context, as Insiders! We believe that such subtle learning is possible because of the use of pretrained language models. We provide several such examples in Table 3 and Figure 3 and also evaluate our model for Zero-shot learning in Table 1 and Figure 6 . Recent NLP efforts have examined the effectiveness of using pretrained Language Models (LM) such as BERT, DistilBERT, RoBERTa, and XLM to address downstream classification tasks through fine-tuning (Sanh et al., 2020; Liu et al., 2019; Lample and Conneau, 2019) . Pretraining establishes the contextual dependencies of language prior to addressing a more specialized task, enabling rapid and efficient transfer learning. A crucial benefit of pretraining is that, in comparison to training a model from scratch, fewer labeled samples are necessary. By fine-tuning a pretrained LM, one can subsequently achieve competitive or better performance on an NLP task. As discussed in Section 2, since our model is required to be contextual and transitive, both of which are qualities that rely on the context embedded in language, we utilize a similar architecture. In recent work involving span-based classification tasks, token-classification heads have proven to be very useful for tasks such as, Parts-of-Speech (POS) Tagging, Named Entity Recognition (NER) and variations of Sentiment Analysis (SA) (Yang et al., 2019; Vlad et al., 2019; Yin et al., 2020) . Since the Insider-Outsider classification task is also set up as a noun phrase labeling task, our architecture uses a similar token-classification head on top of the pretrained LM backbone. Current SA datasets' definitions of positive negative and neutral sentiments can be thought of as a "particularized" form of the Insider-Outsider classification task. For example, among the popular datasets used for SA, Rotten Tomatoes, Yelp reviews (Socher et al., 2013) and others (Dong et al., 2014; Pontiki et al., 2014) implicitly associate a sentiment's origin to the post's author (source) (a single Insider) and its intended target to a movie or restaurant (a single Outsider if the sentiment is negative or an Insider if positive). The post itself generally contains information about the target and particular aspects that the Insider found necessary to highlight. In more recent SA work, such as Aspect-Based Sentiment Analysis (ABSA) (Gao et al., 2021; Li et al., 2019; Dai et al., 2021) , researchers have developed models to extract sentiments -positive, negative, neutral -associated with particular aspects of a target entity. One of the subtasks of ABSA, aspect-level sentiment classification (ALSC), has a form that is particularly close to the Insider-Outsider classification. Interpreted in the context of our task, the author of the post is an Insider although now there can potentially be multiple targets or "aspects" that need to be classified as Insiders and Outsiders. Still, the constructed tasks in ABSA appear to not align well with the goal of Insider-Outsider classification: 1) Datasets are not transitive: Individual posts appear to have only one agent that needs classification, or a set of agents, each with their own separate sets of descriptors; 2) The ALSC data is often at the sentence-level as opposed to post-level, limiting the context-space for inference. Despite these obvious differences, we quantitatively verify our intuitions in Section 7.1, and show that ABSA models do not generalize to our dataset. Closely related to ABSA is Stance Classification (SC) (also known as Stance Detection / Identification), the task of identifying the stance of the text author (in favor of, against or neutral) toward a target (an entity, concept, event, idea, opinion, claim, topic, etc.) (Walker et al., 2012; Zhang et al., 2017; Küçük and Can, 2021) . Unlike ABSA, the target in SC does not need to be embedded as a span within the context. For example, a perfect SC model given an input for classification of context: This house would abolish the monarchy. and target: Hereditary succession, would predict the Negative label ( Bar-Haim et al., 2017; Du et al., 2017) . While SC appears to require a higher level of abstraction and, as a result, a model of higher complexity and better generalization power than those typically used for ABSA, current implementations of SC are limited by the finite set of queried targets; in other words, SC models currently do not generalize to unseen abstract targets. Yet, in real-time social media, potential targets and agents exhibit a continuous process of emergence, combination and dissipation. We seek to classify these shifting targets using the transitive property of language, and would like the language to provide clues about the class of one span relative to another. Ultimately, while SC models are a valuable step in the direction of better semantic understanding, they are ill-suited to our task. Parallel to this work in SA, there are complementary efforts in consensus threat detection on social media (Wester et al., 2016; Kandias et al., 2013; Park et al., 2018) , a task that broadly attempts to classify longer segments of text -such as comments on YouTube or tweets on Twitteras more general "threats". The nuanced instruction to the labelers of the data is to identify whether the author of the post is an Outsider from the labeler's perspective as an Insider. Once again, we observe that this task aligns with the Insider-Outsider paradigm, but does not exhaust it, and the underlying models cannot accomplish our task. The sets of Insiders and Outsiders comprise a higher-order belief system that cannot be adequately captured with the current working definitions of sentiment nor the currently available datasets. This problem presents a primary motivation for creating a new dataset. For example, the post: "Microchips are telling the government where we are", does not directly feature a form of prototypical sentiment associated with "microchips", "the government" and "we", yet clearly insinuates an invasion on our right to privacy making clear the Insiders ("we") and Outsiders ("microchips", "the government") in the post. To construct our novel dataset -Conspiracy Theory-5000 (CT5K) -we designed crawlers to extract a corpus of social media posts generated by the underlying narrative framework of vaccine hesitancy (Details of the crawlers are documented in Appendix A.1). Vaccine hesitancy is a remarkably resilient belief fueled by conspiracy theories that overlaps with multiple other narratives includ-ing ones addressing "depopulation", "government overreach and the deep state", "limits on freedom of choice" and "Satanism". The belief's evolution on social media has already enabled researchers to take the first steps in modeling critical parts of the underlying generative models that drive antivaccination conversations on the internet (Tangherlini et al., 2016; Bandari et al., 2017) . Moreover, vaccine hesitancy is especially relevant in the context of the ongoing COVID-19 pandemic (Burki, 2020) . On the crawled corpus, we extract the nounchunks from each post using SpaCy's noun chunk extraction module and dependency parsers (Honnibal and Johnson, 2015) . A noun chunk is a subtree of the dependency parse tree, the headword of which is a noun. The result is a set of post-phrase pairs, (p, n), where p is a post and n is one of the noun phrases extracted from the post. Amazon Mechanical Turk (AMT) (see Appendix A.2 for labeler instructions) was used to label the post-phrase pairs. For each pair, the labeler was asked, given the context, whether the writer of the post p perceives the noun phrase n to be an Insider, Outsider or neither (N/A). The labeler then provides a label c ∈ C, where C = {Insider, Outsider, N/A} (hence |C| = 3). The triplets of post-phrase pairs along with their labels form the dataset D = Note that a single post can appear in multiple triplets, because multiple different noun phrases can be extracted and labeled from a single post. The overall class distribution and a few conditional class distributions across the labeled samples for several particular noun phrases are provided in Figure 5 in the Appendix B. Manual inspection of the labeled samples ((p, n), c) suggests that the quality of the dataset is good (< 10% misclassified by random sampling). The now-labeled CT5K dataset (Holur et al., 2022) 1 (|D| = 5000 samples) is split into training (90%), and 10% testing sets. 10% of the training set is held out for validation. The final training set is 20-fold augmented by BERT-driven multi-token insertion (Ma, 2019) . The Noun-Phrase-to-Insider-Outsider (NP2IO) model 2 adopts a token classification architecture comprising a BERT-like pre-trained backbone and 1 See: Data and Model Checkpoints 2 Code Repository: NP2IO a softmax classifier on top of the backbone. Tokenlevel labels are induced from the span-level labels for the fine-tuning over CT5K, and the span-level labeling of noun phrases is done through majority vote during inference. An outline of the fine-tuning pipeline is provided in Figure 2 . Given a labeled example ((p, n), c), the model labels each token t i in the post p = [t 1 , . . . , t N ], where N is the number of tokens in the post p. The BERT-like backbone embeds each token t i into a contextual representation Φ i ∈ R d (for example, d = 768 for BERT-base or RoBERTa-base). The embedding is then passed to the softmax classification layer where π i ∈ ∆ |C| is the Insider-Outsider classification prediction probability vector of the i th token, and W ∈ R d×|C| and b ∈ R |C| are the parameters of the classifier. The ground truth class label c accounts for all occurrences of the noun phrase n in the post p. We use this span-level label to induce the tokenlevel label and facilitate the computation of the fine-tuning loss. Concretely, consider the spans where the noun phrase n occurs in the post p: S n = {s 1 , . . . , s M }, where s j ∈ S n denotes the span of the j th occurrence of n, and M is the number of occurrences of n in p. Each span is a sequence of one or more tokens. The set of tokens appearing in one of these labeled spans is: We define the fine-tuning loss L of the labeled example ((p, n), c) as the cross-entropy (CE) loss computed over T n using c as the label for each token in it, where (π i ) c denotes the prediction probability for the class c ∈ C of the i th token. The fine-tuning is done with mini-batch gradient descent for the classification layer and a number of self-attention layers in the backbone. The number of fine-tuned self-attention layers is a hyperparameter. The scope of hyperparameter tuning is provided in Table 4 . During fine-tuning, we extend the label of a noun phrase to all of its constituent tokens; during inference, conversely, we summarize constituent token labels to classify the noun phrases by a majority vote. For a pair of post and noun-phrase (p, n), assuming the definition of {t i } N i=1 , {π i } N i=1 and T n from the Section 5.1, the Insider-Outsider label predictionĉ is given bŷ Now c can be compared toĉ with a number of classification evaluation metrics. Visual display of individual inference results such as those in Figure 1 are supported by displaCy (Honnibal and Montani, 2017) . In this section, we list baselines that we compare to our model's performance ordered by increasing parameter complexity. • Naïve Bayes Model (NB / NB-L): Given a training set, the naïve Bayes classifier estimates the likelihood of each class conditioned on a noun chunk P C,N (c|n) assuming its independence w.r.t. the surrounding context. That is, a noun phrase predicted more frequently in the training-set as an Insider will be predicted as an Insider during the inference, regardless of the context. For noun phrases not encountered during training, the uniform prior distribution over C is used for the prediction. The noun chunk may be lemmatized (by word) during training and testing to shrink the conditioned event space. We abbreviate the naïve Bayes model without lemmatization as NB, and the one with lemmatization as NB-L. • GloVe+CBOW+XGBoost (CBOW -1/2/5): This baseline takes into account the context of a post but uses global word embeddings, instead of contextual-embeddings. A window length w is fixed such that for each noun phrase, we extract the w words before and w words after the noun phrase, creating a set of context words, S w . Stopwords are filtered, and the remaining context words are lemmatized and encoded via 300-dimensional GloVe (Pennington et al., 2014) . The Continuous Bag of Words (CBOW) model (Mikolov et al., 2013) averages the representative GloVe vectors in S w to create an aggregate contextual vector for the noun phrase. XGBoost (Chen and Guestrin, 2016 ) is used to classify the aggregated contextual vector. The same model is applied on the test set to generate labels. We consider window lengths of 1, 2 and 5 (CBOW-1, CBOW-2 and CBOW-5 respectively). Comparison of NP2IO to baselines is provided in Table 1 . The random (RND) and deterministic (DET-I, DET-O, DET-NA) models perform poorly. We present these results to get a better sense of the unbalanced nature of the labels in the CT5K dataset (see Figure 5 ). The naïve Bayes model (NB) and its lemmatized form (NB-L) outperform the trivial baselines. However, they perform worse than the two contextual models, GloVe+CBOW+XGBoost and NP2IO. This fact validates a crucial property of our dataset: Despite the bias in the gold standard labels for particular noun phrases such as "I","they" and "microchip" -see Figure 5 in Appendix B -context dependence plays a crucial role in Insider-Outsider classification. Furthermore, NP2IO outperforms GloVe+CBOW+XGBoost (CBOW-1, CBOW-2, CBOW-5) summarily. While both types of models employ context-dependence to classify noun phrases, NP2IO does so more effectively. The finetuning loss convergence plot for the optimal performing NP2IO model is presented in Figure 4 in Appendix B and model checkpoints are uploaded in the data repository. Given the limitations of current ABSA datasets for our task (see Section 2 and Section 3), we computationally show that CT5K is indeed a different dataset, particularly in comparison to other classical ones in Table 2 . For this experiment, we train near-state-of-the-art ABSA models with RoBERTa-base backbone (Dai et al., 2021) on three popular ABSA datasets -Laptop reviews and Restaurant reviews from SemEval 2014 task 4 (Pontiki et al., 2014) , and Tweets (Dong et al., 2014) . Each trained model is then evaluated on all three datasets as well as the test set of CT5K. The Insider class in CT5K is mapped to the positive sentiment and the Outsider class to the negative sentiment. The F1-macro scores of the models trained and tested among the three ABSA datasets are much higher than the scores when testing on the CT5K dataset. Clearly, models that are successful with typical ABSA datasets do not effectively generalize to CT5K, suggesting that our dataset is different. A challenge for any model, such as NP2IO, is zero-shot performance, when it encounters noun phrases never tagged during training. Answering this question offers a means for validating the context-dependence requirement, mentioned in Section 2. This evaluation is conducted on a subset of the entire testing set: A sample of the subset {p, n} is such that the word-lemmatized, stopwordremoved form of n does not exist in the set of wordlemmatized, stopword-removed noun phrases seen during training. We extract 30% of test samples to be in this set. The results are presented in Table 1 . As expected, the performance of the naïve Bayes models (NB, NB-L) degrades severely to random. The performance of the contextual models CBOW-1/2/5, and NP2IO stay strong, suggesting effective context sensitivity in inferring the correct labels for these models. A visualization of the zero-shot capabilities of NP2IO on unseen noun phrases is presented in Figure 6 in Appendix B. We construct a set of adversarial samples to evaluate the extent to which NP2IO accurately classifies a noun phrase that has a highly-biased label distribution in CT5K. We consider 3 noun phrases in particular: "microchip", "government", and "chemical". Each of these has been largely labeled as Outsiders. The adversarial samples for each phrase, in contrast, are manually aggregated (5 seed posts augmented 20 times each) to suggest that the phrase is an Insider (see Table 5 in Appendix B for the seed posts). We compute the recall of NP2IO in detecting these Insider labels (results in Table 3 ). NP2IO is moderately robust against adversarial attacks: In other words, highly-skewed distributions of labels for noun phrases in our dataset do not appear to imbue a similar drastic bias into our model. Performance in Zero-Shot Model Acc. P R F1 F1(w) Acc. P R F1 F1 ( 2) . The results show that NP2IO largely learned to use the contextual information for its inference logic, and did not memorize the agent bias in CT5K. We speculate that the exhibited bias towards "Chemicals" is due to the large body of text documents that discusses the adverse effects of chemicals, and hence is encoded in the embedding structure of pretrained LM models that NP2IO cannot always overrule; at least yet. We presented a challenging Insider-Outsider classification task, a novel framework necessary for addressing burgeoning misinformation and the proliferation of threat narratives on social media. We compiled a labeled CT5K dataset of conspiracytheoretic posts from multiple social media platforms and presented a competitive NP2IO model that outperforms non-trivial baselines. We have demonstrated that NP2IO is contextual and transitive via its zero-shot performance, adversarial studies and qualitative studies. We have also shown that the CT5K dataset consists of underlying information that is different from existing ABSA datasets. Given NP2IO's ability to identify Insiders and Outsiders in a text segment, we can extend the inference engine to an entire set of interrelated samples in order to extract, visualize and interpret the underlying narrative (see Figure 3 ). This marks a first and significant step in teasing out narratives from fragmentary social media records, with many of its essential semantic parts -such as, Insider/Outsider -tagged in an automated fashion. As extensive evaluations of the NP2IO model show, our engine has learned the causal phrases used to designate the labels. We believe an immediate future work can identify such causal phrases, yet another step toward semantic understanding of the parts of a narrative. Broadly, work similar to this promises to expedite the development of models that rely on a computational foundation of structured information, and that are better at explaining causal chains of inference, a particularly important feature in the tackling of misinforma- Figure 3 : An actor-actant subnarrative network constructed from social media posts: Selected posts from anti-vaccination forums such as qresearch on 4chan were decomposed into relationship tuples using a state-ofthe-art relationship extraction pipeline from previous work (Tangherlini et al., 2020) and these relationships are overlayed with the inferences from NP2IO. This results in a network where the nodes are the noun phrases and the edges are the verb phrases, with each edge representing an extracted relationship from a post. In this network, a connected component emerged capturing a major sub-theory in vaccine hesitancy. This highlights NP2IO's ability at inferring both the threat-centric orientation of the narrative space as well as the negotiation dynamics in play, thereby providing qualitative insight into how NP2IO may be used in future work to extract large-scale relationship networks that are interpretable. The green boxes highlight the noun phrases that have contradictory membership in the Insiders and the Outsiders classes as their affiliations are deliberated. tion. Indeed, NP2IO's success has answered the question: "Which side are you on?" What remains to be synthesized from language is: "Why?" Labelers were reminded several times via popups and other means that the labels were to be chosen with respect to the author of the post and not the labeler's inherent biases and/or political preferences. Data was collected using verified research Application Programming Interfaces (API) provided by the social media companies for non-commercial study. In order to explore data on fringe platforms such as 4chan and 8kun where standard APIs are not available, the data was scraped using a Selenium-based crawler. All the retrieved samples were ensured to be public: the posts could be accessed by anyone on the internet without requiring explicit consent by the authors. Furthermore, we made sure to avoid using Personal Identifiable Information (PII) such as the user location, time of posting and other metadata: indeed, we hid even the specific social media platform from which a particular post was mined. The extracted text was cleaned by fixing capitalization, filtering special characters, adjusting inter-word spacing and correcting punctuation, all of which further obfuscated the identity of the author of a particular post. Tables 0 2000 4000 Values Batch Size 32, 64, 128 Trainable Layers 0, 1, 2, 5 LR 1E-7, 1E-6, 1E-5, 1E-4 Pretrained Backbone BERT-base, DistilBERT-base, RoBERTa-base, RoBERTa-large Table 4 : A summary of the parameters considered for finetuning: NP2IO's best-performing (by validation loss) parameters are in bold. Figure 5: Histograms that show the distributions of labels in CT5K: The plot for "All" represents the full 3-category label distribution across all entities, for "I" the bias toward Insiders is evident, "They" are mostly outsiders, and there is no clear consensus label for "Vaccine" and "Herd Immunity". Microchips are always tagged as Outsiders. Seed Posts (augmented to 100 posts per NP) microchip "I love microchips.", "I feel that microchips are great.", "microchips are lovely and extremely useful.", "I believe microchips are useful in making phones.", "Microchips have made me a lot of money." government "The government helps keep me safe.","The government does a good job.","I think that without the government, we would be worse off.","The government keeps us safe.", "A government is important to keep our society stable." chemical "Chemicals save us.","Chemicals can cure cancer.","I think chemicals can help elongate our lives.","I think chemicals are great and helps keep us healthy.","Chemicals can help remove ringworms." Table 5 : The set of 5 Insider-oriented core posts per noun phrase (in bold) that have a high skew toward Outsider labels in CT5K: Each seed post is augmented 20 times to create a set of 100 adversarial posts per phrase. NP2IO infers the label for the key noun phrase across these samples. The adversarial recall is presented in Table 3 . Figure 6 : Zero-shot Insider-Outsider Classification Profile: This figure shows the consensus vote for noun phrases that do not occur in the training set. The x-axis represents the consensus-vote-count and the y-axis, the indices of the noun phrases. The consensus vote is computed for each noun phrase n by passing all the posts that include n through NP2IO. Each Insider vote is +1 and Outsider vote is −1. The consensus-vote-count is also color-coded for better visualization. The zero-shot classification is qualitatively observed to correctly classify popular noun phrases such as "reeducation camps","depopulation" as Outsiders and "american" and "faith" as Insiders in the subnarrative of the anti-vaccination movement. Searching for storiness: Storygeneration from a reader's perspective A resistant strain: Revealing the online grassroots rise of the antivaccination movement Stance classification of context-dependent claims A Culture of Conspiracy: Apocalyptic Visions in Contemporary America What are narratives good for? COVID-19 conspiracy theories: QAnon, 5G, the New World Order and other viral ideas The online anti-vaccine movement in the age of covid-19. The Lancet Digital Health XGBoost: A scalable tree boosting system 2021. A real-time platform for contextualized conspiracy theory analysis The long prose form Does syntax matter? A strong baseline for aspect-based sentiment analysis with roberta Adaptive recursive neural network for target-dependent Twitter sentiment classification Stance classification with target-specific neural attention The Morphology of North American Indian Folktales Question-driven span labeling model for aspect-opinion pair extraction Acl 2022 -supplementary data files Modelling social readers: novel tools for addressing reception from online book reviews An improved non-monotonic transition system for dependency parsing 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing Proactive insider threat detection through social media: The youtube case Stance Detection: A Survey Narrative analysis Crosslingual language model pretraining Exploiting BERT for end-to-end aspect-based sentiment analysis Nlp augmentation Efficient estimation of word representations in vector space The linguistic structure of legends Detecting potential insider threat: Analyzing insiders' sentiment exposed in social media Glove: Global vectors for word representation SemEval-2014 task 4: Aspect based sentiment analysis Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter An automated pipeline for character and relationship extraction from readers literary book reviews on goodreads Conspiracy in the time of corona: automatic detection of emerging covid-19 conspiracy theories in social media and the news Recursive deep models for semantic compositionality over a sentiment treebank Social identity and intergroup behaviour An integrative theory of intergroup conflict Toward a generative model of legend: Pizzas, bridges, vaccines, and witches mommy blogs" and the vaccination exemption narrative: Results from a machine-learning approach for story aggregation on parenting social media sites Ehsan Ebrahimzadeh, and Vwani Roychowdhury. 2020. An automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: Bridgegate, pizzagate and storytelling on the web Label Studio: Data labeling software Sentence-level propaganda detection in news articles with transfer learning and BERT-BiLSTMcapsule model Stance Classification using Dialogic Properties of Persuasion Automated Concatenation of Embeddings for Structured Prediction Threat detection in online discussions Emotionx-ku: Bertmax based contextual emotion classifier SentiB-ERT: A transferable transformer-based architecture for compositional sentiment semantics We Make Choices We Think are Going to Save Us: Debate and Stance Identification for Online Breast Cancer CAM Discussions A daily data collection method (Chong et al., 2021) aggregates heterogeneous data from various social media platforms including Reddit, YouTube, 4chan and 8kun. Our implementation of this pipeline has extracted potentially conspiracy theoretic posts between March 2020 and June 2021. We select a subset of these posts that are relevant to vaccine hesitancy and that include (a) at least one of the words in ['vaccine', 'mrna', 'pfizer', 'moderna', 'j&j', 'johnson', 'chip', 'pharm'] and (b) between 150 to 700 characters. The end-to-end data processing pipeline is uncased. Amazon Mechanical Turk Labelers were required to be at the Masters' level (exceeding a trust baseline provided by Amazon), were required to speak English, and were required to be residing in the United States. No personally identifying information was collected. Users were asked to create an account on a LabelStudio (Tkachenko et al., 2020 (Tkachenko et al., -2021 ) platform to answer a set of 60-80 questions or 2 hours worth of questions. Each question included (a) A real anonymized social media post with a highlighted sentence, (b) The sentence highlighted in (a) but with the noun phrase of interest highlighted. The question prompt read: Please let us know whether the entity highlighted in bold AS PERCEIVED BY THE WRITER is a good/bad or neutral entity.