key: cord-0500818-snq1ab0s authors: Alex, Neel; Lifland, Eli; Tunstall, Lewis; Thakur, Abhishek; Maham, Pegah; Riedel, C. Jess; Hine, Emmie; Ashurst, Carolyn; Sedille, Paul; Carlier, Alexis; Noetel, Michael; Stuhlmuller, Andreas title: RAFT: A Real-World Few-Shot Text Classification Benchmark date: 2021-09-28 journal: nan DOI: nan sha: e1227daa4877599e13de41a5207a222e1b197456 doc_id: 500818 cord_uid: snq1ab0s Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org . Few-shot learning, the capacity to complete a task given a small number of demonstrations [11] , is one of the hallmarks of human intelligence [30, 17] . As researchers, we leverage this capacity when we delegate work on crowdsourcing platforms or give a task with examples to a human research assistant. Brown et al. [6] show that large pre-trained language models exhibit few-shot learning capabilities for a wide range of natural language tasks. If those capabilities were comparable to people on economically relevant tasks, this would be important to know: a single model could be used across multiple real-world tasks, with low per-task data labeling cost. However, these models have also been shown to have inconsistent few-shot performance depending on the exact setup and task being solved [e.g. 21, 24] . The mixed evidence suggests that it would be valuable to measure and track few-shot performance on a set of tasks that is representative of what appears in practice. Natural language tasks coarsely split into generation, classification, and retrieval. We focus on classification tasks because they support high-quality automated evaluation, cover a wide range of economically valuable tasks, and yet don't have existing real-world benchmarks. Existing few-shot classification benchmarks are typically designed to highlight areas where models fall short [29] or to study particular model abilities [5, 37, 21] . The tasks and evaluation setup aren't optimized to measure progress in applied settings: • Tasks that are generated or chosen specifically to test language models may not represent some of the challenges found when applying these models in real-world settings. For example, SuperGLUE [32] and the few-shot equivalent FewGLUE [29] mainly include short texts. Doing well on applied tasks sometimes requires reasoning over long texts. Existing systems struggle with long texts due to a limited context window, especially in the few-shot setting where some systems learn from examples presented in context. • The evaluation does not closely mirror deployment, and may both under-and overestimate models' capabilities. It may underestimate model capability by restricting models to the closed-book setting (e.g., no retrieval from online sources) and using uninformative labels (e.g., 0/1 instead of "about literature" vs. "about movies"). It may overestimate model capability by using many more than a few examples for setting hyperparameters during validation [24] . RAFT is a real-world few-shot text classification benchmark designed to measure how much recent and upcoming NLP advances benefit applications: • The tasks are naturally occurring tasks. Their labeling is inherently valuable to someone, and they may have challenges that are not reflected in synthetic tasks. Inherent value means that, if it were sufficiently fast and cheap, it would be desirable to outsource the task to human research assistants or crowd workers. Challenges refers to the need for information retrieval, domain expertise, parsing long documents, and making use of instructions. Table 1 shows the real-world challenges presented by RAFT, including 4 datasets with long input texts. • The evaluation closely mirrors deployment. For each task, we release a public training set with 50 examples and a larger unlabeled test set 2 . We encourage unsupervised pre-training on the unlabelled examples and open-domain information retrieval. We keep the test-set labels private and provide automated evaluation through a Hugging Face leaderboard 3 . In addition to the gold-standard human labels, we collect automatic and crowdsourced baselines. The automatic baselines reveal areas where current techniques struggle, such as reasoning over long texts and tasks with many classes. The crowdsourced baseline reveals that RAFT includes a mix of moderate to difficult tasks. We also observe difficulties in collecting human crowdsourced baselines on some datasets, particularly when domain expertise is important, which suggests that real-world value often depends on domain knowledge. The RAFT datasets and leaderboard can be viewed and submitted to at https://raft.elicit.org. We briefly review few-shot learning in NLP, then the benchmarks that are most similar to RAFT. Pre-trained language models (PLMs) such as BERT [10] and GPT-3 [6] can learn to do some NLP tasks when prompted with a few demonstrations, including some classification tasks. The two primary approaches to few-shot classification using PLMs are in-context learning and prompt-based fine-tuning. In-context learning. A PLM is primed with labeled examples in its prompt. It classifies the example included at the end of the prompt by predicting the classification conditioned on the priming. GPT-3 [6] used in-context learning to achieve promising results on a variety of classification tasks. UniFew [5] similarly achieved strong results on classification tasks via in-context learning, converting classification tasks into a multiple-choice question answer format for prompting. Prompt-based fine-tuning. A PLM is fine-tuned with masked-language modeling objectives to learn from few examples. This is also known as Pattern-exploiting training (PET) [28] . While PET requires task-specific prompts, it achieves better performance than GPT-3 in-context with smaller models [29] . LM-BFF [13] improves prompt-based fine-tuning by dynamically constructing prompts. The most closely related few-shot NLP benchmarks are FLEX [5] , FewGLUE [29] , CrossFit [37], and NaturalInstructions [21] . Each of these benchmarks includes at least some classification tasks with meaningful textual labels. These benchmarks are designed to study transfer between tasks [5, 37] , pinpoint where NLP models fall short [29] , and evaluate ability of models to follow instructions [21] , whereas RAFT is designed to be representative of real-world classification tasks. This difference in goals is reflected in selection of tasks and evaluation: RAFT is a few-shot classification benchmark. We focus on classification primarily because automatic evaluation is more reliable than for generation tasks. We believe (as our results will later confirm) that there still is a substantial gap between even non-expert humans and automated systems in the few-shot classification setting. Both tasks (datasets and metadata) and evaluation (rules for submission, metrics) are chosen to mirror real-world classification problems. A classification task is a dataset with labeled natural language entries. Each label corresponds one-to-one with a natural language class name. Each task has instructions for labeling. We selected datasets based on the following criteria ("non-trivial real world tasks"): Naturally occurring. We focus on data that are naturally occurring, rather than being synthetically generated to test and improve language models. Intrinsic value. We select datasets for which the correct labeling inherently provides real-world value. RAFT includes tasks like hate-speech detection, medical case report parsing, and literature review automation, where better performance translates into practical benefits. This criterion involves subjectivity, but we aimed to select tasks that approximate the distribution of valuable classification tasks well. Realistic class distribution. We did not exclude datasets with heavily imbalanced classes. Open-domain feasibility. As we provide an open-domain setting where information retrieved from the web may be used to augment predictions, we excluded tasks for which the correct label is extremely easily discoverable through a Google search. For example, we considered including the LIAR [35] dataset which includes Politifact statements and their veracity. We decided against including it since it would be trivial to get 100% accuracy by running a site search on https://www.politifact.com/. In order to gather datasets meeting the above requirements, we put out a collaboration request. We also reached out to users of classification on Elicit [23]. Lastly, we conducted a search of existing datasets on the Hugging Face Hub 5 and PapersWithCode 6 . In cases where the test set was over 5,000 data points, we randomly selected 5,000 to serve as a test set in order to keep the test set sizes manageable. When the dataset didn't already have textual labels, we added textual labels according to our best understanding of the task. We selected 11 datasets, accessible at https://raft.elicit.org/datasets. Table 1 presents an overview of the datasets. More details are available in the Appendix. ADE Corpus V2 (ADE). The ADE corpus V2 [15] contains sentences from medical case reports annotated for relation to adverse drug effects. We focus on the binary classification task of whether a sentence is related to an adverse drug effect (ADE). Banking77 (B77). Banking77 [7] contains online banking customer service queries annotated with their intents. NeurIPS impact statement risks (NIS). We include the broader impact statements from NeurIPS 2020 papers collected in the dataset from Ashurst et al. [1] . We annotate these based on whether they mention possibly harmful applications of the research done in the paper . 7 5 https://huggingface.co/datasets 6 https://paperswithcode.com 7 The raw scraped NeurIPS impact statements can be found at https://raft.elicit.org/neurips-impact. Long inputs Semiconductor org types (SOT). We collect a dataset of institutions that have contributed to semiconductor conferences in the last 25 years, then classify these institutions into organization types: "university", "company", and "research institute". Systematic review inclusion (SRI). We use data from a systematic meta-review studying interventions to increase charitable donations [22] . The task is to predict whether a paper advances past the screening stage. TAI safety research (TAI). We include data from the formation of a bibliographic database for research on the safety of transformative artificial intelligence (TAI) [27] . We choose the binary task of predicting whether a work is classified as TAI safety research. Terms of Service (ToS). The Terms of Service dataset [19] contains clauses from Terms of Services, annotated by whether they are potentially unfair to consumers. TweetEval Hate (TEH). We include the hate-speech detection task from the TweetEval dataset [2] , which was curated from Basile et al. [3] . Twitter complaints (TC). We include a dataset of tweets annotated by whether they contain a complaint [25]. The RAFT evaluation replicates real-world few-shot classification problems by restricting to 50 labeled examples without validation set, providing meaningful instructions and labels, and using a no-holds-barred setting: Task-specific instructions. As an important replacement for large amounts of labeled data, instructions can specify how a task should be done. Therefore, we provide the instructions we give to human labelers so that they can be used in instructing automatic systems. The level of detail of the instructions varies. We write the instructions based on information from publications (for datasets published elsewhere) or in consultation with the dataset creator (for new datasets). Meaningful label names. Similar to instructions, textual labels are an important aspect of few-shot and especially zero-shot learning. We create default textual labels for each dataset as recommended by FLEX [5] . Transfer learning permitted. Transfer and meta-learning using other datasets is permitted, including further pre-training on other corpora. Unlabeled data permitted. Use of the unlabeled RAFT test sets is permitted, as unlabeled data are usually available in the applied setting. Open-domain retrieval permitted. Models may be augmented with information retrieved from the internet, e.g. via automated web searches. 9 Submission requires only labels. Since some RAFT datasets have substantial class imbalances, we use F1 as our evaluation metric. We compute macro-averaged F1 scores, even for binary datasets. To get an overall score, we average across all datasets. The code for all automatic baselines is open-sourced at https://raft.elicit.org/baselines. We provide a simple automatic baseline using GPT-3 [6] , accessed through the OpenAI API 10 . As in Brown et al. [6] , we use in-context learning, adding labeled examples to the prompt to prime GPT-3. We also run a zero-shot version with no training examples included in the prompt. We build a prompt consisting of: We truncate from a training example's data fields first, leaving the label intact. Field selection and sorting. We exclude data fields that are unlikely to contribute substantially to GPT-3's performance. These fields either deal with the authors of the textual example or are URLs. Additionally, we sort the order in which the text fields occur to put the most important fields first. When examples are truncated, the most important information is preserved. Semantic selection. To select training examples to include in the prompt for a given test example, we selected the most similar training examples as in Liu et al. [20] . To perform semantic search, we use the OpenAI API search endpoint with the ada engine. With the prompt formed, we retrieve GPT-3's 100 most likely next tokens using the davinci engine. For each class, we assign the probability that its first token is generated. We then normalize the probabilities to sum to 1. For the B77 dataset, multiple labels share the same first token so we prepend a numerical prefix such as "1. " to each class. We tune the GPT-3 baseline on the training set using leave-one-out cross validation (LOOCV): k-fold cross validation with k = n so that only one test example is used at a time for validation. While LOOCV isn't robust with as few as 50 examples as discussed in Perez et al. [24] , it is one of the best options for parameter selection in the few-shot setting. Detailed LOOCV results are in Section A.5. Instructions. We test two modes of instruction: (a) a generic classification prompt: "Possible labels:" followed by a list of textual labels. (b) instructions similar to the ones given to human labelers, plus the list of textual labels. The instructions are taken whole when possible, and otherwise shortened and summarized manually to limit usage of the GPT-3 context window. Task-specific instructions outperform generic instructions by an .04 on averaged F1 score, thus we include task-specific instructions in the baseline. Number of examples in the prompt. We select the number of examples to include in the prompt on a per-dataset basis, as our truncation strategy induces a quality-quantity trade-off. For each dataset, we test performance with 5, 10, 25, and 50 11 training examples and choose the number that performs best by F1. For datasets with long inputs, smaller numbers of more detailed samples often produce better performance, while datasets with smaller inputs can fit more complete labeled examples in the prompt. In-context baselines. We run further in-context baselines GPT-Neo [4] and GPT-2 [26]. We provide code 12 for generating predictions on RAFT using these models and any other causal language model available on the HuggingFace Hub. For semantic search, we use a MiniLM [34] fine-tuned on sentence pairs via the sentence-transformers package 13 . Zero-shot baselines. We run two transformers in the zero-shot setting: Non-neural baselines. We run AdaBoost [12] to establish a strong non-neural baseline. We construct feature vectors for each example based on the counts of n-grams of 1-5 words as the input to a weighted ensemble of 100 depth-3 decision trees. These decision trees and weights are trained with AdaBoost with learning rate 1, and evaluated through weighted voting. We also include a plurality (most frequent) class baseline. To collect human baselines, we use the Surge 14 crowdsourcing platform. Following Wang et al. [32], we randomly select 100 data points from each test set and use a 2-step labeling process: qualification then annotation. The crowdsourced label is the plurality vote of 5 labelers. We put crowd workers in a similar situation to automated systems. We link to a sheet with the same 50 labeled examples, use the same textual labels, and give the same task-specific instructions that we are providing to practitioners to adapt for instructing language models. 15 Humans generally outperform GPT-3. Humans outperform GPT-3 on 8 out of 11 tasks, demonstrating room for improvement for models on real-world few-shot tasks. We expect that exceeding the crowdsourced baseline will require substantial advances in model performance, and even more so for a future expert human baseline. Weaknesses of GPT-3 include: • labeled examples. Additionally, just listing out the possible classes takes up a large portion of GPT-3's context window. • Long inputs: GPT-3 performs poorly on some tasks requiring reasoning over long inputs, such as NIS and OSE. GPT-3's context window may be a contributing factor. Crowd-sourced baselines struggle on domain-specific tasks. Crowd-sourced humans substantially outperform GPT-3 on only 1 of 4 tasks we identified as requiring domain expertise: • Humans substantially outperform GPT-3 on ADE, which requires medical expertise. • Humans outperformed GPT-3 by just .053 on ToS, which requires parsing legal language. • GPT-3 outperforms humans on Over, which requires greater legal expertise than ToS [39], and TAI, which requires expertise in AI safety research. Zero-shot performance is weak. GPT-3 zero-shot does poorly on RAFT, performing worse than the plurality class baseline. BART zero-shot exceeds the plurality class baseline but does not do so in every dataset, and it is not competitive with few-shot language models. We encourage future research on improving performance in the zero-shot setting, perhaps through improved prompt construction and transfer learning. Neural baselines besides few-shot GPT-3 perform worse than AdaBoost. Generative language models smaller than GPT-3 comfortably outperform the plurality class baseline but remain below AdaBoost. We use the same amount of labelled examples in the prompt as with GPT-3 despite the context window being smaller; performance may improve with fewer (but longer) examples. Linguistic diversity. The benchmark only includes English tasks. Dealing with multilingual corpora is a real-world challenge for many NLP systems, especially for those deployed in countries where there are multiple national languages. To fully capture the distribution of real-world tasks, additional languages will be needed. Possible biases in data collection. While we attempted to execute our dataset selection process as described in Section 3.1.3 in an unbiased manner, the datasets we ended up selecting are part of a subjective human process that may be subject to biases. For example, the organizations we work with are disproportionately in technology and policy. Offensive content. By including a hate-speech detection dataset, we include offensive content and may harm readers of the dataset. We believe the advantages from studying hate-speech detection are likely greater than the disadvantages of publicizing hate-speech datasets. Prohibitive costs. The models best equipped to perform well on RAFT will often be the massive transformer models trained by private corporations. In advancing this benchmark as a means of evaluating models, we risk further widening the gap between what a dedicated individual or team can do, and what can only be done by industry research labs with sufficient funding. Stronger human baselines. Human baselines are intended to tell us how well the dataset would be labeled in the absence of automated systems. For many RAFT datasets, this process would involve a stronger baseline than is easily available via a crowd-worker platform: for example, the Over dataset would be labeled by someone with law expertise. In addition to ML submissions, we welcome efforts to collect stronger human baselines for RAFT. Additional automatic baselines. We expect that systems that use prompt-based fine-tuning rather than in-context learning may provide an even stronger automatic baseline. We further expect that models that leverage the open-domain information retrieval option can surpass models that don't. Application-specific metrics. Different applications care about different metrics; e.g., in some applications it is more important to minimize false positives, whereas in others the focus is on false negatives. An ideal measure of real-world value would take that into account. Learning from natural language In this work, we focused on instructions as a supplement to labeled examples. Similarly to Mishra et al. [21] , we found that including task-specific instructions improved performance. Like humans, NLP systems could also learn from other types of natural language. For example, could including explanations with each labeled example be used to further improve few-shot performance? RAFT is a benchmark that tests language models across multiple domains on economically valuable classification tasks in the true few-shot setting. To our knowledge, this is the first multi-task benchmark designed to closely mirror how models are applied in both the task distribution and the evaluation setup. By complementing existing synthetic benchmarks designed to highlight where models fall short, it helps measure the gap between research and practice, incentivizes work that is valuable for deployed systems, and provides a template for future benchmarks that mirror deployment. TweetEval Hate (TEH). Unlicensed. Twitter complaints (TC). Unlicensed. See Table 3 for one training example from each dataset. Table 4 contains an excerpt from the instructions for each dataset. Below we provide the full instructions given to human annotators and adapted for automatic baselines for each RAFT task. Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below: Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants). Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake. The following is a banking customer service query. Classify the query into one of the 77 categories available. Label the impact statement based on whether it mentions a harmful application of the research done in the paper. Make sure the statement is sufficient to conclude there are harmful applications of the research being done, not a past risk that this research is solving. In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. Label the sentence based on whether it is overruling or not. Instructions excerpt ADE Corpus V2 (ADE) Label the sentence based on whether it is related to an adverse drug effect (ADE). The following is a banking customer service query. NeurIPS impact statement risks (NIS) Label the impact statement as "mentions a harmful application" or "doesn't mention a harmful application" based on whether it mentions a harmful application of the research done in the paper. The following is an article sourced from The Guardian newspaper, and rewritten by teachers to suit three levels of adult English as Second Language (ESL) learners: elementary, intermediate, and advanced. In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. The dataset is a list of institutions that have contributed papers to semiconductor conferences in the last 25 years, as catalogued by IEEE and sampled randomly. Identify whether this paper should be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations. The contents of the paper are directly motivated by, and substantively inform, the challenge of ensuring good outcomes for Transformative AI. According to art. 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts, a contractual term is unfair if: 1) it has not been individually negotiated; and 2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. Hate (TEH) Label whether the following tweet contains hate speech against either immigrants or women. A complaint presents a state of affairs which breaches the writer's favorable expectation. The dataset is a list of institutions that have contributed papers to semiconductor conferences in the last 25 years, as catalogued by IEEE and sampled randomly. The goal is to classify the institutions into one of three categories: "university", "company" or "research institute". Identify whether this paper should be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations. Papers should be included if they meet all of these criteria: 1. systematic reviews, scoping reviews, or similar reproducible reviews; 2. reviews describing monetary charitable donations; 3. reviews assessing any population of participants in any context; and 4. peer reviewed and written in English (due to logistical constraints). They shouldn't be included if they meet any of these criteria: 1. primary research reporting new data (e.g., randomised experiments); 2. non-systematic reviews, theory papers, or narrative reviews; 3. reviews on cause-related marketing; and 4. reviews of other kinds of prosocial behaviour (e.g., honesty, non-financial donations like volunteering, blood, or organ donations). Transformative AI (TAI) is defined as AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution. Label a paper as "TAI safety research" if: 1. The contents of the paper are directly motivated by, and substantively inform, the challenge of ensuring good outcomes for TAI. The paper need not mention TAI explicitly, but it must be motivated by it, since there are far too many papers that are merely relevant to safety. Judging motivation is, unfortunately, inherently subjective, but this is necessary to avoid penalizing papers that do not explicitly mention TAI for appearance reasons, while also not including every paper on, e.g., adversarial examples (which are motivated by capabilities and near-term safety). If the paper would likely have been written even in the absence of TAI-safety concerns, it is excluded. Ultimately, we want to support researchers who are motivated by TAI safety and allow them to find each other's work 2. There is substantive content on AI safety, not just AI capabilities. That said, for more speculative papers it is harder to distinguish between safety vs. not safety, and between technical vs. meta, and we err on the side of inclusion. Articles on the safety of autonomous vehicles are generally excluded, but articles on the foundations of decision theory for AGI are generally included. 3. The intended audience is the community of researchers. Popular articles and books are excluded. Papers that are widely released but nevertheless have substantial research content (e.g., Bostrom's Superintelligence) are included, but papers that merely try to recruit researchers are excluded. 4. It meets a subjective threshold of seriousness/quality. This is intended to be a very low threshold, and would, for instance, include anything that was accepted to be placed on the ArXiv. Web content not intended for review (e.g., blog posts) is only accepted if it has reached some (inevitably subjective) threshold of notability in the community. It is of course infeasible for us to document all blog posts that are about TAI safety, but we do not want to exclude some posts that have been influential but have never been published formally. 5. Peer review is not required. White papers, preprints, and book chapters are all included. Otherwise, label it as "not TAI safety research". Label the sentence from a Terms of Service based on whether it is potentially unfair. If it seems clearly unfair, mark it as potentially unfair. According to art. 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts, a contractual term is unfair if: 1) it has not been individually negotiated; and 2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. Details on types of potentially unfair clauses are found below: The jurisdiction clause stipulates what courts will have the competence to adjudicate disputes under the contract. Jurisdiction clauses giving consumers a right to bring disputes in their place of residence were marked as clearly fair, whereas clauses stating that any judicial proceeding takes a residence away (i.e. in a different city, different country) were marked as clearly unfair. The choice of law clause specifies what law will govern the contract, meaning also what law will be applied in potential adjudication of a dispute arising under the contract. Clauses defining the applicable law as the law of the consumer's country of residence were marked as clearly fair. In every other case, the choice of law clause was considered as potentially unfair. The limitation of liability clause stipulates that the duty to pay damages is limited or excluded, for certain kind of losses, under certain conditions. Clauses that explicitly affirm non-excludable providers' liabilities were marked as clearly fair. Clauses that reduce, limit, or exclude the liability of the service provider were marked as potentially unfair when concerning broad categories of losses or causes of them, such as any harm to the computer system because of malware or loss of data or the suspension, modification, discontinuance or lack of the availability of the service. Also those liability limitation clauses containing a blanket phrase like "to the fullest extent permissible by law", were considered potentially unfair. Clause meant to reduce, limit, or exclude the liability of the service provider for physical injuries, intentional damages as well as in case of gross negligence were marked as clearly unfair. The unilateral change clause specifies the conditions under which the service provider could amend and modify the terms of service and/or the service itself. Such clause was always considered as potentially unfair. The unilateral termination clause gives provider the right to suspend and/or terminate the service and/or the contract, and sometimes details the circumstances under which the provider claims to have a right to do so. The contract by using clause stipulates that the consumer is bound by the terms of use of a specific service, simply by using the service, without even being required to mark that he or she has read and accepted them. We always marked such clauses as potentially unfair. The content removal gives the provider a right to modify/delete user's content, including in-app purchases, and sometimes specifies the conditions under which the service provider may do so. The arbitration clause requires or allows the parties to resolve their disputes through an arbitration process, before the case could go to court. It is therefore considered a kind of forum selection clause. However, such a clause may or may not specify that arbitration should occur within a specific jurisdiction. Clauses stipulating that the arbitration should (1) take place in a state other then the state of consumer's residence and/or (2) be based not on law but on arbiter's discretion were marked as clearly unfair. Clauses defining arbitration as fully optional would have to be marked as clearly fair. WARNING: This task involves labeling offensive and hateful content, particularly toward immigrants and women. Label whether the following tweet contains hate speech against either immigrants or women. Label whether the following tweet contains hate speech against either immigrants or women. Hate Speech (HS) is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics. Detailed guidelines are provided below, please read before labeling. More specifically, HS against immigrants may include: • insults, threats, denigrating or hateful expressions • incitement to hatred, violence or violation of rights to individuals or groups perceived as different for somatic traits (e.g. skin color), origin, cultural traits, language, etc. • presumed association of origin/ethnicity with cognitive abilities, propensity to crime, laziness or other vices • references to the alleged inferiority (or superiority) of some ethnic groups with respect to others • delegitimization of social position or credibility based on origin/ethnicity • references to certain backgrounds/ethnicities as a threat to the national security or welfare or as competitors in the distribution of government resources • dehumanization or association with animals or entities considered inferior While answering the question "Is this tweet hateful?", you must take into account the following aspects: 1. the tweet content MUST have IMMIGRANTS/REFUGEES as main TARGET, or even a single individual, but considered for his/her membership in that category (and NOT for the individual characteristics) 2. we must deal with a message that spreads, incites, promotes or justifies HATRED OR VIOLENCE TOWARDS THE TARGET, or a message that aims at dehumanizing, hurting or intimidating the target The joint presence of both elements in a tweet is considered essential to determine whether the tweet has hateful contents, therefore if both of them occur, your answer will be 'Yes'. In case even just one of these conditions is not detected, HS (at least against immigrants) is assumed not to occur, then your answer will be 'No'. Here a list of other aspects that are NOT considered hate speech for our purposes: • HATE SPEECH AGAINST OTHER TARGETS • offensive language • blasphemy • historical denial • overt incitement to terrorism • offense towards public servants and police officers • defamation Label the tweet as hate speech if it is misogynous against women. A tweet is misogynous if it expresses hating towards women in particular (in the form of insulting, sexual harassment, threats of violence, stereotype, objectification and negation of male responsibility). A complaint presents a state of affairs which breaches the writer's favorable expectation. Label the tweet text based on whether it contains a complaint. We provide documentation using applicable questions from the datasheets framework [14] for the NIS, SOT, and TAI datasets. For documentation on other datasets we refer readers to the works in which the datasets were originally introduced as cited in Section 3.1.3. The labeling section of this documentation contains information on how the impact statements were annotated based on whether they mention a harmful application. The other sections largely contain information on how the original dataset of NeurIPS impact statements [1] was collected. • For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. The original dataset was created to evaluate the then new requirement for authors to include an "impact statement" in their 2020 NeurIPS papers. Had it been successful? What kind of things did authors mention the most? How long were impact statements on average? See [1] for more details. • Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The original dataset was created as part of a project based at the Centre for the Governance of AI, which involved individual researchers and developers from the University of Oxford, Oxford Internet Institute, Harvard Kennedy School and the Alan Turing Institute. • Who funded the creation of dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number. The project was based at the Centre for the Governance of AI. There was no grant associated with the project. Individuals were funded by their respective organisations, or as contractors. • Is any information missing from individual instances in the dataset? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No. • Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. This dataset has limitations that should be taken into consideration when using it. In particular, the method used to collect broader impact statements involved automated downloads, conversions and scraping and was not error-proof (see https://github.com/paulsedille/NeurIPS-Broader-Impact-Statements/blob/main/maindataset/notes-on-data.md for details). Although care has been taken to identify and correct as many errors as possible, not all texts have been reviewed by a human. This means it is possible some of the broader impact statements contained in the dataset are truncated or otherwise incorrectly extracted from their original article. The original dataset also contains labels describing whether authors chose to effectively "opt-out" of the requirement (for example by stating that a broader impact section is "Not Applicable"). Several statements were ambiguous in this respect, and so this label represents a subjective judgement on what constituted an opt-out. The labeling performed for this paper (whether a harmful application is mentioned) also constitutes a subjective judgment, and will contain human biases. Please see the section on Preprocessing, Cleaning, Labeling for more details. • Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)? If so, please provide a description. The dataset contains authors' names. These were scraped from publicly available scientific papers submitted to NeurIPS 2020. • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. No. • Does the dataset relate to people? The dataset does not relate to people directly, although it does contain authors' names. These were scraped from publicly available scientific papers submitted to NeurIPS 2020. • How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. The data was directly observable (raw text scraped) for the most part; although some data was taken from previous datasets (which themselves had taken it from raw text). The data was validated, but only in part, by human reviewers. Further details can be found here: https://github.com/paulsedille/NeurIPS-Broader-Impact-Statements/blob/main/maindataset/notes-on-data.md • What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated? The main dataset was collected using software, and a combination of code iteration and human review was used to validate the results. Further details may be found here: https://github.com/paulsedille/NeurIPS-Broader-Impact-Statements/blob/main/main-dataset/notes-on-data.md. • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? The subset annotated based on harmful applications was sampled randomly. • Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? The original dataset was created as part of a project based at the Centre for the Governance of AI, which involved individual researchers and developers as described above. The labeling for this paper (whether a harmful application is mentioned) was performed by Ought contractors. • Does the dataset relate to people? The dataset does not relate to people directly, although it does contain authors' names. These were scraped from publicly available scientific papers submitted to NeurIPS 2020. • Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? The impact statements were collected from the NeurIPS websites. Metadata included in the original dataset was collected from the NeurIPS chairs, and websites (for example where affiliated institutions are geographically based). See [1] for further details. The labeling for this paper (whether a harmful application is mentioned) was collected from the contractors directly. • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? For the original dataset [1] , the manuscript pdfs for accepted papers were obtained from the NeurIPS 2020 proceedings website. The pdfs were converted to XML, and the title and impact statement section were extracted. The dataset was appended with information about paper subject area, author names, affiliations, affiliation type and affiliation institution locations, as follows. Primary and secondary subject area, as selected by authors on submission, were supplied to us by the NeurIPS programme chairs. Author names and affiliations were obtained from separate scrapes of the NeurIPS papers. Each affiliation was tagged with a location and type (industry or academia) based on [16] and [8] respectively. Further details on the generation of the original dataset, and its assumptions and limitations, can be found at https://github.com/paulsedille/NeurIPS-Broader-Impact-Statements/blob/main/main-dataset/notes-on-data.md. Contractors paid by Ought performed the labeling of whether impact statements mention harmful applications. A majority vote was taken from three annotators. • Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? The original NeurIPS impact statements data is available at https://github.com/paulsedille/NeurIPS-Broader-Impact-Statements. The accepted papers containing the statements can also be found at https://proceedings.neurips.cc/paper/2020. • Has the dataset been used for any tasks already? If so, please provide a description. An analysis of the original dataset has been prepared by the dataset authors, which can be found in Ashurst et al. [1] . • What (other) tasks could the dataset be used for? Other researchers are encouraged to use the dataset to provide further analysis on the outcomes of the NeurIPS broader impact requirement. The dataset could also be used for additional meta-analysis of NeurIPS 2020 accepted papers. • Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms? This dataset has limitations that should be taken into consideration when using it. In particular, the method used to collect broader impact statements involved automated downloads, conversions and scraping and was not error-proof. Although care has been taken to identify and correct as many errors as possible, not all texts have been reviewed by a human. This means it is possible some of the broader impact statements contained in the dataset are truncated or otherwise incorrectly extracted from their original article. More details may be found at https://github.com/paulsedille/NeurIPS-Broader-Impact-Statements/blob/main/main-dataset/notes-on-data.md. For this paper, individual labelers were asked whether harmful applications were mentioned in the statement, but what constitutes a harmful application is of course highly subjective, and will depend on the particular views and experiences of the labeler. For example, many applications will provide some benefits to some individuals and groups, while creating risks and harms to others. The intention was to capture a rough measure of whether the authors had intended to point out potential negative effects that could arise from the use of their work, or whether they chose to limit to potential positive impacts only. This will likely exclude applications that are typically viewed as beneficial or neutral, despite the fact that such applications can cause harm to individuals or subgroups in society. We therefore urge caution in how such labels are interpreted for future tasks. This Labeling section of this documentation contains information on how the semiconductor organizations were annotated by type. The other sections mainly contain information describing how the unlabeled dataset of semiconductor organizations was collected. • For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. The data set was originally created to understand better which countries' organisations have contributed most to semiconductor R&D over the past 25 years using three main conferences. Moreover, to estimate the share of academic and private sector contributions, the organisations were classified as "university", "research institute" or "company". • Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The data science unit of Stiftung Neue Verantwortung (Berlin). • Who funded the creation of dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number. The Stiftung Mercator is funding the data science unit in general • Is any information missing from individual instances in the dataset? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. This data set is a sample of 500 out of many more organisations. Examples where the institution names contain "universit" were deleted because all language models can classify this as "university" and no discrimination is gained. • Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. The human-created labels could be wrong. • Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)? If so, please provide a description. No. • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. No. • Does the dataset relate to people? No. • What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated? We used the IEEE API to obtain institutions that contributed papers to semiconductor conferences in the last 25 years. This is a random sample of 500 of them with a corresponding conference paper title. The three conferences were the International Solid-State Circuits Conference (ISSCC), the Symposia on VLSI Technology and Circuits (VLSI) and the International Electron Devices Meeting (IEDM). • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? It was probabilistic. Duplicate entries (by organisation name) were deleted. • Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? A student was involved and paid according to German law. • Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. March 2021 Preprocessing, Cleaning, Labeling • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? Yes. Contractors paid by Ought performed the labeling of organization types. A majority vote was taken from 3 annotators. • Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. It can only be used for non-commercial research purposes. See here and here. The annotated data is licensed under Creative Commons Attribution-NonCommercial 4.0 International. • For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. The primary motivations for assembling this database were to: (1) Aid potential donors in assessing organizations focusing on TAI safety by collecting and analyzing their research output. (2) Assemble a comprehensive bibliographic database that can be used as a base for future projects, such as a living review of the field. • Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? Angelica Deibel and myself (Jess Riedel). We did not do it on behalf of any entity. • Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation. No. • Does the dataset relate to people? It's a database of papers, which have authors. • Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? Both. • Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. We asked authors to suggest papers that should be included in the database. • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. No. Preprocessing, Cleaning, Labeling • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? Yes. See the LessWrong post for more details on our labels, which was done largely by hand, on citation numbers, collected from Google Scholar by automated API call, and on the basic bibliographic information, which was collected with the automated tools from Zotero: https://www.lesswrong.com/posts/4DegbDJJiMX2b3EKm/tai-safety-bibliographicdatabase • Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? No. There was no clean distinction between raw and processed data. We used several automated tools that interacted, plus corrections and additions by hand. • Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point. See link to the Citation numbers API called for Google Scholar in the the LessWrong post for more details: https://www.lesswrong.com/posts/4DegbDJJiMX2b3EKm/tai-safety-bibliographicdatabase Uses • Has the dataset been used for any tasks already? If so, please provide a description. Yes, for the report we posted on LessWrong here. It was also used by "Larks" in his review. • Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point. No, this hasn't been used in any academic papers yet. • Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms? No. • Are there tasks for which the dataset should not be used? If so, please provide a description. Don't use it to create a dangerous AI that could bring the end of days. • Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. As mentioned in the LessWrong post: "We release the Zotero database under the Creative Commons Attribution-ShareAlike 4.0 International License. In short, the means you are free to use, modify, and reproduce the database for anything so long as you cite us and release any derivative works under the same license." • Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions. No. The CC-SA-BY license is the only restriction • Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation. No. The code for the GPT-3 baseline is available at https://raft.elicit.org/baselines under an MIT license. Running the automatic baseline of GPT-3 davinci on the test sets cost approximately $2,600. Tables 5, 6 and 7 detail the results of parameter selection runs. All runs were done using GPT-3. We mistakenly use 50 rather than 25 training examples in the prompt for TEH when running in-context baselines, despite 25 performing better in LOOCV. When running in-context baselines besides GPT-3, we use the same number of training examples in the prompt. Note that this may be suboptimal due to other models having smaller context windows; we leave improving upon these baselines to future work. Avg ADE B77 NIS OSE Over SOT SRI TAI ToS TEH TC Task We concatenated all non-label data in every training example into a single string, separated by periods, then constructed n-grams from all words and adjacent sets of n words in the dataset for n ∈ [1, 5] after removing letter cases and certain special symbols. Each training or test example was vectorized as the count of each n-gram in the example. For the base estimator, we used decision trees with a maximum depth of 3. We ensembled 100 estimators with a learning rate of 1.0. Avg ADE B77 NIS OSE Over SOT SRI TAI ToS TEH TC We tuned several hyperparameters in our AdaBoost implementation. First, we tested the learning rate of AdaBoost, the rate at which the weights of the ensembled classifiers are changed, finding that it didn't change results substantially from within a reasonable range. We then tested a number of different depths of decision trees in the ensemble, finding that low depths were ideal. Finally, we tested the number of trees to ensemble, finding that around 50 to 100 trees perform the best. All hyperparameters were tuned with leave-one-out cross validation. Avg ADE B77 NIS OSE Over SOT SRI TAI ToS TEH TC Table 10 : LOO Cross Validation performance for number of trees, F1 scores from an AdaBoost ensemble classifier with learning rate 1.0 trained on n-grams of the dataset for n ∈ [1, 5] . collect 5 labels for each of the 100 data points. We then take the plurality vote for each data point, breaking ties randomly. Due to extreme class imbalance, we conduct only an annotation phase of 200 data points for the SRI dataset. We attempted to mimic annotation instructions reported by the works introducing datasets whenever possible. The instructions we gave to annotators was as follows (parts enclosed in brackets denote variations in the instructions depending on the task or phase): [If qualification phase: This task will serve as a qualification stage for annotation on a larger set. Label at least 10 examples to be considered for qualification for the annotation task. Please only complete this qualification if you're available to label 100 more data points in the next day.] You may use info on the internet (e.g. Google searches) to help you. We know that labeling accuracy will (a) vary some based on level of background knowledge and (b) have some inherent subjectivity. Please select your best guess for each data point. Task-specific instructions are detailed in Section A.3. We spent $2,030 compensating crowdworkers for human baselines. We conservatively estimate that workers were paid $15/hr. Ai ethics statements -analysis and lessons learnt from neurips broader impact statements Tweet-Eval: Unified benchmark and comparative evaluation for tweet classification SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter Large Scale Autoregressive Language Modeling with Mesh-Tensorflow Flex: Unifying evaluation for few-shot nlp. arXiv: Computation and Language A decision-theoretic generalization of on-line learning and an application to boosting Making pre-trained language models better few-shot learners Hal Daumé III au2, and Kate Crawford. Datasheets for datasets Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports Neurips affiliation locations One shot learning of simple visual concepts BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. CoRR, abs Claudette: an automated detector of potentially unfair clauses in online terms of service What makes good in-context examples for gpt-3? Natural instructions: Benchmarking generalization to new tasks from natural language instructions There's no redacted information, but there are undoubtedly tons of papers we failed to find in our literature search. Also, we kept/excluded articles based on a set of subjective criteria we invented • Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. See above. No redundancies that I know of data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)? If so • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why • Does the dataset relate to people? If not, you may skip the remaining questions in this section. Sort of. It's a database of papers, and those papers have authors If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset. Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how. It's a database of papers • Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. We asked TAI safety organizations for what their employees had written, emailed some individual authors, and searched Google Scholar. See the LessWrong post for more details hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated? Mostly be hand. We collected citation information using an automated API call to Google Scholar. See the LessWrong post for more details • Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? It was Angelica Deibel and me. I volunteered and Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created Our automatic baseline collection was subsidized by compute credits generously provided by OpenAI. Ethan Perez, Samuel Bowman, and Long Ouyang gave feedback on early versions of the RAFT concept and dataset lists. Douwe Kiela and Stella Biderman offered helpful advice on the project direction. Ross Gruetzemacher suggested inclusion of the Twitter Complaints dataset. We thank Thomas Wolf and Simon Brandeis for discussions and advice around the design of the benchmark's infrastructure.