Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science Emily M. Bender Department of Linguistics University of Washington ebender@uw.edu Batya Friedman The Information School University of Washington batya@uw.edu Abstract In this paper, we propose data statements as a design solution and professional prac- tice for natural language processing technol- ogists, in both research and development — through the adoption and widespread use of data statements, the field can begin to address critical scientific and ethical issues that result from the use of data from cer- tain populations in the development of tech- nology for other populations. We present a form that data statements can take and explore the implications of adopting them as part of regular practice. We argue that data statements will help alleviate issues re- lated to exclusion and bias in language tech- nology; lead to better precision in claims about how NLP research can generalize and thus better engineering results; protect com- panies from public embarrassment; and ul- timately lead to language technology that meets its users in their own preferred lin- guistic style and furthermore does not mis- represent them to others. 1 Introduction As technology enters widespread societal use it is important that we, as technologists, think critically about how the design decisions we make and sys- tems we build impact people — including not only users of the systems but also other people who will be affected by the systems without directly inter- acting with them. For this paper, we focus on nat- ural language processing (NLP) technology. Po- tential adverse impacts include NLP systems that fail to work for specific subpopulations (e.g. chil- dren or speakers of language varieties which are not supported by training or test data) or systems that reify and reinforce biases present in training data (e.g. a resume-review system that ranks fe- male candidates as less qualified for computer pro- gramming jobs because of biases present in train- ing text). There are both scientific and ethical rea- sons to be concerned. Scientifically, there is the issue of generalizability of results; ethically, the potential for significant real-world harms. While there is increasing interest in ethics in NLP,1 there remains the open and urgent question of how we integrate ethical considerations into the everyday practice of our field. This question has no simple answer, but rather will require a constellation of multi-faceted solutions. Toward that end, and drawing on value sen- sitive design (Friedman et al., 2006), this pa- per contributes one new professional practice — called data statements — which we argue will bring about improvements in engineering and sci- entific outcomes while also enabling more ethi- cally responsive NLP technology. A data state- ment is a characterization of a dataset which pro- vides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software. In developing this practice, we draw on analogous practices from the fields of psychology and medicine that re- quire some standardized information about the populations studied (e.g. APA 2009; Moher et al. 2010; Furler et al. 2012; Mbuagbaw et al. 2017). Though the construct of data statements applies more broadly, in this paper we focus specifically on data statements for NLP systems. Data state- ments should be included in most writing on NLP including: papers presenting new datasets, papers reporting experimental work with datasets, and 1This interest has manifested in workshops (Fort et al., 2016; Devillers et al., 2016; Hovy et al., 2017) and papers (Hovy and Spruit, 2016) in NLP, as well as workshops in related fields, notably the FATML series (http://www. fatml.org/) held annually since 2014. http://www.fatml.org/ http://www.fatml.org/ To appear in Transactions of the ACL documentation for NLP systems. Data statements should help us as a field engage with the ethical is- sues of exclusion, overgeneralization, and under- exposure (Hovy and Spruit, 2016). Furthermore, as data statements bring our datasets and their rep- resented populations into better focus, they should also help us as a field deal with scientific issues of generalizability and reproducibility. Adopting this practice will position us to better understand and describe our results and, ultimately, do better and more ethical science and engineering.2 We begin by defining terms (§2), discuss why NLP needs data statements (§3) and relate our pro- posal to current practice (§4). Next is the sub- stance of our contribution: a detailed proposal for data statements for NLP (§5), illustrated with two case studies (§6). In §7 we discuss how data state- ments can mitigate bias and use the technique of ‘value scenarios’ to envision potential effects of their adoption. Finally, we relate data statements to similar emerging proposals (§8), make recom- mendations for how to implement and promote the uptake of data statements (§9), and lay out consid- erations for tech policy (§10). 2 Definitions As this paper is intended for at least two dis- tinct audiences (NLP technologists and tech pol- icymakers), we use this section to briefly define key terms. Dataset, Annotations An (NLP) dataset is a collection of speech or writing possibly combined with annotations.3 Annotations include indica- tions of linguistic structure like part of speech tags or syntactic parse trees, as well as labels classify- ing aspects of what the speakers were attempting to accomplish with their utterances. The latter in- cludes annotations for sentiment (Liu, 2012) and for figurative language or sarcasm (e.g. Riloff et al. 2013; Ptáček et al. 2014). Labels can be naturally occurring, such as star ratings in reviews taken as indications of the overall sentiment of the review (e.g. Pang et al. 2002) or the hashtag #sarcasm 2By arguing here that data statements promote both eth- ical practice and sound science, we do not mean to suggest that these two can be conflated. A system can give accurate responses as measured by some test set (scientific soundness) and yet lead to real-world harms (ethical issues). Accord- ingly, it is up to researchers and research communities to en- gage with both scientific and ethical ideals. 3Multi-modal data sets combine language and video or other additional signals. Here, our focus is on linguistic data. used to identify sarcastic language (e.g. Kreuz and Caucci 2007). Speaker We use the term speaker to refer to the individual who produced some segment of linguis- tic behavior included in the dataset, even if the lin- guistic behavior is originally written. Annotator Annotator refers to people who as- sign annotations to the raw data, including tran- scribers of spoken data. Annotators may be crowdworkers or highly trained researchers, some- times involved in the creation of the annota- tion guidelines. Annotation is often done semi- automatically, with NLP tools being used to create a first pass which is corrected or augmented by hu- man annotators. Curator A third role in dataset creation, less commonly discussed, is the curator. Curators are involved in the selection of which data to in- clude, by selecting individual documents, by cre- ating search terms that generate sets of documents, by selecting speakers to interview and designing interview questions, etc. Stakeholders Stakeholders are people impacted directly or indirectly by a system (Friedman et al., 2006; Czeskis et al., 2010). Direct stakeholders include those who interact with the system, ei- ther by participating in system creation (develop- ers, speakers, annotators and curators) or by using it. Indirect stakeholders do not use the system but are nonetheless impacted by it. For example, peo- ple whose web content is displayed or rendered invisible by search engine algorithms are indirect stakeholders with respect to those systems. Algorithm We use the term algorithm to en- compass both rule-based and machine learning ap- proaches to NLP. Some algorithms (typically rule- based ones) are tightly connected to the datasets they are developed against. Other algorithms can be easily ported to different datasets.4 System We use the term (NLP) system to re- fer to a piece of software that does some kind of natural language processing, typically involv- ing algorithms trained on particular datasets. We use this term to refer to both components focused on specific tasks (e.g. the Stanford parser (Klein and Manning, 2003) trained on the Penn Treebank 4Datasets used during algorithm development can influ- ence design choices in machine learning approaches too: Munro and Manning (2010) found that subword information, not helpful in English SMS classification, is extremely valu- able in Chichewa, a morphologically complex language with high orthographic variability. (Marcus et al., 1993) to do English parsing) and user-facing products such as Amazon’s Alexa or Google Home. Bias We use the term bias to refer to cases where computer systems “systematically and un- fairly discriminate against certain individuals or groups of individuals in favor of others” (Fried- man and Nissenbaum, 1996, 332).5 To be clear: (i) unfair discrimination does not give rise to bias unless it occurs systematically and (ii) systematic discrimination does not give rise to bias unless it results in an unfair outcome. Friedman and Nis- senbaum (1996) show that in some cases, sys- tem bias reflects biases in society; these are pre- existing biases with roots in social institutions, practices and attitudes. In other cases, reasonable, seemingly neutral, technical elements (e.g. the or- der in which an algorithm processes data) can re- sult in bias when used in real world contexts; these technical biases stem from technical constraints and decisions. A third source of bias, emergent bias, occurs when a system designed for one con- text is applied in another, e.g. with a different pop- ulation. 3 Why does NLP need data statements? Recent studies have documented the fact that lim- itations in training data lead to ethically problem- atic limitations in the resulting NLP systems. Sys- tems trained on naturally occurring language data learn the pre-existing biases held by the speakers of that data: Typical vector-space representations of lexical semantics pick up cultural biases about gender (Bolukbasi et al., 2016) and race, ethnic- ity and religion (Speer, 2017). Zhao et al. (2017) show that beyond picking up such biases, machine learning algorithms can amplify them. Further- more, these biases, far from being inert or simply a reflection of the data, can have real-world con- sequences for both direct and indirect stakehold- ers. For example, Speer (2017) found that a sen- timent analysis system rated reviews of Mexican restaurants as more negative than other types of food with similar star ratings, because of associa- tions between the word Mexican and words with negative sentiment in the larger corpus on which 5The machine learning community uses the term bias to refer to constraints on what an algorithm can learn, which may prevent it from picking up patterns in a dataset or lead it to relevant patterns more quickly (see Coppin 2004, Ch. 10). This use of the term does not carry connotations of unfair- ness. the word embeddings were trained. (See also Kir- itchenko and Mohammad 2018.) In these and other ways, pre-existing biases can be trained into NLP systems. There are other studies showing that systems from part of speech taggers (Hovy and Søgaard, 2015; Jørgensen et al., 2015) to speech recognition engines (Tatman, 2017) perform bet- ter for speakers whose demographic characteris- tics better match those represented in the training data. These are examples of emergent bias. Because the linguistic data we use will always include pre-existing biases and because it is not possible to build an NLP system in such a way that it is immune to emergent bias, we must seek additional strategies for mitigating the scientific and ethical shortcomings that follow from imper- fect datasets. We propose here that foregrounding the characteristics of our datasets can help, by al- lowing reasoning about what the likely effects may be and by making it clearer which populations are and are not represented, for both training and test data. For training data, the characteristics of the dataset will affect how the system will work when it is deployed. For test data, the characteristics of the dataset will affect what can be measured about system performance and thus provides important context for scientific claims. 4 Current practice and challenges Typical current practice in academic NLP is to present new datasets with a careful discussion of the annotation process as well as a brief charac- terization of the genre (usually by naming the un- derlying data source) and the language. NLP pa- pers using datasets for training or test data tend to more briefly characterize the annotations and will sometimes leave out mention of genre and even language.6 Initiatives such as the Open Lan- guage Archives Community (OLAC; Bird and Si- mons 2000), the Fostering Language Resources Network (FLaReNet; Calzolari et al. 2012) and the Text Encoding Initiative (TEI; Consortium 2008) prescribe metadata to publish with language re- sources, primarily to aid in the discoverability of such resources. FLaReNet also encourages doc- umentation of language resources. And yet, it is very rare to find detailed characterization of the speakers whose data is captured or the annotators 6Surveys of EACL 2009 (Bender, 2011) and ACL 2015 (Munro, 2015) found 33%–81% of papers failed to name the language studied. (It always appeared to be English.) who provided the annotations, though the latter are usually characterized as being experts or crowd- workers.7 To fill this information gap, we argue that data statements should be included in every NLP pub- lication which presents new datasets and in the documentation of every NLP system, as part of a chronology of system development including de- scriptions of the various datasets for training, tun- ing and testing. Data statements should also be included in all NLP publications reporting exper- imental results. Accordingly, data statements will need to be both detailed and concise. To meet these competing goals, we propose two variants. For each dataset there should be a long-form ver- sion in an academic paper presenting the dataset or in system documentation. Research papers pre- senting experiments making use of datasets with existing long-form data statements should include shorter data statements and cite the longer one.8 We note another set of goals in competition: While readers need as much information as possi- ble in order to understand how the results can and cannot be expected to generalize, considerations of the privacy of the people involved (speakers, annotators) might preclude including certain kinds of information, especially with small groups. Each project will need to find the right balance, but this can be addressed in part by asking annotators and speakers for permission to collect and publish such information. 5 Proposed Data Statement Schema We propose the following schema of information to include in long and short form data statements. 5.1 Long form Long form data statements should be included in system documentation and in academic papers presenting new datasets, and should strive to pro- vide the following information: A. CURATION RATIONALE Which texts were in- cluded and what were the goals in selecting texts, both in the original collection and in any further 7A notable exception is Derczynski et al. (2016), who present a corpus of tweets collected to sample diverse speaker communities (location, type of engagement with Twitter), at diverse points in time (time of year, month, and day), and an- notated with named entity labels by crowdworker annotators from the same locations as the tweet authors. 8Older datasets can be retrofitted with citeable long-form data statements published on project web pages or archives. sub-selection? This can be especially important in datasets too large to thoroughly inspect by hand. An explicit statement of the curation rationale can help dataset users make inferences about what other kinds of texts systems trained with them could conceivably generalize to. B. LANGUAGE VARIETY Languages differ from each other in structural ways that can interact with NLP algorithms. Within a language, regional or social dialects can also show great variation (Chambers and Trudgill, 1998). The language and language variety should be described with: • A language tag from BCP-479 identifying the language variety (e.g. en-US or yue-Hant- HK) • A prose description of the language variety, glossing the BCP-47 tag and also providing further information (e.g. English as spoken in Palo Alto CA (USA) or Cantonese writ- ten with traditional characters by speakers in Hong Kong who are bilingual in Mandarin) C. SPEAKER DEMOGRAPHIC Sociolinguistics has found that variation (in pronunciation, prosody, word choice, and grammar) correlates with speaker demographic characteristics (Labov, 1966), as speakers use linguistic variation to con- struct and project identities (Eckert and Rickford, 2001). Transfer from native languages (L1) can affect the language produced by non-native (L2) speakers (Ellis, 1994, Ch. 8). A further impor- tant type of variation is disordered speech (e.g. dysarthria). Specifications include: • Age • Gender • Race/ethnicity • Native language • Socio-economic status • Number of different speakers represented • Presence of disordered speech D. ANNOTATOR DEMOGRAPHIC What are the demographic characteristics of the annotators and annotation guideline developers? Their own ‘so- cial address’ influences their experience with lan- guage and thus their perception of what they are annotating. Specifications include: • Age 9https://tools.ietf.org/rfc/bcp/bcp47. txt https://tools.ietf.org/rfc/bcp/bcp47.txt https://tools.ietf.org/rfc/bcp/bcp47.txt • Gender • Race/ethnicity • Native language • Socio-economic status • Training in linguistics/other relevant disci- pline E. SPEECH SITUATION Characteristics of the speech situation can affect linguistic structure and patterns at many levels. The intended audience of a linguistic performance can also affect linguistic choices on the part of speakers.10 The time and place provide broader context for understanding how the texts collected relate to their historical moment and should also be made evident in the data statement.11 Specifications include: • Time and place • Modality (spoken/signed, written) • Scripted/edited v. spontaneous • Synchronous v. asynchronous interaction • Intended audience F. TEXT CHARACTERISTICS Both genre and topic influence the vocabulary and structural char- acteristics of texts (Biber, 1995), and should be specified. G. RECORDING QUALITY For data that includes audio/visual recordings, indicate the quality of the recording equipment and any aspects of the recording situation that could impact recording quality. H. OTHER There may be other information of rel- evance as well (e.g. the demographic characteris- tics of the curators). As stated above, this is in- tended as a starting point and we anticipate best practices around writing data statements to de- velop over time. I. PROVENANCE APPENDIX For datasets built out of existing datasets, the data statements for the source datasets should be included as an appendix. 5.2 Short form Short form data statements should be included in any publication using a dataset for training, tuning or testing a system and may also be appropriate for certain kinds of system documentation. The short 10For example, people speak differently to close friends v. strangers, to small groups v. large ones, to children v. adults and to people v. machines (e.g. Ervin-Tripp 1964). 11Mutable speaker demographic information, such as age, is interpreted as relative to the time of the linguistic behavior. form data statement does not replace the long form one, but rather should include a pointer to it. For short form data statements, we envision 60–100 word summaries of the description included in the long form, covering most of the main points. 5.3 Summary We have outlined the kind of information data statements should include, addressing the needs laid out in §3, describing both long and short ver- sions. As the field gains experience with data statements, we expect to see a better understand- ing of what to include as well as best practices for writing data statements to emerge. Note that full specification of all of this infor- mation may not be feasible in all cases. For ex- ample, in datasets created from web text, precise demographic information may be unavailable. In other cases (e.g. to protect the privacy of annota- tors) it may be preferable to provide ranges rather than precise values. For the description of demo- graphic characteristics, our field can look to others for best practices, such as those described in the American Psychological Association’s Manual of Style. It may seem redundant to reiterate this informa- tion in every paper that makes use of well-trodden datasets. However, it is critical to consider the data anew each time to ensure that it is appropriate for the NLP work being undertaken and that the re- sults reported are properly contextualized. Note that the requirement is not that datasets be used only when there is an ideal fit between the dataset and the NLP goals but rather that the characteris- tics of the dataset be examined in relation to the NLP goals and limitations be reported as appro- priate. 6 Case studies We illustrate the idea of data statements with two cases studies. Ideally, data statements are written at or close to the time of dataset creation. These data statements were constructed post hoc in con- versation with the dataset curators. The first en- tails labels for a particular subset of all Twitter data. In contrast, the second entails all available data for an intentionally generated interview col- lection, including audiofiles and transcripts. Both illustrate how even when specific information is not available, the explicit statement of its lack of availability provides a more informative picture of the dataset. 6.1 Hate Speech Twitter Annotations The Hate Speech Twitter Annotations collection is a set of labels for ∼19,000 tweets collected by Waseem and Hovy (2016) and Waseem (2016). The dataset can be accessed via https:// github.com/zeerakw/hatespeech.12 A. CURATION RATIONALE In order to study the automatic detection of hate speech in tweets and the effect of annotator knowledge (crowd- workers v. experts) on the effectiveness of mod- els trained on the annotations, Waseem and Hovy (2016) performed a scrape of Twitter data using contentious terms and topics. The terms were cho- sen by first crowd-sourcing an initial set of search terms on feminist Facebook groups and then re- viewing the resulting tweets for terms to use and adding others based on the researcher’s intuition.13 Additionally, some prolific users of the terms were chosen and their timelines collected. For the an- notation work reported in Waseem (2016), expert annotators were chosen for their attitudes with re- spect to intersectional feminism in order to explore whether annotator understanding of hate speech would influence the labels and classifiers built on the dataset. B. LANGUAGE VARIETY The data was col- lected via the Twitter search API in late 2015. Information about which varieties of English are represented is not available, but at least Australian (en-AU) and US (en-US) mainstream Englishes are both included. C. SPEAKER DEMOGRAPHIC Speakers were not directly approached for inclusion in this dataset and thus could not be asked for demo- graphic information. More than 1500 different Twitter accounts are included. Based on indepen- dent information about Twitter usage and impres- sionistic observation of the tweets by the dataset curators, the data is likely to include tweets from 12This data statement was prepared based on information provided by Zeerak Waseem, pc, Feb-Apr 2018 and reviewed and approved by him. 13In a standalone data statement, the search terms should be given in the main text. To avoid accosting readers with slurs in this article, we instead list them in this footnote. Waseem and Hovy (2016) provide the following complete list of terms used in their initial scrape: ‘MKR’, ‘asian drive’, ‘feminazi’, ‘immigrant’, ‘nigger’, ‘sjw’, ‘WomenAgainst- Feminism’, ‘blameonenotall’, ‘islam terrorism’, ‘notallmen’, ‘victimcard’, ‘victim card’, ‘arab terror’, ‘gamergate’, ‘jsil’, ‘racecard’, ‘race card’. both younger (18–30) and older (30+) adult speak- ers, the majority of whom likely identify as white. No direct information is available about gender distribution or socioeconomic status of the speak- ers. It is expected that most, but not all, of the speakers speak English as a native language. D. ANNOTATOR DEMOGRAPHIC This dataset includes annotations from both crowdworkers and experts. 1,065 crowdworkers were recruited through Crowd Flower, primarily from Europe, South America and North America. Beyond coun- try of residence, no further information is avail- able about the crowdworkers. The expert anno- tators were recruited specifically for their under- standing of intersectional feminism. All were in- formally trained in critical race theory and gender studies through years of activism and personal re- search. They ranged in age from 20–40, included 3 men and 13 women, and gave their ethnicity as white European (11), East Asian (2), Middle East/Turkey (2), and South Asian (1). Their na- tive languages were Danish (12), Danish/English (1), Turkish/Danish (1), Arabic/Danish (1), and Swedish (1). Based on income levels, the expert annotators represented upper lower class (5), mid- dle class (7), and upper middle class (2). E. SPEECH SITUATION All tweets were ini- tially published between April 2013 and Decem- ber 2015. Tweets represent informal, largely asyn- chronous, spontaneous, written language, of up to 140 characters per tweet. About 23% of the tweets were in reaction to a specific Australian TV show (My Kitchen Rules) and so were likely meant for roughly synchronous interaction with other view- ers. The intended audience of the tweets was ei- ther other viewers of the same show, or simply the general Twitter audience. For the tweets contain- ing racist hate speech, the authors appear to intend them both for those who would agree but also for people whom they hope to provoke into having an agitational and confrontational exchange. F. TEXT CHARACTERISTICS For racist tweets the topic was dominated by Islam and Islamopho- bia. For sexist tweets predominant topics were the TV show and people making sexist statements while claiming not to be sexist. The majority of tweets only used one modality (text) though some included links to pictures and websites. G. RECORDING QUALITY N/A. H. OTHER N/A. I. PROVENANCE APPENDIX N/A. https://github.com/zeerakw/hatespeech https://github.com/zeerakw/hatespeech Twitter Hate Speech short form This dataset includes labels for ∼19,000 English tweets from different locales (Australia and North America be- ing well-represented) selected to contain a high prevalence of hate speech. The labels indicate the presence and type of hate speech and were pro- vided both by experts (mostly with extensive if informal training in critical race theory and gen- der studies and English as a second language) and by crowdworkers primarily from Europe and the Americas. [Include a link to the long form.] 6.2 Voices from the Rwanda Tribunal (VRT) Voices from the Rwanda Tribunal is a collec- tion of 49 video interviews in English and French with personnel from the International Criminal Tribunal for Rwanda (ICTR) comprising 50-60 hours of material with high quality transcrip- tion throughout (Nathan et al., 2011; Nilsen et al., 2012; Friedman et al., 2016). The dataset can be downloaded from http://www. tribunalvoices.org.14 A. CURATION RATIONALE The VRT project, funded by the United States National Science Foundation, is part of a research program on de- veloping multi-lifespan design knowledge (Fried- man and Nathan, 2010). It is independent from the ICTR, the United Nations, and the government of Rwanda. To help ensure accuracy and guard against breeches of confidentiality, interviewees had an opportunity to review and redact any ma- terial that was either misspoken or revealed con- fidential information. A total of two words have been redacted. No other review or redaction of content has occurred. The dataset includes all pub- licly released material from the collection; as of the writing of this data statement (28 September 2017) one interview and a portion of a second are currently sealed. B. LANGUAGE VARIETY Of the interviews, 44 are conducted in English (en-US and international English on the part of the interviewees, en-US on the part of the interviewers) and 5 in French and English, with the interviewee speaking inter- national French, the interviewer speaking English (en-US) and an interpreter speaking both.15 C. SPEAKER DEMOGRAPHIC The interviewees (13 women and 36 men, all adults) are profession- 14This data statement was prepared based on information provided by co-author Batya Friedman. 15At the end of one interview, there is 38 seconds of un- transcribed speech in Kinyarwanda (rw). als working in the area of international justice, such as judges or prosecutors, and support roles of the same, such as communications, prison war- den, and librarian. They represent a variety of na- tionalities: Argentina, Benin, Cameroon, Canada, England, The Gambia, Ghana, Great Britain, In- dia, Italy, Kenya, Madagascar, Mali, Morocco, Nigeria, Norway, Peru, Rwanda, Senegal, South Africa, Sri Lanka, St. Kitts and Nevis, Sweden, Tanzania, Togo, Uganda, and the US. Their native languages are not known, but are presumably di- verse. The 7 interviewers (2 women and 5 men) are informataion and legal professionals from dif- ferent regions in the US. All are native speakers of US English, all are white, and at the time of the interviews they ranged in age from early 40s to late 70s. The interpreters are language profes- sionals employed by the ICTR with experience in- terpreting between French and English. Their age, gender, and native languages are unknown. D. ANNOTATOR DEMOGRAPHIC The initial transcription was outsourced to a professional transcription company, so information about these transcribers is unavailable. The English tran- scripts were reviewed by English speaking (en- US) members of the research team for accuracy and then reviewed a third time by an additional English speaking (en-US) member of the team. The French/English transcripts received a sec- ond and third review for accuracy by bi-lingual French/English doctoral students at the University of Washington. Because of the sensitivity of the topic, the high political status of some intervie- wees (e.g. Prosecutor for the tribunal), and the in- ternational stature of the institution, it is very im- portant that interviewees’ comments be accurately transcribed. Accordingly, the bar for quality of transcription was set extremely high. E. SPEECH SITUATION The interviews were conducted in Autumn 2008 at the ICTR in Arusha, Tanzania and in Rwanda, face-to-face, as spoken language. The interviewers begin with a prepared set of questions, but most of the interaction is semi-structured. Most generally, the speech situa- tion can be characterized as a dialogue, but some of the interviewees give long replies, so stretches may be better characterized as monologues. For the interviewees, the immediate interlocutor is the interviewer, but the intended audience is much larger (see Part F below). http://www.tribunalvoices.org http://www.tribunalvoices.org F. TEXT CHARACTERISTICS The interviews were intended to provide an opportunity for tri- bunal personnel to reflect on their experiences working at the ICTR and what they would like to share with the people of Rwanda, the interna- tional justice community, and the global public now, 50 and 100 years from now. Professionals from all organs of the tribunal (judiciary, prosecu- tion, registry) were invited to be interviewed, with effort made to include a broad spectrum of roles (e.g. judges, prosecutor, defense counsel, but also the warden, librarian, language services). Intervie- wees expected their interviews to be made broadly accessible. G. RECORDING QUALITY The video inter- views were recorded with high definition equip- ment in closed but not sound-proof offices. There is some background noise. H. OTHER N/A I. PROVENANCE APPENDIX N/A. VRT short form The data represents well- vetted transcripts of 49 spoken interviews with personnel from the International Criminal Tri- bunal for Rwanda (ICTR) about their experience at the tribunal and reflections on international jus- tice, in international English (44 interviews) and French (5 interviews with interpreters). Intervie- wees are adults working in international justice and support fields at the ICTR; interviewers are adult information or legal professionals, highly fluent in en-US; and transcribers are highly edu- cated, highly fluent English and French speakers. [Include a link to the long form.] 6.3 Summary These sample data statements are meant to il- lustrate how the schema can be used to com- municate the specific characteristics of datasets. They were both created post-hoc, in communica- tion with the dataset curators. Once data state- ments are created as a matter of best practice, how- ever, they should be developed in tandem with the datasets themselves and may even inform the cu- ration of datasets. At the same time, data state- ments will need to be written for widely used, pre-existing datasets, where documentation may be lacking, memories imperfect, and dataset cu- rators no longer accessible. While retrospective data statements may be incomplete, by and large we believe they can still be valuable. Our case studies also underscore how curation rationales shape the specific kinds of texts in- cluded. This is particularly striking in the case of the Hate Speech Twitter Annotations, where the specific search terms very clearly shaped the spe- cific kinds of hate speech included and the ways in which any technology or studies built on this dataset will generalize. 7 A tool for mitigating bias We have explicitly designed data statements as a tool for mitigating bias in systems that use data for training and testing. Data statements are particu- larly well suited to mitigate forms of emergent and pre-existing bias. For the former, we see benefits at the level of specific systems and of the field: When a system is paired with data statement(s) for the data it is trained on, those deploying it are empowered to assess potential gaps between the speaker populations represented in the training and test data and the populations whose language the system will be working with. At the field level, data statements enable an examination of the en- tire catalog of testing and training datasets to help identify populations who are not yet included. All of these groups are vulnerable to emergent bias, in that any system would by definition have been trained and tested on data from datasets that do not represent them well. Data statements can also be instrumental in the diagnosis (and thus mitigation) of pre-existing bias. Consider again Speer’s (2017) example of Mexican restaurants and sentiment analysis. The information that the word vectors were trained on general web text (together with knowledge of what kind of societal biases such text might contain) was key in figuring out why the system consis- tently underestimated the ratings associated with reviews of Mexican restaurants. In order to en- able both more informed system development and deployment and audits by users and others of sys- tems in action, it is critical that characterizations of the training and test data underlying systems be available. To be clear, data statements do not in and of themselves solve the entire problem of bias. Rather, they are a critical enabling infrastructure. Consider by analogy this example from Friedman (1997) about access to technology and employ- ment for people with disabilities. In terms of computer system design, we are not so privileged as to determine rigidly the values that will emerge from the systems we design. But neither can we abdicate responsibility. For example, let us for the moment agree [. . . ] that disabled people in the work place should be able to access technology, just as they should be able to access a public build- ing. As system designers we can make the choice to try to construct a tech- nological infrastructure which disabled people can access. If we do not make this choice, then we single-handedly un- dermine the principle of universal ac- cess. But if we do make this choice, and are successful, disabled people would still rely, for example, on employers to hire them. (p.3) Similarly, with respect to bias in NLP technology, if we do not make a commitment to data state- ments or a similar practice for making explicit the characteristics of datasets, then we will single- handedly undermine the field’s ability to address bias. In NLP, we expect proposals to come with some kind of evaluation. In this paper, we have demon- strated the substance and ‘writability’ of a data statement through two exemplars (§6). However, the positive effects of data statements that we an- ticipate (and negative effects we haven’t antici- pated) cannot be demonstrated and tested a priori, as their impact emerges through practice. Thus, we look to value sensitive design, which encour- ages us to consider what would happen if a pro- posed technology were to come into widespread use, over longer periods of time, with attention to a wide range of stakeholders, potential benefits, and harms (Friedman et al., 2006, 2017). We do this with value scenarios (Nathan et al., 2007; Czeskis et al., 2010). Specifically, we look at two kinds of value scenarios: Those concerning NLP technology that fails to take into account an appropriate match between training data and deployment con- text and those that envision possible positive as well as negative consequences stemming from the widespread use of the specific ‘technology’ we are proposing in this paper (data statements). Envi- sioning possible negative outcomes allows us to consider how to mitigate such possibilities before they occur. 7.1 Public health and NLP for social media This value scenario is inspired by Jurgens et al. (2017), who provide a similar one to motivate training language ID systems on more represen- tative datasets. Scenario. Big U Hospital in a town in the Up- per Midwest collaborates with the CS Department at Big U to create a Twitter-based early warning system for infectious disease, called DiseaseAlert. Big U Hospital finds that the system improves pa- tient outcomes by alerting hospital staff to emerg- ing community health needs and alerting physi- cians to test for infectious diseases that currently are active locally. Big U decides to make the DiseaseAlert project open source to provide similar benefits to hospi- tals across the Anglophone world and is delighted to learn that City Hospital in Abuja, Nigeria is ex- cited to implement DiseaseAlert locally. Big U supports City Hospital with installing the code, in- cluding localizing the system to draw on tweets posted from Abuja. Over time, however, City Hos- pital finds that the system is leading its physicians to order unnecessary tests and that it is not at all accurate in detecting local health trends. City Hos- pital complains to Big U about the poor system performance and reports that their reputation is be- ing damaged. Big U is puzzled, as the DiseaseAlert performs well in the Upper Midwest, and they had spent time localizing the system to use tweets from Abuja. After a good deal of frustration and in- vestigation into Big U’s system, the developers discover that the third-party language ID compo- nent they had included was trained on only highly- edited US and UK English text. As a result, it tends to misclassify tweets in regional or non- standard varieties of English as ‘not English’ and therefore not relevant. Most of the tweets posted by people living in Abuja that City Hospital’s sys- tem should have been looking at were thrown out by the system at the first step of processing. Analysis. City Hospital adopted Big U’s open source DiseaseAlert system in exactly the way Big U intended. However, the documentation for the language ID component lacked critical informa- tion needed to help ensure the localization process would be successful; namely, information about the training and test sets for the system. Had Big U included data statements for all system compo- nents (including third-party components) in their documentation, then City Hospital IT staff would have been positioned to recognize the potential limitation of DiseaseAlert and to work proactively with Big U to ensure the system performed well in City Hospital’s context. Specifically, in reviewing data statements for all system components, the IT staff could note that the language ID component was trained on data unlike what they were seeing in their local tweets and ask for a different lan- guage ID component or ask for the existing one to be retrained. In this manner, an emergent bias and its concomitant harms could have been iden- tified and addressed during the system adaptation process prior to deployment. 7.2 Toward an inclusive data catalog In §7.1 we consider data statements in relation to a particular system. Here, we explore their potential to enable better science in NLP overall. Scenario. It’s 2022 and ‘Data Statement’ has become a standard section heading for NLP re- search papers and system documentation. Hap- pily, reports of mismatch between dataset and community of application leading to biased sys- tems have decreased. Yet, research community members articulate an unease regarding which lan- guage communities are and which are not part of the field’s data catalog — the abstract total collec- tion of data and associated meta-data to which the field has access — and the possibility for resulting bias in NLP at a systemic level. In response, several national funding bodies jointly fund a project to discover gaps in knowl- edge. The project compares existing data state- ments to surveys of spoken languages and system- atically maps which language varieties have re- sources (annotated corpora and standard process- ing tools) and which ones lack such resources. The study turns up a large number of language varieties lacking such resources; it also produces a precise list of underserved populations, some of which are quite sizable, suggesting opportunity for impactful intervention at the academic, industry and govern- ment levels. Study results in hand, the NLP community em- barks on an intentional program to broaden the language varieties in the data catalog. Public dis- cussions lead to criteria for prioritizing language varieties and funding agencies come together to fund collaborative projects to produce state of the art resources for understudied languages. Over time, the data catalog becomes more inclusive; bias in the catalog, while not wholly absent, is sig- nificantly reduced and NLP researchers and devel- opers are able to run more comprehensive exper- iments and build technology that serves a larger portion of society. Analysis. The NLP community has recognized critical limitations in the field’s existing data cat- alog, leaving many language communities un- derserved (Bender, 2011; Munro, 2015; Jurgens et al., 2017).16 The widespread uptake of data statements positions the NLP community to docu- ment the degree to which it leaves out certain lan- guage groups and empower itself to systematically broaden the data catalog. In turn, individual NLP systems could be trained on datasets that more closely align with the language of anticipated sys- tem users, thereby averting emergent bias. Fur- thermore, NLP researchers can more thoroughly test key research ideas and systems, leading to more reliable scientific results. 7.3 Anticipating and mitigating barriers Finally, we explore one potential negative out- come and how with care it might be mitigated: that of data statements as a barrier to research. Scenario. In response to widespread uptake, in 2026 the Association for Computational Linguis- tics (ACL) proposes that data statements be stan- dardized and required components of research pa- pers. A standards committee is formed, open pub- lic professional discussion is engaged, and in 2028 a standard is adopted. It mandates data statements as a requirement for publication, with standard- ized information fields and strict specifications for how these should be completed to facilitate auto- mated meta-analysis. There is great hope that the field will experience increasing benefits from abil- ity to compare, contrast, and build complementary data sets. Many of those hopes are realized. However, in a relatively short period of time papers from un- derrepresented regions abruptly decline. In addi- tion, the number of papers from everywhere pro- ducing and reporting on new datasets decline as well. Distressed by this outcome, the ACL con- 16The EU-funded project META-NET worked on identi- fying gaps at the level of whole languages for Europe, pro- ducing a series of 32 white papers each concerning one European language, available from http://www.meta- net.eu/whitepapers/overview, accessed 6 August 2018 http://www.meta-net.eu/whitepapers/overview http://www.meta-net.eu/whitepapers/overview stitutes an ad hoc committee to investigate. A survey of researchers reveals two distinct causes: First, researchers from institutions not yet well represented at ACL were having their papers desk- rejected due to missing or insufficient data state- ments. Second, researchers who might otherwise have developed a new dataset instead chose to use existing datasets whose data statements could sim- ply be copied. In response, the ACL executive de- velops a mentoring service to assist authors in sub- mitting standards-compliant data statements and considers relaxing the standard somewhat in order to encourage more dataset creation. Analysis. With any new technology, there can be unanticipated ripple effects — data statements are no exception. Here we envision two poten- tial negative impacts, which could both be miti- gated through other practices. Importantly, while we recommend the practice of creating data state- ments, we believe that they should be widely used before any standardization takes place. Further- more, once a degree of expertise in this area is built up, we recommend that mentoring be put in place proactively. Community engagement and mentor- ing will also contribute to furthering ethical dis- course and practice in the field. 7.4 Summary The value scenarios described here point to key upsides to the widespread adoption of data state- ments and also help to provide words of caution. They are meant to be thought-provoking and plau- sible, but are not predictive. Importantly, the sce- narios illustrate how, if used well, data statements could be an effective tool for mitigating bias in NLP systems. 8 Related work We see three strands of related work which lend support to our proposal and to the proposition that data statements will have the intended effect: sim- ilar practices in medicine (§8.1), emerging, inde- pendent proposals around similar ideas for trans- parency about datasets in AI (§8.2), and proposals for ‘algorithmic impact statements’ (§8.3). 8.1 Guidelines for reporting medical trials In medicine, the CONSORT (CONsolidated Stan- dards of Reporting Trials) guidelines were devel- oped by a consortium of journal editors, specialists in clinical trial methodology and others to improve reporting of randomized, controlled trials.17 They include a checklist for authors to use to indicate where in their research reports each item is han- dled and a statement explaining the rationale be- hind each item (Moher et al., 2010). CONSORT development began in 1993, with the most recent release in 2010. It has been endorsed by 70 medi- cal journals.18 Item 4a, ‘Eligibility criteria for participants’ is most closely related to the concerns of this paper. Characterizing the population that participated in the study is critical for gauging the extent to which the results of the study are applicable to particu- lar patients a physician is treating (Moher et al., 2010). The inclusion of this information has also en- abled further kinds of research. For example, Mbuagbaw et al. (2017) argue that careful atten- tion to and publication of demographic data that may correlate with health inequities can facilitate further work through meta-analyses. In particu- lar, individual studies usually lack the statistical power to do the kind of sub-analyses required to check for health inequities, and failing to publish demographic information precludes its use in the kind of aggregated, meta-analyses that could have sufficient statistical power. This echoes the field- level benefits we anticipate for data statements in building out the data catalog in the value scenario in §7.2. 8.2 Converging proposals At least three other groups are working in parallel on similar proposals regarding bias and AI. Gebru et al. (in prep) propose ‘datasheets for datasets’, looking at AI more broadly (but including NLP); Chmielinski and colleagues at the MIT Media Lab propose ‘dataset nutrition labels’;19 and Yang et al. (2018) describe ‘Ranking Facts’, a series of wid- gets that allow a user to explore how attributes in- fluence a ranking. Of these, the datasheets pro- posal is most similar to ours in including a compa- rable schema. The datasheets are inspired by those used in computer hardware to give specifications, lim- 17http://www.consort-statement.org/ consort-2010, accessed July 12, 2017 18http://www.consort-statement.org/ about-consort/endorsement-of-consort- statement, accessed July 12, 2017 19http://datanutrition.media.mit.edu/, accessed April 2, 2018 http://www.consort-statement.org/consort-2010 http://www.consort-statement.org/consort-2010 http://www.consort-statement.org/about-consort/endorsement-of-consort-statement http://www.consort-statement.org/about-consort/endorsement-of-consort-statement http://www.consort-statement.org/about-consort/endorsement-of-consort-statement http://datanutrition.media.mit.edu/ its and appropriate use information for compo- nents. There is important overlap in the kinds of information called for in the datasheets schema and our data statement schema: For example, the datasheets schema includes a section on ‘Motiva- tion for Dataset Creation’, akin to our ‘Curation Rationale’. The primary differences stem from the fact that the datasheets proposal is trying to ac- commodate all types of datasets used to train ma- chine learning systems and, hence, tends toward more general, cross-cutting categories; while we elaborate requirements for linguistic datasets and, hence, provide more specific, NLP-focused cate- gories. Gebru et al. note, like us, that their pro- posal is meant as an initial starting point to be elaborated through adoption and application. Hav- ing multiple starting points for this discussion will certainly make it more fruitful. 8.3 Algorithmic impact statements Several groups have called for algorithmic im- pact statements (Shneiderman, 2016; Diakopou- los, 2016; AI Now Institute, 2018), modeled af- ter environmental impact statements. Of these AI Now’s proposal is perhaps the most developed. All three groups point to the need to clarify infor- mation about the data: “Algorithm impact state- ments would document [. . . ] data quality control for input sources” (Shneiderman, 2016, 13539); “One avenue for transparency here is to commu- nicate the quality of the data, including its accu- racy, completeness, and uncertainty, [. . . ] repre- sentativeness of a sample for a specific population, and assumptions or other limitations” (Diakopou- los, 2016, 60); “AIAs should cover [. . . ] input and training data.” (AI Now Institute, 2018) However, none of these proposals specify how to do so. Data statements fill this critical gap. 9 Recommendations for implementation Data statements are meant to be something practi- cal and concrete that NLP technologists can adopt as one tool for mitigating potential harms of the technology we develop. For this benefit to come about, data statements must be easily adopted. In addition, practical uptake will require coordinated effort at the level of the field. In this section we briefly consider possible costs to writers and read- ers of data statements, and then propose strategies for promoting uptake. The primary cost we see for writers is time: With the required information to hand, writing a data statement should take no more than 2–3 hours (based on our experience with the case studies). However, the time to collect the information will depend on the dataset. The more speakers and an- notators that are involved, the more time it may take to collect demographic information. This can be facilitated by planning ahead, before the cor- pus is collected. Another possible cost is that col- lecting demographic information may mean that projects previously not submitted to institutional review boards for approval must now be, at least for exempt status. This process itself can take time, but is valuable in its own right. A further cost to writers is space. We propose that data state- ments, even the short form (60 – 100 words), be exempt from page limits in conference and journal publications. As for readers, reviewers have more material to read and dataset (and ultimately system) users need to scrutinize data statements in order to deter- mine which datasets are appropriate for their use case. But this is precisely the point: Data state- ments make critical information accessible that previously could only be found by users with great effort, if at all. The time invested in scrutiniz- ing data statements prior to dataset adoption is expected to be far less than the time required to diagnose and retrofit an already deployed system should biases be identified. Turning to uptake in the field, NLP technolo- gists (both researchers and system developers) are key stakeholders of the technology of data state- ments. Practices that engage these stakeholders in the development and promotion of data statements will both promote uptake and ensure that the ulti- mate form data statements take are responsive to NLP technologists’ needs. Accordingly, we rec- ommend that one or more professional organiza- tions such as the Association for Computational Linguistics convene a working group on data state- ments. Such a working group would engage in several related sets of activities, which would collectively serve to publicize and cultivate the use of data statements: (i) Best practices A clear first step entails de- veloping best practices for how data statements are produced. This includes: steps to take before collecting a dataset to facilitate writing an infor- mative data statement; heuristics for writing con- cise and effective data statements; how to incorpo- rate material from institutional review board/ethics committee applications into the data statement schema; how to find an appropriate level of de- tail given privacy concerns, especially for small or vulnerable populations; and how to produce data statements for older datasets that predate this prac- tice. In doing this work, it may be helpful to distill best practices from other fields, such as medicine and psychology, especially around collecting de- mographic information. (ii) Training and support materials With best practices in place, the next step is providing train- ing and support materials for the field at large. We see several complementary strategies to undertake: Create a digital template for data statements; run tutorials at conferences; establish a mentoring net- work (see §7.3); and develop an on-line ‘how-to’ guide. (iii) Recommendations for field-level policies There are a number of field-level practices that the working group could explore to support the uptake and successful use of data statements. Funding agencies could require data statements to be in- cluded in data management plans; conferences and journals could not count data statements against page limits (similar to references) and eventually require short form data statements in submissions; conferences and journals could allocate additional space for data statements in publications; finally once data statements have been in use for a few years, a standardized form could be established. 10 Tech policy implications Transparency of datasets and systems is essential for preserving accountability and building more just systems (Kroll et al., 2017). Due process provides a critical case in point. In the United States, for example, due process requires that cit- izens who have been deprived of liberty or prop- erty by the government be afforded the opportu- nity to understand and challenge the government’s decision (Citron, 2008). Without data statements or something similar, governmental decisions that are made or supported by automated systems de- prive citizens of the ability to mount such a chal- lenge, undermining the potential for due process. In addition to challenging any specific decision by any specific system, there is a further concern about building systems that are broadly represen- tative and fair. Here too, data statements have much to contribute. As systems are being built, data statements enable developers and researchers to make informed choices about training sets and to flag potential underrepresented populations who may be overlooked or treated unfairly. Once sys- tems are deployed, data statements enable diag- nosis of systemic unfairness when it is detected in system performance. At a societal level, such transparency is necessary for government and ad- vocacy groups seeking to ensure protections and an inclusive society. If data statements turn out to be useful as an- ticipated, then the following implications for stan- dardization and tech policy likely ensue. Long-Form Data Statements Required in System Documentation. For academia, industry and gov- ernment, inclusion of long-form data statements as part of system documentation should be a require- ment. As appropriate, inclusion of long-form data statements should be a requirement for ISO and other certification. Even groups that are creating datasets that they don’t share (e.g. NSA) would be well advised to make internal data statements. Moreover, under certain legal circumstances, such groups may be required to share this information. Short-Form Data Statements Required for Aca- demic and Other Publication. For academic pub- lication in journals and conferences, inclusion of short-form data statements should be a require- ment for publication. As highlighted in §7.3, cau- tion must be exercised to ensure that this require- ment does not become a barrier to access for some researchers. These two recommendations will need to be im- plemented with care. We have already noted the potential barrier to access. Secrecy concerns may also arise in some situations, e.g., some groups may be willing to share datasets but not demo- graphic information, for fear of public relations backlash or to protect the safety of contributors to the dataset. That said, as consumers of datasets or products trained with them, NLP researchers, developers and the general public would be well advised to use systems only if there is access to the information we propose should be included in data statements. 11 Conclusion and future work As researchers and developers working on tech- nology in widespread use, capable of impacting people beyond its direct users, we have an obli- gation to consider the ethical implications of our work. This will only happen reliably if we find ways to integrate such thought into our regular practice. In this paper, we have put forward one specific, concrete proposal which we believe will help with issues related to exclusion and bias in language technology: the practice of including ‘data statements’ in all publications and documen- tation for all NLP systems. We believe this practice will have beneficial ef- fects immediately and into the future: In the short term, it will foreground how our data does and doesn’t represent the world (and the people our systems will impact). In the long term, it should enable research that specifically addresses issues of bias and exclusion, promote the development of more representative datasets, and make it easier and more normative for researchers to take stake- holder values into consideration as they work. In foregrounding the information about the data we work with, we can work toward making sure that the systems we build work for diverse populations and also toward making sure we are not teach- ing computers about the world based on the world views of a limited subset of people. Granted, it will take time and experience to de- velop the skill of writing carefully crafted data statements. However, we see great potential ben- efits: For the scientific community, researchers will be better able to make precise claims about how results should generalize and perform more targeted experiments around reproducing results for datasets that differ in specific characteristics. For industry, we believe that incorporating data statements will encourage the kind of conscien- tious software development that protects compa- nies’ reputations (by avoiding public embarrass- ment) and makes them more competitive (by cre- ating systems used more fluidly by more people). For the public at large, data statements are one piece of a larger collection of practices that will enable the development of NLP systems that eq- uitably serves the interests of users and indirect stakeholders. Acknowledgments We are grateful to the following people for help- ful discussion and critical commentary as we de- veloped this paper: the anonymous TACL review- ers, Hannah Almeter, Stephanie Ballard, Chris Curtis, Leon Derczynski, Michael Wayne Good- man, Anna Hoffmann, Bill Howe, Kristen How- ell, Dirk Hovy, Jessica Hullman, David Inman, Ta- dayoshi Kohno, Nick Logler, Mitch Marcus, An- gelina McMillan-Major, Rob Munro, Glenn Slay- den, Michelle Stamnes, Jevin West Daisy Yoo, Olga Zamaraeva, and especially Zeerak Waseeem and Ryan Calo. We have presented talks based on earlier versions of this paper at New York Uni- versity (Nov 2017), Columbia University (Nov 2017), University of Washington (Nov 2017), UC San Diego (Feb 2018), Microsoft (Mar 2018) and Macquarie University (July 2018) and thank the audiences at those talks for useful feedback. Fi- nally, Batya Friedman’s contributions to this paper were supported by the UW Tech Policy Lab and National Science Foundation Grant IIS-1302709. Any opinions, findings, and conclusions or rec- ommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. References AI Now Institute. 2018. Algorithmic impact assessments: Toward accountable automation in public agencies. Medium.com, https: //medium.com/@AINowInstitute/ algorithmic-impact-assessments- toward-accountable-automation- in-public-agencies-bd9856e6fdde, accessed 6 April 2018. American Psychological Association. 2009. Pub- lication Manual of the American Psychological Association, 6th edition. Author, Washington DC. Emily M. Bender. 2011. On achieving and evalu- ating language independence in NLP. Linguis- tic Issues in Language Technology, 6:1–26. Douglas Biber. 1995. Dimensions of Regis- ter Variation: A Cross-Linguistic Comparison. Cambridge University Press, Cambridge. Steven Bird and Gary Simons. 2000. White pa- per on establishing an infrastructure for open language archiving. In Workshop on Web- Based Language Documentation and Descrip- tion, Philadelphia, PA, pages 12–15. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4349–4357. Curran Associates, Inc. Nicoletta Calzolari, Valeria Quochi, and Claudia Soria. 2012. The strategic language resource agenda. http://www.flarenet.eu/ sites/default/files/FLaReNet_ Strategic_Language_Resource_ Agenda.pdf, accessed 6 August 2018. Jack K. Chambers and Peter Trudgill. 1998. Di- alectology, second edition. Cambridge Univer- sity Press. Danielle Keats Citron. 2008. Technological due process. Washington University Law Review, 85:1249–1313. TEI Consortium. 2008. TEI P5: Guide- lines for Electronic Text Encoding and In- terchange. http://www.tei-c.org/ guidelines/p5/, accessed 6 August 2018. Ben Coppin. 2004. Artificial Intelligence Illumi- nated. Jones & Bartlett Publishers, Sudbury MA. Alexei Czeskis, Ivayla Dermendjieva, Hussein Yapit, Alan Borning, Batya Friedman, Brian Gill, and Tadayoshi Kohno. 2010. Parenting from the pocket: Value tensions and techni- cal directions for secure and private parent-teen mobile safety. In Proceedings of the Sixth Sym- posium on Usable Privacy and Security. ACM. Leon Derczynski, Kalina Bontcheva, and Ian Roberts. 2016. Broad twitter corpus: A di- verse named entity recognition resource. In Proceedings of COLING 2016, the 26th Inter- national Conference on Computational Linguis- tics: Technical Papers, pages 1169–1179. The COLING 2016 Organizing Committee. Laurence Devillers, Björn Schuller, Emily Mower Provost, Peter Robinson, Joseph Mariani, and Agnes Delaborde, editors. 2016. Proceedings of ETHI-CA2 2016: ETHics in Corpus Collec- tion, Annotation & Application. LREC. Nicholas Diakopoulos. 2016. Accountability in algorithmic decision making. Communications of the ACM, 59(2):56–62. Penelope Eckert and John R. Rickford, editors. 2001. Style and Sociolinguistic Variation. Cambridge University Press, Cambridge. Rod Ellis. 1994. The Study of Second Language Acquisition. Oxford University Press, Oxford. Susan Ervin-Tripp. 1964. An analysis of the inter- action of language, topic, and listener. Ameri- can Anthropologist, 66(6_PART2):86–102. Karën Fort, Gilles Adda, and K. Bretonnel Cohen, editors. 2016. TAL et Ethique, special issue of Traitement automatique des languages, volume 57:2. Batya Friedman. 1997. Introduction. In Batya Friedman, editor, Human Values and the Design of Computer Technology, pages 1–18. Stanford CA, Stanford. Batya Friedman, David G Hendry, and Alan Born- ing. 2017. A survey of value sensitive de- sign methods. Foundations and Trends R© in Human–Computer Interaction, 11(2):63–125. Batya Friedman, Peter H. Kahn, Jr., and Alan Borning. 2006. Value sensitive design and in- formation systems. In Ping Zhang and Den- nis F. Galletta, editors, Human–Computer In- teraction in Management Information Systems: Foundations, pages 348–372. M. E. Sharpe, Ar- monk NY. Batya Friedman and Lisa P. Nathan. 2010. Multi- lifespan information system design: A research initiative for the HCI community. In Proceed- ings of the SIGCHI Conference on Human Fac- tors in Computing Systems, pages 2243–2246. ACM. Batya Friedman, Lisa P. Nathan, and Daisy Yoo. 2016. Multi-lifespan information system de- sign in support of transitional justice: Evolving situated design principles for the long(er) term. Interacting with Computers, 29:80–96. Batya Friedman and Helen Nissenbaum. 1996. Bias in computer systems. ACM Transactions on Information Systems (TOIS), 14(3):330–347. John Furler, Parker Magin, Marie Pirotta, and Mieke van Driel. 2012. Participant demograph- ics reported in “table 1” of randomised con- trolled trials: A case of “inverse evidence”? In- ternational Journal for Equity in Health, 11. http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf http://openscholarship.wustl.edu/law_lawreview/vol85/iss6/2 http://openscholarship.wustl.edu/law_lawreview/vol85/iss6/2 http://www.tei-c.org/guidelines/p5/ http://www.tei-c.org/guidelines/p5/ http://aclweb.org/anthology/C16-1111 http://aclweb.org/anthology/C16-1111 https://doi.org/10.1145/2844110 https://doi.org/10.1145/2844110 https://doi.org/10.1093/iwc/iwv045 https://doi.org/10.1093/iwc/iwv045 https://doi.org/10.1093/iwc/iwv045 http://doi.org/10.1186/1475-9276-11-14 http://doi.org/10.1186/1475-9276-11-14 http://doi.org/10.1186/1475-9276-11-14 Timnit Gebru, Jamie Morgenstern, Briana Vec- chione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Craw- ford. in prep. Datasheets for datasets. ArXiv:1803.09010v1. Dirk Hovy and Anders Søgaard. 2015. Tagging performance correlates with author age. In Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natu- ral Language Processing (Volume 2: Short Pa- pers), pages 483–488. Association for Compu- tational Linguistics. Dirk Hovy, Shannon Spruit, Margaret Mitchell, Emily M. Bender, Michael Strube, and Hanna Wallach, editors. 2017. Proceedings of the First ACL Workshop on Ethics in Natural Lan- guage Processing. Association for Computa- tional Linguistics. Dirk Hovy and Shannon L. Spruit. 2016. The so- cial impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 2: Short Papers), pages 591–598. Associa- tion for Computational Linguistics. Anna Jørgensen, Dirk Hovy, and Anders Søgaard. 2015. Challenges of studying and processing dialects in social media. In Proceedings of the Workshop on Noisy User-generated Text, pages 9–18. Association for Computational Linguis- tics. David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. 2017. Incorporating dialectal variability for so- cially equitable language identification. In Pro- ceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Vol- ume 2: Short Papers), pages 51–57. Association for Computational Linguistics. Svetlana Kiritchenko and Saif Mohammad. 2018. Examining gender and race bias in two hundred sentiment analysis systems. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 43–53. Asso- ciation for Computational Linguistics. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 423–430. Association for Computational Linguistics. Roger Kreuz and Gina Caucci. 2007. Lexical influences on the perception of sarcasm. In Proceedings of the Workshop on Computational Approaches to Figurative Language, pages 1–4. Association for Computational Linguistics. Joshua A. Kroll, Joanna Huey, Solon Barocas, Ed- ward W. Felten, Joel R. Reidenberg, David G. Robinson, and Harlan Yu. 2017. Account- able algorithms. University of Pennsylvania Law Review, 165. Fordham Law Legal Stud- ies Research Paper No. 2765268. Available at SSRN: https://ssrn.com/abstract= 2765268, accessed 6 August 2018. William Labov. 1966. The Social Stratification of English in New York City. Center for Applied Linguistics, Washington, DC. Bing Liu. 2012. Sentiment Analysis and Opin- ion Mining, volume 5:1 of Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:313– 330. Lawrence Mbuagbaw, Theresa Aves, Beverley Shea, Janet Jull, Vivian Welch, Monica Tal- jaard, Manosila Yoganathan, Regina Greer- Smith, George Wells, and Peter Tugwell. 2017. Considerations and guidance in design- ing equity-relevant clinical trials. International Journal for Equity in Health, 16(1):93. Davida Moher, Sally Hopewell, Kenneth F. Schulz, Victor Montori, Peter C. Gøtzsche, P. J. Devereaux, Diana Elbourne, Matthias Eg- ger, and Douglas G. Altman. 2010. CON- SORT 2010 explanation and elaboration: Up- dated guidelines for reporting parallel group randomised trials. The BMJ, 340. Robert Munro. 2015. Languages at ACL this year. Blog post, http://www. junglelightspeed.com/languages- at-acl-this-year/, accessed 22 Septem- ber 2017. http://www.aclweb.org/anthology/P15-2079 http://www.aclweb.org/anthology/P15-2079 http://www.aclweb.org/anthology/W17-16 http://www.aclweb.org/anthology/W17-16 http://www.aclweb.org/anthology/W17-16 http://anthology.aclweb.org/P16-2096 http://anthology.aclweb.org/P16-2096 http://www.aclweb.org/anthology/W15-4302 http://www.aclweb.org/anthology/W15-4302 http://aclweb.org/anthology/P17-2009 http://aclweb.org/anthology/P17-2009 http://aclweb.org/anthology/S18-2005 http://aclweb.org/anthology/S18-2005 https://doi.org/10.3115/1075096.1075150 http://www.aclweb.org/anthology/W/W07/W07-0101 http://www.aclweb.org/anthology/W/W07/W07-0101 https://ssrn.com/abstract=2765268 https://ssrn.com/abstract=2765268 https://doi.org/10.1186/s12939-017-0591-1 https://doi.org/10.1186/s12939-017-0591-1 http://doi.org/10.1136/bmj.c869 http://doi.org/10.1136/bmj.c869 http://doi.org/10.1136/bmj.c869 http://doi.org/10.1136/bmj.c869 http://www.junglelightspeed.com/languages-at-acl-this-year/ http://www.junglelightspeed.com/languages-at-acl-this-year/ http://www.junglelightspeed.com/languages-at-acl-this-year/ Robert Munro and Christopher D. Manning. 2010. Subword variation in text message classifica- tion. In Human Language Technologies: The 2010 Annual Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics, pages 510–518. Association for Computational Linguistics. Lisa P. Nathan, Predrag V. Klasnja, and Batya Friedman. 2007. Value scenarios: A technique for envisioning systemic effects of new tech- nologies. In CHI’07 Extended Abstracts on Human Factors in Computing Systems, pages 2585–2590. ACM. Lisa P. Nathan, Milli Lake, Nell Carden Grey, Trond Nilsen, Robert F. Utter, Elizabeth J. Ut- ter, Mark Ring, Zoe Kahn, and Batya Fried- man. 2011. Multi-lifespan information system design: Investigating a new design approach in Rwanda. In Proceedings of the 2011 iConfer- ence, pages 591–597. ACM. Trond T. Nilsen, Nell Carden Grey, and Batya Friedman. 2012. Public curation of a historic collection: A means for speaking safely in pub- lic. In Proceedings of the ACM 2012 confer- ence on Computer Supported Cooperative Work Companion, pages 277–278. ACM. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Senti- ment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Nat- ural Language Processing, pages 79–86. Association for Computational Linguistics. Tomáš Ptáček, Ivan Habernal, and Jun Hong. 2014. Sarcasm detection on Czech and En- glish Twitter. In Proceedings of COLING 2014, the 25th International Conference on Compu- tational Linguistics: Technical Papers, pages 213–223. Dublin City University and Associa- tion for Computational Linguistics. Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalin- dra De Silva, Nathan Gilbert, and Ruihong Huang. 2013. Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 Conference on Empir- ical Methods in Natural Language Processing, pages 704–714. Association for Computational Linguistics. Ben Shneiderman. 2016. Opinion: The dangers of faulty, biased, or malicious algorithms requires independent oversight. Proceedings of the Na- tional Academy of Sciences, 113(48):13538– 13540. Rob Speer. 2017. Conceptnet numberbatch 17.04: better, less-stereotyped word vectors. Blog post, https://blog.conceptnet. io/2017/04/24/conceptnet- numberbatch-17-04-better-less- stereotyped-word-vectors/, ac- cessed 6 July 2017. Rachael Tatman. 2017. Gender and dialect bias in YouTube’s automatic captions. In Proceedings of the First ACL Workshop on Ethics in Natu- ral Language Processing, pages 53–59. Associ- ation for Computational Linguistics. Zeerak Waseem. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In Proceedings of the First Workshop on NLP and Computational Social Science, pages 138–142. Association for Computational Linguistics. Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Pro- ceedings of the NAACL Student Research Work- shop, pages 88–93. Association for Computa- tional Linguistics. Ke Yang, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, H. V. Jagadish, and Gerome Miklau. 2018. A nutritional label for rankings. In Pro- ceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pages 1773–1776, New York, NY, USA. ACM. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplifi- cation using corpus-level constraints. In Pro- ceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2941–2951. Association for Computa- tional Linguistics. http://www.aclweb.org/anthology/N10-1075 http://www.aclweb.org/anthology/N10-1075 https://doi.org/10.3115/1118693.1118704 https://doi.org/10.3115/1118693.1118704 https://doi.org/10.3115/1118693.1118704 http://www.aclweb.org/anthology/C14-1022 http://www.aclweb.org/anthology/C14-1022 http://www.aclweb.org/anthology/D13-1066 http://www.aclweb.org/anthology/D13-1066 https://doi.org/10.1073/pnas.1618211113 https://doi.org/10.1073/pnas.1618211113 https://doi.org/10.1073/pnas.1618211113 https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/ https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/ https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/ https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/ http://www.aclweb.org/anthology/W17-1606 http://www.aclweb.org/anthology/W17-1606 http://aclweb.org/anthology/W16-5618 http://aclweb.org/anthology/W16-5618 http://aclweb.org/anthology/W16-5618 http://www.aclweb.org/anthology/N16-2013 http://www.aclweb.org/anthology/N16-2013 http://www.aclweb.org/anthology/N16-2013 https://doi.org/10.1145/3183713.3193568 https://www.aclweb.org/anthology/D17-1319 https://www.aclweb.org/anthology/D17-1319 https://www.aclweb.org/anthology/D17-1319