Data Statements for Natural Language Processing:
Toward Mitigating System Bias and Enabling Better Science

Emily M. Bender
Department of Linguistics
University of Washington
ebender@uw.edu

Batya Friedman
The Information School

University of Washington
batya@uw.edu

Abstract

In this paper, we propose data statements
as a design solution and professional prac-
tice for natural language processing technol-
ogists, in both research and development —
through the adoption and widespread use
of data statements, the field can begin to
address critical scientific and ethical issues
that result from the use of data from cer-
tain populations in the development of tech-
nology for other populations. We present
a form that data statements can take and
explore the implications of adopting them
as part of regular practice. We argue that
data statements will help alleviate issues re-
lated to exclusion and bias in language tech-
nology; lead to better precision in claims
about how NLP research can generalize and
thus better engineering results; protect com-
panies from public embarrassment; and ul-
timately lead to language technology that
meets its users in their own preferred lin-
guistic style and furthermore does not mis-
represent them to others.

1 Introduction

As technology enters widespread societal use it is
important that we, as technologists, think critically
about how the design decisions we make and sys-
tems we build impact people — including not only
users of the systems but also other people who will
be affected by the systems without directly inter-
acting with them. For this paper, we focus on nat-
ural language processing (NLP) technology. Po-
tential adverse impacts include NLP systems that
fail to work for specific subpopulations (e.g. chil-
dren or speakers of language varieties which are
not supported by training or test data) or systems
that reify and reinforce biases present in training
data (e.g. a resume-review system that ranks fe-
male candidates as less qualified for computer pro-

gramming jobs because of biases present in train-
ing text). There are both scientific and ethical rea-
sons to be concerned. Scientifically, there is the
issue of generalizability of results; ethically, the
potential for significant real-world harms. While
there is increasing interest in ethics in NLP,1 there
remains the open and urgent question of how we
integrate ethical considerations into the everyday
practice of our field. This question has no simple
answer, but rather will require a constellation of
multi-faceted solutions.

Toward that end, and drawing on value sen-
sitive design (Friedman et al., 2006), this pa-
per contributes one new professional practice —
called data statements — which we argue will
bring about improvements in engineering and sci-
entific outcomes while also enabling more ethi-
cally responsive NLP technology. A data state-
ment is a characterization of a dataset which pro-
vides context to allow developers and users to
better understand how experimental results might
generalize, how software might be appropriately
deployed, and what biases might be reflected in
systems built on the software. In developing this
practice, we draw on analogous practices from
the fields of psychology and medicine that re-
quire some standardized information about the
populations studied (e.g. APA 2009; Moher et al.
2010; Furler et al. 2012; Mbuagbaw et al. 2017).
Though the construct of data statements applies
more broadly, in this paper we focus specifically
on data statements for NLP systems. Data state-
ments should be included in most writing on NLP
including: papers presenting new datasets, papers
reporting experimental work with datasets, and

1This interest has manifested in workshops (Fort et al.,
2016; Devillers et al., 2016; Hovy et al., 2017) and papers
(Hovy and Spruit, 2016) in NLP, as well as workshops in
related fields, notably the FATML series (http://www.
fatml.org/) held annually since 2014.

http://www.fatml.org/
http://www.fatml.org/
To appear in Transactions of the ACL


documentation for NLP systems. Data statements
should help us as a field engage with the ethical is-
sues of exclusion, overgeneralization, and under-
exposure (Hovy and Spruit, 2016). Furthermore,
as data statements bring our datasets and their rep-
resented populations into better focus, they should
also help us as a field deal with scientific issues
of generalizability and reproducibility. Adopting
this practice will position us to better understand
and describe our results and, ultimately, do better
and more ethical science and engineering.2

We begin by defining terms (§2), discuss why
NLP needs data statements (§3) and relate our pro-
posal to current practice (§4). Next is the sub-
stance of our contribution: a detailed proposal for
data statements for NLP (§5), illustrated with two
case studies (§6). In §7 we discuss how data state-
ments can mitigate bias and use the technique of
‘value scenarios’ to envision potential effects of
their adoption. Finally, we relate data statements
to similar emerging proposals (§8), make recom-
mendations for how to implement and promote the
uptake of data statements (§9), and lay out consid-
erations for tech policy (§10).

2 Definitions

As this paper is intended for at least two dis-
tinct audiences (NLP technologists and tech pol-
icymakers), we use this section to briefly define
key terms.

Dataset, Annotations An (NLP) dataset is a
collection of speech or writing possibly combined
with annotations.3 Annotations include indica-
tions of linguistic structure like part of speech tags
or syntactic parse trees, as well as labels classify-
ing aspects of what the speakers were attempting
to accomplish with their utterances. The latter in-
cludes annotations for sentiment (Liu, 2012) and
for figurative language or sarcasm (e.g. Riloff et al.
2013; Ptáček et al. 2014). Labels can be naturally
occurring, such as star ratings in reviews taken as
indications of the overall sentiment of the review
(e.g. Pang et al. 2002) or the hashtag #sarcasm

2By arguing here that data statements promote both eth-
ical practice and sound science, we do not mean to suggest
that these two can be conflated. A system can give accurate
responses as measured by some test set (scientific soundness)
and yet lead to real-world harms (ethical issues). Accord-
ingly, it is up to researchers and research communities to en-
gage with both scientific and ethical ideals.

3Multi-modal data sets combine language and video or
other additional signals. Here, our focus is on linguistic data.

used to identify sarcastic language (e.g. Kreuz and
Caucci 2007).

Speaker We use the term speaker to refer to the
individual who produced some segment of linguis-
tic behavior included in the dataset, even if the lin-
guistic behavior is originally written.

Annotator Annotator refers to people who as-
sign annotations to the raw data, including tran-
scribers of spoken data. Annotators may be
crowdworkers or highly trained researchers, some-
times involved in the creation of the annota-
tion guidelines. Annotation is often done semi-
automatically, with NLP tools being used to create
a first pass which is corrected or augmented by hu-
man annotators.

Curator A third role in dataset creation, less
commonly discussed, is the curator. Curators
are involved in the selection of which data to in-
clude, by selecting individual documents, by cre-
ating search terms that generate sets of documents,
by selecting speakers to interview and designing
interview questions, etc.

Stakeholders Stakeholders are people impacted
directly or indirectly by a system (Friedman et al.,
2006; Czeskis et al., 2010). Direct stakeholders
include those who interact with the system, ei-
ther by participating in system creation (develop-
ers, speakers, annotators and curators) or by using
it. Indirect stakeholders do not use the system but
are nonetheless impacted by it. For example, peo-
ple whose web content is displayed or rendered
invisible by search engine algorithms are indirect
stakeholders with respect to those systems.

Algorithm We use the term algorithm to en-
compass both rule-based and machine learning ap-
proaches to NLP. Some algorithms (typically rule-
based ones) are tightly connected to the datasets
they are developed against. Other algorithms can
be easily ported to different datasets.4

System We use the term (NLP) system to re-
fer to a piece of software that does some kind
of natural language processing, typically involv-
ing algorithms trained on particular datasets. We
use this term to refer to both components focused
on specific tasks (e.g. the Stanford parser (Klein
and Manning, 2003) trained on the Penn Treebank

4Datasets used during algorithm development can influ-
ence design choices in machine learning approaches too:
Munro and Manning (2010) found that subword information,
not helpful in English SMS classification, is extremely valu-
able in Chichewa, a morphologically complex language with
high orthographic variability.


(Marcus et al., 1993) to do English parsing) and
user-facing products such as Amazon’s Alexa or
Google Home.

Bias We use the term bias to refer to cases
where computer systems “systematically and un-
fairly discriminate against certain individuals or
groups of individuals in favor of others” (Fried-
man and Nissenbaum, 1996, 332).5 To be clear:
(i) unfair discrimination does not give rise to bias
unless it occurs systematically and (ii) systematic
discrimination does not give rise to bias unless it
results in an unfair outcome. Friedman and Nis-
senbaum (1996) show that in some cases, sys-
tem bias reflects biases in society; these are pre-
existing biases with roots in social institutions,
practices and attitudes. In other cases, reasonable,
seemingly neutral, technical elements (e.g. the or-
der in which an algorithm processes data) can re-
sult in bias when used in real world contexts; these
technical biases stem from technical constraints
and decisions. A third source of bias, emergent
bias, occurs when a system designed for one con-
text is applied in another, e.g. with a different pop-
ulation.

3 Why does NLP need data statements?

Recent studies have documented the fact that lim-
itations in training data lead to ethically problem-
atic limitations in the resulting NLP systems. Sys-
tems trained on naturally occurring language data
learn the pre-existing biases held by the speakers
of that data: Typical vector-space representations
of lexical semantics pick up cultural biases about
gender (Bolukbasi et al., 2016) and race, ethnic-
ity and religion (Speer, 2017). Zhao et al. (2017)
show that beyond picking up such biases, machine
learning algorithms can amplify them. Further-
more, these biases, far from being inert or simply
a reflection of the data, can have real-world con-
sequences for both direct and indirect stakehold-
ers. For example, Speer (2017) found that a sen-
timent analysis system rated reviews of Mexican
restaurants as more negative than other types of
food with similar star ratings, because of associa-
tions between the word Mexican and words with
negative sentiment in the larger corpus on which

5The machine learning community uses the term bias to
refer to constraints on what an algorithm can learn, which
may prevent it from picking up patterns in a dataset or lead it
to relevant patterns more quickly (see Coppin 2004, Ch. 10).
This use of the term does not carry connotations of unfair-
ness.

the word embeddings were trained. (See also Kir-
itchenko and Mohammad 2018.) In these and
other ways, pre-existing biases can be trained into
NLP systems. There are other studies showing that
systems from part of speech taggers (Hovy and
Søgaard, 2015; Jørgensen et al., 2015) to speech
recognition engines (Tatman, 2017) perform bet-
ter for speakers whose demographic characteris-
tics better match those represented in the training
data. These are examples of emergent bias.

Because the linguistic data we use will always
include pre-existing biases and because it is not
possible to build an NLP system in such a way
that it is immune to emergent bias, we must seek
additional strategies for mitigating the scientific
and ethical shortcomings that follow from imper-
fect datasets. We propose here that foregrounding
the characteristics of our datasets can help, by al-
lowing reasoning about what the likely effects may
be and by making it clearer which populations are
and are not represented, for both training and test
data. For training data, the characteristics of the
dataset will affect how the system will work when
it is deployed. For test data, the characteristics of
the dataset will affect what can be measured about
system performance and thus provides important
context for scientific claims.

4 Current practice and challenges

Typical current practice in academic NLP is to
present new datasets with a careful discussion of
the annotation process as well as a brief charac-
terization of the genre (usually by naming the un-
derlying data source) and the language. NLP pa-
pers using datasets for training or test data tend
to more briefly characterize the annotations and
will sometimes leave out mention of genre and
even language.6 Initiatives such as the Open Lan-
guage Archives Community (OLAC; Bird and Si-
mons 2000), the Fostering Language Resources
Network (FLaReNet; Calzolari et al. 2012) and the
Text Encoding Initiative (TEI; Consortium 2008)
prescribe metadata to publish with language re-
sources, primarily to aid in the discoverability of
such resources. FLaReNet also encourages doc-
umentation of language resources. And yet, it is
very rare to find detailed characterization of the
speakers whose data is captured or the annotators

6Surveys of EACL 2009 (Bender, 2011) and ACL 2015
(Munro, 2015) found 33%–81% of papers failed to name the
language studied. (It always appeared to be English.)


who provided the annotations, though the latter are
usually characterized as being experts or crowd-
workers.7

To fill this information gap, we argue that data
statements should be included in every NLP pub-
lication which presents new datasets and in the
documentation of every NLP system, as part of a
chronology of system development including de-
scriptions of the various datasets for training, tun-
ing and testing. Data statements should also be
included in all NLP publications reporting exper-
imental results. Accordingly, data statements will
need to be both detailed and concise. To meet
these competing goals, we propose two variants.
For each dataset there should be a long-form ver-
sion in an academic paper presenting the dataset
or in system documentation. Research papers pre-
senting experiments making use of datasets with
existing long-form data statements should include
shorter data statements and cite the longer one.8

We note another set of goals in competition:
While readers need as much information as possi-
ble in order to understand how the results can and
cannot be expected to generalize, considerations
of the privacy of the people involved (speakers,
annotators) might preclude including certain kinds
of information, especially with small groups. Each
project will need to find the right balance, but this
can be addressed in part by asking annotators and
speakers for permission to collect and publish such
information.

5 Proposed Data Statement Schema

We propose the following schema of information
to include in long and short form data statements.

5.1 Long form

Long form data statements should be included
in system documentation and in academic papers
presenting new datasets, and should strive to pro-
vide the following information:

A. CURATION RATIONALE Which texts were in-
cluded and what were the goals in selecting texts,
both in the original collection and in any further

7A notable exception is Derczynski et al. (2016), who
present a corpus of tweets collected to sample diverse speaker
communities (location, type of engagement with Twitter), at
diverse points in time (time of year, month, and day), and an-
notated with named entity labels by crowdworker annotators
from the same locations as the tweet authors.

8Older datasets can be retrofitted with citeable long-form
data statements published on project web pages or archives.

sub-selection? This can be especially important in
datasets too large to thoroughly inspect by hand.
An explicit statement of the curation rationale can
help dataset users make inferences about what
other kinds of texts systems trained with them
could conceivably generalize to.
B. LANGUAGE VARIETY Languages differ from
each other in structural ways that can interact
with NLP algorithms. Within a language, regional
or social dialects can also show great variation
(Chambers and Trudgill, 1998). The language and
language variety should be described with:

• A language tag from BCP-479 identifying the
language variety (e.g. en-US or yue-Hant-
HK)

• A prose description of the language variety,
glossing the BCP-47 tag and also providing
further information (e.g. English as spoken
in Palo Alto CA (USA) or Cantonese writ-
ten with traditional characters by speakers in
Hong Kong who are bilingual in Mandarin)

C. SPEAKER DEMOGRAPHIC Sociolinguistics
has found that variation (in pronunciation,
prosody, word choice, and grammar) correlates
with speaker demographic characteristics (Labov,
1966), as speakers use linguistic variation to con-
struct and project identities (Eckert and Rickford,
2001). Transfer from native languages (L1) can
affect the language produced by non-native (L2)
speakers (Ellis, 1994, Ch. 8). A further impor-
tant type of variation is disordered speech (e.g.
dysarthria). Specifications include:

• Age
• Gender
• Race/ethnicity
• Native language
• Socio-economic status
• Number of different speakers represented
• Presence of disordered speech

D. ANNOTATOR DEMOGRAPHIC What are the
demographic characteristics of the annotators and
annotation guideline developers? Their own ‘so-
cial address’ influences their experience with lan-
guage and thus their perception of what they are
annotating. Specifications include:

• Age
9https://tools.ietf.org/rfc/bcp/bcp47.

txt

https://tools.ietf.org/rfc/bcp/bcp47.txt
https://tools.ietf.org/rfc/bcp/bcp47.txt


• Gender
• Race/ethnicity
• Native language
• Socio-economic status
• Training in linguistics/other relevant disci-

pline

E. SPEECH SITUATION Characteristics of the
speech situation can affect linguistic structure and
patterns at many levels. The intended audience of
a linguistic performance can also affect linguistic
choices on the part of speakers.10 The time and
place provide broader context for understanding
how the texts collected relate to their historical
moment and should also be made evident in the
data statement.11 Specifications include:

• Time and place
• Modality (spoken/signed, written)
• Scripted/edited v. spontaneous
• Synchronous v. asynchronous interaction
• Intended audience

F. TEXT CHARACTERISTICS Both genre and
topic influence the vocabulary and structural char-
acteristics of texts (Biber, 1995), and should be
specified.

G. RECORDING QUALITY For data that includes
audio/visual recordings, indicate the quality of
the recording equipment and any aspects of the
recording situation that could impact recording
quality.

H. OTHER There may be other information of rel-
evance as well (e.g. the demographic characteris-
tics of the curators). As stated above, this is in-
tended as a starting point and we anticipate best
practices around writing data statements to de-
velop over time.

I. PROVENANCE APPENDIX For datasets built out
of existing datasets, the data statements for the
source datasets should be included as an appendix.

5.2 Short form

Short form data statements should be included in
any publication using a dataset for training, tuning
or testing a system and may also be appropriate for
certain kinds of system documentation. The short

10For example, people speak differently to close friends v.
strangers, to small groups v. large ones, to children v. adults
and to people v. machines (e.g. Ervin-Tripp 1964).

11Mutable speaker demographic information, such as age,
is interpreted as relative to the time of the linguistic behavior.

form data statement does not replace the long form
one, but rather should include a pointer to it. For
short form data statements, we envision 60–100
word summaries of the description included in the
long form, covering most of the main points.

5.3 Summary

We have outlined the kind of information data
statements should include, addressing the needs
laid out in §3, describing both long and short ver-
sions. As the field gains experience with data
statements, we expect to see a better understand-
ing of what to include as well as best practices for
writing data statements to emerge.

Note that full specification of all of this infor-
mation may not be feasible in all cases. For ex-
ample, in datasets created from web text, precise
demographic information may be unavailable. In
other cases (e.g. to protect the privacy of annota-
tors) it may be preferable to provide ranges rather
than precise values. For the description of demo-
graphic characteristics, our field can look to others
for best practices, such as those described in the
American Psychological Association’s Manual of
Style.

It may seem redundant to reiterate this informa-
tion in every paper that makes use of well-trodden
datasets. However, it is critical to consider the data
anew each time to ensure that it is appropriate for
the NLP work being undertaken and that the re-
sults reported are properly contextualized. Note
that the requirement is not that datasets be used
only when there is an ideal fit between the dataset
and the NLP goals but rather that the characteris-
tics of the dataset be examined in relation to the
NLP goals and limitations be reported as appro-
priate.

6 Case studies

We illustrate the idea of data statements with two
cases studies. Ideally, data statements are written
at or close to the time of dataset creation. These
data statements were constructed post hoc in con-
versation with the dataset curators. The first en-
tails labels for a particular subset of all Twitter
data. In contrast, the second entails all available
data for an intentionally generated interview col-
lection, including audiofiles and transcripts. Both
illustrate how even when specific information is
not available, the explicit statement of its lack of
availability provides a more informative picture of


the dataset.

6.1 Hate Speech Twitter Annotations

The Hate Speech Twitter Annotations collection
is a set of labels for ∼19,000 tweets collected by
Waseem and Hovy (2016) and Waseem (2016).
The dataset can be accessed via https://
github.com/zeerakw/hatespeech.12

A. CURATION RATIONALE In order to study
the automatic detection of hate speech in tweets
and the effect of annotator knowledge (crowd-
workers v. experts) on the effectiveness of mod-
els trained on the annotations, Waseem and Hovy
(2016) performed a scrape of Twitter data using
contentious terms and topics. The terms were cho-
sen by first crowd-sourcing an initial set of search
terms on feminist Facebook groups and then re-
viewing the resulting tweets for terms to use and
adding others based on the researcher’s intuition.13

Additionally, some prolific users of the terms were
chosen and their timelines collected. For the an-
notation work reported in Waseem (2016), expert
annotators were chosen for their attitudes with re-
spect to intersectional feminism in order to explore
whether annotator understanding of hate speech
would influence the labels and classifiers built on
the dataset.

B. LANGUAGE VARIETY The data was col-
lected via the Twitter search API in late 2015.
Information about which varieties of English are
represented is not available, but at least Australian
(en-AU) and US (en-US) mainstream Englishes
are both included.

C. SPEAKER DEMOGRAPHIC Speakers were
not directly approached for inclusion in this
dataset and thus could not be asked for demo-
graphic information. More than 1500 different
Twitter accounts are included. Based on indepen-
dent information about Twitter usage and impres-
sionistic observation of the tweets by the dataset
curators, the data is likely to include tweets from

12This data statement was prepared based on information
provided by Zeerak Waseem, pc, Feb-Apr 2018 and reviewed
and approved by him.

13In a standalone data statement, the search terms should
be given in the main text. To avoid accosting readers with
slurs in this article, we instead list them in this footnote.
Waseem and Hovy (2016) provide the following complete list
of terms used in their initial scrape: ‘MKR’, ‘asian drive’,
‘feminazi’, ‘immigrant’, ‘nigger’, ‘sjw’, ‘WomenAgainst-
Feminism’, ‘blameonenotall’, ‘islam terrorism’, ‘notallmen’,
‘victimcard’, ‘victim card’, ‘arab terror’, ‘gamergate’, ‘jsil’,
‘racecard’, ‘race card’.

both younger (18–30) and older (30+) adult speak-
ers, the majority of whom likely identify as white.
No direct information is available about gender
distribution or socioeconomic status of the speak-
ers. It is expected that most, but not all, of the
speakers speak English as a native language.

D. ANNOTATOR DEMOGRAPHIC This dataset
includes annotations from both crowdworkers and
experts. 1,065 crowdworkers were recruited
through Crowd Flower, primarily from Europe,
South America and North America. Beyond coun-
try of residence, no further information is avail-
able about the crowdworkers. The expert anno-
tators were recruited specifically for their under-
standing of intersectional feminism. All were in-
formally trained in critical race theory and gender
studies through years of activism and personal re-
search. They ranged in age from 20–40, included
3 men and 13 women, and gave their ethnicity
as white European (11), East Asian (2), Middle
East/Turkey (2), and South Asian (1). Their na-
tive languages were Danish (12), Danish/English
(1), Turkish/Danish (1), Arabic/Danish (1), and
Swedish (1). Based on income levels, the expert
annotators represented upper lower class (5), mid-
dle class (7), and upper middle class (2).

E. SPEECH SITUATION All tweets were ini-
tially published between April 2013 and Decem-
ber 2015. Tweets represent informal, largely asyn-
chronous, spontaneous, written language, of up to
140 characters per tweet. About 23% of the tweets
were in reaction to a specific Australian TV show
(My Kitchen Rules) and so were likely meant for
roughly synchronous interaction with other view-
ers. The intended audience of the tweets was ei-
ther other viewers of the same show, or simply the
general Twitter audience. For the tweets contain-
ing racist hate speech, the authors appear to intend
them both for those who would agree but also for
people whom they hope to provoke into having an
agitational and confrontational exchange.

F. TEXT CHARACTERISTICS For racist tweets
the topic was dominated by Islam and Islamopho-
bia. For sexist tweets predominant topics were
the TV show and people making sexist statements
while claiming not to be sexist. The majority of
tweets only used one modality (text) though some
included links to pictures and websites.

G. RECORDING QUALITY N/A.
H. OTHER N/A.
I. PROVENANCE APPENDIX N/A.

https://github.com/zeerakw/hatespeech
https://github.com/zeerakw/hatespeech


Twitter Hate Speech short form This dataset
includes labels for ∼19,000 English tweets from
different locales (Australia and North America be-
ing well-represented) selected to contain a high
prevalence of hate speech. The labels indicate the
presence and type of hate speech and were pro-
vided both by experts (mostly with extensive if
informal training in critical race theory and gen-
der studies and English as a second language) and
by crowdworkers primarily from Europe and the
Americas. [Include a link to the long form.]

6.2 Voices from the Rwanda Tribunal (VRT)
Voices from the Rwanda Tribunal is a collec-
tion of 49 video interviews in English and French
with personnel from the International Criminal
Tribunal for Rwanda (ICTR) comprising 50-60
hours of material with high quality transcrip-
tion throughout (Nathan et al., 2011; Nilsen
et al., 2012; Friedman et al., 2016). The
dataset can be downloaded from http://www.
tribunalvoices.org.14

A. CURATION RATIONALE The VRT project,
funded by the United States National Science
Foundation, is part of a research program on de-
veloping multi-lifespan design knowledge (Fried-
man and Nathan, 2010). It is independent from
the ICTR, the United Nations, and the government
of Rwanda. To help ensure accuracy and guard
against breeches of confidentiality, interviewees
had an opportunity to review and redact any ma-
terial that was either misspoken or revealed con-
fidential information. A total of two words have
been redacted. No other review or redaction of
content has occurred. The dataset includes all pub-
licly released material from the collection; as of
the writing of this data statement (28 September
2017) one interview and a portion of a second are
currently sealed.

B. LANGUAGE VARIETY Of the interviews, 44
are conducted in English (en-US and international
English on the part of the interviewees, en-US
on the part of the interviewers) and 5 in French
and English, with the interviewee speaking inter-
national French, the interviewer speaking English
(en-US) and an interpreter speaking both.15

C. SPEAKER DEMOGRAPHIC The interviewees
(13 women and 36 men, all adults) are profession-

14This data statement was prepared based on information
provided by co-author Batya Friedman.

15At the end of one interview, there is 38 seconds of un-
transcribed speech in Kinyarwanda (rw).

als working in the area of international justice,
such as judges or prosecutors, and support roles
of the same, such as communications, prison war-
den, and librarian. They represent a variety of na-
tionalities: Argentina, Benin, Cameroon, Canada,
England, The Gambia, Ghana, Great Britain, In-
dia, Italy, Kenya, Madagascar, Mali, Morocco,
Nigeria, Norway, Peru, Rwanda, Senegal, South
Africa, Sri Lanka, St. Kitts and Nevis, Sweden,
Tanzania, Togo, Uganda, and the US. Their native
languages are not known, but are presumably di-
verse. The 7 interviewers (2 women and 5 men)
are informataion and legal professionals from dif-
ferent regions in the US. All are native speakers
of US English, all are white, and at the time of
the interviews they ranged in age from early 40s
to late 70s. The interpreters are language profes-
sionals employed by the ICTR with experience in-
terpreting between French and English. Their age,
gender, and native languages are unknown.

D. ANNOTATOR DEMOGRAPHIC The initial
transcription was outsourced to a professional
transcription company, so information about these
transcribers is unavailable. The English tran-
scripts were reviewed by English speaking (en-
US) members of the research team for accuracy
and then reviewed a third time by an additional
English speaking (en-US) member of the team.
The French/English transcripts received a sec-
ond and third review for accuracy by bi-lingual
French/English doctoral students at the University
of Washington. Because of the sensitivity of the
topic, the high political status of some intervie-
wees (e.g. Prosecutor for the tribunal), and the in-
ternational stature of the institution, it is very im-
portant that interviewees’ comments be accurately
transcribed. Accordingly, the bar for quality of
transcription was set extremely high.

E. SPEECH SITUATION The interviews were
conducted in Autumn 2008 at the ICTR in Arusha,
Tanzania and in Rwanda, face-to-face, as spoken
language. The interviewers begin with a prepared
set of questions, but most of the interaction is
semi-structured. Most generally, the speech situa-
tion can be characterized as a dialogue, but some
of the interviewees give long replies, so stretches
may be better characterized as monologues. For
the interviewees, the immediate interlocutor is the
interviewer, but the intended audience is much
larger (see Part F below).

http://www.tribunalvoices.org
http://www.tribunalvoices.org


F. TEXT CHARACTERISTICS The interviews
were intended to provide an opportunity for tri-
bunal personnel to reflect on their experiences
working at the ICTR and what they would like
to share with the people of Rwanda, the interna-
tional justice community, and the global public
now, 50 and 100 years from now. Professionals
from all organs of the tribunal (judiciary, prosecu-
tion, registry) were invited to be interviewed, with
effort made to include a broad spectrum of roles
(e.g. judges, prosecutor, defense counsel, but also
the warden, librarian, language services). Intervie-
wees expected their interviews to be made broadly
accessible.

G. RECORDING QUALITY The video inter-
views were recorded with high definition equip-
ment in closed but not sound-proof offices. There
is some background noise.

H. OTHER N/A
I. PROVENANCE APPENDIX N/A.

VRT short form The data represents well-
vetted transcripts of 49 spoken interviews with
personnel from the International Criminal Tri-
bunal for Rwanda (ICTR) about their experience
at the tribunal and reflections on international jus-
tice, in international English (44 interviews) and
French (5 interviews with interpreters). Intervie-
wees are adults working in international justice
and support fields at the ICTR; interviewers are
adult information or legal professionals, highly
fluent in en-US; and transcribers are highly edu-
cated, highly fluent English and French speakers.
[Include a link to the long form.]

6.3 Summary

These sample data statements are meant to il-
lustrate how the schema can be used to com-
municate the specific characteristics of datasets.
They were both created post-hoc, in communica-
tion with the dataset curators. Once data state-
ments are created as a matter of best practice, how-
ever, they should be developed in tandem with the
datasets themselves and may even inform the cu-
ration of datasets. At the same time, data state-
ments will need to be written for widely used,
pre-existing datasets, where documentation may
be lacking, memories imperfect, and dataset cu-
rators no longer accessible. While retrospective
data statements may be incomplete, by and large
we believe they can still be valuable.

Our case studies also underscore how curation

rationales shape the specific kinds of texts in-
cluded. This is particularly striking in the case of
the Hate Speech Twitter Annotations, where the
specific search terms very clearly shaped the spe-
cific kinds of hate speech included and the ways
in which any technology or studies built on this
dataset will generalize.

7 A tool for mitigating bias

We have explicitly designed data statements as a
tool for mitigating bias in systems that use data for
training and testing. Data statements are particu-
larly well suited to mitigate forms of emergent and
pre-existing bias. For the former, we see benefits
at the level of specific systems and of the field:
When a system is paired with data statement(s)
for the data it is trained on, those deploying it
are empowered to assess potential gaps between
the speaker populations represented in the training
and test data and the populations whose language
the system will be working with. At the field level,
data statements enable an examination of the en-
tire catalog of testing and training datasets to help
identify populations who are not yet included. All
of these groups are vulnerable to emergent bias,
in that any system would by definition have been
trained and tested on data from datasets that do not
represent them well.

Data statements can also be instrumental in
the diagnosis (and thus mitigation) of pre-existing
bias. Consider again Speer’s (2017) example of
Mexican restaurants and sentiment analysis. The
information that the word vectors were trained on
general web text (together with knowledge of what
kind of societal biases such text might contain)
was key in figuring out why the system consis-
tently underestimated the ratings associated with
reviews of Mexican restaurants. In order to en-
able both more informed system development and
deployment and audits by users and others of sys-
tems in action, it is critical that characterizations
of the training and test data underlying systems be
available.

To be clear, data statements do not in and
of themselves solve the entire problem of bias.
Rather, they are a critical enabling infrastructure.
Consider by analogy this example from Friedman
(1997) about access to technology and employ-
ment for people with disabilities.

In terms of computer system design, we
are not so privileged as to determine


rigidly the values that will emerge from
the systems we design. But neither can
we abdicate responsibility. For example,
let us for the moment agree [. . . ] that
disabled people in the work place should
be able to access technology, just as they
should be able to access a public build-
ing. As system designers we can make
the choice to try to construct a tech-
nological infrastructure which disabled
people can access. If we do not make
this choice, then we single-handedly un-
dermine the principle of universal ac-
cess. But if we do make this choice, and
are successful, disabled people would
still rely, for example, on employers to
hire them. (p.3)

Similarly, with respect to bias in NLP technology,
if we do not make a commitment to data state-
ments or a similar practice for making explicit
the characteristics of datasets, then we will single-
handedly undermine the field’s ability to address
bias.

In NLP, we expect proposals to come with some
kind of evaluation. In this paper, we have demon-
strated the substance and ‘writability’ of a data
statement through two exemplars (§6). However,
the positive effects of data statements that we an-
ticipate (and negative effects we haven’t antici-
pated) cannot be demonstrated and tested a priori,
as their impact emerges through practice. Thus,
we look to value sensitive design, which encour-
ages us to consider what would happen if a pro-
posed technology were to come into widespread
use, over longer periods of time, with attention to a
wide range of stakeholders, potential benefits, and
harms (Friedman et al., 2006, 2017). We do this
with value scenarios (Nathan et al., 2007; Czeskis
et al., 2010).

Specifically, we look at two kinds of value
scenarios: Those concerning NLP technology
that fails to take into account an appropriate
match between training data and deployment con-
text and those that envision possible positive as
well as negative consequences stemming from the
widespread use of the specific ‘technology’ we are
proposing in this paper (data statements). Envi-
sioning possible negative outcomes allows us to
consider how to mitigate such possibilities before
they occur.

7.1 Public health and NLP for social media

This value scenario is inspired by Jurgens et al.
(2017), who provide a similar one to motivate
training language ID systems on more represen-
tative datasets.

Scenario. Big U Hospital in a town in the Up-
per Midwest collaborates with the CS Department
at Big U to create a Twitter-based early warning
system for infectious disease, called DiseaseAlert.
Big U Hospital finds that the system improves pa-
tient outcomes by alerting hospital staff to emerg-
ing community health needs and alerting physi-
cians to test for infectious diseases that currently
are active locally.

Big U decides to make the DiseaseAlert project
open source to provide similar benefits to hospi-
tals across the Anglophone world and is delighted
to learn that City Hospital in Abuja, Nigeria is ex-
cited to implement DiseaseAlert locally. Big U
supports City Hospital with installing the code, in-
cluding localizing the system to draw on tweets
posted from Abuja. Over time, however, City Hos-
pital finds that the system is leading its physicians
to order unnecessary tests and that it is not at all
accurate in detecting local health trends. City Hos-
pital complains to Big U about the poor system
performance and reports that their reputation is be-
ing damaged.

Big U is puzzled, as the DiseaseAlert performs
well in the Upper Midwest, and they had spent
time localizing the system to use tweets from
Abuja. After a good deal of frustration and in-
vestigation into Big U’s system, the developers
discover that the third-party language ID compo-
nent they had included was trained on only highly-
edited US and UK English text. As a result, it
tends to misclassify tweets in regional or non-
standard varieties of English as ‘not English’ and
therefore not relevant. Most of the tweets posted
by people living in Abuja that City Hospital’s sys-
tem should have been looking at were thrown out
by the system at the first step of processing.

Analysis. City Hospital adopted Big U’s open
source DiseaseAlert system in exactly the way Big
U intended. However, the documentation for the
language ID component lacked critical informa-
tion needed to help ensure the localization process
would be successful; namely, information about
the training and test sets for the system. Had Big
U included data statements for all system compo-
nents (including third-party components) in their


documentation, then City Hospital IT staff would
have been positioned to recognize the potential
limitation of DiseaseAlert and to work proactively
with Big U to ensure the system performed well in
City Hospital’s context. Specifically, in reviewing
data statements for all system components, the IT
staff could note that the language ID component
was trained on data unlike what they were seeing
in their local tweets and ask for a different lan-
guage ID component or ask for the existing one
to be retrained. In this manner, an emergent bias
and its concomitant harms could have been iden-
tified and addressed during the system adaptation
process prior to deployment.

7.2 Toward an inclusive data catalog

In §7.1 we consider data statements in relation to a
particular system. Here, we explore their potential
to enable better science in NLP overall.

Scenario. It’s 2022 and ‘Data Statement’ has
become a standard section heading for NLP re-
search papers and system documentation. Hap-
pily, reports of mismatch between dataset and
community of application leading to biased sys-
tems have decreased. Yet, research community
members articulate an unease regarding which lan-
guage communities are and which are not part of
the field’s data catalog — the abstract total collec-
tion of data and associated meta-data to which the
field has access — and the possibility for resulting
bias in NLP at a systemic level.

In response, several national funding bodies
jointly fund a project to discover gaps in knowl-
edge. The project compares existing data state-
ments to surveys of spoken languages and system-
atically maps which language varieties have re-
sources (annotated corpora and standard process-
ing tools) and which ones lack such resources. The
study turns up a large number of language varieties
lacking such resources; it also produces a precise
list of underserved populations, some of which are
quite sizable, suggesting opportunity for impactful
intervention at the academic, industry and govern-
ment levels.

Study results in hand, the NLP community em-
barks on an intentional program to broaden the
language varieties in the data catalog. Public dis-
cussions lead to criteria for prioritizing language
varieties and funding agencies come together to
fund collaborative projects to produce state of the
art resources for understudied languages. Over

time, the data catalog becomes more inclusive;
bias in the catalog, while not wholly absent, is sig-
nificantly reduced and NLP researchers and devel-
opers are able to run more comprehensive exper-
iments and build technology that serves a larger
portion of society.

Analysis. The NLP community has recognized
critical limitations in the field’s existing data cat-
alog, leaving many language communities un-
derserved (Bender, 2011; Munro, 2015; Jurgens
et al., 2017).16 The widespread uptake of data
statements positions the NLP community to docu-
ment the degree to which it leaves out certain lan-
guage groups and empower itself to systematically
broaden the data catalog. In turn, individual NLP
systems could be trained on datasets that more
closely align with the language of anticipated sys-
tem users, thereby averting emergent bias. Fur-
thermore, NLP researchers can more thoroughly
test key research ideas and systems, leading to
more reliable scientific results.

7.3 Anticipating and mitigating barriers

Finally, we explore one potential negative out-
come and how with care it might be mitigated: that
of data statements as a barrier to research.

Scenario. In response to widespread uptake, in
2026 the Association for Computational Linguis-
tics (ACL) proposes that data statements be stan-
dardized and required components of research pa-
pers. A standards committee is formed, open pub-
lic professional discussion is engaged, and in 2028
a standard is adopted. It mandates data statements
as a requirement for publication, with standard-
ized information fields and strict specifications for
how these should be completed to facilitate auto-
mated meta-analysis. There is great hope that the
field will experience increasing benefits from abil-
ity to compare, contrast, and build complementary
data sets.

Many of those hopes are realized. However, in
a relatively short period of time papers from un-
derrepresented regions abruptly decline. In addi-
tion, the number of papers from everywhere pro-
ducing and reporting on new datasets decline as
well. Distressed by this outcome, the ACL con-

16The EU-funded project META-NET worked on identi-
fying gaps at the level of whole languages for Europe, pro-
ducing a series of 32 white papers each concerning one
European language, available from http://www.meta-
net.eu/whitepapers/overview, accessed 6 August
2018

http://www.meta-net.eu/whitepapers/overview
http://www.meta-net.eu/whitepapers/overview


stitutes an ad hoc committee to investigate. A
survey of researchers reveals two distinct causes:
First, researchers from institutions not yet well
represented at ACL were having their papers desk-
rejected due to missing or insufficient data state-
ments. Second, researchers who might otherwise
have developed a new dataset instead chose to use
existing datasets whose data statements could sim-
ply be copied. In response, the ACL executive de-
velops a mentoring service to assist authors in sub-
mitting standards-compliant data statements and
considers relaxing the standard somewhat in order
to encourage more dataset creation.

Analysis. With any new technology, there can
be unanticipated ripple effects — data statements
are no exception. Here we envision two poten-
tial negative impacts, which could both be miti-
gated through other practices. Importantly, while
we recommend the practice of creating data state-
ments, we believe that they should be widely used
before any standardization takes place. Further-
more, once a degree of expertise in this area is built
up, we recommend that mentoring be put in place
proactively. Community engagement and mentor-
ing will also contribute to furthering ethical dis-
course and practice in the field.

7.4 Summary

The value scenarios described here point to key
upsides to the widespread adoption of data state-
ments and also help to provide words of caution.
They are meant to be thought-provoking and plau-
sible, but are not predictive. Importantly, the sce-
narios illustrate how, if used well, data statements
could be an effective tool for mitigating bias in
NLP systems.

8 Related work

We see three strands of related work which lend
support to our proposal and to the proposition that
data statements will have the intended effect: sim-
ilar practices in medicine (§8.1), emerging, inde-
pendent proposals around similar ideas for trans-
parency about datasets in AI (§8.2), and proposals
for ‘algorithmic impact statements’ (§8.3).

8.1 Guidelines for reporting medical trials

In medicine, the CONSORT (CONsolidated Stan-
dards of Reporting Trials) guidelines were devel-
oped by a consortium of journal editors, specialists
in clinical trial methodology and others to improve

reporting of randomized, controlled trials.17 They
include a checklist for authors to use to indicate
where in their research reports each item is han-
dled and a statement explaining the rationale be-
hind each item (Moher et al., 2010). CONSORT
development began in 1993, with the most recent
release in 2010. It has been endorsed by 70 medi-
cal journals.18

Item 4a, ‘Eligibility criteria for participants’ is
most closely related to the concerns of this paper.
Characterizing the population that participated in
the study is critical for gauging the extent to which
the results of the study are applicable to particu-
lar patients a physician is treating (Moher et al.,
2010).

The inclusion of this information has also en-
abled further kinds of research. For example,
Mbuagbaw et al. (2017) argue that careful atten-
tion to and publication of demographic data that
may correlate with health inequities can facilitate
further work through meta-analyses. In particu-
lar, individual studies usually lack the statistical
power to do the kind of sub-analyses required to
check for health inequities, and failing to publish
demographic information precludes its use in the
kind of aggregated, meta-analyses that could have
sufficient statistical power. This echoes the field-
level benefits we anticipate for data statements in
building out the data catalog in the value scenario
in §7.2.

8.2 Converging proposals

At least three other groups are working in parallel
on similar proposals regarding bias and AI. Gebru
et al. (in prep) propose ‘datasheets for datasets’,
looking at AI more broadly (but including NLP);
Chmielinski and colleagues at the MIT Media Lab
propose ‘dataset nutrition labels’;19 and Yang et al.
(2018) describe ‘Ranking Facts’, a series of wid-
gets that allow a user to explore how attributes in-
fluence a ranking. Of these, the datasheets pro-
posal is most similar to ours in including a compa-
rable schema.

The datasheets are inspired by those used in
computer hardware to give specifications, lim-

17http://www.consort-statement.org/
consort-2010, accessed July 12, 2017

18http://www.consort-statement.org/
about-consort/endorsement-of-consort-
statement, accessed July 12, 2017

19http://datanutrition.media.mit.edu/,
accessed April 2, 2018

http://www.consort-statement.org/consort-2010
http://www.consort-statement.org/consort-2010
http://www.consort-statement.org/about-consort/endorsement-of-consort-statement
http://www.consort-statement.org/about-consort/endorsement-of-consort-statement
http://www.consort-statement.org/about-consort/endorsement-of-consort-statement
http://datanutrition.media.mit.edu/


its and appropriate use information for compo-
nents. There is important overlap in the kinds of
information called for in the datasheets schema
and our data statement schema: For example, the
datasheets schema includes a section on ‘Motiva-
tion for Dataset Creation’, akin to our ‘Curation
Rationale’. The primary differences stem from the
fact that the datasheets proposal is trying to ac-
commodate all types of datasets used to train ma-
chine learning systems and, hence, tends toward
more general, cross-cutting categories; while we
elaborate requirements for linguistic datasets and,
hence, provide more specific, NLP-focused cate-
gories. Gebru et al. note, like us, that their pro-
posal is meant as an initial starting point to be
elaborated through adoption and application. Hav-
ing multiple starting points for this discussion will
certainly make it more fruitful.

8.3 Algorithmic impact statements

Several groups have called for algorithmic im-
pact statements (Shneiderman, 2016; Diakopou-
los, 2016; AI Now Institute, 2018), modeled af-
ter environmental impact statements. Of these AI
Now’s proposal is perhaps the most developed.
All three groups point to the need to clarify infor-
mation about the data: “Algorithm impact state-
ments would document [. . . ] data quality control
for input sources” (Shneiderman, 2016, 13539);
“One avenue for transparency here is to commu-
nicate the quality of the data, including its accu-
racy, completeness, and uncertainty, [. . . ] repre-
sentativeness of a sample for a specific population,
and assumptions or other limitations” (Diakopou-
los, 2016, 60); “AIAs should cover [. . . ] input and
training data.” (AI Now Institute, 2018) However,
none of these proposals specify how to do so. Data
statements fill this critical gap.

9 Recommendations for implementation

Data statements are meant to be something practi-
cal and concrete that NLP technologists can adopt
as one tool for mitigating potential harms of the
technology we develop. For this benefit to come
about, data statements must be easily adopted. In
addition, practical uptake will require coordinated
effort at the level of the field. In this section we
briefly consider possible costs to writers and read-
ers of data statements, and then propose strategies
for promoting uptake.

The primary cost we see for writers is time:

With the required information to hand, writing a
data statement should take no more than 2–3 hours
(based on our experience with the case studies).
However, the time to collect the information will
depend on the dataset. The more speakers and an-
notators that are involved, the more time it may
take to collect demographic information. This can
be facilitated by planning ahead, before the cor-
pus is collected. Another possible cost is that col-
lecting demographic information may mean that
projects previously not submitted to institutional
review boards for approval must now be, at least
for exempt status. This process itself can take
time, but is valuable in its own right. A further
cost to writers is space. We propose that data state-
ments, even the short form (60 – 100 words), be
exempt from page limits in conference and journal
publications.

As for readers, reviewers have more material
to read and dataset (and ultimately system) users
need to scrutinize data statements in order to deter-
mine which datasets are appropriate for their use
case. But this is precisely the point: Data state-
ments make critical information accessible that
previously could only be found by users with great
effort, if at all. The time invested in scrutiniz-
ing data statements prior to dataset adoption is
expected to be far less than the time required to
diagnose and retrofit an already deployed system
should biases be identified.

Turning to uptake in the field, NLP technolo-
gists (both researchers and system developers) are
key stakeholders of the technology of data state-
ments. Practices that engage these stakeholders in
the development and promotion of data statements
will both promote uptake and ensure that the ulti-
mate form data statements take are responsive to
NLP technologists’ needs. Accordingly, we rec-
ommend that one or more professional organiza-
tions such as the Association for Computational
Linguistics convene a working group on data state-
ments.

Such a working group would engage in several
related sets of activities, which would collectively
serve to publicize and cultivate the use of data
statements:

(i) Best practices A clear first step entails de-
veloping best practices for how data statements
are produced. This includes: steps to take before
collecting a dataset to facilitate writing an infor-
mative data statement; heuristics for writing con-


cise and effective data statements; how to incorpo-
rate material from institutional review board/ethics
committee applications into the data statement
schema; how to find an appropriate level of de-
tail given privacy concerns, especially for small or
vulnerable populations; and how to produce data
statements for older datasets that predate this prac-
tice. In doing this work, it may be helpful to distill
best practices from other fields, such as medicine
and psychology, especially around collecting de-
mographic information.

(ii) Training and support materials With best
practices in place, the next step is providing train-
ing and support materials for the field at large. We
see several complementary strategies to undertake:
Create a digital template for data statements; run
tutorials at conferences; establish a mentoring net-
work (see §7.3); and develop an on-line ‘how-to’
guide.

(iii) Recommendations for field-level policies
There are a number of field-level practices that the
working group could explore to support the uptake
and successful use of data statements. Funding
agencies could require data statements to be in-
cluded in data management plans; conferences and
journals could not count data statements against
page limits (similar to references) and eventually
require short form data statements in submissions;
conferences and journals could allocate additional
space for data statements in publications; finally
once data statements have been in use for a few
years, a standardized form could be established.

10 Tech policy implications

Transparency of datasets and systems is essential
for preserving accountability and building more
just systems (Kroll et al., 2017). Due process
provides a critical case in point. In the United
States, for example, due process requires that cit-
izens who have been deprived of liberty or prop-
erty by the government be afforded the opportu-
nity to understand and challenge the government’s
decision (Citron, 2008). Without data statements
or something similar, governmental decisions that
are made or supported by automated systems de-
prive citizens of the ability to mount such a chal-
lenge, undermining the potential for due process.

In addition to challenging any specific decision
by any specific system, there is a further concern
about building systems that are broadly represen-

tative and fair. Here too, data statements have
much to contribute. As systems are being built,
data statements enable developers and researchers
to make informed choices about training sets and
to flag potential underrepresented populations who
may be overlooked or treated unfairly. Once sys-
tems are deployed, data statements enable diag-
nosis of systemic unfairness when it is detected
in system performance. At a societal level, such
transparency is necessary for government and ad-
vocacy groups seeking to ensure protections and
an inclusive society.

If data statements turn out to be useful as an-
ticipated, then the following implications for stan-
dardization and tech policy likely ensue.

Long-Form Data Statements Required in System
Documentation. For academia, industry and gov-
ernment, inclusion of long-form data statements as
part of system documentation should be a require-
ment. As appropriate, inclusion of long-form data
statements should be a requirement for ISO and
other certification. Even groups that are creating
datasets that they don’t share (e.g. NSA) would
be well advised to make internal data statements.
Moreover, under certain legal circumstances, such
groups may be required to share this information.

Short-Form Data Statements Required for Aca-
demic and Other Publication. For academic pub-
lication in journals and conferences, inclusion of
short-form data statements should be a require-
ment for publication. As highlighted in §7.3, cau-
tion must be exercised to ensure that this require-
ment does not become a barrier to access for some
researchers.

These two recommendations will need to be im-
plemented with care. We have already noted the
potential barrier to access. Secrecy concerns may
also arise in some situations, e.g., some groups
may be willing to share datasets but not demo-
graphic information, for fear of public relations
backlash or to protect the safety of contributors to
the dataset. That said, as consumers of datasets
or products trained with them, NLP researchers,
developers and the general public would be well
advised to use systems only if there is access to
the information we propose should be included in
data statements.

11 Conclusion and future work

As researchers and developers working on tech-
nology in widespread use, capable of impacting


people beyond its direct users, we have an obli-
gation to consider the ethical implications of our
work. This will only happen reliably if we find
ways to integrate such thought into our regular
practice. In this paper, we have put forward one
specific, concrete proposal which we believe will
help with issues related to exclusion and bias in
language technology: the practice of including
‘data statements’ in all publications and documen-
tation for all NLP systems.

We believe this practice will have beneficial ef-
fects immediately and into the future: In the short
term, it will foreground how our data does and
doesn’t represent the world (and the people our
systems will impact). In the long term, it should
enable research that specifically addresses issues
of bias and exclusion, promote the development
of more representative datasets, and make it easier
and more normative for researchers to take stake-
holder values into consideration as they work. In
foregrounding the information about the data we
work with, we can work toward making sure that
the systems we build work for diverse populations
and also toward making sure we are not teach-
ing computers about the world based on the world
views of a limited subset of people.

Granted, it will take time and experience to de-
velop the skill of writing carefully crafted data
statements. However, we see great potential ben-
efits: For the scientific community, researchers
will be better able to make precise claims about
how results should generalize and perform more
targeted experiments around reproducing results
for datasets that differ in specific characteristics.
For industry, we believe that incorporating data
statements will encourage the kind of conscien-
tious software development that protects compa-
nies’ reputations (by avoiding public embarrass-
ment) and makes them more competitive (by cre-
ating systems used more fluidly by more people).
For the public at large, data statements are one
piece of a larger collection of practices that will
enable the development of NLP systems that eq-
uitably serves the interests of users and indirect
stakeholders.

Acknowledgments

We are grateful to the following people for help-
ful discussion and critical commentary as we de-
veloped this paper: the anonymous TACL review-
ers, Hannah Almeter, Stephanie Ballard, Chris

Curtis, Leon Derczynski, Michael Wayne Good-
man, Anna Hoffmann, Bill Howe, Kristen How-
ell, Dirk Hovy, Jessica Hullman, David Inman, Ta-
dayoshi Kohno, Nick Logler, Mitch Marcus, An-
gelina McMillan-Major, Rob Munro, Glenn Slay-
den, Michelle Stamnes, Jevin West Daisy Yoo,
Olga Zamaraeva, and especially Zeerak Waseeem
and Ryan Calo. We have presented talks based
on earlier versions of this paper at New York Uni-
versity (Nov 2017), Columbia University (Nov
2017), University of Washington (Nov 2017), UC
San Diego (Feb 2018), Microsoft (Mar 2018) and
Macquarie University (July 2018) and thank the
audiences at those talks for useful feedback. Fi-
nally, Batya Friedman’s contributions to this paper
were supported by the UW Tech Policy Lab and
National Science Foundation Grant IIS-1302709.
Any opinions, findings, and conclusions or rec-
ommendations expressed in this material are those
of the author(s) and do not necessarily reflect the
views of the National Science Foundation.

References

AI Now Institute. 2018. Algorithmic impact
assessments: Toward accountable automation
in public agencies. Medium.com, https:
//medium.com/@AINowInstitute/
algorithmic-impact-assessments-
toward-accountable-automation-
in-public-agencies-bd9856e6fdde,
accessed 6 April 2018.

American Psychological Association. 2009. Pub-
lication Manual of the American Psychological
Association, 6th edition. Author, Washington
DC.

Emily M. Bender. 2011. On achieving and evalu-
ating language independence in NLP. Linguis-
tic Issues in Language Technology, 6:1–26.

Douglas Biber. 1995. Dimensions of Regis-
ter Variation: A Cross-Linguistic Comparison.
Cambridge University Press, Cambridge.

Steven Bird and Gary Simons. 2000. White pa-
per on establishing an infrastructure for open
language archiving. In Workshop on Web-
Based Language Documentation and Descrip-
tion, Philadelphia, PA, pages 12–15.

Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou,
Venkatesh Saligrama, and Adam T. Kalai. 2016.

https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde
https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde
https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde
https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde
https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde


Man is to computer programmer as woman is
to homemaker? Debiasing word embeddings.
In D. D. Lee, M. Sugiyama, U. V. Luxburg,
I. Guyon, and R. Garnett, editors, Advances
in Neural Information Processing Systems 29,
pages 4349–4357. Curran Associates, Inc.

Nicoletta Calzolari, Valeria Quochi, and Claudia
Soria. 2012. The strategic language resource
agenda. http://www.flarenet.eu/
sites/default/files/FLaReNet_
Strategic_Language_Resource_
Agenda.pdf, accessed 6 August 2018.

Jack K. Chambers and Peter Trudgill. 1998. Di-
alectology, second edition. Cambridge Univer-
sity Press.

Danielle Keats Citron. 2008. Technological due
process. Washington University Law Review,
85:1249–1313.

TEI Consortium. 2008. TEI P5: Guide-
lines for Electronic Text Encoding and In-
terchange. http://www.tei-c.org/
guidelines/p5/, accessed 6 August 2018.

Ben Coppin. 2004. Artificial Intelligence Illumi-
nated. Jones & Bartlett Publishers, Sudbury
MA.

Alexei Czeskis, Ivayla Dermendjieva, Hussein
Yapit, Alan Borning, Batya Friedman, Brian
Gill, and Tadayoshi Kohno. 2010. Parenting
from the pocket: Value tensions and techni-
cal directions for secure and private parent-teen
mobile safety. In Proceedings of the Sixth Sym-
posium on Usable Privacy and Security. ACM.

Leon Derczynski, Kalina Bontcheva, and Ian
Roberts. 2016. Broad twitter corpus: A di-
verse named entity recognition resource. In
Proceedings of COLING 2016, the 26th Inter-
national Conference on Computational Linguis-
tics: Technical Papers, pages 1169–1179. The
COLING 2016 Organizing Committee.

Laurence Devillers, Björn Schuller, Emily Mower
Provost, Peter Robinson, Joseph Mariani, and
Agnes Delaborde, editors. 2016. Proceedings
of ETHI-CA2 2016: ETHics in Corpus Collec-
tion, Annotation & Application. LREC.

Nicholas Diakopoulos. 2016. Accountability in
algorithmic decision making. Communications
of the ACM, 59(2):56–62.

Penelope Eckert and John R. Rickford, editors.
2001. Style and Sociolinguistic Variation.
Cambridge University Press, Cambridge.

Rod Ellis. 1994. The Study of Second Language
Acquisition. Oxford University Press, Oxford.

Susan Ervin-Tripp. 1964. An analysis of the inter-
action of language, topic, and listener. Ameri-
can Anthropologist, 66(6_PART2):86–102.

Karën Fort, Gilles Adda, and K. Bretonnel Cohen,
editors. 2016. TAL et Ethique, special issue of
Traitement automatique des languages, volume
57:2.

Batya Friedman. 1997. Introduction. In Batya
Friedman, editor, Human Values and the Design
of Computer Technology, pages 1–18. Stanford
CA, Stanford.

Batya Friedman, David G Hendry, and Alan Born-
ing. 2017. A survey of value sensitive de-
sign methods. Foundations and Trends R© in
Human–Computer Interaction, 11(2):63–125.

Batya Friedman, Peter H. Kahn, Jr., and Alan
Borning. 2006. Value sensitive design and in-
formation systems. In Ping Zhang and Den-
nis F. Galletta, editors, Human–Computer In-
teraction in Management Information Systems:
Foundations, pages 348–372. M. E. Sharpe, Ar-
monk NY.

Batya Friedman and Lisa P. Nathan. 2010. Multi-
lifespan information system design: A research
initiative for the HCI community. In Proceed-
ings of the SIGCHI Conference on Human Fac-
tors in Computing Systems, pages 2243–2246.
ACM.

Batya Friedman, Lisa P. Nathan, and Daisy Yoo.
2016. Multi-lifespan information system de-
sign in support of transitional justice: Evolving
situated design principles for the long(er) term.
Interacting with Computers, 29:80–96.

Batya Friedman and Helen Nissenbaum. 1996.
Bias in computer systems. ACM Transactions
on Information Systems (TOIS), 14(3):330–347.

John Furler, Parker Magin, Marie Pirotta, and
Mieke van Driel. 2012. Participant demograph-
ics reported in “table 1” of randomised con-
trolled trials: A case of “inverse evidence”? In-
ternational Journal for Equity in Health, 11.

http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf
http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf
http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf
http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf
http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf
http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf
http://openscholarship.wustl.edu/law_lawreview/vol85/iss6/2
http://openscholarship.wustl.edu/law_lawreview/vol85/iss6/2
http://www.tei-c.org/guidelines/p5/
http://www.tei-c.org/guidelines/p5/
http://aclweb.org/anthology/C16-1111
http://aclweb.org/anthology/C16-1111
https://doi.org/10.1145/2844110
https://doi.org/10.1145/2844110
https://doi.org/10.1093/iwc/iwv045
https://doi.org/10.1093/iwc/iwv045
https://doi.org/10.1093/iwc/iwv045
http://doi.org/10.1186/1475-9276-11-14
http://doi.org/10.1186/1475-9276-11-14
http://doi.org/10.1186/1475-9276-11-14


Timnit Gebru, Jamie Morgenstern, Briana Vec-
chione, Jennifer Wortman Vaughan, Hanna
Wallach, Hal Daumé III, and Kate Craw-
ford. in prep. Datasheets for datasets.
ArXiv:1803.09010v1.

Dirk Hovy and Anders Søgaard. 2015. Tagging
performance correlates with author age. In Pro-
ceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and
the 7th International Joint Conference on Natu-
ral Language Processing (Volume 2: Short Pa-
pers), pages 483–488. Association for Compu-
tational Linguistics.

Dirk Hovy, Shannon Spruit, Margaret Mitchell,
Emily M. Bender, Michael Strube, and Hanna
Wallach, editors. 2017. Proceedings of the
First ACL Workshop on Ethics in Natural Lan-
guage Processing. Association for Computa-
tional Linguistics.

Dirk Hovy and Shannon L. Spruit. 2016. The so-
cial impact of natural language processing. In
Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Vol-
ume 2: Short Papers), pages 591–598. Associa-
tion for Computational Linguistics.

Anna Jørgensen, Dirk Hovy, and Anders Søgaard.
2015. Challenges of studying and processing
dialects in social media. In Proceedings of the
Workshop on Noisy User-generated Text, pages
9–18. Association for Computational Linguis-
tics.

David Jurgens, Yulia Tsvetkov, and Dan Jurafsky.
2017. Incorporating dialectal variability for so-
cially equitable language identification. In Pro-
ceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics (Vol-
ume 2: Short Papers), pages 51–57. Association
for Computational Linguistics.

Svetlana Kiritchenko and Saif Mohammad. 2018.
Examining gender and race bias in two hundred
sentiment analysis systems. In Proceedings of
the Seventh Joint Conference on Lexical and
Computational Semantics, pages 43–53. Asso-
ciation for Computational Linguistics.

Dan Klein and Christopher D. Manning. 2003.
Accurate unlexicalized parsing. In Proceedings
of the 41st Annual Meeting of the Association

for Computational Linguistics, pages 423–430.
Association for Computational Linguistics.

Roger Kreuz and Gina Caucci. 2007. Lexical
influences on the perception of sarcasm. In
Proceedings of the Workshop on Computational
Approaches to Figurative Language, pages 1–4.
Association for Computational Linguistics.

Joshua A. Kroll, Joanna Huey, Solon Barocas, Ed-
ward W. Felten, Joel R. Reidenberg, David G.
Robinson, and Harlan Yu. 2017. Account-
able algorithms. University of Pennsylvania
Law Review, 165. Fordham Law Legal Stud-
ies Research Paper No. 2765268. Available at
SSRN: https://ssrn.com/abstract=
2765268, accessed 6 August 2018.

William Labov. 1966. The Social Stratification of
English in New York City. Center for Applied
Linguistics, Washington, DC.

Bing Liu. 2012. Sentiment Analysis and Opin-
ion Mining, volume 5:1 of Synthesis Lectures
on Human Language Technologies. Morgan &
Claypool Publishers.

Mitchell P. Marcus, Beatrice Santorini, and
Mary Ann Marcinkiewicz. 1993. Building a
large annotated corpus of English: The Penn
Treebank. Computational Linguistics, 19:313–
330.

Lawrence Mbuagbaw, Theresa Aves, Beverley
Shea, Janet Jull, Vivian Welch, Monica Tal-
jaard, Manosila Yoganathan, Regina Greer-
Smith, George Wells, and Peter Tugwell.
2017. Considerations and guidance in design-
ing equity-relevant clinical trials. International
Journal for Equity in Health, 16(1):93.

Davida Moher, Sally Hopewell, Kenneth F.
Schulz, Victor Montori, Peter C. Gøtzsche,
P. J. Devereaux, Diana Elbourne, Matthias Eg-
ger, and Douglas G. Altman. 2010. CON-
SORT 2010 explanation and elaboration: Up-
dated guidelines for reporting parallel group
randomised trials. The BMJ, 340.

Robert Munro. 2015. Languages at ACL
this year. Blog post, http://www.
junglelightspeed.com/languages-
at-acl-this-year/, accessed 22 Septem-
ber 2017.

http://www.aclweb.org/anthology/P15-2079
http://www.aclweb.org/anthology/P15-2079
http://www.aclweb.org/anthology/W17-16
http://www.aclweb.org/anthology/W17-16
http://www.aclweb.org/anthology/W17-16
http://anthology.aclweb.org/P16-2096
http://anthology.aclweb.org/P16-2096
http://www.aclweb.org/anthology/W15-4302
http://www.aclweb.org/anthology/W15-4302
http://aclweb.org/anthology/P17-2009
http://aclweb.org/anthology/P17-2009
http://aclweb.org/anthology/S18-2005
http://aclweb.org/anthology/S18-2005
https://doi.org/10.3115/1075096.1075150
http://www.aclweb.org/anthology/W/W07/W07-0101
http://www.aclweb.org/anthology/W/W07/W07-0101
https://ssrn.com/abstract=2765268
https://ssrn.com/abstract=2765268
https://doi.org/10.1186/s12939-017-0591-1
https://doi.org/10.1186/s12939-017-0591-1
http://doi.org/10.1136/bmj.c869
http://doi.org/10.1136/bmj.c869
http://doi.org/10.1136/bmj.c869
http://doi.org/10.1136/bmj.c869
http://www.junglelightspeed.com/languages-at-acl-this-year/
http://www.junglelightspeed.com/languages-at-acl-this-year/
http://www.junglelightspeed.com/languages-at-acl-this-year/


Robert Munro and Christopher D. Manning. 2010.
Subword variation in text message classifica-
tion. In Human Language Technologies: The
2010 Annual Conference of the North Ameri-
can Chapter of the Association for Computa-
tional Linguistics, pages 510–518. Association
for Computational Linguistics.

Lisa P. Nathan, Predrag V. Klasnja, and Batya
Friedman. 2007. Value scenarios: A technique
for envisioning systemic effects of new tech-
nologies. In CHI’07 Extended Abstracts on
Human Factors in Computing Systems, pages
2585–2590. ACM.

Lisa P. Nathan, Milli Lake, Nell Carden Grey,
Trond Nilsen, Robert F. Utter, Elizabeth J. Ut-
ter, Mark Ring, Zoe Kahn, and Batya Fried-
man. 2011. Multi-lifespan information system
design: Investigating a new design approach in
Rwanda. In Proceedings of the 2011 iConfer-
ence, pages 591–597. ACM.

Trond T. Nilsen, Nell Carden Grey, and Batya
Friedman. 2012. Public curation of a historic
collection: A means for speaking safely in pub-
lic. In Proceedings of the ACM 2012 confer-
ence on Computer Supported Cooperative Work
Companion, pages 277–278. ACM.

Bo Pang, Lillian Lee, and Shivakumar
Vaithyanathan. 2002. Thumbs up? Senti-
ment classification using machine learning
techniques. In Proceedings of the 2002
Conference on Empirical Methods in Nat-
ural Language Processing, pages 79–86.
Association for Computational Linguistics.

Tomáš Ptáček, Ivan Habernal, and Jun Hong.
2014. Sarcasm detection on Czech and En-
glish Twitter. In Proceedings of COLING 2014,
the 25th International Conference on Compu-
tational Linguistics: Technical Papers, pages
213–223. Dublin City University and Associa-
tion for Computational Linguistics.

Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalin-
dra De Silva, Nathan Gilbert, and Ruihong
Huang. 2013. Sarcasm as contrast between a
positive sentiment and negative situation. In
Proceedings of the 2013 Conference on Empir-
ical Methods in Natural Language Processing,
pages 704–714. Association for Computational
Linguistics.

Ben Shneiderman. 2016. Opinion: The dangers of
faulty, biased, or malicious algorithms requires
independent oversight. Proceedings of the Na-
tional Academy of Sciences, 113(48):13538–
13540.

Rob Speer. 2017. Conceptnet numberbatch 17.04:
better, less-stereotyped word vectors. Blog
post, https://blog.conceptnet.
io/2017/04/24/conceptnet-
numberbatch-17-04-better-less-
stereotyped-word-vectors/, ac-
cessed 6 July 2017.

Rachael Tatman. 2017. Gender and dialect bias in
YouTube’s automatic captions. In Proceedings
of the First ACL Workshop on Ethics in Natu-
ral Language Processing, pages 53–59. Associ-
ation for Computational Linguistics.

Zeerak Waseem. 2016. Are you a racist or am
I seeing things? Annotator influence on hate
speech detection on Twitter. In Proceedings of
the First Workshop on NLP and Computational
Social Science, pages 138–142. Association for
Computational Linguistics.

Zeerak Waseem and Dirk Hovy. 2016. Hateful
symbols or hateful people? Predictive features
for hate speech detection on Twitter. In Pro-
ceedings of the NAACL Student Research Work-
shop, pages 88–93. Association for Computa-
tional Linguistics.

Ke Yang, Julia Stoyanovich, Abolfazl Asudeh,
Bill Howe, H. V. Jagadish, and Gerome Miklau.
2018. A nutritional label for rankings. In Pro-
ceedings of the 2018 International Conference
on Management of Data, SIGMOD ’18, pages
1773–1776, New York, NY, USA. ACM.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente
Ordonez, and Kai-Wei Chang. 2017. Men also
like shopping: Reducing gender bias amplifi-
cation using corpus-level constraints. In Pro-
ceedings of the 2017 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 2941–2951. Association for Computa-
tional Linguistics.

http://www.aclweb.org/anthology/N10-1075
http://www.aclweb.org/anthology/N10-1075
https://doi.org/10.3115/1118693.1118704
https://doi.org/10.3115/1118693.1118704
https://doi.org/10.3115/1118693.1118704
http://www.aclweb.org/anthology/C14-1022
http://www.aclweb.org/anthology/C14-1022
http://www.aclweb.org/anthology/D13-1066
http://www.aclweb.org/anthology/D13-1066
https://doi.org/10.1073/pnas.1618211113
https://doi.org/10.1073/pnas.1618211113
https://doi.org/10.1073/pnas.1618211113
https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/
https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/
https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/
https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/
http://www.aclweb.org/anthology/W17-1606
http://www.aclweb.org/anthology/W17-1606
http://aclweb.org/anthology/W16-5618
http://aclweb.org/anthology/W16-5618
http://aclweb.org/anthology/W16-5618
http://www.aclweb.org/anthology/N16-2013
http://www.aclweb.org/anthology/N16-2013
http://www.aclweb.org/anthology/N16-2013
https://doi.org/10.1145/3183713.3193568
https://www.aclweb.org/anthology/D17-1319
https://www.aclweb.org/anthology/D17-1319
https://www.aclweb.org/anthology/D17-1319