key: cord-0636727-atukg3qc authors: Levy, Shahar; Lazar, Koren; Stanovsky, Gabriel title: Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation date: 2021-09-08 journal: nan DOI: nan sha: d48d1e80b6ea9708fa3a09d1556a7ced3b147da2 doc_id: 636727 cord_uid: atukg3qc Recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. While these quantify bias in a controlled experiment, they often do so on a small scale and consist mostly of artificial, out-of-distribution sentences. In this work, we find grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments (e.g., female nurses versus male dancers) in corpora from three domains, resulting in a first large-scale gender bias dataset of 108K diverse real-world English sentences. We manually verify the quality of our corpus and use it to evaluate gender bias in various coreference resolution and machine translation models. We find that all tested models tend to over-rely on gender stereotypes when presented with natural inputs, which may be especially harmful when deployed in commercial systems. Finally, we show that our dataset lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set. Our dataset and models are publicly available at www.github.com/SLAB-NLP/BUG. We hope they will spur future research into gender bias evaluation mitigation techniques in realistic settings. Gender bias in machine learning occurs when supervised models predict based on spurious societal correlations in their training data. This may result in harmful behaviour when it occurs in models deployed in real-world applications (Caliskan et al., 2017; Buolamwini and Gebru, 2018; Bender et al., 2021) . 1 Recent work has quantified bias mostly using carefully designed templates, following the Winograd schema (Levesque et al., 2012) . Zhao et al. Figure 1 : We propose a semi-automatic method to vastly extend synthetic, small diagnostic datasets. We start with the texts of Winogender (Rudinger et al., 2018) and WinoBias (Zhao et al., 2018) , specifically designed to to be challenging for coreference and machine translation (top), extract syntactic patterns focusing on the salient entities in the artificial sentences (middle), and query real-world datasets for matching texts, using SPIKE (Shlain et al., 2020) . The result is a large collection of diverse real-world texts exhibiting similar challenging properties which lends itself to both finetuning and testing (bottom). (2018) and Rudinger et al. (2018) probed for gender bias in coreference resolution with templates portraying two human entities and a single pronoun. For example, given the sentence "the doctor asked the nurse to help her because she was busy", models often erroneously cluster "her" with "nurse", rather than with "doctor". Stanovsky et al. (2019) used the same data to evaluate gender bias in machine translation. When translating this sentence to a language with grammatical gender, models tend to inflect nouns based on stereotypes, e.g., in Spanish, preferring the masculine inflection over the correct feminine inflection ("doctor-a"). While these experiments are useful for quanti-fying gender bias in a controlled environment, we identify two shortcomings with this approach. First, the artificially-constructed texts diverge from natural language training distribution, which may inadvertently cause models to use prior distributions on such unseen constructions. Second, the small-scale templated data does not lend itself to training or finetuning to mitigate gender bias, limiting these datasets to diagnostic purposes. In this work, outlined in Figure 1 , we address both of these limitations by creating BUG, a largescale dataset of 108K sentences, sampled semiautomatically from large corpora using lexicalsyntactic pattern matching (see Figure 2 for examples). To construct BUG, we devise 14 diverse syntactic patterns, matching a wide range of sentences, ensuring that each mentions a human entity and a pronoun referring to it. Following, we use the SPIKE engine (Shlain et al., 2020) 2 to retrieve matching sentences over three diverse domains, including Wikipedia, Covid19 research, and PubMed abstracts. Finally, we filter the resulting sentences and mark each as either stereotypical or anti-stereotypical with respect to gender role assignments. The result is large corpus which is diverse, challenging, and accurate. We use BUG to conduct a first large-scale evaluation of gender bias on real-world texts. We find that popular machine translation and coreference models struggle with feminine entities and antistereotypical assignments. Furthermore, BUG enables us to identify novel insights. For example, that machine translation models tend to be more biased when there are many pronouns in the input sentence. Finally, we show that BUG can also help in mitigating gender bias. We finetune a state-ofthe-art coreference resolution model on the antistereotypical portion of BUG and achieve a 50% error reduction on a held out test set, at the cost of only a modest drop in overall accuracy. To conclude, our main contributions are: • We present BUG, a first publicly available large-scale corpus for gender bias evaluation which consists of diverse, real-world sentences. • We evaluate gender bias at large scale on natural sentences, leading to novel insights in 2 spike.apps.allenai.org machine translation and coreference resolution. • We use BUG to finetune a coreference resolution model, showing that the resulting model is less prone to make gender biased predictions. In this section, present BUG, a semi-automatic collection of natural, "in the wild" English sentences which are challenging with respect to societal gender-role assignments. Similarly to some of the synthetic gender bias datasets (Zhao et al., 2018; Rudinger et al., 2018) , we are looking for sentences with a human entity, identified by their profession (e.g., "cop", "dancer") and a gendered pronoun (e.g., "he", "she"). For example, see the first sentence in Figure 2 , where the cop co-refers with a feminine pronoun ("she"), while the judge in the last sentence in Figure 2 co-refers with a masculine pronoun ("his"). As opposed to previous work, we are interested in naturally occurring sentences, rather than generating artificial sentences from fixed lexical templates. The process for achieving this is outlined in Figure 1 and elaborated below. First, we perform syntactic search for sentences with challenging syntactic properties over corpora from three domains (Section 2.1). We then filter the sentences to verify they contain at least one entity, and a corresponding pronoun (Section 2.2). Finally, we manually assess BUG, finding it to be 85% accurate (Section 2.3). We devised 14 lexical-syntactic patterns, exemplified in Figure 2 to construct BUG. All our patterns have two anchors -a pronoun and a professionwhich the pattern indicates are coreferring. 3 For example, the last pattern in the figure links a noun (e.g., "officer") with a relative clause relation ("acl:relcl") to a verb (e.g., "distinguished") modified by a direct object ("dobj") gendered reflexive pronoun ("himself" or "herself"). These patterns were constructed by examining and expanding the sentences in the synthetic coreference corpora (Rudinger et al., 2018; Zhao et al., 2018) . To match these 14 patterns against real-world texts, we used SPIKE (Shlain et al., 2020) , which indexes large-scale corpora and retrieves matching instances given a lexical-syntactic pattern. We queried corpora from three domains: Wikipedia, PubMed abstracts, and Covid19 research papers (Wang et al., 2020) . The examples in Figure 2 highlight the diversity of the approach, while they all adhere to one of the predefined patterns, they vary widely in vocabulary and in syntactic construction, often introducing complex phenomena, such as coordination or adverbial phrases. Following the lexical-syntactic querying, we filter BUG to make sure it contains human entities, and mark each instance as either stereotypical (bottom three examples in Figure 2 ), neutral (middle example) or anti-stereotypical (top three examples). This enables us to use BUG to measure gender bias in machine translation and coreference resolution models (Section 4). We filter out two types of nouns: (1) nouns which do not refer to a person (e.g., "COVID-19"); (2) gendered English nouns (e.g., "princess", "father", or "sister"). To address both of these issues, we filtered the results with a predefined list of 183 professions, taken from the U.S. census. Following, to mark each instance as either stereotypical or anti-stereotypical, we we follow Zhao et al. (2018) and Rudinger et al. (2018) and use the United States 2015 census' gender distribution per occupation. 4 For instance, the first example Figure 2 is marked anti-stereotypical since "cop" is a predominantly male profession (76% in the census) and the referring pronoun is feminine. We estimate the accuracy of BUG by randomly sampling 1700 sentences from BUG, sampling uniformly across the data as well as from every pattern and domain. 17 human annotators proficient in English were asked to decide whether the gender BUG assigned to the entity matches their understanding of the sentence. The complete annotation guideline is presented in the Appendix. Overall we found that 85% of the instances were marked correct. We publish these annotation as a separate resource of diverse sentences with gold annotations (dubbed Gold BUG). A physician who respects her autonomy should respect Ann's right to make this decision. Noun selection affects coreference decision. E.g., replacing "autonomy" with "job" would lead to a correct annotation. Ambiguous (23%) Hiei's captain ordered her crew to abandon ship after further damage. The antecedent is ambigous (either captain or Hiei). Non-gendered pronoun (7%) The IPP is a portfolio in which the student reflects on his/her learning and development during the production. Reference to masculine and feminine pronouns. We remove the comments , but this person keeps putting them back up -things like "he says he never met that woman". Quoted pronoun which does not refer to an entity in the sentence. The collection described in the previous section resulted in 108k sentences and 1700 human annotations. Following, we analyze key characteristics of BUG, finding it to be lexically diverse, and an order of magnitude larger than previous gender bias corpora. The error analysis in Table 1 reveals that the most common errors are due to constructions where syntactic patterns are ambiguous with respect to coreference. For instance, in the first example in Table 1 , replacing "autonomy" with "job" changes the antecedent from the physician to the patient. Future work may address this by trying to refine our lexical-syntactic patterns to also include verb selection information. Other types of errors were less frequent and included cases where two pronouns were used as a single gender-neutral word ("he/she"), and where the pronoun was part of a named entity or reported speech. In addition, we test agreement between two annotators on a subset of 200 randomly selected sentences. We found a high level of agreement (95.5%; 0.73κ). Disagreements mostly occur on ambiguous sentences, such as "On the night of 17 August , Charlotte reported that the child had been taken from her tent by a dingo .", where one annotator read "her" as referring to the child, while the other thought that the pronoun refers to Charlotte. BUG statistics are presented in Table 2 in comparison with other datasets for gender bias. BUG is more than 24 times larger than the GAP coreference challenge set (Webster et al., 2018) and more than 30 times larger than WinoMT (Wino- Gender and Winobias combined) (Stanovsky et al., 2019) . BUG consists of 110,544 unique words, while in the WinoMT corpus the vocabulary size is 1,868 and GAP's vocabulary size is 31,834. BUG is more diverse and naturally distributed, as can be seen in the histogram of sentence lengths depicted in Figure 3 . Furthermore, the mean distance (in words) between entity and pronoun does not significantly differs between stereotypical (6.4[±4.5]) and anti-stereotypical (6.3[±4.6]) partitions, thus alleviating recent concerns about such artifacts in diagnostic datasets (Kocijan et al., 2021) . Our sentences were sampled from three corpora indexed in SPIKE. The majority were drawn from Wikipedia. Relative to the size of the original cor-65% 24% 8% 3% 1 2 3 4 Figure 5 : The distribution of the number of pronouns in our corpus. 35% (41K) of the sentences have more than one pronoun, further complicating the coreference resolution task. pora, the yield from Wikipedia is 6 times more productive than PubMed and 4 times more than the Covid19 research domain. This is possibly since Wikipedia lends itself more to discussion of different entities in different settings. As expected, since BUG is sampled from real texts, most of the data is stereotypical and most entities are male. There are three times more sentences with masculine pronouns compared to feminine pronouns, as shown in Figure 4 ; there are twice as many sentences with typically-male professions compared to typically-female professions; and twice as many sentences classified as stereotypical than anti-stereotypical. The natural texts also present a more challenging coreference setting. As evident in Figure 5 by large number of instances (35% of the corpus) with more than one pronoun. To allow for more controlled evaluations, we publish two subsets of BUG. Gold BUG consists of the gold-quality human-validated samples, while Balanced BUG is randomly sampled from BUG to ensure balance between male and female entities and between stereotypical and non-stereotypical gender role assignments. We report statistics for both of these subsets in Table 2 . We evaluate the performance of machine translation and coreference resolution models on BUG, using the metrics and tools established in previous work (Rudinger et al., 2018; Zhao et al., 2018; Stanovsky et al., 2019) . To the best of our knowledge, this is the first quantitative evaluation of gender bias in such systems on a large scale using naturally occurring sentences. Such inputs better resemble real-world use where biases can affect Table 3 : Results for machine translation gender bias evaluation evaluation across 8 diverse target languages on the BUG dataset. Acc represents the overall accuracy (F1) of gender translation. ∆ G is the difference in accuracy between masculine and feminine entities. ∆ S is the difference in performance between stereotypical and antistereotypical gender role assignments. Positive ∆ G and ∆ S values indicate that the translations are gender biased. many users. Machine translation. We used EasyNMT 5 to evaluate three machine translation models: mBART50_m2m (Tang et al., 2020; , m2m_100_418M (Fan et al., 2020) , and Opus-MT (Tiedemann and Thottingal, 2020) , representing the state-of-the-art for publicly available neural machine translations models. We translated BUG from English to a set of eight diverse target languages with grammatical gender: Arabic, Czech, German, Spanish, Hebrew, Italian, Russian and Ukrainian, using tools developed in previous work to infer the translated gender based on morphological inflections (Stanovsky et al., 2019; Kocmi et al., 2020). 6 Coreference resolution. We use the Al-lenNLP (Gardner et al., 2018) implementation of SpanBERT (Joshi et al., 2020) . SpanBERT introduces contextual span representation to the the e2e-coreference model (Lee et al., 2018; Joshi et al., 2019) to achieve state-of-the-art results on the English portion of the popular CoNLL-2012 shared task coreference benchmark (Pradhan et al., 2012) . For each tested model we compute three metrics, following Zhao et al. (2018) and Stanovsky et al. 5 https://github.com/UKPLab/EasyNMT 6 We used the implementation provided by github.com/ gabrielStanovsky/mt_gender (2019), while adapting the terminology suggested recently by Mehrabi et al. (2021) . Accuracy: Denotes the F1 score of the gender prediction. For machine translation, this indicates the percentage of instances in which a correct grammatical gender inflection was produced in the target language. For example translating a female doctor as doctor-a in Spanish. For coreference resolution accuracy refers to the portion of instances where the entity's antecedent is correctly clustered with its pronoun, e.g., a female doctor clustered with the feminine pronoun "her". Population bias (∆ G ): 7 denotes the difference in accuracy (F1 score) between sentences with entities which co-refer with a masculine pronoun versus those with entities which co-refer with feminine pronouns. By definition, −100 ≥ ∆ G ≥ 100. When ∆ G > 0, the model tends to perform better when the input entities co-refer with masculine pronouns, and conversely when ∆ G < 0 it performs better when they co-refer with feminine ones. Historical Bias (∆ S ): 8 denotes the difference in accuracy (F1 score) between stereotypical sentences and anti-stereotypical sentences. Similarly to population bias, ∆ S ∈ [−100, 100], and positive values indicate that the model performs better on stereotypical gender role assignments. The results for gender bias in machine translation are presented in Table 3 , and the results for coreference resolution are presented in the first row in Table 4 . We draw various findings and observations based on these results and additional analyses. All tested models for machine translation and coreference resolution are prone to gender bias on real-world texts. Both ∆ G and ∆ S are larger than zero across all settings, indicating that all models perform better on entities co-referring with a masculine pronoun and over-rely on gender stereotypes, even when it is in conflict with the pronouns providing contextual gender indications. To the best of our knowledge, this is the first time this phenomenon was observed and quantified at large scale on real-world instances, especially important for Figure 8 : Coreference resolution performance as a function of the distance between pronoun and antecedent for stereotypical (orange) and antistereotypical (blue). The performance on both partitions deteriorates towards random choice the farther apart the two elements are. popular NLP services, such as machine translation and coreference resolution, which are in common use in many downstream applications. Machine translation models do worse on sentences with many pronouns. Figure 7 breaks down ∆ S for machine translation as function of the number of pronouns in the sentence, showing that machine translation models are prone to fallback to their stereotypes the more pronouns appear in the sentence. This may be due to the increased syntactic complexity presented in such sentences. Coreference resolution performance deteriorates towards random choice the longer the distance between pronoun and antecedent. Figure 8 shows that the larger the distance (in words) between entity and coreferring pronoun, Span-BERT's performance deteriorates towards random choice, for both stereotypical and anti-stereotypical partitions, diminishing the difference in performance between them. Performance varies across domains. We compare gender bias across each of BUG's three domains in Figure 6 . It seems that m2m_100_418M is the noisiest model in terms of gender bias, its accuracy is the lowest among all languages except Hebrew, and its ∆ G is the highest. In contrast, mBART50_m2m is the most accurate model among the three on all languages except Spanish and Hebrew, and its ∆ G is the lowest on all languages except Arabic and German. A possible explanation may be the vast difference in number of training parameters (15B for mBART50_m2m versus 418M in m2m_100_418M). Notably, m2m_100_418M achieves a negative ∆ S score on PubMed (Figure 6 ), indicating that it over translates entities using anti-stereotypical inflections (e.g., preferring to translate engineers as female). However, the model's low accuracy and high ∆ G score on the same corpus may indicate that this is mostly due to a noisy translation output, perhaps due to the scientific domain of the input texts in PubMed. Our findings support previous work. The accuracy of the translations in this evaluation are much higher than that found by Stanovsky et al. (2019) and Zhao et al. (2018) work (69.9% in average vs. 47.6%), because of BUG's 3:1 ratio in favor of masculine entities versus feminine entities and 2:1 ratio in favor of stereotypical sentences versus anti-stereotypical sentences, representing a distribution which is closer to real-world use-cases. However, ∆ G and ∆ S are relative and their values are similar to those found in previous work, indicating that in fact all tested models were prone to gender bias. In addition, we find that all machine translation models achieve best performance on Czech as a target language, corroborating the findings of Kocmi et al. (2020) , and that Russian and Hebrew have the highest ∆ G and ∆ S respectively, again confirming previous findings (Stanovsky et al., 2019) . For coreference resolution, SpanBERT's gender bias ∆ S metric in Table 4 is better (i.e., smaller) than the models reported by (Zhao et al., 2018 ) (6.0 versus 13.5), which again may be due to the increase in number of parameters. Acc ∆ G ∆ S SpanBERT 65.1 10.2 6.0 SpanBERT + anti-stereotypical BUG 64.1 5.8 2.9 Table 4 : Results for gender bias in coreference resolution. The first row indicates the performance of offthe-shelf SpanBERT on our human validated annotations (Gold BUG), showing that it tends to overperform when clustering masculine and stereotypical gender role assignments. The second row depicts results after finetuning on the anti-stereotypical portion of BUG, showing a 50% error reduction at the cost of a 1% absolute reduction in accuracy. Finally, we show that BUG's size and diverse instances make it amenable for finetuning, which results in more robust models, less prone to rely on gender stereotypes. In the second row in Table 4 we report results of finetuning SpanBERT on the anti-stereotypical portion of BUG (consisting of 29.9K instances), and reevaluate its gender bias metrics on the held out human validated instances (Gold BUG, 1,720 instances). The motivation is to overexpose the coreference model to anti-stereotypical gender role assignment, where relying on stereotypes would directly hurt performance. Indeed, this yields a relative error reduction of more than 50% (3% absolute improvement). We note however, that this comes at the cost of an absolute 1% drop in overall performance accuracy, which may be an expected side-effect due to the shift in training set distribution. Future work can explore ways to find better trade-offs between accuracy and reliance of gender bias with the help of BUG. Several works created synthetic datasets to evaluate gender bias (Kiritchenko and Mohammad, 2018; González et al., 2020; Renduchintala and Williams, 2021) , e.g., in the context of coreference (Rudinger et al., 2017; Zhao et al., 2018) and machine translation (Stanovsky et al., 2019; Prates et al., 2019; Kocmi et al., 2020) , and some works used synthetic datasets to debias models (Saunders et al., 2020; Zhao et al., 2018) . Webster et al. (2018) and Gonen and Webster (2020) , collected natural medium-scale (4.4K sentences) datasets from Wikipedia and reddit, re-spectively, and use them to evaluate gender bias in models of coreference resolution and machine translation. However, their datasets focused on the difference in performance between masculine and feminine entities (population bias), while in this work we also measure historical bias as the difference in performance between stereotypical and anti-stereotypical gender role assignment. In Section 3, we compare BUG to these datasets, finding it is more diverse and challenging in various respects. We presented BUG, a large-scale corpus of 108K diverse real-world English sentences, collected via semi-automatic grammatical pattern matching. We use BUG to evaluate gender bias in various coreference resolution and machine translation models, finding that models tend to make predictions in accordance with gender stereotypes, even when in conflict with opposite gendered pronouns in the sentence. Finally, we finetuned a coreference resolution model on BUG, finding it reduces its gender bias on a held out set. Our data and code are publicly available at www.github. com/SLAB-NLP/BUG. Future work can extend BUG by including more patterns and by extracting sentences from corpora with gold annotations for machine translation and coreference resolution. This will allow exploration of the effect that exposure to anti-stereotypical examples during finetuning has on gender bias reduction. On the dangers of stochastic parrots: Can language models be too big? Gender shades: Intersectional accuracy disparities in commercial gender classification Semantics derived automatically from language corpora contain human-like biases AllenNLP: A deep semantic natural language processing platform Automatically identifying gender issues in machine translation using perturbations Type B reflexivization as an unambiguous testbed for multilingual multi-task gender bias Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python SpanBERT: Improving pre-training by representing and predicting spans BERT for coreference resolution: Baselines and analysis Examining gender and race bias in two hundred sentiment analysis systems The gap on gap: Tackling the problem of differing data distributions in biasmeasuring datasets Gender coreference and bias evaluation at WMT 2020 Higher-order coreference resolution with coarse-tofine inference The winograd schema challenge Multilingual denoising pre-training for neural machine translation A survey on bias and fairness in machine learning Social data: Biases, methodological pitfalls, and ethical boundaries CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes Assessing gender bias in machine translation: a case study with google translate Investigating failures of automatic translation in the case of unambiguous gender Social bias in elicited natural language inferences Gender bias in coreference resolution Neural machine translation doesn't translate gender coreference right unless you make it Syntactic search by example Evaluating gender bias in machine translation A framework for understanding sources of harm throughout the machine learning life cycle Jiatao Gu, and Angela Fan. 2020. Multilingual translation with extensible multilingual pretraining and finetuning OPUS-MT -Building open translation services for the World CORD-19: The COVID-19 open research dataset Mind the GAP: A balanced corpus of gendered ambiguous pronouns Gender bias in coreference resolution: Evaluation and debiasing methods We thank Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, and Yoav Goldberg for their help with SPIKE during our experiments, for fruitful discussions and their comments on earlier drafts of the paper, and the anonymous reviewers for their helpful comments and feedback. This work was supported in part by a research gift from the Allen Institute for AI.