key: cord-0060575-qdkzubmc
authors: Bakarov, Amir
title: Did You Just Assume My Vector? Detecting Gender Stereotypes in Word Embeddings
date: 2021-02-20
journal: Recent Trends in Analysis of Images, Social Networks and Texts
DOI: 10.1007/978-3-030-71214-3_1
sha: 2103459eb44105ceb3e7ec97d2800981a4d694c0
doc_id: 60575
cord_uid: qdkzubmc

Recent studies found out that supervised machine learning models can capture prejudices and stereotypes from training data. Our study focuses on the detection of gender stereotypes in relation to word embeddings. We review prior work on the topic and propose a comparative study of existing methods of gender stereotype detection. We evaluate various word embeddings models with these methods and conclude that the amount of bias does not depend on the corpora size and training algorithm, and does not correlate with embeddings performance on the standard evaluation benchmarks.

Word embeddings (real-valued word representations produced by neural distributional semantic models) are ubiquitous tools in contemporary NLP. They are mostly treated as black-boxes, and it is unclear how to evaluate them. Recent studies propose various approaches to their evaluation [1, 2] . We suppose, though, that certain properties of word embeddings were not considered in recent evaluation studies. One of them is the amount of bias in word vector spaces.

The concept of bias in machine learning commonly refers to prior information [3] , but in this work, we will use the notion of bias from ethics studies, which is regularities in training data relating to prejudices or stereotypes about a person's physiological features (race, gender, orientation, etc.). Training data historically contains implicit stereotypes and discrimination, so the supervised machine learning models unintentionally capture this bias [4] . It can cause unfair decisions in the model for certain sets of a person's characteristics -for instance, while deciding whether to approve a personal loan [5] .

It is usually hard to track such stereotypes in training data for supervised models. For instance, corpus statistics can contain bias that even does not exist in the cognition of the language speakers. The collocation black sheep can be more frequent than the collocation white sheep in the corpora, but in the real world, black sheep are rarer than white ones [6] .

Word embeddings tend to frequently capture these unobvious statistics in the data. Such correlations inevitably appear in the resulting models, so the word vectors produce unfair distances between words characterizing men and women. For example, the situation when the word researcher is closer to the word man than woman can be considered as unfair, since there is no reason why women should be less related to research than men. Such situations make the work of systems based on word embeddings "biased": for instance, in the search engine system the query "researchers" can give results more related to men than to women, and then it could be even harder for women to become recognized as researchers. This problem can be solved from the corpus side as well as from the model side, but the main issue is how to evaluate whether the model contains such bias (and how much of it).

The current paper aims to overview recent advances in the detection of gender stereotypes and to propose a comparative analysis of existing gender bias detection methods, finding out whether the amount of bias depends on the diverting factors (corpus size and training algorithm), or embeddings performance on the traditional evaluation benchmarks (word similarity and analogy reasoning). The contribution of our work is to structurize recent studies related to gender bias in word embeddings, to overview existing metrics for gender bias detection, and to compare them with each other. We are the first to propose a comprehensive comparison of gender bias detection metrics for word embeddings.

This work is organized as follows. In Sect. 2 we survey the recent works related to the problem of gender bias in word embeddings and describe existing methods for detecting gender bias. In Sect. 3 we describe our evaluation experiments and make a discussion of the obtained results, while Sect. 4 concludes the paper and reveals our plans on future work on the subject.

In the field of NLP the problem of bias (and, particularly, gender bias) was covered for various tasks, such as text classification [7] , machine translation [8] , and recommendation systems [9] . We refer an interested reader to a comprehensive review of bias in NLP [10] , and narrow the scope of the current paper to gender bias in word embeddings.

For word embeddings, the problem of bias was firstly introduced in 2015 by a blog post that highlighted the existence of gender unfairness in word vector spaces [11] . Most of the following work was trying to propose algorithms for removing bias in word embeddings: the main effort was to remove bias without hurting their semantic content (i.e., without hurting their performance on the standard benchmarks or downstream tasks).

The first work addressing the problem of measuring and removing gender bias in word embeddings was a work by [12] , which proposed two metrics for removing gender bias, and a debiasing method based on the linear transformation of word vector spaces. These metrics were used for evaluation in almost all the following studies. However, [13] noted that human-like semantic biases are reflected through ordinary language, and proposed another metric based on the association tests. [14] suggested another way to generate association tests, supposing that they should be assessed by humans instead of being used automatically.

[15] introduced a triple loss function that penalizes the model for incorrect distances of word pairs related to gender. [16] investigated cultural tracks of gender biases in word embeddings, and concluded that semantic systems of word vector space represent cultural categories rather than biases or distortions. [17] explored semantic shifts from the perspective of bias, and also concluded that the origin of such shifts can reflect the changes in cultural patterns. [18] developed a technique to trace the origin of bias in embeddings back to the original text. [19] suggested using linear projection of all words based on vectors captured by common names. [20] proposed a general framework for unbiasing word embeddings, considering other types of bias rather than only gender. [21] created a neutral-network-based encoder-decoder framework to remove gender biases from gender-stereotyped words and to preserve gender-related information in feminine and masculine words. [22, 23] pointed out on the problem of gender bias in contextualized word embeddings, such as ELMo. [24] presented a debiasing method for Hindi, while [25] proposed a technique for the German language. The problem of bias in word embeddings was also investigated in [20, [26] [27] [28] [29] [30] [31] .

To the best of our knowledge, there are only 3 fully automatic evaluation methods available. We do not consider non-automatic methods that require additional human assessment: for example, the "Analogies exhibiting stereotypes" score, which suggests generating analogies judged by humans [12] , or clustering plots, considered as classification of biased neutral words that also need further interpretation [26] .

Occupational Stereotypes (OS) [12] . This method explores whether the word vectors contain stereotypes connected to occupations, by quantifying a neutral word's vector position on the axis between vectors corresponding to the most representative female and male words (so-called "gender axis"; it was based on the words "she" and "he" in the original study). Given a list of neutral word vectors N (the authors used occupation words, e.g. "programmer"), the projection onto gender axis is computed 1 :

where x is the vector of the neutral (occupation) word, v female and v male are vectors of gender words (e.g. "she" and "he"). The final score is computed as a mean of absolute projection values for a set of occupational words X:

The lower this score, the less bias the model contains. Notably, since bias in other scores has a direct dependency, we use an inverted version of OS: [17] . The method quantifies bias with a so-called "gender vector", calculated as the average of the set of the most representative words of each gender (e.g. "she", "mother", "woman" for female):

where M is the set of the neutral word vectors, v 1 is the averaged vector of the first gender vectors, and v 2 is the average vector of the second gender vectors. The higher the GD score, the more biased the model is.

Word Embedding Association Test (WEAT) [13] . The method uses vector similarity to measure the association of two given sets of neutral target words (e.g. occupation words), with two sets of gender attributes words ("she", "mother", "woman", etc.). The score is computed as a probability that a random permutation of the attribute words would produce the observed (or greater) difference in sample means:

where X and Y are two sets of target words of equal size, and the s(w, A, B) statistic measures the association of w with the two sets of attribute words A and B:

The higher the W EAT score, the more biased the model.

To empirically compare the aforementioned methods, we evaluate them with a set of pre-trained word embeddings. We used a Word2Vec (continuous skip-gram) model [32] trained on British National Corpus (BNC, 160k tokens), Gigaword (300k tokens), Google News (3M tokens), and Wikipedia (300k tokens), as well as GloVe [33] trained on Gigaword and Wikipedia [34] . For OS we used the dataset of 320 "neutral" words of occupations, along with a dataset of 6 malefemale word pairs from [12] . Notably, the "gender" pairs set used in our study is a modified version of the original dataset, since we dropped all pronouns which were out-of-vocabulary in some of our models. Also, unlike the original study, we used projection on the "female"-"male" axis instead of the "she"-"he" axis due to the same reason. These 2 datasets were also used to compute the GD score. For the WEAT score we used the original dataset proposed by the authors of this metric without any changes [13] .

Following the experimental setting of [12] , we used 4 standard word embeddings evaluation benchmarks: RG [35] , RW [36] , WS-353 [37] and MSR [38] . The first three measure whether related words have similar embeddings, while the latter measures how well the embedding performs in analogy tasks [39] .

The results of our evaluation experiments are presented in Table 1 (best results are highlighted in bold). Surprisingly, the variation in the scores for GD is quite low, unlike the scores for 6 other benchmarks. Considering that a "good" model should have good scores on the standard benchmarks, we note that the amount of bias does not depend on the performance of the model. Henceforth, we can conclude that even a model that is good, from the perspective of standard evaluation benchmarks, can have prejudices and stereotypes.

The results also suggest that while models tend to generalize better on a larger amount of data, it does not mean that a model trained on larger corpora will contain more bias: for instance, according to the WEAT score, the lowest bias have models trained on BNC (the smallest corpora) and GoogleNews (the largest corpora). The amount of bias also does not depend on the training algorithm, and even two different methods for measuring bias can have different scores (the two sets of scores for OS and WEAT have Spearman rank correlation coefficient of −0.2).

The code for reproducing the experiments can be found at https://gitlab. com/bakarov/fair-embeddings. 

Our work analyzed the recent methods for measuring gender-oriented biases in word embeddings, and compared the evaluation scores with word embeddings performance on the standard evaluation benchmarks. The results report that the amount of bias does not depend on the performance against these benchmarks, as well as of the training corpora and training algorithms. Results for the different gender bias evaluation methods also do not correlate with each other, so we can conclude that currently, we are unable to unequivocally say which word embeddings model is more biased, and moreover, we cannot reveal the amount of bias from the corpora size or the training algorithm. Any shifts in the embedding model are possible for several reasons. First, shifts are incorporated in the data itself, including collocations of frequency words or cultural phenomena, and second, the training method could be sensitive to certain correlations. Therefore, our study has supported the idea that we cannot capture these correlations (particularly, related to gender) by analyzing external model factors (e.g. training corpora size), and need specific (gender bias detection) methods for this task.

In the future, we suggest extending our research of these methods to the development of new ones, as well as to evaluation experiments in other settings (e.g. other corpora, other training algorithms, or other evaluation benchmarks). Also, in the current study, we used a set of pre-trained models, for which we cannot track the dependence of the results on the random initialization of weights during the training. We plan to train our models and measure the stability of the metric during several training iterations of the same model on the same corpus. Finally, we want to test the dependence of the results of the choices of the gender-specific words, and plan to extend our experiments to other languages (particularly, Russian) by creating datasets for them.

What's in your embedding, and how it predicts task performance

Semantic structure and interpretability of word embeddings

Pattern Recognition and Machine Learning

Unbiased look at dataset bias

Equality of opportunity in supervised learning

Reporting bias and knowledge extraction

It's a man's Wikipedia? Assessing gender inequality in an online encyclopedia

Equalizing gender biases in neural machine translation with word embeddings techniques

Examining the presence of gender bias in customer reviews using word embedding

Language (technology) is power: a critical survey of "bias

Rejecting the gender binary: a vector-space operation

Man is to computer programmer as woman is to homemaker? debiasing word embeddings

Semantics derived automatically from language corpora contain human-like biases

What are the biases in my word embedding?

Learning gender-neutral word embeddings

The geometry of culture: analyzing meaning through word embeddings

Word embeddings quantify 100 years of gender and ethnic stereotypes

Understanding the origins of bias in word embeddings

Attenuating bias in word vectors

A general framework for implicit and explicit debiasing of distributional word vector spaces

Gender-preserving debiasing for pre-trained word embeddings

Unsupervised discovery of gendered language through latent-variable modeling

Extensive study on the underlying gender bias in contextualized word embeddings

Debiasing gender biased Hindi words with word-embedding

Bias in word embeddings

Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them

Neutralizing gender bias in word embedding with latent disentanglement and counterfactual generation

deb2viz: debiasing gender in word embedding data using subspace visualization

Doublehard debias: tailoring word embeddings for gender bias mitigation

Nurse is closer to woman than surgeon? Mitigating gender-biased proximities in word embeddings

Quantifying 60 years of gender bias in biomedical research with word embeddings

Distributed representations of words and phrases and their compositionality

Glove: global vectors for word representation

Word vectors, reuse, and replicability: towards a community repository of large-text resources

Contextual correlates of synonymy

Better word representations with recursive neural networks for morphology

A study on similarity and relatedness using distributional and wordnet-based approaches

Linguistic regularities in continuous space word representations

A survey of word embeddings evaluation methods

Acknowledgments. The reported study was funded by the Russian Foundation for Basic Research project 20-37-90153 "Development of framework for distributional semantic models evaluation".