key: cord-0182340-wew1nzl3
authors: Fabbrizzi, Simone; Papadopoulos, Symeon; Ntoutsi, Eirini; Kompatsiaris, Ioannis
title: A Survey on Bias in Visual Datasets
date: 2021-07-16
journal: nan
DOI: nan
sha: 22629f336dcf496ff3c3ffefe44259bf2344ed4e
doc_id: 182340
cord_uid: wew1nzl3

Computer Vision (CV) has achieved remarkable results, outperforming humans in several tasks. Nonetheless, it may result in major discrimination if not dealt with proper care. CV systems highly depend on the data they are fed with and can learn and amplify biases within such data. Thus, both the problems of understanding and discovering biases are of utmost importance. Yet, to date there is no comprehensive survey on bias in visual datasets. To this end, this work aims to: i) describe the biases that can affect visual datasets; ii) review the literature on methods for bias discovery and quantification in visual datasets; iii) discuss existing attempts to collect bias-aware visual datasets. A key conclusion of our study is that the problem of bias discovery and quantification in visual datasets is still open and there is room for improvement in terms of both methods and the range of biases that can be addressed; moreover, there is no such thing as a bias-free dataset, so scientists and practitioners must become aware of the biases in their datasets and make them explicit. To this end, we propose a checklist that can be used to spot different types of bias during visual dataset collection.

In the fields of Artificial Intelligence (AI), Algorithmic Fairness and (Big) Data Ethics, the term bias has many different meanings: it might refer to a statistically biased estimator, to a systematic error in a prediction, to a disparity among demographic groups, or even to an undesired causal relationship between a protected attribute and another feature. Ntoutsi et al. [64] define bias as "the inclination or prejudice of a decision made by an AI system which is for or against one person or group, especially in a way considered to be unfair ", but also identify several ways in which bias is encoded in the data (e.g. via spurious correlations, causal relationship among the variables, and unrepresentative data samples). The aim of this work is to provide the reader with a survey on the latter problem in the context of visual data (i.e. images and videos).

Thanks to Deep Learning technology, Computer Vision (CV) gained unprecedented momentum and reached performance levels that were unimaginable before. For example, in tasks like object detection, image classification and image segmentation, CV achieves great results and sometimes even outperforms humans (e.g. in object classification tasks). Nonetheless, visual data, which CV relies heavily on for both training and evaluation, remain a challenging data type to analyse. An image encapsulates many features that require human interpretation and context to make sense. These include: the human subjects, the way they are depicted and their reciprocal position in the image frame, implicit references to culture-specific notions and background knowledge, etc. Even the colouring scheme can be used to convey different messages. Thus, making sense of visual content remains a very complex task.

Furthermore, CV has recently drawn attention because of its ethical implications when deployed in a number of application settings, ranging from targeted advertising to law enforcement. There has been mounting evidence that deploying CV systems without a comprehensive ethical assessment may result in major discrimination against minorities and protected groups. Indeed, facial recognition technologies [68] , gender classification algorithms [11] , and even autonomous driving systems [84] have been shown to exhibit discriminatory behaviour.

While bias in AI systems is a well studied field, the research in biased CV is more limited despite the widespread analysis on image data in the ML community and the abundance of visual data produced nowadays. Moreover, to the best of our knowledge, there is no comprehensive survey on bias in visual datasets ( [78] represents a seminal work in the field, but it is limited to object detection datasets). Hence, the contributions of the present work are: i) to explore and discuss the different types of bias that arise in different visual datasets; ii) to systematically review the work that the CV community has done so far for addressing and measuring bias in visual datasets; and iii) to discuss some attempts to compile bias-aware datasets. We believe this work to be a useful tool for helping scientists and practitioners to both develop new biasdiscovery methods and to collect data as less biased as possible. To this end, we propose a checklist that can be used to spot the different types of bias that can enter the data during the collection process ( Table 6 ).

The structure of the survey is as follows. First, we provide some background about bias in AI systems, and the life cycle of visual content and how biases can enter at different steps of this cycle (Section 2). Second, we describe in detail the different types of bias that might affect visual datasets (Section 3) and we provide concrete examples of CV applications that are affected by those biases. Third, we systematically review the methods for bias discovery in visual content proposed in the literature and provide for each of them a brief summary (Section 4). We also outline future streams of research based on our review. Finally, in Section 5, we discuss weaknesses and strengths of some bias-aware visual benchmark datasets.

In this section, we provide some background knowledge on bias in AI systems (Section 2.1) and, describe how different types of bias appear during the life cycle of visual content (Section 2.2).

In the field of AI ethics, bias is the prejudice of an automated decision system towards individuals or groups of people on the basis of protected attributes like gender, race, age, etc. [64] . Instances of this prejudice have caused discrimination in many fields, including recidivism scoring [1] , online advertisement [75] , gender classification [11] , and credit scoring [7] .

While algorithms may also be responsible for the amplification of pre-existing biases in the training data [8] , the quality of the data itself contributes significantly to the development of discriminatory AI applications, such as those mentioned above. Ntoutsi et al. [64] identified two ways in which bias is encoded in the data: correlations and causal influences among the protected attributes and other features; and the lack of representation of protected groups in the data. They also noted that bias can manifest in ways that are specific to the data type. In Section 2.2, and in more detail in Section 3, we explore bias specifically for visual data.

Furthermore, it is important to note that defining the concepts of bias and fairness in mathematical terms is not a trivial task. Indeed, Verma & Rubin [80] provide a survey on more than 20 different measures of algorithmic fairness, many of which are incompatible with each other [13, 49] . This incompatibility -the so-called impossibility theorem [13, 49] -forces scientists and practitioners to choose the measures they use based on their personal believes or other constraints (e.g. business models) on what has to be considered fair for the particular problem/domain.

Given the impact of AI, the mitigation of bias is a crucial task. It can be achieved in several different ways including pro-active approaches to bias, mitigation approaches, and retroactive approaches [64] . Our work falls into the first category of proactive bias-aware data collection approaches (Section 5). Bias mitigation approaches can be further categorised into: preprocessing, in-processing and post-processing approaches (further details can be found at [64] ). Finally, explainability of black box models [30] is among the most prominent retrospective approaches, especially since EU introduced the "right to explanations" as part of the General Data Protection Regulation (GDPR) 1 (see also Association for Computing Machinery's statement on Algorithmic Transparency and Accountability 2 ). According to this, it is important to understand why models make certain decisions instead of others both for debugging and improving the models themselves and for providing the final recipient of those decisions with meaningful feedback. Real world. The journey of visual content alongside with bias starts even before the actual content is generated. Our world is undeniably shaped by inequalities and this is reflected in the generation of data in general and, in particular, in the generation of visual content. For example, Zhao et al. [89] found out that the dataset MS-COCO [54] , a large-scale object detection, segmentation, and captioning dataset which is used as a benchmark in CV competitions, was more likely to associate kitchen objects to women. While both image capturing and datasets collection come at later stage in the life cycle of Figure 1 , it is clear that in this instance, such bias has roots in the gender division between productive and reproductive/care labour. Nevertheless, as shown in the following paragraphs, each step of the life cycle of visual content can reproduce or amplify historical discrimination as well as insert new biases.

Capture. The actual life of a visual content item starts with its capturing. Here the first types of bias can be introduced, selection bias and framing bias. Selection bias is the bias that arises from the selection of subjects. While usually this kind of bias can be observed in datasets, where entire groups can be under-represented or non-represented at all, the selection begins with the choices of the photographer/video maker 3 . Moreover, the way a photo is composed (the composition, the camera's setting, the lighting conditions, etc.) is a powerful way for conveying different messages and thus a possible source of framing bias. Note that, the selection of the subjects and the framing of the photo/video are both active choices of the photographer/video maker. Nevertheless, historical discrimination can turn into selection and framing bias if not actively countered. Imagine, for instance, a photographer working on a photo book of care workers, it is likely that the photographer will tend to select more women as subjects turning historical discrimination into selection bias (in a similar way to what Zhao et al. [89] described for the MS-COCO dataset [54] ).

Editing. With the advent of digital photography, image editing and post-processing are now a key step in the content life-cycle. Post-processing has become a basic skill for every photographer/video maker along with skills such as camera configuration, lighting, shooting, etc.

Since photo editing tools are extremely powerful, they could give rise to a number of ethical issues: to what extent and in which contexts is it right to digitally modify the visual content of an image or video? What harms could such modifications potentially cause? The discussion around the award-winning Paul Hansen's photo "Gaza Burial" 4 represent a practical example of how this questions are important to photo-journalism. In particular, what is the trade-off between effective storytelling and adherence to reality. Nonetheless, photo editing does not concern only journalism. Actually, it affects people, and especially women, in several different contexts, from the fashion industry 5 to high-school yearbooks 6 . These two examples show how photo editing contributes, on the one hand, to the creation of unrealistic beauty standards and, on the other hand, it serves as a mean of patriarchal control over women's body.

Dissemination. The next step in the life of visual content is dissemination. Nowadays, images and videos are shared via social and mainstream media in such great volume that nobody can possibly inspect. For instance, more than 500 hours of videos are uploaded to YouTube every minute 7 . Dissemination of visual content clearly suffers from both selection and framing bias (for a general introduction to framing bias in textual and visual content the reader can refer to Entman [24] and Coleman [15] ). The images selected, the medium and channels through which they are disseminated, the caption or text they are attached to are all elements that could give rise to selection and framing bias (see Peng [66] for a case-study of framing bias in the media coverage of 2016 US presidential election). This is also a step that can lead to discrimination since it exposes audiences to selected or targeted messages that are conveyed, intentionally or not, by the visual content.

Data collection. The part of the life cycle of a visual content that is more relevant to the discussion that will follow is the collection of visual datasets. Here we encounter once more selection bias, as the the data collection process can exclude or under-represent certain groups from appearing in the dataset, as well as a new type of bias: label bias. Datasets usually are not mere collections of images and great effort is expended to collect them along with additional information in the form of annotations. As we will see, this process is prone to errors, mislabelling and explicit discrimination (see Miceli et al. [60] for an analysis of power dynamics in the labelling process of visual data data in the wild ). Since researchers play a great role in the collection of benchmark datasets, we refer to Van Noorden [79] for a survey on ethical questions on the role of research in the field of facial recognition.

Algorithms Finally, there is the actual step of algorithm training. Fairness and accountability of algorithms is a pressing issue as algorithm-powered systems are used pervasively in applications and services impacting a growing number of citizens [4, 20] . Important questions arise for the AI and CV communities on how to measure fairness and bias in algorithms, which are the legal frameworks that should be put in place by governments, what are the strategies to mitigate bias or to make algorithms explainable. Nevertheless, the journey of visual content does not end with the training of algorithms. Indeed, there are several ways in which biased algorithms can generated vicious feedback loops. For example, biased machine generated labelling and image search engine can easily turn into biased data labelling/collection if dealt without proper care. Moreover, the recent explosion in popularity of generative models, namely Generative Adversarial Networks [29] , has made the process of media creation very easy and fast (see Mirsky & Lee [62] for a survey on AI-generated visual content). Such AI-generated media can then be reinserted in the content life cycle via the Web and present their own ethical issues.

In this section, we describe in detail those types of bias that pertain to the capture and collection of visual data: selection bias (Section 3.1), framing bias (Section 3.2) and label bias (Section 3.3). Note that a comprehensive analysis of both historical discrimination and algorithmic bias is beyond the scope of this work. The interested reader can refer to Bandy [4] for a survey on methods for auditing algorithmic bias. Our categorisation builds upon the categorisation by Torralba & Efros [78] who organised the types of bias that might arise from large collections of images into four different categories: selection bias, capture bias (which we collapse into the more general concept of framing bias), label bias, and negative set bias (which arises when the negative class, say non-white in a binary feature [white people/non-white people], does not reflect entirely the population of the negative class). While it makes sense to consider it as a type on its own in the field of object detection [78] , we posit that negative class bias can be considered as an example of selection bias as it is caused by the under/non-representation of a population in the negative class. Even though our bias categorisation appears in surface to be very similar to the one in [78] , their analysis focused on datasets for object detection, while we contextualise bias in a more general setting and also focus on discrimination against protected groups 8 defined on the basis of protected attributes like gender or race. Since selection, framing and label bias manifest in several different ways, we also go further by describing a sub-categorisation of these three types of bias (Table 1) including several biases commonly encountered in Statistics, Health studies, or Psychology and adapting them to the context of visual data. While in the following we will describe selection, framing and label bias in general terms, we will add some references to Table  1 so that the reader can contextualise better the different manifestations of those three types of bias.

Definition. Selection bias is the type of bias that "occurs when individuals or groups in a study differ systematically from the population of interest leading to a systematic error in an association or outcome" 9 . More generally, it refers to any "association created as a result of the process by which individuals are selected into the analysis" (Hernán & Robins [34, Chapter 8, pg. 99]) and regards both experimental and observational studies. In the first case, the flaws in the selection might arise from the different probability that two groups may have to volunteer for an experiment. In the second, the sampling procedure may exclude or under-represent certain categories of subjects. In visual datasets, using the first definition would be tricky as, for instance in the case of face detection and recognition, respecting the ethnic composition of the population is generally not enough to ensure a good performance across every subgroup, as we will see in the following. Hence, we adopt a slight modification of [34] :

We call selection bias any disparities or associations created as a result of the process by which subjects are included in a visual dataset.

Description. Torralba & Efros [78] showed that usually certain kinds of imagery are more likely to be selected during the collection of large-scale benchmark datasets, leading to selection bias (see sampling bias, Table 1 ). For example, in Caltech101 [52] pictures labelled as cars are usually taken from the side, while ImageNet [18] contains more racing cars. Furthermore, a strong selection bias can also be present within datasets. Indeed, Salakhutdinov et al. [70] showed that, unless a great effort is made to keep the distribution uniform during the data collection process, categories in large image datasets follow a long-tail distribution, which means that, for example, people and windows are way more common than ziggurats and coffins [91] . When, within the same category, a certain subcategory is more represented than others we also have selection bias (for example the category "Bus" may contain more single-decker busses than double-decker ones [91] ).

Selection bias becomes particularly worrisome when it concerns humans. Buolamwini & Gebru [11] pointed out that under-representation of people from different genders and ethnic groups may result in a systematic misclassification of those people (their work concentrates on gender classification algorithms). They also showed that some popular datasets were biased towards lighter-skinned male subjects. For example, Adience [23] (resp. IJB-A [48] ) have 7.4% (resp. 4.4%) of darker-skinned females, 6.4% (resp. 16.0%) of darker-skinned males, 44.6% (resp. 20.2%) of lighter skinned females and 41.6% (resp. 59.4%) of lighter-skinned males. Such imbalances affect greatly the performance of CV tools. For instance, Buolamwini & Gebru [11] showed that the error rate for dark skin individuals can be 18 times higher than for light skin ones in some commercial gender classification algorithms.

Affected applications. In summer 2020, the New York Times published the story of a Black American individual wrongfully arrested due to an error made by a facial recognition algorithm 10 . While we do not know whether this exact case was caused by a bias in the training data, we do know that selection bias can lead to different error rates in face recognition. Hence, such technology should be used with much more care especially in high-impact applications such as law enforcement.

Another application that selection bias might impact is autonomous driving systems. It is very challenging to collect a dataset that describes every possible scene and situation a car might face. The Berkeley Driving Dataset [88] for example contains driving scenes from only four cities in the US; it is very likely that an autonomous car trained on such a dataset will under-perform in other cities with different visual characteristics. The effect of selection bias on autonomous driving becomes particularly risky when it affects pedestrian recognition algorithms. Wilson et al. [84] studied the impact of under-representation of darker-skinned people on the predictive inequity of pedestrian recognition systems. They found evidence that the effect of this selection bias is two-fold: first, such imbalances "beget less statistical certainty" making the process of recognition more difficult; second, standard loss functions tend to prioritise the more represented groups and hence some kind of mitigation measures are needed in the training procedure [84] .

Moreover, Buolamwini & Gebru [11] explained that part of the collection of benchmark face datasets is often done using facial detection algorithms. Therefore, every systematic bias in the training of those tools is propagated to other datasets. This is a clear example of algorithmic bias turning into selection bias (see automation bias, Table 1 ), as described at the end of Section 2.2. Furthermore, image search engines can contribute to the creation of selection biases (see availability bias, Table 1 ) as well due to systematic biases in their retrieval algorithms. For example, Kay et al. [43] studied gender bias in Google's image search engine and, in particular, they focused on the representation of men and women in different professions. They found out that, in male-dominated occupations, the male dominance in the images was even higher. On the contrary, the results for female-dominated careers tended to be more balanced. This shows how data collection processes can easily end up in vicious cycles of bias: biased algorithms give rise to biased datasets, which in turn lead to the training of biased algorithms.

Remarks. Finally, Klare et al. [47] pointed out that, while demographic imbalances have undoubtedly a great impact on some facial recognition algorithms, they do not explain every disparity in the performance of algorithms. For instance, they suggested that a group of subjects (e.g. women) might be more difficult to recognise, even with balanced training data, if it is associated with higher variance (for example, due to hairstyle or make-up).

Definition. According to the seminal work of Entman [24] on framing of (textual) communication, "To frame is to select some aspects of a perceived reality and make them more salient in a communicating text, in such a way as to promote a particular problem definition, causal interpretation, moral evaluation and/or treatment recommendation for the item described". Coleman [15] adopted the same definition for the framing of visual content and adds that "In visual studies, framing refers to the selection of one view, scene, or angle when making the image, cropping, editing or selecting it". These two definitions highlight how framing bias has two different aspects. First, the technical aspect: framing bias derives from the way an image is captured or edited. Second, the medium: visual content are in all respect a medium and therefore the way they are composed conveys different messages. Hence, in the following We refer to framing bias as any associations or disparities that can be used to convey different messages and/or that can be traced back to the way in which the visual content has been artificially composed.

Note that, while the selection process is a powerful tool for framing visual content, we keep the concepts of selection bias and framing bias distinct, as they present their own peculiarities.

Description. An example of visual framing bias (to be more specific, capture bias, Table  1 ) has been studied by Heuer et al. [35] . They analysed images attached to obesity-related articles that appeared in major US online news websites in the time span 2002-2009. They concluded that there was a substantial difference in the way that such images depicted obese people with respect to non-overweighted ones. For example, 59% of obese people were headless (6% of non-overweighted) and 52% had only the abdomen portrayed (0% of non-overweighted). This portrayal as "headless stomachs" 11 [35] (see Figure 2 b) for an example) may have a stigmatising and de-humanising effect on the viewer. On the contrary, positive characteristics were more commonly portrayed among non-overweighted people. Corradi [16] , while analysing the use of female bodies in advertisement, talked about "semiotic mutilation" when parts of a woman's body are used to advertise a product with a de-humanising effect, similar to what Heuer et al. have described in their work about obesity.

The relationship between bodies and faces in image framing is a well-known problem. Indeed, Archer et al. [2] used a ratio between the height of a subject's body and the length of their face to determine whether there was a prominence of men's faces with respect to women's ones. They found out that this was true in three different settings: contemporary American news media photographs, contemporary international news media photographs, and portraits and self-portraits from 17-th century to 20-th century (interestingly, there was no disparity in earlier artworks). Furthermore, they found some evidence that people tend to draw men with higher facial prominence and thus that this bias does not only occur in mass media or art. The fact that such a bias can affect the perceived qualities (intelligence, dominance, etc.) of the image's subject is highlighted by Archer et al. [2] and Schwarz & Kurz [71] .

Affected applications. An application that can suffer greatly from framing bias (see stereotyping, Table 1 ) is that of image search engines. For example, Kay et al. [43] found out that while searching for construction workers on Google Image Search, women were more likely to be depicted in an unprofessional or hyper-sexualised way 12 . It is not clear, though, whether the retrieval algorithms are responsible for the framing or they just index popular pages associated with the queries. Nonetheless, the problem remains because images are powerful media for conveying messages and the incorrect framing in a search engine can contribute to the spread of biased messages. We recall, for instance, the case of the photo of a woman "ostensibly about to remove a man's head" 13 that was retrieved after searching "feminism" on a famous stock photos website.

Nevertheless, visual framing can be used also as a positive tool for challenging societal constructions. See, for example, the Italian and French feminist movements in the 70s and their re-framing of female genitalia as a symbol of struggle and liberty [17] , or the Body Positivity movement with its aim to challenge the stigma around women's bodies, by substituting it with a more positive view of acceptance. These can be thought of as two examples of "semiologic guerrilla" [22] .

Remarks. The study of framing bias opens the door to multi-modal approaches to fairness as an effective means of analysis. For instance, Brennen et al. [10] studied how images, even apparently neutral ones such the stock image of a cigarette, have been intentionally attached to articles to spread misinformation in social-media during the 2020 COVID-19 outbreak. In such a case, the framing of the image depends on the text attached to it and cannot be studied considering only visual features. Note that a similar issue can lead to direct discrimination (see contextual bias, Table 1 ) such as in the case of photos of Asian people attached to articles discussing COVID-19 14 .

Definition. For supervised learning, labelled data are required. The quality of the labels 15 is of paramount importance for learning and comprises a tedious task due to the complexity and volume of today's datasets. Jiang & Nachum [39] define label bias as "the bias that arises when the labels differ systematically from the ground truth" with clear implications regarding the generalisation capability of the model to future unseen instances. Torralba & Efros [78] highlight bias as a result of the labelling process itself for reasons like "semantic categories are often poorly defined, and different annotators may assign differing labels to the same type of object". Torralba & Efros's work mainly focused on object detection tasks and hence by different label assignment they refer, e.g. to "grass" labelled as "lawn" or "picture" as "painting". Nevertheless, this kind of problems arise especially when dealing with human-related features such as race, gender or even beauty.

We define label bias as any errors in the labelling of visual data, with respect to some ground truth, or the use of poorly defined or fuzzy semantic categories.

Description. As already mentioned, a major source of labelling bias is the poor definition of semantic categories. Race is a particularly clear example of this: according to Barbujani & Colonna [6] "The obvious biological differences among humans allow one to make educated guesses about an unknown person's ancestry, but agreeing on a catalogue of human races has so far proved impossible". Given such an impossibility, racial categorisation in visual datasets must come at best from subjects' own race perception or, even worse, from the stereotypical bias of annotators. As an example of how the concept of race can be volatile, we cite an article that appeared on The New York Times 16 describing how the perception of Italian immigrants in the US has changed during the 20-th century from being considered people of colour to being considered white as the result of a socio-political process. From a CV standpoint then, it would be probably more accurate to use actual visual attributes, if strictly necessary, such as Fitzpatrick skin type [11] rather than fuzzy categories such as race. Note that, while skin tone can be a more objective trait to use, it does not reflect entirely the human diversity.

Similarly, the binary categorisation of gender has been criticised. Indeed, as gender identity is a very personal matter, it appears very difficult for a computer scientist to model it because, no matter how complicated such a model is, it would not be able to completely capture every possible shade of gender identity. The above mentioned discussions pose some uncertainty on categories that were considered undeniably binary (in the case of gender) or discrete (in the case of race) until not many years ago. Hence, it is important to take into account the fact that the use of such categories might be inherently biased. Furthermore, the use of fuzzy categories such as race or gender poses major challenge to algorithmic fairness from both the ontological and operational points of view [32, 42] . 14 N.

Roy, News outlets criticized for using Chinatown photos in coronavirus articles, NBC News, March 2020. https://www.nbcnews.com/news/asian-america/ news-outlets-criticized-using-chinatown-photos-coronavirus-articles-n1150626. 15 Note that by label we mean any tabular information attached to the image data (object classes, measures, protected attributes, etc.) 16 B. Staples, How Italians became White, The NY Times, October 2019. https://www.nytimes.com/ interactive/2019/10/12/opinion/columbus-day-italian-american-racism.html. [78] argued that different annotators can come up with different labels for the same object (a field can be a lawn and so on). While this mainly applies to the labelling of huge datasets used for object detection rather than face datasets where labels are usually binary or discrete, it gives us an important input about bias in CV in general: annotators' biases and preconceptions are reflected into the datasets. As an example, we cite the following attempt by Kärkkäinen & Joo [41] to build a race-balanced dataset. In doing so, the authors asked Amazon Mechanical Turkers to annotate the faces' race, gender and age group and chose a 2-out-of-3 approach to define the ground truth of their dataset. Sacha Costanza-Chock, in a Twitter thread 17 , made some pertinent points about the way gender is labelled in the above-mentioned work: putting aside the fact that gender is considered binary and other ethical aspects, Kärkkäinen & Joo assumed that humans are able to guess gender from photos and that this ability maintains the same success rate across all races, ethnicities, genders, etc. If these assumptions are false, it is clear that the methodology used for the construction of the dataset cannot be bias-free. We are going to analyse in more detail different attempts to construct bias-aware datasets in Section 5.

We are going to provide two more concrete examples of label bias. First, we cite a study on non-verbal flirting communication that recently appeared on The Journal of Sex Research [31] . In this study, men were asked whether women in previously labelled photographs appeared to be flirtatious, happy or neutral. Those photos were taken by asking to female posers to mimic happy, neutral or flirtatious expressions either based on suggestions inspired by previous studies or spontaneously. One can note that while labelled as flirtatious, women in the study were not flirting at all: they were asked to act as they were. This might seems a subtle difference, but it does not ensure that the judgements of the male participants were based on a real ability to recognise a flirt and not on the stereotypical representation of it (see perception bias, Table 1 ). Second, a paper by Liang et al. [53] described the construction of a face dataset for assessing facial beauty. Since beauty and attractiveness are the prototypical examples of subjective characteristics, it appears obvious that any attempt of constructing such a dataset will be filled with the personal preconceptions of the voluntary participants who labelled the images (see observer bias, Table 1 ).

We can view what has been described so far also as a problem of operationalisation of what Jacobs & Wallach [37] called unobservable theoretical constructs (see measurement bias, Table  1 ). They proposed a useful framework that serves as a guideline for the mindful use of fuzzy semantic categories and answers the following questions about the validity of the operationalisations (or measurements) of a construct: "Does the operationalization capture all relevant aspects of the construct purported to be measured? Do the measurements look plausible? Do they correlate with other measurements of the same construct? Or do they vary in ways that suggest that the operationalization may be inadvertently capturing aspects of other constructs? Are the measurements predictive of measurements of any relevant observable properties (and other unobservable theoretical constructs) thought to be related to the construct, but not incorporated into the operationalization? Do the measurements support known hypotheses about the construct? What are the consequences of using the measurements[?]".

Affected applications. Since deep learning boosted the popularity of CV, a modern form of physiognomy has gained a certain momentum. Many studies have appeared in the last few years claiming to be able to classify images according to the criminal attitude of the subjects, their happiness or sexual orientation. A commercial tool has also been released to detect terrorists and pedophiles. While "doomed to fail" for a series of technical reasons well explained by Bowyer et al. [9] , these applications rely on a precise ideology that Jake Goldenfein called computational empiricism [28] : an epistemological paradigm that claims, in spite of any scientific evidence, that the true nature of humans can be measured and unveiled by algorithms. The reader can refer also to the famous blog post "Physiognomy's New Clothes" 18 for an introduction to the problem.

Remarks. Just as selection bias, label bias can lead to a vicious cycle: a classification algorithm trained on biased labels will most likely reinforce the original bias when used to label newly collected data (see automation bias, Table 1 ).

The aim of this section is to understand how the researchers have tackled the problem of discovery and quantification of bias in visual datasets since high-quality visual datasets that are authentic representations of the world 19 are a critical component towards more fair and trustworthy CV systems [36] . To this end, we performed a systematic survey of papers addressing the following problem: Given a visual dataset D, is it possible to discover/quantify what types of bias it manifests 20 ? In particular, we focus on the methodologies or measures used in the bias discovery process, of which we are going to outline the pros and cons. Furthermore, we will try to define open issues and possible future directions for research in this field. Note that this problem is critically different from both the problem of finding out whether an algorithm discriminates a protected group and the problem of mitigating such bias even though all these three problems are closely related and sometimes a solution to one of them can give useful insights to the others.

In order to systematise this review, we proceed in the following way to collect the relevant material: first, we outline a set of keywords to be used in three different scientific databases (DBLP, arXiv and Google Scholar) and we select only the material relevant to our research question following a protocol described in the following paragraph; second, we summarise the results of our review and outline pros and cons of different methods in Section 4.5. Our review methodology was inspired by the works of Merli et al. [59] and Kofod-Petersen [50] . Our protocol also resembles the one described by Kitchenham [46, Section 5.1.1].

Given our problem, we identify the following relevant keywords: bias, image, dataset, fairness. For each of them we also defined a set of synonyms and antonyms (see Table 3 ). Note that among the synonyms of the word image we include the words "face" and "facial". This is mainly motivated by the fact that in the title and the abstract of the influential work of Buolamwini & Gebru [11] there are no occurrences of the word "image", instead we find many occurrences of the expression "facial analysis dataset". Notwithstanding, facial analysis is an important case study for detecting bias in visual content, thus it makes sense to include it explicitly in our search. The search queries have been composed of all possible combinations of the different synonyms (antonyms) of the above keywords (for example, "image dataset bias", "visual dataset discrimination", or "image collection fairness").

The queries resulted, after manual filtering 21 , in 17 different relevant papers. The list was further expanded with 6 articles by looking at papers citing the retrieved works (via Google Scholar) and at their "related work" sections. We also added to the list Lopez-Paz et al.'s work on causal signals in images [56] that was not retrieved by the protocol described above. In the following, we review all these 24 papers, dividing them in four categories according to the strategies they use to discover bias:

• Reduction to tabular data: these rely on the attributes and labels attached to or extracted from the visual data and try to measure bias as if it were a tabular dataset (e.g. counting the number of male and female subjects).

• Biased image representations: these rely on lower-dimensional representations of the data to discover bias. 19 Whether this means to have balanced data or statistically representative data is an open debate in the AI community. It really depends on the applications, though, as some (such as CV classification algorithms) do need balance for working properly while others might require representative data. 20 Of those described in Section 3. 21 The manual filtering consisted in keeping only those papers that describe a method or a measure for discovering and quantifying bias in visual datasets. In some cases, we kept also works developing methods for algorithmic bias mitigation but that could be used to discover bias in dataset as well (see [3] ).

Framing Label Name Description Sampling bias * Bias that arises from the sampling of the visual data. It includes class imbalance.

• Negative set bias [78] When a negative class (say non-white in a white/non-white categorisation) is not representative enough.

Availability bias † Distortion arising from the use of the most readily available data (e.g. using search engines)

•

Bias that arises as a result of a data collection being carried out on a specific digital platform (e.g. Twitter, Instagram, etc.).

• Volunteer bias † When data is collected in a controlled setting instead of being collected in-the-wild, volunteers that participate in the data collection procedure may differ from the general population.

•

Bias that arises as a result of the crawling algorithm/system used to collect images from the Web or with the use of an API (e.g. the keywords used to query an API, the seed websites used in a crawler).

Presence of spurious correlations in the dataset that falsely associate a certain group of subjects with any other features.

Exclusion bias * Bias that arise when the data collection excludes partly or completely a certain group of people.

Chronological bias † Distortion due to temporal changes in the visual world the data is supposed to represent.

Geographical bias [72] Bias due to the geographic provenance of the visual content or of the photographer/video maker (e.g. brides and grooms depicted only in western clothes).

Capture bias [78] Bias that arise from the way a picture or video is captured (e.g. objects always in the centre ,exposure, etc.).

• Apprehension bias † Different behaviour of the subjects when they are aware of being photographed/filmed (e.g. smiling).

• Contextual bias [73] Association between a group of subjects and a specific visual context (e.g. women and men respectively in household and working contexts)

• Stereotyping § When a group is depicted according to stereotypes (e.g. female nurses vs. male surgeons).

• Measurement bias [37] Every distortion generated by the operationalisation of an unobservable theoretical construct (e.g. race operationalised as a measure of skin colour).

• Observer bias † Bias due to the way a annotator records the information.

• Perception bias † When data is labelled according to the possibly flawed perception of a annotator (e.g. perceived gender or race) or when the annotation protocol is not specific enough or is misinterpreted.

• Automation bias § Bias that arises when the labelling/data selection process relies excessively on (biased) automated systems. A model f trained on the dataset D w

Bold letters denotes vectors AP Average Precision-Recall (AP) score APB(fD) AP score of the model fD when tested on the dataset B || · ||2

L2-norm |D|

The number of elements in the dataset D σ(·)

Sigmoid function

Indicator function P(·) Probability P(·|·)

Conditional probability H(·)

Shannon entropy H(·|·)

Conditional Shannon entropy I(·, ·)

Simpson D score ln(·) Natural logarithm mean(·)

Arithmetic mean Table 2 : Brief summary of the notation used in Section 4.

Antonyms bias discrimination fair, unbiased image visual, face, facial dataset collection Table 3 : Set of keywords and relative synonyms and antonyms.

• Cross-dataset bias detection: these try to asses bias by comparing different datasets.

The idea is to discover whether they carry some sort of "signature" due to the data collection process.

• Other methods: Different methods that could not fit any of the above categories.

Methods in this category transform visual data into tabular format and leverage the multitude of bias detection techniques developed for tabular datasets [64] . The features for the tabular description can be extracted either directly from the images (using, for example, some image recognition tool) or indirectly from some accompanying image description/annotation (e.g. image caption, labels/hashtags) or both. In the majority of the cases, feature extraction relies upon automatic processes and is therefore prone to errors and biases (in Figure 1 we described how the use of biased algorithms affects the data collection/labelling). This is especially true for the extraction of protected attributes, like gender, race, age etc, as such systems have been found to discriminate against certain groups. As an example, many facial recognition systems have been found to be biased against certain ethnic groups 22 . Therefore, whatever biases exist in the original images (selection, framing, label) might not only be reflected but even amplified in the tabular representation due to the bias-prone feature extraction process. Below we overview different approaches under this category that extract protected attributes from visual datasets and evaluate whether bias exists. As already explained, the sources of bias in such a case are not limited to bias in the original images but bias may also exists due to the labelling process and the automatic feature extraction. The impact of such additional sources in the results is typically omitted. [21] proposed a simple method for auditing ImageNet [18] with respect to gender and age biases. They applied a face detection algorithm to two different subsets of ImageNet (the training set of ILSVRC [69] and the person category of ImageNet [18] ). After that they applied an age recognition model and a gender recognition model to them. They then computed the dataset distribution across the age and gender categories finding out a prevalence of men (58.48%) and a very small amount of people (1.71%) in the > 60 age group. Computing the percentage of men and women across every category gave them the opportunity to find the most imbalanced classes in the two subsets of ImageNet. For example, in the ILSVRC subset, the 89.33% of the images in category bulletproof vest were labelled as men and the 82.81% of the images in category lipstick were labelled as women. Therefore, this method does not only give information on the selection bias but also on the framing of the protected attributes, given a suitable labelling of the dataset. The authors noted that this method relies on the assumption that the gender and age recognition models involved are not biased. Such an assumption is violated by the gender recognition model, as the authors pointed out, therefore the analysis cannot be totally reliable.

Yang et al. [86] performed another analysis of the person category of ImageNet [18] , trying to address both selection and label bias. They addressed label bias by asking annotators to find out, first, whether the labels could be offensive or sensitive (e.g. sexual/racial slur), and, second, to point out whether some of the labels were not referring to visual categories (e.g. it is difficult to understand whether an image depicts a philanthropist). They removed such categories and continued their analysis by asking annotators to further label the images according to some categories of interest (gender, age and skin colour) to understand whether the remaining data were balanced with respect to those categories and then to address selection bias. This demographic analysis showed that women and dark skinned people were both under-represented in the remaining non-offensive/sensitive and visual categories. Moreover, despite the overall under-representation, some categories were found to align with stereotypes (e.g. the 66.4% of people in the category rapper was dark skinned). Hence, they also potentially addressed some framing bias. The annotation process was validated by measuring the annotators' agreement on a small controlled set of images.

Zhao et al. [89] measured the correlation between protected attribute and the occurrences of various objects/actions. They assumed to have a dataset labelled with a protected attribute 

where c(o, x) counts the co-occurrences of the object/scene o and the protected attribute's value

where n is the number of values that the protected attribute can assume, it means that attribute G = g is positively correlated with the object/action O = o.

Shankar et al. [72] used instead a simple count to assess geographic bias in ImageNet [69] and Open Images [51] . They found out, for example, that the great majority of images of which the geographic provenance is known comes from the USA or Western European countries resulting in highly imbalanced data. Such a geography-based analysis has been expanded by Wang et al. [82] . While having balanced data is important for many applications, a mere count is usually not enough for assessing every type of bias as a balanced dataset could still contain spurious correlations, framing bias, label bias and so on.

Buolamwini & Gebru [11] constructed a benchmark dataset for gender classification. For testing discrimination in gender classification models, their dataset is balanced according to the distribution of both gender and Fitzpatrick skin type as they noticed that the error rates of classification models tended to be higher at the intersection of those categories (e.g. black women) because of the use of imbalanced training data. Hence, while they quantify bias by simply counting the instances with certain protected attributes, the novelty of their work is that they took into account multiple protected attribute at a time.

Information theoretical. Merler et al. [58] introduced four measures to construct a balanced dataset of faces. Two measures of diversity: Shannon entropy H(X) = − n i=1 P(X = x i ) · ln(P(X = x i )) and Simpson index D(X) = 1 n i=1 P(X=xi) 2 where P(X = x i ) is the probability of an image having value x i for the attribute X ∈ {x i } n i=1 ; and two measures of evenness: (E Shannon = H(X) ln(n) and E Simpson = D(X) n ). Such measures have been applied to a set of facial attributes, ranging from craniofacial distances to gender and skin colour, computed both via automated systems and via the help of human annotators.

Panda et al. [65] also proposed to use (conditional) Shannon entropy for discovering framing bias in emotion recognition datasets. They computed, using a pre-trained model, the top occurring objects/scenes in the dataset and computed the conditional entropy of each object across the positive and negative set of the emotions to see whether some objects/scene where more likely to be related to a certain emotion. For example, they found out that objects like balloons or candy stores are only present in the negative-set of sadness in Deep Emotion [87] . Given an object c and an emotion E = e ∈ {0, 1} (where e = 1 represents, for instance, sadness and e = 0 represents the negative-set non-sadness) they computed:

When the conditional entropy of an object is zero, it means that such an object is associated only with the emotion E or, on the contrary, is never associated with it (it only appears in the negative set). This may be considered a type of framing bias. Kim et al. [45] introduced another definition of bias inspired by information theory. They wanted to develop a method for training classification models that do not overfit due to dataset biases. In doing so, they give a precise definition of bias: A dataset contains bias when I(X, Y ) 0 where X is the protected attribute, Y is the output variable and I(X, Y ) := H(X) − H(X|Y ) is the mutual information between those random variables. Kim et al. proposed to minimise such mutual information during training so that the model forgets the biases and generalises well. Note that if Y is any feature in the dataset, instead of the output of a model, this measure can be used to quantify bias in the dataset.

Other. Wang et al. [83] addressed this limitation by defining what they called dataset leakage which measures the possibility of a protected attribute to be retrieved using only the information about non-protected ones. Given a dataset D = {x i , y i , g i )} where x i is an image, y i a nonprotected attribute and g i is a protected attribute, the attribute y i is said to leak information about g i if there exists a function f (·), called attacker, such that f (y i ) ≈ g i . The attacker f (·) is operationally a classifier trained on {y i , g i }. The dataset leakage is measured as follows:

Wachinger et al. [81] explicitly used causality for studying spurious correlations in neuroimaging datasets. Given variables X and Y they wanted to test whether X causes Y or there exists a confounder variable Z instead. Since those two hypotheses imply two different factorisations of the distribution P(X|Y ), the factorisation with a lower Kolmogorov complexity is the one that identifies the true causal structure. Kolmogorov complexity is approximated by Minimum Description Length (MDL).

Jang et al. [38] proposed 8 different measures for identifying gender framing bias in movies. The following measures are computed for a movie for each gender, therefore the means are computed across every frame in which an actor of a certain gender appears. The measures are the following: Emotional Diversity H(X) = − s i=1 P(X = x i ) · ln(P(X = x i )) where P(X = x i ) is the probability that a character expresses a certain emotion x i and the sum is computed on the range of the different emotions shown by characters of a certain gender (the list of emotions was: anger, disgust, contempt, fear, happiness, neutral, sadness, and surprise); Spatial Staticity exploits ideas from time-series analysis to measure how static a character is during the movie, it is defined as mean(P SD(p(t)) where PSD is the power spectral density of the time-series of the position on the x-axis (resp. y-axis) of the character (the higher the value the less animated is the character); Spatial occupancy mean( √ A) where A is the area of the face of the character; Temporal occupancy N N total where N is the number of frames in which the character appears and N total is the total number of frames; Mean age computed over each frame and character; Intellectual Image mean(G) where G is the presence of glasses (this seems a debatable choice as it might itself suffer from some label bias); Emphasis on appearance mean(E) where E is the light exposure of faces again calculated for each frame; finally, the type and frequency of surrounding objects is analysed. The attributes involved in the computation of these measures are mainly extracted using Microsoft Face API which Buolamwini & Gebru [11] demonstrated to be biased against black women.

Wang et al. [82] proposed REVISE, a comprehensive tool for bias discovery in image datasets. The authors defined three sets of metrics for that purpose, basically on the information used for the computation of the metric: the first set contains metrics based solely on the use of bounding boxes of objects (Note that a person is considered an object as it can be classified by an objected detection model); if those bounding boxes are also labelled with a protected attribute, the second set of metrics can be computed; the third set uses additional unstructured information such as text to discover bias when the protected attribute is not provided explicitly. REVISE implements 13 different metrics, some of which are very similar to what we described in the previous paragraphs. We are going to describe two of the most relevant: i) scene diversity H(S) where S is the scene attribute computed applying a pre-trained scene recognition algorithm [90] ; and ii) appearance differences, computed by extracting features using a feature extractor of images with a same object/scene but a different protected attribute value and then training an SVM classifier on the extracted features to see whether it could learn different representation of subjects with different protected attribute values in the same context.

The following methods analyse the distances and geometric relations among images exploiting their representation in a lower-dimensional space for discovering the presence of bias.

Distance-based. Kärkkäinen & Joo [41] proposed a simple measure of diversity for testing their face dataset. They studied the distribution of pairwise L1-distances calculated after embedding the images in a lower-dimensional space using a neural network pre-trained on a different benchmark face dataset. If such distribution is skewed towards high pairwise distances, it means that the data show high diversity. Nonetheless, such an analysis is heavily influenced by the embedding. For instance, faces in a "white-oriented" dataset where also well separated in the embedding space, probably because the neural network used for the embedding had also been trained using a similarly biased dataset. Steed & Caliskan (2021 [74] ) developed a method for addressing human-like biases in the latent image representation of unsupervised generative models inspired by bias detection methods in Natural Language Processing [12] . They discovered that the biases found in the latent space of two big models trained on ImageNet [69] match with human-like biases. The authors measured bias looking at associations between semantic concepts (for example, man-career and woman-family) by measuring the cosine similarity among vectors in the latent space computed applying the models to controlled samples of images that resemble those visual concepts. More precisely, given a model f that maps images into a vector space R d , two set of images J and K (e.g. photos of men and women respectively), and two sets A and B of images representing the concepts we want to measure the association with (e.g. photos of people at work and photos of people in familiar settings), we can measure the association of J with A and K with B in the following way: 

Alternatively, we can measure the size of the association via the Cohen's d:

The authors found out that the representations they analysed contained several biased associations that resemble human cognitive biases (e.g. the association between flower/insects with pleasant/unpleasant respectively, male/female with career/family or white/black with tools/weapons). Nevertheless, it is not clear how many of those associations were present in the original training data and to what degree this is the responsibility of the CV models used for computing the representations. While this method for measuring biased associations is model agnostic, in the sense that it can be applied to any possible model, its results heavily depend on the learnt representations and the employed models.

Other. Balakrishnan et al. [3] developed a method for assessing algorithmic biases in image classifiers via computing the causal relationship between the protected attribute and the output of a classifier. In doing so, they developed a method for intervening on the image attributes: they assumed to have a generator (as in Generative Adversarial Networks [29] ) that generates images from latent vectors. They also assumed to have learnt hyper-planes in the latent space that separate different attributes. Sampling points along the direction orthogonal to an attribute's hyper-plane gives a modified version of the original image with respect to that attribute. The authors noted that such interventions are not completely disentangled: for example, adding long hair to the images of white males adds also the beard and changing the gender attributes to the images of black males adds earrings. This is probably due to datasets bias which is then detected as a side effect of the transformation described above. Note that, since such transformation is the result of geometric operations, it means that the bias is encoded in the geometry of the latent space. It would be interesting to study to what extent these manipulations of the latent space can be used as a bias exploration tool.

Methods in this category derive from the realisation that the issue of generalisation in CV might be due to bias. Researchers in the field of object detection are usually able to tell with fair accuracy which famous benchmark dataset an image comes from [78] . This means that each dataset carries some sort of "signature" (bias) that makes the provenance of its images easily distinguishable and affects the ability of the models to generalise well. The methods described in the following aim to detect such signatures by comparing different datasets. Note that two of the papers [77, 57] that we are going to review in this section could also fit Section 4.2.

Cross-dataset generalisation. Torralba and Efros [78] tested bias in object detection datasets by answering the following question: "how well does a typical object detector trained on one dataset generalize when tested on a representative set of other datasets, compared with its performance on the native test set?"

The assumption here is that if the performance on the native test set is much higher it means the datasets exhibit some bias that is learnt by the object detector f D . Hence, let us consider a dataset of interest D and other n benchmarks {B i }. They compare performance by computing:

The closer to 1 is this score, the more f D generalises well. Note that if this score is low, we can say that the f D does not generalise well and thus the dataset D is probably biased, while if the score is close to 1 we can say that the datasets share a similar representation of the visual world. Furthermore, the authors also proposed a test, which they presented as a toy experiment but which has been utilised in many other studies, called Name the dataset. They trained a model to recognise which is the source dataset of a certain image: the greater the accuracy of this model, the more distinguishable and the more biased the datasets are. The methods described above have been also used in [77, 65, 41] . Tommasi et al. [77] investigated the possibility of using CNN feature descriptors to address dataset bias. In particular, they replicated the experiments in Torralba and Efros [78] and Khosla et al. [44] (see next paragraph) using DeCAF features [19] . Furthermore they slightly changed the measure used by Torralba and Efros for the evaluation of cross-generalisation:

where σ is the sigmoid function.

Other. Khosla et al. [44] proposed a method for both modelling and mitigating dataset bias. In particular, they train an SVM binary classifier on the union of a set of n datasets

1} is a common class among all the n datasets and where x i j are feature vectors extracted via some feature extraction algorithms. The problem is framed as a multi-task classification [25] where the algorithm learns n distinct hyper-planes w i · x where w i = w + ∆ i . The vector w, which is common to each dataset, models the common representation of the visual world while the specific biases of each of the D i s are modelled by the vectors ∆ i . This is achieved via the following minimisation problem min w,∆i

where

and λ, C 1 and C 2 are hyper-parameters. The authors proved that studying the vectors ∆ i gives useful semantic information about each dataset bias (e.g. they discovered that Caltech101 [52] has a strong preference on side-view of cars). López-López et al. [57] also investigated the use of feature descriptors to discover biases. In particular they wanted to understand how the image in different datasets are distributed after embedding them in the same lower-dimensional space. Given n datasets, D 1 ,...,D n they sampled two sets of images for each of them G 1 ,...,G n and P 1 ,...,P n , called respectively gallery sets and probe sets. Then, they computed the latent space applying a pre-trained feature descriptor f to the gallery sets. After that, they computed the following probability:

where x is the feature vector of an image in the probe set P i and x * is its nearest neighbour among the feature vectors of the images in the union of the gallery sets. If such probability is not 1 n , it means that the nearest neighbours are not equally distributed among the n datasets. Hence, there must be some selection bias.

Other methods for visual dataset bias discovery that do not fit any of the three categories described above range from methods relying on crowdsourcing to those that are based on training ad-hoc classification models to detect bias. [63] proposed a simple method for addressing bias in object recognition datasets. They cropped a small central sub-image from each image in the original dataset. This cropped picture were so small that humans could not recognise the objects in the picture. Hence, if the attained performance of a model trained on such images was better than pure chance, this meant that the data contained distinguishable features spuriously correlated with the object categories and hence the data showed some kind of selection or framing (capture) bias.

Thomas & Kovashka [76] studied the political framing bias of images by scraping images from online news outlets. Their idea was to train a semi-supervised tool for classifying images according to their political orientation. They labelled the images according to the political orientation of the source. They also used information regarding the articles hosting the images by feeding the network with both the image and the document embedding of the article. Note that, the proposed architecture incorporates textual information at training time, but allowed them to classify images without any additional information at testing time. Thus, this model can be used to understand if a specific image or a collection of images is biased towards a certain political faction. Moreover, the visual explanation of such a model could give semantic information on the political framing of the dataset.

Clark et al. [14] proposed to use an ensemble classification algorithm for mitigating bias. The idea is to train a low-capacity network (i.e. with low number of parameters) together with a high-capacity one (i.e. with more than double the parameters) so that the former learns spurious correlations while the latter learns to classify the data in an unbiased way. While this difference in capacity and the ensemble training encourage the two models to learn different patterns, there is still the possibility for the high-capacity model to learn simpler patterns and hence bias. To avoid this, the two networks are trained to result in conditionally independent outputs. This will be an incentive for the ensemble to isolate simple and complex patterns. While the authors use this method with the only purpose to mitigate algorithmic bias, studying the lower-capacity model can give information on spurious correlations in the training dataset highlighting selection/framing bias.

Lopez-Paz et al. [56] proposed a method to discover "causal signals" among objects appearing in image datasets. In particular, given a set of images D = {d i }, they score the presence of certain objects/attributes A, B by using respectively an object detector or a feature extractor. Then they apply a Neural Causation Coefficient (NCC, [55] ) network, a neural network that takes a bag of samples {(a i , b i ) m i=1 } as input and returns a score s ∈ [0, 1] where s = 0 means that A causes B and s = 1 means instead that B causes A. Note that, while not every causal relationship is a source of bias, some might be. Moreover, such causal relationship can be thought as a by-product of the selection of the subjects and hence this method can detect selection bias.

Human-based. Hu et al. [36] proposed a non-automated approach for bias discovery. Their method consists of a three-step crowdsourcing workflow for bias detection (selection and framing, according to our categorisation). In order to avoid the complexity of free text description, in the first step, workers are presented with a batch of images and asked to describe the similarities among the images of the batch via a question-answer pair (e.g. if every image of the batch shows only white air planes, and then there is some selection bias, the worker would label the batch with the following question-answer pair: What colour are the air planes in the images? White). In a second step, each worker is asked to answer some of the questions collected in the first phase based on different batches to confirm the presence of such biases. Finally, the workers are asked to evaluate whether the statements are true in the real visual world or, on the contrary, constitute some biases (e.g. is it true that every air plane is white? If the answer is yes, this is not considered an instance of bias). Note that since this last step is based on "common sense knowledge and subjective belief" [36] , it heavily relies upon workers own background and biases.

The reduction to tabular data is a reasonable and effective way for discovering bias. These methods could leverage the great amount of work that has been already done in the field of Fair Machine Learning for tabular data. Nevertheless, it seems that for what concerns visual data, the methods used are rather simplistic. Indeed, most of the works just look for balance in the protected attribute or compare the distribution with respect to other features. Furthermore, these methods heavily rely on labels that are either attached to the data or automatically extracted. Hence, any biases in the labelling process affects such discovery methods and should be taken into account. In a complementary way, the image representation methods reduce the problem to bias detection in a lower-dimensional space instead of reducing it to tabular data. While this could better capture the complexity of visual content, such methods are necessarily influenced by both the models used to compute the representation and the metric used to compute distances in it. Moreover, they are usually harder to apply since they need some kind of supervision for computing the space.

Despite their historical importance, cross-dataset detection methods suffer from major issues. First, they are only applicable when several comparable datasets are available, which is not often practical. Second, they might help to unveil the presence of bias, but without further inspections they cannot reveal what kind of bias it is. Note that, while these methods might be useful to get an idea on the existence of some biases in a visual dataset, they are of little or no use if we want to discover bias within the dataset, for example if we want to understand whether there are discriminatory differences between men and women in the same dataset.

Regarding those methods that do not fit any of the above-mentioned categories, we cannot outline common pros and cons as they are very problem/domain specific.

In the following we are going to describe some attempts to construct datasets where the existence of bias was taken into account during the dataset creation process. These datasets were constructed for specific purposes and were probably not thought as universally bias-free. Nonetheless, analysing which biases have been removed and which have been not might be useful to understand the general challenge of bias in visual datasets. We summarise the content of this section in Table 5 . Furthermore, we propose a checklist for helping the collection of bias-aware visual datasets (Table 6) . [11] released the Pilot Parliaments Benchmark (PPB) dataset. PPB is a face dataset constructed by collecting the photos of members of six different national parliaments. The aim of the authors was to collect data that were balanced both regarding the gender of the subjects and their skin colour. To do so, they selected three nations from African countries (Rwanda, Senegal, and South Africa) and three from European countries (Iceland, Finland and Sweden) according to the gender parity rank 23 among their Members of Parliament (MP). The data have been labelled by three annotators (including the authors) according to (binary) gender appearance and the Fitzpatrick skin type (ranging from I to IV, this labels are used to dermatologist as a gold standard for skin classification). The definitive skin labels were provided by a board-certified dermatologist while the definitive gender labels were determined based also on the title, prefix or the name of the parliamentarians (note that using names as proxy of gender can cause label bias [40] ). While the data collection process described above resulted in a much more balanced dataset compared to other famous benchmarks (Adience [23] , IJB-A [48] ), it is still not free from possible biases. For example, the selection process specifically targets a small number of African and northern European countries to ensure gender and skin tone balance. Nevertheless, it completely excludes, for instance, Asian and South American countries. Moreover, MPs are likely to be middle-aged people and this could also exclude young and old people from the selection. On the framing bias side, different countries might have different standards/dress codes for the official portraits of their MPs and this could turn into some biases as well.

Kärkkäinen and Joo [41] collected a face dataset emphasising in particular the balance in terms of age, gender and race. They relied on a crowdsourcing workflow for annotating the images. In particular, they asked three different workers to label the images according to gender, age group, and race. They kept the label if there was a 2/3 accordance. Otherwise, they would have proposed the image to other three workers and discarded it if again it resulted in three different judgements. We can spot two sources of label and selection bias respectively: first, as already discussed in Section 3, we cannot be sure that the workers are able to determine the three labels homogeneously across every sub-group; second, discarding the photos the workers cannot agree on might result in the missed selection of a certain group of people whose characteristics are difficult to determine for the workers. Finally, the taxonomy of races used by the authors (White, Black, Indian, East Asian, South East Asian, Middle East, and Latino) already introduces a form of label bias. While it is derived from the taxonomy commonly used by the US Census Bureau and might be descriptive of the composition of US population, it hardly captures the complexity of human diversity.

Diversity in Face (DiF) [58] and KANFaces [27] are two face datasets that try address the issue of bias by ensuring as much diversity as possible using the diversity measures proposed by Merler et al. [58] that we reviewed in Section 4. The attributes they control the diversity for are: age, gender, skin tone, a set of craniofacial ratios, and pose. The authors also took into account a metric of illumination. By collecting diverse data, the authors try to avoid selection bias (in case of age, gender and skin tone) and some framing bias (in case of pose and illumination).

Barbu et al. [5] made an attempt to avoid framing bias in large-scale object detection datasets. In particular, they added controls for object rotations, viewpoints and background by asking crowd workers to photograph objects in their homes in a natural setting according to the instructions given by the authors. While this resulted in a much more diverse dataset (the authors used ImageNet [69] as a comparison) because of the above controls, the objects appear only in indoor context, are rarely occluded and often centre aligned. Thus, it seems that specific framing biases have been avoided, while others have been introduced by the collection procedure. Also, the authors removed some classes from the dataset for reasons that range from privacy concerns (e.g. "people") or because they were not easy to move and photograph in different settings (e.g. "beds"). In principle, this might generate some selection bias (more specifically, negative class bias) since the absence of those object could make the negative classes less representative.

Wu et al. [85] collected two benchmark datasets: the Inclusive Benchmark Database (IBD), and Non-binary Gender Benchmark Database (NGBD) 24 . IBD contains 12,000 images of 168 different subjects, 21 of which identify as LGBTQ. The geographic provenance of the subjects in the dataset is balanced. NGBD contains 2,000 images of 67 unique subjects. The subjects are public figures whose gender identity is known. Thus, the database contains multiple gender identities (namely: non-binary, genderfluid, genderqueer, gender non-conforming, agender, gender neutral, gender-less, third gender, and queer ). The authors themselves identify two major risks of label bias: first, "Gender identity has its multifaceted aspects that a simple label could not categorize" [85] (the authors identify the problem of modelling gender as a continuum as a direction for future work); and, second, "Gender is a complex socio-cultural construct and an internal identity that is not necessarily tied to physical appearances." [85] . Hazirbas et al. [33] proposed the Casual Conversations Dataset for evaluating the performance of CV models across different demographic categories. Their dataset is composed of 3,011 subjects and contains over 45,000 videos, with an average of 15 videos per person. The videos were recorded in multiple US states with a diverse set of adults in various age, gender and apparent skin tone groups. This work represents probably the greatest effort to build a balanced dataset addressing both selection and framing bias (in the form of the illumination of videos). Nevertheless, some forms of imbalance remain as, for example, most videos presents bright lighting conditions, most subjects are labelled as either male or female (with just the 0.1% of the participants that identify as "Others" and 2.1% whose gender is unknown). The label bias that the use of categories such as gender, age, and race could create is overcome by asking the participants their age and gender and using the Fitzpatrick Skin Type instead of race. Nonetheless, the authors declare that there are "videos in which two subjects are present simultaneously" but that they provide only one set of labels which might also be a form of label bias. Last, we note that once again the subjects are all from the US, which obviously represents a serious selection bias as US population is hardly representative of the whole humankind.

Checklist As highlighted in this and the previous sections, dealing with bias in visual data is not a trivial task. In particular, collecting bias-aware visual datasets might be incredibly challenging. Thus, we propose a checklist (Table 6 ) to help scientist and practitioners to spot and make explicit possible biases in the data they collect or use. Our checklist is inspired by previous work on documentation and reflexive data practices [26, 61] , but adds several questions specific to selection, framing and label bias because they have their own peculiarities and must be analysed separately.

We start with a general set of questions on the purposes the data is collected for and the collection procedures. Then, we continue with the sets of questions specific to selection, framing and label bias. In particular, we ask whether the selection of the subjects generates any imbalance or lack of diversity, whether there are spurious correlations or harmful framing, whether fuzzy categories are used for labelling, and whether the labelling process contributes to inserting biases.

The aim of this survey is threefold: first, to provide a description of different types of bias and to illustrate the processes through which they affect CV applications and visual datasets; second, to perform a systematic review of methods for bias discovery in visual datasets; and, third, to describe existing attempts to collect visual datasets in a bias-aware manner. We showed how the problem of bias is pervasive to CV. It accompanies the whole life cycle of visual content, involves several actors and re-enters the life cycle through biased CV algorithms. One of our major contributions have been to provide a detailed description of the different incarnations (selection, framing and label) of bias in visual data along with several examples. We also went further by providing a sub-categorisation (Table 1) which also includes several categories of Description General What are the purposes the data is collected for? [26] Are there uses of the data that should be discouraged because of possible biases? [26] What kind of bias can be inserted by the way the collection process is designed? [26] Selection bias Do we need balanced data or statistically representative data? Are negative sets representative enough? Is the data representative enough? Is there any group of subjects that is systematically excluded from the data? Do the data come from or depict a specific geographical area? Does the selection of the subjects create any spurious associations? Will the data remain representative for a long time?

Are there any spurious correlation that can contribute to framing different subjects in different ways? Is there any biases due to the way images/videos are captured? Did the capture induce some behaviour in the subjects (e.g. smiling when photographed)? Are there any images that can possibly convey different messages depending on the viewer? Are subjects of a certain group depicted in a particular context more often than others? Do the data agree with harmful stereotypes?

If the labelling process relies on machines: have their biases taken into account? If the labelling process relies on human annotators: is there an adequate and diverse pool of annotators? Have their possible biases taken into account? If the labelling process relies on crowd sourcing: are there any biases due to the workers' access to crowd sourcing platforms? Do we use fuzzy labels such as race or gender? Do we operationalise any unobservable theoretical constructs/use proxy variables? [37] Table 6: Checklist for bias-aware visual data collection bias that are commonly described in Statistics, Health studies, or Psychology adapting them to context of visual data.

The systematic review in Section 4 allowed us to draw some consideration on the state of the art in bias discovery methods for visual data and to outline some possible future streams of research. The vast majority of them used the strategy of reducing the problem to tabular data (Section 4.1). While it seems a natural option as it allows to use Fair Accountable and Transparent Machine Learning (FATML) techniques, it means that there is room for further research that works directly on the visual features of the data and the works that we categorise as "biased image representations" (Section 4.2) seem a first approach to it. However, these methods rely on learnt representations that can add further biases.

Moreover, our review showed how most of the works are suitable for addressing selection and framing bias while label bias is usually not taken into account. Label bias, though, should be studied more deeply as it is not only pervasive to CV but can also result in highly discriminatory applications as shown in Section 3.3. Hence, future research has room both for inventing new and improving existing methods and for enlarging the range of types of bias that can be detected. Another possible stream of research is that of multi-fairness/inter-sectional bias detection as the vast majority of works deal with a single protected attribute. An exception to that is the work of Buolamwini & Gebru [11] .

The review of several attempts to collect bias-aware data in Section 5 also allowed some useful conclusions. First, there is no such a thing as bias-free data. Hence, it is of utmost importance, along with the development of reliable bias discovery tools, that researchers and practitioners become aware of the biases of the datasets they collect and make them explicit in a standardised way (see for example [26, 61] ). Second, we noticed that most of the bias-aware dataset creation is focusing on face images, probably because of their problematic application domains. This leaves room for improving datasets and establishing data collection practices in other fields as well, including medical imaging, self-driving cars, etc. Finally, in Table 6 we outline a checklist for the collection of visual data. We believe that having such a guide will help practitioners and scientists to spot possible causes of bias, to collect data which is as less biased as possible and be aware of such biases during their analysis.

Machine bias: There's software used across the country to predict future criminals. and it's biased against blacks

Face-ism: Five studies of sex differences in facial prominence

Towards causal benchmarking of bias in face analysis algorithms

Problematic machine behavior: A systematic literature review of algorithm audits

Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Genetic Basis of Human Biodiversity: An Update

Consumer-lending discrimination in the fintech era

Man is to computer programmer as woman is to homemaker? debiasing word embeddings

The "criminality from face" illusion

Beyond (mis)representation: Visuals in covid-19 misinformation

Gender shades: Intersectional accuracy disparities in commercial gender classification

Semantics derived automatically from language corpora contain human-like biases

Fair prediction with disparate impact: A study of bias in recidivism prediction instruments

Learning to model and ignore dataset bias with mixed capacity ensembles

Framing the pictures in our heads: Exploring the framing and agendasetting effects of visual images. Doing Frame Analysis: Empirical and Theoretical Perspectives

Specchio delle sue brame: analisi socio-politica delle pubblicità : genere, classe, razza, età ed eterosessismo

Nel segno della vagina. dalla riappropriazione semiotica degli anni '70 alle vagina warriors negli anni '90 alla critica decolonizzante del femminismo indigeno

Imagenet: A large-scale hierarchical image database

Decaf: A deep convolutional activation feature for generic visual recognition

Demographic bias in biometrics: A survey on an emerging challenge

Auditing imagenet: Towards a model-driven framework for annotating demographic attributes of large-scale image datasets

Age and gender estimation of unfiltered faces. Information Forensics and Security

Framing: Toward clarification of a fractured paradigm

Regularized multi-task learning

Investigating bias in deep face analysis: The kanface dataset and empirical study

The profiling potential of computer vision and the challenge of computational empiricism

Generative adversarial nets

A survey of methods for explaining black box models

Identifying a facial expression of flirtation and its effect on men

Towards a critical race methodology in algorithmic fairness

Towards measuring fairness in ai: the casual conversations dataset

Causal Inference: What If

Obesity Stigma in Online News: a Visual Content Analysis

Crowdsourcing detection of sampling biases in image datasets

Measurement and fairness

Quantification of gender representation bias in commercial films based on image analysis

Identifying and correcting label bias in machine learning

Inferring gender from names on the web: A comparative evaluation of gender detection methods

Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation

The use and misuse of counterfactuals in ethical machine learning

Unequal representation and gender stereotypes in image search results for occupations

Undoing the damage of dataset bias

Learning not to learn: Training deep neural networks with biased data

Procedures for performing systematic reviews

Face recognition performance: Role of demographic information

Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a

Inherent trade-offs in the fair determination of risk scores

How to do a structured literature review in computer science

Openimages: A public dataset for large-scale multi-label and multi-class image classification

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Scut-fbp5500: A diverse benchmark dataset for multi-paradigm facial beauty prediction

Microsoft coco: Common objects in context

Towards a learning theory of cause-effect inference

Discovering causal signals in images

Dataset bias exposed in face verification

Diversity in Faces. arXiv

How do scholars approach the circular economy? a systematic literature review

Between subjectivity and imposition: Power dynamics in data annotation for computer vision

Documenting computer vision datasets: An invitation to reflexive data practices

The creation and detection of deepfakes: A survey

Comparison of data set bias in object recognition benchmarks

Thanassis Tiropanis, and Steffen Staab. Bias in datadriven artificial intelligence systems -an introductory survey

Contemplating visual emotions: Understanding and overcoming dataset bias

Same candidates, different faces: Uncovering media bias in visual portrayals of presidential candidates with computer vision

Large image datasets: A pyrrhic win for computer vision? ArXiv, abs

Face recognition: Too bias, or not too bias?

Imagenet large scale visual recognition challenge

Learning to share visual appearance for multiclass object detection

What's in a picture? the impact of face-ism on trait attribution

No classification without representation: Assessing geodiversity issues in open data sets for the developing world

Don't judge an object by its context: Learning to overcome contextual bias

Image representations learned with unsupervised pretraining contain human-like biases

Discrimination in online ad delivery

Predicting the politics of an image using webly supervised data

A deeper look at dataset bias

Unbiased look at dataset bias

The ethical questions that haunt facial-recognition research

Fairness definitions explained

Detect and correct bias in multi-site neuroimaging datasets

REVISE: A tool for measuring and mitigating bias in visual datasets

Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations

Predictive inequity in object detection. CoRR, abs

Gender classification and bias mitigation in facial images

Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy

Building a large scale dataset for image emotion recognition: The fine print and the benchmark

BDD100K: A diverse driving dataset for heterogeneous multitask learning

Men also like shopping: Reducing gender bias amplification using corpus-level constraints

Places: A 10 million image database for scene recognition

Capturing long-tail distributions of object subcategories

Acknowledgements We would like to thank Alaa Elobaid, Miriam Fahimi and Giorgos Kordopatis-Zilos for the fruitful discussions. This work is supported by the project "NoBias -Artificial Intelligence without Bias," which has received funding from the European Union's Horizon 2020 research and innovation programme, under the Marie Sk lodowska-Curie (Innovative Training Network) grant agreement no. 860630.