key: cord-0218511-2pwmwep3 authors: Krishnapriya, KS; King, Michael C.; Bowyer, Kevin W. title: Analysis of Manual and Automated Skin Tone Assignments for Face Recognition Applications date: 2021-04-29 journal: nan DOI: nan sha: 13c5d7cb043d5af4a9d8dcd187c34c5477e94c6a doc_id: 218511 cord_uid: 2pwmwep3 News reports have suggested that darker skin tone causes an increase in face recognition errors. The Fitzpatrick scale is widely used in dermatology to classify sensitivity to sun exposure and skin tone. In this paper, we analyze a set of manual Fitzpatrick skin type assignments and also employ the individual typology angle to automatically estimate the skin tone from face images. The set of manual skin tone rating experiments shows that there are inconsistencies between human raters that are difficult to eliminate. Efforts to automate skin tone rating suggest that it is particularly challenging on images collected without a calibration object in the scene. However, after the color-correction, the level of agreement between automated and manual approaches is found to be 96% or better for the MORPH images. To our knowledge, this is the first work to: (a) examine the consistency of manual skin tone ratings across observers, (b) document that there is substantial variation in the rating of the same image by different observers even when exemplar images are given for guidance and all images are color-corrected, and (c) compare manual versus automated skin tone ratings. Recent news articles have publicized the topic of bias in face recognition. For example, a BBC news article includes a figure with the caption "Face recognition tech is less accurate the darker your skin tone" [38] , and a New York Times article states that "... the darker the skin tone, the more errors arise ..." [30] . While such articles are generally prompted by academic research results, we are not aware of any research that conclusively shows that darker skin tone is a primary causal factor in decreasing the accuracy of face recognition. In this paper, we report on an experiment to measure skin tone from face images in accordance with the Fitzpatrick skin type (FST) and the individual typology angle (ITA). The Fitzpatrick scale is a I (lighter) to VI (darker) rating that is widely used in dermatology [5] . The Fitzpatrick scale has recently been adopted in various face recognition research studies [7, 32, 34, 25] . Prior research [13, 33] has also shown that skin tone assessment can be done directly from an image using automated individual typology angle measurement. Individual typology angle measurements are categorized into six skin type groups -very light, light, intermediate, tan, brown, and dark. The major contributions of this work are as follows. One, we present the first analysis of the consistency of human rating of skin tones from images. The analysis of human ratings suggests that categorical labeling of skin tone by human observers is subjective, and there is some inconsistency across observers. We explore a series of improvements to encourage greater consistency of human ratings, initially adding color rectangles for reference, then exemplar images, then color-correction and exemplar images. Two, we customized an automated approach to skin tone assessment based on individual typology angle. This automated rating produces a level of agreement with manual ratings that is similar to the level of consistency between two human raters using the Fitzpatrick scale. The automated approach has obvious advantages in speed, scalability, cost, and consistency. Due to space limits, we give only a brief overview of how skin tone has been used in exploring accuracy differences in face image analysis. Initial research on recognition accuracy differences between African-American and Caucasian used race meta-data without taking individual skin tone into account. Perhaps the best known such study is Klare et al. [23] . They report that [23] , "the female, Black, and younger cohorts are more difficult to recognize for all matchers used in this study ...". Krishnapriya et al. [26] report that the impostor and genuine distributions for African-American are centered on higher similarity values than for Caucasian. This means that for a fixed decision threshold, African-Americans have a higher FMR and a lower FNMR. A recent NIST report [20] finds that false matches are generally higher for West and East Africans and lower for Eastern Europeans, and also that with mugshot quality images, false non-matches are higher for Caucasians and lower for African-Americans. Wang et al. [39] and Gong et al. [19] also recently reported recognition accuracy results based on using race meta-data without skin tone ratings. The Fitzpatrick scale [16] is a I (lightest) to VI (darkest) rating of skin tone, used in dermatology to classify sensitivity to sun exposure. Skin tone, of course, varies among African-American individuals and among Caucasian individuals. Also, face morphology varies between the groups (and between gender) and between individuals, independent of skin tone variation. Lester et al. [28] reviewed the research literature on covid-19 skin manifestations in the context of the skin tone of subjects represented in research studies. A set of images from the literature were given Fitzpatrick skin tone categorization by a board-certified dermatologist [28] -"... a board-certified dermatologist with expertise in diagnosing and treating patients with skin of colour (Fitzpatrick type IV-VI) evaluated each of the images and categorized them based on Fitzpatrick type I-VI." Lester et al. [28] also commented on the degree of uncertainty in these Fitzpatrick ratings -"Our study is limited by the subjective assessment of skin type from a photograph. Lighting conditions including over-exposure may have made dark skin look lighter, and this may have led to some misclassification across one or two skin types. However, it is unlikely that lighting issues alone would result in skin types V or VI appearing as skin type I-III." This work underlines several important points. One, dermatological research on important current research questions is performed using a board-certified dermatologist's subjective assessment of Fitzpatrick skin type from photos. Whatever shortcomings subjective Fitzpatrick ratings from images have, there is not yet anything better to replace them. Two, the uncertainty due to varying illumination between images is acknowledged as possibly causing "... some misclassification across one or two skin types" [28] . Our experience with multiple observers rating the same set of controlled-acquisition face images is consistent with Lester et al. [28] on this point. And it makes sense that the more varied the illumination in a set of images, the larger the range of potential misclassification. In comparison to imagery in dermatology research publications, inthe-wild imagery used in face recognition research should be expected to have even larger misclassifications for Fitzpatrick skin type. In the context of our research in this pa-per, controlled-acquisition images such as those in MORPH will have less serious misclassification for Fitzpatrick skin type than would any of the in-the-wild datasets popular in face recognition research. The use of Fitzpatrick ratings for face image analysis in the computer vision community appears to have started with the IARPA Janus dataset [24] and with Buolamwini and Gebru's [7] study. The Janus face image datasets [32] have meta-data for Fitzpatrick skin tone ratings obtained via Mechanical Turk. Lu et al. [31] analyzed the Janus dataset and reported that recognition accuracy generally degraded with darker skin tone, but that skin tone VI had only the secondworst ROC. Buolamwini and Gebru [7] generated skin tone ratings for images that they collected off the web and reported that each of the three gender classification tools studied was more accurate for lighter skin types than for darker. Muthukumar et al. [35] followed up with another study on gender classification tools and suggested that skin tone may not be the driving factor for accuracy differences. Krishnapriya et al. [25] analyzed the distribution of skin tone ratings for images sampled from the center and from the high-similarity tail of the impostor distribution for African-American males. They reported that same-skin-tone image pairs occur more frequently in the high-similarity tail of the impostor distribution, but that darker skin tone does not appear to be a driving factor [25] . Cook et al. [12] and Howard et al. [21] analyzed recognition accuracy differences based on race meta-data and on a measure of skin reflectance. Exploiting the 18% gray background in a controlled enrollment image, they computed a measure of relative skin reflectance for each subject. They report that darker skin tone is associated with longer image acquisition times and with lower similarity scores for genuine image pairs, and that the skin reflectance measure was a better predictor than self-reported race labels. While various studies have used Fitzpatrick ratings assigned by viewers looking at images [7, 21, 32, 34, 35] , there is little or no research on the consistency of the ratings. We used the MORPH dataset [2] that contains mugshotstyle images. MORPH was originally assembled and distributed to support research in face aging [36] . The African-American male cohort of MORPH contains 36,838 images of 8,850 subjects in the curated version used in [26] . This cohort allows us to compute an impostor distribution in which we can analyze the effects of skin tone independent of factors of gender and race. The 8,850 African-American male subjects in MORPH are larger than the 3,531 total subjects in IJB-C. It is also far larger than the 562 (363+199) subjects in the dataset in [12] . Also, the dataset in [12] , in contrast to MORPH and IJB-C, is not available to the research community. Note that black rectangles are added over the eye regions to all the example images shown in this paper in an effort to respect individual anonymity and privacy. For the analysis, we used an open-source deep CNN matcher called the ArcFace [15] . One of the major advantages of ArcFace is that it optimizes the geodesic distance margin by utilizing the exact correspondence between arc and angle in a normalized hypersphere. The pre-trained Ar-cFace model used in this work can be found here: https: In this experiment, we sample 500 image pairs from each of two different regions, the center and high-similarity tail, of the African-American male impostor distribution. Each of these two sets of image pairs has just less than 1,000 unique images, as some individual images are repeated in multiple image pairs. There are 982 unique images of 915 persons from the center and 967 unique images of 872 persons from the high-similarity tail. In our initial manual Fitzpatrick rating experiments, three different viewers independently examined each image to assign a Fitzpatrick score to the image. The three raters each assigned a Fitzpatrick score without knowing the region of the impostor distribution that an image came from and without knowing each others' ratings. If two or three of the viewers agreed on the skin tone rating, that was used as the rating for the image. If the three viewers gave different ratings, then the middle of the three ratings was used as the rating for the image. Fusing the results from different viewers is meant to be a possible improvement over the ratings as used in [7, 34] , avoiding anomalies specific to any one viewer. For images from the center of the African-American male impostor distribution, all three reviewers gave the same rating to 352 images (36%), two of three reviewers agreed for 586 images (60%), and all three reviewers gave different ratings for 44 images (4%). For images from the high-similarity tail, all three reviewers gave the same rating to 302 images (31%), two of three reviewers agreed for 602 images (62%), and all three reviewers gave different ratings for 63 images (7%). The distributions of skin tone ratings by the three viewers are given on the top of The initial experiment results show reasonable agreement between the three raters: e.g., agreement by at least 2 of 3 raters on 90% of the images. Nevertheless, we refined the procedure for manual skin tone ratings in an effort to improve the consistency in the ratings. We posited that the rating task would give more consistent results if the raters had exemplar faces of different skin tones to compare with on each rating. Thus, we modified the software used to present images and record ratings to include exemplar im-ages of the different Fitzpatrick ratings as shown in Figure 1 , along with the images to rate. For exemplar images, we selected images from the wellknown IJB-C dataset that has skin tone annotations in its metadata [32] . IJB-C has 3,531 subjects with 31,334 images and 117,542 frames from 11,779 videos [32] . From the metadata for IJB-C [32] , it appears to be a per-subject skin tone annotation. Based on a manual review of subjects in the IJB-C dataset, we chose six exemplars to use in our rating tool. Additionally, a web-based tool was developed to provide greater flexibility in getting viewers to rate images (see Figure 1 ). Figure 1 : Snapshot of web-based skin tone rating tool showing exemplar images from IJB-C that correspond to six different Fitzpatrick ratings. We repeated the initial experiment, with the same images rated again by the same three raters using the version of the rating tool that displays exemplar images to encourage the use of consistent reference points. The comparison plots are given in Figure 3 . The results suggest that the web-based tool with exemplar images does not cause greater consistency between the raters. Individual raters still seem to center their rating distribution differently on the FST. The other observation is that the ratings seem slightly more spread across 1 to 6 with the new exemplar-based tool compared to the baseline tool. MORPH images are acquired in a controlled environment with the subject standing in front of an 18% gray background. This experiment is designed to normalize the face images so that the 18% gray region is the same on average across all images. The motivating hypothesis is that varying color quality between images may contribute to inconsistent manual Fitzpatrick ratings. The 18% gray is defined based on reflection, i.e., an 18% gray surface reflects 18% of the light that hits it [3] . The idea of 18% gray in photography is to achieve middle gray to human perception. In different color spaces, middle gray may be defined differently. For example, in CIELAB, middle gray is defined to be 46.6% brightness [18] and in 24-bit color space, it is RGB (119, 119, 119) [1] . This experiment has taken RGB (119, 119, 119) as the 18% gray value [1] . The steps followed in color-correction are: • Semantic segmentation of person and background in the given face image using a pre-trained model. • Extract all (R, G, B) pixel values corresponds to the image background. • Find the mean pixel value of Red, Green, and Blue components of the background. Let it be (R avg , G avg , B avg ). So, here, R avg = mean(all R components of the background) G avg = mean(all G components of the background) B avg = mean(all B components of the background) • Find the color-correction factor from the background based on 18% gray. It would be: • Apply this color-correction factor to all the pixels in the original image. The corrected pixels would be: R corrected = R const * R original G corrected = G const * G original B corrected = B const * B original and the pixels are not wrapped around 255. We used a pre-trained model called DeepLab V3 [10] to segment the person and the background. DeepLab has Xception [11] as its network backbone and was pre-trained on ImageNet [14] . Subjectively, this color-correction step improves the visual quality of the original image (see Figure 2) and clearly makes the background more consistent across images. We again repeated a set of ratings with the web-based tool, this time using the color-corrected versions of the original images and also having the exemplar images for reference. This set of comparison plots is given in Figure 3 . The results suggest that color-correction gives an improved consistency, but the effect is small. The inter-rater agreement on the color-corrected images by three different viewers shows that it follows the same pattern whether the images are from the center of the distribution or the tail (see Figure 4 ). Different pairs of raters have different levels of agreement. But allowing for a one rating level difference, all pairs have 89% or better agreement. Our sequence of manual skin tone rating experiments suggests that there is a level of inconsistency between human raters that is difficult to eliminate. Also, these exper- iments use relatively controlled images from the MORPH dataset. The inconsistencies observed here would likely be much greater, and the color-correction as implemented here would not be feasible for images from an in-the-wild dataset. Our analysis of the inconsistency in manual skin tone ratings motivates us to determine if automated skin tone ratings can be effectively used in face recognition research. Automated ratings would be 100% consistent across two runs on the same image, repeatable across research groups, much faster and cheaper to acquire, and could be applied at a scale that is not feasible for manual ratings. Prior research [13, 33] has shown that skin tone assessment can be done directly from an image using computerbased individual typology angle (ITA) measurement. The ITA is calculated in the CIELab color space where L represents the lightness, a represents the chromaticity coordinate from green to red, and b represents the chromaticity coordinate from blue to yellow. In this approach, we utilized ITA for representing the skin color [9] , and it is calculated according to equation 2. ITA measurements are categorized into six skin type groups -very light (skin type I), light (skin type II), intermediate (skin type III), tan (skin type IV), brown (skin type V), and dark (skin type VI). The selection of suitable color space for skin detection is an important factor in determining the higher probability of success. Prior research [8, 37] has shown that the RGB color space is not usually preferred for color segmentation analysis because the brightness (luminance) component is not decoupled from the color information (chrominance). We can effectively utilize the chrominance information in Y CbCr color space for modeling the human skin color, and hence we propose thresholding on Y CbCr color space channels for skin detection. The Y channel representing the brightness cannot be constrained here because when we are evaluating different datasets, and the images can be taken in different lighting conditions. Hence, with the Y channel, it is difficult to determine if the variation in distribution are caused by different skin color or different lighting condition. In Figure 5 , we can see similar Cr and Cb distributions of skin color for Caucasian (see Figure 5a ) and African-American (see Figure 5b ) images, and they do not seem to be affected by the variations in luminance. The evaluation of Cb and Cr channels across different sets of images showed that they are consistent across different demographic groups, and hence we can achieve better skin detection based on the thresholding on those two channels from an input image. The ranges mentioned for Cb and Cr in equation 1 were found to be the most suitable and representative of skin color for different sets of images we have tested. This automated approach utilizes color-corrected images for skin tone assignments. Initially, face detection, alignment, and cropping are done using Dlib. We used a model called BiSeNet (Bilateral Segmentation Network) [40] for the face skin segmentation task. BiSeNet was pre-trained on the CelebAMask-HQ [27] dataset that has 30,000 face images from CelebA [29] and CelebA-HQ [22] . The eyes and lips regions are masked out intentionally to avoid any noise or occlusions like sunglasses while estimating the actual skin tone. The extracted face skin may contain overexposed or under-exposed skin pixels due to illumination conditions. Thresholding on the Y CbCr color space is done to select the best skin pixels that are representative of a person's skin tone. Then from all the selected skin pixels, we find the mean pixel value for that face. The mean pixel value is converted to CIELab color space for its corresponding L and b values to compute the ITA. After finding the ITA measurement, we map that into skin types I to VI. The major steps in this automated skin tone rating include: 1. Face detection, alignment, and crop using the Dlib face detector (see Figure 6a ). 2. Semantic segmentation of face skin from the detected face using the BiSeNet pre-trained model (see Figure 6b ). 3. Conversion to Y CbCr color space and applying the following thresholding on the Cb and Cr channels for skin pixels selection (see Figure 6c ). pixel = skin, if 136 ≤Cr≤ 173&77 ≤Cb≤ 127 non-skin, otherwise (1) 4. Calculate the mean pixel and compute its corresponding ITA using equation 2. 5. Map the final ITA measurement to six skin type groups from very light (skin type I) to dark (skin type VI)(see Figure 7 ). To compare the consistency of manual consensus ratings to that of ITA ranges for skin type I to VI specified in [9] , we analyzed 982 unique images from the center and 967 unique images from the high-similarity tail of the African-American male impostor distribution that have manual consensus ratings from three different viewers (exemplar-guided ratings on the color-corrected images). The mapping of these images on the ITA scale shows that there is substantial overlap within the ITA ranges. To give a reasonable means of assigning images to six categories based on ITA and inspired by the Fitzpatrick ratings, we examine custom threshold ranges for ITA skin type labels (customized ranges are shown in Figure 7 ). There were six images from the center of the African-American male impostor distribution and two images from its high-similarity tail reported as failure-to-detect by Dlib. From the center, 550 images (56.4%) had the same rating, 395 images had one-skin-tone-difference, 28 images had two-skin-tone-difference, and three images had morethan-two-skin-tone-difference with the automated approach and the manual consensus ratings. From its high-similarity tail, 530 images (54.9%) had the same rating, 405 images had one-skin-tone-difference, 29 images had two-skin-tonedifference, and one image had more-than-two-skin-tonedifference with the automated approach and the manual consensus ratings. Considering the same or up to one skin tone of difference, there is 96% or above consistency between the manual consensus ratings and automated ratings (see Figure 8 ). The manual-automated comparison based on the customized ITA ranges for MORPH images shows that the automated-to-consensus-manual consistency is as good as the consistency between any two individual raters. This is a good reason, along with the reproducibility, and the ease of use, to use the automated method. This paper systematically analyzes approaches to estimate the skin tone of a person from an image. It describes methods for manual rating and proposes an automated skin tone assignment approach for greater ease of use, scalability, and reproducibility. We also discuss the pros and cons of each approach. The major conclusions and takeaways from these experiments are as follows. The categorical labeling of skin tone by human observers can be subjective and inconsistent. The same images were observed to have been rated differently by different raters or different at different times by the same rater. Several studies mention that it is subjective even by trained practitioners [6, 4] . Inter-rater agreement for the manual rating by three different viewers shows that different pairs of persons have different levels of agreement. But allowing for a one rating level difference, all pairs have 89% or better (See Figure 4) . While skin tone rating from a color image may seem simple in concept, it is complex and quite challenging. Prior research has been conducted on re-purposed images with skin tone rating assessed by humans in accordance with the Fitzpatrick scale [17, 16] , and more recently by computer-based individual typology angle measurement [33] . Both of these techniques are flawed in that the human rating (for example, see Figure 9 ) and automated rating (for example, see Figure 10 ) are often on non-ideal images that have been taken in non-controlled environments. The ITA skin tone annotations given in the metadata for these images from left to right are -80, -75, -40, -85, -51, -58, and -62 respectively, and all those maps to ITA skin type "dark". The effort to automate the process of Fitzpatrick skin tone labeling from images is quite challenging. The analysis of ITA distributions for images consistently rated as I to VI with manual consensus rating shows that we can do retrospective labeling of images using the Fitzpatrick scale as a guide. The results, however, will be inherently noisy. We have not found strong evidence to support that this noise due to the experience level of the raters but more likely can be attributed to the range of colors for skin-related pixels that are induced by varying lighting conditions and sensor characteristics. The automated-to-consensus-manual consistency is as good as the consistency between two individual raters. The automated skin tone rating algorithm was developed and tested using mugshot-style images. The efficacy of this approach on in-the-wild face images is unknown and will need to be evaluated in future work. Cie color calculator What is middle grey and why does it even matter? Recent advances in facial soft biometrics. The Visual Computer Options and challenges for facial rejuvenation in patients with higher fitzpatrick skin phototypes Automatic skin tone extraction for visagism applications Gender shades: Intersectional accuracy disparities in commercial gender classification Face segmentation using skin-color map in videophone applications Skin colour typology and suntanning pathways Rethinking atrous convolution for semantic image segmentation Xception: Deep learning with depthwise separable convolutions Demographic effects in facial recognition and their dependence on image acquisition: An evaluation of eleven commercial systems Relationship between skin response to ultraviolet exposure and skin color type ImageNet: A Large-Scale Hierarchical Image Database Arcface: Additive angular margin loss for deep face recognition The validity and practicality of sunreactive skin types i through vi Adopting iso standards for museum imaging Debface: De-biasing face recognition Ongoing face recognition vendor test (frvt) part 3: Demographic effects The effect of broad and specific demographic homogeneity on the imposter distributions and false match rates in face recognition algorithm performance Progressive growing of gans for improved quality, stability, and variation Face recognition performance: Role of demographic information Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a Issues related to face recognition accuracy varying based on race and skin tone Characterizing the variability in face recognition accuracy relative to race Maskgan: Towards diverse and interactive facial image manipulation Absence of images of skin of colour in publications of covid-19 skin manifestations Deep learning face attributes in the wild Facial recognition is accurate, if you're a white guy. The New York Times An experimental evaluation of covariates effects on unconstrained face verification Iarpa janus benchmark -c: Face dataset and protocol Diversity in faces Color-theoretic experiments to understand unequal gender classification accuracy from face images Understanding unequal gender classification accuracy from face images MORPH: A longitudinal image database of normal adult age-progression Comparative study of skin color detection and segmentation in hsv and ycbcr color space Biased and wrong? facial recognition tech in the dock Racial faces in the wild: Reducing racial bias by information maximization adaptation network Bisenet: Bilateral segmentation network for real-time semantic segmentation