key: cord-0604368-avccir7v
authors: Zhu, Zheng; Huang, Guan; Deng, Jiankang; Ye, Yun; Huang, Junjie; Chen, Xinze; Zhu, Jiagang; Yang, Tian; Du, Dalong; Lu, Jiwen; Zhou, Jie
title: WebFace260M: A Benchmark for Million-Scale Deep Face Recognition
date: 2022-04-21
journal: nan
DOI: nan
sha: 975cd1cad1eaf664bc07dbfab14c3c60715038d6
doc_id: 604368
cord_uid: avccir7v

Face benchmarks empower the research community to train and evaluate high-performance face recognition systems. In this paper, we contribute a new million-scale recognition benchmark, containing uncurated 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. Firstly, we collect 4M name lists and download 260M faces from the Internet. Then, a Cleaning Automatically utilizing Self-Training (CAST) pipeline is devised to purify the tremendous WebFace260M, which is efficient and scalable. To the best of our knowledge, the cleaned WebFace42M is the largest public face recognition training set and we expect to close the data gap between academia and industry. Referring to practical deployments, Face Recognition Under Inference Time conStraint (FRUITS) protocol and a new test set with rich attributes are constructed. Besides, we gather a large-scale masked face sub-set for biometrics assessment under COVID-19. For a comprehensive evaluation of face matchers, three recognition tasks are performed under standard, masked and unbiased settings, respectively. Equipped with this benchmark, we delve into million-scale face recognition problems. A distributed framework is developed to train face recognition models efficiently without tampering with the performance. Enabled by WebFace42M, we reduce 40% failure rate on the challenging IJB-C set and rank 3rd among 430 entries on NIST-FRVT. Even 10% data (WebFace4M) shows superior performance compared with the public training sets. Furthermore, comprehensive baselines are established under the FRUITS-100/500/1000 milliseconds protocols. The proposed benchmark shows enormous potential on standard, masked and unbiased face recognition scenarios. Our WebFace260M website is https://www.face-benchmark.org.

R ECOGNIZING faces in the wild has achieved remarkable success due to the boom of neural networks. The key engine of recent face recognition consists of network architecture evolution [37] , [38] , [47] , [70] , [74] , [77] , [98] , a variety of loss functions [21] , [23] , [42] , [45] , [46] , [53] , [66] , [73] , [75] , [78] , [85] , [86] , [89] , [96] , and growing face benchmarks [11] , [14] , [35] , [41] , [44] , [56] , [57] , [58] , [63] , [67] , [83] , [105] , [113] , [114] . Even though growing efforts have been devoted to investigating sophisticated networks and losses, academia is restricted by limited training sets and nearly saturated test protocols.

As shown in Table 1 and Figure 1 , the largest public training sets in terms of identities and faces are MegaFace2 [58] and MS1M [35] , respectively. MegaFace2 contains 4.7M faces of 672K subjects collected from Flickr [82] . MS1M consists of 10M faces of 100K celebrities but the noise rate is around 50% [83] . In contrast, companies from industry can access much larger private data to train face recognition models: Google utilizes 200M images of 8M identities to train FaceNet [66] , and Facebook [79] performs training by 500M faces of 10M identities. This data gap hinders researchers from pushing the frontiers of deep face recognition. The main obstacles for tremendous training data lie in large- scale identity collection, effective and scalable cleaning, and efficient training. For example, IMDB-Face [83] takes 50 annotators to work continuously for one month to obtain 59K identities and 1.7M images, which is labor-intensive and non-scalable.

On the other hand, test sets and protocols play an essential role in analyzing face recognition performance. Popular evaluations including LFW families [41] , [113] , [114] , CFP [67] , AgeDB [57] , RFW [91] , MegaFace [44] , and IJB-C [56] mainly target the pursuit of accuracy, which have been almost saturated recently. In real-world application scenarios, face recognition is always restricted by inference time, such as unlocking mobile telephones with a smooth user experience. Lightweight Face Recognition (LFR) Challenge [22] takes a step toward this goal by constraining model size and FLOPs, but actual inference time can vary quite a bit for different networks. Besides, LFR neglects the time cost of face detection and alignment. To the best of our knowledge, NIST-FRVT [2] is the only time-constrained face recognition protocol. However, the strict submission policy (no more than one submission every four calendar months) hinders researchers from freely evaluating their algorithms.

To address the above problems, this paper constructs a new ultra-large-scale face benchmark consisting of 4M identities/260M faces (WebFace260M) as well as a timeconstrained assessment protocol. Firstly, a name list of 4M celebrities is gathered and 260M images are downloaded utilizing a search engine. Then, we perform Cleaning Automatically by Self-Training (CAST) pipeline, which is scalable and does not need any human intervention. The proposed CAST procedure results in high-quality 2M identities and arXiv:2204.10149v1 [cs.CV] 21 Apr 2022 42M faces (WebFace42M). Meanwhile, rich attributes are provided for further analyzing the statistics of WebFace42M. Referring to various real-world applications, we design the Face Recognition Under Inference Time conStraint (FRUITS) protocol, enabling academia to test deep face matchers comprehensively. Specifically, the FRUITS includes three time limit tracks: 100, 500, and 1000 milliseconds, which intend to evaluate deployments on mobile devices, local devices, and clouds, respectively. Since public evaluations are almost saturated [41] , [57] , [67] and may contain noise [44] , [56] , we manually construct a new test set with rich attributes to enable FRUITS. Considering the COVID-19 coronavirus epidemic [28] , [48] and reported biased face recognition deployments [31] , [90] , [91] , three evaluation tasks are performed: Standard Face Recognition (SFR), Masked Face Recognition (MFR), and Unbiased Face Recognition (UFR). For MFR, we collect a large-scale masked face test sub-set. Based on the proposed ultra-large-scale benchmark, we delve into million-scale deep face recognition problems. With such data size, a distributed training framework is developed for efficient optimization, which could perform at a nearly linear acceleration without performance drops.

Accuracy on the public and proposed test sets indicates that our training data is indispensable for pushing the frontiers of deep face recognition: WebFace42M achieves 97.70% TAR@FAR=1e-4 on the challenging IJB-C [56] under standard ResNet-100 configurations, reducing near 40% relative error rate compared with state of the art. 10% of our data (WebFace4M) also obtains superior performance than similar-sized MS1M families [1], [21] , [23] and MegaFace2 [58] . On the proposed test set, a similar conclusion can be drawn. For SFR, WebFace42M decreases FNMR@FMR=1e-5 from 9.88% (with MS1MV2) to 2.98% under the same settings, which reduces the error rate by 3 times. For MFR, error rates (FNMR@FMR=1e-5) of WebFace42M and MS1MV2 are 42.97% and 69.56%, respectively. For UFR, considering the tremendous scale of the WebFace data, it provides more room for data balancing and fairness exploration. Furthermore, we participate in the NIST-FRVT [2] and rank 3rd among 430 entries based on WebFace42M. At last, we discuss privacy and bias issues in this benchmark.

For baseline comparisons, comprehensive deep face recognition systems are evaluated under FRUITS-100/500/1000 milliseconds protocols. For SFR, different settings of face detection/alignment and feature extraction are explored, covering MobileNet [15] , [38] , EfficientNet [80] , AttentionNet [84] , ResNet [37] , SENet [39] , ResNeXt [100] and RegNet families [64] . For MFR, the influence of mask augmentation in WebFace training data is studied and a strong baseline is established for this difficult problem. In order to investigate the bias in current face recognition systems, we re-sample the WebFace data to obtain an attribute-balanced sub-set and evaluate its influence on recognition fairness. With this new face benchmark, we hope to close the data gap between academia and industry, and facilitate the time-constrained recognition assessment for real-world applications.

The main contributions of this work can be summarized as follows:

• An ultra-large-scale face recognition dataset is constructed for the research community towards closing the data gap behind the industry. The proposed Web-Face260M consists of 4M identities and 260M faces, which provides an excellent resource for million-class deep face cleaning and recognition as shown in The results indicate substantial insights on three different recognition settings. This paper is built upon our conference work [118] and significantly extended in several aspects. Firstly, we provide comprehensive visualization results to illustrate the face benchmark. This gives a deeper insight into the diversity and challenges of our training/test data. Secondly, the scale of our test set is significantly increased, making it more challenging for recognition evaluation. Besides, we collect a highly-curated masked sub-set, which contains 862 subjects with real-world masks. Lastly, for evaluation and experiment, standard face recognition is extended to masked and unbiased settings. Corresponding mask augmentation and attribute-balanced baselines are also established. In addition, we present a detailed literature review for deep face recognition and benchmarks. Privacy and bias issues in WebFace260M benchmark are also discussed.

Since the preliminary version of this work was published, we have received dataset access applications 1 from near 400 research groups. Based on the WebFace260M benchmark, we organize the Face Bio-metrics under COVID Workshop and Masked Face Recognition Challenge [18] , [117] in ICCV 2021. More than 80 teams have participated in WebFace260M Track under FRUITS protocol and submitted more than 1,000 solutions 2 . In InsightFace Unconstrained Track, all top-3 teams from academia and industry adopt WebFace260M database as their training sets. These results suggest that our WebFace260M is not only an effective benchmark to pursue high-performance face recognition systems but also a meaningful step to reduce the data gap between research laboratories and companies.

Face recognition has been extensively studied in the computer vision literature. Recent years have witnessed a significant advance in both benchmarks and algorithms, including growing training data, evolutional evaluation sets, 1. https://www.face-benchmark.org/download.html 2. https://competitions.codalab.org/competitions/32478#results and loss function designs. This section reviews representative progresses in academia and industry.

A key aspect in developing face recognition systems is the training data used to learn discriminative face representations. Data collections are extremely important but usually overlooked. Even though some companies have internally labeled private face sets that scale to millions of images [78] or even millions of subjects (Google [66] and Facebook [79] ), the situation is quite different for publicly available collections. As shown in Table 1 , we give the detailed statistics of widely used training sets in the community, such as CASIA-WebFace [105] , VGGFace2 [14] , UMDFaces [11] , MS1M [35] , MegaFace2 [58] , and IMDB-Face [83] .

CASIA-WebFace [105] , VGGFace2 [14] and UMD-Faces [11] consist of around 10K identities. CASIA-WebFace [105] is collected by a semi-automatical method, which searches face images of celebrities from the Internet. VGGFace2 [14] is an improved version of VGGFace [63] created in order to mitigate the deficiency of its predecessor. The subjects in VGGFace2 are collected from celebrities and famous people such as professors and politicians. Compared to its predecessor, VGGFace2 contains fewer images for each identity but covers a large range of poses, ages and races. To reduce label noise as much as possible, manual and automatic processes are employed. UMDFaces [11] utilizes a mix of human annotators via Amazon Mechanical Turk (AMT) and pre-trained deep-based face analysis tools to build a face dataset that is much tougher than already available sets.

MS1M [35] , MegaFace2 [58] , and IMDB-Face [83] include more identities than above-mentioned datasets. MS1M [35] retrieves around 100 images for each identity by the Bing search engine [3] using the celebrity's name without any filtering. Therefore, the quality of MS1M is severely biased by label noises, duplicated images, and non-face images present in the set. All of these factors make MS1M hard to be used directly. MegaFace2 [58] contains 672K subjects cleaned from Flickr. However, this dataset only collects 4.7M faces, which results in around 7 images per identity. IMDB-Face [83] claims to be the largest noise-controlled face collection, which contains 1.7M images of 59K celebrities by manual annotation. However, it took 50 annotators to work continuously for one month to clean the dataset, which demonstrates the difficulty of obtaining a large-scale clean dataset for face recognition.

Most popular evaluation sets for face recognition target the pursuit of accuracy. CFP [67] , AgeDB [57] , CALFW [114] and CPLFW [113] evaluate the verification accuracy under different intra-class variations (such as pose and age). MegaFace [44] and IJB-C [56] serve for both accuracies of large-scale face verification and identification. YTF [97] and IQIYI-Video [54] compare the accuracy of video-based verification. Different model-ensemble and post-processing [68] could be adopted for higher performance under these protocols. However, face recognition in real-world application scenarios is always restricted by inference time, such as unlocking mobile phones with a smooth experience or processing multiple channels of surveillance videos on clouds.

Recently, the LFR Challenge [22] takes a step toward this goal by constraining the FLOPs and model size of submissions. Since different neural network architectures can be quite different in terms of real inference time, this protocol is not a straightforward solution. Furthermore, it does not consider face detection and alignment, which are prerequisite components in most modern face recognition systems. To the best of our knowledge, NIST-FRVT [2] is the only benchmark employing the time-constrained protocol. However, the strict submission policy (participants can only send one submission every four calendar months) hinders researchers from freely evaluating their algorithms.

With global COVID-19 pandemic and reported biased systems deployment, up-to-date face recognition datasets focus on comparisons with masks and fairness consideration. RFW [91] aims to evaluate the bias among 4 racial distributions. NIST-FRVT [34] regularly evaluates the bias level of submitted algorithms caused by demographic effects. Due to the sudden outbreak of the coronavirus epidemic, there is yet no comprehensive real-world masked face recognition benchmark. Evaluations on simulated masked images [59] , [61] may result in questioned conclusions, and small-scale masked sets [10] , [12] , [17] can not comprehensively reflect the performance of algorithms. Real-world masked test set RMFRD [95] consists of 525 identities and 5K masked faces, but there exist annotation noises.

The last decade has witnessed the advance of deep convolutional face recognition techniques. A number of successful face recognition systems, such as DeepFace [78] , DeepID [73] , [74] , [75] , [76] , FaceNet [66] have achieved impressive performance on face verification and identification. Most of the early works rely on metric-learning based losses [16] , [66] , [71] , and recent researches have switched to marginbased softmax losses due to their efficiency on the large-scale dataset. SphereFace [53] , AM-softmax [85] , CosFace [89] , ArcFace [21] progressively improve the performance on various benchmarks to the newer level. To further improve the margin-based softmax loss, recent works focus on the exploration of adaptive parameters [51] , [52] , [107] , [108] , inter-class regularization [27] , [112] , sample mining [42] , [94] , learning acceleration [9] , [46] , [49] , [50] , [106] , etc. There are also many complementary methods proposed to build better face recognition models by promoting desired properties of the produced face representations, such as robustness to noisy labels [19] , [40] , [93] , [109] , [116] , occlusions [72] , [92] , [115] and low quality [68] , [69] , invariance to age [43] , [88] , [111] and pose [87] , [110] , ability to mitigate racial bias [32] , [90] , [91] and domain imbalance [13] , [25] , [45] , [49] , to improve the fairness of representations [51] , [101] .

Knowledge graphs website Freebase [4] and well-curated website IMDB [5] provide excellent resources for collecting celebrity names. Furthermore, commercial search engines In data pre-processing, faces are detected and aligned through five landmarks predicted by RetinaFace [20] . Specifically, the threshold of detection score is set as 0.7 to filter the low confident faces. After pre-processing, there are 4M identities/260M faces (WebFace260M) shown as Table 1 . The statistics of WebFace260M are illustrated in Figure 2 including date of birth, nationality and profession. Persons in WebFace260M come from more than 200 distinct countries/regions and more than 500 different professions with the date of birth back to 1846, which guarantees a great diversity in our training data. During the construction of WebFace260M dataset, privacy and bias problems are our first concerns. Detailed discussion is available in Section 7.

We perform a CAST pipeline (Section 4) to automatically clean the noisy WebFace260M and obtain a curated training set named WebFace42M, consisting of 42M faces of 2M subjects. Face number in each identity varies from 3 to more than 300, and the average face number is 21 per identity. As shown in Table 1 and Figure 1 , WebFace42M offers the largest cleaned training data for face recognition. Compared with the MegaFace2 [58] dataset, the proposed WebFace42M includes 3 times more identities (2M vs. 672K), and near 10 times more images (42M vs. 4.7M). Compared with the widely used MS1M [35] , our training set is 20 times (2M vs. 100K) and 4 times (42M vs. 10M) more in terms of # identities and # photos. According to [83] , there are more than 30% and 50% noises in MegaFace2 and MS1M, while the noise ratio of WebFace42M is lower than 10% (similar to CASIA-WebFace [105] ) based on our sampling estimation. With such a large data size, we take a significant step towards closing the data gap between academia and industry. We further provide face attribute statistics for Web-Face42M. Figure Figure 3 illustrates four random celebrities from our WebFace data. The original WebFace260M folders downloaded/detected from Internet images are very noisy, containing various wrong detections, unrelated persons, name repetitions, etc. The proposed CAST in the next section can automatically purify each folder to obtain cleaned faces for a certain identity. Specifically, there are large hairstyle and pose variations for celebrities Kalenna Harper in WebFace42M. Faces of Claire Ayer and Gregg Wallace show different expressions and yaw angles. Finally, the proposed CAST cleaning pipeline covers a broad age range for celebrities Shaun Weiss. The great diversity of WebFace42M guarantees its quality for training high-performance face recognition models.

Since the images downloaded from the web are considerably noisy, it is necessary to perform a cleaning step to obtain high-quality training data. Original MS1M [35] does not perform any dataset cleaning, resulting in a near 50% noise ratio, and significantly degrades the performance of the trained models. VGGFace [63] , VGGFace2 [14] and IMDB-Face [83] adopt semi-automatic or manual cleaning pipelines, which require expensive labor efforts. It becomes difficult to scale up the current annotation size to even more identities. Although the purification in MegaFace2 [58] is automatic, its procedure is complicated and there are considerably more than 30% noises [83] . Another relevant exploration is to cluster faces via unsupervised approaches [55] , [62] or supervised graph-based algorithms [103] , [104] . However, these methods assume the whole dataset is curated, which is not suitable for the extremely noisy WebFace260M.

Recently, self-training [99] , [102] , a standard approach in semi-supervised learning, is explored to significantly boost the performance of image classification. Different from close-set ImageNet classification [65] , directly generating pseudo labels on open-set face recognition is impractical. Considering this inherent limitation, we carefully design the pipeline of Cleaning Automatically by Self-Training (CAST). Our first insight is performing self-training on open-set face recognition data, which is a scalable and efficient cleaning approach. Secondly, we find embedding feature matters in cleaning ultra-large-scale noisy face data.

The overall CAST framework is shown in Figure 4 . Following the self-training pipeline, (1) A Teacher model is trained with the public dataset (MS1MV2 [21] ) to clean the original 260M images, which mainly consists of intraclass and inter-class cleaning.

(2) A Student model is trained

Cleaning Automatically by Self-Training on cleaned images from (1). Since the data size is much larger, this Student generalizes better than the Teacher. (3) We iterate this process by switching the Student as the Teacher until high-quality 42M faces are obtained. It is worth noting that each intra-class/inter-class cleaning is conducted on the initial WebFace260M by different Teacher models. All Teacher/Student models adopt ResNet-100 backbone and ArcFace loss function, and other configurations are equivalent to those in WebFace42M training setting (Section 6.1). The CAST pipeline is summarized in Algorithm 1.

Since WebFace260M contains various noises such as outliers in a folder and identity overlaps between folders, it is impractical to perform unsupervised or supervised clustering on the whole dataset. Based on the observation that the image search results from Google are sorted by relevance and there is always a dominant subject in each search, the initial folder structure provides strong priors to guide the cleaning strategy: one folder always contains a dominant subject and different folders may contain considerable overlapped identities.

Following these priors, we perform dataset cleaning by a two-step procedure: Firstly, face clustering is parallelly conducted in 4M folders (subjects) to select each dominant identity. Specifically, for each face in a folder, 512-dimensional embedding feature is extracted by the Teacher model, and then DBSCAN [29] is utilized to cluster faces in this folder. Only the largest cluster (more than 2 faces) in each fold is reserved. and n of DBSCAN indicate the maximum distance for the radius of a neighborhood, and the minimum number of points required within this distance, respectively. We use (1 − ) to denote the similarity of face embeddings in our paper. With more iterations of CAST, the model learns stronger face embeddings. A higher similarity (1 − ) has a trend to filter out more number of noisy faces, which is beneficial for creating cleaner datasets. So we empirically set larger values for similarity (1 − ) in later iterations of CAST and keep n fixed. We also investigate other different designs of CAST in Section 6.4. Secondly, we compute the feature center of each subject to perform inter-class cleaning. Two folders are merged if their cosine similarity is higher than 0.7, and the folder containing fewer faces would be deleted when the cosine similarity is between 0.5 and 0.7. As shown in Algorithm 1, lines 2-5 and lines 6-13 are intra-class and inter-class cleaning processes, respectively.

The effectiveness of the above intra-class and inter-class cleaning heavily depends on the quality of the embedding feature, which is guaranteed by the proposed self-training pipeline. The ArcFace model trained on MS1MV2 with ResNet-100 provides a good initial embedding feature to perform first-round cleaning for WebFace260M. Then, this feature is significantly enhanced with more training data in later iterations. Figure 5 illustrates the score distribution during different stages of CAST, which indicates a cleaner training set after more iterations. Furthermore, the ablation study in Table 7 also validates the effectiveness of the CAST pipeline. It is worth noting that the proposed CAST pipeline is compatible with any intra-class and inter-class strategies.

Remove duplicates and test set overlaps. After CAST, duplicated faces are removed when their cosine similarity is higher than 0.95. Furthermore, the feature center of each Algorithm 1 Cleaning Automatically by Self-Training for Every two subjects do 7:

Computing the cosine similarity of feature center 8:

if cosine similarity >0.7 then 9: Merging two folders 10: else if cosine similarity >0.5 then 11: Deleting the folder with fewer faces 12: end if 13: end for 14: Training i-th Student model on i-th cleaned WebFace (not for last iteration) 15: Converting i-th Student into i+1-th Teacher model (not for last iteration) 16 : end for 17: return Cleaned WebFace data subject is compared with popular benchmarks (such as LFW families [41] , [113] , [114] , FaceScrub [60] , IJB-C [56] , the proposed test set, etc.), and overlaps are removed if the cosine similarity is higher than 0.7. Figure 6 illustrates the reserved and the rejected face samples in our cleaning process. One can find that noisy and mislabeled faces are successfully rejected by the proposed CAST strategy, and most true positives are reserved in the cleaned WebFace42M set. Significantly, there are diverse expressions and poses among the remaining faces, which clearly show the effectiveness of our CAST and the high quality of our training data.

The statistics of identities and faces during different cleaning stages are shown in Table 2 . After face pre-processing for downloaded images, there are 4,008,130 identities and 

In this section, we firstly introduce the time-constrained face recognition evaluation protocol, which covers various practical applications. Then, three benchmarking tasks (standard, masked, unbiased face recognition) are detailed, including corresponding background, test sets, and metrics.

As discussed in Section 2.2, most existing face recognition evaluations [41] , [44] , [56] , [67] , [97] 

Since public evaluations are most saturated and may contain noises, we manually construct an elaborated test set for SFR, MFR and UFR. It is well known that recognizing strangers, especially when they are similar-looking, is a difficult task even for experienced vision researchers. Therefore, our multiethnic annotators only select their familiar celebrities, which ensures the high quality of the test data. Besides, annotators are encouraged to gather attribute-balanced faces, and recognition models are introduced to guide hard sample collection. The statistics of the final test set are listed in Table 3 . In total, there are 60,926 faces of 2,478 identities. Rich attributes (such as age, race, gender, scenario) are accurately annotated. Among all collected data, 57,715 faces are utilized for SFR evaluation. In the future, we will actively maintain and update this test set. In Figure 7 , we show the samples of our SFR test set, including Controlled and Wild faces. One can find that there is great diversity in ages, poses, scenarios, etc.

Based on the proposed FRUITS protocol and test set, we perform standard 1:1 face verification across various attributes. Table 3 shows numbers of imposter and genuine in different verification settings. All means impostors are paired without attention to any attribute, while later comparisons are conducted on age and scenario sub-sets. Cross-age refers to cross-age (more than 10 or 20 years span) verification, while Cross-scene means pairs are compared between Controlled and Wild settings. Different algorithms are measured on False Non-Match Rate (FNMR) [2], which is defined as the proportion of mated comparisons below a threshold set to achieve the False Match Rate (FMR) specified. FMR is the proportion of impostor comparisons at or above that threshold. Lower FNMR at the same FMR is better.

SFR systems usually work with mostly non-occluded faces, which include primary facial features such as eyes, nose, and mouth. However, there are a number of circumstances in which faces are occluded by masks such as in pandemics, medical settings, excessive pollution, or laboratories. According to WHO statistics, there are more than 235,408,082 confirmed COVID-19 cases including 4,809,149 deaths worldwide till October 6, 2021. During the coronavirus epidemic, almost everyone wears a facial mask, which poses a huge challenge to face recognition. Traditional SFR may not effectively recognize the masked faces, but removing the mask for authentication would increase the virus infection risk.

To cope with the above-mentioned challenging scenarios arising from wearing masks, it is crucial to improve the existing face recognition approaches. Recently, some commercial vendors [61] have developed face recognition algorithms capable of handling face masks, and an increasing number of research publications [10] , [24] , [26] , [30] , [36] have surfaced on this topic. However, due to the sudden outbreak of the epidemic, there is yet no publicly available large-scale MFR benchmark.

In contrast with simulated [59] , [61] or relatively small [10] , [12] , [17] , [95] masked test sets, a real-world comprehensive benchmark for evaluating MFR is developed in this work. Based on the SFR identities, we further collect masked faces for these celebrities. Specifically, as shown in Table 3 , there are carefully selected 3,211 masked faces among 862 identities. Subjects with real-world masks are illustrated in Figure 7 . Wearing masks causes severe occlusion, resulting in just the periocular area and above visible. Besides, there are changeful mask types, colors, wearing ways, and head poses in real-world applications, which are more practical and challenging than simulated ones.

For MFR, assessment is performed with Mask-Nonmask comparisons. Specifically, there are one masked face and another face from standard face sets for pair verification. According to the attributes of faces without masks, we evaluate the performance of algorithms under Controlled-Masked, Wild-Masked, and All-Masked settings in Table 3 .

Bias in face recognition means it provides higher accuracy within certain demographic groups and lower performance for other demographics. According to the NIST-FRVT report [34] , most submitted recognition algorithms from academia and industry exhibit different levels of biased performances. Deploying such systems may cause significant consequences such as racism. Recent UFR researches [31] , [81] , [90] , [91] mainly focus on balanced data collection/sampling and debiased algorithm design. For evaluation, most of the tests are performed on RFW set [91] , which adopts LFW-like pair comparisons. In this paper, based on the proposed test data and FRUITS protocol, we provide a more challenging and practical UFR evaluation.

To enable a trustworthy face recognition system, it is of importance to investigate the performance on different facial attributes. As shown in Table 3 , the test set of UFR is the same as SFR ones. We manually label race and gender attributes for unbiased evaluation. Fairness assessment of ethnicity is reported on Caucasian, East Asian, African, and so does gender. Following common practices [90] in the community, we adopt skewed error ratio (SER) and standard deviation (STD) as the fairness metrics. Specifically, the error ratio of each race and gender attribute is calculated according to FNMR@FMR=1e-5. Then, SER can be computed by the ratio of the highest to the lowest error rate among race and gender groups. STD is the number of error dispersion among different races and genders.

Based on the constructed WebFace260M benchmark, we dive deep into the million-scale face recognition in this section. Firstly, implementations are detailed containing parameter and environment configurations. Then we analyze the speed and performance of distributed framework, which enables large-scale face recognition training. Thirdly, the WebFace data is compared with public counterparts, covering different losses and test sets. Furthermore, the proposed CAST strategy and its key procedures are studied. Lastly, we establish comprehensive baselines for SFR, MFR, UFR, and report the performance on NIST-FRVT.

Hyper-parameters. In order to fairly evaluate the performance of different face recognition models, we reproduce representative algorithms (CosFace [89] , ArcFace [21] and CurricularFace [42] ) in one Gluon codebase. Margin values in CosFace [89] , ArcFace [21] and CurricularFace [42] are set as 0.35, 0.5 and 0.5, respectively. The Stochastic Gradient Descent (SGD) with momentum 0.9 and weight decay 0.0005 is utilized for the network optimization. For large-batch training with cluster, we employ distributed synchronous SGD, which parallelizes the tasks across machines. The default batch size per GPU is set as 64 unless otherwise indicated. The learning rate is set as 0.05 for a single node (8 GPUs), and follows the linear scaling rule [33] for the training on multiple nodes ( 0.05×# machines). We decrease the learning rate by 0.1× at 8, 12, and 16 epochs, and stop at 20 epochs for all models. Gradual warmup [33] is adopted during the initial 1 epoch and 5 epochs for a single node and multiple nodes training, respectively. During training, we only adopt horizontal flip data augmentation. In DBSCAN, similarity (1 − ) is set as 0.5, 0.55, and 0.6 for 1st, 2nd, and 3rd iterations, respectively. n is set as 3 for all iterations. Hardware and software. All experiments are performed on a cluster containing 35 nodes, and each node contains 8 GPUs, 

When using the ultra-large-scale WebFace42M as the training data and computationally demanding backbones as the embedding networks, the model optimization can take several weeks on a single machine. Such a long training time makes it difficult to efficiently perform experiments. Inspired by the distributed optimization on ImageNet [33] , we apportion the workload of model training to clusters. To this end, parallelization on both feature X and center W , mixed-precision (FP16) and large-batch training are adopted in this work. The speed and performance of our distributed training system are illustrated in Table 4 and Figure 8 . Parallelization on both feature X and center W as well as mixed-precision (FP16) significantly reduce the consumption of GPU memory and speed up the training process, while similar performance can be achieved. Equipped with 8 nodes (64 GPUs), the training speed is scaled to 12K samples/s and 11K samples/s on WebFace4M (10% data) and WebFace12M (30% data), respectively. The corresponding training time is only 2 hours and 6 hours. Furthermore, the scaling efficiency of our training system is above 80% when applied to ultralarge-scale WebFace42M on 32 nodes (256 GPUs). Figure 8 shows the ideal speed, actual speed, and corresponding performance for increasing GPUs resource. We can reduce the training time of the ResNet-100 model from 233 hours (1 node) to 9 hours (32 nodes) with comparable performance.

For comprehensively benchmarking the influence of training data, the proposed WebFace42M is compared with public counterparts including MS1M families [1], [21] , [23] , [35] , MegaFace2 [58] and IMDB-Face [83] . 10% (WebFace4M) and 30% (WebFace12M) random selection of our full data are also employed for further analyses. The statistics of different training sets are illustrated in Table 1 . Evaluation sets used in this experiment include popular verification sets (LFW [41] , CALFW [114] , CPLFW [113] , AgeDB [57] , CFP-FP [67] ), RFW [91] , MegaFace [58] , IJB-C [56] and our test set. As we can see from Table 5 and Figure 9 , the proposed WebFace42M breaks the bottleneck of training data for deep face recognition across various loss functions and test sets. In summary, almost all best accuracies on each test set are achieved by the proposed WebFace data shown in Table 5 . Specifically, WebFace42M reduces 40% error rate on the challenging IJB-C dataset compared with MS1MV2, boosting TAR@FAR=1e-4 from 96.03% to 97.70% with ResNet-100 and ArcFace. Along with the increment of data scale (10%, 30%, and 100%), there exists a consistent improvement in performance as observed in Figure 9 , enveloping its public counterparts. On MegaFace identification task, WebFace42M sets a new state of the art by 99.11% rank-1 score. On our test set, for All pairs comparison of SFR, WebFace42M decreases the FNMR@FMR=1e-5 from 9.88% to 2.98%, reducing the error rate by more than 3 times. For All-Masked pairs comparison of MFR, FNMR@FMR=1e-5 metric is boosted from 69.56% to 42.97% by the proposed training data. Meanwhile, due to the tremendous scale and diversity of WebFace, it benefits UFR according to SER metric. Furthermore, the models trained on 10% data, WebFace4M, impressively achieve superior performance compared to ones trained on MS1M families and MegaFace2, which include even more faces. Undisputedly, this comparison confirms the effectiveness and necessity of our WebFace42M in leveling the playing field for million-scale face recognition.

Besides reporting the face recognition results of ResNet-100, we also train ArcFace models by using a ResNet-14 network on different portions of our data (10%, 30% and 100%). As given in Table 6 , there is also a consistent performance gain for ResNet-14 when more training data are progressively employed. Therefore, the proposed WebFace42M is not only beneficial to the large model (ResNet-100) but also valuable for the lightweight model. This is of significance for mobile devices like cellphones, and we explore smaller models in Section 6.5.

As shown in Table 7 , the CAST pipeline is compared with other cleaning strategies on the original MS1M [35] and  TABLE 5 Performance comparisons of our WebFace and public training data. ResNet-100 backbone without flip test is adopted. RFW refers to average accuracy on [91] , MegaFace refers to rank-1 identification and verification scores on [44] , IJB-C is TAR@FAR=1e-5 and 1e-4 on [56] . On our test set, FNMR@FMR=1e-5 on All pairs of SFR, FNMR@FMR=1e-5 on All-Masked pairs of MFR, race SER metric of UFR are reported (lower is better).

The best results are marked in bold. WebFace260M. Specifically, for MS1M results, the initial teacher model is trained on IMDB-Face [83] by using ResNet-100 and ArcFace. Then, CAST is conducted on the noisy MS1M following Section 4. After steps of iteration, our fully automatic cleaning strategy provides highly-curated data for model training, outperforming semi-automatic methods used in [1], [21] , [23] . Compared with the most recent GCN-based cleaning [109] , the data cleaned by the CAST also achieves higher performance. Table 7 also shows the increasing data purity after more iterations in MS1M and WebFace260M. The accuracy gradually increases from 1st to 3rd iteration, while the 4th iteration shows saturated performance. Therefore, we set the iteration number as 3 for CAST. Performance Improvement Reporting of CAST. From Table 7 , one can find the recognition performance improvement as the cleaning iteration continues. Specifically, after the first-round iteration, WebFace data results in 97.28% TAR@FAR=1e-4 on IJB-C, which is already much higher than MS1MV2 ones (96.03%). There are still some mislabeled identities which cause performance degradation. As the cleaning iteration continues, the number of identities decreases, while resulting WebFace datasets further obtain 97.51% and 97.70% on TAR@FAR=1e-4 of IJB-C. Results on Pairs and MegaFace also lead to similar conclusions. [41] , [57] , [67] , [113] , [114] . MegaFace and IJB-C refer to rank-1 identification and TAR@FAR=1e-4 respectively. For MS1M and WebFace by CAST, different iterations are compared. CAST-1 means the first-round iteration. Figure 10 (a) also illustrates the MobileFaceNet performance trained with WebFace42M, which shows very limited recognition accuracy.

Considering the trade-off between cleaning cost and quality, we choose ResNet-100 for Teacher/Student model and MS1MV2 for initial Teacher model in CAST. Furthermore, we investigate the influence of identities duplicates between MS1M and WebFace datasets. As described in Section 3.1, our WebFace celebrity name list consists of two parts: the first one is borrowed from MS1M and the second one is collected from the IMDB database. Since the MS1M identities are most a subset of the Web-Face ones, there are few remaining identities in MS1M if removing the duplicates. Alternatively, we remove the MS1M identities from WebFace, which is denoted as WebFace-no-MS1M. Then, WebFace and WebFace-no-MS1M are cleaned following the same settings (the Teacher model initially uses the recognition model trained by MS1M). We find that these 2 datasets obtain similar performance. Specifically, for IJB-C TAR@FAR=1e-4 metric, training sets cleaned from WebFace and WebFace-no-MS1M result in 97.70% and 97.67%, respectively. It is worth noting that MS1M name list has only 100K identities, while WebFace name list has 4M identities. This comparison shows that duplicated identities between MS1M and WebFace do not influence the effectiveness of CAST. Intra-class Cleaning. In this ablation study, we compare different intra-class cleaning modules under the framework of CAST. Both unsupervised (such as K-means [55] and DBSCAN [29] ) and supervised styles (such as GCN-D [104] and GCN-V [103] ) are explored to find the dominant subject in each noisy folder. As shown in Table 9 , DBSCAN achieves 96.55% TAR@FAR=1e-4 on IJB-C, significantly outperforming K-Means (96.03%) and slightly surpassing the supervised GCN-based ones (96.48% for GCN-D and 96.42% for GCN-V). As the GCN-based strategies may be sub-optimal for the TABLE 9 Comparisons of different intra-class cleaning strategies for MS1M.

ResNet-100 backbone with ArcFace loss is adopted here. extremely noisy folders, we finally select DBSCAN [29] as our intra-class cleaning module.

In this section, we set up a series of SFR baselines under the proposed FRUITS protocol. Table 10 illustrates various face recognition systems (including different settings of detection, alignment, feature embedding) and their inference time. In our SFR baselines, representative network architectures are explored, covering MobileNet [15] , [38] , EfficientNet [80] , AttentionNet [84] , ResNet [37] , SENet [39] , ResNeXt [100] and RegNet [64] families. All the models are trained on WebFace42M with ArcFace. Due to the strict time limitation, FRUITS-100 track can only adopt lightweight architectures, including RetinaFace-MobileNet-0.25 [20] for face detection and alignment, ResNet-14, MobileFaceNet (Flip), EfficientNet-B0 and RegNet-800MF for face feature extraction. FMR-FNMR plots on All pairs and analyses of attributes are shown in Figure 10 (a) and Figure 11 (a). Because of the weak detection and recognition modules, the best baseline (RegNet-800MF) only obtains 12.41% FNMR@FMR=1e-5 (lower is better). Therefore, there leaves substantial room for future improvement under the FRUITS-100 protocol.

For the FRUITS-500 protocol, we can employ more capable modern networks, such as RetinaFace-ResNet-50 [20] for pre-processing, and ResNet-100, ResNet-50 (Flip), SENet-50, ResNeXt-100, RegNet-8GF for feature embedding extraction. As shown in Figure 10 (b) and Figure 11 (b), ResNet-100 exhibits the best overall performance, scoring 2.98% FNMR@FMR=1e-5. For attributes evaluation, ResNet-100 also achieves the lowest FNMR according to the indicators of age, scenarios, race and gender.

Recognition models under the FRUITS-1000 protocol can be more complicated and powerful, therefore we explore ResNet-100 (Flip), ResNet-200, SENet-152, AttentionNet-152 and RegNet-16GF for face feature representations. As shown in Figure 10 (c) and Figure 11 (c), ResNet-200 performs best in face verification and wins all attribute comparisons. Compared with lightweight FRUITS-100 track, performances of different large models are much closer. This result implies that new designs need to be explored for the heavyweight FRUITS track.

Recognizing identities with masks may be the most challenging face recognition problem, which is essential for biometric authentication during COVID-19. Based on the proposed WebFace benchmark, we perform MFR in this part, and establish a series of baselines for different training settings. ResNet-100 in FRUITS-500 protocol and ArcFace are adopted, while similar conclusions can be drawn for other backbones or losses.

As shown in Table 11 , the best-performed public training set, MS1MV2, only scores 71.81% FNMR@FMR=10-5 for All-Masked comparisons on our test set (with ArcFace loss). Different proportions of the WebFace data reduce FNMR to 70.97%, 56.47% and 47.25% respectively, which shows the superiority of the proposed dataset again. Moreover, we augment the training data with simulated masks to further investigate this difficult recognition scenario. Specifically, mask renderer in [7] is applied on each face, where wearing height and mask types are randomly chosen. Both simulated and original faces are trained together with 20 epochs. In Table 11 , one can find that this simple augmentation strategy effectively boosts the MFR accuracy. For all data proportions, the FNMR improvements are near 10%, which build strong baselines for future MFR researches.

Generalized deployments call for robust and fair face recognition systems. In this part, we perform data sampling on WebFace data and investigate its fairness influence on our test set. Table 12 indicates that MS1MV2 and WebFace4M show considerable bias among race and gender. The scores of SER metric are 1.40, 1.45, 1.88, and 1.92 respectively. The relative rank of different models (trained with WebFace42M) on race and gender attributes is also illustrated in Figure 11 . Thanks to the ultra-large-scale of the proposed benchmark, we can sample a balanced race/gender sub-set denoted as WebFace4M-Balanced. According to STD and SER scores, this sampled training data reduces the recognition bias to some extent, surpassing the MS1MV2 and WebFace4M. It is worth noting that there is still demographic bias on the test set even with WebFace4M-Balanced training, which is consistent with the observation in previous studies [90] , [91] . More bias-mitigating solutions need to be developed such as loss design, augmentation, and adversarial learning.

In summary, the results show the great potential of WebFace data (including training and test set) for more fair and robust face recognition systems. Based on the proposed WebFace benchmark, we hope to spark UFR researches in the future.

Finally, we report the submission to NIST-FRVT. Following the settings of FRUITS-1000, our system is built based on RetinaFace-ResNet-50 for detection and alignment, and ArcFace-ResNet-200 trained on WebFace42M for feature embedding extraction. The network is accelerated by Open-VINO [8] and the flip test is adopted. The final inference time is near 1300 milliseconds according to the NIST-FRVT report, meeting the latest 1500 milliseconds limitation. Table 13 illustrates top-ranking entries measured by FNMR across six tracks 3 . Our model trained on the WebFace42M obtains overall 3rd among 430 submissions, showing impressive performance across different tracks. Specifically, the proposed solution based on WebFace ranks 5th and 3rd on controlled Visa and Mugshot scenarios, respectively. On more challenging cross-age comparisons (Mugshot DT≥12Years), we get 2nd place. For less-controlled VisaBorder, Border and Wild tracks, state-of-the-art performances are also achieved. Considering hundreds of company entries to NIST-FRVT, the WebFace42M takes a significant step towards closing the data gap between academia and industry.

Discussion During WebFace260M dataset construction, privacy and bias issues are our primary concerns. For privacy protection, all face images including training and test data are collected from public Internet resources. For data download, we provide strict access for qualified research groups that sign the license, and try our best to guarantee WebFace260M for research purposes only. For bias problem, our dataset has diverse birth dates, poses and ages, while gender and race are inevitably biased due to complex nationality and profession distributions. In evaluations of this work, we especially design the Unbiased Face Recognition (UFR), studying the influence of balanced training data. We argue that the community could develop more bias-mitigating solutions based on ultra-large-scale WebFace260M benchmark.

In this paper, we have dived into the millionscale face recognition problem, contributing a tremendous noisy dataset with 260M faces, a high-quality training dataset 3. According to report of October 9, 2020 (49) with 42M images of 2M identities by using automatic cleaning, a test set containing rich attributes and large-scale masked face sub-set, a time-constrained evaluation protocol, a distributed framework at linear acceleration, a succession of baselines on various scenarios, as well as a final state-of-the-art model. Equipped with this publicly available face dataset, our model significantly reduces 40% failure rate on IJB-C and ranks 3rd among 430 entries on NIST-FRVT. Besides, baselines built on the proposed WebFace show great potential for masked and unbiased recognition tasks. We hope this benchmark could close the data gap behind the industry, and facilitate future researches of ultra-large-scale face recognition.

Partial FC: Training 10 million identities on a single machine

Masked face recognition for secure authentication

UMDFaces: An annotated face dataset for training deep networks

Masked face recognition competition. IJCB Workshop

Domain balancing: Face recognition on long-tailed domains

VGGFace2: A dataset for recognising faces across pose and age

MobileFacenets: Efficient CNNs for accurate real-time face verification on mobile devices

Learning a similarity metric discriminatively, with application to face verification

Extended evaluation of the effect of real and simulated masks on face recognition performance

Masked face recognition challenge

Sub-center ArcFace: Boosting face recognition by large-scale noisy web faces

RetinaFace: Single-shot multi-level face localisation in the wild

ArcFace: Additive angular margin loss for deep face recognition

Lightweight face recognition challenge

Marginal loss for deep face recognition

Masked face recognition with latent part detection

Semisiamese training for shallow face learning

Towards nir-vis masked face recognition

UniformFace: Learning deep equidistributed representation for face recognition

To mask or not to mask: Modeling the potential for face mask use by the general public to curtail the COVID-19 pandemic

A density-based algorithm for discovering clusters in large spatial databases with noise

Masked face recognition with generative data augmentation and domain constrained ranking

Jointly de-biasing face recognition and demographic attribute estimation

Mitigating face recognition bias via group adaptive classifier

Accurate, large minibatch SGD: Training imagenet in 1 hour

Face Recognition Vendor Test (FVRT): Part 3, Demographic Effects

MS-Celeb-1M: A dataset and benchmark for large-scale face recognition

Efficient masked face recognition method during the COVID-19 pandemic

Deep residual learning for image recognition

Mobilenets: Efficient convolutional neural networks for mobile vision applications

Squeeze-and-excitation networks

Noise-tolerant paradigm for training face recognition cnns

Labeled faces in the wild: A database for studying face recognition in unconstrained environments

CurricularFace: adaptive curriculum learning loss for deep face recognition

When age-invariant face recognition meets face age synthesis: A multi-task learning framework

The MegaFace benchmark: 1 million faces for recognition at scale

GroupFace: Learning latent groups and constructing group-based representations for face recognition

BroadFace: Looking at tens of thousands of people at once for face recognition

ImageNet classification with deep convolutional neural networks

Association of social distancing and face mask use with risk of covid-19

Dynamic class queue for large scale face recognition in the wild

Virtual fully-connected layer: Training a large-scale face recognition dataset with limited computational resources

Fair loss: Margin-aware reinforcement learning for deep face recognition

Adaptiveface: Adaptive margin and sampling for face recognition

Sphereface: Deep hypersphere embedding for face recognition

IQIYI-VID: A large dataset for multi-modal person identification

Least squares quantization in PCM. TIT

IARPA Janus Benchmark C: Face dataset and protocol

AgeDB: The first manually collected in-the-wild age database

Level playing field for million scale face recognition

Deep neural architecture for face mask detection on simulated masked face dataset against COVID-19 pandemic

A data-driven approach to cleaning large face datasets

Ongoing Face Recognition Vendor Test (FRVT) Part 6B: Face recognition accuracy with face masks using post-COVID-19 algorithms

Clustering millions of faces by identity

Deep face recognition

Designing network design spaces

Imagenet large scale visual recognition challenge

Facenet: A unified embedding for face recognition and clustering

Frontal to profile face verification in the wild

Probabilistic face embeddings

Towards universal representation learning for deep face recognition

Very deep convolutional networks for large-scale image recognition

Improved deep metric learning with multi-class n-pair loss objective

Occlusion robust face recognition based on mask learning with pairwise differential siamese network

Deep learning face representation by joint identification-verification

DeepID3: Face recognition with very deep neural networks

Deep learning face representation from predicting 10,000 classes

Deeply learned face representations are sparse, selective, and robust

Going deeper with convolutions

Deepface: Closing the gap to human-level performance in face verification

Web-scale training for face identification

Rethinking model scaling for convolutional neural networks

A comprehensive study on face recognition biases beyond demographics

YFCC100M: The new data in multimedia research

The devil of face recognition is in the noise

Residual attention network for image classification

Additive margin softmax for face verification

Normface: L2 hypersphere embedding for face verification

Pseudo facial generation with extreme poses for face recognition

Decorrelated adversarial learning for age-invariant face recognition

CosFace: Large margin cosine loss for deep face recognition

Mitigate bias in face recognition using skewness-aware reinforcement learning

Masked face recognition dataset in the wild: Reducing racial bias by information maximization adaptation network

Hierarchical pyramid diverse attention networks for face recognition

Co-mining: Deep face recognition with noisy labels

Misclassified vector guided softmax loss for face recognition

A discriminative feature learning approach for deep face recognition

Face recognition in unconstrained videos with matched background similarity

A light CNN for deep face representation with noisy labels

Self-training with noisy student improves Imagenet classification

Aggregated residual transformations for deep neural networks

Consistent instance false positive improves fairness in face recognition

Billion-scale semi-supervised learning for image classification

Learning to cluster faces via confidence and connectivity estimation

Learning to cluster faces on an affinity graph

Learning face representation from scratch

Accelerated training for massive classification via dynamic class selection

Adacos: Adaptively scaling cosine logits for effectively learning deep face representations

P2sgrad: Refined gradients for optimizing deep face models

Global-local GCN: Large-scale label noise cleansing for face recognition

Towards pose invariant face recognition in the wild

Towards age-invariant face recognition

RegularFace: Deep face recognition via exclusive regularization

Cross-pose LFW: A database for studying cross-pose face recognition in unconstrained environments

Cross-age LFW: A database for studying cross-age face recognition in unconstrained environments

Learning from the web: Webly supervised meta-learning for masked face recognition

Unequal-training for deep face recognition with long-tailed noisy data

Masked face recognition challenge

WebFace260M: A benchmark unveiling the power of million-scale deep face recognition