key: cord-0472661-luviebw8
authors: Deng, Jiankang; Guo, Jia; An, Xiang; Zhu, Zheng; Zafeiriou, Stefanos
title: Masked Face Recognition Challenge: The InsightFace Track Report
date: 2021-08-18
journal: nan
DOI: nan
sha: 5293625526b83047a7a5e62573fc4ecae92dcc85
doc_id: 472661
cord_uid: luviebw8

During the COVID-19 coronavirus epidemic, almost everyone wears a facial mask, which poses a huge challenge to deep face recognition. In this workshop, we organize Masked Face Recognition (MFR) challenge and focus on bench-marking deep face recognition methods under the existence of facial masks. In the MFR challenge, there are two main tracks: the InsightFace track and the WebFace260M track. For the InsightFace track, we manually collect a large-scale masked face test set with 7K identities. In addition, we also collect a children test set including 14K identities and a multi-racial test set containing 242K identities. By using these three test sets, we build up an online model testing system, which can give a comprehensive evaluation of face recognition models. To avoid data privacy problems, no test image is released to the public. As the challenge is still under-going, we will keep on updating the top-ranked solutions as well as this report on the arxiv.

Recently, great progress has been achieved in face recognition with large-scale training data [14, 24, 1, 39] , sophisticated network structures [26, 16] and advanced loss designs [28, 29, 26, 25, 8, 20, 31, 30, 4, 3, 18, 6] . However, existing face recognition systems are presented with mostly non-occluded faces, which include primary facial features such as the eyes, nose, and mouth. During the COVID-19 coronavirus epidemic, almost everyone wears a facial mask, which poses a huge challenge to existing face recognition systems. Traditional face recognition systems may not effectively recognize the masked faces, but removing the mask for authentication will increase the risk of virus infection.

To cope with the above-mentioned challenging scenarios arising from wearing facial masks, it is crucial to improve the existing face recognition approaches 2 . Generally, there are two kinds of methods to overcome masked face recognition: (1) recovering unmasked faces for feature extraction and (2) producing direct occlusion-robust face feature embedding from masked face images.

Based on Generative Adversarial Network (GAN) [13] , there are many identity-preserved masked face restoration methods [9, 11] . In [9] , masked face images are first segmented and then impainted with fine facial details while retaining the global coherency of face structure. Ge et al. [11] propose identity-preserved inpainting to facilitate occluded face recognition. The core idea is integrating GAN with an optimized pre-trained CNN model which serves as the third player to compete with the generator by enabling the inpainted faces to be close to their identity centers.

Since occlusion recovery methods [9, 11] are more complicated to set up the online evaluation toolkit, we focus on occlusion-robust face feature embedding in this challenge. In [36] , a new partial face recognition approach is proposed by using local texture set matching to recognize persons of interest from their partial faces. In [15] , a masked-aware face feature embedding is proposed by extracting deep features from the unmasked regions (mostly eyes and forehead regions). In [22] , masked face augmentation and extra mask-usage classification loss is proposed to train mask robust facial feature embedding. In [19, 35] , visual attention mechanism is employed to enhance feature learning from non-occluded face regions.

Even though there are some existing explorations for occluded (masked) face recognition, there is yet no publicly available large-scale masked face recognition benchmark due to the sudden outbreak of the epidemic. In this report, we make a significant step further and propose a new com- prehensive benchmark for masked face recognition as well as non-masked face recognition. To this end, we have collected a real-world masked test set, children test set, multiracial test set (i.e. African, Caucasian, South Asian and East Asian [37, 12, 34, 33] ). We define different sub-tracks with fixed training data, and each sub-track has strict constraints on computational complexity and model size. Therefore, the performance comparison between different models can be fair. By using the proposed test data, we organized the In-sightFace track in Masked Face Recognition Challenge (ICCV 2021). This report presents the details of this track, including the training data, the test set, evaluation protocols, baseline solutions, performance analysis of the top-ranked submissions received as part of the competition, and effective strategies for masked face recognition. The report of another WebFace260M track is available in [38] .

As given in Tab. 1, we employ two existing datasets (i.e. MS1M [14] and Glint360K [1] ) as the training data. MS1M: The MS1M training dataset is cleaned from the MS-Celeb-1M [14] dataset. All face images are preprocessed to the size of 112 × 112 by the five facial landmarks predicted by RetinaFace [5] . Then, a semi-automatic refinement is conducted by employing the pre-trained Ar-cFace [4] model and ethnicity-specific annotators [7] . Finally, the refined MS1M dataset contains 5.1M images of 93K identities. Glint360K: The Glint360K training dataset is cleaned from the MS-Celeb-1M [14] and Celeb-500k [2] datasets. All face images are downloaded from the Internet and preprocessed to the size of 112 × 112 by the five facial landmarks predicted by RetinaFace [5] . Then, an automatic refinement is conducted by employing the pre-trained Arc-Face [4] model for intra-class and inter-class cleaning. Finally, the released Glint360K dataset contains 17M images of 360K individuals, which is one of the largest and cleanest training datasets [39] in academia.

The training data (i.e. MS1M and Glint360K) are fixed to facilitate performance reproduction and fair comparison. Detailed requirements:

• No external dataset is allowed and no pre-trained model is allowed.

• All participants must use the predefined training dataset for a particular challenge track. Data augmentation for the facial mask is allowed but the augmentation method needs to be reproducible.

As shown in Tab. 2 and Fig. 1 , we manually collected the following three test sets for the comprehensive evaluation of different algorithms. Unlike existing face recognition test sets (e.g. LFW [17] , CFP-FP [27] , AgeDB [23] , and IJB-C [21] ), our test sets are not collected from celebrities, thus we can naturally avoid the identity-overlapping problem. The pre-processing step for the test set is the same as that on the training data. All of the faces are normalized into 112 × 112 by using RetinaFace [5] . We also employ a semi-automatic method to strictly ensure that (1) most of the test sets are noise-free and (2) there is no identity overlap between our training data and the test set. Multi-racial Test Set: Participants will also have their algorithms tested on the multi-racial test set for fairly evaluating the performance on different demographic groups. The multi-racial test set consists of four demographic groups 3 : African, Caucasian, South Asian and East Asian [37, 12, 34, 33] . In total, there are 1.6M images of 242K identities.

The test is aimed to determine whether, and to what degree, face recognition performance differs when they process photographs of masked faces, child faces, and individuals from various demographic groups (e.g. African, Caucasian, South Asian and East Asian). All pairs between the gallery and probe sets will be used for evaluation. We employ the 1:1 face verification as the evaluation metric. Multi-racial Test Set: We assess accuracy by demographic groups (e.g. African, Caucasian, South Asian and East Asian) and report True Positive Rate (TPR) @ False Positive Rate (FPR) = 1e-6. The number of positive pairs for each demographic group is of the order of million and the number of negative pairs for each demographic group is of the order of billion.

InsightFace Track Ranking Rules: To protect data privacy and ensure fairness in the competition, we withhold all images as well as labels of the test data. Participants can submit their models in the ONNX format to our evaluation server and get their results from the leader-board after the online evaluation (usually several hours). Participants are only allowed to use the training data we provided for a particular challenge track. On the widely used V100 GPU, we set an upper bound of inference time (< 10 ms/image for the MS1M sub-track and < 20 ms/image for the Glint360 sub-track) to control the model complexity and the submitted model size should be smaller than 1GB in the format of float32. On our online test server, we employ cosine similarity for the verification test. The feature dimension 

Training details of baseline models are released before the challenge to facilitate participation. We re-implement a simple online masked face augmentation function [32] , customize the ResNet [16] for the baseline models and employ ArcFace [4] as our loss function, which is one of the topperforming methods for deep face recognition.

As shown in Fig. 2 , we follow the JDAI-CV toolkit [32] 4 to implement our online masked face generation function 4 https://github.com/JDAI-CV/FaceX-Zoo/tree/ main/addition_module/face_mask_adding/FMA-3D 5 . After 3D face reconstruction [10] on the input 2D face image, we obtain the UV texture map, the face geometry and the camera pose. Then, we randomly select one facial mask from the collected mask dataset and project it into the UV space. Based on a simple texture blending, we can easily get the masked facial UV texture. Finally, we combine the masked facial UV texture and the face geometry, and render the masked face into a 2D face image.

During training, we follow ArcFace [4] to set the feature scale to 64 and choose the angular margin at 0.5. As shown in Tab. 3, we customize the ResNet [16] as our base-line models (i.e. R18, R34, R50 and R100). More specifically, we only employ the basic residual block instead of the bottleneck residual block following ArcFace [4] . The baseline models are implemented by PyTorch with parallel acceleration on both features and centres 6 . We set batch size as 1, 024 and train models on eight NVIDIA V100(32GB) GPUs. The learning rate starts with 0.1, drops by 0.1 at 10, 16, 22 epochs, and the whole training procedure finishes at 24 epochs. We set the momentum to 0.9 and the weight decay to 5e − 4. During testing, we only keep the feature embedding network without the fully connected layer and extract the 512-D features for each normalized face crop. We use the cosine similarity metric for each feature pair.

As shown in Tab. 4, we first test our baseline models on public benchmarks, including LFW [17] , CFP-FP [27] , AgeDB [23] , and IJB-C [21] . By increasing the computation complexity from R18 to R100, the performance rises steadily across all test sets. After changing the training data from MS1M to Glint360K, TPR@FPR=1e-4 on IJB-C significantly increases from 96.81% to 97.32% for R100. On CFP-FP, R100 trained on Glint360K outperforms the counterpart model trained on MS1M by 0.23%, indicating that frontal-to-profile face verification can benefit from more training data. By contrast, the verification accuracy on LFW and AgeDB is almost the same, which indicates that LFW and AgeDB are saturated to distinguish high performing models.

In Tab. 5, we report the performance on the challenge benchmarks. As we can see from these results, the verification accuracy benefits from more training data (from MS1M to Glint360K) and heavier backbone structures (from R18 to R100) across all testing scenarios (i.e. masked, children and multi-racial test sets). Compared to the performance gaps on public test datasets (i.e. LFW [17] , CFP-FP [27] , AgeDB [23] , and IJB-C [21] ), the performance gaps on the proposed masked test set, children test set and multi-racial test set are more obvious. In addition, we also conduct experiments with masked face augmentation. When 10% of training data wear facial masks during training, the verification accuracy on the masked test set significantly increases from 69.091% to 77.325% by using MS1M, and increases from 75.567% to 83.710% by using Glint360K. However, masked face augmentation is slightly harmful for non-masked face verification, as the TPR on the MR-All dataset drops by 0.484% for the MS1M sub-track and decreases by 0.644% for the Glint360K sub-track. We leave the balance of masked face augmentation for the challenge participants.

The masked face recognition competition (InsightFace track) is conducted as part of the Masked Face Recognition Challenge & Workshop 7 , at the International Conference on Computer Vision 2021 (ICCV 2021). Participants can freely select different sub-tracks to develop a face feature embedding model, which is automatically evaluated on our test server based on the above-mentioned protocols. The competition has been opened worldwide, to both industry and academic institutions. By 16th August 2021, the In-sightFace track has received hundred of registrations from across the world. More specifically, the competition has received 123 valid submissions for the MS1M sub-track and 69 valid submissions for the Glint360K sub-track. Here, multi-submissions for one sub-track from the same participant is only counted once.

As we postpone the leader-board submission to 1st October 2021, we can not collect the final top-ranked solutions before the camera-ready deadline. After the competition, we will close the test server and select the valid top-3 solutions for each track. We will collect the training code from these top-ranked participants and re-train the models to confirm whether the performance of each submission is valid or not. We will update the challenge report through arxiv with detailed team information and detailed top-ranked solutions.

By 16th August 2021, we have found the best model of the MS1M sub-track has achieved 84.169% on the masked test set, and 90.452% on the MR-All test set. As given in Tab. 6, we list the top-15 submissions from the leaderboard. Comparing with the baseline models in Tab. 5, there are around 7% absolute improvements on the masked test set and the MR-All test set. For the Glint360K sub-track, the best model has achieved 88.972% on the masked test set, and 93.512% on the MR-All test set as shown in Tab. 7. Comparing with the baseline models in Tab. 5, there is around 6% absolute improvement on the masked test set and about 3.5% absolute improvement on the MR-All test set. Therefore, there is huge space for the training optimization to improve masked face recognition without the accuracy drop on the non-masked face recognition.

Face recognition has been a controversial topic recently. There have been questions over ethical concerns about invasion of privacy, alongside how well face recognition systems recognize darker shades of skin (known as the bias problem). In the InsighFace track of this challenge, we employ existing academic data as the training datasets. Most of the identities inside the training data are well-known Table 5 . The baseline performance of the masked face recognition challenge (the InsightFace track). "MR-All" denotes the verification accuracy on all multi-racial images. Inference time is evaluated on Tesla V100 GPU using onnxruntime-gpu==1.6. "MA-0.1" means masked face augmentation with a specific probability of 10%. celebrities [14] . The pre-processed training data are compressed into the binary record and only released to relevant researchers to facilitate the reproducible training. Our private test data will not be released to the public to avoid the data privacy problem and ensure fairness for all participants. For the bias concern, we follow the most authoritative evaluation set up by NIST-FRVT 8 . We wish to promote fairness among deep face recognition and thus set up the multi-racial verification benchmark.

In this InsightFace track report, we introduce our new benchmark for the evaluation of masked face recognition as well as non-masked face recognition. Based on our baseline solutions, we confirm the effectiveness of the naive masked 8 https://nvlpubs.nist.gov/nistpubs/ir/2019/ NIST.IR.8280.pdf face augmentation. As the challenge is still under-going, we will keep on updating the top-ranked solutions as well as this report on arxiv.

Besides the InsightFace track, there is also a parallel WebFace260M track 9 in the Masked Face Recognition challenge. The WebFace260M track is organized by Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie Huang, Xinze Chen, Jiagang Zhu, Tian Yang, Jia Guo, Jiwen Lu, Dalong Du and Jie Zhou. Detailes can be found in the arxiv report [38] , which will be also updated in the future. 

Partial fc: Training 10 million identities on a single machine

Celeb-500k: A large training dataset for face recognition

Sub-center arcface: Boosting face recognition by large-scale noisy web faces

Arcface: Additive angular margin loss for deep face recognition

Retinaface: Single-shot multi-level face localisation in the wild

Alexandros Lattas, and Stefanos Zafeiriou. Variational prototype learning for deep face recognition

Lightweight face recognition challenge

Marginal loss for deep face recognition

A novel gan-based network for unmasking of masked face

Joint 3d face reconstruction and dense alignment with position map regression network

Occluded face recognition in the wild by identity-diversity inpainting

Jointly debiasing face recognition and demographic attribute estimation

Generative adversarial nets

Ms-celeb-1m: A dataset and benchmark for large-scale face recognition

Efficient masked face recognition method during the covid-19 pandemic

Deep residual learning for image recognition

Labeled faces in the wild: A database for studying face recognition in unconstrained environments

Curricularface: adaptive curriculum learning loss for deep face recognition

Cropping and attention based approach for masked face recognition

Sphereface: Deep hypersphere embedding for face recognition

Iarpa janus benchmark-c: Face dataset and protocol

Boosting masked face recognition with multi-task arcface

Agedb: The first manually collected in-the-wild age database

Level playing field for million scale face recognition

Deep face recognition

Facenet: A unified embedding for face recognition and clustering

Frontal to profile face verification in the wild

Deep learning face representation by joint identificationverification

Deepface: Closing the gap to human-level performance in face verification

Additive margin softmax for face verification. SPL

Cosface: Large margin cosine loss for deep face recognition

Facex-zoo: A pytorh toolbox for face recognition

Mitigating bias in face recognition using skewness-aware reinforcement learning

Racial faces in the wild: Reducing racial bias by information maximization adaptation network

He Zheng, and Guodong Guo. Hierarchical pyramid diverse attention networks for face recognition

Robust point set matching for partial face recognition

Investigating bias and fairness in facial expression recognition

Masked face recognition challenge

WebFace260M: A benchmark unveiling the power of million-scale deep face recognition