key: cord-0162249-1ge4yyyc authors: Qi, Delong; Hu, Kangli; Tan, Weijun; Yao, Qi; Liu, Jingfeng title: Balanced Masked and Standard Face Recognition date: 2021-10-04 journal: nan DOI: nan sha: dfaecb3d2624af4b433d3b183ad9dadd04844ebc doc_id: 162249 cord_uid: 1ge4yyyc We present the improved network architecture, data augmentation, and training strategies for the Webface track and Insightface/Glint360K track of the masked face recognition challenge of ICCV2021. One of the key goals is to have a balanced performance of masked and standard face recognition. In order to prevent the overfitting for the masked face recognition, we control the total number of masked faces by not more than 10% of the total face recognition in the training dataset. We propose a few key changes to the face recognition network including a new stem unit, drop block, face detection and alignment using YOLO5Face, feature concatenation, a cycle cosine learning rate, etc. With this strategy, we achieve good and balanced performance for both masked and standard face recognition. Face recognition, as a method to identify or verify a person's identity using a face image, has made tremendous progresses in recent years, thanks to deep learning, particularly deep CNN. However, there are still some challenges, one of which is low quality of images. The factors that cause low quality images include pose, blur, occlusion, illumination etc. As the outbreak of the disastrous COVID-19 and inspired by the motivation to prevent virus spreading, face recognition has become one way to trace COVID-19 patients and close contacts [25] . After a patient is confirmed and his identity is recognized, safety measures can be taken to control the virus spreading. Face mask, as an effective way to prevent the virus spreading, poses a new challenge to traditional face recognition. Since the mask covers a large part of face where abundant features are present, traditional face recognition algorithm may not work effectively. This drives a need to understand how face recognition algorithm deals with masked faces, as a special case of occlusion. To cope with the challenge arising from wearing masks, it is crucial to improve the existing face recognition al-gorithms. Even though some commercial providers have claimed the availability of face recognition algorithms capable of handling face masks, and an increasing number of research publications have surfaced, there is still no publicly available masked face recognition benchmark. The ICCV2021-MFR Workshop organises Masked Face Recognition (MFR) challenge [7, 17, 32, 33] and the goal is to bench-marking face recognition algorithms for masked faces. In this paper we present our solution for this MFR challenge [17] . We notice that one challenge is how to balance the performance of the masked and standard face recognition. At the beginning of the challenge, the MFR error rate was used as the ranking metric. To achieve best MFR performance, people tend to put a lot of masked face images in their training dataset. This causes a overfitting problem such that the performance of the MFR is very good, but that of the standard face recognition is getting worse. Later the organizer realized this problem and changed there ranking metric to a mixture of the error rates of the MFR and standard face recognition. In our solution, we realize this problem from the beginning and keep the balance of the MFR and standard face recognition. Our strategy is to control the percentage of the masked face image, by no more than 10% of the total number of face images in the training dataset. Other than this key strategy, we propose a few changes to the backbone network architecture. We use the ArcFace [8] as our baseline with Resnet [14] as the backbone. We test the ArcFace loss [8] , cosFace loss [26] , and other loss functions. We design a new stem unit, and add the drop block [12] in the last two layers of the Resnet network. We also propose to use YOLO5Face [22] to do face alignment, and to concatenate features from multiple models to improve the recognition accuracy. As of August 3 2021, we achieve the All-Masked MFR metric 0.1056 in the Webface track and TAR@Mask 0.84327, TAR@Mask-All 0.92702 in the Insightface/Glint360k track. Face recognition has been widely studied and deployed in practical applications in recent years. For a latest review, please refer to [27] . For face recognition in the wild, more and more attentions are on poor quality face images by pose, blur, occlusion, illumination etc. Some examples are [28, 24, 19] . MFR is a challenging task since a large part of the face is occluded, so abundant features on mouth, nose, and lower cheek are all lost. As a result, the face recognition algorithm has to focus on the eyes, ears, upper cheek, forehead to identity a person's identity. The work [2, 11] provide a tool to generate masked face images as a data augmentation for MFR. In [3] , normal face detector is used to detect masked face, and a pretrained VGGFace2 [6] is used to extract features for face recognition. They treat the MFR as a normal face recognition problem. In [13] , the author first removes the masked region, then use pre-trained networks to extract the features of the eyes and forehead, and finally use a bag of words and a MLP to do the classification. In [21] a Resnet50 [14] and ArcFace [8] are used. They introduce an probability of mask usage, and add a mask usage classification loss to the ArcFace loss. In [10] , the authors collect some MFR dataset, and propose a latent part detection to locate the latent facial part which is robust to mask wearing, and the latent part is further used to extract discriminative features. In [18] the authors explore the Convolutional Block Attention Module in a Resnet50 network. They also suggest removing the masked region helps the MFR. They test their algorithm on a variety of MFR datasets. All major changes are described in this section. These include the data augmentation for masked face images, a new stem unit, a DropBlock, YOLO5Face [22] detection and alignment, and feature concatenation. Generation of masked face images is one type of data augmentation methods. However, for the importance of a balanced masked and standard face recognition, we put it here as a separate subsection. In our early study, we find that in order to get the best performance of MFR, people is temped to use as many as possible masked face images in the training dataset. As a result, the face recognition model is over-fitted for MFR, but does not work well on normal face recognition. Therefore, we control the balance of the MFR and standard face recognition by controlling the percentage of masked face images in the training dataset. After some experiment we find that 10% is a good trade-off. In stead of using public available MFR dataset, we use a synthetic tool to generate MFR dataset. The tool we use is the FMA-3D [11] . This tool can generate masked face images online or offline. We find that generating masked face images online affects the throughout of the training substantially, we so decide to generate masked face images offline. Shown in Figure 1 are some examples of the generated masked face images. We use the ArcFace framework [8] with Resnet [14] as backbone. The network architecture is shown in Figure 2 . All changed blocks are highlighted in green. Figure 2 (a) is the overall architecture, where YOLO5Face [22] is used only on the test data. The training data provided by Webface [33] or Glint360k [1] are used with data augmentation. The final feature map is flatten, then a 512-neuron full-connection (FC) layer is used to generate the feature. Lastly another FC layer is used for classification. Figure 2 (b) is the feature concatenation of two models, which will be described later. In face recognition backbone likes ResNet [14] there is a stem unit -a component whose goal is to quickly reduce the input resolution. Typically, the input is transformed from [b, c, h, w] to [b, c*2, h//2, w//2], where b,c,h,w are batch,channel,height,and width of the input images. ResNet50 [14] stem is comprised of a stride-2 conv7x7 followed by a max pooling layer which reduces the input resolution by a factor of 4. The ResNet50-D [15] stem design is more elaborate -the conv7x7 is replaced by three conv3x3 layers. The new ResNet50-D stem design improves accuracy, but at a cost of lowering the training throughput. In the TResNet [23] , the stem unit is called a DepthToSpace transformation layer, which aims to replace the traditional convolution-based downscaling unit by a fast and seamless layer, with little information loss as possible. In YOLOv5 [29] the author wants to reduce the cost of Conv2d computation and use a tensor reshaping to reduce space resolution and increase the depth. It has been shown that this focus layer has a good impact on the YOLOv5 performance. Inspired by the stem unit in Resnet50 [14] and the focus layer in the YOLOv5 [29] , we design a new stem unit of down-sampling rate 2 similar to the focus layer in YOLOv5. There are two parallel stem units C1 and C2, whose output are added up. In the first unit C1, the input image is first passed to an average-pooling layer with kernel size=2 and stride=2 to reduce the space resolution. After that, a Conv layer with kernel size=2, stride=2 is applied. In the second unit C2, the input image is 2x down-sampled on the dimensions h and w. The even and odd index on the dimension h and w are mutually combined to form four images. These four images are concatenated to form an image whose resolution is 2x reduced but the depth is 4x increased. This image is sent to a Conv layer whose kernel=3x3, stride=2, padding=1. In both stem units, a batchNorm layer and a PReLU layer are used after the Conv layer. The architecture of this stem unit is shown in Figure 3 . In deep learning, dropout is a widely used regularization method. However, it is often less effective for convolutional layers. This is perhaps due to the fact that activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout. Thus a structured form of dropout is needed to regularize convolutional networks. In [12] , A DropBlock, where units in a contiguous region of a feature map are dropped, is proposed. It is found that applying the Drop-Block increases the accuracy. We add this DropBlock into our face recognition network. This block is added after the Conv layer and the skip connection in the last two layers of the Resnet [14] , as shown in Figure 1 (a). Face images are aligned using landmarks before they are sent to face recognition. Before RetinaFace becomes avail- able, MTCNN [30] is widely used for face image alignment. In Webface dataset, the RetinaFace [9] is used as the standard for face alignment. YOLO5Face [22] is a delicate redesigned face detector from the YOLOv5 [29] object detector. In addition to the bounding box and confidence score, it also outputs five-point landmark, similar to MTCNN and RetinaFace. We use the YOLO5Face for face image alignment in some of our studies. Some qualitative examples are shown in Figure 4 . The landmarks from the YOLO5Face [22] and RetinaFace [9] on a set of face images with large pose are shown. It is not hard to see that the landmarks of YOLO5Face are better. Please note that the some of the detection scores are small because these faces have large poses. Figure 4 . Examples of landmark, first row is from RetinaFace [9] , and second row from the YOLO5Face [22] . The left-right face flipping has been shown to improve the face recognition performance. It is not allowed to use in this MFR competition. We borrow this idea to use extracted feature vectors from multiple models. We choose two best models, extract their feature vectors, and concatenate them as the feature vector for the face recognition, as shown in Figure 2 (b). This is like an model ensemble concept. Our experiment shows that the performance can be improved. In early time of the first phase, the Glint360K dataset [1] is used as the preliminary training dataset. After we meet the required baseline performance, 30% of the Web-Face260M [33] are released however are not used in our training. Other than the masked face images already included in the dataset, more masked face images are generated, as described in Section 3.5. The total number of masked face images is not exceeding 10% of the total number of face images. The ArcFace framework [8] in Pytorch is used in our study. Input image size is set to 112x112. We use the Resnet34 [14] as our backbone. We expect larger models like Resnet50, Resnet101 [14] will give better performance. But for faster training speed, we use Resnet34 in most of our studies. Other data augmentation methods we use include random cropping, random flipping, and the more complex Albumentations [5] , which include Affine transforms (scaling, translation, rotation, distortion), noise, blurring, brightness and contrast adjustment. Our model is trained on four Nvidia RTX GPUs. The training runs 24 epochs with batch size 512. We use the cyclic cosine decay [20, 4] for the learning rate strategy as shown in Figure 5 . The initial learning rate is 0.1, the momentum is 0.9, and the weight decay is 5E-4. We use 0.1 epoch as warm-up, the learning rate is reduced to the minimum decay learning rate in 16 epochs. In the Insightface/Glint360k track, we use Resnet34 [14] backbone for its fast training speed. We use this track as our ablation study tool in addition to submission to the competition leader board. We test a variety of techniques in this study in an incremental manner, and the results are listed in Table 1 . In the table, the cyclic cosine learning rate is described in Section 4.2; data augmentation refers to random cropping plus the Albumentations augmentation with random horizontal flipping, blurring, Gaussian blurring, motion blurring, RGB shift, and image compression, where the probability is 0.5 for the horizontal flipping, and 0.05 for all others. We also test the Squeeze-and-Excitation (SE) network [16] , and the Exponential Moving Average (EMA) gradient update. We see that the cycle cosine learning rate brings a 2.2% performance improvement; the data augmentation brings the biggest improvement of nearly 10%; all other techniques brings 0.1-0.2% improvement except for the stem unit. The stem unit does not improve the accuracy, but it improves the inference time. Face alignment with YOLO5Face [22] . We do this ablation study on the Webface track. We use the best configu- ration from the previous ablation study as baseline. Instead of Resnet34, Resnet124 is used as backbone. The results are listed in Table 2 . First we train two models R124 1 and R124 2, then we concatenate the features from them and form the third model. In these three models, we use the RetinaFace [9] face detector and alignment. Then we keep using the concatenation model, and replace the Reti-naFace with the YOLO5Face [22] . We test two models, a small-sized model YOLO5-S, and a medium sized model YOLO5-M. Both models give better performance than the third model. This demonstrates that the YOLO5Face gives better face detection and landmark prediction as we qualitatively show in Figure 3 . To avoid overfitting on masked/standard face recognition, the Webface track organizer decide to revise the formula for calculating all three MFR metrics in the leaderboard. New MFR metrics are designed to show a weighted sum to consider both masked and standard faces at the same time. The new formulas are shown below, As of August 3, 2021, our best results on the two tracks are listed in Table 3 . In this paper we present a solution for the ICCV2021 MFR challenge [17] . Our solution has a good balance of the masked MFR and the standard face recognition. Bin Qin, Debing Zhang, and Ying Fu. Partial fc: Training 10 million identities on a single machine Masked face recognition for secure authentication Single camera masked face identification There are many consistent explanations of unlabeled data: Why you should average Albumentations: Fast and flexible image augmentations Vggface2: A dataset for recognising faces across pose and age Xiang An, Zheng Zhu, and Stefanos Zafeiriou. Masked face recognition challenge: The webface260m track report. ICCV Workshop Arcface: Additive angular margin loss for deep face recognition Retinaface: Single-stage dense face localisation in the wild Masked face recognition with latent part detection Fma-3d. github.com/JDAI-CV/FaceX-Zoo/tree/main/addition module/face mask adding/FMA-3D Dropblock: A regularization method for convolutional networks. arXiv,1810.12890 Efficient masked face recognition method during the covid-19 pandemic Deep residual learning for image recognition Bag of tricks for image classification with convolutional neural networks Squeeze-and-excitation networks Iccv21-mfr workshop Cropping and attention based approach for masked face recognition Eqface: A simple explicit quality network for face recognition Sgdr: Stochastic gradient descent with warm restarts Boosting masked face recognition with multi-task arcface Yolo5face: Why reinventing a face detector Tresnet: High performance gpu-dedicated architecture Towards universal representation learning for deep face recognition Application of face recognition in tracing covid-19 patients and close contacts Cosface: Large margin cosine loss for deep face recognition Deep face recognition: A survey Neural aggregation network for video face recognition Joint face detection and alignment using multitask cascaded convolutional networks Refineface: Refinement neural network for high performance face detection Dalong Du, and Jie Zhou. Masked face recognition challenge: The webface260m track report Webface260m: A benchmark unveiling the power of million-scale deep face recognition