key: cord-0519162-o2x5ygli
authors: Montero, David; Nieto, Marcos; Leskovsky, Peter; Aginako, Naiara
title: Boosting Masked Face Recognition with Multi-Task ArcFace
date: 2021-04-20
journal: nan
DOI: nan
sha: d4a075d965fc835a71d5bf56fe19613d72b47c68
doc_id: 519162
cord_uid: o2x5ygli

In this paper, we address the problem of face recognition with masks. Given the global health crisis caused by COVID-19, mouth and nose-covering masks have become an essential everyday-clothing-accessory. This sanitary measure has put the state-of-the-art face recognition models on the ropes since they have not been designed to work with masked faces. In addition, the need has arisen for applications capable of detecting whether the subjects are wearing masks to control the spread of the virus. To overcome these problems a full training pipeline is presented based on the ArcFace work, with several modifications for the backbone and the loss function. From the original face-recognition dataset, a masked version is generated using data augmentation, and both datasets are combined during the training process. The selected network, based on ResNet-50, is modified to also output the probability of mask usage without adding any computational cost. Furthermore, the ArcFace loss is combined with the mask-usage classification loss, resulting in a new function named Multi-Task ArcFace (MTArcFace). Experimental results show that the proposed approach highly boosts the original model accuracy when dealing with masked faces, while preserving almost the same accuracy on the original non-masked datasets. Furthermore, it achieves an average accuracy of 99.78% in mask-usage classification.

In recent years, advances in the field of face recognition have made it one of the most reliable biometric techniques among other existing techniques such as fingerprint recognition, hand geometry or iris scanning [1] , [2] . Furthermore, compared to its alternatives, facial recognition has the following advantages:

• It is a more affordable solution than other alternatives such as fingerprint or iris scanners, as it only needs a mono camera as a sensor. In addition, several cameras may be connected to a single processing unit to even more reduce hardware costs. • The verification can be done remotely; there is no need for the user to interact with the sensor. • The sensor can be hidden, which can be very useful for security or aesthetic reasons. All these features make face recognition the best choice for most of the applications based on human re-identification.

However, face recognition also has weaknesses. Current state-of-the-art methods are based on deep learning models that extract biometric feature vectors from the detected face David images. These detected faces may have different orientations, lighting conditions, partial occlusions, low resolution, noise, etc., which can affect the robustness of the feature vectors [3] . Most of these negative conditions can often be eliminated by selecting the correct hardware location and requirements and by preprocessing face images [4] , but others such as partial occlusions caused by clothing accessories cannot be avoided. This particular issue have recently become a major challenge in the field of face recognition, especially since the global health crisis originated by COVID-19 has caused medical face masks to become an everyday-clothingaccessory.

The use of a mouth and nose-covering mask makes the face recognition models to lose about half of the useful biometric information. Since they have been designed to work with the whole face information, the quality of the feature vectors extracted from masked faces is compromised and the accuracy of the re-identification process decreases considerably, as stated in [5] . In fact, NIST agency recently presented a study [6] where they examined 89 major commercial facial recognition algorithms. The results showed error rates of between 5% and 50% in matching photos of the same person with and without a mask.

While masked face detection has been widely studied and several robust solutions have been presented [7] , [8] , [9] , masked face recognition remains an under-researched topic. In the last months, several masked face datasets [10] , [11] and tools [12] , [13] for generating synthetic data have been released. In addition, some methods trying to tackle this issue using different approaches have been presented [14] , [13] , [15] . Nevertheless, there is still much research to be done about this topic. To contribute to this task, we propose an approach based on the ArcFace work presented by Deng et al. [16] with several modifications for the backbone and the loss function. From the original face-recognition dataset, we generate a masked version using data augmentation, and we combine both datasets during the training process. We modify the selected network, based on ResNet-50 [17] , [18] , to also output the probability that a face is wearing a mask without adding any additional computational cost. Furthermore, we combine the ArcFace loss with the mask-usage classification loss, resulting in a new function named Multi-Task ArcFace (MTArcFace).

Experimental results with non-masked and masked facerecognition validation datasets show that the proposed approach highly boosts the model accuracy when dealing with masked face recognition, while preserving almost the same accuracy on the non-masked datasets. Furthermore, the model achieves an accuracy of 99.78% in mask-usage classification.

The rest of the paper is organized as follows. First, we present a review of the related work in Section II. Section III describes the proposed method. We provide the experimental results in Section IV. Finally, conclusions are given in section V.

State-of-the-art face recognition algorithms are based on deep learning models. These models learn to extract the important features from a face image an embed them into an n-dimensional vector with small intra-class and large interclass distance.

These models are trained mainly following two approaches. The first one consists on training a multi-class classifier considering one class for each identity in the training dataset, normally using a softmax function [16] , [19] . In the second one, the embedding is learnt directly, comparing the results of different inputs to minimize the intra-class distance and to maximize the inter-class distance, for example using the triplet loss [20] .

Both softmax-loss-based and triplet-loss-based models suffer from face-mask occlusions in terms of accuracy, as reported by [5] and [6] . However, as stated in [16] , tripletloss-based models require a data preparation step prior to the training phase, in order to select the triplets correctly. For this reason, we decided to address the problem using a softmaxloss approach. More specifically, we selected ArcFace [16] as our baseline, since it has been proven to be the approach that reports the best results for the face recognition task.

Since the rise of COVID-19, several works have been presented in order to solve masked face recognition task. The proposed methods tackle the problem following different approaches that can be categorized in three groups. The first group uses generative adversarial networks (GAN) to unmask faces prior to feeding them to the face recognition model [21] , [13] . Using this approach it is not necessary to retrain the recognition model. However, the reconstructed faces are synthetic and their reliability depends on the quality of the data, the network and the training process. In addition, the process of removing the mask noticeably increases the computation time.

The approach adopted by the second group consists of extracting features only from the upper part of the face [15] . As the processed region of the face is smaller, the trained network performs faster. Nevertheless, this causes an important drop of information when dealing with unmasked faces, so it is not suitable for applications mixing both use cases.

Finally, the last group tackles the problem training the face recognition network with a combination of masked and unmasked faces [12] , [14] . In [12] they combine the VGG2 dataset [22] with augmented masked faces and train the model following the original pipeline described in FaceNet [20] . This way, the model learns to distinguish when a face is wearing a mask and to trust more in the features of the upper half of the face, but still extracts information from the whole face. On the other hand, Geng et al. [14] define two centers for each identity which correspond to the full face images and the masked face images respectively. They use Domain Constrained Ranking for forcing the feature of masked faces getting closer to its corresponding full face center and vice-versa.

The method proposed in this work belongs to the third group, but using ArcFace [16] as the baseline model. First, we generate a masked version of a face recognition dataset using data augmentation. Then, during the training process, both datasets are shuffled separately using the same seed and, for every new face image selected for the input batch, we decide whether the image is taken from the original or the masked dataset with a probability of 50%. Furthermore, we take advantage of knowing to which dataset the face belongs to and modify the original network to output the probability that a face is wearing a mask without additional computational cost.

For the methods belonging to the first and the second group previously described, there is a need of masked face datasets. Some recent works have contributed to this task. For instance, Geng et al. [14] present a dataset where each identity has masked and full face images with various orientations. However, the dataset contains only 11,615 images and 1,004 identities, which is not enough data for training a complex network such as ResNet-50 [17] , [18] . In [10] , the authors present a dataset composed of 137,016 masked faces divided in two groups: correctly and incorrectly masked. Nevertheless, the dataset does not contain information about the identity of any of the subjects, so it cannot be used for the face recognition task. In [11] , two additional datasets are presented: Real-world Masked Face Recognition Dataset (RMFRD), with 95,000 images and 525 identites, and Simulated Masked Face Recognition Dataset (SMFRD), with 500,000 and 10,000 subjects. Although the latter dataset contains a great number of samples, it is not yet sufficient to train a complex network, for example if we compare it with MS1MV2 dataset used in ArcFace [16] , which contains 5.8 million images and 85,000 identities.

On the other hand, Anwar and Raychowdhury [12] present a tool for masking faces in images. It uses a face landmarks detector to identify the face tilt and six key features of the face necessary for adjusting and applying a mask template. This tool supports different types and colors of masks. In this work, we use this tool to generate a masked version of the face recognition datasets used for training and evaluation. Fig. 1 . Illustration of the proposed training pipeline. The image selector decides whether the next input image should be masked or not. The trained network is modified to output also the probability that the face is wearing a mask.

We consider the problem of facial recognition of subjects who may or may not wear masks. As we do not know if the subject is wearing a mask, the network must perform well in both cases. To solve this problem, we aim at increasing the accuracy of the face recognition network when dealing with masked faces, while preserving as much as possible the original accuracy with non-masked faces. In order to achieve this, the network must learn if the subject is wearing a mask to decide which facial features can be trusted in each case. We take advantage of this fact and modify the network so it also outputs the probability that the subject is wearing a mask.

We decide to generate a masked twin dataset from the original one and to combine them during the training process. Both datasets are shuffled separately using the same seed and, for every new face image selected for the input batch, we decide whether the image is taken from the original or the masked dataset with a probability of 50%. As mentioned in Section II-A, we use ArcFace [16] as the baseline work for two reasons: it uses a softmax-loss-based methodology, which does not require an exhaustive training-datapreparation stage; and it has been proven to be the approach that reports the best results for the original face recognition task. Thus, we select the dataset recommended in their work MS1MV2 as the training dataset, which is a refinement of MS-Celeb-1M [23] , which contains 5.8M images and 85,000 identities. An illustration of the proposed training pipeline is shown in Figure 1 .

For the generation of the masked version of the dataset, as discussed in section II-C, we use the tool MaskTheFace [12] . The types of masks considered are surgical, surgical green, surgical blue, N95, cloth and KN95. The type mask is selected randomly and there is a probability of 50% of applying a random color and a probability of 50% of applying a random texture. Some examples of the generated faces are shown in Figure 2 .

We select LResNet-50 as the backbone among all the network architectures tested in the ArcFace repository as it is the one with the best trade-off between the accuracy and the number of parameters. More specifically, we use our own implementation of the network in TensorFlow deep learning framework, publicly available in a GitHub repository [24] .

Starting from this network, we add another dense layer parallel to the one used to generate the feature vector, just after the dropout layer, as shown in Figure 1 . The new dense layer generates an output with two floats, which correspond to the scores related to the probability that the face is masked or not, respectively. This way, we force the network to learn when a face is wearing a mask, information that will also be used by the layer that generates the feature vector.

Thus, from the modified network we obtain two outputs, the logits (unnormalized predictions) of the ArcFace layer (logits ArcF ace ) and the logits of the new dense layer (logits M ask ). To extract the combined error from both logits, we start by generating the ArcFace loss (loss ArcF ace ) in the same way as in [24] :

loss ArcF ace = crossEnt(Sof tmax(logits ArcF ace , labels ID )

(1) Next, we calculate the loss associated with the probability of wearing a mask (loss M ask ) by applying the softmax activation function on the logits and cross-entropy with the labels:

loss M ask = crossEnt(Sof tmax(logits M ask ), labels ID )

(2) The Multi-Task ArcFace loss (loss M T ArcF ace ) is obtained by adding these two losses. However, to reduce the impact of loss M ask and give more importance to the ArcFace loss, we use the logarithm of loss M ask instead of the original value:

loss M T ArcF ace = loss ArcF ace + log(loss M ask + 1.0) (3)

Finally, we add the regularization loss (as in the original implementation) to compute the total loss that will be used for the optimization: loss total = loss M T ArcF ace + loss regularization (4) We train the model using 2 Tesla V100 GPUs with a total batch size of 512 and for 300k steps. We use the SGD optimizer with a momentum of 0.9 and an initial learning rate of 0.0015. The learning rate is reduced by a factor of 0.3 in steps 120k, 200k and 280k. The rest of the parameters of the network remain the same as in the original implementation. In Figure 3 , we show the training loss curve and the facerecognition and mask-usage accuracy curves, compared to those of the original model.

In this section, we present the results of a series of experiments aimed at demonstrating the capabilities of the proposed method. We divide the experiments into two groups: identity verification and mask-usage verification.

The first group of experiments are conducted to measure the performance improvement of the proposed method in the face verification task when dealing with masked faces. For measuring this increase, we use the original model as the baseline to compare the results.

For the verification task, we generate masked versions of 3 well-known face recognition datasets, also used in [16] for evaluating the original models: Each of the protocols are composed of 7,000 comparisons. We consider both protocols for our experiments. • Agedb [27] : the first manually collected, in-the-wild age database. Contains 16,488 images from 568 celebrities at different ages. Contains four verification protocols where the compared faces have an age difference of 5, 10, 20 and 30 years respectively. We select the last protocol for the experiments (AgeDB 30, as it is the most challenging one. It contains 6,000 comparisons. In addition, we also consider for the experiment the masked face dataset MFR2 described in [12] , with 269 realworld face images from 53 celebrities, where the 64% of the faces wear a mask. The associated verification process is composed of 848 comparisons. Some examples of the images of the different datasets considered for the experiment are shown in Figure 4 .

The results of the experiment are presented in Table I . It can be observed that the proposed method largely outperforms the original model in the face verification task when Fig. 4 . Some examples of the faces of the evaluation datasets and their masked versions. The first row belongs to LFW [25] , the second row to CFP [26] , the third row to AGEDB [27] and the last row to MFR2 [12] . dealing with masked faces. This increase in performance is more evident with profile images, where the amount of information of the face available is reduced, as is the case with CFP FP, where the proposed model is almost a 12% more accurate than the original.

We also want to test the accuracy of the new model when recognizing non-masked faces, to check whether it has been a significant drop of performance. Thus, we repeat the previous experiment with the original non-masked datasets and compare the results with those achieved by the original model. The results, exposed in Table II , show that there is indeed a drop of performance for the new model, but that it is not significant (less than a 2% in the worst case). Furthermore, this drop in performance is much less than the gain obtained with masked faces. For example, in the case of CFP FP, the model accuracy with masked faces increases almost a 12%, while its accuracy with non-masked faces decreases less than a 2%. 

Finally, we want to analyze the performance of the maskusage probability output added to the proposed method. For this task, we run the model with all the faces contained in every masked and non-masked dataset used in the previous experiments. For each face we check whether the mask-usage probability estimated by the model is correct or not with a threshold of 0.5. Table I shows the results of the experiment. For each dataset, the model achieves nearly 100% accuracy. Again, the worst result is achieved for the CFP FP dataset (98.82%) due to the profile faces. We believe that this is due to the fact that the training dataset does not contain enough profile faces. In any case, the model achieves an average accuracy of 99.78% across all datasets, so its effectiveness for this task is demonstrated.

In this work, we have presented a full-training pipeline for ArcFace-based face-recognition models to adapt them for working with masked faces. This pipeline includes the generation of a synthetic masked dataset from the original training dataset. Furthermore, we have taken advantage of knowing to which dataset the face belongs to and modified the original network to output the probability that a face is wearing a mask without additional computational cost. As a result, we have created a new loss function to teach the network to extract vectors of good quality and reliable maskusage probabilities called Multi-Task ArcFace. Experimental results with multiple masked and non-masked datasets have demonstrated that the proposed method highly boosts the performance of the model when recognizing masked faces, while suffering just a small drop in performance with nonmasked faces. Furthermore, it has also been demonstrated its effectiveness for the mask-usage verification task with an average performance of 99.78% of accuracy across all datasets. Future work will focus on extending the applicability of this method to other types of occlusions, such as eyes-masked faces. In addition, we will also study the possibility of adding a new output to the model to classify if the subject is wearing a mask correctly or if it is wearing it under its nose or its mouth.

This paper is supported by European Union's Horizon 2020 research and innovation programme under grant agreement No 883341, project GRACE (Global Response Against Child Exploitation).

Fundamentals and Advances in Biometrics and Face Recognition

A review on biometrics and face recognition techniques

Enhanced local texture feature sets for face recognition under difficult lighting conditions

Face normalization: Enhancing face recognition

The effect of wearing a mask on face recognition performance: an exploratory study

Nist finds flaws in facial checks on people with covid masks

Retinamask: A face mask detector

A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the covid-19 pandemic

Masked face detection via a modified lenet

Maskedface-net -a dataset of correctly/incorrectly masked face images in the context of covid-19

Masked face recognition dataset and application

Masked face recognition for secure authentication

A novel gan-based network for unmasking of masked face

Masked face recognition with generative data augmentation and domain constrained ranking

Efficient masked face recognition method during the covid-19 pandemic

Arcface: Additive angular margin loss for deep face recognition

Deep pyramidal residual networks

Deep residual learning for image recognition

Sphereface: Deep hypersphere embedding for face recognition

Facenet: A unified embedding for face recognition and clustering

Look through masks: Towards masked face recognition with de-occlusion distillation

Vggface2: A dataset for recognising faces across pose and age

Ms-celeb-1m: A dataset and benchmark for large-scale face recognition

face recognition tf2

Labeled faces in the wild: A database forstudying face recognition in unconstrained environments

Frontal to profile face verification in the wild

Agedb: the first manually collected, in-the-wild age database