key: cord-0258077-9bew2hgn authors: Du, Hang; Shi, Hailin; Liu, Yinglu; Zeng, Dan; Mei, Tao title: Towards NIR-VIS Masked Face Recognition date: 2021-04-14 journal: nan DOI: 10.1109/lsp.2021.3071663 sha: a9312f4c42e888ef07b24d2a19ffe9f61db44947 doc_id: 258077 cord_uid: 9bew2hgn Near-infrared to visible (NIR-VIS) face recognition is the most common case in heterogeneous face recognition, which aims to match a pair of face images captured from two different modalities. Existing deep learning based methods have made remarkable progress in NIR-VIS face recognition, while it encounters certain newly-emerged difficulties during the pandemic of COVID-19, since people are supposed to wear facial masks to cut off the spread of the virus. We define this task as NIR-VIS masked face recognition, and find it problematic with the masked face in the NIR probe image. First, the lack of masked face data is a challenging issue for the network training. Second, most of the facial parts (cheeks, mouth, nose etc.) are fully occluded by the mask, which leads to a large amount of loss of information. Third, the domain gap still exists in the remaining facial parts. In such scenario, the existing methods suffer from significant performance degradation caused by the above issues. In this paper, we aim to address the challenge of NIR-VIS masked face recognition from the perspectives of training data and training method. Specifically, we propose a novel heterogeneous training method to maximize the mutual information shared by the face representation of two domains with the help of semi-siamese networks. In addition, a 3D face reconstruction based approach is employed to synthesize masked face from the existing NIR image. Resorting to these practices, our solution provides the domain-invariant face representation which is also robust to the mask occlusion. Extensive experiments on three NIR-VIS face datasets demonstrate the effectiveness and cross-dataset-generalization capacity of our method. Abstract-Near-infrared to visible (NIR-VIS) face recognition is the most common case in heterogeneous face recognition, which aims to match a pair of face images captured from two different modalities. Existing deep learning based methods have made remarkable progress in NIR-VIS face recognition, while it encounters certain newly-emerged difficulties during the pandemic of COVID-19, since people are supposed to wear facial masks to cut off the spread of the virus. We define this task as NIR-VIS masked face recognition, and find it problematic with the masked face in the NIR probe image. First, the lack of masked face data is a challenging issue for the network training. Second, most of the facial parts (cheeks, mouth, nose etc.) are fully occluded by the mask, which leads to a large amount of loss of information. Third, the domain gap still exists in the remaining facial parts. In such scenario, the existing methods suffer from significant performance degradation caused by the above issues. In this paper, we aim to address the challenge of NIR-VIS masked face recognition from the perspectives of training data and training method. Specifically, we propose a novel heterogeneous training method to maximize the mutual information shared by the face representation of two domains with the help of semi-siamese networks. In addition, a 3D face reconstruction based approach is employed to synthesize masked face from the existing NIR image. Resorting to these practices, our solution provides the domain-invariant face representation which is also robust to the mask occlusion. Extensive experiments on three NIR-VIS face datasets demonstrate the effectiveness and cross-dataset-generalization capacity of our method. Index Terms-Heterogeneous, NIR-VIS, cross modality, masked face recognition N EAR-INFRARED to visible (NIR-VIS) face recognition has been widely adopted in many face recognition applications, especially on the condition of low illumination. It aims to match a near-infrared (NIR) probe face image with a visible (VIS) gallery face image. Existing deep learning based methods [1] , [2] , [3] , [4] , [5] have made remarkable progress in NIR-VIS face recognition. However, at the pandemic of novel coronavirus 2019 (COVID-19), humans are supposed to wear facial masks to cut off the spread of the virus. Hence, a masked NIR probe face is required to be matched with a VIS gallery face. We define this task as NIR-VIS masked face recognition, and find it problematic with the masked face in the NIR probe image. First, the lack of masked face data is a challenging issue for the existing training methods. Second, as shown in Fig. 1 occluded by the mask, a large amount of information is lost in such case. Third, we can also observe that the domain gap still exists in the remaining facial parts. The above issues result in the significant performance degradation on NIR-VIS masked face recognition. Therefore, it is urgent to address these newlyemerged difficulties in NIR-VIS masked face recognition. Recently, many deep learning methods have been proposed for NIR-VIS face recognition, which can be divided into two schemes. The first scheme focuses on learning face representation from the heterogeneous data. They [6] , [1] , [7] aim to learn a common feature representation space in which the face representation of the same identity from two domains is similar. The typical routine is to pretrain CNN on the largescale visible images and fine-tune it on the heterogeneous data. Besides, learning domain-invariant representation is another choice. IDR [2] and W-CNN [8] achieve this by dividing the high-level convolutional layers into two orthogonal subspaces. However, the above methods suffer from two issues in NIR-VIS masked face recognition. On one hand, the domain gap bewteen NIR and VIS images is enlarged due to the occlusion of mask, which makes it difficult to effectively learn face representation. On the other hand, the limited training data will lead to the overfitting problem. Therefore, these methods can not perform well on NIR-VIS masked face recognition. The second scheme aims to synthesize face images from one domain to another for reducing the domain gap. They [9] , [3] , [10] , [11] , [4] , [12] , [5] , [13] propose to synthesize the VIS images from the NIR or thermal images, and then perform the regular face recognition algorithm in the VIS domain. With the significant improvement of image generation, the above methods have achieve state-of-the-art performance in general NIR-VIS face recognition. However, in the scenario of NIR-VIS masked face recognition, they fail to synthesize the photorealistic full VIS faces from the masked NIR faces, since most of the facial information is lost due to the mask occlusion. In addition, with the pandemic of COVID-19, certain methods [14] , [15] have been proposed for general masked face recognition. Geng et al. [14] introduce a GAN-based method to synthesize masked face and a domain constrained loss to make the masked faces close to its corresponding full face in the feature space. Besides, a latent part detection model [15] is proposed to locate the facial region which is robust to mask wearing and used to extract discriminate features. The above methods mainly study the general homogeneous masked face recognition, but the large domain gap between masked NIR faces and full VIR faces is a more challenging task. In this paper, we study the NIR-VIS masked face recognition which is a challenging task needed to be addressed at the pandemic of COVID-19. We intend to address it from the perspectives of training data and training method. First, a heterogeneous semi-siamese training (HSST) method is proposed to maximize the mutual information shared by the face representation of two domains with the help of semisiamese networks [16] . Specifically, a positive pair of NIR-VIS face images are fed into the semi-siamese networks; with the optimization of the learning objective, the semi-siamese networks enable to maximize the mutual information shared by the face representation of masked NIR images and VIS images. Since two heterogeneous prototypes are employed to compute the training loss, they can provide two complementary views to maximize the mutual information. Second, to obtain realistic masked face data, we adopt a 3D face reconstruction based approach to synthesize masked face from the existing images. Resorting to the above practices, our solution provides the domain-invariant face representation which is also robust to the mask occlusion. Extensive experiments on CASIA NIR-VIS 2.0, Oulu-CASIA NIR-VIS, and BUAA-VisNir datasets demonstrate the effectiveness and cross-dataset-generalization capacity of our method. Semi-siamese training (SST) [16] is proposed to handle shallow face learning. Since the extreme lack of intra-class diversity, traditional training methods suffer from the model degeneration and overfitting issue on shallow face learning. To address these problems, semi-siamese training consist of a probe-net φ p to embed the features of probe images and a gallery-net φ g to update prototypes queue by the features of gallery images. The probe features and the feature-based prototype queue are used to compute the training losses. The probe-net is optimized by the SGD and the gallery-net is updated by the moving-average. Different to the shallow face learning, the intra-class diversity in NIR-VIS masked face recognition is large due to the domain gap and the loss of facial information. Based on the semi-siamese networks, we propose a heterogeneous semisiamese training (HSST) for NIR-VIS masked face recognition. Compared to the orginal SST, we feed a positive pair of heterogeneous faces, including a masked NIR face and a full VIS face, into the semi-siamese networks respectively (as shown in Fig. 2) . Besides, we construct two heterogeneous prototype queues that contain the features of NIR face and VIS face inferred by the gallery-net, respectively. Then, the NIR prototype queue is employed to compute the training loss with the feature of VIS images inferred by the probe-net, and the VIS prototype queue is used to compute the training loss with the feature of NIR images. The two networks are pretrained on the visible face dataset. In the following, we study how HSST can perform well on NIR-VIS masked face recognition. Generally, the classification scheme is a typical routine for deep face representation learning. We take the softmax cross-entropy loss function (omitting the bias term) as an example, which can be formulated as: L = − log e s cos(θi,y) e s cos(θi,y) + n j=1,j =y e s cos(θi,j ) , where cos(θ i,y ) is the cosine similarity between feature x i and its ground-truth prototype w y , s is the scale factor, and n is the number of prototypes. Let I N and I V be a positive pair of NIR and VIS face images respectively. Suppose that we input I N into the probenet and I V into the gallery-net. In this condition, the softmax loss can be reformulated by: where φ p (I N ) is the NIR face representation inferred by the probe-net, φ g (I V ) is the VIS face representation inferred by the gallery-net, and f V j is the jth feature of the VIS prototype queue. For each iteration, the sampled training ID is disjoint to the ID in prototype queue. Thus, there are one positive pair and n negative pairs in the above loss function. Besides, the φ g (I V ) is not the same as f V j , since they are inferred by the gallery-net of different states. Then, we denote the mutual information bewteen the NIR and VIS face representations as I(φ p (I N ); φ g (I V )). Minimizing the Fig. 3 : The pipeline of masked face synthesis. First, we obtain the different kinds of mask templates. Then, we conduct 3D face reconstruction and add the mask template on the UV texture map of the non-masked face. Finally, we recover the 2D masked face image from the masked UV texture map. learning objective L(I N , I V ) is equivalent to maximize the mutual information which can be formulated as: where H(·) denotes the entropy. By minimizing the training loss L(I N , I V ), the face representation is spread in the feature space, and those of the same identity from two domains is becoming close. In other words, it implicitly maximizes the entropy H(φ p (I N )) and minimizes the conditional entropy H(φ p (I N )|φ g (I V )) during the training procedure. Whereas NIR face and VIS face have an even chance to be fed into the probe-net or the gallery-net, the heterogeneous prototype queues are utilized to compute two training losses, i.e., L(I N , I V ) and L(I V , I N ), which are minimized simultaneously during the training process. They provide two complementary views to maximize the mutual information of the NIR and VIS face representations. In this way, HSST fulfils maximizing both I(φ p (I N ); φ g (I V )) and I(φ p (I V ); φ g (I N )), which facilitates the semi-siamese networks to improve the quality of domain-invariant face representation that is also robust to the mask occlusion. Besides, it is worth noting that we only take the probe-net for the tests, which guarantees the same inference efficiency and comparison fairness with the single network design. Since collecting the realistic masked faces is expensive, we present a 3D face reconstruction based method to synthesize masked face. Our method employs PR-Net [17] to extract the UV texture map and its corresponding UV position map to represent the 3D face. Fig. 3 shows the pipeline of our method to synthesize the masked face. Specifically, we first segment the facial mask from the real masked face images and obtain the UV texture map of mask template T M . Then, given a nonmasked face image I, we obtain its UV texture map T I and UV position map P I , and remove the corresponding region on the UV texture map T I according to the mask template, and obtain the remaining UV texture mapT I . Finally, we add the mask template T M onT I . This operation can be simply formulated as: where T M I is the UV texture map of masked face images. Then, we recover the 2D masked face images I M from the UV texture map T M I and the UV position map P I . Compared with the 2D-landmark-based and GAN-based masked face generation methods [14] , [15] , we consider 3D face reconstruction is a more accurate practice for masked face synthesis, especially for the large-pose case. To demonstrate the effectiveness of our method, we employ three widely-used NIR-VIS face datasets, including the CA-SIA NIR-VIS 2.0 [18] , the Oulu-CASIA NIR-VIS [19] , and the BUAA-VisNir [20] datasets. Among them, CASIA NIR-VIS 2.0 dataset is the largest NIR-VIS face dataset, including 725 identities with 17,580 face images. Oulu-CASIA NIR-VIS dataset consists of 80 identities with 6 different expression, and each identity contains 48 NIR images and 48 VIS images. The training set and test set respectively contain 20 identities. BUAA-VisNir dataset contains 150 identities with 9 NIR images and 9 VIS images per identity. The training set contains 50 identities with 900 images and the test set contains the reminder of 1800 images. Since the NIR face images are used as the probe images in the test protocol, we add the masks on all the NIR images in the training and test sets. Besides, we utilize MS1M-v1c 1 [21] as the pre-training dataset. All the faces are detected by Faceboxes [22] . Then, we align and crop them to 144×144 image according to the five facial points. We employ two kinds of basic networks, including Mo-bileFaceNet [23] in the ablation study and ResNet-50 [24] in the cross-dataset experiment. In this paper, all the models are pretrained on the MS1M-v1c dataset. For the experiment comparison, we employ the plain training method [6] , [1] , [7] as baseline, which pre-trains the model on MS1M-v1c dataset and fine-tunes it on the NIR-VIS face datasets. The model is trained with two kinds of loss function, including the classification loss (softmax, AM-softmax [25] , and Arcsoftmax [26] ) and the feature embedding loss (i.e., triplet [27] ). Moreover, we also make a comparison with two learning based methods, including IDR [2] and W-CNN [8] , in the crossdataset experiment. In the training stage, the batch size is 64, and the learning rate is initially set as 0.0005 and divided by 10 at 6k and 8k iterations. The training is finished at 10k iterations. We employ the size of prototypes as 128, and the weight of moving average as 0.999. In the evaluation stage, 512-dimension face representation is extracted from the last fully connected layer of the basic network. The cosine similarity is used as the similarity metric. Note that we only utilize the probe-net for evaluation. In ablation study, we use MobileFaceNet [23] with softmax loss and train the model on CASIA NIR-VIS 2.0 [18] dataset. We use the standard test protocol in view 2 of CASIA NIR-VIS 2.0 dataset, which contains 10-fold experiments. Each fold contains 357 identities with about 2,500 VIS images and 6,100 NIR images. For the evaluation metric, we report the rank-1 accuracy and verification rate at FAR=0.1%. Table I shows the results on the first fold of the non-masked and synthesized masked CASIA NIR-VIS 2.0 dataset. From the results of top three rows, we can find the significant performance degradation in NIR-VIS masked face recognition. Besides, compared to the bottom three rows, we can conclude that HSST is able to achieve better performance not only on the recognition of the non-masked face but also the masked face. In addition, we conduct full 10-fold experiments on CASIA NIR-VIS 2.0 dataset with various training loss functions, and report the mean value and the standard deviation of rank-1 accuracy and verification rate at FAR=1%. As shown in Fig. 4 , the results show stable improvement brought by HSST, which can validate the superiority of our method on NIR-VIS masked face recognition. In this experiment, we adopt ResNet-50 as the backbone and follow the protocol of PCFH [4] to train the model on the first fold of the CASIA NIR-VIS 2.0 dataset. Then, the trained models are evaluated on the Oulu-CASIA NIR-VIS, and the BUAA-VisNir datasets. 1) Results on the 1-fold of CASIA NIR-VIS 2.0 dataset: Table II shows the performance on the first fold of the synthesized masked CASIA NIR-VIS 2.0 dataset. From the results of plain training method, we can find triplet loss is able to achieve better performance than softmax loss and its variants. We consider the feature embedding loss is more suitable than the classification loss in such a case of limited training samples. Whatever loss function, a significant performance improvement can be obtained by using HSST. 2) Results on Oulu-CASIA NIR-VIS, and BUAA-VisNir datasets: The Oulu-CASIA NIR-VIS dataset collects images from CASIA and Oulu University, and utilizes all the VIS images from the same identity as the gallery. As shown in Table II , we only report the verification rate at FAR=1% and 0.1%, since all the methods can achieve 100% rank-1 accuracy. Table II also shows the performance comparison on BUAA-VisNir. For the evaluation metric, we report the rank-1 accuracy and verification rate at FAR=1% and 0.1%. We can observe HSST significantly improve the performance, specifically on the strict false accept rate. Benefit from the design of heterogeneous prototypes and semi-siamese networks, our method can achieve better performance on all the benchmarks. The results can demonstrate the better generalization capacity of our method in cross-dataset case. In this paper, we aim to address the NIR-VIS masked face recognition from the perspectives of training data and training method. To this end, we propose a heterogeneous semi-siamese training (HSST) to maximize the mutual information shared by the face representation of masked NIR and VIS images from two views, which can facilitate the model to learn domaininvariant face representation that is also robust to the mask occlusion. Moreover, we employ a 3D face reconstruction based method to synthesize masked faces for addressing the lack of masked face data. Extensive experiments on three NIR-VIS datasets demonstrate the superiority of our training method over conventional training routine. Transferring deep representation for nir-vis heterogeneous face recognition Learning invariant deep representation for nir-vis face recognition Not afraid of the dark: Nir-vis face recognition via cross-spectral hallucination and low-rank embedding Pose-preserving cross spectral face hallucination Cross-spectral face hallucination via disentangling independent factors Heterogeneous face recognition with cnns Seeing the forest from the trees: A holistic approach to near-infrared heterogeneous face recognition Wasserstein cnn: Learning invariant features for nir-vis face recognition Nir-vis heterogeneous face recognition via cross-spectral joint dictionary learning and reconstruction Adversarial discriminative heterogeneous face recognition Polarimetric thermal to visible face verification via attribute preserved synthesis Polarimetric thermal to visible face verification via self-attention guided synthesis Multi-scale thermal to visible face verification via attribute guided synthesis Masked face recognition with generative data augmentation and domain constrained ranking Masked face recognition with latent part detection Semisiamese training for shallow face learning Joint 3d face reconstruction and dense alignment with position map regression network The casia nir-vis 2.0 face database Learning mappings for face synthesis from near infrared to visual light images The buaa-visnir face database instructions Ms-celeb-1m: A dataset and benchmark for large-scale face recognition Faceboxes: A cpu real-time face detector with high accuracy Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices Residual attention network for image classification Additive margin softmax for face verification Arcface: Additive angular margin loss for deep face recognition Facenet: A unified embedding for face recognition and clustering