key: cord-0603660-f5h7ve5p authors: Fang, Meiling; Damer, Naser; Kirchbuchner, Florian; Kuijper, Arjan title: Real Masks and Spoof Faces: On the Masked Face Presentation Attack Detection date: 2021-03-02 journal: nan DOI: nan sha: 7840add80075236e2963103da2dc0ec7c89c0e30 doc_id: 603660 cord_uid: f5h7ve5p Face masks have become one of the main methods for reducing the transmission of COVID-19. This makes face recognition (FR) a challenging task because masks hide several discriminative features of faces. Moreover, face presentation attack detection (PAD) is crucial to ensure the security of FR systems. In contrast to the growing number of masked FR studies, the impact of face masked attacks on PAD has not been explored. Therefore, we present novel attacks with real face masks placed on presentations and attacks with subjects wearing masks to reflect the current real-world situation. Furthermore, this study investigates the effect of masked attacks on PAD performance by using seven state-of-the-art PAD algorithms under different experimental settings. We also evaluate the vulnerability of FR systems to masked attacks. The experiments show that real masked attacks pose a serious threat to the operation and security of FR systems. Since the SARS-CoV-2 coronavirus outbreak and its rapid global spread, wearing a mask has become one of the most efficient ways to protect and prevent getting infected with COVID-19. However, for identity checks in crowded scenarios such as at airports, removing the mask for face recognition (FR) increases the chance of infection. Wearing masks in public might be an essential health measure and a new norm even after the COVID-19 pandemic as most countries support the use of masks to minimize the spread of the virus. As a result, researchers have shown an increased interest in the effect of face masks on the performance of FR verification [1] , [2] , [3] . The results of their studies have shown that pre-COVID-19 FR algorithms suffer performance degradation owing to the masked faces. However, attacks compromising the security and vulnerability of FR systems for subjects wearing face masks have so far been overlooked. In this study, security refer to the presentation attacks (PAs). Attackers can use PAs to spoof FR systems by impersonating someone or obfuscating their identity. Common PAs include printed photos/images, replayed videos and 3D masks [4] , [5] . Driven by the ongoing COVID-19 pandemic, presentation attack detection (PAD) [6] has encountered several understudied challenges when facing masked faces. Current face PAD databases [7] , [8] , [9] only contain printed images or replayed videos in which subjects were not wearing face masks. Therefore, there is uncertainty about the relationship between the performance of PAD techniques and PAs with face masks. Moreover, the vulnerability of FR systems to masked attacks remains unclear. To overcome such gaps, researchers require well-studied masked PAs. In this study, we design and collect three types of attacks based on masked and unmasked face images collected realistically and collaboratively [2] , [3] . The bona fide samples were divided into categories of BM0 (subjects wearing no masks) and BM1 (subjects wearing masks). AM0 data are unmasked print/replay attacks, which are commonly used data in most current PAD databases. AM1 data include print/replay attacks, where live subjects wore masks. In addition, we provide a novel partial attack type, called AM2, where a real medical mask is placed on printed photos or replayed videos to simulate the subject wearing a mask. This is motivated by our assumption that AM2 might be a challenging attack as it contains both bona fide and attack presentations that may confuse PAD and/or FR systems. The data samples are presented in Fig. 1 . The main contributions in this study are: • The novel Collaborative Real Mask Attack Database (CRMA) is presented. Three types of PAs, called AM0 (unmasked face PA), AM1 (masked face PA), and AM2 (unmasked face PA with a real masked placed on the PA) (as shown in Fig. 1 ), were created for both print and replay presentation attack instruments (PAIs). To create such attacks, three electronic tablets with high-resolution and three capture scales are used. Additionally, we designed three experimental protocols to explore the effect of masked attacks on PAD performance. • Extensive experiments are conducted to explore the effect of bona fide samples, masked faces attacks, and real masks (on spoof faces) on the face PAD behavior. To support the comprehensive evaluation, seven PAD algorithms comprising of texture-based, deep-learning-based, and hybrid methods were selected to evaluate the performance and generalizability in intra-and cross-database scenarios under three mask-related protocols. Both quantitative and qualitative analyses revealed that masked bona fides and PAs dramatically decreased the performance of PAD algorithms. Moreover, deep-learningbased methods perform worse on real mask attacks than maskface attacks in most cases. • An in-depth vulnerability analysis of FR systems is presented. We evaluated three deep-learning-based FR techniques for three types of PAs. The experimental results indicate that these three FR networks exhibit significantly higher vulnerabilities to the real mask attacks than masked face attacks. We provide a brief review of relevant works in Sec. II. Then, arXiv:2103.01546v2 [cs.CV] 1 Nov 2021 our novel CRMA database is described in detail in Sec. III. The face PAD algorithms and FR systems used in this study are introduced in Sec. IV. Sec. V introduces the three PAD protocols and PAD evaluation metrics, and then discusses the PAD results. Sec. VI describes the three FR experimental settings, used FR metrics, and analyzed the vulnerability of FR systems. Finally, conclusions are presented in Sec. VII. This section reviews the most relevant prior works from three perspectives: face PAD databases, face PAD methods, and FR and vulnerability analysis. At the end of each part, the difference between our work and prior work is pointed out. Face PAD Databases: Data resources have become especially important ever since the advent of deep learning, because machine-learning-based algorithms have the risk of underfitting or overfitting on limited data. Given the significance of good-quality databases, several face PAD databases have been released, such as NUAA [10] , CASIA-FAS [11] , Replay-Attack [12] , MSU-MFSD [13] , OULU-NPU [7] , and SiW [8] , all consisting of 2D print/replay attacks. In addition, SiW-M [9] and CelebA-Spoof [14] databases provide multiple types of attacks such as makeup, 3D mask, or paper cut. Moreover, some multimodal databases are publicly available: 3DMAD [15] , Mssproof [16] , CASIA-SURF [17] , and CSMAD [18] . These databases undoubtedly contribute to the significant progress of PAD research. For example, the CeleA-Spoof database comprises images from various environments and illuminations with rich annotations to reflect real scenes. However, these databases also have weaknesses: 1) the multimodal databases have high hardware requirements and cannot be widely used in daily life; 2) some databases such as CASIA-MFS [11] and MSU-MFS [13] cannot satisfy the current needs because of the lower quality of the outdated acquisition sensors; 3) Oulu-NPU [7] , SiW [8] , SiW-M [9] , and CelebA-Spoof [14] are relatively up-to-date, but they do not consider PAs with real face masks to fit the current COVID-19 pandemic. Hence, we collect the CRMA database to fill the gaps in these databases in the context of the ongoing COVID-19 pandemic; furthermore, we ensure the the database is generalizable and compatible with real scenarios. The CRMA database can be used to better analyze the effect of a real mask on PAD performance and the vulnerability of FR systems for novel attacks, such as placing a real mask on an attack presentation. Detailed information related to the databases mentioned above is presented in Tab. I). Face PAD Methods: In recent years, there has been an increasing number of studies in the field of face PAD [19] , [20] , [21] . These studies can be broadly grouped into three categories: texture-based methods, deep-learning-based methods, and hybrid methods. Texture features, such as local binary pattern (LBP) [22] , project the faces to a low-dimensional embeddings. Määttä et al. [23] proposed an approach using multi-scale LBP to encode the micro-texture patterns into an enhanced feature histogram for face PAD. The resulting histograms were then fed to a support vector machine (SVM) classifier to determine whether a sample is a bona fide or attack. The LBP features extracted from different color spaces [24] were further proposed to utilize chrominance information. They achieved competitive results on Replay-Attack [12] (equal error rate (EER) value of 0.4%) and CASIA-FAS [11] (EER value of 6.2%) databases. Furthermore, Boulkenafet et al. [25] organized a face PAD competition based on the OULU-NPU database and compared 13 algorithms provided by participating teams and one color-LBP-based method (referred to as baseline in [25] ). In this competition, the GRADIANT algorithm fused multiple information, that is, color, texture, and motion. The GRADIANT achieved competitive results in the four evaluation protocols. In addition to the texture-based GRADIANT approach, deep-learningbased method (MixFASNet) or hybrid method (CPqD) also achieved lower error rates in all experimental protocols. CPqD fused the results from the fine-tuned Inception-v3 network and the color-LBP-based method (referred to as the baseline in [25] ). Consequently, we chose to re-implement the color-LBP and CPqD methods in this study (details in Sec. IV-A), while the GRADIANT and MixedFASNet are discarded in our work because they do not provide sufficient details for re-implementation. Deep-learning-based methods have been pushing the frontier of face PAD research and have shown remarkable improvements in PAD performance. Lucena et al. [26] presented an approach called FASNet in which a pre-trained VGG16 is fine-tuned by replacing the last fully connected layer. The FASNet network achieved excellent performance on 3DMAD [15] and Replay-Attack databases [12] . Recently, George et al. [27] proposed training a network with pixel-wise binary supervision on feature maps to exploit information from different patches. DeepPixBis [27] outperformed the state-of-the-art algorithms in Protocol-1 of the OULU-NPU database (0.42% ACER) but also achieved significantly better results than traditional texture-based approaches in the cross-database scenario. Considering the popularity of PAD techniques and the ease of implementation, we also chose FASNet and DeepPixBis (details in Sec. IV-A) to study the effect of the real mask and masked face attacks on the PAD performance. Face Recognition and Vulnerability Analysis: As one of the most popular modalities, the face has received in- creasing attention in authentication/security processes, such as smartphone face unlocking and automatic border control (ABC). Moreover, FR techniques [28] , [29] , [30] have achieved significant performance improvements, and many personal electronic products have deployed FR technology. However, the ongoing COVID-19 pandemic brings a new challenge related to the behavior of collaborative recognition techniques when dealing with masked faces. Collaborative data collection refers to a subject actively attending to use the FR systems, such as unlocking personal devices or using an ABC gate, in contrast to uncollaborative capture scenario where the user does not intentionally use the FR service, such as in the case of surveillance. The National Institute of Standards and Technology (NIST) [1] provided a preliminary study that evaluated the performance of 89 commercial FR algorithms developed before the COVID-19 pandemic. Their results indicated that digitally applied face masks with photos decreased the recognition accuracy; for example, even the best of the 89 algorithms had error rates between 5% and 50%. It is worth noting that the masks used in the experiments were synthetically created. Damer et al. [2] , [3] presented a real mask database to simulate a realistically variant collaborative face capture scenario. Each participant was asked to simulate a login scenario by actively looking toward a capture device, such as a static webcam or a mobile phone. Our attack samples were created and collected based on the masked face data, which refers to the bona fide samples in the PAD case (as described in Sec. III). They also explored the effect of wearing a mask on FR performance and concluded that face masks significantly reduce the accuracy of algorithms. Mohammadi et al. [31] provided empirical evidence to support the claim that the CNN-based FR methods are extremely vulnerable to 2D PAs. Subsequently, Bhattacharjee et al. [18] presented the first FR-vulnerability study on 3D PAs. The experiments also clearly showed that CNN-based FR methods are vulnerable to custom 3D mask PAs. However, the vulnerability of FR systems to PAs with face masks has not been investigated. Therefore, in this study, we selected three CNN-based FR algorithms for further FR-vulnerability analysis on masked face attacks: the state-of-the-art ArcFace [28] , SphereFace [29] , and VGGFace [30] . These algorithms are discussed in more detail in Sec. IV-B. Our proposed CRMA database 1 and can serve as a supplement to the databases in Tab. I, and because of the COVID-19 pandemic, it can better reflect the possible issues facing real-world PAD performance. The CRMA database includes 1) both unmasked (BM0) and masked (BM1) bona fide samples collected in a realistic scenario [2] , [3] , 2) conventional replay and print PAs created from faces not wearing a mask (AM0), 3) replay and printed PAs created from masked face images (AM1), and 4) novel PAs where the PAs of unmasked faces are covered (partially) with real masks (AM2), as shown in Fig. 1 . Damer et al. [2] , [3] collected data to investigate the effect of wearing a mask on face verification performance. For PAD, such data are considered bona fide. The data presented in this study build on an extended version of the data introduced in [2], [3] , by creating and capturing different types of PAs based on the bona fide data captured in [2] , [3] . As a result, the bona fide data in this work are an extended version of the one introduced in [2], [3] and the attack data presented here are completely novel and have not been previously studied. Fig. 3 introduces the general statistical information of the CRMA database. This database contains 62% males and 38% females. The attack AM0, AM1, and AM2 ratios are 30%, 60%, and 10%, respectively, as will be described later in this section. Additionally, we count the frequency of the proportion of the face size in the video. The histogram shows that the proportion of the face areas in the videos is mostly between 5% and 30%. This section first describes the bona fide samples provided by [2] , [3] , and then introduces our process of attack sample creation and collection. To explore the FR performance on masked faces, Damer et al. [2] , [3] recently presented a database where the subjects wearing face masks. This database simulates a collaborative environment in which participants collect videos by actively looking towards the capture device. During this process, the eyeglasses were removed when the frame was considered very thick following the International Civil Aviation Organization (ICAO) standard [32] . The videos were captured by the participants at their residences while working from home. Therefore, the types of face masks, capture devices, illumination, and background were varied. For PAD, these videos are classified as bona fides and will be used later to create attack samples. The final version [3] of this database contains 47 participants. Each subject recorded a total of nine videos over three days with three different scenarios for each day. In contrast to the study by Damer et al. [2] , which examined the effects of both face masks and illumination variations, we focused only on the impact of face masks on PAD performance. Hence, in our study, the bona fide videos are divided into two categories: a face without a mask on is denoted as BM0 (three videos per subject), and a face with a mask is marked as BM1 (six videos per subject) (as shown in the right column of Fig. 1) . Most FR databases tried to collect data under various harsh conditions, such as poor lighting, strong occlusion, or low resolution. Such databases attempted to reproduce what might happen in a real-world scenario when a legitimate user obtains authorization [33] . In contrast, attackers use highly sophisticated artifacts, such as high-resolution images or videos, to maximize the success rate when impersonating someone. For this reason, we first collect the PAs in a windowless room where all lights are on. Second, three high-resolution electronic tablets were used in the acquisition process: 1) iPad Pro (10.5-inch) with the display resolution of 2224 × 1668 pixels, 2) Samsung Galaxy Tab S6 with the display resolution of 2560 × 1600 pixels, 3) Microsoft Surface Pro 6 with the display resolution of 2736×1824 pixels. In the process of collecting data, the capture devices and displayed images/tablets were stationary. The videos were captured with a resolution of 1920×1080. In addition, each video had a minimum length of 5 seconds, and the frame rate was 30 fps. This work focuses on the two common PAIs, print and replay attacks, due to their ease of creation and low cost. The attack data in each PAI (see the samples in Fig. 1 ) are divided into three types: 1) the spoof face with no face mask (AM0), 2) the spoof face with a mask on (AM1), and 3) the spoof face with no face mask, but a real mask was placed on it to simulate a participant wearing a mask (AM2). However, the size of the face area in each video is slightly inconsistent because the videos were recorded by the participants themselves. To reproduce the appearance of wearing a mask in the real world, we cropped five face masks to fit most of the faces (see Fig. 3 ). The five masks are three small blue surgical masks, one slighter bigger white face mask, and one uncropped mask. When placing the mask, we select a suitable mask according to the size of the face in the printed image or video, aiming to cover the nose to the chin area and the cheeks without exceeding. The details of each PAI are as followings: Print image attack: In print PAI, an attacker tries to fool the FR system using a printed photo. Considering the instability of the face during the first second, such as the participant pressing the recording button or adjusting the sitting position, the 35 th frame of each bona fide video was printed out as an attack artifact. Therefore, we obtained nine photos per subject. The three tablets mentioned above were used to capture the photos. Furthermore, to increase the diversity and variety of the data, each tablet captured three videos for a photo with three scales (see examples in Fig. 2 ). The captured videos using the first scale contained all areas (100%) of the photos, the second scale consisted of most areas (80%) of the original photos, and the third scale focused on the face area (60%) as much as possible. In addition to collecting attack data solely from printed images, we also collected data from real face masks overlaid on photos (i.e., the previously defined AM2). Theoretically, real masks will reduce the region of artificial features and increase the complexity and mixture of the features in the collected attack data. Eventually, 90 print attack videos were generated for each subject, that is, a total of 4,230 videos for 47 subjects in print PAI. Replay video attack: In replay PAI, an attacker tries to obtain the authentication by replaying a video. The three common points of the collection process between print and replay PAI are the use of three tablets, the use of three scales, and the process of AM2 data creation, respectively. The difference is that these tablets were also used for capturing displays of videos (see examples in Fig.2 ). While one tablet was replaying the video, the other two tablets were used to capture the data. As a result, each subject corresponded to 180 replay attack videos (162 videos of the AM0 and AM1 groups, 18 videos of AM2.), i.e., there were a total of 8,460 videos in this attack subset. Galaxy -Surface Surface -iPad Surface -Galaxy iPad -Galaxy iPad -Surface Fig. 3 : The statistics of the subjects and the used mask shapes for creating AM2 samples in the CRMA database. From left to right: gender, mask types of attacks (AM0, AM1, AM2), the histogram shows the probability distribution of the face size ratio and the applied mask shapes. three FR algorithms were introduced for further vulnerability analysis. In both PAD and FR experiments, the widely used multi-task cascaded convolutional networks (MTCNN) [34] technique was adopted to detect and crop the face. A competition [25] was carried out in 2017 to evaluate and compare the generalization performance of face PAD techniques under real-world variations. In this competition [25] , there were 14 participating teams together with organizers that contributed to several state-of-the-art approaches. We chose two methods ((as previously discussed in Sec. II)), the LBP-based method (referred to as the baseline in [25] ), and hybrid CPqD, and included additional solutions. We reimplemented a total of seven face PAD algorithms in this study, which can be categorized into three groups: handcrafted features, deep-learning features, and hybrid features. For further cross-database evaluation scenarios, we used three publicly available databases, mainly involving 2D PAs (details in Sec. II): CASIA-FAS [11] , MSU-MFS [13] , and OULU-NPU [7] in the competition. A brief description of the adopted methods is provided below: • LBP: The LBP method is referred to as baseline method in [25] provided by the competition organizers that utilized the color texture technique. The face in a frame is first detected, cropped, and normalized to a size of 64×64 pixels. Second, an RGB face was converted into HSV and YCbCr color spaces. Third, the LBP features were extracted from each channel. The obtained six LBP features are then concatenated into one feature vector to feed into a softmax classifier. The final prediction score for each video was computed by averaging the output scores of all the frames. • CPqD: The CPqD is based on the Inception-v3 network [35] and the above LBP method. The last layer of the pretrained Inception-v3 model was replaced by a fully connected layer and a sigmoid activation function. The faces in the RGB frames are detected, cropped, and normalized to 299 × 299 pixels. These face images were utilized as inputs to finetune the Inception-v3 model. The model with the lowest EER on the development set among all 10 training epochs was selected. A single score for each video was obtained by averaging the output scores of all frames. To further improve the performance, the final score for each video was computed by fusing the score achieved by the Inception-v3 model and the score obtained by the LBP method. • Inception FT and Inception TFS : Since the CPqD uses the Inception-v3 [35] network as the basic architecture, we also report the results of fine-tuned Inception-v3 model, named Inception FT . In addition to the fine-tuned model, we trained the Inception-v3 model from scratch for performance comparison, named Inception TFS . In the training phase, the binary cross-entropy loss function and Adam optimizer with a learning rate of 10 −5 were used. The output scores of the frames were averaged to obtain a final prediction decision for each video. • FASNet FT and FASNet TFS : FASNet [26] used transfer learning from pre-trained VGG16 model [36] for face PAD. They used a pre-trained VGG16 model as a feature extractor and modified the last fully connected layer. The newly added fully connected layers with a sigmoid function were then fine-tuned for the PAD task. This fine-tuned FASNet is referred to as FASNet FT , similar to the Inception-v3 network methods, and we also train FASNet from scratch with the name FASNet TFS . The input images are the detected, cropped, and normalized RGB face frames with a size of 224 × 224 pixels. The Adam optimizer with a learning rate of 10 −4 was used for training, as defined in [26] . Data augmentation techniques and class weights are utilized to deal with imbalanced data problems.. To further reduce overfitting, an early stop technique with a patience of 5 and maximum epochs of 30 was used. The resulting scores were averaged to obtain the final score for each video. • DeepPixBis: George et al. [27] proposed a densely connected network framework for face PAD with binary and deep pixel-wise supervision. This framework is based on DenseNet architecture [37] . Two dense blocks and two transition blocks with a fully connected layer with sigmoid activation produce a binary output. We used the same data augmentation technique (horizontal flip, random jitter in brightness, contrast, and saturation) and the same hyper-parameters (Adam optimizer with a learning rate of 10 −4 and weight decay of 10 −5 ) as defined in [27] for the training. In addition to data augmentation, we applied the class weight and an early stopping technique to avoid overfitting. The final score for each video was computed by averaging the frame scores. For FR systems, trained CNNs are typically used as feature extractors. The feature vector extracted from a specific layer of an off-the-shelf CNN was used as the template to represent the corresponding input face image. Then, the resulting templates were compared with each other using similarity measures. To provide a vulnerability analysis of the FR systems to our novel masked attacks, we adapted the following three FR algorithms: • ArcFace: ArcFace [28] introduced an additive angular margin loss function to obtain highly discriminative features for FR. We chose this algorithm because ArcFace consistently outperformed state-of-the-art methods. ArcFace achieved 99.83% on Labeled Faces in the Wild (LFW) [33] and 98.02% on YouTube Faces (YTF) [38] dataset. The pretrained ArcFace model 2 in our study was based on the ResNet-100 [39] architecture and trained on the MS-Celeb-1M [40] dataset (MS1M-v2). The output template is a 512-dimension feature vector extracted from the 'fc1' layer of ArcFace. • SphereFace: Liu et al. [29] proposed a deep hypersphere embedding approach (SphereFace) for FR task. SphereFace [29] utilized the angular softmax loss for CNNs to learn angularly discriminative features. This method also achieved competitive performance on LFW [33] (accuracy of 99.42%) and YTF [38] datasets (95.00%). We extract the face representation with 512-dimension from a pre-trained 20-layer SphereFace model. 3 • VGGFace2: The first version of VGGFace is based on 16-layer VGG [36] network, while the second version of VGGFace (VGGFace2) [30] adopt ResNet-50 [39] as the backbone architecture. In this work, we use the second version that a ResNet-50 network trained on VGGFace2 dataset [30] 4 for extracting the 512-dimension templates. The vulnerability of each FR system to attacks was analyzed based on three scenarios. Regardless of the scenario, the references are scenarios-specific bona fide videos captured on the first day, while bona fide videos from the second and third days or attack videos were selected as probes. The three cases, including the division of scenario-specific references and probes, are described with the results in detail in Sec. VI. Once the references for the face images are obtained, we use the Cosine-similarity as recommended in [28] , [29] , [30] to compute the similarity scores between references and probes. This section first describes three protocols designed to investigate the effect of masked attacks on PAD performance under different training settings. Second, the PAD evaluation metrics used were introduced for further analysis. Third, quantitative results were reported according to different PAD protocols. Finally, qualitative analysis and visualization are presented and discussed. A. PAD evaluation protocols 1) PAD protocols for the CRMA database In this study, three protocols are provided to study the impact of masks on the performance of PAD solutions under different training settings. Other factors, such as various devices, illumination, and capture scales, are outside the scope of this study. These three protocols try to answer three questions separately: 1) Does the PAD algorithm trained on unmasked data generalize well on the masked bona fides and attacks, that is, can the previously trained model be adapted to the presentday situation? 2) Does the PAD algorithm designed before the COVID-19 pandemic still work efficiently if it is trained on additional masked data? 3) Will a network that has learned masked face attacks be confused by real masks that obscure the spoof face? Hence, we split 47 subjects in the CRMA database into three subject-disjoint sets: the training set (19 subjects), the development set (10 subjects), and the testing set (18 subjects). Gender was balanced as much as possible between the three sets. Tab. II provides more information about three protocols. A detailed description of three protocols is as follows: Protocol-1 (P1): This protocol demonstrates the generalization performance of the PAD solutions trained on unmasked data. The training and development sets contain only videos of subjects without masks (such as data in most current PAD databases). The trained model was then tested on the data using face masks. More specifically, only BM0 and AM0 data were used for training, while BM1, AM1, and AM2 were considered unknown mask data. Protocol-2 (P2): In contrast to protocol-1, which focuses on generalizability on unseen mask data, the second protocol is designed to evaluate the performance of PAD algorithms when masked data has been learned in the training phase. In this protocol, the training, development, and testing sets include masked and unmasked bona fides (BM0, BM1), masked and unmasked attacks (AM0, AM1), and spoof faces with real masks (AM2). Protocol-3 (P3): Until now, the effect of AM2 on PAD performance is still unclear. AM2 is a special attack type that a real face mask is placed on spoof faces, which means it contains only partial artifacts (i.e., unmasked face spoofing region) compared to AM1, which carries entire artifacts (i.e., spoofed face and mask). Therefore, this protocol attempts to answer the following question: If the network has learned the masked attacks AM1, can this trained model not be confused by a real mask and perform similarly on the attack covered by a real mask AM2? Consequently, the training and development sets include bona fides BM0 and BM1, and attacks AM0 and AM1, while AM2 is an unknown attack in the testing set. Because data in the CRMA are video sequences and the number of videos between bona fide and attack classes are imbalanced, we sampled 60 frames from a bona fide video and five frames from an attack video to reduce data bias. In addition to different frame sampling, we also adapt the class weights inversely proportional to the class frequencies to reduce overfitting in the training phase. In the test phase, a final classification decision was determined by averaging the prediction scores of all sampled frames. 2) PAD protocols for cross-database scenarios In addition to the intra-database scenario on our CRMA database, we also perform cross-database experiments to explore the generalizability of these PAD algorithms on masked data. Because the PAIs in the CRMA database are print and replay attacks, we selected three popular publicly available databases containing the same PAIs: CASIA-MFS [11] , MSU-MFS [13] , and OULU-NPU [7] to demonstrate the evaluation. We conducted two cross-database experiments. In the first cross-database scenario, the PAD solutions trained on three publicly available databases were evaluated on the test set of the CRMA database. In addition, the results tested on their own test sets are also reported (as shown in the left block in Tab. IV). The first setting is similar to protocol-1 of the CRMA intra-database scenario, as no masked data are seen in the training phase. Therefore, the first cross- 57 1569 Dev 20-29 BM0, AM0 30 810 Test 30-47 BM0, BM1, AM0, AM1, AM2 162 4860 P2 Train 1-19 BM0, BM1, AM0, AM1, AM2 171 5130 Dev 20-29 BM0, BM1, AM0, AM1, AM2 90 270 Test 30-47 BM0, BM1, AM0, AM1, AM2 162 The metrics following the ISO/IEC 30107-3 [41] standard were used to measure the performance of the PAD algorithms: Attack Presentation Classification Error Rate (APCER) and bona fide presentation classification error rate (BPCER). APCER is the proportion of attack images incorrectly classified as bona fide samples in a specific scenario, while BPCER is the proportion of bona fide images incorrectly classified as attacks in a specific scenario. The APCER and BPCER reported in the test set were based on a pre-computed threshold in the development set. In our study, we use a BPCER at 10% (on the development set) to obtain the threshold (denoted as τ BP CER10 ). Additionally, half-total error rater (HTER) corresponding to half of the summation of BPCER and APCER is used for the cross-database evaluation. Noticeably, we computed a threshold in the development set of the training database. Then, this threshold was used to determine the HTER value in the test database. The detection EER (D-EER) value, where APCER and BPCER are equal is also reported in the cross-database scenarios. For further analysis of PAD performance, receiver operating characteristic (ROC) curves were also demonstrated. Protocol-1 represents the pre-COVID-19 PAD scenarios, in which subjects normally do not wear a mask, and demonstrates the generalization performance on masked data. Therefore, protocol-1 is considered the most challenging task because of the unseen BM1, AM1, and AM2 data. Tab. III describes the results of the different protocols on the CRMA database. The bold numbers indicate the highest BPCER values between BM0 and BM1 and the highest APCER values between AM0, AM1, and AM2 in each PAI. By observing the first block, P1, in Tab. III, the BPCER values of masked bona fide samples are much higher than those of unmasked ones; however, most PAD systems achieve lower APCER values on the masked attack samples (either AM1 or AM2). The higher classification error rates on masked bona fide and the lower error rates on masked attacks are intuitively conceivable. When the model has not seen faces wearing a mask before, it is more inclined to falsely classify such a masked bona fide sample (BM1) as an attack. Moreover, it is interesting to note that networks trained from scratch and the DeepPixBis approach work worse on attack AM2 than AM1. These observations are consistent with the ROC (Fig. 4) . The red curves generated by printed AM2, bona fide BM1, and gray curves obtained by replay AM2 and bona fide BM1 possess significantly smaller areas under the curves in five of the seven methods. Furthermore, training a network from scratch improves the overall performance. The possible reason for those observations is that learning from scratch is more efficient for obtaining discriminative features between bona fide and artifacts. On the contrary, such approaches might be confusing when applying realistic masks to attack samples. In addition to the intra-database scenario, the first crossdatabase experiment (introduced in Sec. V-A2) can be seen as similar to protocol-1, as both scenarios study PAD methods that PAD solutions trained on unmasked data and tested on the CRMA database. In Tab. IV, the bold BPCER number is the highest BPCER (between BM0 and BM1) for each PAD method. The bold APCER number is the highest APCER (between AM0, AM1, and BM2) for each PAD method in print and replay attacks, respectively. This bolding is performed to show which samples are more difficult to classify correctly. We observed that the performance in the cross-database setting was relatively poor for all models. Even though deeplearning-based methods achieved great results on their own test sets, they generalize significantly worse on masked bona fide samples; for example, most BPCER values for BM1 are close to 100%. In contrast, most algorithms achieve lower APCER values on masked AM1 and AM2 than unmasked AM0 attacks, which is consistent with the observation of protocol-1 from the intra-database scenarios. In general, the experimental results of the intra-database protocol-1 and the first cross-database scenario results answer the first posed question (in Sec. V-A1) by showing that models trained only on unmasked data cannot properly classify images of masked faces. A subject with a mask on has a high probability of being falsely detected as an attack by PAD systems, even if this subject is bona fide. Protocol-2 targets the performance of PAD algorithms on masked data when both unmasked and masked samples are used in the training phase. As shown in Tab. III, we can observe the following points: First, despite the fact that the masked bona fide samples are still more difficult to classify correctly than unmasked ones in most cases, the BPCER value of BM1 behaves more similar to its behavior on the BM0 in protocol-2 than in protocol-1. Moreover, the BPCER values of BM0 and BM1 in protocol-2 decreased in most cases compared with the results of protocol-1. For example, the BPCER value of BM1 achieved by Inception FT was 84.21% in protocol-1 and 7.02% in protocol-2. This finding indicates that learning the masked data is helpful in improving the performance of PAD methods. This is also consistent with the observation in the ROC curves (by comparing the ROCs in protocol-1 and protocol-2 in general). In particular, Inception TFS , FASNet TFS , and DeepPixBis achieved significant progress (larger areas under the curves). Second, six of the seven methods performed worse on the masked printed face (AM1 or AM2), while five of the seven algorithms showed inferior results for unmasked replay attacks. Moreover, AM2 in print PAI achieves higher APCER values than AM1 by training from scratch approaches. One possible reason for the different results between print and replay attacks is specular reflection. Because attack data were collected in windowless labor with all electric lights on, tablets easily reflect the light compared to the printed paper, and this reflection is difficult to avoid. The real face masks might also leak light when placed on an electric tablet, but this does not appear when applied on printed paper. In general, the experimental results of the intra-database protocol-2 answer the second question (in Sec. V-A1), which addresses the performance changes of the current PAD algorithms after complementary learning on the masked data. Based on the above findings, we can conclude that the PAD algorithms still perform worse on masked bona fides (BM1) than on unmasked faces (BM0 ), even when the PAD solutions are trained on masked data. Protocol-3 investigates the generalizability of the model trained on data that includes masked face attacks (AM1) when tested on the masked face attacks where a real mask is placed on top of the attack (AM2). For bona fide samples, we draw a similar conclusion to protocol-1 and protocol-2, stating that masked bona fide samples have a higher probability of incorrectly being classified as attacks. However, the experimental results show differences in attack detection behavior (APCER) between protocol-3 on one side and protocols-1 and -2 on the other side. In this protocol, the highest APCER values of most PAD algorithms appear on either the AM1 or AM2 attacks in both print and replay PAIs. Second, the traditional LBP method, Inception FT , FASNet FT , and the hybrid CPqD method that achieve relatively worse results on AM0 or AM1 attacks than other methods may have proved to be unable to learn or extract sufficient discriminative features. Third, although the other methods, such as learning from scratch Inception TFS and FASNet TFS or custom designed DeepPixBis achieve impressive results on seen AM0 and AM1 attacks, they generalize not well on unseen AM2 attacks. These observations answer the third question stated in Sec. V-A1 by stating that a network trained on masked face attacks (AM1) tends to produce confusing decisions on AM2, where a real mask is placed on an attack face. To qualitatively analyze and interpret the deep-learningbased methods, the score-Weighted CAM [42] technique was adopted to localize the discriminative areas in face images. The rows from top to bottom correspond to Inception FT , Inception TFS , FASNet FT , FASNet TFS and DeepPixBis. Fig. 5a shows the results of protocol-1 (the example subject is in the test set). Inception FT mainly focuses on the nose, including nearby parts of the masks, whereas Inception TFS pays more attention to the upper region of the face. Similarly, FASNet TFS reduces the attention paid to the masks and increases the concentration around the forehead. DeepPixBis concentrates around the eyes for both unmasked (BM0) and masked (BM1) bona fides. However, for attack samples, attention seems to be focused on the left eye and partial masks. In general, masks are noticed by all networks. The results of protocol-2 and protocol-3 for the same subjects are shown in Fig. 5b, and Fig. 5c . We noticed that 1) the attention areas of fine-tuned networks hardly change in the three protocols because of the fixed weights of layers before the last classification layer. 2) Inception TFS in protocol-2 appears to focus on the upper face, including many more eye regions than in protocol-1. 3) FASNet TFS in protocol-2 concentrates much more on applied real masks than in protocol-3 where training without AM2. 4) DeepPixBis still works well on bona fide, but for attack samples, its attention seems to be distracted to the edge of images. Although DeepPixBis produces correct decisions, this observation raises a serious concern about its reliability and generalizability. This concern was confirmed by the cross-database evaluation. DeepPixBis generally obtains worse cross-database results than the other two training from scratch networks (details see Tab. IV). Finally, looking at attention maps in all protocols for this identity, we notice that except for the misclassified samples (with red boxes) that appear on print/replay AM0, print AM2 attacks are more easily to be incorrectly detected as bona fide than AM1 attacks. This finding is in line with the previous quantitative evaluation that AM2 attacks may confuse the PAD, even if the network has been trained by masked face attacks. To further understand the above quantitative and qualitative results, we provide additional t-SNE plots to visualize the learned features in the supplementary material. These plots consolidate our findings here that 1) masked bona fide samples are more likely to be detected as bona fide by the pre-COVID-19 PAD algorithms. 2) attacks with real masks placed on the attacks (AM2) are more falsely detected by PAD systems as bona fides than attacks with masked faces (AM1). The vulnerability of each FR system on each type of PA is analyzed based on three experimental settings. In the first setting BM0-BM0, we use the bona fide unmasked samples captured on the first day as references. Then, the references are compared against bona fide BM0 samples captured on the second and third days of the same subjects (to compute genuine scores), as well as of other subjects (zero-effort imposter (ZEI) scores). Once genuine and ZEI comparison scores are obtained, the operating threshold is computed using the τ F M R@0.01 threshold. Finally, the probe samples of each type of PA were compared against the reference of the same subjects separately. In the second setting BM0-BM1, the difference is that bona fide BM1 data captured on the second and third days are used for comparison against references BM0 and then obtain the corresponding genuine and ZEI scores. In the third setting, BM1-BM1, the bona fide masked faces captured on the first day are references for each subject. Such references are also compared against the masked bona fide samples captured on the second and third days to obtain their genuine and ZEI scores. These three experimental settings are provided to enable addressing the following four questions: 1) When having an unmasked reference and we use a decision threshold that does not consider masked comparisons (BM0-BM0), how vulnerable are FR systems to the three types of attacks in CRMA (AM0, AM1, and AM2)? 2) When having an unmasked reference and we use a decision threshold based on unmasked-to-masked comparisons (BM0-BM1), how vulnerable are FR systems to the three types of attacks in CRMA (AM0, AM1, and AM2)? 3) When having a masked reference and we use a decision threshold based on maskedto-masked comparisons (BM1-BM1), how vulnerable are FR systems to the three types of attacks in CRMA (AM0, AM1, and AM2)? Additionally, we address the fourth question: 4) will the vulnerability of FR systems be different when facing the AM1 and AM2 attacks? To measure the performance of FR techniques, the genuine match rate (GMR), which refers to the proportion of correctly matched genuine samples, is used at the fixed false match rate (FMR). GMR is equal to 1 minus the false non-match rate (FNMR). Moreover, to analyze the vulnerability of FR algorithms for our masked attacks, the imposter attack presentation match rate (IAPMR) corresponding to the proportion of PAs accepted by the FR system as genuine presentations is adopted. IAPMR also follows the standard definition presented in ISO/IEC 30107-3 [41] . The threshold for GMR and IAMPR is defined by fixing the FMR at 1% (denoted as τ F M R@0.01 ). The probe images with similarity scores lower than the τ F M R@0.01 are not matched. Moreover, the recognition score-distribution histograms are shown in Fig. 6, 7, and 8 . In addition to these metrics, the EER value, where FMR equals FNMR, is computed to compare the FR algorithms. The performance and vulnerability of each FR system are summarized in Tab. V. SphereFace [29] obtains relatively low IAPMR values; however, its GMR values are also much lower than those of ArcFace [28] and VGGFace [30] . In general, Fig. 7 : The similarity score distributions by off-the-shelf SphereFace [29] . the IAPMR values of all three FR systems were close to their GMR values. Specifically, FR systems are vulnerable to unmasked attacks when unmasked bona fide samples are used as references (the settings BM0-BM0 and BM0-BM1), and vulnerable to the masked attacks when the reference is masked bona fide data. Comparing the vulnerability analysis results for AM1 and AM2 in all three cases and all FR systems, we note that the IAMPR values of AM2 are always significantly higher than those of AM1. This indicates that applying real masks on attack presentations can further reduce the performance of FR systems. This might be due to the fact that the AM2 attacks possess more realistic features than AM1. To further verify this assumption, we provide histograms of the similarity score distribution in the three scenarios and three FR systems (see Fig. 6 , 7, and 8). In the histograms, green refers to genuine scores, blue represents ZEI scores, and gray represents attack verification scores. The ideal situation is that there is no overlap between the green and the other two histograms. Fig. 6 shows the score distributions of ArcFace [28] , where the rows from top to bottom represent BM0-BM0, BM0-BM1, BM1-BM1 cases and columns from left to right refer to AM0, AM1, and AM2 attacks. It can be seen that 1) the verification scores of attacks are higher than the scores of ZEI in all cases. 2) The scores of AM0 attacks and genuine scores almost overlap in the BM0-BM0 and BM0-BM1 settings, while the scores of AM1/AM2 attacks have many overlapping areas with genuine scores in the BM1-BM1 setting. 3) for all cases, the scores of AM2 have more overlaps with genuine scores than AM1. Similar observations can be found in Fig. 7 for the SphereFace, and Fig. 8 for VGGFace. These observations are consistent with the findings presented in Tab. V. Overall, these results indicate that 1) FR systems are more vulnerable to unmasked attacks compare to masked attacks when the references are unmasked faces, 2) when the threshold is computed based on the unmasked-to-masked comparison, the vulnerability of FR systems becomes higher for both masked or unmasked attacks, 3) when the reference is masked, FR systems are more vulnerable to masked attacks in comparison to the FR systems having unmasked references. Another important finding is that 4) FR systems pose a higher vulnerability for spoof faces with real masks placed on them (AM2) than a masked face attack (AM1). Such observations raise concerns about the security of FR systems when facing masked attacks. We studied the behavior of PAD methods on different types of masked face images. To enable our study, we presented a new large-scale face PAD database, CRMA, including the conventional unmasked attacks, novel attacks with faces wearing masks, and attacks with real masks placed on spoof faces. It consists of 13,113 high-resolution videos and has a large diversity in capture sensors, displays, and capture scales. To study the effect of wearing a mask on the PAD algorithms, we designed three experimental protocols. The first protocol measures the generalizability of the current PAD algorithms on unknown masked bona fide or attack samples. In the second protocol, masked data are used in the training phase to measure the performance of PAD solutions where the face masks are known. The third protocol investigates the generalizability of models trained on masked face attacks when tested on attacks covered by a real mask. Extensive experiments were conducted using these protocols. The results showed that PAD algorithms have a high possibility of detecting masked bona fide samples as attackers (median BPCER value for BM1 in protocol-1 is 48.25%). Moreover, even if the PAD solutions have seen the masked bona fide data, the PAD algorithms still perform worse on masked bona fide samples compared with unmasked bona fides. Furthermore, the PAD solutions trained on masked face attacks (AM1) do not generalize well on attacks covered by a real mask (AM2). For example, the APCER values achieved by DeepPixBis increased from 0.62% for AM1 to 2.92% for AM2 in print attack and from 1.92% for AM1 to 6.43% for AM2 in replay (protocol-3). In addition, we performed a thorough analysis of the vulnerability of FR systems to such novel attacks. The results indicate that FR systems are vulnerable to both masked and unmasked attacks. For example, when the reference images and system threshold are based on unmasked faces (BM0-BM0), the IAPMR values for unmasked attacks (AM0), masked attacks (AM1), and attacks covered by a real mask (AM2) are 98.40%, 81.60%, and 97.10%, respectively. This leads to the interesting observation that all the investigated FR systems are more vulnerable to attacks where real masks are placed on attacks (AM2) than attacks of masked faces (AM1). To further observe and deeper understand the discriminative features between bona fide and PAs in the CRMA database, we present here, as supplementary material, the visualized features of face samples from seven classes: BF-BM0, BF-BM1, PR-AM0, PR-AM1, PR-AM2, RE-AM0, RE-AM1, RE-AM2. Here, BF refers to bona fide, while PR and RE correspond to print and reply. Fig. 9 shows the differential results of fine-tuned Inception FT and FASNet FT , and trained from scratch Inception TFS and FASNet TFS on the first protocol. Furthermore, the results of DeepPixBis on three experimental protocols are selected to present in Fig. 10 Fig. 9 : The t-SNE plots of fine-tuned Inception FT and FASNet FT (in left column), and trained from scratch Inception TFS and FASNet TFS (in right column) on the first protocol which targets the unknown masked bona fide and two types of masked attacks. These plots show that fine-tuned Inception FT and FASNet FT cannot discriminate the features between bona fide and attacks. The 2D features representing the face images seem to be grouped based on the existence of face masks. Moreover, AM2 attacks are surrounded by bona fide samples and other types of attacks. First, by observing the t-SNE plots from Inception FT and FASNet FT methods in Fig. 9 , we observed that BM0, AM0 of print and replay PAIs are tightly grouped, while masked BM1 and AM1 samples in both PAIs are clustered. Simultaneously, the print AM2 and replay AM2 attack samples tend to form a compact distribution. Moreover, it is interesting to note that AM2 attacks are enclosed by the previous BM0/AM0 and BM1/AM1 clusters. This indicated that 1) fine-tuned networks are unable to learn the discriminative features between bona a) DeepPixBis for protocol-1 b) DeepPixBis for protocol-2 c) DeepPixBis for protocol-3 BF BM0 BF BM1 PR AM0 PR AM1 PR AM2 RE AM0 RE AM1 RE AM2 Class: Fig. 10 : The t-SNE plots for DeepPixBis method on three protocols in our CRMA database. It can be seen that the unknown AM2 attacks are enclosed by bona fide samples and other types of attacks. fide and attacks. 2) AM2 attacks, which have artifacts and live features together, are more difficult for networks to make a correct PAD decision. Second, it can be seen in the t-SNE subplots from Inception TFS and FASNet TFS that compared to fine-tuned networks, trained from scratch networks perform better. Because the features of BM0 and BM1 are in a group, print attacks in a group, and replay attacks in a group. However, 2D features of AM2 attacks in print and replay PAIs are still closer than other attack types when employing FASNet TFS . Overall, AM2 attacks are slightly difficult to detected correctly than AM1 attacks by trained from scratch networks, even though the bona fide and attacks are more separate than using fine-tuned networks. Third, by taking a look at results on protocol-1 by DeepPixBis method in Fig. 10.(a) , almost half of the BM1 data is mixed with attack data. Besides, on the protocol-1 plot, print AM0 data is close to the replay AM0, print AM1 is close to the replay AM1, and grouped AM2 attacks in both PAIs. It should be noticed again that AM2 attacks are surrounded by the bona fide and other attack data. In Fig. 10.(b) and Fig. 10 .(c), bona fide and attack data are more separate. However, some replay AM2 attacks are still mixed inside bona fide groups on the protocol-2. Additionally, more AM2 attacks are close to the bona fide samples on the protocol-3 than protocol-2. A possible reason is that AM2 attacks are not learned in the training phase of protocol-3. All the above findings are consistent with observations in our main work. Together these results provide important insights that 1) fine-tuned networks have poorer generalizability on the unknown masked bona fide and attack data than trained from scratch networks. 2) masked bona fide samples are more probable to detected as bona fide by the pre-COVID-19 PAD algorithms. 3) attacks with real masks on spoofing faces (AM2) are more accessible detected by PAD systems as bona fide than attacks with masked faces (AM1). The first cross-database scenario is similar to protocol-1 and discussed in the main work. The second cross-database scenario can show the diversity of our CRMA database due to the face masks, various sensors, and different scales, as the results are reported in the Tab. VI. It can be seen that trained from scratch networks outperform traditional LBP or fine-tuned networks in most cases. The models trained on protocol-1 of the CRMA database achieved better results than models trained on protocol-2 and protocol-3 in the OULU-NPU database ( 17.92% on protocol-2 and protocol-3). We can conclude that even without masked data, the CRMA database still possesses great diversity in sensors and environments to boost the performance of vanilla models. For example, FASNet TFS , which train VGG16 from scratch, achieves competitive results (13.12% HTER) on OULU-NPU database [25] , [7] . Besides, as mentioned earlier, even though DeepPixBis achieves much better results than other methods in intra-database evaluations, it performs mostly worse than other deep-learning methods, especially cross-testing on MSU-MFSD database. Ongoing face recognition vendor test (frvt) part 6b: Face recognition accuracy with face masks using post-covid-19 algorithms The effect of wearing a mask on face recognition performance: an exploratory study Extended evaluation of the effect of real and simulated masks on face recognition performance A survey on 3d mask presentation attack detection and countermeasures Face spoofing detection under superrealistic 3d wax face attacks NAS-FAS: staticdynamic central difference network search for face anti-spoofing OULU-NPU: A mobile face presentation attack database with real-world variations Learning deep models for face anti-spoofing: Binary or auxiliary supervision Deep tree learning for zero-shot face anti-spoofing Face liveness detection from a single image with sparse low rank bilinear discriminative model A face antispoofing database with diverse attacks On the effectiveness of local binary patterns in face anti-spoofing Face spoof detection with image distortion analysis Celebaspoof: Large-scale face anti-spoofing dataset with rich annotations Spoofing in 2d face recognition with 3d masks and anti-spoofing with kinect Face recognition systems under spoofing attacks CASIA-SURF: A large-scale multi-modal benchmark for face anti-spoofing Spoofing deep face recognition with custom silicone masks Discriminative representation combinations for accurate face spoofing detection Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face antispoofing Client-specific anomaly detection for face presentation attack detection Multiresolution gray-scale and rotation invariant texture classification with local binary patterns Face spoofing detection from single images using micro-texture analysis Face anti-spoofing based on color texture analysis A competition on generalized software-based face presentation attack detection in mobile scenarios Transfer learning using convolutional neural networks for face anti-spoofing Deep pixel-wise binary supervision for face presentation attack detection Arcface: Additive angular margin loss for deep face recognition Sphereface: Deep hypersphere embedding for face recognition Vggface2: A dataset for recognising faces across pose and age Deeply vulnerable: a study of the robustness of face recognition to presentation attacks Portrait quality (reference facial images for mrtd) Labeled faces in the wild: A database for studying face recognition in unconstrained environments Joint face detection and alignment using multitask cascaded convolutional networks Rethinking the inception architecture for computer vision Very deep convolutional networks for large-scale image recognition Densely connected convolutional networks Face recognition in unconstrained videos with matched background similarity Deep residual learning for image recognition Ms-celeb-1m: A dataset and benchmark for large-scale face recognition ISO/IEC DIS 30107-3:2016: Information Technology -Biometric presentation attack detection -P. 3: Testing and reporting Score-cam: Score-weighted visual explanations for convolutional neural networks