key: cord-0640274-rzzf3p4k
authors: Queiroz, Leonardo; Lai, Kenneth; Yanushkevich, Svetlana; Shmerko, Vlad
title: Biometrics in the Time of Pandemic: 40% Masked Face Recognition Degradation can be Reduced to 2%
date: 2022-01-03
journal: nan
DOI: nan
sha: 595d16c5f2e41a739ad15a24bff9623a8febec92
doc_id: 640274
cord_uid: rzzf3p4k

In this study of the face recognition on masked versus unmasked faces generated using Flickr-Faces-HQ and SpeakingFaces datasets, we report 36.78% degradation of recognition performance caused by the mask-wearing at the time of pandemics, in particular, in border checkpoint scenarios. We have achieved better performance and reduced the degradation to 1.79% using advanced deep learning approaches in the cross-spectral domain.

B iometric-enabled security checkpoints deployed in masstransit hubs such as airports and seaports are the frontiers of national and international security [1] , [2] , [3] . Two events had a drastic impact on the Research & Development (R&D) of security checkpoints: the 9/11 2001 terrorist attack and the 2019 COVID-19 pandemic (Fig. 1) .

Post 9/11 R&D period focused on increasing security, by introducing biometric-enabled ID and trusted traveler service [4] , [5] , self-service kiosks (gates) [6] , [7] , biometric-enabled watchlist screening [8] , [9] , [10] , identity de-duplication detection [11] , authentication machines [12] , [13] , implementation and deployment guides and regulations [7] , [1] , roadmapping [14] , human right protection [15] , as well as harmonization of advanced management techniques in order to increase a checkpoint performance [16] . The key performance measures remained the travelers' satisfaction with the service such as time of authentication, public acceptance of the biometric traits (face, iris, fingerprints), waiting times, privacy issues, and addressing the language barriers.

The full spectrum of post-pandemic challenges is yet to be specified. However, the current technological and societal state of security checkpoints during the pandemic has been characterized by 1) degradation of biometric-enabled technologies such as facial recognition because of mask wearing, and 2) emergence of the ICAO counter-epidemiological initiatives such as immunity passport [17] , [18] . Performance measures such as risk, trust, and bias are of particular interest, for example, risk of mis-identification when using immunity passport, masked face recognition bias due to the unsatisfactory amount of training data, as well as traveler trustworthiness to The counter-epidemic response by IATA (International Air Transport Association) prompted the deployment of the travelpass screening mechanism [17] , [18] as a step of critical importance in transferring from a centralized platform to the trusted distributed platform. This is only a part of a technological breakthrough that addresses counter epidemic measures. The next step is combining the biometric technology, e.g. transferring biometric-enabled technologies onto distributed platforms.

Recent work on epidemic-conditioned biometrics focused on periocular biometrics (periocular is the face area around the eyes) [19] , [20] , [21] , mask detection [23] , and adjusting e-interviewers to the acoustic effects of mask wearing [24] . We also refer to [22] for legal, ethical, and privacy concerns of the counter-epidemic checkpoint extension. In particular, the counter-epidemic checkpoint will need to be 'proofed' against invasions of privacy. The data sharing will need to satisfy the requirements for an independent audit, to ensure data are not used for purposes outside of the pandemic. There is also a concern that the emergency checkpoint mode will set a precedent and may remain beyond the end of the state of emergency.

The COVID-19 pandemic has prompted the expansion and accelerated development of epidemic-conditioned biometrics. This means that biometrics trait recognition algorithms are under constraints caused by the counter-epidemiological mea-sures, e.g. masks, shields, gloves, and other personal protective equipment. These measures also impede the performance of biometric-enabled tools and systems. The following biometric traits are epidemic-conditioned: − Face appearance obstructed by a mask, safety glasses, and/or shield'; the face mask prompts the usage of the periocular face region. − Fingerprint and palmprint (both contact and touchless) usage is prevented by protective gloves or the need to use sanitizer to the point of impracticality. − Iris biometrics may be affected by the safety glasses and shield but not the mask alone. − Voice biometrics is impeded by the face mask or shield. − Affective state is also conditioned by the face mask; its assessment is limited by the periocular face region.

In this Section, we show experimentally how the checkpoint biometric recognition should be adjusted towards the counter epidemiological requirements. For this, we chose two mandatory post-pandemic checkpoint recognition modes: 1) Face mask detection, including whether it is worn correctly, and 2) authentication of the person wearing the mask. The authentication scenario, in this case, is as follows: given a person, 1) their face is acquired and 2) the face features are extracted, and matching is performed against the data stored in the e-ID. The challenge of the pandemic times is that only the periocular part of the face is available while the lower part is hidden by a mask.

According to the ICAO-IATA recommendations, there are three kinds of biometric traits implemented within the e-ID: face, fingerprints, and iris [5] . In epidemiological scenarios, counter-epidemic measures such as personal protective equipment (masks, shields, glasses, and gloves) impact the availability of these biometric traits. Hence, it is reasonable to consider epidemic-conditioned biometric traits. In this paper, we focus on the mask-affected person's face recognition which is rather replaced by a periocular recognition. Intuitively, the performance of the face recognition system will be degraded if only part of the face is available. Thus, the goals of our experiments are as follows: Goal I: Estimate the face recognition degradation given that only the periocular region is available; Goal II: Investigate whether compensation for this degradation is possible by using additional data. We investigate the potential of additional data such as face image in infrared band for processing the mask-obstructed part of the face. Our approach is as follows:

There are several rational arguments for this approach: − The cross-sensor periocular biometrics have been studied, in particular, in [19] , [20] , [21] , [25] ; − Thermal or Infrared (IR) facial image has been used as a source of health-related indicators such as breathing function and air-breathing temperature [26] . We conducted the following experiments in order to achieve the above formulated goals (Fig. 2) : Experiment I: Mouth and nose cover detection (mask detector). Experiment II: Face verification (1:1 comparison) using a periocular region in order to estimate the face recognition performance degradation; Experiment III: Face identification (1:N comparison) using visual, thermal, and thermal+visual hybrid image to examine whether capitalizing on different spectral domains can mitigate performance degradation. The main outcome of these experiments is the generation of a recovered face image using the information from both the visual and thermal domains. We assume that images of an individual (a traveler) are taken at a checkpoint using both visible spectrum and thermal camera. These images of the subject are processed in order to determine whether or not the subject is wearing a mask; this is performed using the mask detection approach (Experiment 1). If the subject is not wearing a mask, a regular visual face verification is applied to check whether the subject is on the watchlist (Experiment 2). If the face mask is detected, a thermal image of the same subject is used. The thermal face image is processed such that the masked portion of the face is recovered via the use of image translation techniques such as generative adversarial networks. This process is performed in the thermal domain; we determined experimentally that the thermal domain reveals more detail on the masked face as opposed to the visual domain. After the subject is 'unmasked' via such processing, we extract only the lower face region and concatenate it with the upper visual face region to generate a complete face image. This image can then be compared to the images in a preexisting legacy watchlist (Experiment 3).

In this paper, we perform three main experiments: mask detection, periocular face verification, and thermal+visual face identification. Since each task is different, the datasets required to train and test each experiment will also be different. We used two main datasets, Flickr-Faces-HQ (FFHQ) [27] and SpeakingFaces dataset [28] . Using each of these datasets, the authors of the published works, Quieroz et al. [29] and Cabani et al. [30] synthetically added a mask to each image, thus creating the Thermal-Mask and MaskFace-Net dataset, respectfully.

SpeakingFaces [28] and Thermal-Mask [29] were used in Experiment 1 and 3 for face mask detection in both visible and thermal spectra. SpeakingFaces is a large-scale multimodal dataset that combines thermal, visual, and audio data streams. It includes data from 142 subjects, with a gender balance of 68 female and 74 male participants, with ages ranging from 20 FFHQ [27] and MaskedFace-Net [30] were used in Experiment 2 for illustrating the impact of wearing masks on facial verification. FFHQ is a dataset containing 70,000 highquality images crawled from Flickr. Images are of different subjects across varying ages, ethnicity, background, and wearing different accessories. The resolution of each image is 1024×1024. The MaskedFace-Net dataset is a synthetically created dataset using FFHQ images as a base. Each image from the MaskedFace-Net dataset contains subjects either wearing a mask correctly or incorrectly. Figure 4 illustrates two images of the original subject and two images with the masks artificially 'placed over'. 

In this paper, we present three unique experiments to illustrate the checkpoint biometric recognition process. Depending on the task, different performance measures are used to report the final results. In addition, each deep learning algorithm requires a unique loss metric for it to be optimally trained.

To detect masked and unmasked faces, we used the Cascade R-CNN model. As an object detector, the Cascade R-CNN applies two loss functions: the binary cross-entropy loss for classification and the smooth-L1 loss for bounding box regression. Binary cross-entropy loss is used in two-class classification tasks, such as classifying between masked and unmasked faces. It measures the error between the ground truth and the predicted value, defined by the equation below and described in [32] :

(1) where Y is the binary label, Y pred is the predicted value and log is the natural logarithmic function.

Smooth-L1 loss is used for box regression in the Cascade R-CNN. This loss is less sensitive to outliers than most regression losses. It is defined by the equations below as described in [33] Loss_box = i∈{x,y,w,h}

We use a multi-task Loss on each labeled region of interest to jointly train for classification and bounding-box regression:

As opposed to using typical cross-entropy loss for binary decision, we employed the contrastive loss, which is shown in many works to perform well for Siamese networks. Contrastive loss is a better metric for comparison tasks as it focuses on learning from the distance metric as opposed to cross-entropy loss which focuses on classification error. The contrastive loss is defined as follows [34] :

where Y is the binary label, D is the Euclidean distance between the two feature vectors, max is the maximum function choosing between the two provided values, and Margin represents the radius of influence.

For object/mask detection, we used the Intersection over Union (IoU) measure to assess how well the object location prediction is. It describes the extension area of the overlap of the ground truth bounding box and the predicted bounding box.

IoU = Area of overlap Area of union

The IoU is a value within the range of 0 to 1, with 1 being a perfect match between the ground truth and the predicted bounding boxes. Under this context, the true positives T P will be the predicted bounding boxes whose IoU is above a chosen threshold (usually 0.5), and the false positive F P occurs once the IoU is below this threshold.

Based on the IoU, when ground truth is present in the image, and the model fails to detect the object, we classify it as False Negative (FN). True Negative (TN) accounts for every part of the image where we did not predict an object; however, this metrics is not useful for object detection and will be ignored.

We used the mean Average Precision (mAP ) to evaluate how well the Cascade R-CNN performs on the mask detection task.

Precision represents a fraction of the relevant instances (true positives) and the total number of detected instances:

Recall stands for a fraction of the true positive cases out of the number of ground truth cases (both true positives and false negatives):

To assess the entire model performance, we applied the mean Average Precision (mAP ). We first calculate the average precision AP as the area under the curve (AUC) of the precision-recall curve for each category. It computes the average value of precision over the interval from Recall = 0 to Recall = 1 of Precision = p(r). Next, we calculate the average for each category which is represented by the equation below, given the number of categories (classes) N :

The performance in terms of accuracy for face identification and verification is defined as follows [35] :

where T P is the number of true positives, i.e. the number of matching image pairs correctly identified by the network; T N is the number of true negatives, i.e. the number of nonmatching image pairs correctly rejected by the network; F P is the number of false positives, i.e. the number of non-matching image pairs incorrectly accepted by the network; and F N is the number of false negatives, i.e. the number of matching image pairs incorrectly rejected by the network.

In this experiment, we focused on detecting faces in the visual and infrared (thermal) spectra and classifying these faces between masked and unmasked. We used the Thermal-Mask dataset [29] with unmasked and synthetically masked face images in the thermal spectrum that was created based on the SpeakingFaces dataset [28] . We applied the same approach to the visual spectrum images of the SpeakingFaces, and for this study, we consider the set of all images (visual + thermal) as the Thermal-Mask dataset.

Given 142 subjects, each recorded using 9 different head positions, we selected 42,460 masked face images (80 subjects) and 33,448 unmasked face images (62 subjects) for each spectrum as described below, with the total number of images being 151,816: With the deep learning approach, we randomly selected 70% for training (100 subjects), 20% (28 subjects) for validation and 10% (14 subjects) for testing, among the 142 subjects. Note that these percentages may not be precise since the subjects do not have the same amount of images after the data cleaning. Table I summarizes the total number of samples in the final subset of the Thermal-Mask dataset, and this amount is the same for both visual and thermal versions.

To detect the masked and unmasked faces in thermal and visual spectra, we applied the state-of-the-art Cascade R-CNN [36] model separately for each spectrum. We chose a I  NUMBER OF SAMPLES IN EACH OF THE TRAIN, VALIDATION, AND TEST  DATA SPLITS FOR THE THERMAL-MASK DATASET (FOR ONE SPECTRUM) . Train  23,188  29,842  100  Validation  5,940  8,905  28  Test  4,320  3,713  14  Total  33,448  42,460  142 two-stage model rather than a one-stage model, focusing on accuracy over processing speed. The Cascade R-CNN is an object detector and works as a multi-stage extension of the Faster R-CNN architecture. It is composed of a sequence of detectors trained with increasing IoU thresholds. Those detectors are trained sequentially and use the output of one as the training set for the next, as seen in Fig. 5A . Fig. 5B illustrates some of the results of this model applied to the test set of the Thermal-Mask dataset.

To assess the effectiveness of the Cascade R-CNN model, we applied the mean Average Precision (mAP) metric, which jointly assesses the localization of the faces through bounding boxes position and the classification between masked and unmasked. [37] Table II reports the performance of the Cascade R-CNN for the mask detection task with images in visual and infrared (thermal) spectra. It illustrates the results of the applied model with four different backbones: ResNet-50, ResNet-101 [38] , ResNeXt-101-32x4d and ResNeXt-101-64x4d [39] . For each backbone, was applied the Feature Pyramid Network (FPN), to overcome the low resolution of the feature maps in upper layers. The mAP column in Table II indicates the results of the mAP calculated over different IoU thresholds (0.5:0.05:0.95), with the averages over all classes and also over the IoU thresholds. The subsequent columns shows the mAP applied to IoU equals to 0.5 (mAP 50 ) and 0.75 (mAP 75 ).

The ResNeXt [39] CNN contains particular attributes, such as parallel paths, which presents better performance compared to ResNet [38] with the same complexity (number of floating point operations -FLOPS). It happens because the ResNeXt topology shares the same hyperparameters (width and filter sizes) between the parallel modules, reducing the total number of hyperparameters. Based on this observation, we can further improve the performance by increasing the cardinality (number of parallel paths) rather them increasing the number of layers. In the thermal spectrum, better results were obtained with ResNext with cardinality = 64 and a bottleneck width = 4d in all metrics. However, in the visual spectrum, we observed a slight difference that made the model with lower cardinality = 32 better in the mAP with multiple IoUs. The authors of the ResNeXt backbone mentioned in the original article that in some cases, increased cardinality will begin to show a saturation of accuracy for more complex datasets. Since the visible spectrum images are relatively more complex than the infrared spectrum (thermal) images, we believe that cardinality = 32 got better results than cardinality = 64.

The above leads to the following conclusion: With the Cascade R-CNN model, we can locate and classify faces with or without masks at a relatively high performance (0.879 mAP ). These results provide sufficient precision and can be given to the face identification/verification module for further processing.

In this experiment, we explore the influence of wearing masks on facial verification (1:1 comparison) using the FFHQ [27] and MaskedFace-Net [30] datasets. For this experiment, we designed 3 training cases and 3 testing cases, resulting in 9 performance measures. We have the following image pairs for testing/training: • (a) image 1: no mask, image 2: no mask • (b) image 1: mask, image 2: mask • (c) image 1: no mask, image 2: mask where no mask (FFHQ) represents an image containing a subject that is not wearing a mask and mask (MaskedFace-Net) represents an image containing a subject wearing a mask.

For performance evaluation, we used a 5-fold crossvalidation method where all the samples were divided into 5 partitions. For each fold, 4 partitions are used for training while the remaining partition is used for validation. The results are then averaged across the 5 folds.

Due to the limited amount of images per subject, we choose to use a Siamese network [31] to perform facial verification using one-shot learning. Fig. 6 illustrates the overall architecture used in this paper for facial matching given two images. The base idea of a Siamese network is to extract features from two images then compare the features from each image to determine whether they are of the same person. The comparison task can be performed using a distance metric such as the Euclidean distance. Ideally, when two images are of the same subject, the extracted features from each image should be similar and therefore the Euclidean distance between the two should be near-zero, the opposite should be true when comparing different subjects. Note that since the Siamese network compares 2 images, the type of image for image 1 and image 2 does not matter, ie. image 1 and image 2 can be reversed. The Siamese network is trained using the Adam optimizer for 10 epochs, further details regarding the training process and hyper-parameter turning can be found in Appendix IV. Table III illustrates the accuracy (Equation 10) performance of facial verification given the various training and testing scenarios. It can be seen that when the training and testing set is the same, the performance is near perfect (> 98%), while different sets result in much-degraded performance (< 64%). This observation is likely due to the network model being overfitted to specific elements discovered between the images but is obscured when a mask is worn. This results in a degraded network that is no longer capable of matching these image pairs. This degradation indicates that by default the network train with uncovered faces is incapable of performing facial verification on individuals wearing masks. Note that for train case (c), all test results report fairly high accuracy, indicating that when possible, it is beneficial to train on both mask/unmasked subjects; however, the case of training on mask/unmasked subjects may not always be possible.

Next, we examine the performance of the same network given images with only the periocular region shown. The details of creating the modified (periocular) images are provided in Appendix IV. Table IV reports the performance of the Siamese network when using the periocular images for training and testing. The results presented in Tables III and IV show To summarize, the conducted experiments illustrate the current limitation of one-shot face verification models, specifically when subjects are wearing masks. The accuracy becomes 36.78% (99.78% in recognizing faces of the subjects not wearing a mask which degrades by 63.00% when the subjects are wearing a mask). A possible remedy as confirmed by the experiments indicates that performing face verification with emphasis on the periocular region can greatly alleviate this problem. The difference between periocular biometrics and masked individuals, we observe an accuracy difference of 36.34% (99.34% with periocular which is 63.00% less for the faces of subjects wearing masks). This shows that by placing emphasis on the periocular region, we are able to reduce the accuracy degradation due to masks by a huge margin without the need of a specific "masked face" dataset.

In this experiment, we are interested in examining the performance of the recognition algorithm to work on hybrid visual and thermal face images. A hybrid image consists of two portions: the top portion that includes the forehead and the periocular region of the face taken in the visual spectrum, and the bottom portion of the image consisting of the mouth and neck region of the unmasked individual generated in the thermal spectrum. By combining the top and the bottom portion of the image, we attempt to recover a complete face image to be used for matching the database face data.

In Experiment 3, we focus on facial identification (1:N comparison) as opposed to Experiment 2 which focused on facial verification (1:1 comparison). As such, we choose to use the Thermal-Mask [29] and SpeakingFace [28] datasets as they contain facial images of subjects taken in both visual and thermal domains. For this experiment, we used five types of images to illustrate the impact of using thermal images of masked faces on the subject recognition performance. The types of images we used include: visual face, masked visual face, thermal face, masked thermal face, and recovered face image (Fig. 7) . For this experiment, a typical Convolutional Neural Network (CNN) based on Inception v3 for feature extraction is used. This CNN (Fig. 8) is trained using the Adam optimizer via the categorical cross-entropy loss function. A detailed description of the CNN architecture is provided in Appendix IV. The performance evaluation is based on the identification accuracy (Equation 10) across 5-fold cross-validation. Table V reports the identification accuracy (Equation 10) using an assorted mask and unmasked faces taken in thermal, visual, and thermal+visual domains. The column represents the image type used for testing, and the row represents the image type used for training. For example, row-1 and column-3 refer to training the CNN using unmasked visual face image and testing with unmasked thermal face image. In this experiment, we see a performance degradation of 0.9982 − 0.4189 = 0.5793 (taken from row-1, column-1 and row-1, column-2 in Table I ) in the visual domain when a mask is worn. Similarly, a loss of 0.9899 − 0.3055 = 0.6844 (taken from row-3, column-3 and row-3, column-4 in Table I) is observed in the thermal domain.

There are a few observations shown in Table V. 1) Diagonal of the matrix shows the highest performance. This exceptional accuracy is most likely because both the training and testing sets are from the same cohort (same image domain and both are wearing/not-wearing masks). 2) Accuracy between images from the same image domain (visual or thermal) is much higher than cross-domain (visual vs. thermal). 3) Masked images are used for training and are tested on unmasked images, its performance is magnitudes higher than the reverse. The numbers shown in row-1, column-2 and row-5, column-2 illustrate the scenario where the models are trained on a pre-existing dataset (watchlist), and its accuracy only reaches 41.89% when the same individual wears a mask. This accuracy can be improved to 47.04% if the bottom half mouth region (hidden under the mask in the visual domain) is replaced with a thermal region of the mouth (synthetically unmasked). Note that this compensation method via hybrid images is only applicable in the visual domain, as the periocular region in the thermal domain is often obscured by the use of clear glasses (which appear to be dark in thermal but clear in visual).

Leaders in security technologies such as NEC corporation have been on the search for epidemic-conditioned solutions in biometrics, and for ways to overcome the degradation of face recognition. Recently, the NEC reported that they achieved 99.9 % accuracy of periocular recognition compared with 98.21% reported in our experiments. However, the NEC achieved these results through the process of QR immunity passport verification, i.e. in scenarios when additional information is available. Specifically, the NEC combined three authentication methods: associated possession (ID), token, and a biometric trait, thus reducing the problem of biometric identification to verification, i.e. reducing from 1:Many to 1:1 matching [40] . Based on our experimental results, we deduce that it is realistic to recover the accuracy that we achieved in our experiments, by 1.79% in order to achieve 99.9 % as reported by NEC.

This Appendix provides more detail on the parameters of the experiments described in this paper.

Parameters: For this experiment, our model was trained for 12 epochs using a mini-batch size of 2 and the Stochastic Gradient Descent (SGD) optimizer, with the parameters learning rate of 0.002, β 1 = 0.9, and β 2 = 0.0001. The learning rate is a tuning hyper-parameter that determines the step size at each iteration while minimizing the loss function. The β 1 stands for the momentum, which adds a fraction of the previous weight update to the current one to avoid local minima and speeds up the training time. The β 2 stands for the weight decay, a regularization technique that adds a small penalty to the loss function. As an object detector, the model applies two loss functions: the binary cross-entropy loss [32] for classification and the smooth-L1 loss [33] for the bounding box regression (localization).

Architecture: For this experiment, we divided our model architecture into the following parts: 1) Backbone, a CNN which takes as input the image and extracts the feature map. 2) Neck, a component between the backbone and head of the architecture that performs improvements or refinements to feature maps. 3) Head, a target object (masked/unmasked face) detector part of the network architecture.

In our experiment, we applied the ResNet [38] backbone and the ResNext [39] . We compared the ResNet-50 and Resnet-101 with 50 and 101 layers, respectively. For the ResNeXt, we considered the ResNeXt-101-32x4d, which stands for the architecture with 101 layers, 32 parallel pathways, and a bottleneck width of 4 dimensions. We also applied the ResNeXt-101-64x4d with higher cardinality of 64.

A sequence of several CNN layers usually leads to an increase in the semantic value of feature maps, while the spatial dimension (resolution) decreases. To overcome the low resolution of the feature maps in the upper layers, we applied the Feature Pyramid Network (FPN) [41] . It takes an image as an input and outputs the feature maps at multiple levels (different sizes) in a fully convolutional fashion, which improves the detection of small objects.

The head of our architecture is the Cascade R-CNN [36] , which is composed of a sequence of detectors trained with increasing Intersection-of-Unions (IoU) thresholds. It is implemented with four stages: one Region Proposal Network (RPN) and three detection heads with thresholds IoU = 0.5, 0.6, 0.7.

Parameters: For this experiment, we used the Adam optimizer to train our Siamese network. Adam [42] is an optimization algorithm used to replace the standard stochastic gradient descent. We chose the following parameters for the training: a learning rate of 0.001, β 1 = 0.9, β 2 = 0.999, an epoch of 10, and a batch size of 32. Learning rate is a hyper-parameter that controls how much update is applied to the model. beta 1 and beta 2 are 2 coefficients used to control the decay rate of the first and second moments, respectively. Epoch is the number of times the entire training set is used for training. Batch size is a hyper-parameter that determines how many samples the model sees before an update is triggered.

Architecture: In this experiment, the machine-learning model is the Siamese network designed to perform 1-to-1 image comparisons corresponding to the facial verification (1:1 comparison). A unique component of this Siamese network is the "twin" connection between the two backbone networks. For this experiment, we used Inception v3 as the backbone network, which is a CNN proposed by Szegedy et al. [43] . It is pre-trained on the ImageNet dataset, to extract image features. The "twin" connection is designed in such a way that both networks share the same exact weights and, therefore, when presented with the same image, they should yield the exact same output. Each output from these two backbone networks is passed through a global average pooling layer, and a 2096-unit fully-connected layer. The Euclidean distance is then calculated between the output of the 2 2096-unit fullyconnected layers. The computed distance is then analyzed through 3 fully-connected layers with unit size 512, 2056, and 2. The resulting output of the 2-unit is a 2-class probability representing match or no-match.

Image Rescaling: The images used in this experiment are from the FFHQ dataset which contains images taken in 1024×1024 pixel resolution. Since for this experiment the image resolution is not a strict requirement, all images have been scaled down to 256×256 pixel resolution got the purpose of conserving the memory usage. The rescaling process is done via area-based interpolation.

Periocular Processing: The creation of periocular images is done via "blacking out" the non-essential regions of the face. The proposed modification is to divide each image into equal 8x8 regions. After the image rescaling process, each image has a pixel resolution of 256×256. We grouped 32x32 pixels together, to divide the image into equal 8x8 regions and labeled them sequentially (with the top left corner being labelled 0, and the bottom right corner being labelled 63). Since the original authors centered the location of the detected face, we determine that the periocular region of the face is located between regions 25-30 (highlighted red in Fig. 9 (a) ).

In this paper, we propose to use a masking procedure to perform the "blacking out" process. The purpose of such a masking technique is to maintain the same image resolution as the original image while randomly "blacking out" the nonperiocular regions of the face. By removing or "blacking out" these regions, we are essentially steering any networks using these images to focus on the periocular regions as opposed to other features on the face (such as the nose). The mask shown in Fig. 9 contains the following properties:

• borders of the image are always dropped (top/bottom rows and left/right columns are always dropped). • periocular regions are always kept intact (regions 25 to 30). • all remaining regions have a 50% chance of getting dropped.

(a) (b) Fig. 9 . Example of a modified face image: (a) dividing the face to 8 × 8 regions, and (b) applying the mask to a random subject.

Parameters: For this experiment, we used the Adam optimizer [42] to train our CNN. Similar to Experiment 2, the following parameters were used: a learning rate of 0.001, β 1 = 0.9, β 2 = 0.999, an epoch of 10, and a batch size of 32.

Architecture: In this experiment, our machine-learning model is a transfer learning-based CNN such as Inception v3 [43] . The transfer learning process involves modifying the original Inception v3 by replacing the top fully-connected layers with new fully-connected layers. The network is then fine-tuned with these new fully-connected layers in order to adapt the network to new data. For this experiment, we replaced the original layers with a global average pooling, 1024-unit fully-connected, 512-unit fully-connected, 1024unit fully-connected, and 80-unit fully-connected. The global average pooling layer averages the features across the channel dimensions, thereby converting the feature vector from 2D to 1D. The 1024-512-1024 unit configuration is a simple multilayer perceptron shown to yield the best performance in this experiment. The last 80-units (chosen because we had 80 subjects) in a fully-connected layer represent the classification layer which outputs the identity of the subject.

Image Rescaling: The images used in this experiment are from the SpeakingFace and Thermal-Mask dataset which contains images taken in various pixel resolutions (464×348 for thermal and 768×512 for visual). Similar to Experiment 2, all images are scaled to 256×256 pixel resolution using area-based interpolation.

Image Processing: For this experiment, we deployed three image domains: thermal, visual, and thermal+visual (hybrid). Both the thermal and visual images can be processed directly, while the hybrid images require additional segmentation and concatenation. The hybrid image consists of two partial images, the top half is the visual image and the bottom half is the thermal image. Since each image is pre-aligned by the authors of the data set [28] , we were able to directly crop and use the top half of the visual image and the bottom half of the thermal image. Fig. 10 illustrates the hybrid image creation process. 

Transportation Security Administration, Layers of Security

A Survey of Security and Privacy Issues in ePassport Protocols

Machine Readable Travel Documents

Traveller processes for biometric recognition in automated border control systems

Automated border control -state of play and latest developments, Federal Office for Information Security

Bridging the Gap Between Forensics and Biometric-Enabled Watchlists for e-Borders

Cognitive checkpoint: Emerging technologies for biometricenabled watchlist screening

Taxonomy and modelling of impersonation in e-border authentication

Adaptive fusion of biometric and biographic information for identity de-duplication

Biometric-enabled authentication machines: A survey of open-set real-world applications

Biometric Recognition in Automated Border Control: A Survey

IATA (International Air Transport Association): Checkpoint of the future. Executive summary. 4th Proof

Protect rights at automated borders Nature

A review of risk-based security and its impact on TSA PreCheck

IATA (International Air Transport Association), Travel-Pass Initiative

IATA (International Air Transport Association), How IATA Travel Pass is using blockchain technology to keep passengers in control of their data

Classification of Soft Biometric Traits When Matching Near-Infrared Long-Range Face Images Against Their Visible Counterparts

Cross-Sensor Periocular Biometrics for Partial Face Recognition in a Global Pandemic: Comparative Benchmark and Novel Multialgorithmic Approach, preprint

Ongoing FRVT part 6a: Face recognition accuracy with face masks using pre-COVID-19 algorithms

Digital technologies in the public-health response to COVID-19

Identifying Facemask-Wearing Condition Using Image Super-Resolution with Classification Network to Prevent COVID-19

Subject independent evaluation of eyebrows as a stand-alone biometric

Subclass Heterogeneity Aware Loss for Cross-Spectral Cross-Resolution Face Recognition

Noncontact Measurement of Breathing Function

A Style-Based Generator Architecture for Generative Adversarial Networks

SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams

Thermal-Mask-A Dataset for Facial Mask Detection and Breathing Rate Measurement

MaskedFace-Net -A dataset of correctly/incorrectly masked face images in the context of COVID-19

Signet: Convolutional Siamese network for writer independent offline signature verification

Machine learning: a probabilistic perspective

IEEE International Conference on Computer Vision (ICCV)

Dimensionality Reduction by Learning an Invariant Mapping

An Introduction to ROC Analysis

High Quality Object Detection and Instance Segmentation

Recall, precision and average precision, =Department of Statistics and Actuarial Science

Deep Residual Learning for Image Recognition, =IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Aggregated Residual Transformations for Deep Neural Networks, IEEE Conference on Computer Vision and Pattern Recognition (CVPR

Feature Pyramid Networks for Object Detection

Adam: A Method for Stochastic Optimization

Rethinking the Inception Architecture for Computer Vision

This Project was partially supported by Natural Sciences and Engineering Research Council of Canada (NSERC) through grant "Biometric intelligent interfaces".