key: cord-0205791-bc6xhd1j
authors: Phan, Hai; Nguyen, Anh
title: DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover's Distance Improves Out-Of-Distribution Face Identification
date: 2021-12-07
journal: nan
DOI: nan
sha: e4521260c4c929d71fca8e63c210d2ede0414b17
doc_id: 205791
cord_uid: bc6xhd1j

Face identification (FI) is ubiquitous and drives many high-stake decisions made by law enforcement. State-of-the-art FI approaches compare two images by taking the cosine similarity between their image embeddings. Yet, such an approach suffers from poor out-of-distribution (OOD) generalization to new types of images (e.g., when a query face is masked, cropped, or rotated) not included in the training set or the gallery. Here, we propose a re-ranking approach that compares two faces using the Earth Mover's Distance on the deep, spatial features of image patches. Our extra comparison stage explicitly examines image similarity at a fine-grained level (e.g., eyes to eyes) and is more robust to OOD perturbations and occlusions than traditional FI. Interestingly, without finetuning feature extractors, our method consistently improves the accuracy on all tested OOD queries: masked, cropped, rotated, and adversarial while obtaining similar results on in-distribution images.

. Traditional face identification ranks gallery images based on their cosine distance with the query (top row) at the image-level embedding, which yields large errors upon out-of-distribution changes in the input (e.g. masks or sunglasses; b-d). We find that re-ranking the top-k shortlisted faces from Stage 1 (leftmost column) using their patch-wise EMD similarity w.r.t. the query substantially improves the precision (Stage 2) on challenging cases (b-d). The "Flow" visualization intuitively shows the patch-wise reconstruction of the query face using the most similar patches (i.e. highest flow) from the retrieved face. See Fig. S4 for a full figure with the top-5 candidates.

Face identification (FI) is ubiquitous and drives many high-stake decisions made by the law enforcement. A common FI approach compares two images by taking the cosine similarity between their image embeddings. Yet, such approach suffers from poor out-of-distribution (OOD) generalization to new types of images (e.g., when a query face is masked, cropped or rotated) not included in the training set or the gallery. Here, we propose a re-ranking approach that compares two faces using the Earth Mover's Distance on the deep, spatial features of image patches. Our extra comparison stage explicitly examines image similarity at a fine-grained level (e.g., eyes to eyes) and is more robust to OOD perturbations and occlusions than traditional FI. Interestingly, without finetuning feature extractors, our method consistently improves the accuracy on all tested OOD queries: masked, cropped, rotated, and adversarial while obtaining similar results on in-distribution images.

Who shoplifted from the Shinola luxury store in Detroit [4] ? Who are you to receive unemployment benefits [5] or board an airplane [1] ? Face identification (FI) today is behind the answers to such life-critical questions. Yet, the technology can make errors, leading to severe consequences, e.g. people wrongly denied of unemployment benefits [5] or falsely arrested [2] [3] [4] 7] . Identifying the person in a single photo remains challenging because, in many cases, the problem is a zero-shot and ill-posed image retrieval task. First, a deep feature extractor may not have seen a normal, non-celebrity person before during training. Second, there may be too few photos of a person in the database for FI systems to make reliable decisions. Third, it is harder to identify when a face in the wild (e.g. from surveillance cameras) is occluded [44, 54] (e.g. wearing masks), distant or cropped, yielding a new type of photo not in both the training set of deep networks and the retrieval database-i.e., out-of-distribution (OOD) data. For example, face verification accuracy may notoriously drop significantly (from 99.38% to 81.12% on LFW) given an occluded query face ( Fig. 1b-d) [44] or adversarial queries [8, 73] .

In this paper, we propose to evaluate the performance of state-of-the-art facial feature extractors (ArcFace [19] , Cos-Face [61] , and FaceNet [47] ) on OOD face identification tests. That is, our main task is to recognize the person in a query image given a gallery of known faces. Besides indistribution (ID) query images, we also test FI models on OOD queries that contain (1) common occlusions, i.e. random crops, faces with masks or sunglasses; and (2) adversarial perturbations [73] . Our main findings are: 1 • Interestingly, the OOD accuracy can be substantially improved via a 2-stage approach (see Fig. 2 ): First, identify a set of the most globally-similar faces from the gallery using cosine distance and then, re-rank these shortlisted candidates by comparing them with the query at the patch-embedding level using the Earth Mover's Distance (EMD) [45] (Sec. 3 & Sec. 4).

• Across three different models (ArcFace, CosFace, and FaceNet), our re-ranking approach consistently improves the original precision (under all metrics: P@1, R-Precision, and MAP@R) without finetuning (Sec. 4). That is, interestingly, the spatial features extracted from these models can be leveraged to compare images patch-wise (in addition to image-wise) and further improve FI accuracy.

• On masked images [59] , our re-ranking method (no training) rivals the ArcFace models finetuned directly on masked images (Sec. 4.3).

To our knowledge, our work is the first to demonstrate the remarkable effectiveness of EMD for comparing OOD, occluded and adversarial images at the deep feature level.

To demonstrate the generality of our method, we adopt the following simple FI formulation as in [19, 33, 71] : Identify the person in a query image by ranking all gallery images based on their pair-wise similarity with the query. After ranking (Stage 1) or re-ranking (Stage 2), we take the identity of the top-1 nearest image as the predicted identity. Evaluation Following [38, 71] , we use three common evaluation metrics: Precision@1 (P@1), R-Precision (RP), and MAP@R (M@R). See their definitions in Sec. B1 in [71] . 1 Code, demo and data are available at https://github.com/ anguyen8/deepface-emd

Pre-trained models We use three state-of-the-art Py-Torch models of ArcFace, FaceNet, and CosFace pretrained on CASIA [65] , VGGFace2 [14] , and CASIA, respectively. Their architectures are ResNet-18 [24] , Inception-ResNet-v1 [56] , and 20-layer SphereFace [33] , respectively. See Sec. S1 for more details on network architectures and implementation in PyTorch. Image pre-processing For all networks, we align and crop input images following the 3D facial alignment in [11] (which uses 5 reference points, 0.7 and 0.6 crop ratios for width and height, and Similarity transformation). All images shown in this paper (e.g. Fig. 1 ) are pre-processed. Using MTCNN, the default pre-processing of all three networks, does not change the results substantially (Sec. S5).

Stage-1: Ranking A common 1-stage face identification [33, 47, 61] ranks gallery images based on their pairwise cosine similarity with a given query in the last-linearlayer feature space of a pre-trained feature extractor (Fig. 2) . Here, our image embeddings are extracted from the last linear layer of all three models and are all ∈ R 512 . Stage-2: Re-ranking We re-rank the top-k (where the optimal k = 100) candidates from Stage 1 by computing the patch-wise similarity for an image pair using EMD. Overall, we compare faces in two hierarchical stages (Fig. 2 ), first at a coarse, image level and then a fine-grained, patch level.

Via an ablation study (Sec. 3), we find our 2-stage approach (a.k.a. DeepFace-EMD) more accurate than Stage 1 alone (i.e. no patch-wise re-ranking) and also Stage 2 alone (i.e. sorting the entire gallery using patch-wise similarity).

EMD is an edit distance between two set of weighted objects or distributions [45] . Its effectiveness was first demonstrated in measuring pair-wise image similarity based on color histograms and texture frequencies [45] for image retrieval. Yet, EMD is also an effective distance between two text documents [29] , probability distributions (where EMD is equivalent to Wasserstein, i.e. Mallows distance) [30] , and distributions in many other domains [27, 34, 43] . Here, we propose to harness EMD as a distance between two faces, i.e. two sets of weighted facial features.

Let Q = {(q 1 , w q1 ), ..., (q N , w q N )} be a set of N (facial feature, weight) pairs describing a query face where q i is a feature (e.g. left eye or nose) and the corresponding w qi indicates how important the feature q i is in FI. The flow between Q and the set of weighted features of a gallery face G = {(g 1 , w g1 ), ..., (g N , w g N )} is any matrix F = (f ij ) ∈ R N ×N . Intuitively, f ij is the amount of importance weight at q i that is matched to the weight at g j . Let d ij be a ground distance between (q i , g j ) and D = (d ij ) ∈ R N ×N be the ground distance matrix of all pair-wise distances. We want to find an optimal flow F that minimizes the following cost function, i.e. the sum of weighted pair-wise distances across the two sets of facial features:

As in [66, 71] , we normalize the weights of a face such that the total weights of features is 1 i.e. N i=1 w qi = N j=1 w gj = 1, which is also the total flow in Eq. (4). Note that EMD is a metric iff two distributions have an equal total weight and the ground distance function is a metric [16] .

We use the iterative Sinkhorn algorithm [18] to efficiently solve the linear programming problem in Eq. (1), which yields the final EMD between two faces Q and G. Facial features In image retrieval using EMD, a set of features {q i } can be a collection of dominant colors [45] , spatial frequencies [45] , or a histogram-like descriptor based on the local patches of reference identities [60] . Inspired by [60] , we also divide an image into a grid but we take the embeddings of the local patches from the last convolutional layers of each network. That is, in FI, face images are aligned and cropped such that the entire face covers most of the image (see Fig. 1a ). Therefore, without facial occlusion, every image patch is supposed to contain useful identity information, which is in contrast to natural photos [66] .

Our grid sizes H × W for ArcFace, FaceNet, and Cos-Face are respectively, 8×8, 3×3, and 6×7, which are the corresponding spatial dimensions of their last convolutional layers (see definitions of these layers in Sec. S1). That is, each feature q i is an embedding of size 1×1×C where C is the number of channels (i.e. 512, 1792, and 512 for Arc-Face, FaceNet, and CosFace, respectively). Ground distance Like [66, 71] , we use cosine distance as the ground distance d ij between the embeddings (q i , g j ) of two patches:

where . is the dot product between two feature vectors.

EMD in our FI intuitively is an optimal plan to match all weighted features across two images. Therefore, how to weight features is an important step. Here, we thoroughly explore five different feature-weighting techniques for FI. Uniform Zhang et al. [66] found that it is beneficial to assign lower weight to less informative regions (e.g. background or occlusion) and higher weight to discriminative areas (e.g. those containing salient objects). Yet, assigning an equal weight to all N = H × W patches is worth testing given that background noise is often cropped out of the pre-processed face image ( Fig. 1) :

Average Pooling Correlation (APC) Instead of uniformly weighting all patch embeddings, an alternative from [66] would be to weight a given feature q i proportional to its correlation to the entire other image in consideration. That is, the weight w qi would be the dot product between the feature q i and the average pooling output of all embeddings {g j } N 1 of the gallery image:

where max(.) keeps the weights always non-negative. APC tends to assign near-zero weight to occluded regions and, interestingly, also minimizes the weight of eyes and mouth in a non-occluded gallery image (see Fig. 3b ; blue shades around both the mask and the non-occluded mouth). Cross Correlation (CC) APC [66] is different from CC introduced in [71] , which is the same as APC except that CC uses the output vector from the last linear layer (see code) instead of the global average pooling vector in APC. Spatial Correlation (SC) While both APC and CC "summarize" an entire other gallery image into a vector first, and then compute its correlation with a given patch q i in the query. In contrast, an alternative, inspired by [53] , is to take the sum of the cosine similarity between the query patch q i and every patch in each gallery image {g j } N 1 :

We observe that SC often assigns a higher weight to occluded regions e.g., masks and sunglasses (Fig. 3b) . Landmarking (LMK) While the previous three techniques adaptively rely on the image-patch similarity (APC, CC) or patch-wise similarity (SC) to weight a given patch embedding, their considered important points may or may not align with facial landmarks, which are known to be important for many face-related tasks. Here, as a baseline for APC, CC, and SC, we use dlib [26] to predict 68 keypoints in each face image (see Fig. 3c ) and weight each patchembedding by the density of the keypoints inside the patch area. Our LMK weight distribution appears Gaussian-like with the peak often right below the nose (Fig. 3c) . 

We perform three ablation studies to rigorously evaluate the key design choices in our 2-stage FI approach: (1) Which feature-weighting techniques to use (Sec. 3.1)? (2) re-ranking using both EMD and cosine distance (Sec. 3.2); and (3) comparing patches or images in Stage 1 (Sec. 3.3). Experiment For all three experiments, we use ArcFace to perform FI on both LFW [65] and LFW-crop. For LFW, we take all 1,680 people who have ≥2 images for a total of 9,164 images. When taking each image as a query, we search in a gallery of the remaining 9,163 images. For the experiments with LFW-crop, we use all 13,233 original LFW images as the gallery. To create a query set of 13,233 cropped images, we clone the gallery and crop each image randomly to its 70% and upsample it back to the original size of 128×128 (see examples in Fig. 5d ). That is, LFW-crop tests identifying a cropped (i.e. close-up, and misaligned) image given the unchanged LFW gallery. LFW and LFW-crop tests offer contrast insights (ID vs. OOD).

In Stage 2, i.e. re-ranking the top-k candidates, we test different values of k ∈ {100, 200, 300} and do not find the performance to change substantially. At k = 100, our 2stage precision is already close to the maximum precision of 99.88 under a perfect re-ranking (see Tab. 1a; Max prec.).

Here, we evaluate the precision of our 2-stage FI as we sweep across five different feature-weighting techniques and two grid sizes (8×8 and 4×4). In an 8×8 grid, we observe that some facial features such as the eyes are often split in half across two patches (see Fig. S5 ), which may impair the patch-wise similarity. Therefore, for each weight-ing technique, we also test average-pooling the 8×8 grid into 4×4 and performing EMD on the resultant 16 patches. Results First, we find that, on LFW, our image-similaritybased techniques (APC, SC) outperform the LMK baseline (Tab. 1a) despite not using landmarks in the weighting process, verifying the effectiveness of adaptive, similaritybased weighting schemes.

Second, interestingly, in FI, we find that Uniform, APC, and SC all outperform the CC weighting proposed in [66, 71] . This is in stark contrast to the finding in [71] that CC is better than Uniform (perhaps because face images do not have background noise and are close-up). Furthermore, using the global average-pooling vector from the channel (APC) substantially yields more useful spatial similarity than the last-linear-layer output as in CC implementation (Tab. 1b; 96.16 vs. 91.31 P@1).

Third, surprisingly, despite that a patch in a 8×8 grid does not enclose an entire, fully-visible facial feature (e.g. an eye), all feature-weighting methods are on-par or better on an 8×8 grid than on a 4×4 (e.g. Tab. 1b; APC: 96.16 vs. 95.32). Note that the optimal flow visualized in a 4×4 grid is more interpretable to humans than that on a 8×8 grid (compare Fig. 1 vs. Fig. S5) .

Fourth, across all variants of feature weighting, our 2stage approach consistently and substantially outperforms the traditional Stage 1 alone on LFW-crop, suggesting its robust effectiveness in handling OOD queries.

Fifth, under a perfect re-ranking of the top-k candidates (where k = 100), there is only 1.4% headroom for improvement upon Stage 

We observe that for some images, re-ranking using patch-wise similarity at Stage 2 does not help but instead hurt the accuracy. Here, we test whether linearly combining EMD (at the patch-level embeddings as in Stage 2) and cosine distance (at the image-level embeddings as in Stage 1) may improve re-ranking accuracy further (vs. EMD alone). Experiment We use the grid size of 8×8, i.e. the better setting from the previous ablation study (Sec. 3.1). For each pair of images, we linearly combine their patch-level EMD (θ EMD ) and the image-level cosine distance (θ Cosine ) as:

Sweeping across α ∈ {0, 0.3, 0.5, 0.7, 1}, we find that changing α has a marginal effect on the P@1 on LFW. That is, the P@1 changes in [95, 98.5] with the lowest accuracy being 95 when EMD is exclusively used, i.e. α = 1 (see Fig. 4a ). In contrast, for LFW-crop, we find the accuracy to monotonically increases as we increase α (Fig. 4b) . That is, the higher the contribution of patch-wise similarity, the better re-ranking accuracy on the challenging randomlycropped queries. We choose α = 0.7 as the best and default choice for all subsequent FI experiments. Interestingly, our proposed distance (Eq. 9) also yields a state-of-the-art face verification result on MLFW [59] (Sec. S4). 

Given that re-ranking using EMD at the patchembedding space substantially improves the precision of FI compared to Stage 1 alone (Tab. 1), here, we test performing such patch-wise EMD sorting at Stage 1 instead of Stage 2.

Experiment That is, we test ranking images using EMD at the patch level instead of the standard cosine distance at the image level. Performing patch-wise EMD at Stage 1 is significantly slower than our 2-stage approach, e.g., ∼12 times slower (729.20s vs. 60.97s, in total, for 13,233 queries). That is, Sinkhorn is a slow, iterative optimization method and the EMD at Stage 2 has to sort only k = 100 (instead of 13, 233) images. In addition, FI by comparing images patchwise using EMD at Stage 1 yields consistently worse accuracy than our 2-stage method under all feature-weighting techniques (see Tab. S1 for details).

To demonstrate the generality and effectiveness of our 2-stage FI, we take the best hyperparameter settings (α = 0.7; APC) from the ablation studies (Sec. 3) and use them for three different models (ArcFace [19] , CosFace [61] , and FaceNet [47] ), which have different grid sizes.

We test the three models on five different OOD query types: (1) faces wearing masks or (2) 

Experiment We perform our 2-stage FI on three datasets: CFP [48] , CALFW [72] , and AgeDB [37] . 12,173-image CALFW and 16,488-image AgeDB have age-varying images of 4,025 and 568 identities, respectively. CFP has 500 people, each having 14 images (10 frontal and 4 profile).

To test our models on challenging OOD queries, in CFP, we use its 2,000 profile faces in CFP as queries and its 5,000 frontal faces as the gallery. To create OOD queries using CFP 2 , CALFW, and AgeDB, we automatically occlude all images with masks and sunglasses by detecting the landmarks of eyes and mouth using dlib and overlaying black sunglasses or a mask on the faces (see examples in Fig. 1 ). We also take these three datasets and create randomly cropped queries (as for LFW-crop in Sec. 3). For all datasets, we test identifying occluded query faces given the original, unmodified gallery. That is, for every query, there is ≥ 1 matching gallery image. Results First, for all three models and all occlusion types, i.e. due to masks, sunglasses, crop, and self-occlusion (profile queries in CFP), our method consistently outperforms the traditional Stage 1 alone approach under all three precision metrics (Tables 2, S8, & S4) .

Second, across all three datasets, we find the largest improvement that our Stage 2 provides upon the Stage 1 alone is when the queries are randomly cropped or masked (Tab. 2). In some cases, the Stage 1 alone using cosine distance is not able to retrieve any relevant examples among 2 We only apply masks and sunglasses on the frontal images of CFP. the top-5 but our re-ranking manages to push three relevant faces into the top-5 (Fig. 5d) . Third, we observe that for faces with masks or sunglasses, APC interestingly often excludes the mouth or eye regions from the fully-visible gallery faces when computing the EMD patch-wise similarity with the corresponding occluded query (Fig. 3) . The same observation can be seen in the visualizations of the most similar patch pairs, i.e. highest flow, for our same 2-stage approach that uses either 4×4 grids ( Fig. 5 and Fig. 1 ) or 8×8 grids (Fig. S5 ).

Adversarial examples pose a huge challenge and a serious security threat to computer vision systems [28, 39] including FI [50, 73] . Recent research suggests that the Table 3 . Our re-ranking (8×8 grid; APC) consistently improves the precision over Stage 1 alone (ST1) when identifying adversarial TALFW [73] images given an in-distribution LFW [65] gallery. The conclusions also carry over to other feature-weighting methods (more results in Tab. S5).

patch representation may be the key behind ViT impressive robustness to adversarial images [9, 35, 49] . Motivated by these findings, we test our 2-stage FI on TALFW [73] queries given an original 13,233-image LFW gallery. Experiment TALFW contains 4,069 LFW images perturbed adversarially to cause face verifiers to mislabel [73] .

Results Over the entire TALFW query set, we find our reranking to consistently outperform the Stage 1 alone under all three metrics (Tab. 3). Interestingly, the improvement (of ∼2 to 4 points under P@1 for three models) is larger than when tested on the original LFW queries (around 0.12 in Tab. 1a), verifying our patch-based re-ranking robustness when queries are perturbed with very small noise. That is, our approach can improve FI precision when the perturbation size is either small (adversarial) or large (e.g. masks).

While our approach does not involve re-training, a common technique for improving FI robustness to occlusion is data augmentation, i.e. re-train the models on occluded data in addition to the original data. Here, we compare our method with data augmentation on masked images. Experiment To generate augmented, masked images, we follow [10] to overlay various types of masks on CASIA images to generate ∼415K masked images. We add these images to the original CASIA training set, resulting in a total of ∼907K images (10,575 identities). We finetune ArcFace on this dataset with the same original hyperparameters [6] (see Sec. S2). We train three models and report the mean and standard deviation (Tab. 4).

For a fair comparison, we evaluate the finetuned models and our no-training approach on the MLFW dataset [59] , instead of our self-created masked datasets. That is, the query set has 11,959 MLFW masked-face images and the gallery is the entire 13,233-image LFW. 

Face Identification under Occlusion Partial occlusion presents a significant, ill-posed challenge to face identification as the AI has to rely only on incomplete or noisy facial features to make decisions [44] . Most prior methods propose to improve FI robustness by augmenting the training set of deep feature extractors with partially-occluded faces [23, 41, 57, 59, 63, 63] . Training on augmented, occluded data encourages models to rely more on local, dis- Table 4 . Our 2-stage approach (b) using ArcFace (8×8 grid; APC) substantially outperforms Stage 1 alone (a) on identifying masked images of MLFW given the unmasked gallery of LFW. Interestingly, our method (b) also outperforms Stage 1 alone when Arc-Face has been finetuned on masked images (c). In (c), we report the mean and std over three finetuned models. criminative facial features [41] ; however, does not prevent FI models from misbehaving on new OOD occlusion types, especially under adversarial scenarios [50] . In contrast, our approach (1) does not require re-training or data augmentation; and (2) harnesses both image-level features (stage 1) and local, patch-level features (stage 2) for FI.

A common alternative is to learn to generate a spatial feature mask [36, 44, 52, 58] or an attention map [63] to exclude the occluded (i.e. uninformative or noisy) regions in the input image from the face matching process. Motivated by these works, we tested five methods for inferring the importance of each image patch (Sec. 3) for EMD computation. Early works used hand-crafted features and obtained limited accuracy [32, 36, 40] . In contrast, the latter attempts took advantage of deep architectures but requires a separate occlusion detector [52] or a masking subnetwork in a custom architecture trained end-to-end [44, 58] . In contrast, we leverage directly the pre-trained state-of-the-art image embeddings (of ArcFace, CosFace, & FaceNet) and EMD to exclude the occluded regions from an input image without any architectural modifications or re-training.

Another approach is to predict occluded pixels and then perform FI on the recovered images [25, 31, 62, 64, 70, 75 ]. Yet, how to recover a non-occluded face while preserving true identity remains a challenge to state-of-the-art GANbased de-occlusion methods [13, 20, 22] . Re-ranking in Face Identification Re-ranking is a popular 2-stage method for refining image retrieval results [69] in many domains, e.g. person re-identification [46] , localization [51] , or web image search [17] . In FI, Zhou et. al. [74] used hand-crafted patch-level features to encode an image for ranking and then used multiple reference images in the database to re-rank each top-k candidate. The social context between two identities has also been found to be useful in re-ranking photo-tagging results [12] . Swearingen et. al. [55] found that harnessing an external "disambiguator" network trained to separate a query from lookalikes is an effective re-ranking method. In contrast to the prior work, we do not use extra images [74] or external knowledge [12] . Compared to face re-ranking [21, 42] , our method is the first re-rank candidates based on a pair-wise similarity score computed from both the image-level and patch-level similarity computed off of state-of-the-art deep facial features. EMD for Image Retrieval While EMD is a well-known metric in image retrieval [45] , its applications on deep convolutional features of images have been relatively underexplored. Zhang et al. [66, 67] recently found that classifying fine-grained images (of dogs, birds, and cars) by comparing them patch-wise using EMD in a deep feature space improves few-shot fine-grained classification accuracy. Yet, their success has been limited to few-shot, 5-way and 10way classification with smaller networks (ResNet-12 [24] ). In contrast, here, we demonstrate a substantial improvement in FI using EMD without re-training the feature extractors.

Concurrent to our work, Zhao et al. [71] proposes DIML, which exhibits consistent improvement of ∼2-3% in image retrieval on images of birds, cars, and products by using the sum of cosine distance and EMD as a "structural similarity" score for ranking. They found that CC is more effective than assigning uniform weights to image patches [70] . Interestingly, via a rigorous study into different feature-weighting techniques, we find novel insights specific for FI: Uniform weighting is more effective than CC. Unlike prior EMD works [60, 66, 67, 71] , ours is the first to show the significant effectiveness of EMD on (1) occluded and adversarial OOD images; and (2) on face identification.

Limitations Solving patch-wise EMD via Sinkhorn is slow, which may prohibit it from being used to sort a much larger image sets (see run-time reports in Tab. S1). Furthermore, here, we used EMD on two distributions of equal weights; however, the algorithm can be used for unequalweight cases [16, 45] , which may be beneficial for handling occlusions. While substantially improving FI accuracy under the four occlusion types (i.e., masks, sunglasses, random crops, and adversarial images), re-ranking is only marginally better than Stage 1 alone on ID and profile faces, which is interesting to understand deeper in future research.

Instead of using pre-trained models, it might be interesting to re-train new models explicitly on patch-wise correspondence tasks, which may yield better patch embeddings for our re-ranking. In sum, we propose DeepFace-EMD, a 2-stage approach for comparing images hierarchically: First at the image level and then at the patch level. DeepFace-EMD shows impressive robustness to occluded and adversarial faces and can be easily integrated into existing FI systems in the wild.

We describe here the hyperparameters used for finetuning ArcFace on our CASIA dataset augmented with masked images (see Fig. S6 for some samples).

• Training on 907, 459 facial images (masks and non-masks).

• Number of epochs is 12.

• Optimizer: SGD. 

We use the same visualization technique as in DeepEMD to generate the flow visualization showing the correspondence between two images (see the flow visualization in Fig. 1 or Fig. S2) . Given a pair of embeddings from query and gallery images, EMD computes the optimal flows (see Eq. (1) for details). That is, given a 8×8 grid, a given patch embedding q i in the query has 64 flow values {f ij } where j ∈ {1, 2, ..., 64}. In the location of patch q i in the query image, we show the corresponding highest-flow patch g k , i.e. k is the index of the gallery patch of highest flow f i,k = max(f i,1 , f i,2 , ..., f i,64 ). For displaying, we normalize a flow value f i,k over all 64 flow values (each for a patch i ∈ {1, 2, ..., 64}) via:

See Fig. S4, Fig. S5 , and Fig. 5 for example flow visualizations.

Glass SC APC LMK Figure S1 . The feature-weighting heatmaps using SC, APC, and LMK for random pairs of faces across three input types (normal faces, and faces with masks and sunglasses). Here, we use ArcFace [19] and an 4×4 grid (average pooling result from 8×8). SC heatmaps often cover the entire face including the occluded region. APC tend to assign low importance to occlusion and the corresponding region in the unoccluded image (see blue areas in APC). LMK results in a heatmap that covers the middle area of a face. Best view in color. SC APC LMK Uniform Figure S2 . Given a pair of images, after the features are weighted (heatmaps; red corresponds to 1 and blue corresponds to 0 importance weight), EMD computes an optimal matching or "transport" plan. The middle flow image shows the one-to-one correspondence following the format in [66] (see also description in Sec. S3). That is, intuitively, the flow visualization shows the reconstruction of the left image, using the nearest patches (i.e. highest flow) from the right image. Here, we use ArcFace and a 4× patch size (i.e. computing the EMD between two sets of 16 patch-embeddings Table S1 . Comparison of performing patch-wise EMD ranking at Stage 1 vs. our proposed 2-stage FI approach (i.e. cosine similarity ranking in Stage 1 and patch-wise EMD re-ranking in Stage 2). In both cases, EMD uses 8×8 patches. EMD at Stage 1 is the method of using EMD to rank images directly (instead of the regular cosine similarity) and there is no Stage 2 (re-ranking). For our method, we choose the same setup of α = 0.7. Our 2-stage approach does not only outperform using EMD at Stage 1 but is also ∼2-4 × faster. The run time is the total for all 13,214 queries for both (a) and (b). The result supports our choice of performing EMD in Stage 2 instead of Stage 1.

In the main text, we find that DeepFace-EMD is effective in face identification given many types of OOD images. Here, we also evaluate DeepFace-EMD for face verification of MLFW [59] , a recent benchmark that consists of masked LFW faces. As in common verification setups of LFW [33, 47, 59] , given pairs of face images and their similarity scores predicted by a verification system, we find the optimal threshold that yields the best accuracy. Here, we follow the setup in [59] to enable a fair comparison. First of all, we reproduce Table 3 in [59] , which evaluate face verification accuracy on 6,000 pair of MLFW images. Then, we run our DeepFace-EMD distance function (Eq. 9). We found that using our proposed distance consistently improves on face verification for all three PyTorch models in [59] . Interestingly, with DeepFace-EMD, we obtained a state-of-the-art result (91.17%) on MLFW (see Tab. S6). 

The reason we used the 3D alignment pre-processing instead of the default MTCNN pre-processing [68] of the three models was because for ArcFace, the 3D alignment actually resulted in better P@1, RP, and M@R for both our baselines and DeepFace-EMD (e.g. +3.35% on MLFW). For FaceNet, the 3D alignment did yield worse performance compared to MTCNN. However, we confirm that our conclusions that DeepFace-EMD improves FI on the reported datasets regardless of the pre-processing choice. See Tab Figure S4 . Traditional face identification ranks gallery images based on their cosine distance with the query (top row) at the image-level embedding, which yields large errors upon out-of-distribution changes in the input (e.g. masks or sunglasses; b-d). We find that re-ranking the top-k shortlisted faces from Stage 1 (leftmost column) using their patch-wise EMD similarity w.r.t. the query substantially improves the precision (Stage 2) on challenging cases (b-d). The "Flow" visualization (of 4 × 4) intuitively shows the patch-wise reconstruction of the query face using the most similar patches (i.e. highest flow) from the retrieved face. Table S5 . Our re-ranking consistently improves the precision over Stage 1 alone (ST1) when identifying adversarial TALFW [73] images given an in-distribution LFW [65] gallery. The conclusions also carry over to other feature-weighting methods and models (ArcFace, CosFace, FaceNet). MS1MV2, R100, Curricularface [58] 90.60% + DeepFaceEMD 91.17% Table S6 . Using our proposed similarity function consistently improves the face verification results on MLFW (i.e. OOD masked images) for models reported in Wang et al. [59] . We use pre-trained models and code by [59] . Figure S5 . Traditional face identification ranks gallery images based on their cosine distance with the query (top row) at the image-level embedding, which yields large errors upon out-of-distribution changes in the input (e.g. masks or sunglasses; b-d). We find that re-ranking the top-k shortlisted faces from Stage 1 (leftmost column) using their patch-wise EMD similarity w.r.t. the query substantially improves the precision (Stage 2) on challenging cases (b-d). The "flow" visualization (of 8 × 8) intuitively shows the patch-wise reconstruction of the query face using the most similar patches (i.e. highest flow) from the retrieved face. Table S8 . Our 2-stage approach based on ArcFace features (8×8 grid; APC) performs slightly better than the Stage 1 alone (ST1) baseline at P@1 when the query is a rotated face (i.e. profile faces from CFP [48] ). See Tab. S4 for the results of occlusions on CFP.

Facial recognition tech at hartsfield-jackson wins over most international delta customers -atlanta business chronicle

Flawed facial recognition leads to arrest and jail for new jersey man -the new york times

The new lawsuit that shows facial recognition is officially a civil rights issue -mit technology review

The pandemic is testing the limits of face recognition -mit technology review

Wrongfully arrested man sues detroit police following false facial-recognition match -the washington post

Openface: A general-purpose face recognition library with mobile applications

Patches are all you need?

Masked face recognition for secure authentication. ArXiv, abs

Faster than real-time facial alignment: A 3d spatial transformer network approach in unconstrained poses

Aiding face recognition with social context association rule based re-ranking

Semi-supervised natural face de-occlusion

Vggface2: A dataset for recognising faces across pose and age

Vggface2: A dataset for recognising faces across pose and age

Finding color and shape patterns in images. stanford university

Real time google and live image search re-ranking

Sinkhorn distances: Lightspeed computation of optimal transport

Arcface: Additive angular margin loss for deep face recognition

Occlusion-aware gan for face de-occlusion in the wild

Open-set face identification with index-of-max hashing by learning

Occluded face recognition in the wild by identity-diversity inpainting

Face synthesis for eyeglass-robust face recognition

Identity mappings in deep residual networks

A regularized correntropy framework for robust pattern recognition

Dlib-ml: A machine learning toolkit

Earth mover's distance pooling over siamese lstms for automatic short answer grading

Adversarial examples in the physical world

From word embeddings to document distances

The earth mover's distance is the mallows distance: Some insights from statistics

Structured sparse error coding for face recognition with occlusion

Nonparametric subspace analysis for face recognition

Sphereface: Deep hypersphere embedding for face recognition

A new measure of congruence: The earth mover's distance

On the robustness of vision transformers to adversarial examples

Improving the recognition of faces occluded by facial accessories

Agedb: the first manually collected, in-the-wild age database

A metric learning reality check

Deep neural networks are easily fooled: High confidence predictions for unrecognizable images

Occlusion invariant face recognition using selective local non-negative matrix factorization basis images. Image and Vision computing

Increasing cnn robustness to occlusions by reducing filter support

Reranking high-dimensional deep local representation for nirvis face recognition

Using earth mover's distance for audio clip retrieval

End2end occluded face recognition by masking corrupted features

The earth mover's distance as a metric for image retrieval. International journal of computer vision

A pose-sensitive embedding for person re-identification with expanded cross neighborhood reranking

Facenet: A unified embedding for face recognition and clustering

Frontal to profile face verification in the wild

On the adversarial robustness of visual transformers

Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition

Object retrieval and localization with spatiallyconstrained similarity measure and k-nn re-ranking

Occlusion robust face recognition based on mask learning with pairwise differential siamese network

Visualizing deep similarity networks

Deeply learned face representations are sparse, selective, and robust

Lookalike disambiguation: Improving face identification performance at top ranks

Inception-v4, inception-resnet and the impact of residual connections on learning

Enhancing convolutional neural networks for face recognition with occlusion maps and batch triplet loss

Occlusion robust face recognition based on mask learning

Mlfw: A database for face recognition on masked faces

Supervised earth mover's distance learning and its computer vision applications

Cosface: Large margin cosine loss for deep face recognition

Robust face recognition via sparse representation

On improving the generalization of face recognition in the presence of occlusions

Robust sparse coding for face recognition

Learning face representation from scratch

Deepemd: Differentiable earth mover's distance for few-shot learning

Deepemd: Few-shot image classification with differentiable earth mover's distance and structured classifiers

Joint face detection and alignment using multitask cascaded convolutional networks

Understanding image retrieval reranking: A graph neural network perspective

Robust lstm-autoencoders for face deocclusion in the wild

Towards interpretable deep metric learning with structural matching

Cross-age LFW: A database for studying cross-age face recognition in unconstrained environments

Towards transferable adversarial attack against deep face recognition

Asarcface: Asymmetric additive angular margin loss for fairface recognition

Face recognition with contiguous occlusion using markov random fields

Pre-trained models

Sources We downloaded the three pre-trained PyTorch models of ArcFace, FaceNet, and CosFace from: • ArcFace

These ArcFace, FaceNet, and CosFace models were trained on dataset CASIA Webface

Image-level embeddings for Ranking We use these layers to extract the image embeddings for stage 1, i.e., ranking images based on the cosine similarity between each pair of (query image

layer bn5 (see code), which is the 512-output

layer last bn (see code), which is the 512-output

layer fc (see code), which is the 512-output

Patch-level embeddings for Re-ranking We use the following layers to extract the spatial feature maps (i.e. embeddings {q i }) for the patches: • ArcFace: layer dropout (see code)

Spatial dimension: 6 × 7

Acknowledgement We thank Qi Li, Peijie Chen, and Giang Nguyen for their feedback on manuscript. We also thank Chi Zhang, Wenliang Zhao, Chengrui Wang for releasing their DeepEMD, DIML, and MLFW code, respectively. AN was supported by the NSF Grant No. 1850117 and a donation from NaphCare Foundation.

Appendix for: DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover's Distance Improves Out-Of-Distribution Face Identification