key: cord-0206434-6me5om73
authors: Ma, Xin; Zhou, Xiaoqiang; Huang, Huaibo; Jia, Gengyun; Chai, Zhenhua; Wei, Xiaolin
title: Contrastive Attention Network with Dense Field Estimation for Face Completion
date: 2021-12-20
journal: nan
DOI: nan
sha: eb941f69814a1a92c98a323c56a0b0ec3f4cfd26
doc_id: 206434
cord_uid: 6me5om73

Most modern face completion approaches adopt an autoencoder or its variants to restore missing regions in face images. Encoders are often utilized to learn powerful representations that play an important role in meeting the challenges of sophisticated learning tasks. Specifically, various kinds of masks are often presented in face images in the wild, forming complex patterns, especially in this hard period of COVID-19. It's difficult for encoders to capture such powerful representations under this complex situation. To address this challenge, we propose a self-supervised Siamese inference network to improve the generalization and robustness of encoders. It can encode contextual semantics from full-resolution images and obtain more discriminative representations. To deal with geometric variations of face images, a dense correspondence field is integrated into the network. We further propose a multi-scale decoder with a novel dual attention fusion module (DAF), which can combine the restored and known regions in an adaptive manner. This multi-scale architecture is beneficial for the decoder to utilize discriminative representations learned from encoders into images. Extensive experiments clearly demonstrate that the proposed approach not only achieves more appealing results compared with state-of-the-art methods but also improves the performance of masked face recognition dramatically.

Face completion (a.k.a face inpainting or face holefilling) aims at filling missing regions of a face image with * indicates the correspondence author † Xin Ma and Xiaoqiang Zhou have contributed equally to the work The link to the Pattern Recgnition version The link to the codes plausible contents [7] . It is more difficult than general image inpainting because there are high-level identity information, pose variations, etc in face images. Face completion is a fundamental low-level vision task and can be applied to many downstream applications, such as photo editing and face verification [76, 5, 64] . The target of face completion is to produce semantically meaningful content and reasonable structure information in missing areas.

There are many attempts for face completion, but they usually treat it as a general image inpainting problem. Traditional image inpainting methods [5, 23, 73] (e.g., Patch-Match) assume that the content to be filled comes from the background area. Therefore, they gradually synthesize plausible stationary contents by copying and pasting similar patches from known areas. The performances of these methods are satisfying when dealing with background inpainting tasks. But non-repetitive and complicated scenes, such as faces and objects, are the Waterloo of these traditional methods because of the limited ability to capture high-level semantics. Recently, deep convolutional neural networks (CNNs) have made great progress in many computer vision tasks [47, 35, 21, 46, 32, 56, 28] . Thus, many deep learning-based methods have been proposed. Benefiting from the powerful ability of representation learning of CNNs, their performance has been significantly improved. These approaches adopt autoencoder or its variant architectures jointly trained with generative adversarial networks (GANs) to hallucinate semantically plausible contents in missing regions [76, 72, 44] . But these methods still suffer from three problems:

Firstly, various kinds of masks are often presented in face images in the wild, especially in this tough period of COVID-19, which greatly increases the difficulty of image inpainting. Previous image inpainting approaches usually train an encoder and a decoder jointly with some commonly-used loss functions (e.g., reconstruction loss, style loss, etc). But encoders still struggle to learn powerful representations from images with various kinds of masks. As a result, these CNN-based approaches will produce unsatisfactory results with obvious artifacts. A naive solution is to design a very deep network to obtain a large model capacity for learning powerful representations. However, it will increase the computational cost heavily and may not help to learn accurate latent representations.

To cope with this limitation, we propose a selfsupervised Siamese inference network with contrastive learning. We assume that two identical images with different masks form a positive pair while a negative pair consists of two different images. Contrastive learning aims to maximize (minimize) the similarities of positive pairs (negative pairs) in a representation space. As explored in [27, 26] , contrastive learning can be regarded as training an encoder to perform a dictionary look-up task. An encoded 'query' should be matched with its corresponding 'key' (token) and different from others. The 'keys' (tokens) in the dictionary are usually sampled from images, patches, or other data types. In order to acquire a large and consistent dictionary, we design a queue dictionary and a momentum-updated key encoder. As demonstrated in MoCo [27] , the proposed selfsupervised inference network can learn good features from input images. Thus, the robustness and the accuracy of the encoder can be improved.

Secondly, previous methods consider image inpainting as a conditional image generation task. The roles of the encoder and decoder are recognizing high-level semantic information and synthesizing low-level textures [74] , respectively. These approaches, e.g., PConv [44] and LBAM [72] , focus more on missing areas and synthesize realistic alternative contents by a well-designed architecture or some commonly-used loss functions. However, there are either obvious color contrasts or artificial edge responses, especially in the boundaries of results produced by these methods since they ignore the structural consistency. In fact, the development of biology has revealed that the human visual system is more sensitive to the topological distinction [13] . Therefore, we focus not only on the structural continuity of restored images surrounding holes but also on generating texture-rich images.

To properly suppress color discrepancy and artifacts in boundaries, we propose a novel dual attention fusion module (DAF) to synthesize pixel-wise smooth contents, which can be inserted into autoencoder architectures in a plug-andplay way. The core idea of the fusion module is to calculate the similarity between the synthesized content and the known region. Some methods are proposed to address this problem, such as DFNet [29] and Perez's method [57] . However, these methods lack flexibility in handling different information types (e.g., different semantics), hindering learning more discriminative representations. Our proposed DAF is developed to adaptively recalibrate channel-wise features by taking interdependencies between channels into account and force CNNs to focus more on unknown regions. DAF will predict an adaptive spatial attention map to blend restored contents and original images naturally.

Finally, the verification performance heavily relies on the pixel level similarity and feature level similarity according to [83] , which means that the geometric information of the output results should be similar to the input. In practice, face appearance will be influenced by a number of factors such as meshes, wearing masks [43, 83, 10] and so on. Masks can significantly destroy the facial shape and geometric information, greatly increasing the difficulty of generating visually appealing results. Therefore, it inevitably leads to a sharp decline in face verification performance. For example, healthcare workers must wear sanitary masks to avoid infection of diseases, and they will fail to pass through the face verification system.

In this paper, we assume that the geometric information of the input face image should be kept intact. Inspired by recent advances in 3D face analysis [2, 1] , a dense correspondence field estimation is integrated into our network since it contains the complete geometric information of the input face. For simplicity, instead of using another network to predict the dense correspondence field separately, we make our decoder simultaneously predict the dense correspondence field and feature maps at multi-scales. Thus, we subtly employ a 3D supervision for our network provided by the dense correspondence field. Under this 3D geometric supervision, our network can generate inpainting results with reasonable structure information.

Qualitative and quantitative experiments are conducted on multiple datasets to evaluate our proposed method. The experimental results demonstrate that our proposed method not only outperforms state-of-the-art methods in generating high-quality inpainting results but also improves the performance of masked face recognition dramatically. This paper is an extension of our previous conference publication [48] . We extend it in three folds: 1) A dense correspondence field is proposed to be integrated into our network for utilizing 3D prior information of human faces. It can help our network to retain the facial shape and appearance information from the input. 2) We mainly concentrate on face image completion rather than other types of images. We add an extra face dataset, Flickr-Faces-HQ (FFHQ) [38] , to demonstrate the effectiveness of our method. 3) We conduct an identity verification evaluation for face completion. It clearly shows the advantage of the proposed method compared with state-of-the-art methods.

To sum up, the main contributions of this paper are as follows:

• We propose a Siamese inference network based on contrastive learning for face completion. It helps to improve the robustness and accuracy of representation learning for complex mask patterns.

• We propose a novel dual attention fusion module that can explore feature interdependencies in spatial and channel dimensions and blend features in missing regions and known regions naturally. Smooth contents with rich texture information can be naturally synthesized.

• To keep structural information of the input intact, the dense correspondence field that binds 2D and 3D surface spaces is estimated in our network, which can preserve the expression and pose of the input.

• Our proposed method achieves smooth inpainting results with rich texture and reasonable topological structural information on three standard datasets against state-of-the-art methods, and also greatly improves the performance of face verification.

Image inpainting aims to generate alternative contents when a given image is partially occluded or corrupt. Early traditional image inpainting methods are mainly diffusionbased [7] or patch-based [5] . They often use the information of the pixels (or image patches) around the occluded area to fill the missing regions. Bertalmio et al. [7] proposed an algorithm to fill missing regions with information surrounding them automatically based on the principle that isophote lines arriving at the boundaries of the regions are completed inside. Barnes et a. [5] presented a fast nearest neighbor searching algorithm named PatchMatch, to search and paste the most similar image patches from the known regions. These methods utilize low-level image features to guide the feature propagation from known image backgrounds or image datasets to corrupted regions. Criminisi et al. [17] proposed an efficient algorithm, which combined the advantages of 'texture synthesis' techniques and 'inpainting' techniques. Specifically, they designed a bestfirst method to find the most similar patches and used them to recover the corrupted regions gradually. These methods work well when holes are small and narrow, or there are plausible matching patches in uncorrupted regions. However, when suffering from complicated scenes, it is difficult for these approaches to produce semantically plausible solutions, due to a lack of semantic understanding of images.

Nowadays, deep learning techniques have made great contributions to computer vision communities. In order to accurately recover corrupted images, many methods adopt deep convolutional neural networks (CNNs) [63, 20] , especially generative adversarial networks (GANs) [24] in image inpainting. Pathak et al. [54] formulates image inpainting as a conditional image generation problem. Then, they proposed a Context Encoder to recover corrupted regions according to surrounding pixels. Iizuka et al. [34] utilized two discriminators to improve the quality of the generated images at different scales, facilitating both globally and locally consistent image completion. At the same time, some approaches designed a coarse-to-fine framework to solve the sub-problem of image inpainting in different stages [75, 50, 58] . Nazeri et al. [50] proposed to firstly recover the edge map of the corrupted image, then generate image textures in the second stage. Ren et al. [58] proposed a method in which a structure reconstructor was employed to generate the missing structures of the inputs while a texture generator yielded image details. Zhang et al. [79] proposed an iterative inpainting approach that contained a corresponding confidence map in results. They used this map as feedback and recovered holes by trusting high-confidence pixels.

As a branch of image inpainting, face completion is different from general image inpainting since its target mainly focuses on restoring the topological structure and texture of the face input. Zhang et al. [83] argued that the performance of verification relied on both the pixel level similarity and the feature level similarity. Therefore, they proposed a feature-oriented blind face inpainting framework. Cai et al. [11] proposed a method named FCSR-GAN to perform face completion and face super-resolution by multi-task learning where the generator was required to generate a highresolution face image without occlusion from the occluded low-resolution face image. Zhou et al. [85] argued that previous works overlooked the serious impacts of inaccurate attention scores. Thus, they integrated the oracle supervision signal into the attention module to produce reasonable attention scores.

Unsupervised learning has shown great potential to learn powerful representations of images recently [27, 80, 15] . Compared with supervised learning, unsupervised learning utilizes unlabeled data to learn representations, which can go back to as far as the literature proposed by Becker and Hinton [6] . Dosovitskiy et al. [22] proposed to discriminate between a set of surrogate classes generated by applying a number of transformations. Wu et al. [71] treated instancelevel discrimination as a metric learning problem. Then, the discrete memory bank was utilized to store the features for each instance. Zhuang et al. [88] maximized a dynamic aggregation metric, which can move similar data instances together in the embedding space and separate dissimilar instances. He et al. [27] proposed a dynamic dictionary consisting of a queue encoder and a moving-averaged encoder from a perspective on contrastive learning and they called this method MoCo. At the same time, Chen et al. [15] also presented a simple framework with contrastive learn- The self-supervised Siamese inference network consists of encoders Eq and E k . This inference network encodes the new key representations on-the-fly by using the momentum-updated encoder E k . We insert the dual attention fusion module into several decoder layers, forming a multi-scale decoder. We allow the decoder to estimate the dense correspondence field and the feature maps that are used for the DAF module at multi-scales simultaneously. The inference network is firstly trained with contrastive learning. Then the pre-trained encoder Eq and the decoder are jointly trained with the fusion module.

ing for visual representations (SimCLR). Technically, they simplified recent contrastive learning-based algorithms and did not require specific structures and memory banks. Unsupervised learning strategies are also used in many computer vision tasks recently. Mustikovela et al. [49] used self-supervised learning for viewpoint estimation by making use of generative consistency and symmetry constraint. Zhan et al. [81] utilized a mask completion network to predict occlusion ordering with a self-supervised learning strategy.

Attention mechanism is a hot topic in computer vision and has been widely investigated in many works [65, 14, 51, 19] . The wildly-used attention mechanism can be coarsely divided into two categories: spatial attention [65] and channel attention [30] for image inpainting. Yu et al. [75] argued that convolutional neural networks lacked the ability to borrow or copy information from distant places, which led to blurry textures in generated images. Thus, they proposed a contextual attention module to calculate the spatial attention scores between pixels in the corrupted region and known region. Hong et al. [29] proposed a fusion block to generate an adaptive spatial attention map α to combine features in the corrupted region and known region. In this paper, we investigate both spatial attention and channel attention mechanism to further improve the performance of face completion.

Nowadays, the famous 3DMM [8] is widely used to express facial shape and appearance information for face related tasks, such as facial attribute editing, face hallucination, etc [42, 61] . Roth et al. [59] proposed a photometric stereo-based method for unconstrained 3D face reconstruction, which benefited from a combination of landmark constraints and photometric stereo-based normals. Yin et al. [22] proposed a generative adversarial network combined with 3DMM, termed as FF-GAN, to provide shape and appearance priors without requiring large training data. 2DASL [62] utilized 2D face images with noisy landmark information in the wild to assist 3D face model learning. It has become a popular method to establish the dense correspondence field between the 2D and 3D space. Güler et al. [2, 1, 12] proposed a UV correspondence field to build pixel-wise correspondence between RGB color space and 3D surface space. These works show that the UV correspondence field can retain geometric information of the human face.

In this section, we first present our self-supervised Siamese inference network. Subsequently, the details of the dual attention fusion (DAF) module, the dense correspondence estimation, and learning objectives in our method are provided. The overall framework of our face completion method is shown in Fig. 1 .

Our proposed self-supervised Siamese inference network consists of two identical encoders but not sharing parameters [27, 26, 66] , noted as E q and E k , respectively. The proposed inference network is trained with contrastive learning, which can be viewed as training an encoder to perform a dictionary look-up task: a 'query' encoded by E q should be similar with its corresponding 'key' (i.e., positive key) represented by another encoder E k and dissimilar to others (i.e., negative keys). Two images with different masks are required for the proposed inference network, named as x q and x k , respectively. Thus, we can obtain a query representation z q = E q (x q ) and a key representation z k = E k (x k ), respectively. Following many previous selfsupervised works [88, 4] , the contrastive loss is utilized as the self-supervised objective function for training the proposed inference network and can be written as:

where τ is the temperature hyper-parameter, and the loss function will degrade into the original sof tmax when τ is equal to 1. The output will be less sparse with τ increasing [16] . The τ is set as 0.07 for the efficient training process in this work. Specially, this loss, also known as InfoNCE loss [27, 26] , tries to classify z q as z + k . Here, z q and z + k are encoded from a positive pair. K means the number of negative samples. High-dimensional continuous images can be projected into a discrete dictionary by contrastive learning. There are three general mechanisms for implementing contrastive learning (i.e., end-to-end training [26] , memory bank [71] and momentum updating [27] ), whose main differences are how to maintain keys and how to update the key encoder. Considering GPU memory size and powerful feature learning, we follow MoCo [27] to design a consistent dictionary implemented by queue. Thus, the key representations of the current batch data are enqueued into the dictionary while the oldest representations are dequeued progressively. The length of the queue is under control, which enables the dictionary to contain a large number of negative image pairs. Such a dictionary with large-scale negative pairs will facilitate representation learning. We set the length of the queue as 65536 in this work.

It is worth noting that the encoder E k is updated by a momentum-updated strategy instead of direct backpropagation. The main reason is that it's difficult to propagate the gradients to all keys in the queue. The updating process of E k can be formulated as follows:

where θ q and θ k denotes as the parameters of E q and E k , respectively. θ q is updated by back-propagation. m ∈ [0, 1)

is the momentum coefficient hyper-parameter and set as 0.9 in this paper. The momentum-update mechanism makes the encoder E k update smoothly relative to E q , resulting in a more consistent discrete dictionary.

We now give more details about our proposed dual attention fusion module (see Fig. 2 ), which contains a channel attention mechanism and a spatial attention mechanism. This fusion module is embedded into the last several layers of the decoder and outputs face completion results with multi-scale resolutions [37] . Thus, constraints can be imposed on multi-scale outputs for high-quality results.

Previous CNN-based image inpainting approaches treat channel-wise features equally, thus hindering the ability of the representation learning of the network. Meanwhile, high-level and interrelated channel features can be considered as specific class responses. For more discriminative representations, we first build a channel attention module in our proposed fusion module. As shown in Fig.  2 , let a feature map F = [f 1 , · · · , f c , · · · , f C ] be one of the inputs of the fusion module, whose channel index is c and size is h × w. The channel descriptor can be acquired from the channel-wise global spatial information by global averaging pooling. Then we can obtain the channel-wise statistics z c ∈ R c by shrinking F :

where z c is the c-th element of z. f c (i, j) is the value at position (i, j) of c-th feature f c . H GP means the global pooling function.

In order to fully explore the channel-wise dependencies of the aggregated information, we introduce a gating mechanism. As illustrated in [30, 84] , the sigmoid function can be used as a gating function:

where σ(·) and δ(·) are the sigmoid gating and ReLU functions, respectively. W D and W U are the weight sets of Conv layers who set channel number as C/r and C, respectively. Finally, the channel statistics ω are acquired and used to rescale the input f c :

where w c and f c are the scaling factor and feature map of the c-th channel, respectively.

The long-range contextual information is essential for discriminant feature representations. We propose a spatial attention module that forms the final part of the proposed fusion module. Given an input image with a mask x q , we first get x q that matches the size of the re-scaled feature mapF ∈ R c×h×w ,

where W C and ↓ are the weight set of a 1 × 1 convolutional layer and downsample module, respectively. Then the adaptive spatial attention map α ∈ R C×h×w is given by,

where W K is the weight set of a 1×1 convolutional layer. It sets channel number ofF to be same with x q . A is a learnable transformation function implemented by three 3 × 3 convolutional layers. W KF and x q are first concatenated and then fed into the convolutional layers. f (·) is the sigmoid function that can make α an attention map to some extent. The final inpainting resultŶ is obtained by,

where and B denote the Hadamard product and fusion function, respectively. The adaptive spatial attention map α can adjust the balance between the ground truth image and the restored image to obtain a smoother transition. We can eliminate obvious color contrasts and artifacts especially in boundary areas, and get natural face completion results with richer textures. 

Masks can dramatically destroy the facial shape and structure information, such as viewing angles, facial expressions, and so on, making it quite tough to achieve visually appealing results. To keep the geometric information of the human face intact during the face completion process, we introduce a dense correspondence field that binds the 2D and 3D surface spaces into our network.

The structure and texture information of the face image can be disentangled by the dense correspondence field according to [2, 1] . The geometrical information is stored in the correspondence field while the texture map can represent the surface of a 3D face to some extent. In this paper, we mainly concentrate on inferring the dense correspondence field by our network. Technically, given an input image x ∈ R c×h×w , the dense correspondence field C = (u; v) consists of maps in the UV space (u, v ∈ R h×w ). The visual illustration is shown in Fig. 3 in which the minimum is rendered as blue and the maximum is rendered as yellow.

We allow our decoder to predict the dense correspondence fields and feature maps at multi-scales simultaneously, where the feature maps are fed into the proposed dual attention fusion module (please see Sec. 3.2). Thanks to the multi-scale network architecture, our decoder can obtain context information better and maintain geometrical information. In order to supervise C during training, we minimize the pixel-wise error between the estimated result and the ground truth C. It can be written mathematically as,

where C means the predicted dense correspondence field result of an input image. We employ BFM [55] , a 3D shape estimation approach, to obtain the ground truth dense correspondence field C similar with [42, 12] . We then obtain coordinates of vertices by performing the model fitting method [87] . Finally, those vertices are mapped to the UV space by the cylindrical unwrapping according to [9] .

Following [20, 78, 39] , for synthesizing richer texture details and correct semantics, the element-wise reconstruction loss, the perceptual loss [36] , the style loss and the adversarial loss are used in our proposed method. Moreover, we also employ the identity preserving loss function to ensure that the identity information of the generated images remains unchanged.

Reconstruction Loss. It is calculated as L 1 -norm between the inpainting resultŶ and the target image Y ,

Style Loss. For getting richer textures, we also adopt the style loss defined on the feature maps produced by the pre-trained VGG-16. Following [72, 44] , the style loss can be calculated as the L 1 -norm between the Gram matrices of the feature maps,

where C i denotes the channel number of the feature map at i-th layer in the pre-trained VGG-16.

Identity Preserving Loss. To ensure the generated face images belong to the same identity as the target face images, we adopt LightCNN [70] to extract the features, then use the mean square error to constrain the embedding spaces,

where Ψ means the pre-trained LightCNN network [70] . Model Objective. The above loss functions can be grouped into two categories: Structure Loss and T exture Loss, respectively. The Structure Loss is given by,

where λ rec and λ uv mean the weight factors and are set as 6 and 0.1 empirically. L k struct is calculated as the sum of L rec and L uv at the k-th layer of the decoder. Here, L uv means the UV loss function (please see Sec. 3.3).

The T exture Loss is given by,

where λ style and λ ip are trade-off factors and are set as 240 and 0.1 empirically in this work. Finally, the total model objective can be formulated as,

where both P and Q are the selected decoder layer sets that imposed constraints. We select P as {1, 2, 3, 4, 5, 6} and Q as {1, 2, 3} respectively for better inpainting results. Note that 1 represents the outermost layer.

To demonstrate the superiority of our approach against state-of-the-art methods, both quantitative and qualitative experiments for face completion and face verification experiments are conducted. In this section, we will introduce the details of our experimental settings and the experimental results one by one.

CelebA. The CelebFaces Attributes dataset [45] is widely used for face hallucination, image-to-image translation, etc. It's a large-scale face attributes dataset containing more than 200k celebrity images, which includes face images with large occlusion and pose variations. We randomly select 10,000 images for testing and the rest for training.

CelebA-HQ. It's a high-resolution face images dataset established by Karras et al. [37] , which contains 30,000 high-quality face images. We divide the dataset into two subsets: the training set of 28,000 images and the testing set of 2,000 images.

FFHQ. The Flickr-Faces-HQ dataset [38] is a highquality dataset containing 70,000 face images at 1024 × 1024 resolution. It also covers age, ethnicity, and image background variations. We randomly choose 6,000 images for testing and the rest for training.

Multi-PIE. It contains more than 750,000 images that cover 15 viewpoints, 19 illumination conditions and a number of facial expressions of 337 identities [25] . We follow Huang et al. [33] to split the dataset. In our experiments, we only utilize the training set to train our network and the compared methods for face recognition.

LFW. The Labeled Faces in the Wild [31] is a benchmark database commonly used for face recognition, which contains 13,233 images of 5,749 people captured in unconstrained environments. LFW provides a standard protocol for face verification that contains 6,000 face image pairs (including 3,000 positive pairs and 3,000 negative pairs, respectively). We use these standard face image pairs to evaluate face verification performance via face completion. Specially, face images in the gallery set remain the same while the counterparts in the probe set are occluded by masks. We firstly recover the occluded face images by our proposed method and the state-of-the-arts. Then, we compare the verification performance. It's worth noting that we only use LFW for testing.

L2SFO. It is a large-scale synthesized face-withocclusion dataset built by Yuan et al. [77] . We call it L2SFO in which face images are occluded by six common objects including masks, eyeglasses, sunglasses, cups, scarves, and hands. All the occlusions are located on face images according to segmentation information to augment the reality of this dataset. It contains 991 different identities and more than 73,000 images. We randomly select 891 identities as the training set (about 66,000 images) and the rest as the testing set (about 7,000 images).

IJB-C. IARPA Janus Benchmark C is a dataset consisting of video still-frames and photos and used for face recognition benchmark [69] . It contains 117,500 frames from 11,799 videos and 3,531 subjects with 31,300 still images. We use the 1:1 protocol for face verification, whose probe and gallery templates are combined using some images and video frames for each subject. Same as the processing procedure of LFW, images in the probe set are occluded and images in the gallery set remain unchanged. We firstly generate clean face images from occluded face images by using [53] , GMCNN [67] , CycleGAN [86] , CUT [52] , DFNet [29] , CANet [48] and ours method respectively. (i) is the ground truth. our method and other compared methods and then compare the face verification performance. IJB-C is also only used for testing.

In our experiments, face images are normalized to 256 × 256 and 128 × 128 for high-resolution face completion and face verification, respectively. Following Wu et al. [70] , the landmarks in the centers of the eyes and mouth are used for normalizing face images. The occluded face images are generated by MaskTheFace proposed by Anwar and Raychowdhury [3] . We randomly select mask types to occlude face images during training. Some occluded face images are shown in Fig. 4 and Fig. 8 . For different experimental settings, different datasets are utilized to train our network. For face completion, we train our network on the training sets of CelebA, CelebA-HQ, FFHQ and L2SFO, then testing on their testing sets. As for face verification, we train our network on the training sets of CelebA and Multi-PIE and test on LFW and IJB-C.

Our proposed method can be broken down into two stages. In the first stage, the inference network is trained through contrastive learning until convergence. And in the next stage, the pre-trained encoder and the decoder are jointly trained with the fusion module. We use the SGD optimizer with the learning rate as 0.015 for training the Siamese inference network, and use the Adam optimizer with the learning rate as 10 −4 for jointly training the encoder and decoder. All the results are reported directly without any additional post-processing. Our proposed method is implemented by the Pytorch framework and trained on four NVIDIA TITAN Xp GPUs (12GB).

Peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and Fréchet Inception Distance (FID) are used as evaluation metrics. PSNR and SSIM measure the similarity between the inpainting result and the target image. As for FID, it measures the Wasserstein-2 distance between real and inpainting images through the pre-trained Inception-V3. We select 'cloth #333333', 'KN95', 'N95', 'surgical blue', 'cloth #515151', 'surgical', 'surgical green', 'cloth #dadad9' and 'cloth #929292' masks to occlude the testing images for experiments. These mask images are shown in Fig. 4 from top to bottom.

We conduct quantitative experiments on the testing sets of CelebA, CelebA-HQ and FFHQ occluded by the nine kinds of masks, and report the averaged results. Table  1 shows the performance of our proposed method against other state-of-the-art methods, which consists of two image inpainting methods, GMCNN [67] and DFNet [29] , and three image-to-image translation methods: Spade [53] , Cy-cleGAN [86] and CUT [52] . In Table 1 , we also conduct the experiments to show the improvement of performance compared to our prior conference work [48] . For simplicity, we call it CANet, which can be regarded as a simplified version of our proposed method in this paper without Dense Correspondence Field Estimation and the identity preserving loss. We retrain all the compared methods on the training sets of CelebA, CelebA-HQ and FFHQ for the sake of Table 4 . Face verification results on IJB-C. 'Masked' means face verification experiments are conducted between the masked probe set and the unchanged gallery set directly.

fairness. As shown in Table 1 , the proposed method and CANet achieve the best and the second-best quantitative results in three metrics on all the testing sets. The results suggest that the proposed method can generate very realistic face images while the compared methods may not work well encountered various kinds of masks. The main reasons for the relatively low performance of the compared methods (excluding CANet) are that 1): face images with various kinds of masks dramatically increase the difficulty of image inpainting, hindering the ability of the representation learning of the encoder; 2): exiting methods take generating realistic images into account but ignore the structural consistency of the generated image. The reason why the performance of our method is higher than CANet may be that Dense Correspondence Field Estimation keeps the geometric information of the human face intact during the face completion process.

We compare our proposed method with state-of-the-art methods in terms of visual and semantic coherence. We conduct qualitative experiments on the testing sets of three datasets with various kinds of masks. As shown in Fig. 4 , we mask the testing images with the nine kinds of masks as described in the last section.

Among all these compared methods, there are severe artifacts in results produced by SPADE, CUT, and DFNet. Thus, the qualities of generated images are far from the requirements. The reason is that various kinds of masks hinder their networks to capture powerful representations. There are no obvious artifacts in face images produced by CycleGAN. But it fails to maintain the geometric information of face images and produce obvious color contrasts. The reason is that CycleGAN endeavors to translate the input to its correspondence non-mask face image and ignores the structural consistency. As for GMCNN, it produces relatively appealing results, but there are significant differences in color at the edges. CANet produces better results in which the facial geometric information is maintained but there are still artifacts, especially in the corners of the mouth. Compared with other methods, our proposed method can generate natural inpainting results with reasonable semantics and richer textures with the help of the self-supervised Siamese inference network, the dense correspondence field, and the DAF module. It demonstrates that our proposed method is superior to the compared methods in terms of consistent structures and colors.

Furthermore, we also conduct experiments on a realworld masked face dataset (RMFD) [68] . Note that there are no ground truth images in it. Therefore, we directly use our model and the compared models to evaluate on this dataset. As shown in Fig. 5 , although there is a huge domain gap between our training sets and the real-world masked face dataset, our method can still generate relatively satisfactory results, which demonstrates the superiority of our proposed method. At the same time, some compared methods can not remove masks effectively, such as (d) and (e) in Fig. 5 .

We also provide the corresponding quantitative comparative experiments by using FID, Learned Perceptual Image Similarity (LPIPS) [82] , F1-Score and Realism in Table 2 . LPIPS measures the diversity of images by calculating the similarity in the feature space from the pre-trained AlexNet [40] . F1-Score is the harmonic mean of recall and precision, where precision is calculated by querying whether the each generated image is within the estimated manifold of real images and recall is calculated by querying whether the each real image is within the estimated manifold of generated images [41] . Realism is a metric that reflects the distance between the image and the manifold: the closer the image is to the manifold, the higher Realism is, and the further the image is from the manifold, the lower Realism is [41] . It clearly demonstrates the superiority of our proposed method in dealing with masked face images in real world.

In the above three sections, we mainly conduct quantitative and qualitative experiments on face images with masks. In order to demonstrate the effectiveness of our method, we conduct experiments on the L2SFO dataset [77] in which face images are occluded by six common objects, i.e, masks, eyeglasses, sunglasses, cups, scarves, and hands. We conduct quantitative experiments on the testing set of L2SFO, and report the averaged results. we also retrain all the compared methods on the training sets of L2SFO for the sake of fairness. Table 3 shows the performance of our proposed method against other compared methods. Our method outperforms all the other compared methods in three metrics on the testing sets as shown in this table. The results suggest that the proposed method can still extend to other kinds of occlusions.

We also compare our proposed method with the state-ofthe-art methods in terms of the visual quality on the testing set of L2SFO. As shown in Fig. 6 , we find that SPADE and GMCNN can remove occlusions, but there are serious artifacts in the generated images. CycleGAN and CUT fail to remove occlusions in some cases. Because they adopt unsupervised learning and hardly handle face images with complex occlusions. DFNet and CANet achieve relatively high-quality results. However, there are still artifacts in the generated face images produced by them. Different from all the compared methods, the proposed method can generate photo-realistic face images.

In order to quantitatively evaluate the feasibility of our method for face verification, we compare the results of our method and the compared methods on LFW and IJB-C following the testing protocol as described in Sec 4.1. Face verification experiments are conducted between the recovered probe set and the unchanged gallery set. Three publicly released face recognition models are tested: the LightCNN [70] , ArcFace [18] and FaceNet [60] . We use the area under the ROC curve (AUC), true positive rates at 1% and 0.1% (TPR@FPR=1%, TPR@FPR=0.1%) as the evaluation metrics in the experiments. The results are reported in Table 5 and Table 4 .

We use the masked probe set as a baseline to demonstrate the influences of face completion on face verification. From Table 5 , we can see that our method brings dramatic improvement to face verification. Because our method can keep geometric information intact and generate face images with consistent structures and colors. Compared with the baseline, our method can achieve an increase of more than 10% in TPR@FPR=0.1% on LFW and an increase of 6.68% in TPR@FPR=0.1% on IJB-C, which demonstrates that our proposed method can ameliorate the negative impact of masks. Similar to our method, the compared methods endeavor to recover face images. However, we find that the face verification performances of some compared methods decrease actually, especially in TPR@FPR=0.1%. For instance, the performance of CycleGAN drops from 77.13% to 76.06% on LFW, a drop of about 1% when taking the metric TPR@FAR=1% and using LightCNN as the face feature extractor. From Table 4 , we can also see that the compared methods do not show obvious advantages over the baseline ('Masked') on IJB-C. For example, the performance of CUT is 91.82%, a very limited improvement of 0.006% over the baseline when taking the metric TPR@FAR=1% and using FaceNet as the face feature extractor. For the poor performances of compared methods on LFW and IJB-C, the reason may lie in two aspects. The first reason is that the compared methods can not generate high-quality face images. The other reason is that they can not recover discriminative information of a face image due to the great negative effects of masks. We also present the ROC curves on LFW in Fig. 7 . It is obvious that our method outperforms all the compared methods.

We conduct the time complexity experiments on a single GPU (TITAN Xp) and CPU, respectively. To evaluate the inference time for different methods, we randomly sample 1,000 testing images and run forward one time for each image. Then we report the mean inference time for one image. As shown in Table 7 , our proposed method achieves a pleasing time performance compared with the other methods. It runs the second fast on a single TITAN Xp GPU. The fastest method is CUT on GPU. Because the number of parameters of CUT is only about a quarter of our method. However, as can be seen from Table 1 and Fig. 4 , our method outperforms CUT with a large margin. When running on CPU, our proposed method is faster than SPADE, GMCNN, Cycle-GAN and CUT and achieves the comparable performance against DFNet. Table 6 .

We investigate the effectiveness of different components of the proposed method on the testing set of CelebA. We train several variants of the proposed method: remove the self-supervised Siamese inference network (denote as con-trastive learning), the DAF module, and/or the dense correspondence estimation (denoted as UV map). As shown in Table 6 , it clearly demonstrates that the self-supervised Siamese inference network, the DAF module, and the dense correspondence field estimation play important roles in determining the performance. As shown in Fig. 10 , the uncompleted models usually generate images with obvious artifacts, especially in boundaries while our full model can suppress color discrepancy and artifacts in boundaries and produce realistic inpainting results.

The multi-scale decoder can progressively refine the inpainting results at each scale. We also conduct experiments on the testing set of FFHQ. Then we visualize the images predicted by the decoder at several scales. As shown in Fig.  8 , it demonstrates that this multi-scale architecture is beneficial for decoding learned representations into generated images layer by layer.

We conduct sufficient experiments on the FFHQ dataset to explore the performance variation of our model affected by the weight of the UV loss function. We plot some figures according to the experimental results (Fig. 10) . The horizontal axis represents the weight of the UV loss function. We use eight different weights to design the experiment, i.e, 0, 0.001, 0.01 0.05, 0.1, 0.5, 1 and 10. From Fig.  9 , we can see that PSNR gradually increases with the in-crease of weight, reaches the maximum value when weight is equal to 0.1, and then drops sharply. The variation of SSIM is roughly the same as that of PSNR. The value of FID decreases dramatically from about 4 at the weight of 0 to around 2.5 at the weight of 0.001 and reaches the bottom (about 1.7) at the weight of 0.1. From these experiments, we can see that the UV loss (or Dense Correspondence Field Estimation) plays an important role in determining the performance since it can keep the geometric information of the human face intact during the face completion process.

In this paper, we propose a novel two-stage paradigm image inpainting method to generate smoother results with reasonable semantics and richer textures. Specifically, the proposed method boosts the ability of the representation learning of the inference network by using contrastive learning. For keeping the geometric information of the input face image intact, we introduce a dense correspondence field that binds the 2D and 3D surface spaces into our network. We further design a novel dual attention fusion module, which can be embedded into decoder layers in a plug-and-play way. Extensive experiments show the superiority of our proposed method in generating smoother, more coherent, and fine-detailed results, and demonstrate our method can greatly improve the performance of face verification.

This work is partially funded by National Natural Science Foundation of China (Grant No. 62006228).

Densepose: Dense human pose estimation in the wild

Densereg: Fully convolutional dense shape regression in-the-wild

Masked face recognition for secure authentication

Learning representations by maximizing mutual information across views

Patchmatch: A randomized correspondence algorithm for structural image editing

Self-organizing neural network that discovers surfaces in random-dot stereograms

Image inpainting

A morphable model for the synthesis of 3d faces

Optimal uv spaces for facial morphable model construction

Semi-supervised natural face de-occlusion

Fcsr-gan: Joint face completion and super-resolution via multi-task learning

Learning a high fidelity pose invariant model for highresolution face frontalization

Topological structure in visual perception

Multi-attention augmented network for single image super-resolution

A simple framework for contrastive learning of visual representations

Dynamic convolution: Attention over convolution kernels

Region filling and object removal by exemplar-based image inpainting

Arcface: Additive angular margin loss for deep face recognition

Image inpainting using nonlocal texture matching and nonlinear filtering

Perceptually aware image inpainting

Unsupervised contrastive photo-to-caricature translation based on auto-distortion

Discriminative unsupervised feature learning with convolutional neural networks

Image quilting for texture synthesis and transfer

Generative adversarial nets

Multi-pie. Image and Vision Computing

Dimensionality reduction by learning an invariant mapping

Momentum contrast for unsupervised visual representation learning

Non-local meets global: An integrated paradigm for hyperspectral image restoration

Deep fusion network for image completion

Squeeze-and-excitation networks

Labeled faces in the wild: A database forstudying face recognition in unconstrained environments

Wavelet domain generative adversarial network for multiscale face hallucination

Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis

Globally and locally consistent image completion

Yafeng Deng, and Ran He. Inconsistencyaware wavelet dual-branch network for face forgery detection

Perceptual losses for real-time style transfer and super-resolution

Progressive growing of gans for improved quality, stability, and variation

A style-based generator architecture for generative adversarial networks

Joon-Young Lee, and In So Kweon. Recurrent temporal aggregation framework for deep video inpainting

Imagenet classification with deep convolutional neural networks

Improved precision and recall metric for assessing generative models

Disentangled representation learning of makeup portraits in the wild

Learning disentangling and fusing networks for face completion under structured occlusions

Image inpainting for irregular holes using partial convolutions

Deep learning face attributes in the wild

Xiaoyu Zhang, and Ran He. Fa-gan: Face augmentation gan for deformation-invariant face recognition

Jie Cao, and Ran He. Partial nir-vis heterogeneous face recognition with automatic saliency search

Xiaolin Wei, and Ran He. Free-form image inpainting via contrastive attention network

Selfsupervised viewpoint learning from image collections

Edgeconnect: Structure guided image inpainting using edge prediction

Visual vs internal attention mechanisms in deep neural networks for image classification and object detection

Contrastive learning for unpaired image-to-image translation

Semantic image synthesis with spatially-adaptive normalization

Context encoders: Feature learning by inpainting

A 3d face model for pose and illumination invariant face recognition

All-in-focus synthetic aperture imaging using generative adversarial network-based semantic inpainting

Poisson image editing

Structureflow: Image inpainting via structure-aware appearance flow

Unconstrained 3d face reconstruction

Facenet: A unified embedding for face recognition and clustering

Joint 3d face reconstruction and dense face alignment from a single image with 2d-assisted self-supervised learning

3d face reconstruction from a single image assisted by 2d face images in the wild

Multistage attention network for image inpainting

Laplacian pyramid adversarial network for face completion

Non-local neural networks

Unsupervised learning of visual representations using videos

Image inpainting via generative multi-column convolutional neural networks

Masked face recognition dataset and application

Iarpa janus benchmark-b face dataset. IEEE Conference on Computer Vision and Pattern Recognition Workshops

A light cnn for deep face representation with noisy labels

Unsupervised feature learning via non-parametric instance discrimination

Image inpainting with learnable bidirectional attention maps

Image inpainting by patch propagation using patch sparsity

Generative image inpainting with contextual attention

Generative image inpainting with contextual attention

Free-form image inpainting with gated convolution

Face de-occlusion using 3d morphable model and generative adversarial network

Feature learning and patch matching for diverse image inpainting. Pattern Recognition

High-resolution image inpainting with iterative confidence feedback and guided upsampling

Self-supervised scene deocclusion

Self-supervised scene deocclusion

The unreasonable effectiveness of deep features as a perceptual metric

Demeshnet: Blind face inpainting for deep meshface verification

Image super-resolution using very deep residual channel attention networks

Learning oracle attention for high-fidelity face completion

Unpaired image-to-image translation using cycleconsistent adversarial networks

Face alignment across large poses: A 3d solution

Local aggregation for unsupervised learning of visual embeddings