key: cord-0447508-pdz7run6 authors: Yin, Xiangnan; Chen, Liming title: Non-Deterministic Face Mask Removal Based On 3D Priors date: 2022-02-20 journal: nan DOI: nan sha: 1657094e3174a47d7a3ca1781678be865141572e doc_id: 447508 cord_uid: pdz7run6 This paper presents a novel image inpainting framework for face mask removal. Although current methods have demonstrated their impressive ability in recovering damaged face images, they suffer from two main problems: the dependence on manually labeled missing regions and the deterministic result corresponding to each input. The proposed approach tackles these problems by integrating a multi-task 3D face reconstruction module with a face inpainting module. Given a masked face image, the former predicts a 3DMM-based reconstructed face together with a binary occlusion map, providing dense geometrical and textural priors that greatly facilitate the inpainting task of the latter. By gradually controlling the 3D shape parameters, our method generates high-quality dynamic inpainting results with different expressions and mouth movements. Qualitative and quantitative experiments verify the effectiveness of the proposed method. Wearing face masks in public has become an essential hygiene practice to control the spread of COVID-19, posing new challenges for face-related computer vision tasks. Computers need to accomplish face recognition, expression recognition, landmark detection, etc., using minimal exposed facial textures. Although many recent studies focus on the masked scenario, most are task-specific and not universally applicable. In comparison, directly restoring mask-occluded face texture promises to be a one-stop solution to the problem. To this end, we need to tackle two sub-tasks: 1) detecting the occluded region and 2) recovering the face textures, corresponding to image segmentation and face image inpainting, respectively. Thanks to the revolutionary emergence of deep learning, data-driven approaches have dominated computer vision with great success. However, this also leads to the reliance on highquality training data. Regarding mask segmentation specifically, large, diverse, and manually annotated mask datasets are in strong demand due to the targets' varying shapes, orientations, and textures. Some methods synthesize training data by overlaying masks on ordinary face images, which is a cheap interim solution before a large paired masked face dataset becomes available. Early image inpainting methods fill the holes by iteratively searching nearest neighbor textures from the background [1]. However, such copy-and-paste methods only consider internal information within the image, making them only capable of recovering tiny, smooth textures and not dealing with semantic-level deficiencies such as masked noses and mouths. On the other hand, data-driven approaches learn the data distribution from large datasets, allowing them to restore the semantic-level image patterns. Context Encoder [2] pioneered the adversarial training paradigm. [3, 4] exploits feature masking to deal with free-form missing regions. Also, different attention modules [5, 6] have been proposed to break through the limited receptive field of the convolution kernel and thus explicitly model long-distance dependencies. Despite the improved inpainting quality, most methods produce only deterministic results, ignoring multiple fill options. This paper proposes a novel 3D reconstruction-guided method for removing masks from face images in the wild. The model comprises a multi-task mask-robust 3D face reconstruction module and a face inpainting module. The former predicts both the 3D Morphable Model (3DMM) [7] parameters and the binary occlusion map of the masked face, and the latter recovers the missing facial texture conditioned by the rendered 3D prior. By changing the 3DMM parameters, we can control the shape and expression of the recovered face both accurately and smoothly. The closest work to ours is that of Din et al. [8] , where we both focus on the problem of face mask removal and divide it into mask segmentation and face painting. Our method surpasses theirs in two aspects: 1) We labeled more mask templates (900 vs. 50) to train the mask segmentation task. 2) Our inpainting results are diverse and highly controllable. Some variational autoencoder-based methods can also produce non-deterministic outputs [9, 6] by sampling latent codes from predicted distributions. Although the stochastic nature of the VAE brings about varied results, diversity is never guaranteed: the targets are still fixed, leading to 1) sharp latent distributions and 2) robust decoder to the latent codes' variations, degrading the framework into a common Fig. 1 . We first synthesize the training data I m by adding masks to ordinary face images I. Next, we use a multi-task model N 3D to predict the mask silhouette m and the 3D reconstructed face I 3D of the input. Finally, we synthesize the non-occluded faceÎ based on the noised face I n , m and I 3D . A VGG-shaped discriminator N D is leveraged to distinguishÎ from the real. autoencoder. Furthermore, the diversity introduced by random sampling is neither controllable nor smooth. Although some other methods conditioned on sketches [10] , facial landmarks [11] , or segmentation maps [12] do yield editable results, the sparsity and instability of such conditions lead to poor controlling accuracy. Since collecting large amounts of paired with/without the mask face images is infeasible, we train on synthetic data pairs generated by overlaying masks on ordinary face images, as shown on the leftmost of Figure 1 . The proposed model is composed of a multi-task 3D face reconstruction-mask segmentation module N 3D and a face inpainting module N G , corresponding to the left and right halves of Figure 1 , respectively. Given a masked face image I m , N 3D predicts its 1) corresponding 3DMM parameters c, from which a 3D face I 3D could be reconstructed and rendered, and 2) the occlusion mask m, indicating the mask silhouette. We then replace the mask texture with random noises based on m and get I n . Finally, N G predicts the mask-free faceÎ conditioned by I n , m, and I 3D . We also employ a discriminator N D to increase the realism of the generated images. The Following presents each module in detail. N 3D takes ResNet50 [13] as its backbone and fulfills 3D face reconstruction and mask segmentation. Intuitively, the neural network captures global shape patterns in the bottom layers and detailed texture patterns in the top layers, while the masks usually occupy a large area of the face with relatively simple textures, so we perform mask segmentation using features from the first three residual blocks. The segmentation task is solely guided by the Binary Cross-Entropy (BCE) loss: wherem and m denote the predicted and the ground truth binary mask, W and H denote the spatial dimension of the binary map. The "online hard example mining" (OHEM) technique is also utilized to make the training more efficient. To concentrate the model on visible textures while predicting the 3DMM parameters, we also integrate a gated convolution [4] layer before the last residual block of ResNet50, predicting a dynamic feature mask for each channel. The 3D reconstruction branch outputs a vectorĉ ∈ R 237 , containing the face's shape, pose, texture, and illumination parameters. To accelerate the training, we use 3D coefficients predicted from the original unmasked face images by the pre-trained model of [14] as the ground truth of c. Following losses jointly guide the 3D reconstruction task: The most direct term is the coefficient loss. whereĉ and c are the predicted and the ground truth 3D coefficients, N denote the dimension of c. However, the coefficient level loss treats the discrepancy in all dimensions equally, which is unreasonable, as some dimensions affect the reconstruction results much more than others (e.g., poses v.s. illuminations). Hence we introduce the photo loss, which constraints the training at the image level. where I 3D denotes the rendered reconstruction result, I denotes the original face image, and M denotes the binary face region map (provided by the training dataset). As with most face reconstruction methods, we apply identity loss for better capturing the face identity. where F(·) denotes the feature extraction operation via a pretrained Arcface [15] model. Finally, we leverage landmark loss as [14] to loosely constrain the shape and pose of the reconstructed face. whereq i and q i represent the 68 (n pt = 68) facial landmarks indexed from the predicted and the ground truth (reconstructed from c in Equation 2) 3D faces, respectively. ω i is the weight corresponding to the ith landmark, set to 20 for the nose and inner-mouth points and 1 for others. The overall loss function is formulated as: where λ id = 0.1 and λ lm = 0.001. The inpainting module consists of a generator N G with stacked residual blocks and a discriminator N D with the VGG structure. As shown in Figure 1 , N G concatenates the mask parsing map m, the 3DMM-based face I 3D , and the noised image I n as input and outputsÎ, which recovers the original mask-free face image I. Further,Î and I are fed into N D to obtain their probabilities of being real data. We utilize the following losses to train the model: Pixel-wise loss, where H, W , C are the height, width and channels of I. Identity loss L id , formulated the same as Equation 4 , except replacing I 3D therein withÎ. Total variation loss [16] , where ∇ denotes the directional gradient. Adversarial loss, where D(·) denotes the mapping function of N D ; the larger its value, the more its input tends to be real. The full loss of N G is summarized as: where λ pix = 10, λ id = 0.1, λ tv = 0.1, λ adv = 0.01. Discriminator loss, the loss of N D follows the implementation of [17] , which is composed of an ordinary BCE loss and a zero-centered gradient penalty for real images, Fig. 2 . 3D face reconstruction ability for masked faces. We first present our implementation details. Then, we qualitatively compare our method's 3D face reconstruction, mask removal, and face editing abilities with state-of-the-art. Finally, we quantitatively compare the face restoration ability of different methods at pixel and perceptual levels. Most mask-related approaches synthesize masked/unmasked training pairs by overlaying mask templates on face images of existing face datasets. However, as Table 1 shows, the mask templates used by previous methods are pretty limited; only a few tens of variations are far from sufficient to train a robust model. Therefore, we 1) manually keyed out 900 masks from the masked face images and 2) collected 800 texture patches to replace the textures of the original masks 1 . The mask templates are then combined with CelebAMask-HQ [21] and FFHQ [22] to generate data pairs on the fly as training goes (1000 images from FFHQ are left out for testing). We train 500,000 steps for N 3D and 200,000 steps for N G -N D , both with a batch size of 8 and an initial learning rate of 1e −4 . For each module, the learning rate drops to 1e −5 when the training reaches its midpoint. We use Adam with betas set to [0.9, 0.999] to optimize the two modules. It takes about 60 hours to train N 3D and 40 hours to train N G -N D on two Nvidia GTX 1080 GPUs. Accurately reconstructing the 3D face from masked faces is the prerequisite for the success of the subsequent inpainting module. Therefore, we first compare our method with the SOTA 3D reconstruction method of Deng et al. As shown in Figure 2 by the mask, resulting in deviations in texture and poses. In comparison, our method is robust to face masks thanks to the synthetic masked face images for training. As Section 2 mentions, the method closest to ours is that of Din et al. [8] . Unfortunately, they do not release their code; therefore, we use the images from their paper for a more convincing comparison. The other two methods we compare are LaFIn [11] and Zhang et al., which can generate diverse inpainting results. We provide those methods with mask regions detected by N 3D . As Figure 3 shows, our approach significantly outperforms Zhang et al. and Din et al.. The random sampling in the hidden space leads to apparent artifacts in the results of Zhang et al.. Without a shape prior, the method of Din et al. may generate distorted faces (row 4 column 2); In addition, the poor accuracy of their mask segmentation module results in residual mask edges on the face (row 4, columns 6 and 7). Our results are comparable with LaFIn, however, the latter requires additional binary mask maps. We further compare the face editing ability of our model with LaFIn, the landmark-guided face inpainting method. This time we provide LaFIn with 68 facial landmarks extracted from our predicted 3D face model. Figure 4 shows the results guided by different 3D priors(for the face in the red box in Figure 3 ). The first six columns are conditioned by different shapes and the last column is conditioned by a brighter skin. As can be seen, with the guidance of our landmarks, LaFIn can generate diverse inpainting results. Nevertheless, due to the sparsity of the landmark, the generated faces do not precisely comply with the 3D face shapes; in addition, LaFIn cannot change the skin color as we do in the last column. We synthesized 1000 masked face images on the test set, and then used LaFIn and Zheng et al.'s method to recover the unmasked faces (with externally provided binary mask maps). The face restoration ability is evaluated by: L1 Loss, PSNR score, SSIM score, FID score 2 , and the cosine similarity of the identity features extracted by [15] . For Din et al.'s method, since their code is not publicly available, we adopt the data from their paper. Results are shown in Table 2 Table 2 , respectively. It can be seen that the FID score of EdgeConnect reported by Din et al. is similar to their proposed method but much lower than the one we tested. We, therefore, question the credibility of Din et al.'s data. This paper proposes a novel framework for removing the mask from the face image. First, we manually labeled a large, high-quality dataset of face masks for synthesizing training pairs. Next, we trained a mask-robust multi-task module for reconstructing 3D faces and detecting the mask region of face images. Finally, we proposed a 3D reconstruction guided face inpainting module to generate non-deterministic and highlycontrollable results. The proposed method outperforms the state-of-the-art qualitatively and quantitatively. [1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman, "Patchmatch: A randomized correspondence algorithm for structural image editing," ACM Trans. Graph., vol. 28, no. 3, pp. 24, 2009. Context encoders: Feature learning by inpainting Image inpainting for irregular holes using partial convolutions Free-form image inpainting with gated convolution Self-attention generative adversarial networks Pluralistic image completion A morphable model for the synthesis of 3d faces A novel gan-based network for unmasking of masked face Learning structured output representation using deep conditional generative models Sc-fegan: Face editing generative adversarial network with user's sketch and color Lafin: Generative landmark guided face inpainting Semantic segmentation guided face inpainting based on sn-patchgan Deep residual learning for image recognition Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set Arcface: Additive angular margin loss for deep face recognition Understanding deep image representations by inverting them Stargan v2: Diverse image synthesis for multiple domains A 3d model-based approach for fitting masks to faces in the wild Extended labeled faces in-the-wild (elfw): Augmenting classes for face segmentation Masked face recognition for secure authentication Maskgan: Towards diverse and interactive facial image manipulation Progressive growing of gans for improved quality, stability, and variation Edgeconnect: Structure guided image inpainting using edge prediction