key: cord-0154801-esbdc36h
authors: Hu, Yixin; Li, Xingyu
title: CoverTheFace: face covering monitoring and demonstrating using deep learning and statistical shape analysis
date: 2021-08-23
journal: nan
DOI: nan
sha: 0a1d53c82702b792261544524d03508f60d3bfbd
doc_id: 154801
cord_uid: esbdc36h

Wearing a mask is a strong protection against the COVID-19 pandemic, even though the vaccine has been successfully developed and is widely available. However, many people wear them incorrectly. This observation prompts us to devise an automated approach to monitor the condition of people wearing masks. Unlike previous studies, our work goes beyond mask detection; it focuses on generating a personalized demonstration on proper mask-wearing, which helps people use masks better through visual demonstration rather than text explanation. The pipeline starts from the detection of face covering. For images where faces are improperly covered, our mask overlay module incorporates statistical shape analysis (SSA) and dense landmark alignment to approximate the geometry of a face and generates corresponding face-covering examples. Our results show that the proposed system successfully identifies images with faces covered properly. Our ablation study on mask overlay suggests that the SSA model helps to address variations in face shapes, orientations, and scales. The final face-covering examples, especially half profile face images, surpass previous arts by a noticeable margin.

Masking face correctly reduces the spray of respiratory droplets. Particularly, it has been proven to be an effective measure of protection against the COVID-19 epidemic. According to the World Health Organization (WHO) guidelines, correctly masking is defined as all nose, mouth, and chin are covered. However, many people refuse to wear masks or wear them incorrectly, for example, wearing masks without covering their noses. According to an infection control epidemiologist with the University of Toronto, it is the same as not wearing masks if wearing them incorrectly. Therefore, checking people if they are wearing masks correctly in public, densely populated spaces has become a significant problem.

Prior efforts in literature usually focus on tasks related to face mask detection and recognition. There are many studies and algorithms developed to detect if people are wearing masks. A few investigate algorithms for mask removal and face inpainting. However, no further instructions are suggested for people who are wearing masks properly. This motivates us to design an automated system for facecovering monitoring and demonstration. We argue that directly demonstrating correct face-covering through visual displays, rather than text explanations, facilitates guiding the habits of wearing masks to prevent the spread of the virus, eventually beneficial to the public. For example, the proposed system could be set up at the doors of densely populated places such as airports, subway stations, or large shopping centers. If an individual is wearing a mask improperly, the person can take action using the visual demo as a reference.

The proposed system consists of two modules: face mask detection and mask overlay. Specifically, an input facial image is first classified into one of the three categories: "correctly wearing", "not wearing," and "incorrectly wearing". For "correctly wearing", the mask overlay module is bypassed, and the original image is displayed as the demonstration. Otherwise, the mask overlay module edits the input image by adding a mask to the face. To this end, two challenges need to be addressed. First, the face is not a rigid object that always stays in the same shape; Second, faces in images are usually positioned differently due to camera positions and angles. To tackle these challenges, we incorporate SSA and dense landmark alignment in our mask overlay module so that variations in face shapes, orientations, and scales are considered in mask put-on. Column (d) of Figure 1 presents several examples of the proposed mask overlay module. Compared to the prior study, MaskThe-Face [1] , our method significantly improves the results on half profile faces. We summarize the contributions in this paper:

• An automated pipeline is proposed to monitor facecovering conditions and render a face-covering demo. The generated images are very realistic on both front and profile face images.

• As an essential component in our pipeline, a novel and effective mask overlay procedure is developed by fusing statistical shape analysis and dense face landmark alignment.

More visually appealing results of the proposed method are presented in the experimentation section.

The problem in this study involves several research topics: face mask detection, mask removal, and mask overlay/inpainting. In this section, we present a brief review of the most relevant works in literature.

Face mask detection has been widely studied [2, 9, [13] [14] [15] [16] [17] . Given an input image or real-time videos, the detectors can tell if the person is wearing a mask properly or not. Applications of mask detection usually run on edge devices such as mobile phones and embedded systems. A typical example is a mobile application called "CheckYour-Mask" [7] . It allows people to take selfies to check if they are wearing their masks correctly. Since deep models in the MobileNet family demonstrate a good balance between accuracy and resource consumption, they are usually taken as the backbone in prior arts. Specifically, Venkateswarlu et al. [15] proposed an algorithm using MobileNet with a global pooling block for face mask detection. Vinh and Anh [16] proposed an algorithm using a Haar cascade classifier to detect the face and YOLOv3 to detect the mask, achieving a 90% detection rate. Xue et al. [17] and Jiang et al. [9] improved the RetinaFace algorithm [4] by inferring the positions of the mask. Oumina et al. [2] proposed a system using ensemble learning to detect if a person is wearing a mask. They extracted face features using different deep learning models (VGG19, Xception, and Mo-bileNetV2) and achieved the detection by fusing the classification results of SVM and KNN.

Mask removal is essential for images that are classified as "incorrectly wearing" in the proposed method. It removes the wrong worn masks from faces meanwhile attempts to edit the images such that complete, non-occluded faces are reconstructed and displayed. Though there are many successful segmentation and image inpainting algorithms, Din et al. [5] argued that most prior approaches did not fit the problem of unmasking covered face well due to the large size of masks (e.g., face masks usually cover the front beyond the face boundary below the chin). To tackle this problem, they investigated a two-stage method where the first stage detects and segmented masks with a modified version of the U-Net and the second stage deployed a GAN-based network with global and local discriminators for mask-area inpainting.

Mask put-on is a process to overlay a mask in a face image. There are many ways to add masks to faces, such as manually photoshop an image in [5] . Among various methods, to the best of our knowledge, MaskTheFace [1] is the only study to achieve this goal automatically. It uses six facial landmarks, one on the nose bridge, two on the cheeks, and three along the chin line. Then a mask template is matched to the six landmarks to overlay the mask onto faces. Since only six facial landmarks are used in this method, we call the method proposed in MaskTheFace sparse landmark alignment (SLA) in this paper. Column (b) of Figure 1 shows examples of SLA on face images. We observe that SLA fails to follow the face boundary in halfprofile images. This observation motivates us to introduce dense landmark alignment (DLA) and statistic shape analysis in our mask overlay algorithm.

An overview of our pipeline is provided in Figure 2 . When an image is classified as "Correctly Masked", no further action is required. When an input is classified as "Not Masked", a mask is overlaid on the image. Otherwise, the wrong-worn mask is removed, and a new mask is inpainted to cover the proper face region for visual demo. We will elaborate on the technical details of the two composing The final face-covering example is generated using statistical shape analysis and dense landmark alignment (Sec.3.2.2). In the diagram, we use dash-edge boxes to mark deep learning models; by contrast, the solid-edge box represents a non-deep learning algorithm. modules: mask detection module and mask overlay module in this section.

In our pipeline, the mask detection module classifies an input facial image into one of the three categories, "correctly wearing", "incorrectly wearing" , and "not wearing", which are passed to the downstream mask overlay module. Following previous efforts in literature, we take Mo-bileNetV2 as the backbone of the detector. Specifically, the MobileNetV2 model is pre-trained on ImageNet. We replace the fully connected layer at the top with a 7-by-7 average pooling layer and two dense layers with a ReLU activation function. Before the output layer, a dropout layer with a rate of 0.25 is applied in training.

This module comprises two algorithms: mask removal using a GAN-based model and mask put-on using statistic shape analysis and landmark alignment.

The purpose of the mask removal algorithm is to remove wrong-worn masks and synthesize a non-occluded face. In this regard, we adopt the MCGAN structure proposed by Khan et al. [11] . Note that in MCGAN, a binary mask is required as input to recover the occluded region on the face. In this regard, we design a mask segmentation network for a binary mask.

The specific structure of our mask removal net is depicted in figure 3 . In the segmentation network G seg , the squeeze-and-Excitation (SE) Block [8] that performs recalibration of channel characteristics is incorporated in the UNet structure. The segmentation loss combines a Dice loss, L Dice , and a binary cross-entropy loss, L BCE , to evaluate the similarity between the obtained binary map I mask and the ground-truth I gt :

where L Dice measures the region-based similarity between I mask and I gt and L BCE measures the global distribution differences between the two masks. The MCGAN model [11] cascades two generative nets for face inpainting: one for maintaining face global semantics and achieving coarse face recovery, and the other for face refinement. To obtain a realistic synthesis face image, MCGAN exploits adversarial discriminators for both generative nets. The complex structure of MSGAN makes its training hard and unstable. We notice that previously covered regions in facial images will eventually be covered again in our problem. This relaxes the level of face recovery and prompts us to simplify the MCGAN structure for easy training. Specifically, we remove the adversarial discriminator paired with the coarse inpainting net. To compensate for the performance loss, we replace all convolutional layers with gated convolution blocks [18] . With a soft mask mechanism, gated convolution enables the model to learn the masked regions in a separate path, thus generating more realistic inpainting results. In addition, we deploy a pretrained VGG-16 to quantify the perceptual loss between the inpainting face I inp and its ground-truth I gt . In sum, our target function to train the modified MCGAN model is

where L GAN , L rc , and L p are the adversarial loss, image reconstruction loss, and perceptual loss between I inp and I gt . λ rc and λ p are the weights of reconstruction loss and perceptual loss, respectively. Our image construction loss combines L 1 loss and image structural similarity metric (SSIM) loss:

Assuming φ i is the activation map of the i-th layer in the pre-trained VGG-16, the perceptual loss is computed by:

Mask put-on is an essential component in our pipeline. Instead of leveraging deep learning, we design a non-deep learning algorithm for mask overlay for two reasons. On the one hand, the lack of suitable training samples hinders the use of deep learning. Though both MaskedFace-Net [7] and MaskTheFace [1] provide paired non-occluded facial images and face-covering images, those face-covering images are not realistic; the noticeable distortions and artifacts in images may bias the training. On the other hand, a clear definition of proper face-covering with a mask has been elaborated in WHO guidelines: all nose, mouth, and chin should be covered. Since masks are usually in similar shape, we argue that put-on masks can be efficiently achievable by aligning face landmarks and mask templates.

To synthesize realistic face-covering images, accurate localizing landmarks in both face images and mask templates is crucial. We follow the 68 facial landmark convention in face landmark estimation and use the pre-trained facial landmark detector inside the dlib library on face images. However, we notice that landmark estimation of faces in profile is poor. Though increasing facial landmark numbers in landmark alignment improve the performance (please refer to column (c) of Figure 1 for examples of DLA), it still fails to follow the chin line in images closely. Therefore, we propose to build a face shape model using SSA, specifically active shape model (ASM) [3] , so that variations in face shapes, orientation, and scales due to different camera positions can be accommodated. To this end, we start by estimating landmarks on the contours of face image samples. Then we organize these face landmark coordinates in a matrix called point distribution matrix (PDM), where each row corresponds to one face. After Procrustes analysis to align the coordinates and PCA to reduce the dimension of PDM, the face ASM is formulated as:

where f i andf represent the latent landmark representations of a specific face sample and the mean face cross all training images, P = [p 1 , p 2 , ..., p t ] is the matrix consisting of the first t eigenvectors in PCA, and b = [b 1 , b 2 , ..., b t ] is a vector of weights acting like "knobs" to fit a specific f i . The procedure to build the shape model is briefly demonstrated in Figure 4 . Without directly using facial landmarks in a new image f new , we use the ASM in (5) to approximate the geometry of the input and estimate the corresponding landmarks by searching the best set of transition/rotation/scale parameters (τ, θ, s):

where M = s cos(θ) sin(θ) − sin(θ) cos(θ) + τ

As demonstrated in Figure 4 , after obtaining the landmarks of a face in a query image, we align landmarks of the face with a mask template. In this study, we consider three mask templates, one for front view and another two for profiles. In contrast to MaskTheFace [1] that adopts sparse landmark alignment, we propose the use of dense landmark alignment for realistic face masking images. Because the mask should cover the bottom of the chin and the top of the nose above the nose tip, instead of using six landmarks, we manually annotate 17 landmarks on mask templates that correspond to face landmarks ranging from index 2 to 16 and 30 and 34 in the conventional 68-landmark patterns. We illustrate the traditional 68 face landmarks and our annotated mask templates in Figure 5 

Our whole pipeline consists of two modules: face mask detection and mask overlay. Since the two modules are relatively independent, we evaluate their performance separately.

Data set: We collect 5829 images from two public datasets, MaskedFace-Net [7] and Flickr-Faces-HQ Dataset (FFHQ) [10] . MaskedFace-Net is a large synthetic dataset consisting of paired correctly and incorrectly face-covering photos. Images in MaskedFace-Net contain considerable variation in age, ethnicity, and image background. We randomly pick 1903 images with correct mask-wearing and 1926 images of wrong mask-wearing from the dataset; paired images are avoided. In addition, 2000 non-occluded images are randomly selected from the FFHQ image set.

In our experiment, all 5829 images are resized to 224-by-224. Images in each category are randomly divided into a training and testing set with a ratio of 8:2, resulting in 4663 training samples and 1166 testing images.

Implementation details: Category cross-entropy is used to train our mask detector. We use Adam with a learning rate of 10 −4 and a decay rate of 5 × 10 −6 to optimize the model. The model is set to be trained maximum for 20 epochs with a batch size of 32. The early stop mechanism is deployed to prevent overfitting.

Results: Our face-covering monitoring model achieves a 98% detection rate. The specific results in each category are summarized in Table 1 . 

Data sets: Our mask overlay module is applied to images classified as either "not wearing" or "incorrectly wearing". To train the mask removal model, we randomly pick ten thousand images from the public CelebA dataset [12] . All of these images are resized to 512-by-512, with the faces centered in images. We manually add masks on images to obtain paired samples of non-occluded faces, masked faces and mask binary maps. The dataset is partitioned with an 8:2 ratio for training/validation.

To build our face shape model, we select a subset of 1950 data samples (130 examples each of 15 individuals) from the head pose estimation dataset Pointing'04 [6] . These photos are taken by changing the orientation of the head in the direction vertically and horizontally.

To evaluate the generalization of our mask overlay module, the test images of "no mask" and "incorrectly mask wearing" are drawn from celecA [12] and MaskedFace-Net [7] during test, respectively. Implementation details: To train our mask removal model, we set the batch size of 4 and 2 for mask segmentation and face inpainting, respectively. The optimizer for both models is Adam, with a learning rate of 0.001. The segmentation model is trained for 200,000 iterations, and the inpainting model is trained for 300,000 iterations. Since our mask-overlay algorithm is not based on deep learning, we directly feed our Pointing'04 subset into ASM. Then images with non-occluded faces are passed to the mask overlay algorithm for image editing.

Results: Figure 6 presents examples of our mask overlay results. In literature, MaskTheFace /citeMaskTheFace is the only study to achieve the goal of mask overlap on non-occluded images automatically. We compare the SLA algorithm in MaskTheFace and the proposed method and present the results in column (b) and column (d) in Figure  1 . Visually, our mask overlay algorithm that incorporates SSA and DLA obtain noticeable improvement on faces with different orientations. Figure 7 presents results of our mask overlay on images with incorrect mask-wearing.

In this experiment, we take the SLA algorithm in Mask-TheFace [1] as the baseline and investigate the effect of DLA and SSM on the final mask overlay results. To this end, we take non-occluded facial images from the CelecA image set [?] and generate face-covering images using the SLA algorithm, DLA algorithm based on the 17 landmarks without the face shape model, and our DLA+SSA approach. Examples of mask overlay are displayed in Figure 1 . Comparing to sparse landmark alignment, dense landmark alignment helps to fit the mask to faces. The active shape model with dense landmark alignment generates more realistic face-covering images.

We considered an important problem of monitoring and demonstrating face-covering conditions from images. Our approach was capable of detecting face images with improperly mask-wearing and rendering plausible personalized face-covering demonstrations. Experimentation showed that the proposed mask overlay algorithm based on active shape model and dense landmark alignment outperformed prior arts. For future work, we plan to test our pipeline on diverse images taken under different illumination (i.e. shadows, reflections, etc.).

Masked face recognition for secure authentication

Real time face mask detection system using transfer learning with machine learning method in the era of COVID-19 pandemic

Active shape models-their training and application. Computer Vision and Image Understanding

RetinaFace: Single-shot multilevel face localisation in the wild

A novel gan-based network for unmasking of masked face

Estimating face orientation from robust detection of salient facial structures

Validating the correct wearing of protection mask by taking a selfie: Design of a mobile application "checkyourmask" to limit the spread of COVID-19

Squeeze-and-excitation networks

RetinaMask: A face mask detector

A style-based generator architecture for generative adversarial networks

Interactive removal of microphone object in facial images

Deep learning face attributes in the wild

Prototype for integration of face mask detection and person identification model -COVID-19

Face mask detection using MobileNetV2 in the era of COVID-19 pandemic

Face mask detection using MobileNet and global pooling block

Real-time face mask detector using YOLOv3 algorithm and haar cascade classifier

Intelligent detection and recognition system for mask wearing based on improved RetinaFace algorithm

Free-form image inpainting with gated convolution