key: cord-0499135-of9anga3 authors: Yang, Chun-Wei; Phung, Thanh-Hai; Shuai, Hong-Han; Cheng, Wen-Huang title: Mask or Non-Mask? Robust Face Mask Detector via Triplet-Consistency Representation Learning date: 2021-10-01 journal: nan DOI: nan sha: b06d2d2a4b2b303065593ffa1ba30d0afa072303 doc_id: 499135 cord_uid: of9anga3 In the absence of vaccines or medicines to stop COVID-19, one of the effective methods to slow the spread of the coronavirus and reduce the overloading of healthcare is to wear a face mask. Nevertheless, to mandate the use of face masks or coverings in public areas, additional human resources are required, which is tedious and attention-intensive. To automate the monitoring process, one of the promising solutions is to leverage existing object detection models to detect the faces with or without masks. As such, security officers do not have to stare at the monitoring devices or crowds, and only have to deal with the alerts triggered by the detection of faces without masks. Existing object detection models usually focus on designing the CNN-based network architectures for extracting discriminative features. However, the size of training datasets of face mask detection is small, while the difference between faces with and without masks is subtle. Therefore, in this paper, we propose a face mask detection framework that uses the context attention module to enable the effective attention of the feed-forward convolution neural network by adapting their attention maps feature refinement. Moreover, we further propose an anchor-free detector with Triplet-Consistency Representation Learning by integrating the consistency loss and the triplet loss to deal with the small-scale training data and the similarity between masks and occlusions. Extensive experimental results show that our method outperforms the other state-of-the-art methods. The source code is released as a public download to improve public health at https://github.com/wei-1006/MaskFaceDetection. strategy of measures to suppress transmission and provide protection against COVID-19. Leung et al. [41] also show that surgical masks can protect people from the coronavirus by reducing the probability of airborne transmission. As such, many governments mandate the use of face masks or coverings in public areas, such as retail establishments and public transportation [5, 33] . Nevertheless, to implement the executive order, it requires staff to monitor whether pedestrians wear masks or not at the entrance of restricted areas, which may be a heavy burden for large-scale and long-term surveillance. With the advance of Artificial Intelligence (AI), it is promising to alleviate the burden by utilizing its machine learning algorithms for implementing an automatic monitoring system. Generally, facial mask detection can be regarded as a special case of object detection [3, 7, 9, 11, 19, 28, 32, 37, 38, 42, 50, 69, 77, [82] [83] [84] [85] . That is, the goal is to determine where the faces are (object localization) and whether the faces are with or without masks (object classification) in a given image. Object detection can be categorized into two mainstreams, anchor-based approaches [3, 23, 24, 28, 37, 62, 69] and anchor-free approaches [10, 40, 57, 70, 77, 82, 83] . Anchor-based approaches leverage predefined multiple sizes anchor boxes to detect objects with different scales and aspect ratios, which can also be categorized into one-stage detectors [7, 18, 44, 46, 57, 85] and two-stage detectors [6, 23, 43, 62] . Two-stage detectors adopt region proposals to Manuscript submitted to ACM extract the region of interest (ROI) and separate the object detection task into object localization and image classification. Due to the high computational complexity of two-stage detectors, one-stage detectors merge the tasks of object localization and image classification into a regression problem by predicting class probabilities and bounding box coordinates simultaneously. On the other hand, anchor-free detectors truncate anchor boxes and directly detect the vital keypoints [14, 40, 57, 70, 77, 82, 83] , such as centers and corners of the object. Generally, existing works of face detection solve a variety of challenging cases as illustrated in Figs. 1(a)-1(c), i.e., faces of different orientations, reflections, crowded scenes of different scales. Nevertheless, it is still challenging to detect faces with or without masks since several issues cannot be well-addressed by most of the previous works of object detection. First, masks are with different colors and styles, e.g., Fig. 1 (d) contains dark blue, light blue, white and pink masks, while the masks in Fig. 1 (e) and Fig. 1 (f) are with different styles. To detect faces with masks of different colors and styles, one naïve solution is to increase the size of training data. However, the number of training data with facial masks is much smaller than of face images in existing datasets. Therefore, it is challenging to devise a machine learning model without a large-scale dataset. Second, images of wearing masks or not are only partially different, while the occlusions are similar to facial masks, e.g., one hand covering mouth in Fig. 1(d) , kissing a medal in Fig. 1(g) . Moreover, the occlusions of faces are diverse, e.g., makeup in Fig. 1 (h), the clothing occluding the mask in Fig. 1 (i). One of the possible solutions is to use fine-grained feature extraction models, e.g., [45, 51, 64, 76] . Nevertheless, the embeddings of faces with and without facial masks, as well as occluded faces, are still relatively close to each other, which easily leads to the confusion of the model. To address these issues, in this paper, we present a new framework, namely, "CenterFace", for face mask detection with triplet-consistency representation learning. We first use the pre-trained models, e.g., ResNet-18 [29] , MobileNet [30] , to extract the basic features. Afterward, to enhance the basic features, the convolution block attention module [78] is leveraged, which is the simple yet effective attention but can blend the cross-channel and spatial information together by operating the interactively informative features along the two significant dimensions. The input feature map can easily concatenate with the attention maps for adaptively refining the features in the feed-forward convolution neural networks module. As the anchor-free models demonstrate promising efficiency and effectiveness, the proposed "CenterFace" uses a keypoint heat map to find the center point of its bounding box. Moreover, we propose a new Triplet-Consistency Representation Learning by integrating the consistency loss [32] and the triplet loss [63] to address the first and second challenges, respectively. Specifically, the consistency loss fully utilizes the labeled data to deal with the various orientations and visual appearance diversity, while the triplet loss emphasizes the difference of face with masks and occluded faces. Experimental results on public datasets show that the proposed "CenterFace" outperforms the state-of-the-art methods by 7.4% and 6.3% for the face with and without masks in terms of the precision, respectively, while the inference time satisfies the real-time constraint. The main contributions can be summarized as shown bellows: • Since enforcing a mask policy is important to control the disease propagated through airborne transmission, e.g., COVID-19, and alleviates the burden of healthcare, we propose a new framework, namely, "CenterFace", by utilizing the Triplet-Consistency Representation Learning to mitigate the effect of visual diversity and various orientations of the face masks. • Experimental results on public datasets manifest that the proposed approach outperforms state-of-the-art methods, while the model size is small and can be adopted on the edge mobile devices. The source codes is released as a public download for public health improvement. Manuscript submitted to ACM The rest of the paper is organized as follows. We present the related work in Section 2. The methodology is then described in detail in Section 3. Finally, the experimental results are elaborated in Section 4 and we make conclusions and describe future work in Section 5. Conventional face detection methods are based on the handcrafted feature due to the limitation of computing resources and the lack of large-scale datasets [2, 13, 16, 17, 23-26, 48, 49, 73, 74] . For example, Viola et al. [73] propose the first real-time human face detector without any constraint by combining the Haar features selection with integral images and detection cascades. Dalal et al. [13] further propose an improved model by extracting the histogram of the oriented gradients (HOG)-based scale-invariant feature transformation, which are robust to the translation, scale and illumination features, as well as different sizes of objects [2, 48, 49] . However, to detect objects of different sizes, the HOG requires re-scaling input images several times. Therefore, Felzenszwalb et al. [16, 17] propose the extension of HOG with the advancement of Fourier Transform by integrating the Local Binary Pattern (LBP) to extract the scale-invariant features map. Girsick et al. [23] [24] [25] [26] propose the notion that objects can be modeled by parts in a deformable configuration and ensemble the detection of different object parts for the final prediction. Nevertheless, the representative DFM consists of a root-filter and a number of part-filters instead of specifying the size and location of the part filters. With the advance of deep learning, extracting features from images is now data-driven instead of using the prior knowledge to handcraft the features. Object detection can be grouped into two categories: one-stage detector and two-stage detector. The two-stage detectors [23, 24, 35, 61, 67, 68, 87] first extract a set of object candidate boxes by selective search algorithm [71] as region proposals, and region proposals are warped into a square and fed into a convolutional neural network for extracting discriminative features. However, a large number of overlapped region proposals make redundant feature computation, leading to inefficient detection models. Therefore, several works make progress on improving the computing of the overlapped proposals [23, 43, 61, 61] . Even though, the training process is still multi-stage with computation redundancy at the subsequent detection stage. On the other hand, one-stage detectors can be further categorized into anchor-based methods and anchor-free methods. Redmon et al. propose you only look once (YOLO) series [4, [58] [59] [60] , which replace the region proposal network (RPN) with anchor boxes predefined the ratio of width and heights of objects. However, YOLO series still suffers from improving the localization accuracy as compared with two-stage detectors. To solve the problem of one stage detection on localization and small object detection, Liu et al. propose a single shot multibox detector (SSD) [46] , which focuses on the multi-scale object detection with a variety of sizes and aspect ratios and computes both the location and class scores using small convolution filters. Despite the improvement of the speed and accuracy, the performance of one-stage detectors is usually inferior to two-stage detectors. A recent line of studies proposes anchor-free detectors, which directly find the objects without using multiple anchors in the input images [37, 39, 44, 70, 86] . There are two popular kinds of anchor-free detection methods. The first kind is keypoint-based methods [39, 86] , which bounds the spatial extent of objects by locating several predefined or self-learned keypoints. The second one is center-based methods [37, 70] , which defines positive by the center point or region of objects to predict their boundary. Self-supervised learning has become a popular learning, which creates pretext tasks on unlabeled data and learns in a supervised manner. In other words, the goal of self-supervised learning is to construct the representation of images with semantically meaningful content via pretext task while not requiring the semantic annotation for a large training set of Manuscript submitted to ACM Fig. 2 . The overview architecture of CenterFace. We use an output heatmap from a convolution network to detect a face and/or a mask as a pair of the bounding box, the network is trained to predict similar embeddings for faces that belong to the same face. images. For example, Misra et al. [52] use the pretext tasks with image transformation to encourage the representation of images to be invariant with the image patch perturbation. Xu et al. [80] exploit the similarity from the self-supervised signals as an auxiliary task, which can effectively transfers the hidden information from the teacher to the student network. Gidaris et al. [21] propose a self-supervised learning method based on transforming input images in different rotations, so the framework can learn to estimate the geometric transformations applied to the image, which helps the downstream tasks, such as object detection. Grill et al. [27] leverage two networks, where an online network is trained to predict the representation of another network of another augmented view of the identical image. Attention has become a popular concept and a useful tool in the deep learning community in recent years [12, 15, 31, 47, 72, 75, 78] . The basic idea is that human visual perception only focuses on some specific regions at one time and performs well in object detection. Therefore, to mimic human attention as a sequence of partial glimpses, different kinds of attention mechanisms are proposed to learn "what" and "where" to attend and then focus on the important features and suppressing unnecessary ones. For example, Vaswani et al. [72] use a decoder-encoder architecture with the attention mechanism to refine the feature maps. Hu et al. [31] introduce the inter-channel attention by using global-pooled features to compute their channel-wise attention. Woo et al. [78] demonstrate that channel-wise attention is insufficient and provides the spatial attention to decide "where" to focus, enabling the attention generation process for 3D features map with much less number of parameters. [78] with a block in ResNet [29] . We utilize CBAM to integrate within the convolution outputs in each ResNet block. To effectively detect the faces with and without facial masks, we present a new framework, namely, "CenterFace", for face mask detection with triplet-consistency representation learning. Fig. 2 shows the overview of the proposed "CenterFace". Specifically, we first use the pre-trained models to extract the low-level features since low-level feature extraction is usually shared in different tasks, e.g., texture, edges. Afterward, to enhance the basic features without significantly increasing the number of parameters, the convolution block attention module (CBAM) [78] is leveraged to attend the channel and spatial features. Fig. 3 illustrates the model architecture of CBAM, which contains channel and spatial modules. The channel module utilizes both max-pooling outputs and average-pooling outputs for finding the attention weights on channels, while the spatial module uses similar two outputs and pool along the channel axis to generate the attention weights on spatial dimensions. Moreover, since the anchor-free models have been proved to be efficient and effective, the proposed "CenterFace" uses a key-point heat map to find the center point of its bounding box. Moreover, we propose a new Triplet-Consistency Representation Learning by integrating the consistency loss [32] and the triplet loss [63] to address the first and second challenges, respectively. Specifically, the consistency loss fully utilizes the labeled data to deal with the various orientations and visual appearance diversity, while the triplet loss emphasizes the difference of face with masks and occluded faces. In the following, we first explain the center loss, which leverages the anchor-free models, and then the triplet loss and consistency loss for Triplet-Consistency Representation Learning. Finally, the total loss function is presented, together with the pseudocode. In the following, bold uppercase letters (e.g., X) and lowercase letters (e.g., x) denote matrices and column vectors, respectively. Non-bold letters (e.g., ) and squiggle letters (e.g., X) represent scalars and tensors, respectively. Most of the successful object detection models require enumerating candidate object locations and classifying each candidate region, which is computationally expensive. Moreover, since the model may predict several candidate bounding boxes for one object, the predicted bounding boxes require additional post-processing, e.g., non-maximum suppression [53] . To satisfy the requirement of real-time monitoring, we adopt the anchor-free method, which performs object classification and bounding box localization simultaneously. Specifically, let I ∈ R ℎ× ×3 denote the input image tensor with the width and height ℎ. The goal is to generate the key-point heatmapŶ ∈ [0, 1] ℎ × × with the stride , whereŶ , , = 1 andŶ , , = 0 respectively represent a predicted keypoint and the background region at position ( , ) belonging to class , is the number of keypoint classes, 1 To train the model for predicting the keypoints, given a ground truth keypoint p = ( , ) ∈ R 2 , we first calculate the low resolution equivalencep = (⌊ ⌋, ⌊ ⌋). We then smooth the groundtruth keypoint by the 2D Gaussian kernel where is the object-size standard deviation, which is determined by its size to ensure that the pair of radius points can generate the bounding box around the face by the ground truth annotation for each ground truth keypoint p. The training objective is a pixelwise logistic regression with a variant of focal loss [44] , i.e., where and are the hyperparameters controlling the contribution for each point of the focal loss. 2 is the number of the faces in input images , which is applied to normalize all the positive focal instance losses to 1. As the output stride may cause errors due to the discretization, i.e., p becomesp, we predict the local offset O ∈ R ℎ × ×2 for each center point, where the groundtruth offset can be derived by calculating p − ⌊ p ⌋ for each center point p. Let p denote the position of the -th center. The offset loss can be derived as follows. In addition to the centers derived by the keypoint heatmapŶ and offsetÔ, the next goal is to find the bounding boxes. Let ( 1 , 1 , 2 , 2 ) be the coordinates of the bounding box of the object with class , while the center can be represented by p = 1 + 2 2 , 1 + 2 2 . We predict the bounding box sizeŜ ∈ R ℎ × ×2 for each center point. It is worth noting thatŜ is shared for the classes of the faces with or without masks to reduce the computational cost. The groundtruth for the -th center can be calculated by s = ( 2 − 1 , 2 − 1 ). As such, the loss for bounding box size, denoted by , can be derived as follows. The overall center loss for detecting faces with and without masks is: where , , and are the hyperparameters controlling the weights between different prediction targets. 3 Due to the subtle difference between faces with masks and faces with occlusions, it is difficult to train a model by only using the center loss. Therefore, to further improve the performance on challenging cases, we utilize an online triplet mining method [63] , which enforces that the distance between a pair of samples with the same label is smaller than that between a pair of samples with different labels. Indeed, the triplet loss works directly on features. A triplet is formed by 1) anchor input, 2) positive input (samples with the same label) and 3) negative input (samples with the different labels or random cropped samples). The distance between the anchor input and the positive input is enforced to be smaller than the distance between the anchor input and the negative input. Specifically, let R , R + and R − respectively denote anchor image region, positive image region and negative image region in the -th triplet. The triplet loss can be derived as follows. The input image regions R , R + and R − are embedded by the same mapping function . We ensure that the features of the anchor region R is particularly close to other positive samples R + and far away from any of the negative images R − by using the margin to separate positive and negative pairs. In other words, when faces are regarded as anchors In i.e.,Ŷ ′ ′ , ′ ,: . One way to measure the closeness is to use L2 distance. However, L2 loss regards all the classes equally so that the irrelevant classes with a low probability may also highly affects the results. Therefore, inspired by [32] , we also use Jensen-Shannon divergence (JSD) to measure the difference betweenŶ , ,: andŶ ′ ′ , ′ ,: . The consistency loss for classification, denoted by − , is obtained by calculating the expectation of all bounding box pairs can thus be derived as follows. On the other hand, the localization result of the candidate box is based on the offsetÔ and sizeŜ. LetÔ ′ andŜ ′ denote the predicted offset and size of the horizontally-flipped images, respectively. Unlike the classification, a simple modification is required to make the prediction equivalent to each other. As the flipping transformation makes the offset change in the opposite direction, a negation is applied to correct the groundtruth, i.e., Afterward, we derive the localization consistency loss by calculating the expectation of the offset and size difference for each single pair of the candidate box center at position p and p ′ as below. Finally, when the consistency loss is computed with all candidates, the results may easily be dominated by the backgrounds, which deteriorates the performance of the foreground classification candidates. As such, we have to build a mask with the same size as the heatmap for every groundtruth bounding box to exclude the boxes with a high probability of background class. The total consistency loss, denoted by , simply summarizes − and − , i.e., = − + − . The total loss function is summarized over the center loss, triplet loss and consistency loss, i.e., where and are hyper-parameters controlling the contribution of different loss terms. 4 In summary, the keypoint heatmap prediction works as a general-purpose object detector which extends the keypoint estimator to generate the face's bounding box. On the other hand, the triplet loss, which enforces the face margin between each pair of face to roughly align matching/non-matching face, along with the consistency classification and localization loss are proposed to localize not only the face classification but also the position. The overall objective loss facilitates the detection model to be robust enough for any complicated situations among various representation of the face detection, especially from the masked face detection. The pseudocode of "CenterFace" is presented in Algorithm 1. Input image set with groundtruth Hyperparameters and (focal loss), stride , margin , , , , , , number of iterations Output: Bounding box with class represented byŶ * ,Ŝ * andÔ * if is the smallest so far then 10:Ŷ * ,Ŝ * ,Ô * ←Ŷ,Ŝ,Ô 11: end if 12: end for 13: returnŶ * ,Ŝ * andÔ * In this section, we first present the experimental setup in detail. Next, we evaluate the proposed framework on public datasets with diverse face mask representations including synthetic data, visual diversity, and various orientations. Finally, the ablation studies are presented to show the effectiveness of each component in the proposed framework. 4 Empirically, we set = 1 and = 100. Datasets. In this section, we evaluate our detection approach on the AIZOO dataset [1] , WiderFace [81] and MAFA dataset [20] to verify the effectiveness of our "CenterFace" module with baselines. AIZOO dataset is introduced when the pandemic situation just ransomed for the hope that people in the world can defeat the pandemic as soon as possible. Baselines. To the best of our knowledge, only one published paper that focuses on face mask detection with available source codes. Specifically, in AIZOO dataset, the authors verify their dataset by deploying the structure of SSD [46] with a light-weight backbone network, which contains 24 layers, of which 8 layers are convolution layers. The input size is set as 260 × 260, while the total number of parameters is only 1.01M. Evaluation Metrics We evaluate the above approach by calculating the precision and recall metrics for the facial detection model and masked facial classifier, which is defined as follows. In the experiment, we follow the evaluation protocol in [79] , and use ResNet-18 [29] as the backbone network. The model is trained on an input image resolution 520 × 520 with a batch size of 10 with Adam optimizer [36] . The learning rate is 1.25 −4 for 140 epochs, with the learning rate dropped 10× at 30 and 80 epochs, respectively. We implement the proposed CenterFace with Pytorch [56] . Then, to strengthen the effective attention for feed-forward convolutional [78] to every single block of the CNNs. The attention module not only improves the effectiveness of the CNNs but also gives a solution to reduce the computation and time consuming of the framework. Performance Comparison on AIZOO Dataset. We evaluate the proposed model with two light-weight backbone networks, ResNet-18 [29] and MobileNet [30] , which are capable to deploy on the edge mobile devices. Table 1 Efficiency Analysis. To evaluate the efficiency of different approaches, we use FLOPs and MACs to measure the complexity of different models. As shown in Table 2 , the results manifest that the number of parameters for ours with RestNet-18 network is much less than the state-of-the-art results using ResNet-50 since the model size of ResNet-50 is greater than the of ResNet-18. Nevertheless, even with a deep network, Table 3 still shows that a shallower network with the proposed triplet-consistency representation learning performs better than that with a deeper network. Even more, when using the same MobileNet network, our result still outperforms the current state-of-the-artwork. This is because of the efficiency of the anchor-free techniques and the attention module we utilize in this work, which saves the computation cost by enhancing the features with fewer parameters, instead of directly increasing the number of layers. Analysis on Qualitative Results. Fig. 4 demonstrates the qualitative detection results of our methods on the face class, as well as the results of the baseline. The demonstration cases show the difficulties and challenges of face detection such as intra-class variation, occlusion, and multi-scale detection [89] . Here MAFA datasets to show the results. We compare with the recently state-of-the-art detector in WiderFace and MAFA test-dev in Table 3 . We also adopt three objective loss functions in Eq. 9 to evaluate their individual performance. Please note that the ResNet backbone here is utilized alone without the attention mechanism to certify the performance of objective loss functions. The results show that our module outperforms the baseline without adding any objective losses. Afterward, we add the proposed losses one-by-one to observe their effects on the accuracy. This study shows that the proposed objective loss functions significantly boost the performance on the WiderFace dataset. Attention Mechanism. In this study, We compare the two different ways of arranging the channel and spatial attention sub-module in the convolution neural network architecture. As each module has a different setting function, the order may affect the overall performance. For example, the spatial attention works locally while the channel attention is applied globally. Indeed, in Table 3 our experiment shows that it yields a lower accuracy if we arrange the sub-modules channel and spatial in different ways. The reason is that in traditional architecture the channel attention focus on "what" is an informative part, e.g., the coordinate part in the feature space, which is complementary to the spatial attention. The informative parts then concatenate with the computed pooling from the spatial attention module to generate the efficient feature descriptor. the vise-versa architecture of the attention module may make the descriptor miss to learn some crucial features such as corner area or flipped angle of objects. Comparisons with Self-Supervised Learning. In addition, we conduct another experiment for demonstrating the performance of consistency loss. Specifically, we use the other widely-used self-supervised pretext task for face mask detection, i.e., rotation [22] . Furthermore, following the previous work [80] , we also apply other augmentation methods like grayscale and random crop as the self-supervision tasks. Table 3 shows that all other self-supervision tasks decrease the accuracy of the framework on the WiderFace dataset while keeping a similar accuracy on the MAFA dataset. This is because of the characteristics of the dataset, i.e., the WiderFace dataset includes people in different angles and various appearance diversity while most images in the MAFA dataset are the small group or single people with or without masks. As such, knowing the knowledge from the self-supervised tasks cannot further improve or even deteriorate the performance. It is also worth noting that the objects are different when applying the self-supervised learning in the original paper. Therefore, knowing the rotation angles means that the model understands the semantics of the objects. However, the subjects in the face mask detection are almost about people. As such, knowing the rotation does not mean that the model understand the semantic information. In contrast, imposing the consistency constraint on the prediction results of original images and their horizontally-flipped images simultaneously stabilizes the prediction results and enlarges the training data. Therefore, the consistency loss facilitates the learning process. Training Curve. The default number of training epochs is 140 epochs with a learning drop at 30 and 80 epochs. We further train the model to 200 epochs to observe if there is any improvement in performance. As shown in Figure 7 , the results manifest that the medium and hard sets of the WiderFace dataset further increases by 0.2% and 0.3% while the result of MAFA dataset increases by 0.2%. The improvement is relatively minor while the training requires much longer training time. Therefore, to keep the performance of the algorithm without wasting the computational resource, we suggest to train the the proposed model with 140 epochs. Regression Loss. The regression loss is the distance loss, which enforces the distance between the anchor to the positive and negative pair. Certainly, the smooth L1 loss is usually a common choice for classification, regression, and distance loss in object detection problems. Similar with the observation from previous works [65, 66] , L1 loss yields better accuracy at a fine scale as compared to Smooth L1 loss [39] . We compare an L1 loss to a Smooth L1 Loss in Table 4 . The results show that the L1 loss is a bit better than Smooth L1 loss. It is because the L1 loss has a higher tolerance and thus is more robust to the noise. In contrast, the Smooth L1 loss is smoother but sensitive to outliers. Table 5 shows the sensitivity test on different and , which manifests that the accuracy slightly changes with different hyperparameter weights. Therefore, and are respectively set to 0.01 and 100 according to the results. To prevent the massive infections of the coronavirus and reduce the overloading of healthcare, we propose a new framework named "CenterFace" to automate the monitoring of wearing face masks. "CenterFace" uses the context attention module to enable the effective attention of the feed-forward convolution neural network by adapting their attention maps feature refinement. Moreover, we further propose an anchor-free detector with Triplet-Consistency Representation Learning by integrating the consistency loss and the triplet loss to deal with the small-scale training data and the similarity between masks and occlusions. Experimental results show that "CenterFace" outperforms the other state-of-the-art methods, which is released as a public download to improve public health. In the future, we plan to study the problem of face recognition with masks and jointly consider the privacy issues. Shape matching and object recognition using shape contexts Yolov4: Optimal speed and accuracy of object detection YOLOv4: YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv Real-time implementation of face recognition system Cascade r-cnn: Delving into high quality object detection D2det: Towards high quality object detection and instance segmentation Realtime multi-person 2d pose estimation using part affinity fields You Only Look One-level Feature RepPoints V2: Verification Meets Regression for Object Detection Face mask detection using transfer learning of inceptionv3 Second-order attention network for single image super-resolution Histograms of oriented gradients for human detection CenterNet: Keypoint Triplets for Object Detection Saccader: improving accuracy of hard attention models for vision A discriminatively trained, multiscale, deformable part model Object detection with discriminatively trained part-based models RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free Detecting Masked Faces using Region-based Convolutional Neural Network Detecting masked faces in the wild with lle-cnns Unsupervised representation learning by predicting image rotations Unsupervised representation learning by predicting image rotations Fast r-cnn Rich feature hierarchies for accurate object detection and semantic segmentation Object detection with grammar models Discriminatively trained deformable part models Bootstrap your own latent: A new approach to self-supervised learning Momentum contrast for unsupervised visual representation learning Deep residual learning for image recognition Mobilenets: Efficient convolutional neural networks for mobile vision applications Squeeze-and-excitation networks Consistency-based semi-supervised learning for object detection Face detection for security surveillance system RetinaMask: A Face Mask detector Deep Learning Framework to Detect Face Masks from Video Footage Adam: A method for stochastic optimization Foveabox: Beyound anchor-based object detection Face detection techniques: a review Cornernet: Detecting objects as paired keypoints Centermask: Real-time anchor-free instance segmentation Respiratory virus shedding in exhaled breath and efficacy of face masks Masked face detection via a modified LeNet Kaiming He, Bharath Hariharan, and Serge Belongie Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection Bidirectional Attention-Recognition Model for Fine-Grained Object Classification Ssd: Single shot multibox detector Object-centric learning with slot attention Object recognition from local scale-invariant features Distinctive image features from scale-invariant keypoints A novel technique for automated concealed face detection in surveillance videos Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition Self-supervised learning of pretext-invariant representations Deep residual learning for image recognition Stacked hourglass networks for human pose estimation Towards accurate multi-person pose estimation in the wild Pytorch: An imperative style, high-performance deep learning library You Only Look Once: Unified, Real-Time Object Detection You only look once: Unified, real-time object detection YOLO9000: better, faster, stronger Yolov3: An incremental improvement Faster r-cnn: Towards real-time object detection with region proposal networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Facenet: A unified embedding for face recognition and clustering Deep High-Resolution Representation Learning for Human Pose Estimation Compositional human pose regression Integral human pose regression Deeply learned face representations are sparse, selective, and robust Deepface: Closing the gap to human-level performance in face verification Efficientdet: Scalable and efficient object detection FCOS: Fully Convolutional One-Stage Object Detection Segmentation as selective search for object recognition Attention is all you need Rapid object detection using a boosted cascade of simple features Robust real-time face detection Residual attention network for image classification Deep High-Resolution Representation Learning for Visual Recognition Learning human-object interaction detection using interaction points Cbam: Convolutional block attention module Simple baselines for human pose estimation and tracking Knowledge distillation meets self-supervision WIDER FACE: A Face Detection Benchmark RepPoints: Point Set Representation for Object Detection Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection Object detection with deep learning: A review Distance-IoU loss: Faster and better learning for bounding box regression Bottom-up object detection by grouping extreme and center points Recover canonical-view faces in the wild with deep neural networks Edge boxes: Locating object proposals from edges Object detection in 20 years: A survey