key: cord-0630050-enxqeuil authors: Eyiokur, Fevziye Irem; Ekenel, Hazim Kemal; Waibel, Alexander title: Unconstrained Face-Mask&Face-Hand Datasets: Building a Computer Vision System to Help Prevent the Transmission of COVID-19 date: 2021-03-16 journal: nan DOI: nan sha: 3c4b93b29c1c7b421f6657ca22b39083d2cb8e18 doc_id: 630050 cord_uid: enxqeuil Health organizations advise social distancing, wearing face mask, and avoiding touching face to prevent the spread of coronavirus. Based on these protective measures, we developed a computer vision system to help prevent the transmission of COVID-19. Specifically, the developed system performs face mask detection, face-hand interaction detection, and measures social distance. To train and evaluate the developed system, we collected and annotated images that represent face mask usage and face-hand interaction in the real world. Besides assessing the performance of the developed system on our own datasets, we also tested it on existing datasets in the literature without performing any adaptation on them. In addition, we proposed a module to track social distance between people. Experimental results indicate that our datasets represent the real-world's diversity well. The proposed system achieved very high performance and generalization capacity for face mask usage detection, face-hand interaction detection, and measuring social distance in a real-world scenario on unseen data. The datasets will be available at https://github.com/iremeyiokur/COVID-19-Preventions-Control-System. The COVID-19 pandemic has affected the whole world since the beginning of 2020. In order to decrease the transmission of the COVID-19 disease, many health institutions, particularly the World Health Organization (WHO), have recommended serious constraints and preventions [1] . The fundamental precautions that individuals can carry out are to keep the distance from others (practicing social distance) [2] , wear a face mask properly (covering mouth and nose), pay attention to personal hygiene, especially hand hygiene, and avoid touching faces with hands without cleanliness [1] . Convolutional Neural Networks (CNNs), introduced in late 80s [22, 37] , have gained popularity during the last decade. Due to the success of deep learning in computer vision, novel research topics that emerged as a consequence of the COVID-19 pandemic are handled in this context by researchers. These studies focus on diagnosing COVID-19 [9, 15, 23, 29] , adjusting the already existing surveillance systems to COVID-19 conditions [5, 10, 12, 16, 19, 39] , and building systems to control the preventions [4, 6, 8, 11, 20, [26] [27] [28] 30, 31, 33, [39] [40] [41] . While some of the studies employ CT scans [9, 23] to diagnose COVID-19, the others benefit from chest X-ray images [15, 29] . Face detection and recognition systems' performance deteriorates when subjects wear face masks. Thus, novel face recognition and detection studies [5, 12, 19] try to improve the performance under this condition. Besides, the age prediction [16] is investigated when face mask is used. Moreover, in order to track the execution of preventions against the spread of COVID-19, several works investigate the detection of wearing a mask suitably [6, 8, 11, 20, 26-28, 30, 39, 40] and keeping the distance from other people [4, 30, 31, 33, 41] . In [39] , a novel masked face recognition dataset is published to improve the face recognition performance in the case of occlusion due to face masks. This dataset contains three subsets, which are Masked Face Detection Dataset (MFDD), Real-world Masked Face Recognition Dataset (RMFRD), and Simulated Masked Face Recognition Dataset (SMFRD). In [8] , an artificial masked face dataset, named as MaskedFace-Net is presented. It contains 137016 images that are generated from the FFHQ dataset [21] using a mask-to-face deformable model. Joshi et. al [20] proposed a framework to detect whether people are wearing a mask or not in public areas. They utilized MTCNN [43] and MobileNetV2 [32] to detect faces and classify them on their own video dataset. In [19] , a one-stage detector based on RetinaFace [14] is proposed to detect faces and classify them whether they contain masks. In [28] , the authors proposed a real-time face mask detector framework named SSDMNV2, which is composed of SSD [24] as a face detector and MobileNetV2 [32] as a mask classifier. A recent study [7] investigated the face-hand touching behavior. In this study, the authors presented facehand touching interaction annotations on 64 video recordings and they evaluated the annotated 2M non-touching and 74K touching frames with rule-based, hand-crafted featurebased, and CNN learned feature-based models. As a result of evaluations, CNN based model obtained the best results with 83.76% F1-score. In this work, we present a computer vision system that controls preventions advised by the health institutions. These preventions are to detect whether people wear a face mask, keep away from touching their faces, and to monitor the social distance. To train and evaluate the developed system, we collected two novel face datasets, namely Interactive Systems Labs Unconstrained Face Mask Dataset (ISL-UFMD) and Interactive Systems Labs Unconstrained Face Hand Interaction Dataset (ISL-UFHD). These datasets are collected from the web to provide a significant amount of variation in terms of pose, illumination, resolution, and ethnicity. The system consists of three submodules, corresponding to face mask detection, face-hand interaction detection, and social distance measurement tasks, respectively. We trained two separate CNN models to classify face images for the face mask and face-hand interaction detection tasks. While the first model classifies the face image as wearing a mask properly, wearing a mask improperly, or not wearing a mask, the second model classifies face images as touching the face or not. The trained models were evaluated both on the collected dataset and on the existing face mask datasets in the literature without training or fine-tuning on them. We also proposed an approach to measure the social distance. Our contributions can be summarized as follows: (1) We develop a computer vision system to help people to follow the recommended protective measures -wearing a face mask properly, not touching faces, having social distance-to avoid spread of COVID-19. (2) We present two novel datasets, ISL-UFMD and ISL-UFHD, for face mask and face-hand interaction detection tasks. ISL-UFMD is one of the largest face mask datasets that includes images with a significant amount of variations. The ISL-UFHD is the first dataset that contains face-hand interaction images from unconstrained real-world scenes. (3) We extensively investigate several CNN models on our datasets. We also tested them on publicly available masked face datasets without performing adaptation, e.g. fine-tuning, on them to demonstrate the generalization capacity of our models. We achieved very high classification accuracies which in-dicates the collected datasets' capability to represent realworld cases. Moreover, to evaluate the overall system, we utilized six different short real-world video recordings. To train our system, we collected both face masked images and face-hand interaction images. Recently published datasets on the tracking of COVID-19 preventions, which are presented in Table 1 , mainly focused on collecting face mask images to develop a system that examines whether there is a mask on the face or not. Most of them contain a limited amount of images or include synthetic images generated with putting a mask on the face using landmark points around the mouth and nose. Besides, the variety of subjects' ethnicity, image conditions, such as environment, resolution, and particularly different head pose variations are limited. For instance, in these datasets except MaskedFace-Net [8] , Asian people are in the majority. Although MaskedFace-Net includes variation in terms of ethnicity, it consists entirely of images with artificial face masks. Besides, all face mask datasets have limited head poses mostly from frontal view to profile view in yaw axis. Thus, these limitations led us to collect a dataset to overcome all these drawbacks. In addition to face mask, there is only one dataset [7] that is recently annotated to investigate face-hand interaction in the literature. However, these face-hand interaction annotations are also limited based on the number of subjects. Moreover, the dataset is collected in an indoor environment under controlled conditions. In this study on the other hand, we present a face-hand interaction dataset that is collected from unconstrained real world scenes. We collected a large amount of face images from several different resources, such as publicly available face datasets, FFHQ [21] , CelebA [25] , LFW [18] , YouTube videos, and web. These various sources enable us to collect a significant variety of face images in terms of ethnicity, age, and gender. In addition to the subject diversity, we obtained images from indoor and outdoor environments, under different light conditions and resolutions. We also considered ensuring large head pose variations. Moreover, another important key point is to leverage the performance of our COVID-19 prevention system for the combined scenario, e.g., determining mask usage in the case of touching faces or detecting face-hand interaction in the case of wearing a mask. Besides, our images include different sorts of occlusion that make the dataset more challenging. In the end, ISL-UFMD contains 21316 face images for the face-mask detection scenario, 10618 face images with masks and 10698 images without a mask. Additionally, we gathered 500 images for improper mask usage. This class has a relatively small num- ber of images compared to no mask and mask classes due to lack of face images with improper mask usage. The ISL-UFHD is composed of face images that represent the interaction between the face and hand of the subjects. We collected 22289 negative samples (no face-hand interaction) and 10004 positive samples (face-hand interaction). Please note that, even if the hand is around the face without touching it, we annotated it as a no interaction. Therefore, the model should be able to distinguish whether the hand in the image is touching the face (or very close to the face) or not. For labelling the collected datasets, we designed a webbased image annotation tool. We utilized crowd-sourcing to annotate each image and after examining these annotations, we decided each image's final label. Since we formulated our tasks as classification problems, we annotated our images in that manner. While we have three classes -mask, no mask, improper mask-for the mask detection task, we have two classes for the face-hand interaction detection task. The images that include the face without a fully covered nose and mouth by the mask are annotated with the improper mask label. If a person has a mask under the chin, we annotated the image with the no mask label. In the face-hand annotation, we considered the direct contact or too close to contact as the existence of face-hand interaction. Many examples of annotated face images for face mask and facehand interaction detection tasks are shown in Fig. 1 The proposed COVID-19 prevention control system is illustrated in Fig. 3 . The proposed system consists of three submodules. Each module utilizes deep CNN models to obtain predictions. The system performs person detection and calculates distances between detected subjects on input image/video frame. Meanwhile, the same input is also used to detect and crop faces of subjects to perform face mask and face-hand interaction detections. While the face mask model decides whether a person wears a mask (properly) or not, the face-hand interaction model identifies whether a hand touches the subject's face. We decided to perform person detection and face detection separately on the input image/video frame to eliminate the effect of missing modality. For instance, although a person's body is occluded and social distancing cannot be measured with this person, sys- tem can still detect the face of the corresponding subject to perform face mask and face-hand interaction tasks. Similarly, if the subject's face is occluded or not turned to the camera, system can be able to capture the person's body to perform the social distance task. For face mask and face-hand interaction tasks, firstly, we performed face detection using the pretrained ResNet50 [17] backboned RetinaFace model [14] that was trained on the large-scale Wider-Face dataset [42] . We used Reti-naFace detector since it is robust against tiny faces, challenging head poses, and faces with a mask. After detection, we cropped detected faces with a 20% margin for each side, since the face detector's outputs are quite tight. To perform face mask and face-hand interaction detections, we employed several different CNN architectures, namely, ResNet50 [17] , Inception-v3 [35] , MobileNetV2 [32] , and EfficientNet [36] . We decided to use EfficientNet, since it is the state-of-the-art model. We also included MobileNetV2, since it is a light-weight deep CNN model. Finally, we chose ResNet and Inception-v3 models based on their high performances in the literature. In the training, we benefited from transfer learning and initialized our networks with the weights of the pretrained models that were trained on Im-ageNet dataset [13] . We employed softmax loss at the end of each network. In EfficientNet and MobileNetV2, we utilized dropout with a 0.2 probability rate to avoid overfitting. We addressed the mask classification task as a multiclass classification -improper mask, proper mask, no maskproblem. We handled the face-hand interaction detection as a two class classification -interaction, no interactiontask. We aim to identify whether the hand touches the face using 2D images without using predefined or estimated depth information. At first, the input data passes through the face detector to detect bounding box coordinates of the faces. Then, these are used to obtain face crops with suitable margins. Afterward, the face mask and face-hand interaction models are used to predict on acquired face crops. Keeping the social distance from others is another crucial measurement to avoid spreading of COVID-19. For this, firstly, we detect each person on the image using a pretrained person detection model, DeepHRNet [38] . Thus, we obtain bounding boxes around the people and estimated pose information of each person. Principally, we focus on the shoulders' coordinates to measure the approximate body width of a person on the projected image. In many studies, measurements are calculated based on the bounding box around the person. However, when the angle of the body joints and pose of the person are considered, changes on the bounding boxes may reduce the precision of the measurements. To prevent this, we propose to use shoulders' coordinates to measure the width of the body and identify the middle point of shoulders line as center of the body. After performing detection and pose estimation, we generated pairs based on the combination of the detected persons, e.g., P (p i , p j ). Then, we calculated the Euclidean distance between the centers of shoulder points of each pair of persons. In order to decide whether these persons keep social distance between each other, we adaptively calculate a threshold for each pair individually based on the average of their body width. Since the represented measurement of the real world, expressed by pixels in the image domain, constantly changes as depth increases, we represent the mapping between real-world and pixel domain measurements by calculating the average of the body widths of two people to express this effect. Since the average distance between shoulder points of an adult is around 40-50 cm in the realworld and the required social distance between two persons is 1.5-2.0 meters, we empirically decide to select λ coefficient as 3 when calculating the threshold for social distance in the pixel domain as in Eq. 1. Finally, if the Euclidean distance between two persons is lower than the calculated threshold, we decide that these people do not keep sufficient social distance. We used publicly available datasets to evaluate the generalization capacity of our system and also compared our mask detection models with the previous works. We used RMFD and RWMFD [39] 1 . In RMFD, the publicly available version includes around 2203 masked face images, although the paper indicates that there are 5000 face mask images. For RWMFD, we executed RetinaFace and we obtained 5171 face images from 4343 images. MaskedFace-Net dataset [8] contains 130000 face images with artificial masks. While the half of the dataset (CMFD) has correctly worn face masks, the remaining half (IMFD) has incorrectly worn face masks. Face-mask dataset (Kaggle) [3] contains 853 images and we used provided annotations to crop face images and obtain labels. In the end, we acquired 4080 face images. We used 90% of the data for training, the remaining data is reserved equally for validation and testing. We followed the same strategy for face-hand interaction dataset. Additionally, before creating train-validation-test splits, we put aside around 5000 images from no face-hand interaction class to obtain a balanced dataset to execute face-hand interaction detection. In Table 2 , we presented various evaluation results using classification accuracy, precision, and recall. In the table, while the first column indicates the employed CNN models, the following columns represent evaluation results for 1 https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset face mask detection with these models. According to the experimental results in Table 2 , although all employed models achieved significantly high performance, the best one is Inception-v3 model with 98.20% classification accuracy. In addition to the classification accuracy, we also present precision and recall measurements for each class separately to demonstrate the performance of the models individually. In Table 2 , although the precision and recall values are very accurate for no mask and mask classes, these results for improper mask class are slightly lower than these two classes. Even though improper face mask can be confusing in terms of discrimination from mask class (proper), the more probable reason behind this outcome is the lack of images for improper mask usage. In Fig. 4a , we present Class Activation Maps (CAM) [34] for the face mask detection task to investigate activation of the model. It is clearly seen that the model focuses on the middle part of the faces, particularly on the nose and mouth. In the second image, the model identified improper mask usage since the nose of the subject is not covered by the face mask even though the mouth is covered. In Fig. 4c , we present some misclassified images for the face mask detection task. Although the model classified the images incorrectly, the prediction probabilities of the model are not as high as in correct predictions. This outcome indicates that the model did not confidently misclassify images. Still, the difficulty in the head pose, illumination conditions, occlusion caused misclassification in some cases. Table 3 , we presented cross-dataset experiments to investigate the effect of the datasets on the generalization capacity of the proposed model. First, we evaluated our MobileNetV2 and Inception-v3 models on four different public face mask datasets. Additionally, we finetuned the MobileNetV2 and Inception-v3 models with two different training setups to compare our approach. The first setup contains 97842 images from the combination of RMFD and RWMFD datasets [39] . We used them together since RMFD dataset has no improper [21] . We used FFHQ dataset as a no mask data due to the absence of no mask class on MaskedFace-Net dataset. While we selected RMFD, RWMFD, MaskedFace-Net, and Facemask (Kaggle) [3] datasets as target for our model, we used the proposed ISL-UFMD dataset and Face-mask (Kaggle) dataset as target datasets for other models. Almost all the models that are trained on the ISL-UFMD achieved more than 90% accuracy. These results indicate that our ISL-UFMD dataset is significantly representative to provide well generalized models for face mask detection task. We employed two different architectures to endorse this outcome. Otherwise, the combination of RMFD and RWMFD provide accurate results, although they are not as high as our results. On the contrary, the models that are trained on the MaskedFace-Net dataset show the worst performance. The possible reason of this outcome is that the artificial dataset is not as useful as the real data for training. In Table 4 , we present the face-hand interaction detection results. As in the face mask detection task, all of the employed models have achieved very high performance to discriminate whether there is an interaction between face and hand. The best classification accuracy is obtained as 93.35% using EfficientNet-b2 model. The best recall and precision results are achieved by EfficientNet-b2 model as Fig. 4b , we provide CAM [34] for the face-hand interaction detection. It is clearly seen that the model focuses on the hand and around the hand to decide whether there is an interaction between the hand and the face. In Fig. 4d , we present some misclassified images for the face-hand interaction detection. In the first image, although the model can detect the hand and the face, it cannot identify the depth between the face and the hand due to the position of the hand. In the second image, it is seen that there is an interaction between the face and hands of someone else. For this example, the angles of the head and hands are challenging. We utilized six different videos that we collected from the web to evaluate proposed social distancing control module. These videos have different number of frames and they were recorded in various environments with different camera angles. During the calculation of the accuracy of the social distance measurement algorithm, we utilized the annotations that we decided based on subject pairs and existing distance between each other. Person detector could not detect some of the subjects in the scene, if they are not visible in the camera due to the occlusion by other people or objects. For that reason, we ignored the missing detections when we annotated the videos' frames and calculated the accuracies. According to the results in Table 5 , we achieved very high accuracies on average. However, the fundamental problem, especially occurred in the last video, is caused by the lack of depth information.We project real-world distances to the image pixels with a rule-based approach without using reference points. Therefore, depth perception can be problematic for specific angles. We evaluated the overall system performance on the same six videos and presented the results in Table 5 . When we examined the face-hand interaction and face mask detection performance of our system, the results on videos that contains various people and cases indicate that system can reach very high performance similar to the ones that are obtained by the models on individual test sets. In this paper, we proposed a system to track essential COVID-19 preventions. We collected two unconstrained datasets, ISL-UFMD and ISL-UFHD, with high diversity. While we employed several different CNN-based models to perform face mask and face-hand interaction detection tasks, we benefited from a geometric calculation method to track the social distance between people. Experimental results showed that our proposed models achieved significantly high performance with the help of our proposed datasets, since they contain a large amount of variation and they represent various cases in the real-world scenario. The cross-dataset experiments indicate the generalization capacity of our proposed models on unseen data. The proposed system can be effectively utilized to track all preventions against the transmission of COVID-19. Covid-19: physical distancing Face mask detection Gwanggil Jeon, and Sadia Din. A deep learning-based social distance monitoring framework for covid-19 Masked face recognition for secure authentication Vitomir Štruc, and Simon Dobrišek. How to correctly detect face-masks for covid-19 from visual information? Applied Sciences Analysis of face-touching behavior in large scale social interaction dataset Maskedface-net-a dataset of correctly/incorrectly masked face images in the context of covid-19. Smart Health Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography Efficient transfer learning combined skip-connected structure for masked face poses classification Face mask detection using transfer learning of inceptionv3 The effect of wearing a mask on face recognition performance: an exploratory study Imagenet: A large-scale hierarchical image database Retinaface: Single-shot multi-level face localisation in the wild Covid-resnet: A deep learning framework for screening of covid19 from radiographs Age detection with face mask using deep learning and facemasknet-9 Deep residual learning for image recognition Labeled faces in the wild: Updates and new reporting procedures Retinamask: a face mask detector Deep learning framework to detect face masks from video footage A style-based generator architecture for generative adversarial networks Handwritten digit recognition with a back-propagation network Using artificial intelligence to detect covid-19 and community-acquired pneumonia based on pulmonary ct: evaluation of the diagnostic accuracy Ssd: Single shot multibox detector Deep learning face attributes in the wild Fighting against covid-19: A novel deep learning model based on yolo-v2 with resnet-50 for medical face mask detection A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the covid-19 pandemic Ssdmnv2: A real time dnn-based face mask detection system using single shot multibox detector and mobilenetv2. Sustainable cities and society Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks Iot-based system for covid-19 indoor safety monitoring. preprint), IcETRAN Deepsocial: Social distancing monitoring and infection risk assessment in covid-19 pandemic Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR Covid-robot: Monitoring social distancing constraints in crowded scenarios Grad-cam: Visual explanations from deep networks via gradient-based localization Rethinking the inception architecture for computer vision Efficientnet: Rethinking model scaling for convolutional neural networks Phoneme recognition using time-delay neural networks Deep high-resolution representation learning for visual recognition Masked face recognition dataset and application Wearmask: Fast in-browser face mask detection with serverless edge computing for covid-19 A vision-based social distancing and critical density detection system for covid-19 Wider face: A face detection benchmark Joint face detection and alignment using multitask cascaded convolutional networks The project on which this report is based was funded by the Federal Ministry of Education and Research (BMBF) of Germany under the number 01IS18040A. The authors are responsible for the content of this publication.