key: cord-0929164-hxyj864h authors: Ottakath, Najmath; Elharrouss, Omar; Almaadeed, Noor; Al-Maadeed, Somaya; Mohamed, Amr; Khattab, Tamer; Abualsaud, Khalid title: ViDMASK dataset for face mask detection with social distance measurement date: 2022-05-10 journal: Displays DOI: 10.1016/j.displa.2022.102235 sha: ffb7e1378ee698a6a081e96ddc3f2b94ec19e704 doc_id: 929164 cord_uid: hxyj864h The COVID-19 outbreak has extenuated the need for a monitoring system that can monitor face mask adherence and social distancing with the use of AI. Taking advantage of the existing video surveillance system, a deep learning based method for mask detection and social distancing is proposed. State-of-the-art object detection and recognition models such as Mask RCNN, YOLOv4, YOLOv5, and YOLOR were trained for mask detection and evaluated on the existing datasets and a proposed video mask detection dataset. The obtained results achieved a comparatively high mean average precision. After mask detection, the distance between people’s faces is measured. Furthermore, a new large-scale mask dataset from videos named VIDMASK is introduced. This diversifies the subjects in terms of pose, environment, quality of image, and versatile subject’ characteristics, producing a challenging dataset. The tested models succeed in detecting the face masks with high performance on the existing dataset, MOXA. However, with the VIDMASK dataset, the performance of these models is less accurate because of the complexity of the dataset and the number of people in each scene. The link to ViDMask dataset and the base codes are available at https://github.com/ViDMask/VidMask-code.git. Abundant surveillance systems capture terabytes of data on videos, but there's a lack of inference from them, which are real-time and fast. Several emerging technologies are charted for social distance monitoring, crowd control, and mask detection. Thermal scanning, ultrasound, and wireless tracking are among a few that need dedicated hardware to run [1] . Computer vision can be added to existing technology, allowing for easier implementation and better visualization. Deep learning and machine learning models are widely used in this context to achieve efficient and autonomous results [13, 14] . Real-time data can be generated from computer vision systems [4] [5] [6] . This can support the human resources that are already present to make informed decisions, avoiding mis-interpretations and overcoming the lack of efficiency in detection. • Experimentation of mask detection methods using transfer learning on state-of-art object detection deep learning models, YOLOv4, Faster RCNN with resent-101 backbone with FPN, YOLOv4-tiny, YOLOv5 and YOLOR. • Comparison of the performance of deep learning models on both datasets. • A very large and challenging dataset created from more than 60 videos, that has more than 30,000 images of people not wearing face masks and wearing face masks in diverse environments. This paper provides a background study into mask detection using deep learning models and social distancing in Section 2. Section 3 details the proposed method with the dataset used and a brief description of the VIDMASK dataset. Section 4 is a description of the experimental setup with evaluation metrics. Section 5 presents the results and compares them with the state of art literature. Section 6, will conclude the paper. One of the most effective ways to prevent the COVID-19 infection is to wear masks [39] . Mask detection can be considered as an object detection and recognition problem. pre-trained models can be leveraged for object detection and training for mask detection [23] . Several techniques have been proposed in the literature where the authors employ hybrid machine learning and deep learning models for mask detection. They have proved that realtime and fast detection can be achieved with a deep learning model alone. The dataset, being central to the deep learning models, was used for transfer learning on several state-ofthe-art pre-trained deep learning models like YOLOV3, SSD, and the Faster R-CNN model using ResNet-50 as a backbone with FPN (Feature Pyramid Network) detection. They were evaluated on the mean average precision (mAP) scale for accuracy and frames inferred per second (FPS) for speed. After the onset of COVID, mask detection has become a case study in several stateof-the-art [8, 24, 26, 37] . Transfer learning on state-of-art models has been widely used for multiple stage detectors like SSD with mobile net as well as single-state detectors like J o u r n a l P r e -p r o o f Journal Pre-proof YOLOV3 and inception net. In [40] the authors used several object detection backbones to detect masks and perform social distance at different stages. However, a faster and more accurate model needs to be identified by leveraging the current deep learning models for real-time use. A dataset that has the required diversity with a mixture of crowds and no crowds, with masks or no masks, is required here for training and testing. Several datasets are available to detect face masks [32, 40] of which MOXA3k [32] was used in this paper for comparison. They can be enhanced with a larger number of images with people in different poses at different angles and expressions, which can be extracted from videos. This can not only provide a better dataset but also improve model accuracy. Social distancing is achieved by maintaining a certain distance between two people [12, 27, 28] . Several techniques can be used to calculate the distance between two persons and count the number of people in a region of interest [1, 9, 10, 25, 34, 40] . Deep learning has an essential impact on tools used for social distancing; several pre-trained models used for this purpose have shown its efficiency in social distance measurement as person detection can be achieved through pretrained model [1, 2, 16, 30] . Hardware based approaches for mobile crowd sensing using several schemes like smart phone applications and IoT (Internet of Things) devices were used to detect social distancing and crowd control [7, 20, [33] [34] [35] 43] . P-dispersion problem was proposed in [21] that can solve the minimum number of persons in a place problem, however it covers only one entity of the social distancing thats number of persons in an area. A virtual queue was implemented to thwart crowd gathering at queues thereby maintaining social distance using machine learning model [22] . Mask RCNN was used to detect people in frame and then frame parameters were used to detect the social distance between them [15] . It can be noted that in state of art literature social distancing and mask detection are performed as individual tasks except that of in [40] , a social distancing tool was used. However, it would be efficient if mask detection and social distancing were performed together in one pipeline which would accomplish both tasks at once. The proposed model performs a two steps process, one identifying the best fit open source model for the purpose of mask detection by performing transfer learning and then comparing with state of art literature, thereby exploring different deep learning models and their performance for mask detection. 4 The best trained weights were then used to measure social distancing to identity high risk and low risk individuals gaining a two-step safety through one model to prevent COVID19. This paper also introduces VIDMASK dataset, an image dataset created of videos with more than 60 videos which produces a more challenging set of data for mask detection. Produced here is analysis of the models with a part of the new dataset to evaluate its performance compared to MOXA3k dataset. Social distancing is further performed from the best of these by calculating the distance from the centroid of the bounding box coordinates of object detection. The proposed methodology uses deep learning methods for transfer learning along with state of art dataset. Introduced in this section is the new video dataset ViDMASK with its descriptions including the deep learning models used for transfer learning. VIDMASK contains images that are crowd and incidental images with and without masks. Five videos were used for experimentation from the 67 videos listed for training, testing, and validation for the initial experiment as stated. The images extracted from the videos after annotation were shuffled. The chosen dataset contains densely crowded scenes, blurred scenes, and less crowded scenes with a majority of mask-wearing people. 20,000 instances of masks were found in the images, along with 2,500 non-mask-wearing people. The sources of the current setup were from YouTube and Pexel.com. The videos modeled a natural environment with people performing daily tasks or being interviewed, thereby containing natural poses and expressions. The video dataset used has over 67 videos of varying lengths, from 1 minute to 10 seconds. The videos consist of scenes from different locations and situations, with masked and non-masked people in varied poses, angles, and masks. The videos were clippings from YouTube channels, specifically TV channel clippings on the COVID-19 situation and interviews related to the COVID-19 pandemic and other situations that occurred during the COVID-19 pandemic, where masks had to be worn and people were in public view. The video dataset contains street scenes, press conference scenes, marches, general crowds, and a typical busy day scene. The videos originate from various locations around the globe. The videos are of different resolutions, converted to frames; masked and non masked people's faces are then labelled with a bounding box using cvat [11] and the Roboflow tool for reviewing and visualisation. The bounding box coordinates are saved as .TXT as well as coco json files as annotations for masked and non-masked people. Table 1 , is the description of the dataset, which includes the number of masked and non-masked individuals with dataset profiles. Figure ( 3) illustrates the annotated video mask dataset. Each model has a data set configured to its requirements in terms of annotation format and the image size required to train the model. The following are the dataset configurations for each model used. 5 Object detection models that are optimal for detection need to have a higher input network size for smaller object [17, 31, 38] . Multiple layers are needed to cover the increased size of input network for a higher receptive field. To detect multiple objects theres a need for different sizes in a single image [3] . The following are different models that accomplish some of these criteria and that are used in this setup. YOLOV3 is from the series of YOLO object detection models. YOLO detects multiple objects at a time. YOLOV3 consists of 53 convolutional layers, also called Darknet 53. For detection, it has 53 more layers, for a total of 106 layers [? ] . Images are often resized in YOLO according to input network size to improve detection at various resolutions. The objectness score represents the probability of predicting the object inside the bounding box. Predicted probability to the intersection of the union, which is a measure of the predicted bounding box to the ground truth bounding box [3, 32] . YOLOV4 produces optimal speed and accuracy when compared to YOLOV3. It is a predictor with a backbone, a neck, dense predictions, and sparse prediction [3] . The neck enhances feature discriminability by methods like Feature Pyramid Network (FPN), Path Aggregation Network (PAN), and Receptive Field Block (RFB). The head handles the dense 6 prediction using either Region Proposal Network (RPN), YOLO, or SSD. For the backbone, Darknet53 was tested to be the most superior model [3] . YOLOV3 is commonly used as the head of YOLOV4. To optimise training, data augmentation happens in the backbone. It is proven to be faster and more accurate than YOLOV3 with the MSCOCO dataset [3] . The advantage of YOLOV4 is that it runs on a single GPU, enabling comparatively less computation [3] . YOLOV4-tiny is a compressed version of YOLOV4 with its network size decreased, having a lower number of convolutional layers in the CSPDarknet backbone. The YOLO layers are reduced to three and the anchor boxes for prediction are also reduced, enabling faster detection. Mask R-CNN, FPNR101 benchmarks have a modular architecture, the input goes through a CNN backbone to extract features, which are used to predict region proposals. Regional features and image features are used to predict bounding boxes [29] . The scalability of this model and region proposal feature enables accurate detection [44] . YOLOV5 is a Python implementation of an improved version of YOLOV3, published in May 2020 by Glenn Jocher of Ultralytics LLC5 on GitHub6 [19] . It is an improved version of their well known YOLOV3 implementation for PyTorch7. It has a similar implementation to YOLOV4, where it incorporates several techniques like data augmentation and changes to activation functions with post processing to the YOLO architecture. It combines images for training and uses a self-adversarial training (SAT) claiming an accelerated inference [19, 45] . YOLOR [42] is an unified network that takes advantage of implicit and explicit knowledge together. The implicit knowledge identifies features of the deep layers. The explicit knowledge is attained from labelled data. The distance between two objects can be calculated by finding the centroid of the bounding boxes of the objects recognised and then finding the euclidean distance between them. Bounding boxes are typically rectangular in shape resulting in four coordinates, Xmin, Ymin, Xmax and Ymax. Each representing the four corners of the bounding box through coordinate system as (Xmin, Y max), (Xmax, Y min), (Xmin, Y min) and (Xmax, Y max). The centroid (C x , C y ) of the rectangle can be calculated by Equation (1) The distance then can be calculated by finding the Pythagorean distance between the centroids, assuming two images, image1 with centroid coordinates (C x 1 , C y 1 ) and image2 with centroid coordinates (C x 2 , C y 2 ) , the distance between two people (D) can be calculated as in Equation (3). Figure (4), illustrates how social distancing is calculated for monitoring. The algorithm (1), below is a computational methodology for face-mask detection and social distance measurement. Mask detection and social distancing can be achieved through the use of transfer learning, enabling better and faster detection. A comparison of the latest and current deep learning models for mask detection is done through the evaluation metrics mAP and FPS, described in the following section. The most efficient model is used for social distance measurement to identify high-risk individuals with a given threshold of distance. Figures (2) and (3) illustrate the flow chart of the experimental setup of the proposed methodology for mask detection and social distancing. The Moxa3K dataset is used for training and testing different object detection models. The dataset contains around 600 images from the Kaggle dataset of medical masks from 8 The pre-trained model used has a backbone which is Darknet3 that has 29 convolutional layers 3x3 ,725X725 receptive fields and 27.6 M parameters, the next stage has a Spacial Pyramid Pooling (SPP) and Path Aggregation Network (PAN), with that YOLOV3 is used for dense prediction. Both the Bag of Specials (BoS) and the Bag of Freebies (BoF) are included for tasks such as data augmentation, mish activation, and so on. YOLOV4 is setup to run on a single GPU, with CUDA 10.1 version running on the Tesla K80. A pre-trained model weight is used to custom train the images on classes, masks, and no masks. For optimal results, the number of classes is set as 2, with the maximum number of batches for training as the number of class*2000. The YOLOV4 training was run on 64 batches with 16 subdivisions. The network size was setup to 416X416 and the number of filters were defined by the equation (4) . Here num_classes is the number of classes in the dataset that is two. Training loss, precision, recall and mAP are calculated during training. The mAP was calculated at every 100 iterations and consequently every 1000. FPS was calculated during inference on test images. YOLOV4-tiny: YOLOV4 tiny was setup to run training on a single GPU, with CUDA 10.1 version running on Tesla K80. The dataset was setup to custom train on a pre-trained model. The configuration file is similar to YOLOV4, with two classes, where number of classes is set as 2 according to masks and no masks. The number of filters are determined from equation (4) . 64 batches with 16 subdivisions with the network size setup to 416x416 was used for training. The performance metrics, training loss, precision, recall and mAP are measured at every 1000 iterations of the training. FPS is calculated based on time taken to infer on tested image. Mask RCNN with Resnet 101 backbone and FPN: For object detection and classification, a fast RCNN with a feature Pyramid Network (FPN) and an R101 backbone is used.Mask RCNN with FPN is GPU-accelerated using CUDA 10.1 and the Pytorch framework. It uses COCO validation parameters for inference and evaluation. Pre-trained weights were used from a model zoo of Mask RCNN with FPN. The number of iterations in the training parameter was set to 1000, and the number of classes was set to two. The number of groups in each layer is 32, with a width per group of 8 and a depth of 101 for Resnet. The batch size per image is set to 16, which was used for the GPU computing capability of the training device. mAP, precision, recall, and AP were calculated during the training process. FPS was calculated during inference. YOLOV5: YOLOV5 was setup to run on Torch 1.7 with the CUDA 10.1 version running on the Tesla V100. YOLOV5 was setup to train on custom object detection using transfer with pre-trained YOLOV5 weights. It was set to train for 300 iterations, with iterations decreased to achieve a better mAP. The number of classes was set to 2, mask, and no mask in the configuration file. In addition to mAP, precision, recall, and AP were plotted. YOLOR: YOLOR was setup to run on Torch 1.7 with CUDA 10.1 version running on Tesla V100. A pretrained model was used to perform transfer learning. The model was trained for 10 epochs on VIDMASK dataset. The dataset was added noise and rotated to improve accuracy of detection. mAP, precision and recall were measured for the model. In order to have a fast and accurate model for real-time object detection, it is essential to measure the performance of the model and compare the model with certain metrics. mAP and FPS are used here to choose the optimal model for mask detection. Thus accuracy is measured in terms of detection and speed of detection in terms of Frames infered per second. Inorder to identifty the mAP, precision and recall has to be computed. The precision and recall are computed by identifying the True positive(TP), True negative (TN), False Positive(FP) and False Negative(FP). The True positive in general terms are referred for images that where labelled true and the prediction produced true. The True negative are for images that are labelled true but predicted as false. False positive on the other hand are results that are labelled as false but predicted as false and False negative are the images that are labelled false but predicted as true. In the context of mask detection, due to multiple objects and multiple classes in one image itself, TP, FN,FP and TN were all measured for each detection based on the objectness of the detection resultant from the intersection of Union measurement which is as stated below. 10 The Intersection of Union (IoU), a commonly used evaluation metric which estimates regression quality. It is the IoU between a predicted bounding box and its corresponding assignment ground truth box [17] . The overlap of two boundaries is measured. The real object boundary is the ground truth to the predicted object boundary. In object detection models, True Positive(TP), False Positives(FP) , False Negative(FN) and True Negative(TN) are defined by setting an IoU threshold. This paper uses a standard threshold of 50 percent match to classify the true positive or false positives in detection [29] . IoU which equals to or over 50% is set to True positive (TP) thats a correct detection, and IoU below 50% is False positive, a wrong detection. A FN is determined by counting an object not detected. Precision and Recall: Precision is the percentage of the time your prediction is correct. Recall measures how well all the positive predictions are found. Equation (5) , and Equation (6) are used to calculate precision and recall [29] . Precision is the percentage of the time your prediction is correct. Recall measures how well all the positive predictions are found. Equation (5) , and Equation (6) are used to calculate precision and recall [29, 36] . The average precision can be calculated as the area enclosed under the Precision-Recall curve plotted on the coordinate axis. It is denoted as In terms of object detecction models the mean of the AP is calculated at different recall values. Mean Average Precision: mAP, or mean Average Precision, is the metric used to determine performance on various object detection models. The classification and localization of the image are specifially determined [29] . The ground truth of object detection models, the class of the objects, and the bounding box of each object serve as parameters for calculating the mAP. The correctness of each image is determined by a bounding box 11 through intersection of union (IoU), which is the ratio between the intersection and union of the predicted boxes and ground truth. A confidence threshold is kept, determining positive and negative boxes. From this, the mAP is calculated with a fixed confidence threshold and IoU value, typically 50% for Pascal VOC datasets. Different confidence thresholds are chosen in such a way that recall varies from 0 to 1, which makes AP, the mean of precision values at different recall values. Therefore, mAP is the average precision of all the classes in the dataset [3, 29? ]. Within the experimental setup for each deep learning model, True positive (TP), False positive (FP), recall, precision, average precision, mAP, and FPS were plotted. Each model achieved object detection at different levels of efficiency. The models had varying precision and recall based on the type of images. With the requirements of optimal speed and accuracy for object detection in perspective, Table 2 and Table 3 summarize the experimental results of the training of the models. Mask RCNN with FPN, achieves high precision and recall, with a 74.717% mAP at 50% IoU. YOLOV4 has higher average precision for masks and non-masks compared to other models. Table 3 compares the MOXA3k dataset to the state-of-the-art literature on 7 models for mask and non-mask detection. Figure (8 The Initial experimental setup included the evaluation of MOXA3k and ViDMASK with new and improved models. They were trained on YOLOv4, YOLOv4 tiny, YOLOv5, Mask RCNN with FPN. Identification of the optimum model for mask detection using the metrics mAP and FPS. With poor performance noted comparatively as seen in Table (4) , a later and more accurate model according to the COCO evaluation parameters [42] , YOLOR was experimented on. A pre-processing step was performed by adding noise and rotation for more diversity and better accuracy. The following describes the training process and setup of each model trained. The YOLOV4 was trained for a custom object using pre-trained weights; the mask dataset with two labelled classes, masks and no masks, with a resultant mAP of around 12 Figure(6) is a chart depicting training loss, which is shown to decrease at multiple iterations. The x-axis shows the number of iterations the training process took and the y-axis is the loss which is shown as steadily decreasing by number of iterations. In addition, the mAP shows a steady rise with a few dips in between due to the different sizes used in training. The maximum mAP was found to be 68% which is slightly more than the state-of-the-art literature, which is 66. YOLOV3. With Vidmask dataset, the mAP starts at 46.3% and keeps its steady at this accuracy rate and reaches 49.03% which is comparatively lower than other models. The training loss drops rapdily as shown in Figure(7) . Figure 8 (a) and Figure 9 (d) illustrate the inference of YOLOV4 on several types of images from the MOXA3k dataset. Table (2) shows the precision and recall at the end of 4000 iterations. Table ( 3) represents the results of the ViDmask dataset for 300 iterations. YOLOV4-tiny produced faster predictions compared to other models, with an FPS rate of 139.314. With this FPS rate, the speed of detection can be improved compared to other models for real-time detection. With the light architecture of YOLOV4-tiny and computational complexity reduced, this is used on computationally deficient devices like mobile phones, achieving acceptable mAP and high speed. Figure 8 (b): YOLOV4-tiny prediction and inference on MOXA3k dataset images with bounding box results. Mask RCNN with FPN architecture proved to be far superior in terms of accuracy, with 74.717% for AP at 50%IoU. This technique, comparatively, has more accurate predictions than state-of-art literature and YOLOV4. However, Mask RCNN with FPN has a low FPS rate compared to other models, which can be a reason for the slow processing of the frames and is not adequate enough for real-time fast object detection. Figure 8 (c) and Figure 9 (c) illustrate the inference and prediction on several types of images from the Moxa3k and ViDmask datasets. mAP@50-MOXA mAP@50-VIDMASK FPS YOLOV3 414x414 [32] 63.99% -21.2 YOLOV3 608x608 [32] 66.84% -10.9 YOLOV3 832x832 [32] 61.73% -6.9 YOLOV3Tiny 414x414 [32] 56.27% -138 YOLOV3Tiny 608x608 [32] 55.08% -72 YOLOV3Tiny 832x832 [32] 56.57% -46.5 SSD 300 MobileNetv2 [32] 46.52% -67.1 F-RCNN 300 Inceptionv2 [32] 60. Figure 8 (d) and Figure 9 (a) illustrate the inference and prediction on several types of images in the MOXA3k and ViDMask datasets. Precision and recall plots increase with training time. YOLOV5 has a notably high FPS compared to similar architectures, achieving similar accuracy but higher FPS. YOLOR achieves a higher mAP of 92.1% with an inference time of 16.8 ms FPS. Figure 9 (e) is the result of testing on the VIDMASK dataset. Figure (9) illustrates the mAP, precision, and recall graphs. With a few epochs, a higher mAP was achieved compared to other models. Comparing Precision, Recall and AP for mask and non-maskAP for the models on Vid-Mask, the highest precision and recall is evident in YOLOR as seen in Table ( 3), it produced a balanced mask AP and non-maskAP. It should be noted that data augmentation in terms of rotation and noise was added on to the VidMask for YOLOR, where as resizing was only performed on the other models. In all the other models, the imabalance in dataset is clear in the non-mask AP results as they are very neglibile compared to the mask AP. The impact 15 of data augmentation is an added advantage to the high accuracy of YOLOR apart from its complex architecture. Table 3 illustrates the comparison of the state-of-the-art literature results with those of the models used in this paper. According to the most recent literature, and our experiments, all of the new and current models achieved nearly equal or better performance in terms of mAP score for detecting objects in the MOXA3K dataset and ViDMASK. YOLOV4, YOLOV5 and Mask RCNN with FPN achieved better performance in the accuracy of detection of masks and no masks compared to the state of the art literature for the MOXA3K dataset. However, the ViDMASK dataset was deficient in those models, but achieved high accuracy with YOLOR. In terms of FPS, YOLOV4-tiny is very close to YOLOV3-tiny, but better speed and accuracy are achieved for this model. With an mAP of 65.5% and an FPS rate of 83.3 for the MOXA3k dataset, YOLOV5 stands out in terms of optimal speed and accuracy for real-time object detection. Explicit and implicit learning enhanced the detection in YOLOR by 92.4% for the ViDMAsk dataset. When compared to the two stage detector Mask RCNN, which provided more accurate findings, one shot detectors such as YOLOv5 and YOLOV4 produced faster inference. The two-stage detectors have more sophisticated models, which create inference delays, making them less preferrable for real-time use. The tradeoff between speed and precision is indicated here. As can be observed, a larger dataset does not always yield better results; the issue lies in the quality of the data and the model utilized, which is appropriate for the model. YOLOV5 was found to be the optimal model with its capacity for accuracy and speed of detection in terms of mAP and high FPS, which was then used for social distancing to classify high-risk and low-risk based on the pixel distance between them. The figures (10) (a) and (b) illustrate the social distance labelled with yellow for high-risk individuals and with red for low-risk individuals.This achieves a two-step process using the same model, enhancing COVID-19 detection for social distancing with mask detection. The line is an indication of distance between two individuals. Each individual can be part of both high risk and low risk based on the proximity. A first person can be at high risk towards a second person that is closer to 3m, and a third person can be far away from a person who is consequently marked as low risk. Thus the indication is the risk distance between individuals. This method further infers that personal distance is required for social distancing and restricting the number of people in an area is required for proper social distancing. In addition, to the detrimental requirement of social distancing justified and identified. The social distance measurement performed as a pipeline reduces the overhead of the system as there's negligible change in FPS noted for social distancing. An efficient system for social distancing and mask detection was chosen by experimental analysis and comparison to state-of-art literature with two datasets, the MOXA3K and the 16 ViDMask dataset. Transfer learning was performed on five deep learning models. YOLOV5 was found to be efficient in terms of mAP and FPS for mask detection compared to stateof-art literature. The weights for this model are used for social distance measurement. The efficiency of social distance monitoring is thereby tagged to the efficiency of face mask detection, achieving intelligent mask detection and social distancing surveillance. A new, diverse and challenging video dataset, VIDMASK, containing more than 60 videos for face mask detection, was used for training and inference. This dataset produced high mean average precision for YOLOR with added noise and image rotation. This was a challenge for YOLOv4, YOLOv5 and Fast Mask RCNN with Resnet-101 backbone. VidMask can further be explored for mask detection and social distancing on the edge by compressing and quantizing the models or identifying lighter models with less computational complexity requiring only cpu usage. Besides the evident requirement for Covid situation, this can be further used for epidemics that require masks or environments that require proper social distancing and mask wearing such as laborartories that may have contaminants. Due to different camera angles used, it not only has a surveillance perspective, the model can be mounted on onground devices suchs as robots. This dataset can be further used for crowd counting applications for crowd that wear masks as some of the videos are of crowded scenes. Domain adapation is a deep learning trainign process that can be used to adapt the model to specific domains thus can improve the model for diverse pruposes requiring less computational complexity. Further, this can be used as a tool for social experiments where effectiveness of masks and social distancing can be confirmed. Person detection for social distancing and safety violation alert based on segmented roi A deep learning-based social distance monitoring framework for covid-19 Op-timal speed and accuracy of object detection Pose-invariant face recognition with multitask cascade networks BEMD-3DCNN-based method for COVID-19 detection A review of deep learningbased detection methods for COVID-19 Mobile crowdsensing approaches to address the covid-19 pandemic in spain Multi-stage cnn architecture for face mask detection The visual social distancing problem A proximity-based indoor navigation system tackling the covid-19 social distancing measures Video annotation tools: A review Does social distancing matter? Pm2.5 monitoring: Use information abundance measurement and wide and deep learning Deep dual-channel neural network for image-based smoke detection Sd-measure: a social distancing detector So-cial distancing detection with deep learning model A survey of deep learning based object detection Face mask detection using transfer learning of inceptionv3 ultralytics/yolov5: v3.1 -Bug Fixes and Performance Improvements An edge-based social distancing detec-tion service to mitigate covid-19 propagation Social distancing as p-dispersion problem A queue management approach for social distancing and contact tracing A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the covid-19 pandemic Facial mask detection using semantic segmentation Evaluation of boarding methods adapted for social distancing when using apron buses Ssdmnv2: A real time dnn-based face mask detection system using single shot multibox detector and mobilenetv2. Sustainable cities and society 66 A comprehensive survey of enabling and emerging technologies for social distancingpart ii: Emerging technologies and open issues Epidemic spread over networks with agent awareness and social distancing A survey on perfor-mance metrics for object-detection algorithms A smart image processing sys-tem for hall management including social distancing-sodiscop Yolov3: an incremental improvement Moxa: A deep learning based unmanned approach for real-time mon-itoring of people wearing medical masks Mysd: a smart social distancing monitoring system Deepdist: a deeplearning-based iov framework for real-time objects and distance violation detection Large area pressure sensor for smart floor sensor applica-tionsan occupancy limiting technology to combat social distancing A system for recognition of on-line handwritten mathematical expres-sions Face mask detection by using optimistic convolutional neural network Efficientdet: Scalable and efficient object detection A schlieren optical study of the human cough with and without wearing masks for aerosol infection control Real-time implementation of ai-based face mask detection and social distancing measuring system for covid-19 prevention. Scientific Programming 2021.I-Based Face Mask Detection and Social Distancing Measuring System for COVID-19 Prevention You only learn one representation: Unified network for multiple tasks You only learn one representation: Unified network for multiple tasks Flocking for multiple subgroups of multi-agents with different social distancing Deep learning in diabetic foot ulcers detection: a comprehensive evaluation This research work was made possible by research grant support (QUEX-CENG-SCDL-19/20-1 ) from Supreme Committee for Delivery and Legacy (SC) in Qatar.