key: cord-0509931-ooz0t695 authors: Khosravipour, Shayan; Taghvaei, Erfan; Charkari, Nasrollah Moghadam title: COVID-19 personal protective equipment detection using real-time deep learning methods date: 2021-03-27 journal: nan DOI: nan sha: 4f5021df819287f95e3b941b3f70b861f855000b doc_id: 509931 cord_uid: ooz0t695 The exponential spread of COVID-19 in over 215 countries has led WHO to recommend face masks and gloves for a safe return to school or work. We used artificial intelligence and deep learning algorithms for automatic face masks and gloves detection in public areas. We investigated and assessed the efficacy of two popular deep learning algorithms of YOLO (You Only Look Once) and SSD MobileNet for the detection and proper wearing of face masks and gloves trained over a data set of 8250 images imported from the internet. YOLOv3 is implemented using the DarkNet framework, and the SSD MobileNet algorithm is applied for the development of accurate object detection. The proposed models have been developed to provide accurate multi-class detection (Mask vs. No-Mask vs. Gloves vs. No-Gloves vs. Improper). When people wear their masks improperly, the method detects them as an improper class. The introduced models provide accuracies of (90.6% for YOLO and 85.5% for SSD) for multi-class detection. The systems' results indicate the efficiency and validity of detecting people who do not wear masks and gloves in public. Today, AI has a crucial role in every aspect of the COVID-19 crisis response. AI is composed of different techniques that are used as non-clinical approaches to mitigate the huge burden of health care systems. One of its roles is to prepare an assisting tool to prevent the spread of the virus through automatic tracing and surveillance of people who do not wear masks and gloves in the public area. When the virus was at its early stages, researchers quickly provided the necessary information and data for academic purposes. Among them were AI researchers, making their efforts to develop smart applications to overcome the limitations that humans pose against COVID-19 [1] . A notable example is a deep learning system that uses the DarkNet model classifier. It can classify COVID-19 cases from raw chest X-ray images with an accuracy of 98.08 [2] . Since prevention is better than treatment, the COVID-19 mask and gloves detection system is a useful and feasible solution to mitigate this virus's spread. Almost the majority of countries in the world are going through a pandemic. At the time of writing this paper, more than 23 million COVID-19 cases have been confirmed in more than 215 countries, and the virus has caused more than 800 thousand deaths (https://www.worldometers.info/coronavirus), the virus's growth rate is exponential in more populated countries which is discussed by regression coefficient of the log growth time-series data (i.e., number of new people infected per day) [3] . It is mainly transmitted through human-to-human interaction [4] . It can spread through large droplets existing in the air when someone who carries the virus coughs, sneezes, or touches the face with a hand exposed by the virus. The respiratory tract is the gateway for the virus to enter the human body [5] , [6] . In addition to patients with visible symptoms, there are numerous asymptomatic carrier cases with normal chest CT images and no self-reported fever, but they turned out to be carriers of the virus [7] . Unfortunately, due to the imbalance between supply and demand, some authorities attempt to lessen the importance of using masks [8] . While there are no randomized controlled trials (RCT) for using masks as source control for SARS-CoV-2, numerous studies indicate that using personal protective equipment (PPE), along with social distancing and personal hygiene, are necessary to prevent the virus from entering the body through infectious respiratory droplets and help flatten-the-curve. Furthermore, fitting gloves can prevent microorganisms obtained on the hands during daily tasks and when known and unknown contaminated equipment or surfaces come into contact. As a result, the transmission rate of the virus is less significant, and the case-fatality rates (CFR) decreases [9] , [10] , [11] . One of the main reasons that morbidity and mortality are higher in men than women is women tend to use facial protection more than men [12] . Since the virus remains silent in the body, and the symptoms can go unnoticed for weeks, wearing protection is crucial to stop "Silent Spreaders," transmitting the virus [13] . In a study undertaken in Hong Kong, findings from the Bonferroni-Dunn test with an adjusted level of significance indicate that older participants used a face mask less often than young people when they had respiratory symptoms. Also, the intention of citizens in wearing a mask was more about protecting others than themselves [14] . Even if there are carriers of the virus in public areas, the virus's transmission rate becomes very low if they wear a face mask. To further emphasize the importance of wearing facial protections and providing a barrier between the face and the virus, the commitment of citizens to wearing masks in public places contributed a great role in decreasing the spread of the contagious virus in Vietnam, in contrast to Brazil, were lacking sufficient protection amongst travelers in the airports, and not paying adequate attention to the WHO guidelines, were the main reasons the virus entered and spread throughout the whole country [15] . All these statistics indicate that a positive outcome could be achieved when a high percentage of society collaborates and follows the safety guidelines [16] . In addition to masks, in research that compared surface stability of SARS-CoV-2 with SARS-CoV-1, the virus potential to exist on surfaces up to days depending on the surface material has been indicated [17] . The human hands come in contact with these unclean surfaces every day. When the individual does not wear a glove, hands can be contaminated by the virus and results in infection upon touching the facial organs. In this regard, Wearing masks and gloves are essential for the safety of healthcare workers (HCW) and other staff that are at high risk of getting infected and spreading the virus [18] , [19] . To conclude, numerous studies suggest practical solutions to help reduce the spread of the virus, such as alerting the public and urging them to use personal protective equipment, which results in spread prevention, as well as a decrease in public anxiety [20] [21] . Thus, people who are willing to visit public places during this severe pandemic should follow the safety guidelines if the communities aim to be victorious in the battle against COVID-19. In this research, a method to track and supervise the proper enforcement of health recommendations for preventing the COVID-19 pandemic based on deep learning is proposed. Deep learning is one of the branches of AI that works like the human brain with many neurons. The word deep derives from the expansion of the network size, which is proportionate to the number of layers [22] [23] . Convolutional neural networks (CNN) produce a state of the art results on image and video data. CNN consists of a series of convolutional layers. In convolutional layers, multiple kernels convolve with input to produce a feature map; this layer function can be expressed using (1) . K ij denotes the convolutional kernel, L input and L output represent the number of input and output features, respectively. Then, this feature map passes through the activation function to introduce non-linearity into the output. Ψ could be any nonlinear function like tanh, relu, etc. in formula (2). Finally, an m × m filter with a stride n is applied to the input vector and outputs maximum or minimum or average values of each subarea called pooling. It decreases the input's spatial size to reduce the number of parameters and computation in the network. The functionality of pooling layer is shown in figure 1. The differences between CNNs come from the number of convolutional layers, pooling function, and other internal parameters to accomplish a specific task like object detection or classification. In this regard, FRCNN [24] , Mask-RCNN [25] ,YOLO [26] ,SSD [27] are the most well-known detection algorithms in the object detection area. Since real-time detection is crucial in this work, YOLOv3 and SSD MobileNet are selected. After training the custom dataset on these two deep learning models, their effectiveness has been compared. These models have a similar end-to-end architecture, computing a feature map with running a convolutional network on input image only once, which results in detecting objects in real-time with good accuracy. The YOLOv3 applies a new network, Darknet-53, whereas our SSD uses the MobileNet network [28] . In societies where there is not an abundant supply of GPU power, SSD is recommended. The experiments' results indicate the effectiveness and validity of the mentioned methods in supervising people to observe the regulation of using protectives in public areas. YOLO is one of the best object detection models which detect objects in real-time and provides a good trade-off between speed and accuracy. YOLO can retrieve contextual information about the classes as it observes the entire image during training and test. In contrast, the R-CNN models need a separate stage to fetch the target region [29] . Finally, it is generalizable since it can detect one object in various poses. There are three official versions of YOLO, where we have employed the latest version in the paper. A high-level diagram of an object detector is shown in figure 2. The whole system is composed of two major components: Feature Extractor and Detector. When an image comes in, it passes through the feature extractor first, and feature embedding is obtained at different scales. Then, these features are feed into branches of the detector to get bounding boxes and class information. figure 3 , Darknet-53 has 53 convolutional layers. It is much deeper than the old version, which has only 19 convolutional layers [31] , performing detections at three different scales, improving the accuracy of almost 9.8% in 608 × 608 image size. However, YOLOv2 performs at a better speed due to a lighter architecture. Moreover, it has residuals or shortcut connections to allow the gradients to flow through the network without passing through the activation functions. By adding Feature Pyramids in YOLOv3, the models' accuracy has improved in detecting small objects. The stages of the YOLO detection model are shown in figure 4 . In the first step, the model divides the image into an S × S grid. Then, the grid containing the center of the ground truth bounding box of an object is activated for detecting the object. Each grid is responsible for predicting B bounding boxes, their confidence scores, and C conditional probabilities for classes [30] . Same as all other detectors, the performance of YOLOv3 decreases as the IOU threshold increases. The Mean average precision on more than 50% of Intersection over Union is a judgment metric used to check an object detector's accuracy on a particular dataset. The ground-truth bounding boxes (hand-labeled bounding boxes) and the predicted bounding boxes from the detector model are the two main factors used to find an object detector's accuracy. The IOU metric is computed by the ratio between the overlapped areas and the areas of Union. Refer to figure 5 , the thin rectangles are the ground truth boxes, and the thick boxes are the objects detected by YOLOv3. As observed, an appropriate result is provided. Consequently, there is no doubt that YOLOv3 is one of the best state-of-the-art object detectors available, having an acceptable trade-off between accuracy and speed. In this paper YOLOv3 has been used to solve the five-class detection problem, the detection kernels shape is (1 × 1 × 30) where 30 comes from 30 = ((k 1 +k 2 +k 3 )×3). k 1 = 5 The number of classes (Mask,Improper,No-mask,Glove,No-glove) The bounding boxes attribute(x-center,y-center,width,height) The presence of object Finally, 3 is the default value for the number of bounding boxes a cell on the feature map can predict. Similar to YOLO, SSD detects objects in a single deep learning network. After producing prediction rates to determine the likelihood of an object's existence in a bounding box, SSD makes the necessary adjustments to better shape the targeted object in the bounding box. Since SSD uses the predictions from various activation maps, it handles images with various sizes properly. Both YOLO and SSD use a final non-maximum-suppression step in the final detection stage. One of the main characteristics of SSD is its capability in detecting larger objects. Furthermore, the performance of SSD is less sensitive to the quality of the feature extractor than Faster R-CNN and R-FCN. However, it does not perform very well on small objects [32] . Comparing SSD and YOLOv3 performance, with a fixed size of the objects, YOLOv3 outperformed SSD in both accuracy and speed [33] . In this study, the dataset comprised of all kinds of object size. Thus, the detector should identify small, large, near, and far objects in the image. The Mo-bileNet architecture's objective, which is shown in figure 6 , is to make neural networks lighter and portable for mobile and embedded applications, which is maintained primarily by depth-wise separable convolutions [28] . Like other object detectors, SSD has two major parts: backbone and head. The backbone part is a pre-trained image classification network for feature extraction. Here, we have used MobileNet architecture as a feature extractor because of its speed and proper accuracy, which is the result of using a combination of normal convolution and depthwise convolution. DW convolution is performed independently for each of the input channels. It significantly reduces the computational cost by omitting convolution in the channel domain. The SSD consists of one or more convolutional layers added to the MobileNet. The outputs are translated as the bounding boxes and classes of objects within the spatial area of the final layers activation. The main difference between YOLO and SSD is that SSD uses multi-scale convolutional feature maps at the top of the network, but YOLO uses fully connected layers. Some extra augmentation e.g., a technique that can be used for the artificial change of the size and other characteristics of a training image, and other techniques such as: converting RGB to gray, vertical flip, 90-degree rotation, and adding Gaussian noise, have been applied to get closer to YOLO results. In this section, to evaluate our proposed methods' robustness, our test datasets have been used to compare SSD and YOLOv3 results. The A set of human images, with and without protective equipment, as explained in the following. A total number of 8250 image data was carefully selected and collected for our purpose. Google Images were the source for gathering the required data. Masks with various forms, colors, and styles exist in the dataset. Due to the increased popularity of face shields, pictures with people wearing face shields with the intent to be detected as masks were also included. The diverse dataset comprises of images consisting of only a few numbers of people and images from crowded places, each with different backgrounds. As a result, objects with different sizes and qualities have been labeled in the annotation process. After the hand labeling process was complete, the number of image classes are brought in table 2, which are 34942 objects in total. Due to the reason that the number of gloves and no-gloves were incommensurate, up-sampling was done by repetition to balance out the dataset. For the test phase, after training the data and producing the necessary weights, 2000 images were randomly selected from the dataset to test the methods' accuracy. The dataset could be divided into four main groups: Ideal, masks but no gloves, improper mask-wearing, and poor hygiene. The ideal group is the people wearing both masks and gloves and, as a result, having perfect hygiene. The second group is protecting their facial organs by wearing masks; however, they do not wear gloves. The third group is those wearing the masks improperly, and finally, the last group is the people not wearing any protection. . Type Number of Objects 1 Mask 11455 2 Improper 385 3 No-mask 8450 4 Glove 3175 5 No-glove 11477 Finally, figures 9 and 10 indicate people who do not obey the public rules and those who wear the protections improperly. It is also worth mentioning that some selected images include different classes in the collected dataset. mAP is a metric used for evaluating object detectors. It is the average of the AP, first precision and recall are defined for a single data for better understanding. T P T P + F P RECALL = T P T P + F N • TP = A correct detection and IOU is greater than or equal to the threshold. • FP = A wrong detection or IOU is less than the threshold. • FN = no detection for ground truth. Precision and recall are always between 0 and 1, where AP is defined as the area under the precision-recall curve, which is a plot of precision as a function of recall. We compute the AP for each class and average them at different IOU values. However, mAp@(0.5), which is the average of AP over IOU=0.5, means that 50% or greater overlap of the detected box with its ground truth would be considered true positive is the main metric for results. The results of object detection using YOLO and SSD methods are depicted in Figure 11 shows the results on the data in the collected dataset. Furthermore, Figure 12 shows some misclassified results. Table 4 : SSD result on our dataset Figure 11 : Image pairs are shown in which the left image is for YOLO, while the right image is for SSD. As it can be seen SSD has some weakness in detecting small objects (class: Glove) compared to YOLO. Precision of YOLO and SSD is compared based on object size in table 5. As it can be seen SSD performs better in detecting bigger objects. method small medium large YOLO SSD In this study, two different DNN object detection algorithms have been applied for proper masked face and glove detection. We have compared two popular DDNs: SSD MobilNet and YOLOv3. Both of them were trained via transfer learning. The number of training iteration is selected based on the minimum loss. The result indicates that YOLO outperforms SSD MobileNet in terms of mAp. However, when the average recall is considered, SSD shows a better result. Also, as observed in tables 3 and 4, SSD obtained a better result in IOU=0.75 compared to YOLO, demonstrating better alignment with the target. Considering the situations in public places, better alignment is not a crucial factor in this problem. Whereas, the ability of a method to detect objects with acceptable accuracy is more critical even with less alignment. So, YOLOv3 may be a more useful detection method. However, SSD might perform better in low computational power systems since of a lighter architecture than YOLO. This work has the potential to operate at a large scale, and we are confident that it can contribute to bringing a better life in this pandemic. For future work, COVID-19 social distancing with person detection and tracking could be merged with this model to calculate the distance of people considering the factor whether personal protective equipment is used or not [34] . Contributions of latin american researchers in the understanding of the novel coronavirus outbreak: a literature review Automated detection of covid-19 cases using deep neural networks with x-ray images Global expansion of covid-19 pandemic is driven by population size and airport connections A comparison of COVID-19, SARS and MERS Covid-19-new insights on a rapidly changing epidemic Evidence for gastrointestinal infection of sars-cov-2 Presumed asymptomatic carrier transmission of covid-19 Facial protection in the era of covid-19: a narrative review Face masks against covid-19: an evidence review Interim guidance for the use of masks to control seasonal influenza virus transmission Practical guidelines for infection control in health care facilities Coronavirus: Why men are more vulnerable to covid-19 than women? Asymptomatic carriers of covid-19 as a concern for disease prevention and control: more testing, more followup Practice and technique of using face mask amongst adults in the community: a cross-sectional descriptive study Severe airport sanitarian control could slow down the spreading of COVID-19 pandemics in brazil Ho chi minh city-the front line against covid-19 in vietnam Aerosol and surface stability of sars-cov-2 as compared with sars-cov-1 Transmission of covid-19 to health care personnel during exposures to a hospitalized patient-solano county, california Considerations in performing endoscopy during the covid-19 pandemic Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov-2) Covid-19: Taiwan's epidemiological characteristics and public and hospital responses Deep learning Imagenet classification with deep convolutional neural networks Faster r-cnn: Towards real-time object detection with region proposal networks Mask r-cnn You only look once: Unified, real-time object detection Ssd: Single shot multibox detector Mobilenets: Efficient convolutional neural networks for mobile vision applications Ball detection using yolo and mask r-cnn Yolov3: An incremental improvement Yolo9000: better, faster, stronger Speed/accuracy tradeoffs for modern convolutional object detectors Automated identification of cephalometric landmarks: Part 1-comparisons between the latest deep-learning methods yolov3 and ssd Monitoring covid-19 social distancing with person detection and tracking via fine-tuned yolo v3 and deepsort techniques