key: cord-0270210-e45u7365 authors: Mohan, Puranjay; Paul, Aditya Jyoti; Chirania, Abhay title: A Tiny CNN Architecture for Medical Face Mask Detection for Resource-Constrained Endpoints date: 2020-11-30 journal: nan DOI: 10.1007/978-981-16-0749-3_52 sha: 69f7dd7ec903e9d495cd6fbe2365616f58c7652b doc_id: 270210 cord_uid: e45u7365 The world is going through one of the most dangerous pandemics of all time with the rapid spread of the novel coronavirus (COVID-19). According to the World Health Organisation, the most effective way to thwart the transmission of coronavirus is to wear medical face masks. Monitoring the use of face masks in public places has been a challenge because manual monitoring could be unsafe. This paper proposes an architecture for detecting medical face masks for deployment on resource-constrained endpoints having extremely low memory footprints. A small development board with an ARM Cortex-M7 microcontroller clocked at 480 Mhz and having just 496 KB of framebuffer RAM, has been used for the deployment of the model. Using the TensorFlow Lite framework, the model is quantized to further reduce its size. The proposed model is 138 KB post quantization and runs at the inference speed of 30 FPS. The sudden increase of computational capability and availability of data in the last few years has allowed intelligent systems to solve various problems involving computer vision, speech, etc. Traditionally, these models got deployed on servers with high compute and storage capabilities. With the rise of the Internet of things and edge computing, the need to deploy these systems at the edge has grown. The major roadblock in edge deployment of deep neural networks is the very high computational and memory footprint of these models. Image classification is one such problem where edge deployment is in high demand because of the many applications which rely on it. Face mask detection is a subset of image classification where the goal is to classify the image into two classes, i.e. Mask and No-Mask, respectively. The outbreak of the novel coronavirus has significantly impacted the livelihood of people across the globe [21] and effective deployment of face mask detection in public places can help in thwarting the transmission of the virus. Face mask detection is a non-trivial problem because of its high throughput, reliability, and privacy requirements; it can't use traditional deployment methods where the image is first sent to a server for classification and the result is sent back for further use in the application. One application scenario the authors of this research work envision is a small camera attached to an automatic door, the camera continuously takes images of the person standing in front of it and only opens if the person is wearing a mask. This intelligent door can be used at all public places and it will safely monitor the people entering public premises. One requirement of this smart door is that it should be cheap and consume less energy. The proposed model is small enough to fit inside the memories of the smallest and cheapest microcontrollers available in the market. Many other applications requiring high speed mask-detection on the edge can be envisaged as other possible applications as well. TinyML is an upcoming field at the intersection of hardware, software, and machine learning algorithms, that is gaining massive traction. Recent developments in this field include building deep neural networks having sizes of few hundreds of KBs. This paper presents the process to train and deploy an innovative architecture on the OpenMV H7 development board for detecting face masks using the small on-board camera. A major challenge with TinyML is that most microcontrollers don't have a floating point unit and hence all mathematical computations need to work on integers. This leads to a smaller model along with a change in accuracy that is difficult to predict. Earlier studies focused on deployment on more powerful edge devices, using much larger models, but this paper reports how the quantized CNN model compares to existing architectures in terms of model size, accuracy and performance. The rest of the paper is organised as follows -Section 2 covers the literature review, Section 3 discusses some technical and hardware details, Section 4 explains the experimental methodology, Section 5 reports the observations and the findings, and section 6 concludes this paper, throwing some light on possible future research avenues in this field. In this section, some prior advances in face mask detection and quantization, which are the primary facets of this research work, have been reviewed. Due to the Coronavirus pandemic, Face Masks have become an integral part of our society, hence numerous implementations of detecting face mask have come forward which are based on Convolutional Neural Networks [15] . [4] proposed a two stage detection scheme, the first being face detection, and the next being face mask classifier. [16] proposed a hybrid deep transfer learning model with two components, the first for feature extraction using ResNet50 and other for classification using SVM, and other ensemble algorithms. Their Support Vector Machine (SVM) classifier achieved testing accuracy of 99.64%. RetinaMask [13] achieved state-of-the-art result on public face mask dataset (2.3% and 1.5% higher precision than baseline result on face and mask detection) using one-stage detector which consisted of feature pyramid network to fuse high level semantic information with multiple feature maps. [19] presented a model with pre-trained weights of VGG-16 architecture for feature extraction and then a fully connected neural network (FCNN) to segment out faces present in an image, and detect face masks on them. The model showed great result in recognizing non-frontal faces. [17] proposed a model using YOLO-v2 with ResNet-50 which achieves higher average precision by using mean Intersection over Union (IoU). [14] proposed a model based on transfer learning with InceptionV3, It outperformed recently proposed models by achieving testing accuracy of 100% on Simulated Masked Face Dataset (SMFD). [23] discussed the challenge of implementing object detection on edge devices, the paper compared various popular object detection algorithms like YOLO-v3, YOLO-v3tiny, Faster R-CNN, etc. to determine the most efficient algorithm for real time detection of face masks. Leveraging quantization techniques is necessary for implementing CNNs on resource-constrained devices. [3] introduced 4-bit training quantization on both activation and weights, achieving accuracies, a few percent less than state-ofthe-art baselines across CNNs. [20] proposed a method which quantizes layer parameters that improve accuracy over existing post-training quantization techniques. [25] proposed an outlier channel splitting (OCS) based method to improve quantization performance without retraining. [5] discussed low bit quantization of neural networks by optimization of constrained Mean Squared Error(MSE) problems for performing hardware-aware quantization. [11] proposed a quantization scheme along with a co-designed training procedure. The paper concluded that inference using integer-only arithmetic performs better than floating-point arithmetic on typical ARM CPUs. [9] proposed Differentiable Soft Quantization (DSQ) to bridge the gap between the full-precision and low-bit networks. The hybrid compression model in [8] uses four major modules, Approximation, Quantization, Pruning, and Coding, which provides 20-30x compression rate with negligible loss in accuracy. The research by [7] , [6] , and [24] proposed mixed-precision quantization techniques, where more sensitive layers are at higher precision. This section explains the technical details related to the experimental setup including the hardware, the software, and the use of datasets. The hardware used in this research for edge deployment is the OpenMV Cam H7 [2] , housing STMicroelectronics' STM32H743VI [1] , an ARM Cortex-M7 based 32-bit microcontroller and a small camera. The microcontroller has a clock speed of 480 Mhz, 1 MB SRAM for various applications and 2 MB of flash memory for non-volatile storage. The development board provides a MicroPython based operating system allowing easy deployment and on-device analytics of TF-Lite models. The documentation of the board suggests keeping the model under 400KB, but during this study [22] we found that the biggest model which can be fit successfully in memory is under 230KB. A larger model of size upto 1MB can be stored on the flash memory but for that the model has to converted into a FlatBuffer using STM32Cube.AI and the operating system has to be recompiled, which leads to the loss of the utility of the MicroPython. The models were trained using Kaggle kernels with TPU (Tensor Processing Unit) acceleration enabled, 128 TPU elements per core with 8 such cores running in parallel, this made the training time extremely short. Four datasets were used in this research work. The details of the datasets can be seen in Table 1 . The first dataset from Kaggle [12] has around 11,792 images taken on different backgrounds and cropped to the face region. The images of this dataset were merged and interpolation augmentation was applied using OpenCV's interpolation methods, INTER AREA, INTER CUBIC, IN-TER NEAREST, INTER LINEAR, and INTER LANCZOS4 to augment the images to 58,960. The second dataset was also from Kaggle [18] which had 440 images taken on noisy backgrounds, equally divided into mask and without mask images. It was augmented to 22,200 using standard augmentation followed by interpolation. The third dataset was produced by the authors using the OpenMV Cam H7 camera. Images of size 200x200 were taken and saved on the SD Card of the development board. This dataset had 1979 images which were augmented to 49,895 using augmentation techniques discussed earlier. Some images from this dataset can be seen in Fig. 1 . A fourth dataset was also produced using the OpenMV camera which had 594 images augmented to 4794. This dataset was held out for testing the performance of the OpenMV Cam H7. The exact usage of this dataset is novel to this research work, and has been elaborated in Section 4. All the images of each dataset were resized to 32x32 as it was found to be the optimal size of the image to fit in the framebuffer of the microcontroller. This section explains the steps taken in this research work for data splitting, model design, evaluation, and model comparison. The datasets after being merged had 131,055 images in total, these included images from two Kaggle datasets [12] [18] and one dataset produced by the OpenMV Camera. The fourth dataset was held out and was used for testing. This regime of holding out a separate dataset for testing is not usually performed but was considered imperative in this research work to evaluate the generalization achieved by the models running on the microcontroller. The normal regime of combining everything and then taking out a small portion for testing would be unable to show the true model performance on the target edge-case scenarios due to differences between the distributions of the train and test sets. After experimenting with different architectures and comparing their size and performance, the CNN architecture shown in Fig. 2 was found to be the best, after considering the RAM constraints of the device. This model has 128,193 trainable parameters and full integer quantization reduces it to 138 KB. SqueezeNet [10] was chosen for comparison with the proposed model because of its small size. A smaller version of SqueezeNet was also built by removing two fire modules, which is called Modified SqueezeNet in this work, and also used for comparison. All three models were designed in TensorFlow 2.3. BinaryCrossentropy loss was chosen as the loss function, shown in Eq. (1), because the problem involved binary classification of images. where y is the true label and p is the predicted label. Adam was chosen as the optimizer with a learning rate of 0.001, first moment decay of 0.9, second moment decay of 0.999, and epsilon was chosen as 10 -7 . ReduceLROnPlateau callback was used to reduce the learning rate by a factor of 0.2 when the validation accuracy didn't improve for 5 epochs. ModelCheckPoint callback was used to save the best weights into a file. TensorFlow-Lite's Full Integer Quantization was used to convert all three models from float-32 precision to Int-8 precision. This procedure used a representative dataset for this conversion using the dynamic range of activations. This representative dataset was built by taking a small part of the test set. All three models were evaluated using the TensorFlow-Lite Interpreter. The quantized models were loaded into the Interpreter and OpenMV test set was used to find the classification metrics. On-device evaluation was not possible for the SqueezeNet and the Modified SqueezeNet because both of them were bigger than 230 KB. The proposed model was loaded onto the OpenMV Cam, a script took images of size 200x200, scaled them to 32x32, and normalized them before feeding it to the model. All images and predictions were saved on the SD Card, and later used for the analysis. This section explains the results and analysis, along with comparison of the proposed model against SqueezeNet and modified SqueezeNet. The methodology discussed in sec. 4 was followed to train the SqueezeNet model. The training accuracy reached 99.79%, achieving a test accuracy of 98.50% for the float32 model and 98.53% for the int8 model. The size of the float32 model was 8 MB which shrunk to 780 KB post quantization. The details can be seen in Fig. 3 . Modified SqueezeNet was also trained in a similar way. Size of the float32 model was 3.84 MB which got reduced to 386 KB after quantization. The test accuracies for this model were 98.93% and 98.99% for float32 and int8 respectively. The details can be seen in Fig. 4 . The proposed model reached the training accuracy of 99.79%, and achieved a testing accuracy of 99.81% and 99.83% for float32 and int8 models respectively. The 1.52 MB float32 model was reduced to 138 KB post-qunatization. Our model outperformed SqueezNet and modified SqueezeNet in both accuracy and size. The details can be seen in Fig. 5 On comparing the SqueezeNet and the Modified SqueezeNet, it was observed that the modified version, which had two fire modules removed, generalized better than the original model, keeping the precision constant. Thus it was observed, "On resource-constrained endpoints, smaller models sometimes outperform bigger ones in generalizing to new data." It was also observed, "On devices with Floating Point Unit (FPU) support, keeping inputs and outputs as float32 gives best results." The proposed model, despite being the smallest one, achieved the highest accuracy amongst all three. Since the int8 accuracy is a little more than float32 accuracy for all experimental models, the following conclusion can be drawn, "Int8 appears to generalize better than float32 for small models." Table 2 shows the size and accuracy comparison of the three models. Table 3 illustrates the compression comparison. The proposed model, despite having the smallest size, achieved the highest accuracy, precision, recall and F1 score, as can be seen in Tables 4 and 5 , representing the int8 and float32 models respectively. Some of the model predictions can be seen in Fig. 6 . Even smaller CNNs may overfit when solving problems like binary classification, hence aggressive regularization is required to increase their generalization accuracy. In this research, dropout was used after every layer and it made a significant difference in the test accuracy achieved. Observing the proposed model's architecture in Fig. 2 it was found that, "Dropout added after every layer seems to significantly improve the generalization of smaller models." Interpolation augmentation, as suggested in [22] , was used in the proposed model, improving generalization, and corroborating the statement, "Interpolation Augmentation seems to improve generalization for resource-constrained endpoints." In this research work, an extremely small and well-generalizable CNN based solution has been proposed for face mask recognition on edge devices with extreme resource constraints. The solution has been deployed on a microcontroller development board called OpenMV Cam H7. The model is just 138 KB in size and runs at 30 FPS on the board. It has a test accuracy of 99.83%. It has been shown that aggressive regularization through dropout might be useful for developing extremely generalizable CNN architectures for problems like binary classification. The method proposed in this paper is universal and applicable to any microcontroller architecture. The methodology used in this research work can also be used to build and deploy architectures for other challenging problems. The pipeline followed here, which includes, dataset construction, training on float32, quantization to int8, and deployment on edge devices, is applicable to a wide spectrum of resource-constrained, intelligent edge solutions. Future avenues of research include building systems that are more robust to noise and can work on even smaller microcontrollers. Work can be done on building datasets that include images from more diverse sources. Novel quantization schemes can be developed for converting float32 to int8. Smaller precision can be experimented with, including 6-bit, 4-bit and binarized neural networks too. 32-bit Arm® Cortex®-M7 480MHz MCUs, up to 2MB Flash, up to 1MB RAM, 46 com. and analog interfaces Openmv: A Python powered, extensible machine vision camera Post training 4-bit quantization of convolutional networks for rapid-deployment Multi-Stage CNN Architecture for Face Mask Detection Low-bit quantization of neural networks for efficient inference Hawq-v2: Hessian aware trace-weighted quantization of neural networks Hawq: Hessian aware quantization of neural networks with mixed-precision Compressing deep neural networks for efficient visual inference Differentiable soft quantization: Bridging full-precision and low-bit neural networks SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size Quantization and training of neural networks for efficient integerarithmetic-only inference Face mask 12k images dataset RetinaMask: A Face Mask detector Face Mask Detection using Transfer Learning of InceptionV3 Gradient-based learning applied to document recognition A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the covid-19 pandemic Fighting against covid-19: A novel deep learning model based on yolo-v2 with resnet-50 for medical face mask detection. Sustainable Cities and Society p Face mask classification Facial mask detection using semantic segmentation Loss Aware Post-training Quantization Recent advances in selective image encryption and its indispensability due to covid-19 Rethinking generalization in american sign language prediction for edge devices with extremely low memory footprint Moxa: A deep learning based unmanned approach for real-time monitoring of people wearing medical masks HAWQV3: Dyadic Neural Network Quantization Improving neural network quantization without retraining using outlier channel splitting This is a post-print of a paper published in Innovations in Electrical and Electronic Engineering. Lecture Notes in Electrical Engineering, vol 756. Springer, Singapore. doi: 10.1007/978-981-16-0749-3 52.