key: cord-0549997-f10cc5es authors: Fasfous, Nael; Vemparala, Manoj-Rohit; Frickenstein, Alexander; Frickenstein, Lukas; Stechele, Walter title: BinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wear and Positioning Predictor on Edge Devices date: 2021-02-06 journal: nan DOI: nan sha: e3a9eed2f67a175bcd9be81d8c1fc980cafeabab doc_id: 549997 cord_uid: f10cc5es Face masks have long been used in many areas of everyday life to protect against the inhalation of hazardous fumes and particles. They also offer an effective solution in healthcare for bi-directional protection against air-borne diseases. Wearing and positioning the mask correctly is essential for its function. Convolutional neural networks (CNNs) offer an excellent solution for face recognition and classification of correct mask wearing and positioning. In the context of the ongoing COVID-19 pandemic, such algorithms can be used at entrances to corporate buildings, airports, shopping areas, and other indoor locations, to mitigate the spread of the virus. These application scenarios impose major challenges to the underlying compute platform. The inference hardware must be cheap, small and energy efficient, while providing sufficient memory and compute power to execute accurate CNNs at a reasonably low latency. To maintain data privacy of the public, all processing must remain on the edge-device, without any communication with cloud servers. To address these challenges, we present a low-power binary neural network classifier for correct facial-mask wear and positioning. The classification task is implemented on an embedded FPGA, performing high-throughput binary operations. Classification can take place at up to ~6400 frames-per-second, easily enabling multi-camera, speed-gate settings or statistics collection in crowd settings. When deployed on a single entrance or gate, the idle power consumption is reduced to 1.6W, improving the battery-life of the device. We achieve an accuracy of up to 98% for four wearing positions of the MaskedFace-Net dataset. To maintain equivalent classification accuracy for all face structures, skin-tones, hair types, and mask types, the algorithms are tested for their ability to generalize the relevant features over all subjects using the Grad-CAM approach. Convolutional neural networks (CNNs) have been applied to real-world problems since the early days of their conception [1] . In current times, the ongoing COVID-19 pandemic presents new challenges, which can be solved with the help of state-of-the-art computer vision algorithms [2] , [3] . One of the most simple ways of mitigating the spread of the COVID-19 disease is wearing a face-mask, which can protect the wearer from direct exposure to the virus through the mouth and nasal passages. A correctly worn mask can also protect other people, in case the wearer is already infected with the disease. This bi-directional protection makes masks highly effective in crowded and/or indoor areas. Although face-masks have become a mandatory requirement in many public areas, it is difficult to ensure the compliance of the * Equally contributed general public. More specifically, it is difficult to assert that the masks are worn correctly as intended, i.e. completely covering the nose, mouth and chin [4] . CNNs are the current state-of-the-art in face detection applications. Compared to classical computer vision algorithms, CNNs can provide better accuracy on problems with diverse features without having to manually extract said features [5] . This holds true only when the training dataset has a fair distribution of samples. Correctly identifying a mask on a person's face is a relatively simple task for these powerful algorithms. However, a more precise classification of the exact positioning of the mask and identifying the exposed region of the face is more challenging. To maintain equivalent classification accuracy for all face structures, skintones, hair types, and mask types, the algorithms must be able to generalize the relevant features over all individuals. The deployment scenarios for the CNN should also be taken into consideration. A face-mask detector can be set at the entrance of corporate buildings, shopping areas, airport checkpoints, and speed gates. These distributed settings require cheap, battery-powered, edge devices which are limited in memory and compute power. To maintain security and data privacy of the public, all processing must remain on the edge-device without any communication with cloud servers. Minimizing power and resource utilization while maintaining a high classification accuracy is a design challenge which necessitates hardware-software co-design. In this context, we propose BinaryCoP (Binary COVID-mask Predictor), an efficient binary neural network (BNN) classifier for real-time classification of correct face-mask wear and positioning. The challenges of the described application are tackled through the following contributions: • Training BNNs on synthetically generated data [6] to cover a wide demographic and generalize relevant taskrelated features. A high accuracy of ∼98% is achieved for a 4-class problem of mask wear and positioning on the MaskedFace-Net dataset. • Deploying BNNs on a low-power, real-time embedded FPGA accelerator based on the Xilinx FINN architecture [7] . The accelerator can idle at a low-power of 1.65W on single entrances and gates or operate at highperformance (∼6400 frames-per-second) in crowded multi-gate settings, requiring ∼2W of power. • The BNNs are analyzed through Gradient-weighted Class Activation Mapping (Grad-CAM) to improve in-terpretability and study the features being learned. Correctly worn masks play a pivotal role in mitigating the spread of the COVID-19 disease during the ongoing pandemic [8] . Members of the general public often underestimate the importance of this simple yet effective method of disease prevention and control. Researchers and data scientists in the field of computer vision have collected data to train and deploy algorithms which help in automatically regulating masks in public spaces and indoor locations [9] , [10] . Although large-scale natural face datasets exist, the number of real-world masked images is limited [9] . Wang et al. [10] extended their masked-face dataset with a Simulated Masked Face Recognition Dataset (SMFRD), which is synthetically generated by applying virtual masks to existing natural face datasets. Cabani et al. [6] improved the generation of synthetically masked-faces by applying a deformable mask-model onto natural face images with the help of automatically detected facial key-points. The keypoints of the deformable mask-model can be matched to the key-points of the face, allowing the application of the mask in a variety of ways. This allows the dataset generation process to further generate examples of incorrectly worn masks, such as chin exposed, nose exposed or nose and mouth exposed. The memory footprint of neural networks and the complexity of their arithmetic operations on inference hardware can be reduced through parameter quantization. In the most extreme case, binarizing neural networks constrains their weights and activations to {−1, 1}, such that their memory footprint is theoretically reduced by ×32 compared to a float-32 CNN [11] . Additionally, simple XNOR and popcount operations can be used to implement multiply-accumulate (MAC) operations on inference hardware [12] . Specialized training schemes have been proposed to mitigate the loss in information capacity introduced by the low-bitwidth representation of BNNs [11] , [13] , [14] , [12] . In some cases, the low information capacity due to binarization can have a regularization effect which improves feature generalization [13] . This is helpful in improving the classification performance on real-world data, particularly when training on synthetically generated data [15] . In [13] , Courbariaux et al. introduced a scheme to train neural networks with binary weights during forward propagation while maintaining latent full-precision values during back propagation. This ensures proper gradient flow and fine adjustments through the gradients. This approach is later extended by the binarization of activations [11] . Rastegari et al. [12] proposed XNOR-Net, where both weights and activations are binarized such that the convolutions of input feature maps and weights can be approximated by a combination of XNOR operations and popcounts, followed by a multiplication with scaling factors. The introduction of scaling factors improves the information capacity of the network at the cost of more trainable parameters for each layer. This adds to the computational complexity of XNOR-Net at deployment time. For the task of face-mask detection with a single subject in the frame (e.g. gates and entrance points), more efficient forms of BNNs [11] can be applied. Several accelerators have been designed to exploit the benefits of BNNs [16] , [17] , [7] , [18] . The Xilinx FINN [7] framework was developed to accelerate BNNs efficiently on FPGA platforms. The framework compiles high level synthesis (HLS) code from a BNN description to create a hardware design for the network. The generated streaming architecture consists of a pipeline of individual hardware components instantiated for each layer of the BNN. In this work, we deploy BinaryCoP on FINN-based hardware architectures to achieve an efficient acceleration of the maskedface inference on embedded FPGAs. We parameterize and synthesize accelerators with different hardware requirements, geared towards individual COVID-19 mask recognition (lowpower) or multi-camera (multi-gate) classification (highperformance). The BNN method proposed by Courbariaux et al. [11] serves as our foundation to efficiently approximate weights and activations to single-bit precision at inference time, such that the neural network's arithmetic operations can be executed as simple logic operations. Smooth model training and convergence is ensured by relying on full-precision latent weights W during training time [19] . In detail, the activation tensor A l−1 ∈ R Xi×Yi×Ci , with its dimensions of X i width, Y i height, and C i channels, serves as the input to the convolutional layer l ∈ [1, ..., L]. Here, A 0 and A L represent the input image and the network's prediction, respectively. The trainable parameters of the 2D-convolutional layers are composed of the latent weight matrix W ∈ R K×K×Ci×Co required for training, with kernel dimension K, input channels C i , and output channels C o . As previously stated, the latent weights are mapped to {−1, +1} during the forward pass for loss calculation or deployment, resulting in the binarized b ⊂ B ∈ B K×K×Ci×Co . In the hardware implementation, −1 is expressed as a binary 0 to perform multiplications as XNOR logic operations. The sign() function in Eq. 1 is used to binarize the input feature maps and weights. The derivative of the sign() function is almost always zero, resulting in insufficient gradient flow during training and back-propagation. This necessitates gradient flow approximation using a straight-through estimator (STE) [19] . Particularly for BNNs, it is of crucial importance to adjust the input elements a l−1 ⊂ A l−1 , before the approximation into the binary representation h l−1 ⊂ H l−1 ∈ B Xi×Yi×Ci by means of batch normalization to zero mean and unit Fig. 1 : Schematic representation of the BinaryCoP accelerator. A camera captures images to be classified by the neural network. The BNN accelerator is tailored for the application scenario (single or multi-gate prediction). Binary tensors are processed in the PEs of the FPGA-based accelerator using efficient XNOR operations. variance. An advantage of BNNs is that the result of the batch-norm operation is followed by sign() (see Fig. 1 ). Since the result after applying both functions is simply {−1, 1}, the precise calculation of the batch-norm is wasteful on embedded hardware. Based on the batch-norm statistics collected at training time, a threshold point τ is defined, wherein an activation value a l−1 ≥ τ results in 1, otherwise -1 [7] . This allows the implementation of the typically costly batch-norm operation as a simple magnitude comparison operation on hardware. Next, the binary convolution follows as: which results in the output feature map A l ∈ R Xo×Yo×Co . The trained BNNs are conditioned for deployment on the Xilinx FINN framework [7] . The pipelined architecture offers several advantages on embedded devices, most importantly, the reduction in on-chip to off-chip memory transfers of the BNN parameters B l and intermediate activations A l and H l . This is mainly feasible due to the binary format, which results in highly compact neural networks that can fit on the on-chip memory units of embedded devices. The number of processing elements (PEs), single-instruction-multiple-data (SIMD)-lanes, and other parameters can be optimized by the designer to suit the acceleration of the trained BNN. The final design is synthesized and implemented on an embedded FPGA. For each convolutional or fully-connected layer in the BNN, a matrix-vector-threshold unit (MVTU) is instantiated, which executes the XNOR, popcount and threshold operations mentioned in Sec. III-A. Each MVTU in the pipeline can be dimensioned for the number of PEs and SIMD lanes, which have a significant impact on hardware resource utilization, latency and the effective throughput of the pipeline. Based on the compute complexity of each layer, the available hardware resources need to be distributed over the corresponding MVTUs, such that all parts of the pipeline have a matched-throughput. A single under-dimensioned MVTU could throttle the entire pipeline, resulting in sub-optimal throughput. A single MVTU of the pipeline is shown in Fig. 1 , and a corresponding PE is detailed. For convolutional layers, an additional sliding-window unit (SWU) reshapes the binarized activation maps to create a single, wide input feature map memory, which can efficiently be accessed by the corresponding MVTU. Max-pool layers are implemented as boolean OR operations, since a single binary "1" value suffices to make the entire pool window output equal to 1. The output of the convolutional layers in a CNN contains localized information of the input image, without any prior bias on the location of objects and features during training. This information can be captured using Class Activation Mapping (CAM) [20] and Gradient-weighted Class Activation Mapping (Grad-CAM) [21] techniques. To apply CAM, the model must end with a global average pooling layer followed by a fully-connected layer, providing the logits of a particular input. The BNN models investigated in this work operate on a small input resolution of 32×32, and achieve a high reduction of spatial information without incorporating a global average pooling layer. For this reason, the Grad-CAM approach is better-suited to obtain visual interpretations of BinaryCoP's attention and determine the important regions for its predictions of different inputs and classes. To obtain the class-discriminative localization map, we consider the activations and gradients for the output of the Conv2 2 layer (see Tab. I), which has spatial dimensions of 5×5. We use average pooling for the corresponding gradients and reduce the channels by performing Einstein summation as specified in [21] . With this approach the base networks do not need any modifications or retraining. Due to the synthetically generated dataset used for training, we expect BinaryCoP models to generalize well against domain shifts. BinaryCoP is able to detect the presence of a mask, as well as its position and correctness. This level of classification detail is possible through the more detailed split of the MaskedFace-Net dataset [6] test accuracy). We trained the BNN architectures shown in Tab. I according to the method described in Sec.III-A. Each convolutional (Conv) and fully-connected (FC) layer is followed by batch-norm and activation layers except for the final layer. Conv groups 1 and 2 are followed by a max-pool layer. The target System-on-Chip (SoC) platform for the experiments is the Xilinx XC7Z020 (Z7020) chip on the PYNQ-Z1 board. The µ-CNV design can also be synthesized for the more constrained XC7Z010 (Z7010) chip, when XNOR operations are offloaded to the DSP blocks as described in [23] . Power and throughput measurements are taken directly on a running system. The power is measured at the power supply of the board (includes both the processing system and programmable logic). The throughput reported is the classification rate when the accelerator's pipeline is full, while latency is measured end-to-end for a single image entering and exiting the pipeline. It should be noted that due to the pipelined architecture, throughput is not simply the reciprocal of latency. When the pipeline is full, each MVTU (layer) ideally operates on a different image of the input batch, i.e. concurrently processing L images at different stages of the accelerator. We evaluate three BinaryCoP prototypes, namely CNV, n-CNV and µ-CNV. The CNV network is based on the architecture in [7] inspired by VGG-16 [24] and Bina-ryNet [11] . n-CNV is a downsized version for a smaller memory footprint, and µ-CNV has fewer layers to reduce the size of the synthesized design. All designs are synthesized with a target clock frequency of 100MHz. Referring back to Tab. I, the PE counts and SIMD-lanes for each layer (MVTU) are shown in sequence. For BinaryCoPn-CNV, the most complex layer is Conv1 2 with 3.6M XNOR and popcount operations. In Fig. 2 , we mark this layer as the throughput setter, due to its heavy influence on the final throughput of the accelerator. Allocating more PEs for this layer's MVTU increases the overall throughput of the pipeline, so long as no other layer becomes the bottleneck. We allocate enough resources for Conv1 1 to roughly match Conv1 2's latency. The FINN architecture employs a weight-stationary dataflow, since each PE has its own pre-loaded weight memory. When the total number of parameters of a given layer increases, it becomes important to map these parameters to BRAM (Block RAM) units instead of logic. The deeper layers have several orders of magnitude fewer OPs, but more parameters. For these layers, increasing the number of PEs fragments the total weight memory, leading to worse BRAM utilization and no benefit in terms of throughput. Here, choosing fewer PEs, with larger unified weight memories, leads to improved memory allocation, while maintaining rate-matching with the shallow layers (see Fig. 2 ), leaving the throughput gains from the initial PEs unhindered. The CNV architecture in [7] follows the same reasoning for PE and SIMD allocation. For µ-CNV, we choose fewer PEs for the throughput-setters, as this prototype is meant to fit on embedded FPGAs with less emphasis on high frame rates. In Tab. II, the hardware utilization for the BinaryCoP prototypes is provided. With µ-CNV, a significant reduction in LUTs is achieved, which makes the design synthesizable on the heavily constrained Z7010 SoC. The trade-off is a slight increase in the memory footprint of the BNN, as the shallower network has a larger spatial dimension before the fully-connected layers, increasing the total number of parameters after the last convolutional layer. The choice of PE count and SIMD lanes for the n-CNV prototype allow it to reach a maximum throughput of ∼6400 classifications per second when its pipeline is full. This high-performance can be used to classify images from multiple cameras in multi-gate settings. The inference power values reported in Tab. II show a total power requirement of around 2W for all prototypes. For single entrance/gate classifications, all prototypes have an idle power of around 1.65W. In this setting, a classification needs to be triggered only when a subject is attempting to pass through the entrance where BinaryCoP is deployed. The idle power is required mostly by the processor (ARM-Cortex A9) on the SoC and the board (PYNQ-Z1). This can be reduced further by choosing a smaller processor to pair with the proposed hardware accelerator. Although the PYNQ-Z1 board has no PMBus to isolate the power measurements of the FPGA from the rest of the components, we can infer that the hardware accelerator requires roughly 0.4W for the inference task from the two measured power values in Tab. II. The current design is still dependent on the processor for pre-and post-processing, therefore we report the joint power for fairness. The confusion matrix in Fig. 3 shows the generalization of BinaryCoP-CNV on all classes after balancing the dataset. As expected, it is extremely rare to mistake nose+mouth exposed with a correctly worn mask. Less critically, nose and nose+mouth have a slight misclassification overlap, still at only 2% of the total samples given for each class. Finally, the chin exposed and the correct class have some sample misclassifications (≤1%), which could be attributed to the chin area being small in some images and hard to detect at low-resolution. We further analyze the output heat maps generated by Grad-CAM to interpret the predictions of our BNNs with respect to the diverse attributes of the MaskedFace-Net dataset. In Fig. 4 and Fig. 5 -Fig. 7 , column 1 and 2 indicate the label and input image respectively. Columns 3, 4 and 5 highlight the heat maps obtained from the Grad-CAM output of BinaryCoP-CNV, BinaryCoP-n-CNV and a full-precision version of CNV with float-32 parameters (FP32). The heat maps are overlaid on the raw input images for better visualization. All raw images chosen have been classified correctly by all the networks, for fair interpretation of feature-to-prediction correlation. In Fig. 4 (a), we analyze the Region of Interest (RoI) for the correctly masked class. BinaryCoP's learning capacity allows it to focus on key facial lineaments of the human wearing the mask, rather than the mask itself. This potentially helps in generalizing on other mask types. For the child example shown in the first row, the focus of BinaryCoP lies on the nose area, asserting that it is fully covered to result in a correctly masked prediction. Similarly, for the adult in row 2, BinaryCoP-CNV focuses on the upper edge of the mask, to predict its coverage of the face. This also holds for our small version of BinaryCoP, with significantly reduced learning capacity. The RoI curves finely above the mask, tracing the exposed region of the face. In the third row example, BinaryCoP-CNV falls back to focusing on the mask, whereas BinaryCoP-n-CNV continues to focus on the exposed features. Both models achieve the same prediction by focusing on different parts of the raw image. In contrast to the BinaryCoP variants, the full-precision FP32 model seems to focus on a combination of several different features on all three examples. This can be attributed to its larger learning capacity and possible overfitting. In Fig. 4 (b), we analyze the Grad-CAM output of the uncovered nose class. BinaryCoP-CNV and BinaryCoP-n-CNV focus specifically on two regions, namely the nose and the straight upper edge of the mask. These clear characteristics cannot be observed with the oversized FP32 CNN. In Fig 4(c) , the results show the RoI for predicting the exposed mouth and nose class. All models seem to distribute their attention onto several exposed features of the face. Fig. 4(d) shows Grad-CAM results for chin exposed predictions. Although the top region of the mask points upwards, similar to the correctly worn mask, the BNNs pay less attention to this region and instead focus on the neck and chin. With the full-precision FP32 model, it is difficult to interpret the reason for the correct classification, as little to no focus is given to the chin region, again hinting at possible overfitting. Beyond studying the BNNs' behavior on different class predictions, we can use the attention heat maps to understand the generalization behavior of the classifier. In Fig. 5 -Fig. 7 , we test BinaryCoP's generalization over ages, hair colors and head gear, as well as complete face manipulation with double-masks, face paint and sunglasses. In Fig 5, we see that the smaller eyes of infants and elderly do not hinder BinaryCoP's ability to focus on the top region of the correctly worn masks. In Fig. 6 , BinaryCoP-CNV shows resilience to differently colored hair and head-gear, even when having a similar light-blue color as the face-masks (row 2 and 3). In contrast, the FP32 model's attention seems to shift towards the hair and head-gear for these cases. Finally, in Fig. 7 , both BinaryCoP variants focus on relevant features of the corresponding label, irrespective of the obscured or manipulated faces. This empirically shows that the complex training of BNNs, along with their lower information capacity, constrains them to focus on a smaller set of relevant features, thereby generalizing well for unprecedented cases. As mentioned in Sec II-A, detection of masks has piqued the interest of many researchers in the computer vision domain due to its relevance in the context of the ongoing COVID-19 pandemic. NVIDIA proposed mask recognition using object detection models [25] . These models require INT8 or Float-16 numerical precision, with ResNet-18 as a backbone for input images of 960×544. The complexity is orders of magnitude higher than the models we propose in this paper. A head-to-head comparison is difficult to make due to differences in the training approach, the CNN model, the datasets used and the application requirements. The networks are trained to predict only two classes (mask, no mask), which is a simpler problem compared to the exact positioning supported by BinaryCoP. However, the localization and higher resolution makes it a more complex task overall. With the NVIDIA Jetson Nano hardware, which typically requires ∼10W of power on intensive workloads, a framerate of 21 FPS is achieved. The more powerful 25W Jetson AGX Xavier can achieve up to 508 FPS. Compared to the NVIDIA approach [25] , BinaryCoP is targeted at low-power, embedded applications with peak inference power of ∼2W and high classification rates of up to ∼6400 FPS on smaller resolution input images. It is worth noting that BinaryCoP can also classify high resolution images containing multiple individuals, by slicing the input into many 32×32 frames and batch processing them. This application makes use of the high-throughput results presented in Tab. II. Another approach proposed by Agarwal et al. [26] achieves the task of detecting a range of personal protective equipment (PPE). Processing takes place on cloud servers, which could raise privacy and data safety concerns in public settings. Wang et al. [27] propose an in-browser server-less edge computing method, with object detection models. The browser-enabled device must support the WebAssembly instruction format. The authors benchmarked their approach on an iPad Pro (A9X), an iPhone 11 (A13) and a MacBook pro (Intel i7-9750H), achieving 5, 10 and 20 FPS respectively. Needless to say, these devices (or similar) are expensive and cannot be placed in abundance in public areas. Similarly, [28] offers an Android application solution, which is suitable for users selfchecking their masks. In this case, low-power, edge-hardware and continuous surveillance are not emphasized. Our approach offers a unique, low-power, high-throughput solution, which is applicable to cheap, embedded FPGAs. Moreover, the BinaryCoP solution is not constrained to FPGA platforms. Software-based inference of BinaryCoP is also possible on other low-power microcontrollers, with binary instructions. Training on synthetic data allows us to generate more samples with different mask colors, shapes, and sizes [29] , further improving the generalizability of the BNNs, while keeping real-world data available for finetuning stages. In this paper, we apply binary neural networks to the task of classifying the correctness of face-mask wear and positioning. In the context of the ongoing COVID-19 pandemic, such algorithms can be used at entrances to corporate buildings, airports, shopping areas, and other indoor locations to mitigate the spread of the virus. Applying BNNs to this application solves several challenges such as (1) maintaining data privacy of the public by processing data on the edgedevice, (2) deploying the classifier on an efficient XNORbased accelerator to achieve low-power computation, and (3) minimizing the neural network's memory footprint by representing all parameters in the binary domain, enabling deployment on low-cost, embedded hardware. The accelerator requires only ∼1.65W of power when idling on single gates/entrances. Alternatively, high-performance is possible, providing fast batch classification on multiple gates and entrances with multiple cameras, at ∼6400 frames-per-second and 2W of power. We achieve an accuracy of up to 98% for four wearing positions of the MaskedFace-Net dataset. The Grad-CAM approach is used to study the features learned by the proposed BinaryCoP classifier. The results show the classifier's high generalization ability, allowing it to perform well on different face structures, skin-tones, hair types, and age groups. Gradient-based learning applied to document recognition Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Coronet: A deep neural network for detection and diagnosis of covid-19 from chest x-ray images When and how to use masks Deep learning vs. traditional computer vision Maskedface-net -a dataset of correctly/incorrectly masked face images in the context of covid-19 Finn: A framework for fast, scalable binarized neural network inference ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA '17 Face masks considerably reduce covid-19 cases in germany Detecting masked faces in the wild with lle-cnns Masked Face Recognition Dataset and Application Binarized neural networks XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Binaryconnect: Training deep neural networks with binary weights during propagations Towards accurate binary convolutional neural network Binary DAD-Net: Binarized Driveable Area Detection Network for Autonomous Driving Yodann: An architecture for ultralow power binary-weight cnn acceleration Brein memory: A single-chip binary/ternary reconfigurable in-memory deep neural network accelerator achieving 1.4 tops at 0.6 w Towards fast and energy-efficient binarized neural network inference on fpga Estimating or propagating gradients through stochastic neurons for conditional computation Learning deep features for discriminative localization Grad-cam: Visual explanations from deep networks via gradient-based localization Learning multiple layers of features from tiny images Orthruspe: Runtime reconfigurable processing elements for binary neural networks Very deep convolutional networks for large-scale image recognition Implementing a real-time, ai-based, face mask detector application for covid-19 Automatically detecting personal protective equipment on persons in images using amazon rekognition Wear-Mask: Fast In-browser Face Mask Detection with Serverless Edge Computing for COVID-19 Validating the correct wearing of protection mask by taking a selfie: Design of a mobile application "checkyourmask" to limit the spread of covid-19 Masked Face Recognition for Secure Authentication