key: cord-0971246-i9961miz authors: Mas, Juan; Panadero, Teodoro; Botella, Guillermo; Del Barrio, Alberto A.; García, Carlos title: CNN Inference acceleration using low-power devices for human monitoring and security scenarios date: 2020-10-01 journal: Comput Electr Eng DOI: 10.1016/j.compeleceng.2020.106859 sha: 3ea011c49631e1a682e419156f1968468c027296 doc_id: 971246 cord_uid: i9961miz Security is currently one of the top concerns in our society. From governmental installations to private companies and medical institutions, they all have to address directly with security issues as: access to restricted information quarantine control, or criminal tracking. As an example, identifying patients is critical in hospitals or geriatrics in order to isolate infected people, which has proven to be a non- trivial issue with the COVID-19 pandemic that is currently affecting all countries, or to locate fled patients. Face recognition is then a non-intrusive alternative for performing these tasks. Although FaceNet from Google has proved to be almost perfect, in a multi-face scenario its performance decays rapidly. In order to mitigate this loss of performance, in this paper a cluster based on the Neural Computer Stick version 2 and OpenVINO by Intel is proposed. A detailed power and runtime study is shown for two programming models, namely: multithreading and multiprocessing. Furthermore, 3 different hosts have been considered. In the most efficient configuration, an average of 6 frames per second has been achieved using the Raspberry Pi 4 as host and with a power consumption of just 11.2W, increasing by a factor of 3.3X the energy efficiency with respect to a PC-based solution in a multi-face scenario. Year 2012 was a turning point in the classification of images. Since the appearance of AlexNet [1] by that year, the utilization of Deep Neural Networks (DNNs) for classifying million of images has proved to be even more effective than humans [2] . While the application of DNNs mostly focused on cloud-based systems, nowadays the trend is changing, as there is an increasing demand for running these algorithms on embedded systems at the edge [3] . In this scenario, face recognition has become a major goal of machine learning and deep learning development. This is a critical step for security cameras or in general any system tracking people [4] . Being able to identify a concrete person among the crowd is essential for security purposes. In first place, face recognition was limited to high restricted, security areas where it was used to verify the identity of the people who tried to access. However, it has recently widened its scope to other security issues like criminal tracking in public areas, which has drifted into many technological issues related to the increase in the number of faces to infer. As mentioned in [5] , face recognition became popular using holistic methods as the well-known Eigenfaces [6] or local-based features algorithms [7] . Unfortunately, these methods failed to produce the expected results. In recent years, the explosion of Big Data and Deep Learning has allowed some implementations to get noticeably high levels of accuracy (up to 99.97%), as it is the case of Google FaceNet [8] . These methods are based on Convolutional Neural Networks (CNN) [9] and, according to the results presented by the Labeled Faces in the Wild (LFW) [10] , they are almost as accurate as humans. Nevertheless, in spite of these high accuracy levels, the performance of the system depends on the number of faces to be recognised in each frame. Thus, in order to comply with these performance requirements, it is necessary to increase the computation capability of the system. An example of this statement is shown in Table 1 . As observed, the number of frames per second [11, 12] rapidly decreases as the number of faces appearing on a video increases, which motivates the usage of an accelerator to keep a higher performance. Currently the efforts have moved from improving the accuracy to lighten the neural networks [13] and accelerate inference times [9, 14] , which leads to further degrading the performance of the system when many faces need to be analyzed. Moreover, the previously mentioned decrease in performance is associated with an increase in the power consumption, which is very relevant when talking about monitoring applications. This paper will focus on the employment of energy efficient embedded systems to perform the inference stage. Concretely, in this work a cluster of 3 Intel Neural Compute Stick 2 [15] is proposed to mitigate the aforementioned losses of performance with a low-cost and low-power hardware. Together with the OpenVINO toolkit [16] , these are the Intel basic technologies for accelerating inference. This toolkit allows to optimize the neural networks before the deployment and tweak them in order to improve their performance in tightly constrained platforms. Leveraging these technologies, then in this paper we propose a real-time face recognition application based on FaceNet [8] . The inference stage will be accelerated through the NCS2 cluster. Results show that it is possible to increase the performance taking advantage of the horizontal scalability of these kind of accelerating hardware devices drifting into interesting low-cost alternatives for the deployment of the aforementioned applications. Therefore, our major contributions can be listed as follows: • The construction of a low-power and low-cost NCS2 cluster to run a face recognition application. • The mitigation of the performance decay in a multi-face scenario, while providing an energy efficient solution. • The evaluation of two programming models to run the face recognition applications, namely: one based on multithreading and other based on multiprocessing, the latter being the best match for the proposed cluster. • A scalability study based on three different hosts: a laptop and two embedded devices as Raspberry Pi 3B+ and 4. • The proposed cluster with 3 NCS2 sticks has achieved an average of 6 FPS (10 FPS peak) in multi-face scenarios, increasing the energy efficiency by a factor of 3.3X with respect to a CPU-based solution. The rest of the paper is organized as follows: Section 2 describes some preliminary concepts about Facenet; Section 3 describes the state of the art regarding the different acceleration technologies available and the prior attempts of face recognition over the NCS we are using for the research; Section 4 will describe the hardware architecture mounted to make the experiments; Section 5 refers to the software application developed and some of its main features; Section 6 explains the carried out experiment and presents the results obtained and finally Section 7 gives our remarks on the work. FaceNet was presented in 2015 by Schroff et al. [8] . It is a model that allows representing facial features into a unified format, named embedding, composed of 128 values. As can be seen in Fig. 2 , an embedding is a vector-like representation that digests the discriminant features of a certain human face. This representation has been proven to be very useful in literature, as it is not necessary to apply Support Vector Machine (SVM) or Principal Component Analysis (PCA) classification algorithms after performing the inference stage, which is a very common practice in image recognition. In fact, in order to predict whether two faces are the same or not, it is only necessary to obtain the euclidean distance between two different embeddings and check if this distance is lower than a certain threshold fixed beforehand. This model has an accuracy of 99.96% according to the LFW benchmarks [10] and is widely used by the community. It is usually implemented using Tensorflow [17] and many Inception [18] variants. An Inception ResNet-v1 [19] has been used in this research. This architecture provides high levels of accuracy while managing to mantain a reduced computational cost. An overview of the Inception Resnet-v1 CNN is depicted in Fig. 1 . As seen in the picture, the input size is 299x299 pixels. Thus, prior to the inference phase, images must be resized to this resolution in order to be digested by the CNN. This fixed input makes the inference agnostic to the original resolution of the video, so increasing this parameter would only impact the preprocessing phase. In Fig. 2 , a real example consisting of 3 different image embeddings is shown, with their euclidean distance calculated in order to check if they are the same face or not. Face A is different from B and C which correspond to the same person. The threshold (1.20) , is first set following the suggestions written in the official documentation of the pre-trained model. In any case, this has been later verified as a suitable value after performing numerous tests. According to Shi [20] , Edge Computing is defined as "the enabling technologies allowing computation to be performed at the edge of the network, on downstream data on behalf of cloud services and upstream data on behalf of IoT services". As it has been mentioned in Section 1, Big Data approaches boosted the performance and utilization of CNNs for implementing face recognition. However, if the inference stage is conducted at the edge, the bandwidth can be saved for other purposes and the latency of the application is reduced. For this reason, during the last years many companies like Nvidia or Intel have invested important amounts of money in hardware specifically designed for performing the inference stage of neural networks on the edge. Intel has designed the Intel Movidius Neural Compute Stick (NCS) [21, 22] and its revision, the Neural Compute Stick 2 (NCS2). The NCS and NCS2 are small USB 3.0-based sticks which contain Myriad 2 and Myriad X Vision Processing Units (VPUs), respectively, and they can be plugged-in any device based on Windows, Linux, MacOS or Raspbian. Myriad 2 [22] includes 4Gbits of LPDDR3 DRAM, imaging and vision accelerators, and an array of 12 VLIW (Very Long Instruction Word) vector processors called shaved processors. These processors are used to accelerate neural networks by running parts of the neural networks in parallel. Among many improvements, Myriad X mounts 16 shaved processors and is able to employ the half-precision floating point format, i.e. 16-bit numbers. These sticks can be used along the Intel OpenVINO toolkit [16] . This toolkit allows to optimize CNNs and deploy the neural networks on one or various Intel devices (CPU, GPU, NCS, FPGA, etc). Other examples of edge devices are the Jetson boards by Nvidia [23] , or the Coral ones by Google [24] , both possessing software support from TensorRT and TensorFlow, respectively.Hardware comparisons have already been made between the NCS2 and the Coral USB [25] , however, we highlight the main differences in Table 2 . Although these are pretty recent technologies, there is an increasing interest on them. As an evidence we can find some works as the one by Kristiani et al. [26] for image classification using NCS2, or the work by Adnan et al. [27] , which leverages the use of a Raspberry Pi in combination with the NCS to perform object detection. In other plane, the usage of CNNs for medical research using OpenVINO in combination with the new Intel Xeon CPUs [28] , the efforts in deploying CNNs on FPGAs [29] or the work of Lin et al. [30] , studying OpenVINO and TensorFlow for deploying a traffic sign classification and detection system on an FPGA, are examples that prove the variety of scenarios in which this framework could be further used, besides embedded devices and IoT use cases. Regarding FaceNet, there has been several attempts trying to implement face recognition applications on embedded systems. For instance, OpenFace [32] , which is a face recognition library specially designed for mobile platforms. In 2019, Jose et al. [33] implemented FaceNet and Multitask Cascaded Convolutional Networks (MTCNN) [34] , which is an algorithm composed of 3 CNNs for detecting and aligning faces within an image, on a Jetson TX2 in order to implement a multicamera surveillance system. Also in 2019, Salhi et al. [35] developed an evolutionary face recognition algorithm and deployed it on a Jetson TK1 with some real time oriented improvements. Our study differs from them in the cost and power of the employed technologies, which are lower than those of the Jetson TX2, with a cost of $399 in the offical Nvidia store [36] , while the NCS2 costs $78,35. It must be noted that the TK1 model is currently discontinued. A further comparison is shown in Table 2 . It is also possible to find works in literature combining face recognition and NCS. For instance, the work by Wang et al. [37] or Boka and Morris [38] , by Moka and Morris, which utilize a NCS and Intel Movidius Neural Compute SDK (NCSDK) [39] to accelerate face recognition on a Raspberry Pi. In this paper we will focus on its successor, the NCS2 and OpenVINO, as NCSDK does not support NCS2. Regarding this, there are already some studies using the NCS2 with face recognition too. That is the case of the work by Yuan Xie et al. [40] , where they analyze the overall speedup of Sphereface using Mobilenet on a hardware configuration consisting of a Raspberry Pi 3B+ with an NCS2, obtaining performance of 7.031 FPS. In this paper, the focus is not set in analysing the performance of the NCS2 with face recognition algorithms but in mitigating the loss of performance in multiface scenarios with a cluster hardware configuration of NCS2. None of the aforementioned works studies the possibility of horizontally scaling this technology, taking into account not only the performance in different situations but also the cost of the technology and the power consumption. Therefore, in this paper, a cluster of 3 NCS2 is proposed to mitigate the performance decay when employing FaceNet on multiface-images. Furthermore, the power of the system will be studied under different scenarios and configurations. The hardware architecture of the system is depicted by Fig. 3 . As can be observed, the accelerating NCS2 cluster appears on the rightmost part of the figure, being connected to the host through an active hub, which is powered by an external power supply. The active hub is required to provide the necessary voltage and current to the NCS2 sticks. Otherwise, specially if the host is an embedded platform as Raspberry, the cluster is not properly powered and cannot achieve a good performance (instability of the usb connection causing losses of information and a downgrade in overall performance). An INA260 module is also attached to the connection between the hub and the supply in order to measure the power consumption through the capture of both voltage and current. The voltage and current values are then read through the INA260 I2C interface using an independently powered Raspberry PI 3B. Additionally, the host power supply, when the host is a Raspberry Pi 3B+, has also been monitored through the INA260. In the following subsections, there is a description of every module taking part in the proposed architecture. The neural sticks by Intel are the key elements composing the cluster. In this work, we have mounted from 1 to 3 in order to evaluate different performance-power trade-offs. Each NCS2 can execute parallel inference requests through an Asynchronous API. According to the official documentation [41] , the optimal number of concurrent inferences is 4 for each device, so this value has been set in our framework. In the proposed architecture, two different types of hosts have been tested. On the one hand, a laptop possessing an Intel Core i7-4702MQ@2.20GHz, 8GB RAM DDR3, 500GB SSD SATA3 running an Ubuntu 16.04.06-LTS. On the other hand, two embedded devices have been evaluated: a Raspberry Pi 3B+, which possesses a SoC Broadcom BCM2837B0 quad-core A53 (ARMv8) 64-bit@1.4GHz, 1GB LPDDR2 SDRAM and a Micro-SD of 32GB; and the lastest generation Raspberry Pi 4, which contains a SoC Broadcom BCM2711 Quad Core Cortex A-72@1,5 GHz, 4GB SDRAM LPDDR4-2400 and a 32GB Micro-SD. In these cases the operating systems are Raspbian Stretch and Raspbian Buster, respectively. The NCS2 sticks are connected to the host using an USB3.0 active hub. By employing this hub, we assure that no power shortage will affect the NCS2 performance. The INA260 is a precise digital current and voltage monitor, so with the product of both magnitudes it is easy to obtain the consumed power. It is compatible with 3V and 5V logic and can measure up to +36VDC. Thanks to the integrated stunt resistor, the chip allows measuring up to +15A in either high or low side. It is compatible with Arduino and Raspberry Pi through its I2C interface. The OpenVINO toolkit allows developers to optimize and deploy CNNs for accelerating inferences at the edge. In our application, the Model Optimizer and the Inference Engine components are being employed. These will be explained in detail in the following subsections. The overall software architecture is represented in Fig. 4 . Tradicionally, when talking about implementations of AI applications, there is a division between training/development and deployment stages. The Model Optimizer (MO) is a command line tool that can be used for adjusting and fine-tuning neural network models in order to accelerate the inferences in later stages. The MO is fed with the definition files of the model (.pb file in case it is a Tensorflow model) and provides two Intermediate Representation (IR) files. These files consist of a.xml file defining the layers, sizes and connections and a.bin file determining the weights of each parameter. When executing the optimization, the user can introduce some other configuration parameters such as floating point precision [42] (FP32, FP16, INT8) , channel inversion, layer fusing, etc. However, the method used by the MO to optimize the model is not explained in detail, so developers lose some control over the fine-tuning of the model. In this paper, an optimized CNN for face detection (face-detection-retail-004) provided by Intel as well as David Sandberg's [43] implementation of Facenet have been used. The area under the recall curve of the face detection network (also known as the Average Precision) is 83%, tested on the WIDER [44] dataset, which possesses faces larger than 60x60 pixels. During the testing phase of the neural network, an unstable behaviour has been observed when reducing the resolution of a single face below 60x60 pixels. In this scenario, we have detected cases in which the faces were not properly detected (wrong bounding limits) and cases in which the face was not detected at all. Finally, both models have been optimized using FP16 format. This format has been chosen because of the computing constraints of the MYRIAD VPU, which does not allow other formats as FP32. Nonetheless, it is worthy to note that the OpenVINO framework can of course optimize models with FP32 format when targeting CPU and GPU units as inference devices. The MO helps developers to make the transition from the training phase to the deployment phase, but the tool that is really in charge of executing the inference is the Inference Engine (IE), which can be installed independently of the rest of components of OpenVINO, as it must be installed in the inference device. That is the reason why, in devices such as Raspberry Pi, only the IE can be installed. This modular structure helps developers to truly separate both environments (training and deployment) only needing two light-weighted, optimized files. The IE is in charge of executing the inferences as transparent as possible from the developer. It loads the IR files and creates an executable network object, which is then loaded into the respective plugin, depending on the target device. The IE manages the plugin by itself, executing the network and balancing the inference requests among all the available devices. It always allocates the inference request in the least used device from the available ones. Additionally, it is possible to extract performance data through counters, like the inference time per layer among other metrics. Finally, the IE will produce an embedding per input image, which will serve to determine which person is. While the modular structure of the IE allows developers to deploy the model in many environments, the execution phase is carried out by the plugins of each family of devices. It is the plugin itself who decides how the inference requests are managed and distributed among the different cores and threads of each physical device. For example, the MYRIAD X has an optimal number of inference requests of 4. The developer is not the responsible to plan the distribution of the requests but the plugin itself. Although it is possible to configure OpenVINO to use more or less than 4 inference pointers, the developer loses control on the schedule of the inference work. Taking into account the architecture described in Section 4, an end-to-end application has been implemented. Given a video taken as a sequence of frames, this application performs face recognition. As depicted in Fig. 5 , in first place the faces belonging to a frame are detected and second, every face is driven to FaceNet to identify to which person corresponds. Algorithm 1 shows the pseudocode of the inference stage. The work in mainly divided into four stages: getting frames from input, detecting the faces within every frame, applying FaceNet to every detected face and finally showing the results. The final outcome of the application is the same input frame, but with the faces surrounded by bounding boxes marking their position and whether they are the target face or not. It must be noted that the asynchronous API of the IE has been applied for performing the face recognition. FaceNet takes as input only one face per inference, that is, for an image with n faces we must dispatch n inferences through FaceNet. The asynchronous API allows to dispatch inferences without waiting for them to finish and enables the retrieval of results when they are necessary. Thus, it is possible to perform several face inferences in parallel on the NCS2 sticks. The application has been developed using Python3.6. Two parallel execution methods have been implemented for inferences: multithreading and multiprocessing. They will be explained in detail in the following subsections. In this method, a process is created for the inferences. Each inference task is then executed by a single thread. These processes only manage the inference loop, receiving an image, detecting the faces and applying face recognition on the faces previously found. if euclidDistance(embedding, targetEmbedding) < THRESHOLD then 10: markFaceRecognized(face, frame, true) 11: else 12: markFaceRecognized(face, frame, false) 13: end if 14: end for 15: end if 16: showResults(frame) Algorithm 1. Pseudocode of the inference pipeline. Finally, the processes put the resulting image into a shared queue used by another process. In this model, no NCS2 is specifically allocated, so the workload of a single image can be shared across different sticks. Under the host point of view, this method is more lightweighted and should be the ideal one for deploying on devices with low computing power. However, it may be restricted due to the so-called Global Interpreter Locker (GIL) [45] . The pseudocode corresponding with the multithreading approach is shown in Algorithm 2. The GIL was created in order to ensure the sequential execution of threads in the Python interpreter. This lock is needed as Python considers its own interpreter as a resource that must be sequentially accessed. Thus, only one thread per Python process is able to acquire the lock. This behaviour allows Python to improve the efficiency of programs running on single-core environments, because the GIL is unlocked whenever a thread waits for an I/O operation and can be optimized through time slicing techniques. Therefore, considering the interpreter as the bottleneck in multi-core devices we propose a multiprocessing approach employing the same strategy. This model will create various processes, each with its own Python interpreter instance, allowing to bypass the effects of the GIL despite of the increase of resources usage. This method implies that we are creating various inference processes, one per NCS2 employed to execute the inference. Ultimately, these processes do the same work as the threads above but, in this case, the NCS2 devices cannot be shared among the processes. This method is supposed to be more resource-hungry for the host, but a higher performance is expected. Each process is allocated within each NCS2 individually, so the sticks are not able to share the workload of the same image. The pseudocode corresponding with this approach is shown in Algorithm 3. A more detailed graphic comparison is shown in Fig. 7 . In order to assure real-time constraints, we take a sample from the video only when a process or a thread have finished inferring the previous image and can take that sample immediately. In this section we present our experiments employing the system described in Section 4. For the sake of fairness, the CPU frequency was set to the maximum value to achieve the highest performance. In order to evaluate the performance, several 299x299 pixel videos accesible using this URL from Google Drive [46] have been processed. The following metrics have been calculated, namely: the minimum, maximum and average frames per second (FPS); the minimum, maximum and average inference time; the average number of faces in relation to the FPS; and the total execution times and number of calculated inferences. Moreover, power figures will also be shown. These consumption metrics have been acquired using the INA260 as explained in Section 4. In this first experiment, several configurations of the cluster have been tested depending on the number of mounted NCS2 sticks: 1, 2 and 3. Moreover, the number of faces appearing on videos have been considered in this study, employing videos that contain: 1, 2, 5, 9, 15 and a random number of faces (labeled as R). It must be noted that all these cases the host is the i7-based laptop. Fig. 6 contains the FPS data when employing the Multithreading model. As expected, FPS slow down when the number of faces increases inside the videos. It is noticeable that number of NCS2 sticks hardly report speedup. In order to analyze this issue deeply, the 1: createInputReaderProcess() 2: numDevices = getNumDevices() 3: for i in range(0, numDevices) do 4: thread = createInferenceThread(i) 5: launchThread(thread) 6 : end for 7: createProcessingResultsProcess() Algorithm 2. Pseudocode of the multithreading approach. 1: createInputReaderProcess() 2: devices = getInferenceDevices() 3: for all device in devices do 4: createInferenceProcess(device) 5: end for 6: createProcessingResultsProcess() Algorithm 3. Pseudocode of the multiprocessing approach. inference times have been measured as well, as shown in Table 3 . According to the results, we observe that more NCS2 sticks hardly varies the computation times related with each inference in the accelerators, so the loss of performance is due to the aforementioned GIL or the internal scheduler of OpenVINO. OpenVINO scheduling policy consists in allocating the inference request to the available device possessing the lowest workload. However, the OpenVINO scheduling bottleneck has to be discarded, as with the Multiprocessing model the results are much better (see in Fig. 8) . The Multiprocessing results are summarized in Fig. 8 . As this figure depicts, the application scales reasonably well when increasing the number of NCS2 sticks. For example, when considering the case of just one face, the maximum performance achieved is 20 FPS@299x299, 40 and close to 58, with 1, 2 and 3 NCS2 sticks, respectively. This effect is also observed considering the average number of FPS (FPS Avg in the figure), so the system built is able to provide a higher performance in the multi-face scenario. Table 4 confirms this last aspect, showing the speedup results obtained for the Multiprocessing model. The speedups are close to their ideal value for 2 and 3 NCS2 sticks in most of cases. It must be noted that the best accelerations are achieved when the number of faces increases, which means that more inferences are processed. The proposed implementation not only presents good scalability rates in a multi-stick system, but also behaves especially well in high demand environments. Fig. 9 shows the average temporal evolution of FPS and faces for the random case on Multiprocessing model. As can be seen, the loss of performance due to the extra computation with more faces in the scene is compensated by far with the use of the multi-stick solution. We highlight the worst case seen using 3-NCS2 is always above 25 FPS@299x299. In order words, the use of several accelerators not only translates into higher peak performance, but also allows to support loads enhancing system resilience. In this subsection, a similar study is carried out, but considering a Raspberry Pi 3B+ and Raspberry Pi 4 as the hosts of our system. Unlike the previous section we have only considered the experimentation for the Multiprocessing model. Fig. 10 shows the FPS for 1, 2 and 3 NCS2 sticks in combination with the Raspberry Pi 3B+ and considering a range of faces in the scene. As can be seen, although there exists some speedups for several sticks, scalability rates differ from observed in the i7 host. The performance boost is not so pronounced as in the prior case. In fact, there is not a big improvement when increasing from 2 to 3 the number of NCS2 sticks. Nevertheless, Raspberry Pi 3B+ is low-power platform. Even though the FPS@299x299 performance is lower than the frame rates achieved in the previous experiments as seen in Fig. 10 , the power consumption of the Raspberry Pi 3B+based architecture have been considerably reduced. In order to obtain the power consumption, the INA-based system was utilized, as described in Section 4. The measured power consumption is shown in Fig. 11 for the three configurations of the cluster. Table 5 contains the minimum, maximum and the Root Mean Square (P RMS ) values of the power figures obtained through the measurement system. As observed, there is an approximate increase of 1.2 Watt per stick. Table 6 shows the power consumption of the whole system then. As can be seen, the upper bound is around 10W when employing the 3-NCS2 sticks configuration. Taking into account the results extracted from the Raspberry Pi 3B+, another embedded board has been tested as well. Unlike its predecessor, the Raspberry Pi 4 provides USB3.0 connections, while it is a bit more power hungry. The performance results are shown in Fig. 12 . Using version 4 Raspberry has almost doubled the average FPS in multi-face environments. As in the previous board, there is no significant speedup when scaling from 2 to 3 NCS2 sticks. However, it is important to note that Raspberry 4 with 2 NCS2 have reached almost the same performance rate that the isolated PC based i7 with much less power requirement and and with at least a third of its overall cost. Despite still far from the video real-time conditions, processing on average 6 FPS (around 1 frame per 166 ms) in multi-face scenarios is enough for the targeted environment, as an hospital or governmental installations. Finally, the power consumption has been measured as well as in the case of the Raspberry Pi 3B+-based system, and it never surpasses 11.2W, which is also a very low power consumption. Comparing general purpose hosts with embedded platforms is a difficult task, as they differ not only in features, but in power and cost. This section addresses this problem by comparing the efficiency of the previously studied systems in terms of performance/power ratio. These results are shown in Table 7 . It must be noted that in the case of the PC, the power consumption of the i7 is 37W, but the maximum consumption of the device is 57.9W according to the 3DMark06 benchmark [47]. These results have been extracted from this full benchmark of the device [48] . As can be observed, the Raspberry Pi 3B+-based cluster and the PC-based cluster are close in terms of FPS/Watt. It is noteworthy to see how the former is better in mono-face scenarios, while the latter gets better efficiency figures in multi-face environments. In any case, the Raspberry Pi 4 surpasses both of them in most of the cases. Finally, it is noteworthy to mention that the cluster with three NCS2 sticks and the Raspberry Pi 4 host achieves a 3.3X improvement in terms of FPS/Watt with respect to the PC-based solution in multi-faced scenarios. In this paper, an NCS2 cluster has been proposed to accelerate the face recognition task to address security issues in non-restrained scenarios. As has been shown, this cluster allows mitigating the loss of performance in multi-face scenarios, which is critical in situations as the COVID-19 pandemic. The multiprocessing model has suited better than the multithreading one to the features of this cluster. All in all, the cluster provides a low-power and low-cost solution to being deployed in the aforementioned scenarios, specially when combined with the Raspberry Pi 4 host, achieving an average performance of 6 FPS and improving more than three times the energy efficiency of the baseline solution. As future work, it would be interesting to explore different implementations to overcome the limitations imposed by the Python GIL. None. ImageNet classification with deep convolutional neural networks Deep learning Mobile multimedia/image processing, security, and applications 2019. 10993. International society for optics and photonics, SPIE FaceNet: tracking people and acquiring canonical face images in a wireless camera sensor network Deep face recognition: a survey Eigenfaces for recognition Acceleration and energy consumption optimization in cascading classifiers for face detection on low-cost arm big. Little asymmetric architectures FaceNet: a unified embedding for face recognition and clustering Efficient Mitchell's approximate log multipliers for convolutional neural networks Labeled faces in the wild: a database for studying face recognition in unconstrained environments Complexity reduction in the HEVC/H265 standard based on smooth region classification HEVC Optimization based on human perception for real-time environments Lightening the load with highly accurate storage-and energy-efficient lightnns Low-power implementation of Mitchell's approximate logarithmic multiplication for convolutional neural networks Intel® neural compute stick 2 Tensorflow: an end-to-end open source machine learning platform Going deeper with convolutions Inception-v4, inception-resnet and the impact of residual connections on learning Edge computing: vision and challenges Intel® movidius™neural compute stick Intel® movidius™neural compute stick Embedded systems for next-generation autonomous machines EmBench: quantifying performance variations of deep neural networks across modern commodity devices. The 3rd international workshop on deep learning for mobile systems and applications iSEC: an optimized deep learning model for image classification on edge computing A new deep learning application based on movidius NCS for embedded object detection and recognition Lung nodule detection from low dose CT scan using optimization on intel xeon and core processors with intel distribution of openvino toolkit An FPGA-based hardware accelerator for CNNs using on-chip memories only: design and benchmarking with intel movidius neural compute stick Benchmarking deep learning frameworks and investigating FPGA deployment for traffic sign classification and detection NVIDIA Jetson TK1 development Kit OpenFace: a general-purpose face recognition library with mobile applications Face recognition based surveillance system using FaceNet and MTCNN on Jetson TX2 Joint face detection and alignment using multitask cascaded convolutional networks Implementation of an evolutionary facial recognition algorithm on Jetson TK 1 Design and implementation of vehicle unlocking system based on face recognition. 2019 34rd Youth academic annual conference of chinese association of automation (YAC) Person recognition for access logging Intel® movidius™neural compute sdk An optimized face recognition for edge computing Transitioning from intel® movidius™neural compute sdk to intel© distribution of openvino™toolkit Ultra-low-power adder stage design for exascale floating point units Wider face: a face detection benchmark