key: cord-0047220-6zpk86h6 authors: Dokic, Kristian title: Microcontrollers on the Edge – Is ESP32 with Camera Ready for Machine Learning? date: 2020-06-05 journal: Image and Signal Processing DOI: 10.1007/978-3-030-51935-3_23 sha: 6cb2db3f1b210088e635d62315360b2af3d21ffd doc_id: 47220 cord_uid: 6zpk86h6 For most machine learning tasks big computing power is needed, but some tasks can be done with microcontrollers. In this paper well-known SoC ESP32 has been analyzed. It is usually used in IoT devices for data measurement, but some authors started to use simple machine learning algorithms with them. Generally, this paper will analyze the possibility of using ESP32 with a built-in camera for machine learning algorithms. Focus of research will be on durations of photographing and photograph processing, because that can be a bottleneck of a machine learning tasks. For this purpose, logistic regression has been implemented on ESP32 with camera. It has been used to differentiate two handwritten letters on the greyscale pictures (“o” and “x”). Logistic regression weights have been calculated on the cloud, but then they have been transferred to an ESP32. The output results have been analyzed. The duration of photographing and processing were analyzed as well as the impact of implemented PSRAM memory on performances. It can be concluded that ESP32 with camera can be used for some simple machine learning tasks and for camera picture taking and preparing for other more powerful processors. Arduino IDE still does not provide enough level of optimization for implemented PSRAM memory. A microcontroller is a cheap and programmable system that generally includes memory and I/O interfaces on a single chip. They have been developed for decades but the main paradigm hasn't changed all that time until the first decade of 21st century. The ubiquity of the internet has resulted in the appearance of services that offer to send, collecting and analyzing data from microcontrollers on the cloud services. In most cases connection between microcontroller and the Internet has been made through Wi-Fi. Cloud services offer lots of advantages like the reliability of cloud services and data visualization but in the last few years new paradigm has arrivededge computing. Some authors declared that "edge computing refers to the enabling technologies allowing computation to be performed at the edge of the network, on downstream data on behalf of cloud services and upstream data on behalf of IoT services" [1] . On the other hand, in the last few years, a lot of microcontroller producers have worked on machine learning implementation on microcontrollers. Some of them have developed special libraries with machine learning functions [2, 3] but the others have implemented special hardware with enhanced machine learning capabilities [4, 5] . In this paper, a low-cost Chinese SoC ESP32 with camera has been analyzed. The ability to apply simple machine learning algorithms (Logistic Regression) has been tested as well as the impact of PSRAM memory implementation on performances. The focus of the research has been on durations of photographing and photograph processing. ESP32 has been chosen because it has caused great interest from the start of its production. In the Sect. 2, few papers with ESP32 used for machine learning algorithms are presented. In Sect. 3, Logistic Regression implementation on an ESP32 with camera is presented. ESP32 board with a camera has been used to differentiate two letters on greyscale photos that have been taken from implemented camera. Logistic regression has been used to solve that problem, and the first part has been released with Google Collaboratory service. After that, final coding and microcontroller programming have been done with Arduino IDE. In Sect. 4, camera and picture processing speed have been analyzed, as well as the impact of PSRAM memory implementation on performances. Finally, in Sects. 5 and 6, discussion and conclusion can be found. ESP32 SoC is the second generation of Espressif corporation IoT solution and it includes WiFi and Bluetooth. It is based on the 32-bit RISC Tensilica Xtensa LX106 MCU with included FPU and DSP. Clock speed is 240 MHz, and it has 520 KB SRAM. Ivkovic put it in the MCU IoT Ready group because of FPU, DSP and WiFi components integrated [6] . In the ESP32 datasheets there is the application list where ESP32 can be used. Some of them are speech and picture recognition, internet radio players, an energy monitors and smart lighting, etc. [7] . Only a few authors used ESP32 for some ML tasks. Kokoulin et al. used ESP32 to reduce high network traffic and computing load of the central face recognition server. They implemented system based on a microcontroller that processes video stream from a public place and detects the presence of a face or silhouette fragment. Only pictures with faces or silhouette are sent to the main server. Estimated traffic decrease gains up to 80-90% [8] . Espressif System has been developed ESP-WHO framework for face recognition and detection and it is available on the GitHub [9, 10] . They cited that their framework is based on Multi-task Cascaded Convolutional Networks model and new mobile architecture -MobileNetV2 [11, 12] . Logistic regression is an algorithm for classification purposes. It is used when a model has to return a limited number of values, and the dependent variable is categorical. With only two possible outcomes Logistic Regression is called Binary Logistic Regression [13] . Logistic Regression is similar to Linear Regression, but the only difference is in the fact that output the weighted sum has to pass through a function that can map any value between zero and one. For that purpose, sigmoid function is used. In this paper, binary logistic regression has been used, and hT values have been defined in the cloud. These values have been transferred to the ESP32 SoC, and they have been used for deciding about the letter in front of the camera connected with the microcontroller. In this paper, ESP32 with camera module is used. There are lots of vendors that produce ESP32 with a camera with different characteristics. In this paper, ESP32-CAM produced by AI-Thinker has been used as well as M5CAMERA by M5STACK. The main difference between them is that M5CAMERA is in a plastic box and it has 4 MB PSRAM memory. ESP32-CAM hasn't PSRAM memory, and it has only 512 kB RAM on the board. Both boards have the same camera, and it is OV2640. for white color. It can be seen that the content of the first row of the file are values for the letter "x". It is easy to prove it because the value for the fifth pixel in the center must be almost zero. It is 0.192156863. Next step is weights calculating with Logistic Regression. Logistic Regression library from scikit-learn has been used. Code with comments is available on the GitHub. Output is file named weights3.csv with weights. It is nine-dimension vector that has to be transferred to a ESP32 with a camera, and it will be used to decide what picture is in the front of the camera. Program for ESP32-CAM and M5CAMERA has been developed on Arduino IDE and the vector from the previous step has been imported as an array. ESP32-CAM and M5CAMERA both have OV-2640 camera implemented, and different resolutions can be used, but the highest is 1600 Â 1200. All available resolutions are rectangular and not quadratic. Pictures that have been used for calculating Logistic Regression weights have been quadratic. The solution is to cut parts of the taken picture from the left and the right side. All pictures and code for both SoC boards with comments is available on the address https://github.com/kristian1971/LogRegES P32-CAM/. The program output from the ESP32-CAM and M5CAMERA is a number that can be received over the serial port. Programmed ESP32-CAM and M5CAMERA have been tested with different pictures and results were similar. Boards have sent numbers around 0,25 when a paper with letter "o" has been in front of the camera, as well as the numbers around value 0,70 when a paper with letter "x" has been in front of the camera. The device is sensitive to light, and light intensity change causes slight changes in the output values. Overview of proposed approach can be seen on the Fig. 1 . After Logistic Regression implementation next goal has been to analyze the speed of the camera and photo processing on an ESP32-CAM and M5CAMERA boards. These boards can be used for taking pictures as well as preparing photos for some more powerful processor that can be connected with ESP32 board with serial, WiFi or Bluetooth connection. This second processor can be used for some more intensive calculations. This part of the research has been separated in the two parts. In the first part, ESP32-CAM and M5CAMERA have been used without 4 MB PSRAM and in the second part PSRAM has been enabled on M5CAMERA board. The measurements were performed by inserting function micros() in the program code. This function returns a number of microseconds after microcontroller reset. When these values are subtracted, results are the execution times between inserted functions. Duration of photographing presents time between command for photograph taking and the moment after the command execution. An example is here: time1 = micros(); //taking photograph time2 = micros(); Total_time = time2-time1; Photograph processing time presents the time needed to downsample picture and calculate logistic regression value. Downsampling is more time consuming than logistic regression calculating because all used pixels have to be taken into account. Without 4 MB PSRAM memory, ESP32-CAM and M5CAMERA boards can use only one photo buffer. In Table 1 , four different resolutions and duration of photographing and processing can be seen. It is obvious that the duration of photographing depends on photo resolution as well as photo processing depends on it. It can be seen in Figs. 2 and 3, too. There are two different charts because the camera takes a photo in full resolution but microcontroller processes only quadratic part of the photo. There is a difference between pixel numbers in taken and processed photos. Table 2 . It can be seen that the duration of photographing is less than a millisecond despite photo resolution. On the other hand, photo processing times are much higher than in Table 1 . Duration of photographing are not presented with the chart, but pho-to processing times are presented on Fig. 4 . It is obvious that elapsed time to take a picture depends on picture resolution. It can be concluded that there is a linear connection between number of pixels and elapsed time to take a picture when microcontroller without PSRAM is used. It can be seen in Fig. 2 . Values are between 70 ms and 540 ms. Microcontroller with PSRAM has elapsed time to take a picture less than 1 ms, but as it is mentioned before, a microcontroller with PSRAM use two buffers. While the first picture is being processed, the second is already waiting in the buffer. Picture processing time, as well as elapsed time to take a picture, depends on picture resolution. Picture processing time consists of downsampling time and logistic regression calculations. Downsampling process consists of two integer multiplications and three integer addition for each pixel in our case. The size of the picture after the downsampling process does not affect the duration of the process because all pixels have to be taken into account. Microcontroller without PSRAM has downsampling time within the expected range. About 240 ns is needed per pixel for downsampling [14] . Logistic regression calculation time depends on downsampled picture size and includes floating-point multiplication and addition per downsampled picture pixel. Floating-point multiplication and addition take 54 ns with 32bit floating-point numbers [14] . Logistic regression calculation time is very short in our case because our downsampled picture has only nine pixels. Microcontroller with PSRAM has much higher photo processing times. On producer website can be found that PSRAM uses the same cache as the external flash but when accessing large chunks of data (>32 KB) speeds will fall back to the access speed of the external RAM and it is little over 7 MB/sec [15, 16] . In this paper, ESP32 microcontroller has been presented because its quality is proven as well as the boards with cameras have been developed in the last year. Cameras on tested boards (ESP32-CAM and M5CAMERA) have sufficient resolution for most machine learning tasks. Downsampling process can be time-consuming on ESP32, but machine learning algorithms usually use low-resolution pictures, so it is recommended to set camera capture resolution to the lowest levels. Logistic regression calculation speed depends on number of inputs, but with single-precision floating-point numbers, it lasts about 100 ns per input node. In our case with nine inputs it is negligible. From our experience, PSRAM usage is not recommended with Arduino core for ESP32 WiFi chip, version 1.0.2. It looks like that Arduino IDE still does not provide enough level of optimization for all boards type. Espressit Systems has developed Espressif IoT Development Framework, and it is the official development framework for the ESP32 chip, and it probably provides the highest optimization level. We can conclude that ESP32 with camera has enough computing power for simple machine learning tasks and for camera picture taking and preparing for other more powerful processors. In the future analysis it will be interesting to test ESP32 with different neural networks and to try to use both Tensilica Xtensa LX106 cores in calculations because ESP32 has two cores. Edge computing: vision and challenges Convert Neural Networks into Optimized Code for STM32 CMSIS-NN: efficient neural network kernels for arm cortex-M CPUs General vision, Presentation of the CurieNeurons on Arduino/Genuino101, 6 Getting Started with the NVIDIA Jetson Nano Developer Kit Analysis of the performance of the new generation of 32-bit microcontrollers for IoT and big data application Hierarchical convolutional neural network architecture in distributed facial recognition system Face Detection and Recognition on the ESP32 Proceedings of the IEEE Conference on computer vision and Pattern Recognition Joint face detection and alignment using multi-task cascaded convolutional networks The regression analysis of binary sequences