key: cord-0671627-utg8enjk authors: Safa, Ali; Bourdoux, Andr'e; Ocket, Ilja; Catthoor, Francky; Gielen, Georges G.E. title: A 2-$mu$J, 12-class, 91% Accuracy Spiking Neural Network Approach For Radar Gesture Recognition date: 2021-08-05 journal: nan DOI: nan sha: 7f6eea5f5da5489e2ad2ada8ce898f7dcf966bc4 doc_id: 671627 cord_uid: utg8enjk Radar processing via spiking neural networks (SNNs) has recently emerged as a solution in the field of ultra-low-power wireless human-computer interaction. Compared to traditional energy- and area-hungry deep learning methods, SNNs are significantly more energy efficient and can be deployed in the growing number of compact SNN accelerator chips, making them a better solution for ubiquitous IoT applications. We propose a novel SNN strategy for radar gesture recognition, achieving more than 91% of accuracy on two different radar datasets. Our work significantly differs from previous approaches as 1) we use a novel radar-SNN training strategy, 2) we use quantized weights, enabling power-efficient implementation in real-world SNN hardware, and 3) we report the SNN energy consumption per classification, clearly demonstrating the real-world feasibility and power savings induced by SNN-based radar processing. We release evaluation code to help future research. W IRELESS human-computer interaction using radarbased gesture recognition systems has attracted large interest during the past decade, enabling applications such as smart domotics, AR/VR headsets and many other touchless interfacing solutions that are key for a more hygienic, post-COVID-19 world [1] . In order to embed radar sensing into ubiquitous, ultra-low-power IoT devices, research at the hardware side has mainly been devoted to high-level integration of radar transceivers [2] with a focus on energy and area efficiency [3] . In contrast, research at the signal processing side has mainly been devoted to the use of high-accuracy deep neural networks (DNNs) known to be rather energy-and areahungry [4] , [5] . State-of-the-art DNN-based techniques either rely on the use of an expensive desktop-grade GPU [4] or either on the use of a lower-power and lower-area embedded GPU (e.g., 10-W Nvidia Jetson Nano) [5] , still ill-suited for ultra-low-power applications like ubiquitous IoT. Very recently, the use of energy-efficient spiking neural networks (SNNs) for radar processing has grown to become an emerging topic in radar sensing and is currently being investigated by many teams [6] - [9] . Algorithm-wise, SNNs differ from DNNs as they communicate inter-neural information asynchronously, using binary spikes that are only emitted when the neuron membrane potential reaches a specific threshold. In contrast to DNNs, SNNs do not require expensive multiplyaccumulate operations at the input of each neuron, but make use of inexpensive add operations only. Hardware-wise, SNNs can be integrated near the radar sensor (see Fig. 1 ) as subthreshold analog circuits, reaching more than 5 orders of magnitude lower power consumption compared to embedded GPUs [5] , [11] , [12] . Still, the development of SNN-based radar processing is at an early stage. In this letter, our aim is to propose a novel SNN architecture for radar gesture recognition using a different arXiv:2108.02669v2 [eess.SP] 24 Aug 2021 approach than the ones used in previously presented radar-SNN systems. Compared to previous works [6] , [8] , [9] , which either use the µDoppler pre-processing [13] or the range-Doppler pre-processing [4] , we demonstrate that our novel radar-SNN approach is compatible with both pre-processing techniques. In contrast to the work in [7] , our approach is purely SNN-based, while the system of [7] uses an SNN followed by classical machine learning techniques such as Random Forest, which cannot be deployed in sub-threshold analog SNN circuits. Compared to [7] - [9] , our system uses implementation-ready, quantized weights (typical bit width in SNN hardware is < 8 bits [6] , [10] ), while none of the aforementioned works quantize their weights, making their reported performances (85%-98%) unclear when deployed in real-world hardware (12-class 91% with 6-bit weights and 5class 93% with 4-bit weights in our work). Finally, in contrast to most previously mentioned works [7] - [9] , we report an estimate of our SNN energy consumption when deployed in dedicated SNN hardware [6] . We assess the performance of our system on two different radar gesture datasets: the 12-class Google Soli dataset of [4] and the 5-class 8-GHz dataset of [6] . We report a ×3 increase in the number of gesture classes compared to state-of-the-art quantized-weight SNNs. The dataset of [6] contains radar ADC data with N chirps = 192 chirps per frame and with a variable number of frames per gesture acquisition N f rames (step 1 in Fig. 1 ). µDoppler signatures [13] are acquired for each gesture acquisition in the dataset by first computing the range profiles R n [k] for each chirp n = 1, ..., N tot (where N tot is the total number of chirps). R n [k] is acquired by DFT using a Blackman window [14] (step 2 in Fig. 1 ). Then, we apply the Short- Fig. 1 ), which removes the strong DC component during each analysis window [15] , as follows: where k * denotes the range bin where the gestures are executed, g s denotes a Hanning window of length s, and R is the hop size (s = 192 and R = 8 throughout this paper). k * is known a priori as the gestures are executed at 2 meters from the radar. We define the µDoppler signature as |Θ[m, f ]| which is a matrix of size (N T × s) with N T given by [9] : examples are thus obtained for each acquisition. By balancing the dataset and by removing the first and the last 6 example maps to remove start-up (when the human simply sits in front of the radar before performing gestures) and ending artefacts (when the human reaches out to the radar to stop it), we obtain a balanced dataset with a total of 1695 µDoppler examples. Each example map is then normalized between [0, 1]. Outof-band noise is removed through the band-limiting of the Doppler frequency axis by keeping the normalized frequency range between [−0.26, 0.26] only. This frequency band was identified visually by evaluating the maximal significant extent of the Doppler spectra in the dataset. Then, we use soft thresholding [16] to remove in-band noise in each Doppler spectrum (step 4 in Fig. 1) . The soft thresholding is performed by keeping the k largest values and pad the remaining ones to 0. We choose k heuristically by considering that more than half of the Doppler samples within the normalized frequencies Fig. 1 shows an example radar map, resulting from the µDoppler pre-processing described above. The pre-processed radar µDoppler maps must then be converted into event streams to be compatible with the spiking nature of our SNN. Each pixel of the map is coded as a spike train of length T inf (number of time steps per inference). We encode each pixel using Time-To-First-Spike (TTFS) encoding (step 5 in Fig. 1) , where a pixel of value v ∈ [0, 1] is quantized into an event train containing one spike located at index T inf − vT inf [17] . If the pixel is equal to 0, then no spikes are emitted. As we are aiming at low-latency inference, we choose T inf = 4 time steps. The Soli dataset [4] has been acquired using a 60-GHz FMCW radar and is composed of 12 classes with a total of 5500 CFAR-processed range-Doppler magnitude acquisitions. Each gesture acquisition is a collection of maps RD[t, l, m] where t is the frame index, l is the range index and m is the Doppler index (see step 1 in Fig. 2) , with a varying number of time steps t ∈ [1, T f r ] per acquisition. First, we average and sub-sample each gesture acquisition RD[t, l, m] (with varying T f r ) along t (step 2 in Fig. 2 ) to a fixed number T inf < T f r ∀T f r of frames per acquisition, as follows: where n is the sub-sampled time index. Then, the resulting frames are converted to binary images RD b [n, l, m] by thresholding against 0 (step 3 in Fig. 2) . Therefore, for any pixel coordinate (l * , m * ), RD b [n, l * , m * ] represents a spike train of length T inf , set to 28 (minimum T f r in the dataset). To classify the spiking radar tensors, we use the SNN architecture shown in Fig. 3 with Integrate and Fire (IF) neurons: Fig. 3 : SNN architecture used for radar processing. Each spiking map slice corresponding to each time step is fed one by one to the network and the IF neurons change state according to their self recurrence (as denoted by the black recurrence arrows). where V k ≥ 0 is the neural membrane potential at time step k, J in is the neuron input and S is the spiking output. As the derivative of spikes as a function of the membrane potential is ill-defined, we create a custom neuron model using the pyTorch framework [18] which behaves as (4) in forward pass. For the backward pass, we approximate the derivative using a Gaussian function (5) as the surrogate derivative [19] . This enables the use of back-propagation in the spiking domain. The layer-by-layer description of our SNN architecture (Fig. 3) is the following. After the spike train encoding of the radar maps, we use a (5, 5, 12) convolutional layer. At each time step, the convolution result is fed to the IF neuron layer σ 1 . Then, the spiking tensor at the output of σ 1 is down-sampled via MaxPooling and the resulting tensor is flattened to a 1dimensional spiking vector. Then, two fully-connected spiking layers are used and the 12-or 5-dimensional output of σ 3 (corresponding to the 12 or 5 gesture classes) is accumulated over time in a vector A. Finally, A is transformed via SoftMax into class probabilities. Our network architecture search was conducted with the objective of achieving a > 90% accuracy with heavily quantized weights (at most 6-bit) and a small network size. For training, we use the Adam optimizer [20] with learning rate 10 −3 . The batch size is 128 and the SNN is first trained for 14 epochs with full-bit weights and 1 epoch with quantized weights in the forward pass and full-bit weights in the backward pass. The accuracy of our SNN is assessed using 6-fold cross validation. Table I reports the performance of our proposed system (entry 6 for the 8-GHz dataset and entry 7 for the Soli dataset) against the state of the art. We evaluate the energy per classification E c of our SNN using the hardware metrics of the µBrain SNN chip, described in [6] : where N spikes is the maximum number of spikes during classification, E dyn = 2.1 pJ is the energy per spike, P stat = 73 µW is the static leakage power and δT is the inference time. Even though a smaller δT can be reached by adjusting the bias voltages that control the delay cells in [6] , we assume δT = 4 ms for the 8-GHz dataset (T inf = 4) and δT = 28 ms for the Soli dataset (T inf = 28) to provide an upper bound estimate on E c . [2] [3] [4] and N bits is the number of bits for the network weights (f and i stand for float and integer respectively). Out of the implementation-ready SNNs using quantized weights only (entries 5, 6 and 7 in Table I ), our work significantly outperforms entry 5 by up to ×3 higher number of gesture classes N c , while having a similar accuracy and N bits , with E c of the same order of magnitude. All other entries in Table I either rely on DNNs (entry 1) and RF (entry 4), being ill-suited for ultra-low-power IoT, or do not quantize their weights (entries 2 and 3, giving unclear performance in real-world SNN hardware). In addition, our work achieves a recognition accuracy close to the accuracy of the DNN in entry 1 [5] , while consuming more than two orders of magnitude less energy per inference. Finally, entries 6 and 7 clearly show how our system trades off N c , E c and N bits for a target accuracy of > 90%. This letter has presented a novel radar-SNN architecture for ultra-low-power radar gesture recognition, significantly outperforming existing implementation-ready SNNs in terms of classification performance. The presented approach has reported several key innovations compared to previous radar-SNN systems such as a novel radar-SNN training strategy and radar to spike encoding approaches. Radar-SNN evaluation code has also been provided, which helps lighting the way for the emerging area of SNN-based radar processing. Stability of SARS-CoV-2 in different environmental conditions Low Power Low Phase Noise 60 GHz Multichannel Transceiver in 28 nm CMOS for Radar Applications 3 A680 µW Burst-Chirp UWB Radar Transceiver for Vital Signs and Occupancy Sensing up to 15m Distance Interacting with Soli: Exploring Fine-Grained Dynamic Gesture Recognition in the Radio-Frequency Spectrum Real-Time Radar-Based Gesture Detection and Recognition Built in an Edge-Computing Platform µBrain: An Event-Driven and Fully Synthesizable Architecture for Spiking Neural Networks Radar-Based Hand Gesture Recognition Using Spiking Neural Networks Resource Efficient Gesture Sensing Based on FMCW Radar using Spiking Neural Networks Application of Spiking Neural Networks for Action Recognition from Radar Data A 0.086-mm 2 12.7-pJ/SOP 64k-Synapse 256-Neuron Online-Learning Digital Spiking Neuromorphic Processor in 28-nm CMOS A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs) Loihi: A Neuromorphic Manycore Processor with On-Chip Learning Radar Micro-Doppler Signatures: Processing and Applications On the use of windows for harmonic analysis with the discrete Fourier transform Indoor person identification using a low-power fmcw radar Learning Simple Thresholded Features With Sparse Support Recovery Conversion of analog to spiking neural networks using sparse temporal coding Automatic differentiation in PyTorch Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks Adam: A Method for Stochastic Optimization The authors thank Dr. Federico Corradi and Dr. Lars Keuninckx for the discussions and guidance, and the Flanders AI research program for partially supporting this work.