key: cord-0058380-6qtbijk7 authors: Rubinpur, Yaniv; Toledo, Sivan title: High-performance GPU and CPU Signal Processing for a Reverse-GPS Wildlife Tracking System date: 2021-02-15 journal: Euro-Par 2020: Parallel Processing Workshops DOI: 10.1007/978-3-030-71593-9_8 sha: 37bb93760d72035303b9c77b59994d52031453a1 doc_id: 58380 cord_uid: 6qtbijk7 We present robust high-performance implementations of signal-processing tasks performed by a high-throughput wildlife tracking system called ATLAS. The system tracks radio transmitters attached to wild animals by estimating the time of arrival of radio packets to multiple receivers (base stations). Time-of-arrival estimation of wideband radio signals is computationally expensive, especially in acquisition mode (when the time of transmission of not known, not even approximately). These computation are a bottleneck that limits the throughput of the system. The paper reports on two implementations of ATLAS’s main signal-processing algorithms, one for CPUs and the other for GPUs, and carefully evaluates their performance. The evaluations indicates that the GPU implementation dramatically improves performance and power-performance relative to our baseline, a high-end desktop CPU typical of the computers in current base stations. Performance improves by more than 50X on a high-end GPU and more than 4X with a GPU platform that consumes almost 5 times less power than the CPU platform. Performance-per-Watt ratios also improve (by more than 16X), and so do the price-performance ratios. ATLAS is a reverse-GPS wildlife tracking system, targeting mostly regional movement patterns (within an area spanning kilometers to tens of kilometers) and small animals, including small birds and bats [18, 21] . ATLAS is a mature collaborative research effort: 6 systems have been set up and are operating in 5 countries on 3 continents. The first system has been operating for about 6 years almost continuously and has produced ground-breaking research in Ecology [4, 20] . ATLAS tracks wild animals using miniature radio-frequency (RF) transmitting tags attached to the animals [17, 19] . The transmissions are received by ATLAS base stations that include a sampling radio receiver and a computer running Linux or Windows. The computer processes RF samples to detect transmissions from tags and to estimate the time of arrival (ToA) of the transmissions. It reports the reception times to a server via an internet connection, usually cellular. The server estimates the location of a tag from ToA reports of the same transmission by different base stations [21] . The signal processing that ATLAS base stations performs is computationally demanding and is one of the main limiting factors of the throughput of the system (the number of tags that it can track and the number of localizations per second that it can produce). The signal-processing algorithms were initially optimized for single-threaded on CPUs, but no significant effort has been made to exploit multiple cores effectively. This paper presents a new implementation of the ATLAS signal-processing code 1 designed to effectively exploit graphical processing units (GPUs). Our aim in developing this implementation was to significantly improve the throughput of the system and to reduce the power consumption of base stations. Reduced power consumption reduces the cost and complexity of base stations that rely on solar and wind energy harvesting, such as those deployed in the shallow Wadden sea; it is not particularly important in base stations connected to the power grid. High throughput is useful in most base stations. As part of this project, we also exposed a little more parallelism in the original CPU implementation, but it was not our intention to make it as parallel as possible, because that would have little value to users (who should use the GPU implementation) and would necessitate replacing our simple single-threaded task scheduler with a complex concurrent one. We also evaluate the performance of both the (slightly improved) CPU code and the new GPU code on real recorded data. The evaluations, performed on two CPU platforms and on three GPU platforms, show dramatic improvements relative to our baseline, a high-end desktop CPU that is typical of the computers in current base stations. The improvements are both in terms of absolute performance (more than 50X with a high-end GPU and more than 4X with a GPU platform that consumes almost 5 times less power than the CPU platform), in terms of performance-per-Watt ratios (more than 16X), and in terms of price-performance ratios. However, because we did not attempt to achieve top multicore performance on CPUs, these results should not be taken as fair comparisons of the hardware platforms; they are meant mainly to demonstrate the level of performance that is achievable on such tasks on GPUs using a singlethreaded scheduler coupled with GPU data parallel tasks. ATLAS tags transmit a fixed unique pseudorandom packet every second, 2 s, 4 s, or 8 s. The packets are 8192-bit long and the bitrate is around 1 Mb/s. The data is frequency modulated (FSK); ATLAS can also use phase modulation, but this is beyond the scope of this paper (see [12] for details). The sampling receiver in each base station sends a continuous stream of complex RF samples, usually at 8 or 8.33 Ms/s, to a computer. The samples are placed in a circular buffer. A high-level scheduler repeatedly extract a block of samples from the buffer and processes it. The size of the circular buffer allows for processing delays of more than 10 s; this simplifies the scheduler and the signal-processing code considerably relative to in-order stream processing with hard deadlines. The signal processing aims to detect whether packets from specific tags appear in the block, to estimate the precise (sub-sample) time of arrival (ToA) of each packet, to estimate the (relative) power of the packet, and to estimate a signal-to-noise ratio that is correlated with the variance of the ToA estimate. This data is sent to a server that estimates the locations of the tags [21] . The scheduler creates two kinds of tasks for the signal-processing code. Searching-mode (acquisition-mode) tasks process blocks of 100 ms and try to detect packets from all the tags that have not been detected in the past few minutes. The set of tags to search for can consist of over 100 tags. Since all tags transmit on one or two frequencies, the FSK demodulation step is performed only once or twice per block of samples, but the number of pseudorandom codes that must be correlated with the demodulated signals can be over 100. Trackingmode tasks aim to detect an 8 ms packet from one particular tag in a block of about 12 ms of samples. These tasks perform demodulation and correlate the demodulated signal with one pseudorandom code. Normally, the scheduler allocates 50% of the processor's time to searching and 50% to tracking, in an amortized sense, simply to avoid starvation of one of the tasks. If one of the queues is empty, all the processing resources are devoted to the other queue. The scheduler is sequential; it generates one task at a time and performs it to completion, devoting to it all the cores except for one that handles incoming samples. This simplifies its algorithms but places all the responsibility to efficiently utilize multiple cores to the signal-processing code. Graphics cards (GPUs) that can run general-purpose code, sometimes called GPGPUs, have emerged as effective accelerators of computationally-intensive tasks [9] . This paper focuses on GPUs produced by the market leader, NVIDIA. NVIDIA GPUs contains a large number of simple cores (execution units) under the control of a smaller number of instruction schedulers. In the Jetson TX2 GPU, for example, 256 cores are organized into warps of 32 cores that are controlled by a single instruction scheduler. The warps are organized into streaming multiprocessors (SMs; two in the TX2). All the cores in a warp perform the same operation at the same time, so the code must exhibit a high degree of data parallelism. Larger NVIDIA GPUs use the same basic structure, but with different numbers of cores and SMs. Many NVIDIA GPUs can only operate directly on data stored in the GPUs memory, not in the computer's main memory. NVIDIA GPUs have a memory hierarchy that includes a small block of so-called shared memory that is private to an SM; on the TX2, its size is 64 KB. Data-movement engines called copy engines in the GPU move data between main memory and GPU memories and within GPU memories. NVIDIA GPUs run programs written in CUDA, an extension of the C language. CUDA programs express GPU computations using an abstraction called a kernel. A kernel operates on a small piece of data (say one element of an input array and one element of an output array). CUDA programs invoke kernels on entire arrays concurrently. The signal-processing building blocks that are performed on each task are as follows: 1. Conversion of the complex RF samples, represented by pairs of 16-bit integers, to a single-precision (float) complex vector x. The complex samples are usually element-wise by a complex input vector l representing a local-oscillator signal, to shift the center frequency so that transmissions are centered at zero. Next, a bandpass FIR (finite impulse response) filter, represented here by a circulant matrix that H BP , is applied, to produce y ← H BP x. We use filters with 200 coefficients. 3. Two short (8 samples) matched filters are applied to y, one that represent a single-bit (chip) period at the frequency representing a 1 symbol and one that represent a single-bit period at the frequency that represents a 0 symbol. We denote their outputs by f 1 = H 1 y and f 0 = H 0 y. 4. The vectors f 1 and f 0 are used to demodulate the transmission in two different ways, with and without normalization, (elementwise absolute value, elementwise subtraction and addition, and elementwise division). These signals are real. 5. The algorithm applies exactly the same steps to a replica of the transmission we are trying to detect, a synthetic noise-free zero-padded signal r (c) that represents an FSK packet with the same modulation parameters and a pseudo-random bit sequence c. The resulting demodulated vector is denoted d (c) ; it is computed once and stored. The lengths of d, u and d (c) are identical. 6. We cross-correlate d with d (c) . The cross correlation vector is also real. 7. We compute the value and location j of the maximum of the absolute value of the cross correlation vector, j = arg max i xcorr(d, d (c) ) . The elements of xcorr(d, d (c) ) around j are subsequently interpolated to estimate the arrival time of the incoming signal. We also compute quantities that are used to estimate the signal-to-noise ratio (SNR) and the power of the signal. Assuming that the nonzero part of d (c) spans its first n elements, we compute For details on how power and SNR are estimated and how they are used, see [12, 15] . We use algorithms that minimize the operation counts that the signal-processing building-blocks perform. In particular, we use the fast-Fourier transform (FFT) to compute cross-correlation and to apply FIR filters with many coefficients, so xcorr(d, d (c) ) = ifft(fft(d) fft(d (c) )). We compose FIR filters that are applied in a sequence (H BP H 1 and H BP H 0 ) and we use FFTs to apply long FIR filters (filters with many coefficients). We use the overlap-add method to apply medium-length filters and cross correlations. This reduces the operation count from Θ(m log m) to Θ(m log n) when applying a filter of length n to a block of m RF samples. We pad inputs to lengths that are a product of small integers, usually 2, 3, and 5; this ensures that applying FFT is as inexpensive as possible. We also use high-performance implementation principles in both the CPU implementation in C and in the GPU implementation in CUDA. In most cases the principles are applicable to both implementations; we highlight the differences when this is not the case. We use comprehensive high-performance FFT libraries to compute FFTs. On CPUs, we use FFTW [7] ; on GPUs, we use NVIDIA's cuFFT. The implementations allocate arrays when they are needed and reuse them aggressively. In general, they are never released. For example, demodulation of a block of RF samples of a certain size is always done using the same temporary arrays; for other sizes, we use other arrays. This reduces memory-allocation overheads and allows us to preplan all the FFT calls (both FFTW and cuFFT requires calls to be planned in order to achieve high performance). Allocated arrays are aligned on cache-line boundaries in the CPU implementation and are allocated in GPU memory in the GPU implementation. Auxiliary vector, like fft(d (c) ), are computed when needed but stored indefinitely, to avoid recomputation. Loops are aggressively fused in the CPU implementation and kernels are aggressively fused in the GPU implementation. This reduces data movement (e.g., cache misses) and allows elimination of some temporary arrays. On the GPU, a library called CUB [13] enables fusion of kernels with reductions (e.g., sums), which are otherwise challenging to implement efficiently in CUDA. In the CPU implementation, we batch cross correlation operations: a single call to FFTW computes many cross-correlation vectors. This exposes "embarrassing" parallelism (completely independent operations) that FFTW should be able to easily exploit, at least in principle. CUB uses shared memory to achieve high performance; it requires the caller to allocate this memory, which our code does. In kernels that do not use CUB we do not use shared memory because they implement low data-reuse data-parallel operations over large vectors. cuFFT might also use shared memory, but if it does, it allocates it internally. This section presents our experimental evaluation of the effectiveness of GPUs for our task, in terms of both performance and energy efficiency. To test the codes, we modified the CPU-based DSP C code so that it stores all its inputs and outputs in files. We then ran the ATLAS base station code in an ad-hoc mode (that is, not as part of a localization system) on a computer connected to a USRP B210 sampling radio and configured the base station to detect a tag that was present in the room. This produced files that contained the RF samples that were processed in both searching and tracking mode, inputs that represent filter coefficients and the signal to correlate with, and the outputs of the signal-processing algorithms. Next, we wrote a C program that reads these files, calls the signal processing routines on the recorded data, measures their running time and optionally the power consumption of the computer and its components, and stores the results in files. The program can use the recordings in both single-code single-RF-window mode and in batch mode that processes many codes in one call. The former is typical of tracking mode and the latter of searching mode. The program checks that the returned results are identical, up to numerical rounding errors, to those returned by the full base station run that detected the tag correctly. This ensures that all the results that we report represent correct executions of the algorithms. The code then stores the running times and the power measurements, if made, to log files. We also tested that the new CUDA-based code works correctly when called from Java through the JNI interface and detects transmissions from tags and their arrival times. This test was performed on the Jetson TX2 computer described below and the same URSP B210 radio. We evaluated the code on several platforms using both the CPU code and the GPU code. Our baseline is a small form-factor desktop computer, representative of those currently used in ATLAS base stations, with an Intel i7-8700T CPU. This CPU has 6 physical cores running at clock frequencies between 2.4 and 4 GHz and thermal design power (TDP) of 35 W. This CPU was launched in Q2 2018 and is fabricated in a 14 nm process. The computer ran Linux kernel version 5.3. We compiled the code using GCC version 7.5. Both our code and FFTW version 3.3.8 were compiled using the optimization options that are built into FFTW. The code that was produced ran slightly faster than code compiled with only -O3 -mtune=native. Our main target is a low-power Jetson TX2 computer [2, 6] , which has a 256core NVIDIA Pascal GPU, four ARM Cortex-A57 cores and two ARM Denver2 cores, launched in Q2 2017 using a 16 nm process. The Cortex-A57 cores were designed by ARM and the Denver2 cores were designed by NVIDIA for higher single-threaded performance; both use the same 64-bit ARMv8 instruction set. It also has 8 GB of memory that both the CPU and GPU can access, with 59.7 GB/s memory bandwidth. The TX2 ran Linux kernel 4.9.140-tegra. We used nvcc version 10.0.326, CUDA library 10.0.130, gcc 7.4.0, and FFTW 3.5.7. CUB version 1.8.0 was used on all platforms. We measured power consumption on the TX2 using two ina3221 current sensors built into the TX2 module and a third built into the motherboard [5] . Each sensor senses current on three different rails, and all the measurements are available by reading special files exposed by the driver under /sys/bus/i2c/drivers/ina3221x. The values that we report are the maximum value observed during the computation. The power-vs-performance profile of the TX2 can be adjusted by turning cores on or off and by changing their clock frequency. NVIDIA defined several standard modes, which we use below in our tests. Table 1 describes these modes. The nominal TDP of the TX2 ranges from 7.5 W for the highest power efficiency mode, to 15 W for the highest performance modes. Both the TX2 module and the motherboards include power sensors that we use to measure the power consumption directly in our tests. We also ran the GPU code on two additional platforms. One is an NVIDIA GeForce 1050 GTX GPU. This GPU uses the Pascal architecture, 640 cores running at 1.455 GHz, and 2 GB of RAM. The TDP is 75 W. It was plugged into a desktop running Windows 10 with a quad-core Intel i5-6500 CPU; we used CUDA 10.1, nvcc version 10.1.168, Microsoft's C++ compiler (cl) version 19.00.24210 for x64. The last GPU platform that we used is an NVIDIA Titan Xp GPU. This GPU also uses the Pascal architecture and has 3840 cores running at 1.582 GHz. It has 12 GB of memory and a high-bandwidth memory interface. The thermal design power is 250 W. It was plugged into a server with a 10-core Intel Xeon Silver 4114 CPU running Linux. We used CUDA and nvcc 10.0.130 and gcc 4.9.2. Figure 1 shows the performance of our C implementation on the baseline platform, which has an Intel i7-8700T CPU. We present the performance in terms of the ratio of processing time per pattern relative to the length of the RF window. That is, if the code takes 1 s to process one 100 ms window of RF samples and to correlate the demodulated signal with 16 different code patterns, then we report the performance as (1/16)/0.1 = 0.625. A ratio of 1 implies that the base station can search for one tag continuously, that searching for 2 tags would drop 50% of the RF samples, and so on. A ratio of 0.1 implies that the station can search continuously for 10 tags without dropping any RF sample, and so on. Lower is better. The results on one core (Fig. 1 left) show that the performance per pattern improves significantly when we process multiple patterns in one window of RF samples (which is how the experiment was structured, since this is typical given how ATLAS systems are usually configured). This is mostly due to the amortization of the cost of demodulation over many patterns. The graph on the right in Fig. 1 that using 2 or 3 cores improves performance relative to using only one core, but the improvement is far from dramatic or linear. Using 4 or more cores actually slows the code down relative to 2 or 3 cores. The parallelization in the CPU code is only within FFTW and it does not appear to be particularly effective in this code, perhaps due to the length of the FFTs. Performance on the TX2 is excellent on the GPU but poor on the CPU, as shown in Fig. 2 . Our CUDA code running on the TX is about 4.3 times faster than the single-core i7 code and about 3 times faster than the i7 multicore runs. However, even at the highest performance mode, the TX's CPU cores perform about 4 times worse than the i7. We also measured the power consumption of the TX2 while it was running our code. The results, shown in Fig. 3 , indicate that when running the GPU code, the GPU is the largest power consumer, but the memory and other parts of the system-on-chip (most probably the memory interface) consume a lot of power, about 50% of the total. The CPU and IO interfaces also consume power, but not much. In the C-code runs, the GPU is essentially off; the CPU, memory, and system-on-chip are the largest power consumers. The graph on the right in Fig. 3 shows that the CUDA code is about 10 times more energy efficient than the C code running on the CPU, for the same task. Figure 4 shows that our CUDA code is also very effective on desktop and server GPUs. A low-end GPU 12.8 times faster than a single x86 64 desktop core that is 2 years newer. A server GPU is 51.4 times faster than the desktop CPU. Alawieh et al. [1] and Hendricks et al. [10] compare the performance of several types of compute nodes, including GPUs, CPUs, and FPGAs, in the context of RF ToA estimation, with application to a location estimation system called RedFIR. Their requirements are more demanding, in the sense that RedFIR requires real-time processing of a stream of samples, whereas we buffer samples for a few seconds and use a priority scheduler to simplify the signal-processing code. It is well known that real-time scheduling on GPUs is challenging [22] ; our scheduler allows ATLAS to avoid the difficulty. Also, RedFIR does not rely on periodic transmit schedules, whereas ATLAS reduces the computational load by tracking tags rather than just searching for them. Finally, signal processing in RedFIR is a bit simpler than in ATLAS because they use PSK transmitters, not FSK transmitters. Belloch et al. [3] and Kim et al. [11] present acoustic localization systems that exploit GPUs for ToA estimation. Our use of cuFFT follows the advice of Střelák and Filipovič [16, Sect. 2.5] . Other CUDA FFT libraries [8, 14] appear to be no longer maintained. We have shown that by implementing the DSP functionality of an RF time-ofarrival transmitter localization system in CUDA, we can improve the acquisition (searching) throughput of the system by a factor of 4 while reducing power consumption by a factor of 5 or so relative to a baseline single-core C code, even though the C code has been carefully optimized. Table 2 , which summarizes the characteristics of our test platforms (as well as of a few newer platforms) show that higher-end GPUs can improve throughput dramatically higher, at the cost of higher power consumption, and sometimes also higher cost. The throughput of tracking modes also improves on GPU platforms. Table 2 . A comparison of GPU platforms. Column 3 shows the number CPU and GPU cores. The 5th column shows the TDP of the platform, either the overall power consumption or, if marked by a +, of only the device itself. The cost in USD is only indicative, and again shows either the total system cost or, when marked by a +, the cost of the device. The rightmost column shows the throughput, defined as the number of codes (tags) that can be searched for without dropping any RF samples, assuming batches of 128 and windows of 100 ms each; this also assumes that only 50% of the time is devoted to searching, the rest to tracking. The performance of the i7 processor assumes that only one of the six cores are used. Launch Our baseline code does not effectively exploit multicore CPU platforms, even though it relies heavily on a (high-quality) parallel multicore FFT library; this alone does not deliver good parallel speedups, perhaps due to the modest size of the tasks. It is likely that a careful parallel multicore implementation, perhaps in OpenMP, can the performance of the C code on multicore CPUs. However, this would entail programming that is at least as complex as our CUDA implementation, and it would still not attain the maximum performance of the GPU code or its power-performance ratio. A high performance FPGA-GPU-CPU platform for a real-time locating system European Signal Processing Conference (EUSIPCO) Proceedings of the IEEE Real-Time Systems Symposium (RTSS) On the performance of multi-GPU-based expert systems for acoustic localization involving massive microphone arrays Movement ecology and sex are linked to barn owl microbial community composition NVIDIA Jetson Linux Developer Guide NVIDIA Jetson TX2 delivers twice the intelligence to the edge, nVIDIA Developer Blog The design and implementation of FFTW3 High performance discrete Fourier transforms on graphics processors GPUs reshape computing Evaluating performance and energy-efficiency of a parallel signal correlation algorithm on current multi and manycore architectures Moving-target position estimation using GPU-based particle filter for IoT sensing applications Modulation and signal-processing tradeoffs for reverse-GPS wildlife localization systems Cub (cuda unbound) library version 1.8.0, a library of CUDA collective primitives Small discrete Fourier transforms on GPUs High-performance GPU and CPU signal processing for a reverse-GPS wildlife tracking system Performance analysis and autotuning setup of the cuFFT library Lightweight low-cost wildlife tracking tags using integrated transceivers Lessons and experiences from the design, implementation, and deployment of a wildlife tracking system Physical-layer protocols for lightweight wildlife tags with Internet-of-things transceivers Cognitive map-based navigation in wild bats revealed by a new high-throughput tracking system Characterizing the accuracy of a self-synchronized reverse-GPS wildlife localization system Avoiding pitfalls when using NVIDIA GPUs for real-time tasks in autonomous systems Acknowledgements. Thanks to NVIDIA Corporation for the donation of the Jetson TX2. This study was also supported by grants 965/15, 863/15, and 1919/19 from the Israel Science Foundation.