key: cord-0047304-ib2m3q80 authors: Burger, Alwyn; Urban, Patrick; Boubin, Jayson; Schiele, Gregor title: An Architecture for Solving the Eigenvalue Problem on Embedded FPGAs date: 2020-06-12 journal: Architecture of Computing Systems - ARCS 2020 DOI: 10.1007/978-3-030-52794-5_3 sha: a5d80e4d1417590c390af1d331aa4fb32ef2d78f doc_id: 47304 cord_uid: ib2m3q80 Resource-limited embedded devices like Unmanned Aerial Vehicles (UAVs) often rely on offloading or simplified algorithms. Feature extraction such as Principle Component Analysis (PCA) can reduce transmission data without compromising accuracy, or even be used for applications like facial detection. This involves solving eigenvectors and values which is impractical on conventional embedded MCUs. We present a novel hardware architecture for embedded FPGAs that performs eigendecomposition using previously unused techniques like squared Givens rotations. That leads to a 3x performance improvement for 16 [Formula: see text] 16 covariance matrices over similar approaches that use much larger FPGAs. Offering higher than 30 fps at only 68.61 [Formula: see text]J per frame, our architecture creates exciting new possibilities for intelligent mobile devices. Eigendecomposition and feature extraction have been the focus of continued research for many years [19, 25, 26] . Algorithms like principle component analysis (PCA) allow us to simplify a dataset to only its important features by identifying its distinguishing eigenvectors. By projecting data into a reduced eigenspace (the space described by the eigenvectors), we can simplify problems like facial detection and recognition to a comparison of a few eigenvalues, i.e. the relative weight of each eigenvector. More applications of these techniques are being developed, e.g. in the field of convolutional neural networks (CNNs) where PCA can find dominant features and compress network structures [12] . However, PCA's batched nature and computational complexity makes it infeasible for resource-limited devices. In power-limited applications such as Unmanned Aerial Vehicles (UAVs) that rely on camera feeds, feature extraction could offer data size reduction through local preprocessing. Adding the online learning capabilities of incremental PCA (IPCA) further allows the devices to incorporate incoming images into the training set -thereby continuously improving its performance. To enable this, an accelerator architecture is required that efficiently performs eigendecomposition on an embedded FPGA. This offers improved energy efficiency for small devices over GPUs, and additionally provides flexibility over ASICs as it can be reconfigured to deploy another accelerator at runtime. Delegating this complex computational task to a local FPGA promises considerably improved processing power over doing everything on a MCU. However, most techniques for doing eigendecomposition such as the QR algorithm [11] strongly depend on trigonometric functions or square roots to compute a Givens rotation matrix [14] which are resource inefficient on such devices. Although alternatives like Squared Givens Rotations (SGR) [9] would be considerably more efficient, they introduce scaling issues and have to the authors' knowledge not been successfully used in the QR algorithm. In this paper we present a revolutionary hardware architecture design for performing eigenvalue decomposition (EVD) on an embedded FPGA. By using a number of state-of-the-art optimization techniques in a novel way, our system is capable of increasing processing speed by 3-4x over current literature without compromising accuracy. Our main contributions are 1. a highly resource-optimized computing architecture for solving eigenvalue problems, 2. that is scalable from tiny embedded FPGAs to standard desktop models through a fully homogeneous network of processing elements, 3. and offers pipelined single clock processing elements for maximum processing speed. We present our solution by looking at related work in Sect. 2, followed by an overview of our solution in Sect. 3 . The details of the technical contributions follows in Sect. 4, after which we evaluate our solution in Sect. 5. Finally, we study the application case of UAVs in Sect. 6 and conclude with some final thoughts in Sect. 7. Incremental PCA [1, 6] is a relatively recent development. It offers us the crucial benefit of online training and avoids the expansion of the covariance matrix as the training dataset is expanded. Conventional eigensolver algorithms have been found to be ill-suited to GPU architectures [18] even though they can achieve nearly 5x speedup over CPUs. QR decomposition (which computes a single iteration of the QR algorithm) specifically has been implemented using different GPU-based accelerator architectures [17, 18] . Similar to our approach, Guerrero-Ramírez et. al. [15] presented the first eigensolver based on systolic arrays that implements the QR algorithm using FPGAs. These arrays describe a network of processing elements, where each partially computes a function and passes to their neighbors. In this case, they iteratively calculate trigonometric functions. Their implementation improved processing time by a factor of 1.17x-1.37x compared to CPU architectures. A slower solution that includes a full PCA solver was shown by Korat [19] . It uses significantly more FPGA resources than the previously mentioned work, and showed that some of the components such as mean calculation and data normalization are very inefficient on FPGAs. Ultimately, these authors were limited by having to iteratively approximate trigonometric functions using the COordinate Rotation DIgital Computer (CORDIC) algorithm [21] -causing severe slowdown for more processed bits [23] . Additionally, their resource consumption is impractically high for an embedded FPGA. Other projects that use systolic arrays for QR decompositions on FPGAs [8, 27] have similar limitations. To the best of the authors' knowledge our work represents the first FPGA implementation of the QR algorithm using systolic arrays based on an algorithm that does not rely on trigonometric functions. At the core of our EVD (see Fig. 1 ) is the triangular systolic array (a) to perform QR decomposition. It is composed of two types of nodes: boundary (b) on the diagonal of the triangular matrix and internal (c) off the diagonal. This iteratively computes the eigenvalues and eigenvectors of a provided covariance matrix, entering in a skewed order (d). The QR-array results can be fed back into the system using the buffer (e) until the result converges, at which point the deskewed output (f) is presented. The scaled output of each step of the QR array is down-scaled (g). The starting point for our solution is the QR decomposition. We first consider a real symmetric matrix A 0 of dimensions n × n, which is the covariance matrix for the PCA to be applied. The rank of this matrix corresponds with the number of eigenvectors being computed, effectively controlling the number of features being extracted. The approximate determination of the eigenvalues and eigenvectors is done with the QR algorithm. It is an iterative application of the QR decomposition, which factorizes a matrix by means of plane rotations, e.g. Givens rotations. Each QR iteration is given as: and is performed n times over the matrix A until its diagonal elements converge to the eigenvalues. The collection of eigenvectors Q themselves could be determined by calculating the product of all these orthogonal matrices Q i : Using SGR allows us to first use A i to compute R i , and even to solve Eq. 1 by processing the identity matrix I to compute Q i . The orthogonal similarity transformation A i+1 follows by processing R i . Furthermore, the eigenvectors Q in Eq. 3 can be determined efficiently by processing each computed Q i . As all of these are processed in the same way, we can reuse the processing elements for improved efficiency. Each iteration in the QR algorithm thus consists of the input sequence S = {A, Q, R}. The problem remains that all R i and Q i are scaled by the SGR, meaning it cannot be directly used for further iterations. Our primary contribution addresses the internal structure of the processing elements in the QR array. We improve upon the latency of current state-of-the-art algorithms by using the square-root-free algorithm proposed by Döhler [9] to avoid the associated latency. It allows our processing elements to have a latency of only one clock cycle. Although SGR has been used for QR decomposition, it has not been applied to the QR algorithm due to scaling problems. Since results should be fed through multiple iterations, this would cause overflow errors. To the authors' best knowledge SGR has therefore not been used for EVD using the QR algorithm. The SGR algorithm scales each calculated QR decomposition [9] , which means that it cannot be used for the QR algorithm directly. Especially when using fixed-point representation, this will quickly cause overflow. We found the result to be as shown in Eq. 4, which shows that the eigenvalues λ i found on the diagonal of R * are squared. Additionally, other values are linearly scaled with the value of λ 2 i . Similarly, each column in Q * is scaled. A well-known approach for determining reciprocal square roots [10] is given by iteratively solving the Newton Method until it converges to y = 1 √ xin . However, this can be slow under a bad initial guess y 0 very different from the actual result. An interesting approach to this was coined for the video game Doom 1 , where the initial guess is varied depending on the input value. Extending on this concept, we have developed a novel way to use lookuptables (LUTs) for using this with fixed-point numbers. By choosing from a precomputed set of appropriate y 0 based on the input x in , we can reduce the number of iterations required for convergence. Given a sufficiently large LUT with 128 24-bit entries to create a very accurate initial guess, we can directly solve Eq. 5 in a single iteration. Solving EVD using SGR requires two divisions [7, 9, 20] , which for a matrix width of n would result in 1 2 (n 2 −n) dividers. Since they are non-trivial to implement in hardware (particularly the reciprocal of the divisor), this would be very resourceintensive. Therefore, we studied the schedule of active nodes in the array as shown in Fig. 2 . As division is only required in diagonal mode, this shows that only one division occurs per row. This allows us to share the dividers more efficiently, and to reduce the required number to n. For a 16 × 16 covariance matrix, this leads to a reduction of 104 divider circuits. Similarly, large binary multipliers occupy substantial logic resources in FPGAs. One can build a sequential circuit using multiplexers on the inputs that cycles a single multiplier for multiple usages. The basic idea is to first get the result of A * B in a register, then to multiply that by C. Additionally, the DSPs are optimized using a technique called retiming, which involves moving registers across combinatorial logic to improve the design performance without affecting the input or output behavior of the circuit [22] . Despite the optimized interconnection in dedicated logic, adder chains used to implement binary multipliers in DPS slices cause delays. Based on anecdotal evidence, this technique improved our maximum frequency possible from 247.64 MHz to 373.13 MHz. This increase of 50.67% greatly boosts performance, as the worst case slack is greatly improved. Before our approach can be applied to a practical system, we must first evaluate how well it performs. It is aimed at embedded FPGAs that have been shown to be very capable in applications such as small neural networks [5, 24] . Not only must we ensure that our design is efficient enough to fit this resource-constrained class of FPGAs, but also that the resulting performance is adequate to offer realworld usability. Firstly, we consider the resource consumption on the FPGA. As detailed in Sect. 4, the greatest impact on this is through the size of the processed matrix. Larger matrix sizes enable the computation of more eigenvectors at increased complexity, thereby extracting more identifiable features. Therefore, we varied this size in Table 1 and captured the number of resources consumed by each solution. Note that these results are an absolute number and is valid for the entire 7 series devices from Xilinx, as they are all based on the same architecture. This provides a convenient way to choose the correct FPGA to use for a specific application, based on the limiting hardware resource. For example, the Spartan 7 range varies in available DSP slices from 10 on the S6 to 160 on the S100. It also shows that the implemented homogeneous architecture is easily adaptable to larger-scale deployment, as a larger FPGA could simply support a larger matrix and thereby enable larger inputs and more complex applications. To put these numbers in context, we compare them to the most recently published CORDIC-based eigensolvers [15, 19] in Fig. 3 . We consider specifically the logic cells and DSP slices, as these are commonly the limiting factors. Omitting the additional logic required by a CORDIC-based approach significantly improves our resource consumption, as almost half of the logic cells are saved. More importantly, the number of DSP blocks are reduced by almost 85%. This allows us to use FPGAs with significantly fewer resources, or to support a larger covariance matrix. Before the system's throughput rate can be calculated, the maximum operating frequency f max must be determined using a static time analysis. Table 2 lists the maximum possible clock rates for all targets as the matrix size is varied. As the other solution is not open source, only the clock frequencies achievable in [15] are provided for comparison. Unsurprisingly, the maximum clock rate at which the implemented design can be operated decreases with increasing logic density. To determine the throughput rate, the combined latency of the processing elements must be considered. Each has a latency of p = 6 clock cycles. The number of iterations to be performed is set to k = 30 for a direct comparison with related work. Firstly, the latency of initially filling the FIFO buffers is L F IF O = 3n − 1 cycles, where n is again the matrix width. Each of the QR iterations requires L QR = 24n − 6 while the inverse square root consumes a constant L Sqrt = 12 clock cycles. This leads to a model of the overall latency L and throughput T of where each solution refers to a complete calculation of all eigenvalues andvectors [15] . The maximum operating frequency f max results from the static timing analysis results shown in Table 2 . Figure 4 compares the throughput of our approach to a CORDIC-based approach [15] and a desktop CPU. The SGR-QR was implemented on a Xilinx Spartan-7 XC7S100, and a fixed point representation of 24 bits was chosen to match the input signals in each DSP48 block. Note that the frequency of the memory is assumed to be at least as fast as the main clock f max . The SGR-QR is faster than the CORDIC-based approach implemented on the considerably larger Virtex-7 (3.81x for 4 × 4 to 4.26x for 16 × 16 matrices). The benefits of our highly parallel architecture over higher clocked CPUs become particularly evident for larger matrices. This is due to our approach's linear runtime, while CPU implementations are commonly O(N 3 ) and single-threaded. Using the maximum clock frequency from Table 2 , the implementation results for a number of embedded FPGAs from the Xilinx-7 Family are shown in Table 3 . Apart from the proportional resource consumption for a number of devices, the estimated power usage is also provided by the Vivado software of Xilinx. This is the active consumption of the device, highlighting the importance of processing speed to offset the cost of keeping the FPGA powered. Our system is designed with high energy and resource efficiency in mind in order to support the small, battery-powered devices used in many pervasive or organic computing applications. One example is a fully autonomous aerial system (FAAS) that combines unmanned aerial vehicles (UAV), edge computers, and data centers to create intelligent systems. They should autonomously explore their environment and accomplish high level goals without human intervention [3] , which requires expensive techniques such as facial detection. UAVs typically only carry small batteries with flight times between 15 and 25 min and therefore rely on offloading tasks to edge and cloud systems [4] . Transferring images between edge and UAV is costly, taking on the order of seconds in prior work [4] . Prior work on micro aerial vehicles with in-situ vision systems performed detections locally on UAV. Increased frame rates and decreased power-consumption were achieved by downsampling (5-12 fps) and compressing incredibly small images (17 fps) to be used as input to neural networks [2, 13] . In aerial applications this can lead to loss of critical information contained in small regions. Instead, our system can be used as a local facial detection algorithm or as preprocessing to reduce offloaded data to only the important features. Therefore, we evaluated our architecture design using the well-known FDDB dataset [16] . A sliding window of 250 × 250 pixels is moved over an input image of resolution 640×480. The covariance matrix varies with the number of training images from 4 to 16 faces. For this dataset, 95% of the variance could be described with 62.5% of vectors -offering substantial data reductions. Processing speed of an EVD on the Spartan 7 S100 for different size covariance matrices are presented in Fig. 5 . Using a naive classifier, increasing the matrix size from 4 × 4 to 16 × 16 increased the accuracy from 44.6% to 55.5% (in line with similar approaches [16] ). The speed is reduced for larger matrices, but even at 16×16 the performance remains above 30fps. This shows the trade-off between speed and complexity, which can be combined with Table 3 to tailor the hardware choice. Each device's power usage allows us to estimate the energy usage per frame to between 3.14 µJ for n = 4 and 68.61 µJ for n = 16. Although related work does not provide this information, we are confident that our system is more energy efficient, as transmitting even an image preview (720 × 900) can take a UAV 1.4 s [4] . We presented our approach for EVD on an embedded FPGA. Through optimizations like systolic arrays and dynamically scaling SGR results, we achieved an improvement of 3x performance over other approaches. Additionally, the architecture is resource optimized enough to be used even on small embedded FPGAs like a Xilinx Spartan 7. In future work, we hope to implement this onto a set of drones augmented with FPGAs for real-world experiments. We also plan to investigate using this feature extraction method as a preprocessor for CNNs. By using the reconfigurability of the FPGA, we can switch between EVD to perform a learning feature extraction on incoming data followed by a neural network. This provides processing complexity heretofore impractical on embedded devices used in organic computing applications. Incremental PCA for on-line visual learning and recognition MAVBench: micro aerial vehicle benchmarking Autonomic computing challenges in fully autonomous precision agriculture Managing edge resources for fully autonomous aerial systems An embedded CNN implementation for on-device ECG analysis Online principal component analysis in high dimension: which algorithm to choose? Enabling VLSI processing blocks for MIMO-OFDM communications Fixed-point CORDIC-based QR decomposition by Givens rotations on FPGA Squared givens rotation Reciprocation, square root, inverse square root, and some elementary functions using small multipliers A low effort approach to structured CNN design using PCA Flying IoT: toward lowpower vision in the sky Matrix Computations, 4th edn Hardware design of an eigensolver based on the QR method FDDB: a benchmark for face detection in unconstrained settings On the improvement and acceleration of eigenvalue decomposition in spectral methods using GPUs QR decomposition on GPUs A reconfigurable hardware architecture for principal component analysis MSGR-based low latency complex matrix inversion architecture 50 years of cordic: algorithms, architectures, and applications A new retiming-based technology mapping algorithm for LUTbased FPGAs Cordic-based Givens QR decomposition for MIMO detectors The elastic node: an experimentation platform for hardware accelerator research in the internet of things A survey of dimensionality reduction techniques Face recognition using eigenfaces FPGA-based implementation of QR decomposition The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany in the KI-Sprung LUTNet project (project number 16ES1125).