key: cord-0106081-00akotur authors: Sohrabizadeh, Atefeh; Chi, Yuze; Cong, Jason title: SPA-GCN: Efficient and Flexible GCN Accelerator with an Application for Graph Similarity Computation date: 2021-11-10 journal: nan DOI: nan sha: a39b5a163b0544ec382540839685d8ebfc64849b doc_id: 106081 cord_uid: 00akotur While there have been many studies on hardware acceleration for deep learning on images, there has been a rather limited focus on accelerating deep learning applications involving graphs. The unique characteristics of graphs, such as the irregular memory access and dynamic parallelism, impose several challenges when the algorithm is mapped to a CPU or GPU. To address these challenges while exploiting all the available sparsity, we propose a flexible architecture called SPA-GCN for accelerating Graph Convolutional Networks (GCN), the core computation unit in deep learning algorithms on graphs. The architecture is specialized for dealing with many small graphs since the graph size has a significant impact on design considerations. In this context, we use SimGNN, a neural-network-based graph matching algorithm, as a case study to demonstrate the effectiveness of our architecture. The experimental results demonstrate that SPA-GCN can deliver a high speedup compared to a multi-core CPU implementation and a GPU implementation, showing the efficiency of our design. Graphs are the core data structure used in datacenters and have a wide application in different domains such as recommender systems, social networks, and the World Wide Web. Although they are widely used, they are mainly unstructured and have a high dimensionality, making them computationally expensive to process. In fact, many graph algorithms, such as Graph Edit Distance [46] and Maximum Common Subgraphs [9] are known to be NP-complete [29, 75] . This problem has motivated researchers to apply deep learning on graphs with the goal of extracting structured, low-dimensional features from it (e.g., [31, 72] ). More specifically, the purpose of this line of research is to assign a feature vector to each node of the graph that can show the role of the node in the graph. Such feature vectors are called the node embeddings. In this context, Graph Convolutional Networks (GCN) [31] are widely used to extract the node embeddings in a graph. GCNs follow the same behavior as Convolutional Neural Networks (CNN) in learning. They consist of multiple layers in which the features of the nodes are propagated within them until rich information of the input graph is derived. Like in a CNN, in each layer, the GCN updates the node features by gathering the neighbors' features and passing their summation through a filter. GCNs have shown to be successful in many domains including traffic prediction [80] , facilitating web-scale recommender systems [72] , molecular footprint calculation [19] , logic optimization for EDA tools [22] , etc. While some graph data tend to scale rapidly, there are also many graph data that are naturally limited in size, for example, chemical compounds and molecules [6, 7, 10, 40, 54] that have a wide application in different domains including drug development, quantum mechanics, physical chemistry, biophysics, etc [10, 64] ; the LINUX dataset containing operating system program dependence graph [57] ; the GREC database consisting of graphs representing symbols from architectural and electronic drawings [18] ; the Fingerprint database [61] , etc [44, 64] . The average number of nodes for the graphs of these databases ranges from 5 to 50. Because of the vast application of small graphs, numerous algorithms have been proposed to obtain their information by extracting their features or learning how similar two graphs are [3, 11, 26, 30, 33, 36, 43] , especially using GCNs [3, 26, 33, 36, 43] . In particular, SimGNN [3] proposed a GCN-based approach to learn a similarity score for such graphs. It demonstrates that a GCNbased approach is able to approximate the GED with high accuracy; hence, it expedites the graph similarity computation significantly for many applications. SimGNN targets searching/matching graphs from real-world graph databases, such as AIDS [40] , LINUX [57] , and IMDB [70] , which consists of antivirus chemical compounds, program dependency graphs, and actor/actress ego-networks, respectively. The generated target graphs are relatively small, with 10 nodes on average, but the database contains millions of graph pairs, creating a large number of graph matching queries. Although the CPU implementation can finish each SimGNN query for such graphs in milliseconds (5.8ms to be exact), processing millions of queries can take several hours, which is not acceptable and requires customized acceleration. Such a workload of graph searching/mining is increasing in importance. For example, searching for antivirus chemical compounds is an important step in drug repurposing for Despite the popularity and effectiveness of graph neural network (GNN) approaches, there has been limited research on developing an accelerator for them. Besides, the differences between an image and a graph structure, as explained below, make the countless CNN accelerators proposed in the literature (e.g., [12, 28, 37, 39, 47-51, 62, 63, 73, 77, 79] ) incompatible. Furthermore, the characteristics of the graphs impose several challenges when the algorithms are executed using CPUs or GPUs. More specifically, the unique features of GNN impose the following challenges in designing an accelerator: • Irregular memory access and low data reuse: As opposed to images in which the neighbors of a pixel are stored either in contiguous locations or have a fixed distance to one another, the neighbors of a node in a graph may be stored in any location in memory. This will result in many irregular memory accesses to all levels of the memory hierarchy. To make matters worse, GNNs have much lower data reuse compared to CNNs. These characteristics make the application memory-bounded. Compared to the traditional graph algorithms such as breadth first search (BFS), single-source shortest path (SSSP), PageRank (PR), [76] Large etc., the nodes have long feature vectors instead of a single scalar value. As a result, although they both have irregular memory access, the access pattern is different. Also, this characteristic introduces new kinds of parallelism and data reuse. Now that each node deals with a long vector (often length of 256 or higher) rather than a scalar, we can exploit intra-node parallelism. Furthermore, all the vectors associated with different nodes share a weight matrix which brings in data reuse opportunities. These differences make most graph-based accelerators (e.g., [2, 5, 13-16, 23, 34, 38, 58, 59, 71, 81, 82] ) ill-suited for GNNs. • Computation pattern disparity: Different steps of the GCN algorithm deal with different sparsity rates as will be discussed in Section 2.1. In addition, a GNN may utilize other types of computation patterns, such as neural tensor network computation in SimGNN (see Section 4.1 for details) to make an end-to-end application. Such computation pattern variations call for a customized processing unit for each of these steps. • Dynamic workload and parallelism: Apart from the random memory location of the neighbors of a node, the number of neighbors also varies across different nodes. This will result in load-imbalance between the graph's nodes that can degrade the performance significantly. In addition to the challenges mentioned above, dealing with small graphs (which is particularly challenging for GPUs as shown in Section 5.4.2) requires special design considerations as we will explain in Section 3. To solve these challenges, in this paper, we present SPA-GCN as an efficient and flexible GCN accelerator for small graphs that exploits all the available sparsity. Then, we apply it to accelerate the entire processing pipeline of SimGNN as an end-to-end application. Since we are facing a memory-bounded application, our goal is to have the least number of global memory transactions and to read each element only once. Because of the computation pattern disparity, we analyze the requirements for each step of the computation pipeline and, accordingly, develop a dedicated architecture for each of them. To deal with the irregular memory access, we utilize a scratchpad memory to store the matrices that need random access. This is doable as we are targeting small graphs. We further propose an efficient workload distribution mechanism to alleviate the load imbalance problem. Concisely, we fuse all the stages together and employ a very deep pipeline with four different levels of nested parallelization and customize the computation units for each of the stages based on its workload as listed in Table 1 . These design decisions distinguish SPA-GCN from recently proposed GCN accelerators [21, 69, 74, 76] as summarized in Table 1 . Section 6 compares our approach to prior works in more details. While we use SimGNN for illustrating our approach, the same optimizations can be applied to other GCNbased networks dealing with small graphs such as [4, 26, 43] Inspired by the success of Convolutional Neural Networks (CNN) on images, GCN [31] was developed to apply a series of convolutional layers on graphs. Layer l of a GCN takes an undirected graph ( , , ) as the input, where V and E denote the nodes and edges of the graph, respectively. ∈ R | |× is the matrix of the input node embeddings for this layer, with each row containing the embedding (also called node features in the literature) of one of the nodes where indicates the number of features of each node at layer . The core computation of a GCN layer to produce the output node embeddings, +1 ∈ R | |× +1 is as follows: where (·) is an activation function which typically is a ReLU and is a layer-specific trainable weight matrix. ′ is the normalized adjacency matrix with added self-connections and can be calculated as follows: here, and are the adjacency and the identity matrix, respectively.˜is a diagonal matrix where˜is the degree of node i after the self-connection edges are added. Fig. 1 depicts the computation in Eq. 1 on a simple toy graph assuming that ′ is given, the activation function is ReLU, and each node embedding is a vector of size 4. The first step in the computation gathers the information from the neighbors of each of the nodes which is denoted in Eq. 1 as ′ · . Since ′ is a normalized matrix, the computation here is a weighted aggregation. After the Aggregation step, the node embeddings are transformed by applying a pre-trained set of weights. Note that the output embeddings may have a different vector size. In the end, the embeddings are passed through a ReLU layer to introduce non-linearity to the model. The time complexity for each GCN layer is (| | ), where | | denotes the number of edges including the self-connection ones, since ′ · can be implemented as a sparse-dense matrix multiplication [31] . As we will explain in Section 3.4, because of the ReLU unit at the end of each GCN layer, the node embeddings will also be a sparse matrix. However, the rate of sparsity is different as the adjacency matrix is, in fact, an ultra sparse matrix [21] . The first step in designing an architecture for GCN is to determine the order of the matrix multiplications. More specifically, we can either compute Eq. 1 as ( ′ × ) × or ′ × ( × ). We have chosen the latter since it results in a fewer number of operations. Intuitively, this is because both matrices ′ and are sparse, but their multiplication creates a dense matrix. As a result, in the former, we end up doing a dense-dense multiplication for the second multiplication. However, if we go with the latter, both multiplications are sparse-dense that as shown in AWB-GCN [21] , reduces the number of operations. Figure 4 illustrates the high-level view of GCN architecture in SPA-GCN. In this section, we employ a bottom-up approach to highlight the optimization opportunities when GCN is applied to small graphs and how we used them to build the GCN accelerator as demonstrated in As our application is memory-bounded and we are targeting small graphs, SPA-GCN is designed: • To reduce the number of times we access the global memory to the least amount possible. In our final architecture, each input element is read only once and there is no need to store any of the intermediate results in the global memory. • To exploit all the available sparsity. • To employ a deep pipeline with varying levels and degrees of parallelization for matching the workload of different stages and maximizing the overall performance. • To customize the architecture to efficiently handle small graphs. In this section, we describe the basic optimizations that can be applied for processing GCNs. We explain the optimized way of scheduling the operations of a GCN when targeting small graphs and justify the use of an outer-product-based multiplication [8, 55] for it. Furthermore, we demonstrate how we support sparse computation in the Aggregation step. Although these optimizations are necessary, they are not enough when dealing with many small graphs. We cover the rest of the optimizations that can be applied here in the subsequent sections. 3.2.1 Feature Transformation: In this step, one has to multiply matrices ∈ R | |× and ∈ R × , where and denote the number of input and output features, respectively. Since the matrix multiplication has | |× × operations, the minimum latency for doing this computation is: where # is the number of used multiply and accumulate (MAC) units. To achieve this latency, we should be able to schedule a new set of operations in each clock cycle. However, adopting an inner-productbased matrix multiplication results in updating the same output feature in the consecutive iterations which introduces read-afterwrite a (RAW) dependency between them. As a result, our pipeline cannot achieve an initiation interval (II) of one which degrades the efficiency of our accelerator. Optimized Scheduling: Read the Weight Matrix Row-wise; Stream the Embeddings Matrix Column-wise. We can solve the aforementioned problem by changing the order of computations. To break the dependency, we should update different output locations by taking an element from one of the input matrices (read as a stream) and broadcasting that to parallel MAC units while each MAC unit reads different elements from the other matrix. We choose to read as a stream and prefetch and cache since it will be reused by . With this scheduling we can support sparsity for this step more efficiently as discussed in Section 3.4. Fig. 3 (a) depicts the modified scheduling for cycle l. As the figure suggests, we divide the workload within a PE by parallelizing SIMD operations across the output features. To read each element only once, for each fetched element of , (i.e., ℎ 00 ), we schedule all the operations it is involved with, before its eviction. In addition to reusing the ℎ 00 , this scheduling increases the number of cycles before a RAW dependency happens. This is because by reading each element of the input node embedding, we can ensure that we will be updating different output locations in the next cycles. − elements in the figure represent the result of the MAC operations that were scheduled L cycles (L being the latency of a MAC unit) before the current inputs: elements. As long as ∀ ∈ [1 .. − 1] − ≠ , there is no dependency; hence, we can schedule a new operation every cycle and achieve II=1. To make sure that II=1 is possible, we read matrix in column order. Note that if we read it rather in row order, we update the same location every iterations instead of every | | × iterations. We add a second level of parallelization by duplicating the SIMD PE by a duplication factor (DF) which parallelizes the node dimension. As a result, we can make use of memory coalescing by packing and reading DF elements in each cycle. Fig. 3 (c) illustrates the final execution order of this step. The arrows denote the high-level ordering of traversing different dimensions and the numbers show the elements that are accessed at their respective cycles. It is important to traverse the input feature dimension ( ) last (arrow ) since it is the dimension causing the dependencies. In the baseline architecture, to ensure that the dependency distance is larger than L (to get II=1), we pad the matrix with zeros by adding rows until ( | |+ ) × ≥ . In Section 3.4, however, we insert bubbles in case of a dependency since the location of non-zero elements is dynamic. In this step, we must multiply matrices ′ ∈ R | |× | | and ∈ R | |× where is the result of the Feature Transformation step and ′ denotes the normalized adjacency matrix with added self-connections. Due to the highly irregular access to the matrix to aggregate features of the neighbors, we allocate a scratchpad memory to it. Matrix ′ , is often ultra sparse [21] . To reduce the number of both transferred elements and operations, we prune this matrix and only pass its non-zero elements, which represent edges, to FPGA. For most graph databases, like the ones we target, this step is not needed since the graphs are already stored as a data structure that contains a list of the vertices and edges. Instead of dedicating an on-chip memory for storing the edges, we read them as a stream and update all the features of the destination node, before retiring the edge. We further pre-process the adjacency matrix to not only precompute ′ , but also re-organize the edges to prevent having RAW dependencies. More specifically, before sending the edges to the FPGA, they are re-arranged offline so that the ones with the same destination node are at least L locations apart to make sure there is not more than one update to the same node within the window of L cycles. Without this re-ordering, we either have to increase the II for the whole operation or add a control unit to insert bubbles in the pipeline of this computation engine in case there is a RAW dependency to ensure correctness. We chose to pre-process the edges since as we are dealing with small graphs, the time for this pre-processing is negligible. We only make use of feature-level parallelism to distribute the workload in this step. Edge-level parallelism can result in bank conflicts since they update random nodes. In the last two sections, we explained the basic optimizations that should be applied to the major computation steps of GCN, Feature Transformation and Aggregation. The last step (ReLU) has a very low area and latency overhead. We only utilize a (0, ·) unit at the end of the aggregation module for it. To further boost the performance, we add intra-layer pipelining by exploiting a dataflow architecture for connecting the modules (see Fig. 2 ). As a result, the overall latency will be close to the latency of the slowest module (here, the ACG module). In addition, we can avoid off-chip memory accesses in between these modules. The computation in the Feature Transformation step can be divided into two separate tasks. Task 1, the MULT unit, is responsible for doing all the multiplications and Task 2, the ACC unit, does the additions. We pipeline these tasks by implementing them as separate modules that are connected by FIFOs. To avoid deadlocks, the throughput (i.e. II) of these units should match. The MULT module, as depicted in Fig. 2 , has a local buffer to store the weights and streams the elements of from input FIFO. Each entry of this FIFO is a concatenation of DF elements (see Section 3.2.1). Both the SIMD factor and DF can be customized based on the target network configuration. Once the multiplication results are ready, they are packed and sent to the ACG module. The ACC unit of the Feature Transformation step and the Aggregation step share the matrix ; thus, to save memory resources, we merge their computation into one module as shown in ACG module in Fig. 2 . After fetching the output of the MULT module, the ACG module unpacks the data based on the same DF, and dispatches SIMD elements to each SIMD Acc Unit with the same SIMD factor. Once the additions are done, it will store the partial results to the local buffer features buffer. After all updates are committed to the features buffer, the matrix is computed and the Aggregation step can start. The SIMD factor of this step is higher than the one in Feature Transformation step since we only exploit feature-level parallelization here. After this step is finished, the elements of the out features buffer are added with a bias, passed through a ReLU unit, and stored into off-chip memory. Note that in the baseline architecture, we reuse the same modules for all the GCN layers. As it is commonly practiced ( [35, 69, 74, 76] ), in the baseline architecture, we only exploit intra-layer pipelining and reuse the modules for all the GCN layers. However, this is not sufficient when we are dealing with small graphs. The off-chip communication is a serious burden for this application since it deals with small-sized inputs. To alleviate this problem, we intend to reduce the number of accesses to the off-chip memory as much as possible. The baseline architecture is inefficient with this regard since, at the end of each layer, the output should be stored to the off-chip memory and read back again for the next layer. To avoid these redundant accesses, we propose to extend the dataflow architecture described in Section 3.2.3 to all layers of GCN. To realize this, we instantiate new modules for each layer and connect them with FIFOs as depicted in Fig. 4 . The various (LUT, BRAM, and URAM) and large on-chip memory resources of FPGA and the rather small matrices used for our target application enable us to dedicate separate modules for each of the layers. Fusing the computation for all the layers by enabling dataflow architecture has several benefits such as: 1) we can avoid writing the intermediate results to the global memory by forwarding them to the next layer through FIFOs. 2) The operations will be dynamically scheduled since each module can perform its operation whenever it has a data available. 3) Since we are instantiating different modules for each layer, we can customize the parallelization factors of each module based on the available workload of their respective GCN layer. 4) As the adjacency matrix of a graph does not change across different layers, we can read the edges from the global memory only once for the first layer and reuse them for the subsequent ones by transferring them through the on-chip FIFOs. The input node embedding to the first layer of GCN usually contains many zero elements since it often uses one-hot encoding for assigning initial vectors to nodes. Furthermore, as shown in Fig. 1 , at the end of each GCN layer, there is a ReLU unit. As a result, the matrix generated by each layer, which is the input to the next layer, is sparse. In fact, we saw 52% and 47% sparsity on average for the input to the second and the third layers of GCN in SimGNN for randomly drawn graphs from our target dataset. As a result, the Feature Transformation step also needs to have the support for sparse computation. To reduce the number of operations, we prune the zero elements and only pass the non-zero ones to the next layer. As the updates to the output buffer may come in random cycles, it is necessary to store the buffer containing the partial results on-chip to enable random access. For the same reason, we pack the node features with their address which includes their row and column ID. Packing the elements with their address helps to make the dispatch unit simpler since each SIMD PE is free to work with any data and knows which partial result should be updated; hence, there is no need to take special considerations to navigate the data to the correct PE. We only need to make sure that at all the times each SIMD PE is working with a different memory bank. We employ an arbiter for this matter, as explained below. As mentioned in Section 3.2, to reduce the number of RAW dependencies, we chose to stream the node embedding matrix and broadcast it to different computation units (CU) which read the weight matrix as a batch. Since the node embedding is a sparse matrix, reading it as a stream facilitates the pruning mechanism we employ and enables us to distribute the workload more efficiently. Fig. 5 demonstrates a toy example illustrating the benefit of this scheduling. The colored squares show the non-zero elements of the node embedding matrix. By mapping the weights, which are nonzero, to the SIMD dimension, all the CUs in the PE would execute useful operations and we can skip all the operations involving a zero node embedding. Figure 5 : The benefit of streaming the node embeddings and mapping the weights to the SIMD dimension for supporting sparse computation. When skipping the zero node embeddings, the dependency distance for output elements may change dynamically since the number of non-zero inputs between the updates to the same location can be different. Even though the scheduling discussed in Section 3.2 increases the dependency distance as much as possible by doing all the MAC operations when a non-zero input is encountered (each non-zero element would fill cycles of the dependency window), there still may be some cases where the dependency distance is less than L (the latency of our function unit (FU) causing the dependency) after this optimization. Instead of setting the II to L to ensure correctness, we first insert L registers to store the partial results of FU at the end of each of its pipeline stages; hence, we can schedule a new set of operations at each clock cycle (II=1). There may be cases where the new scheduled operations want to update a location whose old value is still in the registers and have not updated the buffer. To ensure the correctness, we add a control unit which keeps track of the last cycle that each of the output locations was updated. If the number of cycles between two updates to the same location is less than L, the control unit will insert bubbles into the pipeline until the previous update is committed. We insert a unit for pruning zeros at the end of the Aggregation step in ACG module. As Fig. 6 evaluate P elements of the node embeddings and pass each to a FIFO if it is not zero. To prevent deadlocks in the system, we use non-blocking read or writes for these FIFOs and only write (read) elements when the FIFO is not full (empty). The MULT module of the Feature Transformation step of the next layer takes the P FIFOs as the input and uses an arbiter to fetch, at most, DF of them ( <= ) for passing to DF SIMD PEs. An arbiter keeps track of the FIFO whose turn it is to be read first in the next cycle. It then uses a round-robin ordering for dispatching the elements from the non-empty FIFOs. After dispatching the inputs, it checks for the RAW dependency by scanning the prev iter buffer which contains the last cycle when each element was seen as the input. If the distance was less than L, it will insert bubbles in the pipeline until the previous input has committed its update to avoid the conflict on II. If there is no dependency, for each memory bank no more than one element from the dispatched inputs will be issued to a SIMD PE and the current cycle number will be stored in prev iter buffer for that input. To recap, the SPA-GCN architecture provides a flexibility in choosing the parallelization rates. Table 2 lists the parameters that can be tuned for each GCN layer based on its workload. The SIMD factors correspond to feature-level parallelization, while and map to node dimension. In Section 3, we proposed an architecture for GCN specialized for small graphs. In this section, we extend our architecture to accelerate an end-to-end application, SimGNN. An end-to-end application introduces new computation patterns beyond GCN which requires customized processing units. We introduce a new level of parallelization in Section 5.4.3. Bai et al. [3] proposed a neural-network-based approach to assign a similarity score to two graphs. Its computation pipeline consists of four major stages as depicted in Fig. 7 . The first stage is made up of three layers of GCN to extract the node embeddings ∈ R | |× where is the number of features of the last layer. In the second stage, it uses a Global Context-Aware Attention layer (Att) to combine the node embeddings and generate a single embedding per graph ℎ ∈ R . For this matter, at first, the Att stage computes the global context by taking an average of the node embeddings followed by a non-linear transformation: where ∈ R × is a learnable weight matrix. Then, it calculates the importance of each node by constructing an attention weight for them to denote their similarity score to the global context by computing: . Finally, the graph embedding can be calculated by taking a weighted sum of the node embeddings using the attention weights. The computation in this stage can be summarized as follows: where (·) denotes the sigmoid function to produce . The time complexity of this stage can be seen to be (| | ). The third stage is a Neural Tensor Network (NTN) that calculates a vector of similarity score between the two graphs: where [1: ] ∈ R × × , ∈ R ×2 , and ∈ R are learnable weight tensor, weight matrix, and bias vector, respectively. is a hyper-parameter that controls the number of similarity scores. The time complexity of this stage is ( 2 ) where F denotes the dimension of the graph level embedding. In the last stage, a standard fully connected network is used to gradually reduce the similarity vector to only one score. Figure 7 : The SimGNN pipeline as depicted in [3] . Since the core computation stage of SimGNN consists of GCN layers, the challenges mentioned in previous sections still exist here. Furthermore, the rest of the stages make use of exponential and hyperbolic tangent functions which are expensive to have on FPGA that can limit the rate of parallelism for them. By comparing the computation complexity of these stages, with GCN (Section 2.1), we can see that the GCN step is the most computation-intensive step; hence, when pipelining all the stages together, the accelerator will be bottlenecked by the GCN step. Therefore, we do not aggressively parallelize the rest of the steps. We rather focus on reducing their resource utilization. The SimGNN pipeline applies the GCN stage to two graphs for each comparison query. Instead of duplicating the architecture in Fig. 4 twice, we process the graphs serially and reuse the GCN module for the two input graphs in the query. Reusing the GCN module enables us to map the design to smaller FPGAs as well. Note that the GCN stage is the most resource-hungry one of them all. We improve the performance for processing one query by overlapping the GCN computation of one graph with the Att computation of the other one. Thus, the total performance will be bottlenecked with the performance of GCN. As a result, we focus on reducing the area and reusing the resources for Att. Let ∈ R × | | be the transposed result of the GCN stage of SimGNN, then, its n-th column ( ) will be the vector ℎ in Eq. 3. In computing = Σ | | =1 ℎ , we first have to add ℎ vectors and then do a matrix-vector multiplication (MVM). Instead of instantiating separate adders for the first additions and the ones in MVM, we rewrite the equation as follows to reuse the adders: where ( . , 2) denotes the reduction of the result matrix across its second dimension (columns), meaning that all the multiplications associated with a column of should be added together. Fig. 8 demonstrates an overview of the Att module. As in the GCN stage, we divide the matrix multiplication to two different modules, one responsible for multiplications and the other for additions. Again, we use SIMD PEs to implement these modules. However, the SIMD factor here can be set to a different value compared to the GCN stage since they have different computation complexities. The Repack module is responsible for adjusting the output of GCN with the SIMD factor of this stage. We use the implementation of tanh and exp functions available in Xilinx HLS Math library since they are already optimized. Note that the last summation in Eq. 3 can be seen as × where ∈ R | | contains the sigmoid results. Hence, we use a matrix vector multiply (MVM) unit at the end. The computation in the NTN stage is rather simple since it is a series of fixed-size MVMs followed by a bias addition and an activation function. Furthermore, the layers of the fully-connected network (FCN) in the last stage either need an MVM unit or a reduction tree to lower a vector to a scalar. Like the previous stages, we implement all the sub-modules of these two stages in a dataflow-manner. Fig. 9 depicts the architecture of these two steps. Finally, we develop the SimGNN architecture by connecting the modules described in the previous sections. The whole computation pipeline is implemented as a three-level dataflow architecture. The first two levels resemble an inter-stage pipelining while the last one is for intra-stage pipelining. The first level enables a task-level parallelization by grouping the graph-related steps, the GCN (Section 3) and Att (Section 4.2) modules, and overlapping them with the rest, NTN_FCN module (Section 4.3). The second level of dataflow architecture overlaps the GCN stage with the Att. Finally, the last level applies dataflow architecture to each of the the GCN, Att, and NTN_FCN modules as shown in Fig. 4, 8, and 9 , respectively. The Pre-fetcher of the GCN module reads the weights and bias of the rest of the stages as well and distributes them to the right place. We apply three optimizations for reducing the off-chip communication latency: 1) each input buffer can be mapped to a different DRAM bank or HBM channel to enable parallel access to them, 2) the available global memory bandwidth is fully utilized by applying memory coalescing. Memory burst is also applied to amortize the initialization overhead, 3) the modules that access the global memory are overlapped by the computation modules by implementing the accelerator as a dataflow architecture. In this section, we first review our target dataset in Section 5.1 and present the experimental setup in Section 5.2. We then evaluate the effects of our optimizations to the GCN architecture (Section 3) step-by-step in Section 5.3. The whole pipeline of SPA-GCN is then evaluated in Section 5.4 on three different FPGAs that differ in the amount of resources available and the type of global memory they employ. Finally, we compare the performance of SPA-GCN to CPU and GPU implementation and demonstrate the superiority of our design. Related work on general GCN acceleration, sparse CNN, and sparse matrix multiplication will be discussed in Section 6. We consider a publicly available dataset, AIDS [40] , for benchmarking our design. AIDS is a real-life graph dataset containing 42,687 antivirus chemical compounds gathered by the Developmental Therapeutics Program at NCI/NIH. The graphs in AIDS have 25.6 nodes and 27.6 edges on average. We randomly select 10,000 pairs to form 10,000 queries. The kernel time and end-to-end (E2E) time reported in this section are the average of all queries. The SPA-GCN architecture is described using Vivado HLS C++ [68] . The design is synthesized and implemented using Xilinx Vitis 2019.2 [67] on three different target platforms: Xilinx Alveo U50, Xilinx Alveo U280, and Xilinx Kintex UltraScale+ KU15P. The first two are equipped with HBM2 [27] and, ideally, can achieve a bandwidth of 316 GB/s (460 GB/s) with a TDP of 75W (225W) [56, 66] ; while the last one utilizes DDR4 as the global memory. Table 3 compares the hardware resources of these boards. For comparison to CPU and GPU, the PyTorch-based implementation of SimGNN from [45] is used that is built using the state-of-the art PyTorch Geometric (PyG) library [20] which is commonly used as a baseline by previous works [21, 35, 69] . For the Aggregation step, PyG exploits sparsity and edge-level parallelism by adapting the PyTorch Scatter library [1] . For the Feature Transformation step, it uses Intel MKL [25] and NVIDIA cuBLAS library [41] for CPU and GPU respectively, making it a reasonable and optimized baseline. The target CPU in our experiments is Intel(R) Xeon(R) CPU E5-2699 v4 running at 2.2 GHz. For testing on GPU, we use an AWS p3.2xlarge instance which has an NVIDIA V100 GPU. Table 4 shows the resource usage and performance of the SPA-GCN architecture when accelerating three GCN layers of SimGNN on the U280 FPGA. The baseline uses the same hardware for all GCN layers. With inter-layer pipeline added, all 3 GCN layers run in parallel as a coarse-grained pipeline. Since each layer utilizes different pieces of hardware, we can customize the design parameters to match the throughput of each layer. As a result, the 3 layers require 2.4× more DSPs compared with the baseline. Although BRAM usage appears to be smaller, more URAM is used with inter-layer pipelining since we distribute the required resources to obtain to a better frequency. The total storage usage is increased in order to store all the buffers for different GCN layers and the FIFOs for the intermediate results between them. The GCN kernel time is reduced by 36% with inter-layer pipelining added to the baseline. However, if we look at the latency-area product metric, i.e., Kernel×DSP, we can see that the performance improvement does not catch up with the computation units (DSP) increment, suggesting potential for further optimizations. Sparsity to Feature Transformation: Indeed, we can extend sparsity support to the Feature Transformation step in GCN to improve both the kernel time and the latency-area product metric, as explained in Section 3.4. Although using P queue helps the arbiter fetch non-zero elements more frequently, it may still not be enough to dispatch data to all the DF PEs. Furthermore, by increasing the DF, we may need to insert more bubbles in the pipeline to avoid RAW dependency since it reduces the number of cycles between the updates to the same location. As a result, there is a trade-off in choosing the right DF for each layer. Since it is a highly workloaddependent decision, we employ profiling results for setting each of the parallelization factors. The best parallelization factors are summarized in Table 4 . When DF is set to 1, we no longer need to have separate banks in the row dimension of the buffers which can lessen the number of needed memory blocks. This makes it more efficient to use dense memory blocks (BRAM and URAM) as opposed to LUTs for the buffers, thus BRAM and URAM usage are increased. As Table 4 shows, extending sparsity to feature transformation over inter-layer pipeline has further reduced the kernel time by 31%, effectively achieving a 2.27× speedup over the baseline, while decreasing the DSP usage by 4.09×. The results clearly suggest that, since this is a memorybounded application, throwing more resources to the architecture is not helpful. Instead, the memory access latency should be reduced and the computation units shall be used more efficiently. Since a large number of zero elements and required DSPs are excluded, the latency-area metric Kernel×DSP is greatly improved by 3.88× over the baseline. As the figure exhibits, we allocate most of the resources to the GCN stage as it is the computation-intensive part of the network. Table 5 shows the resource usage and performance for the three FPGA platforms. We can see that the kernel runs faster on HBM FPGAs compared to KU15P. This is mainly due to the fact that HBM FPGAs can achieve a better frequency as they have more resources and the Vitis tool has more freedom in PnR to optimize the timing. The fact that the multiplication and addition units have different latencies on these boards (5 and 8 cycles for KU15P; 4 and 7 cycles for U280, respectively) further increases the difference in the runtimes. In fact, the cycle count of the same kernel when it uses different types and number of banks for global memory is almost the same. This suggests that after our optimizations the bottleneck is no longer at the memory level. Besides, we can see that the end-to-end time is much larger than the kernel time (runtimes measured on the host using CPU clock). This is mainly because of the overhead of launching the kernel and PCIe transfers since all the pre-processing steps take less than 5% of the end-to-end time. This suggests the potential for batching queries to further improve the end-to-end throughput (Section 5.4.3). We test the performance of the whole pipeline of SimGNN on the CPU and GPU described in Section 5.2. In this section, we are assuming that the inputs are already stored in the host memory, and we want to offload the graph comparison queries to either of the target platforms. The goal is to compare the performance of these platforms for processing a graph matching query. Table 6 summarizes the results. The results are averaged over 10,000 queries on each of the platforms. The queries are started sequentially, and the end-to-end time of all the platforms is the time interval between two consecutive queries are started. This contains the runtime for any pre-processing steps as well. For FPGA and GPU, it also includes the host-kernel communication via the PCIe link, writing data to FPGA/GPU's global memory, kernel computation, reading the results from that, and the overheads for using the APIs (OpenCL for FPGA and PyTorch for CPU/GPU). We use the end-to-end time for comparison since these overheads are inevitable and should be accounted for. The kernel time on CPU/GPU is measured with the PyTorch profiler. The results demonstrate that our FPGA solution can outperform both CPU and GPU significantly. As discussed in Section 1, this is partly because of the dynamic load balance and the irregular memory access of the graph structure. Furthermore, since we target small graphs, it results in extreme under-utilization of GPU. In fact, the profiling results indicate that the GPU utilization does not go higher than 6% and, for the most part, the PyG-GPU only uses 1 streaming multiprocessor (SM) since the matrices are small. Because of this and the fact that GPU runs at a lower frequency (1.3GHz) compared to CPU (2.2 GHz), the GPU version of this application is even slower than the CPU. The nvprof profiling results show that PyG-GPU runs 225 kernels for accelerating this application that on average have 4. 6 . With this low computation intensity, the overhead of running the kernel (such as cudaLaunchKernel) is larger than the actual kernel runtime that greatly impacts the GPU performance. Designing the GPU kernel manually can alleviate some of these shortcomings, but the underlying problem still exists due to the coarse-grained execution model of GPU. In contrast, our FPGA solution suffers from the kernel initialization overheads only once since we develop a deep pipeline across all stages of the computation by fusing them in one kernel. This pipelining has several other benefits as explained in Section 3.3. The difference in the end-to-end time and the kernel time on FPGA is mainly due to both the DMA transfers and overhead of OpenCL APIs. In fact, our experiments using the Profile Summary provided by Vitis Runtime library [65] show that OpenCL APIs can take 10-100 s which is comparable to the kernel execution for one query. Therefore, SPA-GCN implements batching graph queries to amortize these overheads. We can send the data for processing multiple queries to FPGA, process each query serially, and send the results back together. Fig. 11 demonstrates the results of this experiment on U280 board. The results suggest that by transferring the data for~300 queries, we can amortize the setup time and achieve a 2.8× speedup. Batching queries makes it possible to further improve the throughput by adding another level of parallelization. When targeting the HBM FPGAs, we use 4 of the HBM PCs for processing one query of graph matching. As each HBM FPGA has 32 PCs that can be accessed independently, we can further improve the performance by replicating the logic of SPA-GCN for one query to process up to 8 different queries in parallel as long as we have enough resources available. Table 3 illustrates that the available resources allow us to instantiate 6 SPA-GCN pipelines with U280 before hitting the 80% resource usage upper-bound. This does not change the latency of each graph query, but it would further increase the throughput by 6× resulting in a throughput of 33522 query/s on U280. 6 Related Work GCN Accelerators: Because of the popularity of GCN, there is a growing interest in developing an accelerator for it [21, 35, 69, 74, 76] . As summarized in Table 1 , HyGCN [69] , GraphACT [74] , and Prasanna et al. [76] develop a fixed hardware for all the layers of GCN and process them sequentially. This is an undesirable feature particularly when we target small graphs. In fact, our baseline architecture, in which we reuse the same architecture for all the GCN layers, exploit only the sparsity of the Aggregation step, treat the Feature Transformation step as a regular matrix multiplication, and employ a 2D computation unit for it, has the same design principles as these works. However, as the experimental results in Table 4 show, not only should we execute the GCN layers in a pipelined fashion, but also we should exploit the sparsity of the node embeddings to enhance both area and performance. In addition, GraphACT and Prasanna et al. [76] mainly rely on redundancy reduction in the Aggregation step to decrease the number of operations by pre-computing the repeated aggregations. This is possible only when the adjacency matrix is binary. However, not all GCNs can benefit from this feature since they typically work with the normalized adjacency matrix meaning that they need weighted additions in this step. AWB-GCN [21] proposes an architecture that supports inter-layer pipelining and considerations for sparsity of the node embeddings for accelerating GCN. However, partitioning the computation by the nodes in their approach complicates the design of the task distributor since the node embeddings are sparse and special consideration is needed to prevent PEs from doing unnecessary operations on the zero elements. On the other hand, feature-level parallelization deals better with workload imbalance as we discussed in Section 3. In addition, AWB-GCN is developed for large graphs and is based on the inner-product matrix multiplication, whereas, as we discuss in Section 3.2, the outer-product multiplication is preferred to reduce the RAW dependency especially when we are targeting small graphs and do not have enough (non-zero) nodes to fill the dependency window. Moreover, when targeting small graphs, we need more optimizations. For example, it is important to reduce the number of accesses to the global memory as much as possible. This is not the focus of the previous works. Besides, in addition to fitting the design for the whole network into an FPGA, we have the option to process parallel queries of the network. Table 1 summarizes the differences of our design compared to the aforementioned works on GCN acceleration. SpMM and SCNN Accelerators: Apart from the works focusing on GCN, there has been a lot of research on sparse matrix multiplication (MM) either for pruned CNNs or normal MM [17, 24, 32, 42, 52, 53, 60, 78, 83] . They all rely on the fact that the sparse matrix is known offline and they can pre-process it. For example, EIE [24] propose a sparse matrix vector multiplier for the fully-connected layers. It reorganizes the sparse matrix in compressed sparse column (CSC) format and pre-loads that into on-chip memory. As another example, Kung et al. [32] pre-process the data by merging multiple sparse columns of the weight matrix into one and pruning all the weights except for the most-significant ones. resulting in some accuracy loss. These approaches are not feasible for GCN in which the sparse matrix (i.e. the node embeddings) is generated while running the algorithm; whereas, we proposed a technique to prune the zeros on-the-fly. In this paper, we analyzed and examined the optimization opportunities when GCN is applied to small graphs. We presented an efficient architecture, SPA-GCN, and developed an accelerator for SimGNN based on that as an end-to-end application. SimGNN is a neural-network-based graph matching algorithm for calculating an approximation of the GED between two graphs. The computation disparity existing in the network calls for a customized accelerator. Besides, since the GPU has coarse-grained execution, we cannot have improvement beyond the optimizations applied for each phase since different phases are executed separately. However, on the FPGA side, we can exploit a deep pipeline across the phases by enabling a dataflow architecture. Not only does it help us reduce the global memory transactions, we can also eliminate the overhead of running different kernels. Furthermore, we showed that since this is a memory-bounded application, instantiating many computation units (as in GPU) is not beneficial. The experimental results demonstrate that SPA-GCN can outperform CPU and GPU by 18.2× and 26.9×, respectively, because of the use of a very deep pipeline with different levels of parallelization. interning at Samsung Semiconductor Inc. It is also supported by the CAPA award jointly funded by NSF (CCF-1723773) and Intel (36888881), the RTML award funded by NSF (CCF-1937599), and CDSC industrial partners 1 . Lazy release consistency for GPUs Simgnn: A neural network approach to fast graph similarity computation Learningbased efficient graph similarity computation via multi-scale convolutional set matching Analysis and optimization of the memory hierarchy for graph processing workloads The protein data bank Pub-Chem: integrated platform of small molecules and biological activities On the representation and multiplication of hypersparse matrices A graph distance metric based on the maximal common subgraph Alchemy: A quantum chemistry dataset for benchmarking ai models An efficient algorithm for graph edit distance computation. Knowledge-Based Systems Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks Nxgraph: An efficient graph processing system on a single machine Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms Fpgp: Graph processing framework on fpga a case study of breadth-first search Foregraph: Exploring large-scale graph processing on multi-fpga architecture Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices Report on the second symbol recognition contest Convolutional networks on graphs for learning molecular fingerprints Fast graph representation learning with PyTorch Geometric AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing Deep learning for logic optimization algorithms Graphicionado: A high-performance and energy-efficient accelerator for graph analytics ACM International Symposium on Microarchitecture (MICRO). 1-13 EIE: Efficient inference engine on compressed deep neural network Graph Neural Networks with Multiple Feature Extraction Paths for Chemical Property Estimation JEDEC. 2020. High Bandwidth Memory (HBM) FPGA Stripes: Bit-serial deep neural network computing On the approximability of the maximum common subgraph problem Inves: Incremental Partitioning-Based Verification for Graph Similarity Search Semi-supervised classification with graph convolutional networks Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization Graph matching networks for learning the similarity of graph structured objects Aggressive Pipelining of Irregular Applications on Reconfigurable Hardware EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks Multi-view graph neural networks for molecular property prediction Tfe: Energy-efficient transferred filter-based engine to compress and accelerate convolutional neural networks Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression Scnn: An accelerator for compressed-sparse convolutional neural networks GHashing: Semantic Graph Hashing for Approximate Similarity Search in Graph Databases IAM graph database repository for graph based pattern recognition and machine learning A distance measure between attributed relational graphs for pattern recognition Simba: Scaling deep-learning inference with multichip-module-based architecture Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network Escher: A CNN accelerator with flexible buffering to minimize off-chip transfer End-to-End Optimization of Deep Learning Applications Pipelayer: A pipelined reram-based accelerator for deep learning Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations ZINC 15-ligand discovery for everyone SUMMA: Scalable universal matrix multiplication algorithm Alveo U280 Data Center Accelerator Card Data Sheet An efficient graph indexing method Gunrock: A high-performance graph processing library on the GPU Processor assisted worklist scheduling for FPGA accelerated graph processing on a shared-memory platform Dual-side Sparse Tensor Core NIST special database 4. Fingerprint Database, National Institute of Standards and Technology TGPA: tile-grained pipeline architecture for low latency CNN inference Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs MoleculeNet: a benchmark for molecular machine learning Vitis Manual -UG1393 Alveo U50 Data Center Accelerator Card Data Sheet Vitis Unified Software Platform Vivado Design Suite User Guide -High-Level Synthesis (UG902) Hygcn: A gcn accelerator with hybrid architecture Deep graph kernels GraphABCD : Scaling Out Graph Analytics with Asynchronous Block Coordinate Descent Graph convolutional neural networks for web-scale recommender systems A framework for generating high throughput CNN implementations on FPGAs Graphact: Accelerating gcn training on cpu-fpga heterogeneous platforms Comparing stars: On approximating graph edit distance Accelerating large scale GCN inference on FPGA Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks Cambricon-X: An accelerator for sparse neural networks DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs T-gcn: A temporal graph convolutional network for traffic prediction Accelerating large-scale single-source shortest path on FPGA Highthroughput and energy-efficient graph processing on FPGA Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs We would like to thank Marci Baun for editing the paper. A significant part of this work was conducted while the first author was