key: cord-0047301-erpi3qdo authors: Rheindt, Sven; Fried, Andreas; Lenke, Oliver; Nolte, Lars; Sabirov, Temur; Twardzik, Tim; Wild, Thomas; Herkersdorf, Andreas title: X-CEL: A Method to Estimate Near-Memory Acceleration Potential in Tile-Based MPSoCs date: 2020-06-12 journal: Architecture of Computing Systems - ARCS 2020 DOI: 10.1007/978-3-030-52794-5_9 sha: 7d1ec7eb20f4cdbbbe277dde3a0a161b2bffeee1 doc_id: 47301 cord_uid: erpi3qdo Near-memory acceleration strives to tackle the data-to-task locality issue in MPSoCs in order to obtain higher performance and lower power consumption. However, it is not easy to determine whether the advantages arise from the near-memory integration or the hardware acceleration (versus software execution). We propose X-CEL, a method to accurately estimate the potential of near-memory acceleration using an easy-to-integrate near-memory core. We showcase X-CEL’s benefits with three variants of graph copy mechanisms in a tile-based MPSoC. Evaluations reveal that the estimated speedup is in good accordance with the actual speedup achieved by the near-memory accelerator. The performance and power consumption of today's MPSoCs are dependent on data-to-task locality more than ever. A significant amount of energy and time is nowadays spent on data transfers between processor cores and the main memory, especially for memory-intensive applications, which are dominated by data access and movement [3, 10] . Conventionally, sophisticated cache hierarchies are used to improve data-to-task locality by bringing data closer to the processor cores, thus lowering memory access latencies and the energy footprint. However, their benefit is decreasing due to the emergence of large, irregular and cache-unfriendly datasets, utilized by today's and future applications [10] . The locality challenge becomes worse when shifting towards tile-based manycore architectures, as on these the distance between physically distributed cores and memory grows. Many recent approaches therefore leverage in-or near-memory computing to bridge the widening gap between processors and memory [1, 16, 22, 25, 27] . The majority of them use near-memory accelerators (NMAs), which perform their task both close to memory, as well as in a dedicated hardware module (either near the memory or as a specific accelerator layer in a 3D-stacked circuit). NMAs usually achieve a higher computational density and performance than a software solution while saving energy and resources at the same time. On the other hand, they sacrifice the flexibility of general-purpose computing and every new accelerator requires a significant hardware development effort. However, it is not always clear which portion of the performance advantage of the NMA originates from the location (i.e., near-memory integration) or type of function implementation (i.e. software execution versus hardware acceleration). The impact of either one of the two effects is highly dependent on the application and the underlying system architecture. To determine the optimal design, it is therefore essential to analyze the influence of both effects on multiple important user-and case-specific decision criteria, such as: performance, power consumption, resource usage, design effort, flexibility (general-vs. fixed-purpose), system or accelerator utilization, etc. The analysis whether 1. a near-memory integration (near-memory core or near-memory accelerator) is beneficial at all, 2. a dedicated hardware accelerator can outperform a software-programmable core for the given task, or 3. whether only the combination of both achieves a speedup, is crucial to avoid unnecessary and costly development effort. However, it is not trivial to quantitatively predict the effect of the individual optimizations before implementing and measuring them. Further, it has to be determined if a near-memory core or accelerator can handle the workload which is outsourced to it by many cores. Therefore, a method for speedup estimation which helps the developer to make early yet robust design choices would be of much benefit. Conventionally, a design space exploration (DSE) is mostly performed on a virtual prototype (i.e. simulation-based) or an FPGA-based prototype [8, 19] . Both need at least an accurate model or an implementation of the NMA, which already requires a good amount of development effort if the DSE is expected to yield conclusive results. To avoid this effort, we envision an orthogonal approach that could be applied to both virtual and FPGA-based prototyping. We therefore -propose X-CEL, an agile, measurement-based method to estimate the speedup potential of near-memory accelerators in a tile-based MPSoC, -showcase X-CEL with a case study of three graph copy mechanisms (two of them are near-memory), -and provide an in-depth evaluation of this case study. This agile development method builds on actual measurements of an intermediate, easy-to-integrate near-memory core implementation. With the intermediate stage, we achieve a better estimation of the target design (near-memory accelerator) because in it the near-memory dimension (i.e. location) has been decoupled from the accelerator dimension (i.e. type of function implementation). The rest of the paper is organized as follows: Sect. 2 describes the related work. In Sect. 3, we present X-CEL followed by the case study in Sect. 4. Section 4 is divided into a description of the showcase scenario (Sect. 4.1 and 4.2) and how we apply X-CEL to it (Sect. 4.3). We further perform an in-depth analysis of the evaluation results in Sect. 5, before concluding in Sect. 6. Our work is closely related to design space exploration (DSE) of heterogeneous systems. As Sangiovanni-Venticelli et al. strive to do in their platform-based design method [23], we place our approach early in the design phase. During a DSE run, the DSE needs to be able to evaluate the performance of each considered design point. Conventionally, it follows either a simulation-based or analytical method as defined by Pimentel [19] . When a custom hardware unit is part of the system, both of these methods need a model of that unit to be developed beforehand. Reagen et al. [21] and Altaf et al. [2] demonstrate this approach for the simulation-based and analytical methods, respectively. There is also the measurement-based evaluation method, but Pimentel associates this with a prohibitively high development overhead. This is because instead of a (simplified) model of the custom hardware, the evaluation now needs a full prototype. Our approach, however, is orthogonal to conventional DSE and allows us to bypass the need to develop a model or prototype beforehand. As we target near-memory acceleration (NMA), we extrapolate its performance by leveraging measurements of an easy-to-integrate, software-programmable near-memory core, without the need for the actual accelerator. Recently, there has been much interest in NMAs for numerical applications [16, 25] , graph processing [11] , and system software [27] . For dealing with object graphs, Maas et al. presented an accelerator (albeit not an NMA) to speed up tracing garbage collection [13] . Rheindt et al. specifically targeted the problem of copying object graphs with an NMA [22] , which is also the focus of our paper. There are also sophisticated software-only approaches to efficiently copy object graphs without costly (de)serialization: Mohr et al. [14] presented Pegasus, which targets embedded MPSoCs, while Skyway by Nguyen et al. [17] optimized object graph transfers over networks. X-CEL is a measurement-based method to estimate and analyze the speedup potential of near-memory accelerators in tile-based manycore architectures. To be able to conquer the complexity of this endeavor, we propose an agile twostage approach, which separates the near-memory from the hardware accelerator dimension. This decoupling of both effects allows us to make a better estimation. Our method categorizes the activity of a parallel application scenario running on an MPSoC into three parts: 1. the task of interest (TOI), which would benefit from near-memory computing and which is often memory-intensive, 2. all remaining tasks of the application, and 3. idle time of the cores. Figure 2 illustrates this for a parallel application running on N cpu cores. As depicted and defined in Fig. 2 , t toi and t other are the accumulated times over all application cores executing the task of interest and the remaining tasks, respectively. The TOI can either be given as a design choice to be explored/analyzed or it can be determined through application profiling, e. g. last-level cache misses indicate which task(s) have the most DRAM accesses. The idle times arise from sequential parts of the application, limited parallelism, data dependencies, as well as inter-thread communication and synchronization overhead. If there are several different tasks of interest, X-CEL could also be individually applied to them to analyze the speedup potential of each. In the following, we assume one task of interest which is executed multiple times throughout the application. The tile-based manycore architecture we consider (an example is depicted in Fig. 4) , contains a main memory, a two level cache hierarchy, many cores and potentially a software-programmable near-memory core (NMCore) or dedicated hardware near-memory accelerator (NMA). Thus, we can differentiate between three implementation variants: 1. baseline (far-from-memory & without accelerator): the task of interest (TOI) and all others tasks are executed parallelized on the far-from-memory cores, 2. NMCore (near-memory, but without accelerator): the task of interest is executed near-memory on the near-memory core, while all others tasks remain on the distributed cores, and 3. NMA (nearmemory & accelerated): similar to NMCore, but the task of interest is offloaded to the near-memory hardware accelerator. Beginning with the existing baseline variant, X-CEL introduces and leverages an agile development step via the NMCore variant. The near-memory core serves well as an intermediate step in the two-stage estimation since it has negligible development effort compared to the near-memory accelerator: The existing software algorithm of the TOI just needs to be executed on an additionally instantiated core. This offloading needs to be properly synchronized with the rest of the system. As depicted in Fig. 1 , X-CEL decouples the near-memory from the hardware acceleration dimension. In contrast, an estimation of the NMA using the baseline measurements would incorporate a change of both dimensions at the same time. This would be a difficult endeavor in such a complex system with many superposed effects of MPSoCs and parallel programming. Therefore, a refined estimation based on the NMCore variant is more promising because the near-memory dimension is fixed due to the same location of the NMCore and the NMA in the architecture. Our proposed X-CEL method thus follows the steps depicted in Fig. 3 : Step 1. Identify the task of interest (TOI) of the application scenario that could benefit from near-memory acceleration. In case of more than one TOI, apply X-CEL either individually or in a combined manner to them. Step 2.1. Execute the baseline variant and measure the accumulated CPU time of all application cores taken by the task of interest t base toi , all other parts of the program t base other , as well as the overall runtime t base app . Step 2.2. Determine a first speedup estimate S 1 est of the NMA variant using the baseline measurements. Given that only the TOI is accelerated, while the rest of the application remains untouched, an upper bound estimate is given by: Step 3. If S 1 est ≈ 1, the TOI has a negligible fraction of the total execution time. There is thus no speedup potential through near-memory computing and the baseline variant can be used. If, however, S 1 est > 1+ sat , where sat expresses a user-defined satisfying margin, we consider it worthwhile to speedup the TOI with near-memory computing. However, the confidence of this first stage estimate S 1 est is not very high, as the estimation for the near-memory accelerator is based on the baseline variant which is neither near-memory integrated nor accelerated. Therefore, we refine the estimation in the next steps. Step 4.1. Integrate the near-memory core (NMCore) variant. Step 4.2. Execute this variant and measure the respective times of the different tasks t nmc toi , t nmc other , as well as the overall runtime t nmc app . Step 4.4. Analyze whether the near-memory core becomes a bottleneck by monitoring its utilization. If t nmc toi ≈ t nmc app , meaning the NMCore is utilized almost during the whole execution time of the application, the use of a second nearmemory core might be an option. However, as commonly known, interleaved accesses of several cores to the same DRAM memory bank can even deteriorate the performance due to row conflicts. We experienced this behavior and hence employ only one near-memory core. Step 4.5. Based on the NMCore measurements, refine the speedup estimate for the NMA compared to the baseline variant: As the NMCore and the NMA are located in the same position in the architecture, this second stage estimate is invariant to the near-memory dimension. It therefore promises a higher confidence. Step 5. Compare the actual speedup achieved by the NMCore variant S nmc act (Step 4.3) with S 2 est (Step 4.5), which is the refined estimation for the NMA speedup potential. Both are relative to the baseline variant and thus directly comparable. If S 2 est ≈ S nmc act , there is no remaining speedup potential for the hardware accelerator and the near-memory core is sufficient. If S 2 est > S nmc act + rem , where rem expresses a big enough remaining speedup margin, the nearmemory accelerator should be considered. However, the development effort and the required hardware resources of the NMA should not be neglected in this decision. Step 6. Develop and implement the near-memory accelerator. Step 7. Finally, measure the NMA variant and perform an analysis of how close both estimates S 1 est and S 2 est approach the NMA variant. This section presents a case study of X-CEL applied to near-memory graph copy. We first motivate the choice of near-memory graph copy as a showcase scenario (Sect. 4.1) and describe the prototype and benchmark setup of our case study (Sect. 4.2), before applying X-CEL to it (Sect. 4.3). As mentioned in Sect. 1, data-to-task locality and the reduction of data movement is especially challenging on tile-based manycore architectures. Although parallel applications and operating systems help to exploit the increased scalability, they often impose significant overhead for inter-tile communication, data transport and thread synchronization. Common communication patterns of parallel applications, libraries, and operating systems require the transfer of arbitrary data to remote tiles and its subsequent processing there. As tile-based architectures often omit hardware support for inter-tile cache coherence and consistency [4, 5, 12] , inter-tile communication (data transfer and thread synchronization) has to be handled explicitly via e. g. message passing (e. g. MPI [15] ) or partitioned global address space (PGAS) programming (e. g. X10 [24] or Chapel [7] ). These models have in common that they require data transfers between the memory partitions associated with each processor. These architectures therefore normally provide direct memory access (DMA) engines to support efficient transfer of data. However, if object oriented programming (e. g. Java, X10, Chapel) is used, the data to be copied will be object graphs consisting of objects pointing to each other. These pointered data structures cannot be directly copied by a DMA engine since all copied pointers would become invalid. Since it is crucial for the performance of object-oriented applications on such architectures, many approaches optimize the transfer or handling of object graphs [13, 14, 17, 22] . As one of them (Pegasus [14] ) uses neither near-memory integration, nor hardware acceleration, it serves well as a baseline implementation in the case study. Another state-of-the art implementation of the same mechanism (NE-MESYS [22] ) on the other hand leverages full near-memory acceleration. Both approaches target a MPSoC architecture as well. We use a tile-based manycore architecture synthesized on a multi-FPGA system consisting of four Xilinx Virtex-7 2000T FPGAs [20] . The 4 × 4 tile MPSoC prototype design consists of up to 15 compute tiles and one memory tile, which is located at grid position (1,1). Figure 4 depicts the top-left-most 2 × 2 part of the whole design. Each compute tile contains 4 cores (Gaisler SPARC V8 LEON 3 [6,26] processors) with private L1 caches. They are configured in write-through mode and kept intra-tile coherent by a classical bus snooping coherence scheme. The LEON 3 cores further use branch prediction and a floating-point unit. Each compute tile is further equipped with an L2 cache, which caches accesses to the remote main memory, and a tile-local memory (TLM), which holds the program text, OS data, and temporary user data. The memory tile is additionally connected to the off-chip DDR-3 main memory and also contains the near-memory core (LEON 3 core with L1 cache) or accelerator (NMA) if present. A network adapter (NA) connects the tiles to the NoC routers and carries out the remote load-store operations received from the L2 cache back-end. Besides that, the NA can forward remote task invocations and trigger commands to the NMA. Table 1 gives an overview of the core, cache, accelerator and memory configuration parameters. A distributed operating system [18] which is able to exploit the described hardware features runs on the FPGA prototype. We use the X10 IMSuite benchmarks [9] -a collection of distributed parallel kernels using the PGAS modelin the same configuration as [22] . To demonstrate X-CEL, we now apply it to the above-mentioned graph copy problem on this tile-based manycore architecture. In this section, we pick one (MinimumSpanningTree, MST) out of the twelve IMSuite benchmarks and run it on 15 compute tiles (MST-15) to showcase the different steps of the framework. For a complete study of all twelve benchmarks and different number of compute tiles, refer to the evaluation in Sect. 5. In Step 1 of the framework, we identify the memory-intensive graph copy operation as the task of interest (TOI). This task is part of the inter-tile communication routine of the runtime system and therefore occurs during the execution of any kind of parallel application on our system. As outlined in Sect. 3, our goal is to decide whether to perform this graph copy operation on a core in the receiving compute tile, on the near-memory core, or on a near-memory accelerator (see Fig. 4 ). In every variant, the sending processor first needs to ensure that the latest version of the object graph G is in main memory. Since our architecture does not provide inter-tile cache coherence, the processor traverses G and explicitly writes back all necessary cache lines. After the write back on sender side and the invalidation on destination side, both the receiving processor and any nearmemory processing elements now have a consistent view of G, and the copying operation can begin. In the baseline variant, the receiving processor itself does the graph copying [14] . Here, the complete object graph needs to be cloned remotely via the cache hierarchy and the NoC from the source memory partition S to the processor and back to the destination memory partition D. The operation is indicated in Fig. 4 with the beige arrow. This limits performance and pollutes the receiver's caches with the source graph. On the other hand, this approach requires no additional hardware and the newly copied data is available in the receiver's cache right away. Steps 2.1-2.2. We execute the baseline variant, which yields the following measurements: Note, that t toi and t other are accumulated times over all cores, as defined in Fig. 2 , while t app is not. Step 3. As the speedup potential S 1 est = 2.20× > 1 + sat is satisfyingly large, we go on to analyze the near-memory core variant. Step 4.1. We implement the NMCore variant, where the memory-intensive graph copy is outsourced to the near-memory core. The near-memory core performs the same software graph copy algorithm as the baseline variant. A negligible effort is required to integrate the near-memory core in the system, schedule the existing graph copy software algorithm on it and maintain consistency with it. The near-memory core is assisted by a state-of-the-art DMA engine for copying larger amounts of consecutive, non-pointered data, if existent. Figure 4 shows the existing system architecture including the near-memory core in green. The actual speedup of the this variant was measured as S nmc act = 1.60×. However, since the NMCore is only utilized during roughly one third (t nmc toi = 3.09 s) of the total application runtime (t nmc app = 8.94 s), it is far from becoming the bottleneck. Step 4.5. Based on the measurement results of Step 4.2, we can now do a better estimation of the NMA variant. According to the numbers depicted above, the speedup estimate for the NMA variant compared to the baseline can be calculated to S 2 est = 1.82×. Step 5. As S 2 est = 1.82× > 1.60× = S nmc act , we still see potential to achieve a higher speedup by using the near-memory accelerator. However, before this decision is made, all different benchmarks and application scenarios should be evaluated, which is done in Sect. 5. Also the development effort and the required hardware resources (compared to the NMCore) should be considered in this decision. Step 6. Develop a graph copy NMA as proposed by Rheindt et al. [22] . This implementation uses a near-memory accelerator to perform the graph copy operation which executes the same graph copy functionality as the processor core using a slightly different algorithm that can be performed by a hardware module [22] . The NMA is indicated in purple in Fig. 4 . This speeds up the copy operation itself and leaves the processors free for other tasks. However, it requires a tremendous development effort, as well as additional hardware resources of approximately the size of one core. Furthermore, the functionality of the NMA is limited to the graph copy task. Step 7. The execution and measurement of the NMA variant brought these final results: The actual measured speedup of the NMA S NMA act = 1.86× is very close and even slightly larger than the estimate S 2 est = 1.82×. Under the assumption that t other is not effected by the NMA implementation, S 2 est was defined as an upper bound. However, t other decreased to t NMA other = 21.38 s compared to the baseline implementation's t base other = 27.16 s, which helps to explain the additional improvement compared to the estimate. This section presents the full case study and in-depth analysis for all twelve IMSuite benchmarks and a varying number of compute tiles between one and 15. We examine the performance predictions of X-CEL in more detail. To this end, we use all benchmarks from the X10-IMSuite, and run them each on differently sized systems (1, 2, 3, 4, 8, 12 , and 15 compute tiles). We then compare the two stages of performance predictions made by X-CEL with the actual performance achieved by the NMA. Figure 5 shows the speedups achieved by the NMA in each benchmark with varying system size, relative to the baseline variant of the same system size. The solid line shows the actual speedups, whereas the dashed lines and depict S 1 est and S 2 est , respectively. For the systems with 3 and 15 compute tiles, we also show the run-times of each variant (Baseline, NMCore, and NMA) in Fig. 6 . The dashed lines in these charts represent the run-times predicted by S 1 est and S 2 est . The validity of X-CEL rests on two conditions: First, that S 1 est gives an indication whether near-memory computing could accelerate the given program at all, and second that S 2 est gives an accurate prediction of the run-time achieved by an NMA. We will now examine these two conditions in turn. We first observe that S 1 est usually gives an upper bound on the achievable speedup. That is to say, if S 1 est is close to 1, the application will certainly not benefit from near-memory computing. S 1 est only under-estimates the speedup in the DR and DS benchmarks. A closer analysis of the graph copy tasks performed shows a difference to the other benchmarks: DR and DS have many very small graph copy tasks to perform (e. g., DS transfers a single object of 24 bytes 17 856 times [22] ). Thus, the offloading and synchronization overheads come to play a larger role, which our model does not handle as well. Still, we see that S 1 est fulfills its function well in most cases. When examining S 2 est , we observe that S 2 est approximates the actual speedup well, with a root mean square error of 0.23. Out of all the 84 configurations we evaluated (12 benchmarks × 7 system sizes), in 60 configurations S 2 est deviated by less than 5% from the actual speedup. The other 24 configurations warrant a closer analysis, because too low speedup estimates have a different impact from too high ones: X-CEL uses S 2 est as an indication of whether to develop a dedicated hardware accelerator (see Step 5 in Sect. 3). If S 2 est turns out to under-estimate the NMA's speedup, this is hardly a problem, because the NMA performs better than expected. On the other hand, if S 2 est over-estimates the speedup, the effort spent developing the NMA may have been wasted. Out of the 24 configuration where S 2 est deviates by more than 5 %, it underestimates the speedup in 14 cases, and over-estimates it in 10. The underestimates are relatively large in places (up to 32.9 % for DR on 15 compute tiles), but as we have explained, this is not problematic. On the other hand, the over-estimates are at most 10.2 % (HS on 8 compute tiles), and indeed only 5 of the 10 over-estimates are larger than 6 %. Considering that X-CEL does not need any information about the actual algorithm, the estimates it provides are quite accurate in most cases. Morevoer, if they deviate from the speedup achievable by the NMA, they usually err on the safe side from the developer's point of view. We presented X-CEL, a measurement-based method to estimate the potential of near-memory acceleration. It helps to perform an early yet robust estimation whether the development effort of a near-memory accelerator is worthwhile. The two-stage method is based on measurements of an easy-to-integrate nearmemory core (near-memory, but no accelerator) variant, which is closer to the target design than the existing baseline implementation (neither near-memory, nor hardware-accelerated). We showcased X-CEL with a (near-memory) graph copy problem in a tile-based MPSoC with a set of distributed graph algorithm kernels. An in-depth analysis revealed that the second stage estimate is within 5 % of the actual speedup in 70 % of the configurations. Moreover, it has 36 % higher accuracy than the original estimate. Future work could refine the estimation model, as well as extend the framework to more case studies. All in all, we envision X-CEL to become an x-cel-lent tool in the hand of developers to make sophisticated predictions on the near-memory acceleration potential and thereby avoid unnecessary development effort. A scalable processing-in-memory accelerator for parallel graph processing LogCA: a high-level performance model for hardware accelerators Power aware heterogeneous MPSoC with dynamic task scheduling and increased data locality for multiple applications Runnemede: an architecture for ubiquitous high-performance computing DeNovo: rethinking the memory hierarchy for disciplined parallelism Chapel language specification Methods for evaluating and covering the design space during early design development. Integr IMSuite: a benchmark suite for simulating distributed algorithms Memory intensive computing, the 3rd wall, and the need for innovation in architecture GraphIA: an in-situ accelerator for largescale graph processing Formic: cost-efficient and scalable prototyping of manycore architectures A hardware accelerator for tracing garbage collection Pegasus: efficient data transfers for PGAS languages on non-cache-coherent many-cores MPI Forum: MPI: a message passing interface standard version Rapid in-memory matrix multiplication using associative processor Skyway: connecting managed heaps in distributed big data systems Proceedings of the International Workshop on Systems for Future Multi-Core Architectures (SFMA Exploring exploration: a tutorial introduction to embedded systems design space exploration FPGA module xc7v2000t Quantifying acceleration: power/performance trade-offs of application kernels in hardware NEMESYS: near-memory graph copy enhanced system-software Platform-based design and software design methodology for embedded systems X10 language specification A scalable near-memory architecture for training deep neural networks on large in-memory datasets The SPARC Architecture Manual, Version 8, sav080si9308 edn Exploring specialized nearmemory processing for data intensive operations