key: cord-0058696-btscbauw authors: Bougherara, Maamar; Nedjah, Nadia; Bennouar, Djamel; Kemcha, Rebiha; de Macedo Mourelle, Luiza title: Application Mapping onto 3D NoCs Using Differential Evolution date: 2020-08-20 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58808-3_8 sha: 3b7b16f7264d2416e305cfaabf2b996c4fdb2121 doc_id: 58696 cord_uid: btscbauw Three-dimensional networks-on-chip appear as a new on-chip communication solution in many-core based systems. An application is implemented by a set of collaborative intellectual property blocks. The mapping of the pre-selected sets of these blocks on three-dimensional networks-on-chip is a NP-complete problem. In this work, we use Differential Evolution to deal with the blocks mapping problem in order to implement efficiently a given application on a three-dimensional network-on-chip. In this sense, Differential Evolution is extended to multi-objective optimization in order to minimize hardware area, execution time and power consumption of the final implementation . A critical problem in the design of Multi-Processor Systems-on-Chip (MPSoCs) is the on-chip communication, where a Network-on-Chip (NoC) [15] can offer a scalable infrastructure to accelerate the design process. A MPSoC is designed to run a specific application, based on Intellectual Property (IP) blocks. A NoC consists of a set of resources (R) and switches (S), forming a tile [1] . Each resource of the NoC is an IP block, such as processor, memory, Digital Signal Processor (DSP), connected to one switch. Each switch of the NoC implements routing and arbitration, connected by links. The way switches are connected defines the topology, such as the two-dimensional (2D) mesh topology [16] . However, the 2D NoC fails to meet the requirements of SoCs design in performance and area. Three-dimensional (3D) NoCs [2] have proved to be an effective solution to the problem of interconnection complexity in large scale SoCs by using integrated circuit stacking technology. A 3D mesh is implemented by stacking several layers of 2D mesh on top of each other and providing vertical links for interlayer communication, called Through Silicon Vias (TSV) [2] . Each switch is connected to up to six other neighbouring switches through channels, in the same way as 2D mesh does. Figure 1 shows the architecture of a 3D mesh-based NoC. Different optimization criteria can be pursued depending on the detailed information of the application and IP blocks. The application is viewed as a graph of tasks called Task Graph (TG) [4] . The features of an IP block can be determined from its companion library [3] . The objectives involved in IP task assignment and IP mapping are multiple and have to be optimized simultaneously. Some of these objectives are conflicting because of their nature. So, IP assignment and IP mapping are classified as NP-hard problems [4] . Therefore, it is mandatory to use a multi-objetive optimization strategy, such as Multi-Objective Evolutionary Algorithms (MOEAs), with specific objective functions. We use Differential Evolution (DE) as the MOEA [9] , modified to suit the specificities of the assignment and mapping problems in a NoC with mesh topology, and also to guarantee the NoC designer's constraints. In previous work [12] , we applied this strategy to the assignment problem in NoCs. In this paper, we describe the use of DE to the subsequent problem of mapping onto a 3D mesh-based NoC. In Sect. 2, we present some related works, where a multi-objective strategy is applied in order to optimize some aspects of the design. In Sect. 3, we introduce the problems of IP assignment and mapping, concerning a SoC design over a NoC platform. In Sect. 4, we concentrate our attention on IPs mapping using DE for multi-objective optimization. In Sect. 5, we describe the objective functions for area, power consumption and execution time. In Sect. 6, we show some performance results, based on the E3S benchmarks suite. In Sect. 7, we draw some conclusions and future work, based on our experiments. In some works, the assignment and mapping steps are treated as a single NPhard problem. The multi-objective nature of these steps is also not taken into account, addressing the problem as a single objective. Since we choose to treat the problem using a multi-objective optimization process, we present here some works that followed the same strategy when dealing with the mapping step. In [5] , the mapping step is treated as a two conflicting objective optimization problem, attempting to minimize the average number of hops and achieve a thermal balance. Every time data cross a switch, before reaching its target, the number of hops is incremented. To deal with this process, they used the multiobjective evolutionary algorithm NSGA. The problem of mapping IPs/cores onto a mesh-based NoC is addressed by [6] in two systematic steps using NSGA-II. The key problem was to obtain a solution that minimizes energy consumption, considering both computation and communication activities, and also minimizes the link bandwidth requirements. SPEA-II and NSGA-II are used in [8] for mapping, with some changes in crossover and mutation operators. Energy consumption and thermal balance were the main optimization objectives. In [17] , task mapping on a 3D mesh-based NoC is implemented using fuzzy logic in order to minimize thermal and power consumption. In [18] , the authors propose a multi-objective rank-based genetic algorithm for 3D mesh-based NoCs. Two different models are used for packet latency: under no congestion and with congestion situations. In [19] , a multi-objective immune algorithm is used, where latency and power consumption are considered as the objective functions, constrained by the heating function. In [20] , a centralized 3D mapping (C3Map) is proposed using a new octahedral traversal technology. Combining the C3Map and attractive/repulsive particle swarm optimization, they attempted to reduce energy and latency. The NoC design methodology for SoCs encourages the reuse of components to reduce costs and to reduce the time-to-market of new designs. The designer faces two main problems: selecting the adequate set of IPs (assignment step) and finding the best physical mapping of these IPs (mapping step) onto the NoC infrastructure. The objective of IP assignment [4, 10] is to select, from an IP library (IP repository), a set of IPs, exploiting re-usability and optimizing the implementation of a given application in terms of time, power and area requirements. During this step, no information about physical location of IPs onto the NoC is given. The optimization process must be done based on the Task Graph (TG) and IP features only. Each one of the nodes in the TG is associated with a task type, which corresponds to a processor instruction or a set of instructions. If a given processor is able to execute a given type of instruction, that processor is a candidate to be mapped onto a resource in the NoC structure and will be responsible for the execution of one or more tasks of the TG. The result of this step is a set of IPs that should maximize the NoC performance, i.e. minimize power consumption, hardware resources as well as the total execution time of the application. An Application Characterization Graph (ACG) is generated, based on the application's task graph, wherein each task has an IP associated with it. We structured the used application repository, based on the E3S benchmark suite [21] , using XML, both for the TG and the IP repository. Figure 2 (a) shows the XML representation of a simple TG of ES3 and Fig. 2 (b) shows a simplified XML representation of an IP repository. In previous work [12] , we used DE during the assignment step in order to optimize area required, execution time and power consumption. Given an application, the problem that we are concerned with here is to determine how to topologically map the selected IPs onto the network structure, such that the objectives of interest are optimized [4] . At this stage, a more accurate evaluation can be done taking into account the distance between resources and the number of switches and channels crossed by a data package during a communication session. The result of this process should be an optimal allocation of the prescribed IP assignment to execute the application on the NoC. Figure 3 shows the assignment and the mapping steps. for Multi-objective Optimization DE [11] is a simple and efficient Evolutionary Algorithm (EA). It was, initially, used to solve single-objective optimization problems [9] . DE is a populationbased global optimization algorithm, starting with a population of NP individuals, of dimension D. Each individual encodes a candidate solution, i.e where G denotes the generation to which the population belongs [12] . The initial population is generated randomly from the entire search space. The main operations of the DE algorithm are: mutation, crossover and selection. This operation changes the population with the mutant vector V i,G for each individual X i,G in the population at generation G. The mutation operation can be generated using a specific strategy. In this work, three strategies are used: Rand (Eq. 1); Best (Eq. 2); Current-to-Best (Eq. 3): where V i,G is the mutant vector to be produced; r 1 , r 2 , r 3 are integer constants generated randomly in the range of [1, NP ] , at each iteration; X best,G is the best individual at generation G; F is a scaling factor, which is a real constant usually chosen in the range of [0, 1], controlling the amplification of the difference variation. This operation improves the diversity of the population, being applied after the mutation operation. The crossover operation uses the mutation of the mutant vector V i,G to exchange its components with the target vector X i,G , in order to form the trial vector U i,G . The crossover operation is defined by Eq. 4: where j = 1, 2, ..., D; rand j is the jth evaluation of an uniform random number generator within [0, 1] [11] ; the crossover rate CR is an user-specified constant within the range [0, 1]; j rand is a randomly chosen integer within the range [1, D] [9] . In order to keep the population size constant over subsequent generations, the selection operation is performed. The trial vector is evaluated according to the objective function and compared with its corresponding target vector in the current generation. If the trial vector is better than the target one, the trial vector will replace the target one, otherwise the target vector will remain in the population. The selection operation is represented by Eq. 5: The mutation, crossover and selection operations are applied for each generation until a termination criterion. In order to extend the DE algorithm to solve multi-objective optimization problems, we should use the Pareto concept to deal with multiple objectives in order to select the best solution. If the newly generated trial vector dominates the parent vector, then the trial vector will replace the parent one. If the parent dominates the trial, the trial vector will be discarded. Otherwise, when the trial and the parent vectors are not related to each other, the trial vector will be added to the current population for later sorting. Algorithm 1 shows the main steps of the modified DE Multi-Objective (DEMO) algorithm. In this work, we adopted a multi-objective optimization strategy in order to minimize three parameters: area, power consumption and execution time. Here, we describe how to compute each of these parameters in terms of the characteristics of the application and those of the NoC. In order to compute the area required by a given mapping, it is necessary to know the area needed for the selected IPs and that required by the used links and switches. The total number of links and switches can be obtained by taking into account all the communication paths between the exploited tiles. Each communication path between the tiles is stored in the routing table. We adopted an XYZ fixed routing strategy, in which data coming from tile i are sent first to the West or East of the current switch side depending on the target tile position, say j, with respect to i in the 3D mesh NoC, until it reaches the column of tile j. Then, it is sent to the South or North side, depending on the position of tile j with respect to tile i. Finally, it is sent to the Top or Bottom side until it reaches the target tile. The number of links in the described route represents the distance between tiles i and j, corresponding to the Manhattan distance [13] as defined by Eq. 6: wherein (x i , y i , z i ) and (x j , y j , z j ) are the coordinates of tiles i and j, respectively. In order to compute efficiently the area required by all used links and switches, the ACG is associated to a routing table, in which the links and switches necessary interconnect tiles are described. The number of hops between tiles, along a given path, leads to the number of traversal switches. The area is, then, computed summing up the areas required by processors, switches and links involved. Equation 7 describes the computation involved to obtain the total area for the implementation of a given IP mapping M : (7) wherein function P roc(.) provides the set of distinct processors used in ACG M and area p is the required area for processor p; function Links(.) gives the number of distinct links used in ACG M ; A l is the area of any given link; and A s is the area of any given switch. The total power consumption of an application NoC-based implementation consists of the power consumption of the processors, while processing the computation performed by each IP, and that due to the data transportation between the tiles, as presented in Eq. 8: wherein P ower p (M ) and P ower c (M ) are the power consumption of processing and communication, respectively, as detailed in [13] . The power consumption due to processing is simply obtained summing up attribute taskPower of all nodes in the ACG and is as described in Eq. 9: The total power consumption of sending one bit of data from tile i to tile j can be calculated considering the number of switches and links each bit passes through on its way along the path. It can be estimated as shown in Eq. 10: wherein E S bit and E L bit represent the energy consumed by the switch and link tying the two neighboring tiles, respectively [22] . Function nLinks(.), defined by Eq. 6, provides the number of traversed links (and switches too) considering the routing strategy used in this work and described earlier in this section. The communication volume (V (d t,t )) is provided by the TG in terms of number of bits sent from the task t to task t passing through a direct arc d t,t . Let us assume that the tasks t and t have been mapped onto tiles i and j respectively. Equation 11 defines the total network communication power consumption for a given mapping M : wherin T argets t provides all tasks that have a direct dependency on data resulted from task t and T ile t yields the tile number into which task t is mapped. The execution time for a given mapping takes into account the execution time of each task, its schedule and the additional time due to data transportation through links and switches along the communication path. taskTime attribute in TG provides the execution time of each task. Each task of the TG needs to be scheduled into its own cycle. Therefore, we used the As-Soon-As-Possible (ASAP) scheduling strategy [10] , scheduling tasks in the earliest possible control step. The routing table allows us to count the number of links and switches required. Two scenarios can lead to the increase in the execution time of the application: (1) when a task in a tile needs to send data to parallel tasks in different tiles through the same initial link, data cannot be sent to both tiles at the same time; (2) when several tasks need to send data to a shared target task, one or more shared links would be needed simultaneously along the partially shared path, which means that the data from both tasks must be pipelined and will not arrive to the target task at the same time. The overall application execution time takes into account the execution time regarding the underlying task computation and communication for the applied mapping M . It is also necessary to take into account the delay concerning the two aforementioned situations, regarding task scheduling. Therefore, the overall application execution time can be modelled according to Eq. 12: wherein T ime p returns the time necessary to execute the tasks of the TG; T ime c the time spent due to communication among tasks; function F 1 computes the delay caused by the first scenario; function F 2 computes the delay caused the second scenario. Function F 1 computes the additional time due to parallel tasks that have data dependencies on tasks mapped in the same source tile and yet these share a common initial link in the communication path. Function F 2 computes the additional time due to the fact that parallel tasks producing data for the same target task need to use simultaneously at least a common link along the communication path. Equation 13 defines the time spent with communication between tasks t and t , based on [23] : wherein V (d t,t ) is the volume of bits transmitted from task t to task t . Equation 14 defines the time spent transferring a phit: wherein T l phit is the link transmission time and T p phit is the switch processing time used to transfer one phit between two neighboring tiles. A phit represents the physical unit given by the channel width and a flit represents the flow unit, which is a multiple of the phit. In order to evaluate the performance of the DEMO algorithm for the mapping step and compare it to that obtained using MOPSO (Multi-Objective Particle Swarm Optimization) algorithm [13] , we used the same benchmarks. These are provided by the E3S benchmarks suite, constituted of the characteristics of 17 embedded processors. These characteristics are based on execution times of 16 different tasks, power consumption based on data-sheets, area required based on die size, price and clock frequency. We use 5 random task graphs used in [13] , generated by Task Graph For Free (TGFF) [14] to perform experiments and evaluate the performance. We exploit a population size of 100, with F set to 0.5 and CR set to 0.9. These parameters were set based on simulations. The algorithm was run for 500 iterations. It is noteworthy to emphasize that the same objective functions are used in both works. Besides comparing the algorithms, two topologies are used in the test: 2D mesh with 5 × 5 and 3D mesh with 3 × 3 × 3. Also, we used As-Soon-As-Possible (ASAP) as the schedulling strategy. Figure 4 shows the power consumption for the mapping yield by the compared strategies, regarding best results. We can see that the three mutation variants adopted for DEMO offer better results than MOPSO. Among these three, Best shows the best performance . TG0 TG1 TG2 TG3 TG4 TG0 TG1 TG2 TG3 TG4 10 2.4 Also, Fig. 5 provides a comparison of the results regarding execution time, considering the best quality mapping. As can be seen, the performance of DEMO, based on Best mutation strategy, is better than those obtained by MOPSO and the other mutation strategies for DEMO. Finally, Fig. 6 shows the comparison of the required hardware area related to the mapping obtained by the compared strategies. Here, also, the performance TG0 TG1 TG2 TG3 TG4 TG0 TG1 TG2 TG3 TG4 of DEMO is better than that of MOPSO. Among the mutation strategies for DEMO, we can see that Best proves to be the best option . TG0 TG1 TG2 TG3 TG4 TG0 TG1 TG2 TG3 TG4 The problem of assigning and mapping IPs into a NoC platform is NP-hard. There are three objectives to be optimized: area required, execution time and power consumption. In this paper, we propose a Multi-Objective algorithm based on Differential Evolution (DEMO) to help NoC designers during the mapping step onto a 3D mesh-based NoC platform, based on pre-selected set of IPs. We use the same objective functions and the same TGs used in [13] , where we applied a Multi-Objective algorithm based on Particle Swarm Optimization (MOPSO), to evaluate the performance of the proposed algorithm. Besides this, the DEMO algorithm offers three strategies, related to variations of the mutation operation: Rand, Best, Current-to-Best. The strategy based on Best presents better performance than the other strategies. We also compare the results obtained by the DEMO algorithm to those obtained by MOPSO algorithm, where DEMO proves to be better than MOPSO. It is interesting to highlight that DEMO requires only two parameters to be set, while MOPSO requires three parameters. For future work, we plan to experiment with other strategies in the DEMO algorithm and also to use other scheduling algorithms, such as List Schedulling and As Late as Possible. Energy-aware mapping for tile-based NoC architectures under performance constraints Demystifying 3D ICs: The pros and cons of going vertical Key research problems in NoC design: a holistic perspective Evolutionary IP assignment for efficient NoC-based system design using multi-objective optimization Pareto based multiobjective mapping IP cores onto NoC architectures A multi-objective evolutionary algorithm based optimization model for network-on-chip synthesis Optimized algorithms for network-on-chip application mapping ANoC Developing domain-knowledge evolutionary algorithms for network-on-chip application mapping DEMO: differential evolution for multiobjective optimization IP assignment for efficient NoC-based system design using multi-objective particle swarm optimisation Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces IP assignment optimization for an efficient noc-based system using multi-objective differential evolution Efficient application mapping onto three-dimensional network-on-chips using multi-objective particle swarm optimization TGFF: task graphs for free Interconnection Networks -An Engineering Approach Thermal and power aware task mapping on 3d network on chip Latency-aware mapping for 3D NoC using rank-based multi-objective genetic algorithm An evolutive approach for designing thermal and performance-aware heterogeneous 3D-NoCs C3Map and ARPSO based mapping algorithms for energy-efficient regular 3-d noc architectures Embedded system synthesis benchmarks suite (E3S) Energy-aware mapping for tile-based NoC architectures under performance constraints Design space exploration comparing homogeneous and heterogeneous network-on-chip architectures