key: cord-0047298-ii3shws6 authors: Wang, Bo; Imtiaz, Aneek; Falk, Joachim; Glaß, Michael; Teich, Jürgen title: Exploration of Power Domain Partitioning with Concurrent Task Mapping and Scheduling for Application-Specific Multi-core SoCs date: 2020-06-12 journal: Architecture of Computing Systems - ARCS 2020 DOI: 10.1007/978-3-030-52794-5_12 sha: 3bedaa9b7d5ef72c3dadc6ec8053cde87fd81b28 doc_id: 47298 cord_uid: ii3shws6 This paper proposes a novel approach to explore the design space of Power Domain (PD) partitioning in the architecture definition phase of heterogeneous SoCs. By formulating an Integer Linear Program (ILP), task mapping and scheduling is determined concurrently while considering power-off dependencies among cores in the same PD and the power-gating break-even time. Compared to state-of-the-art approaches aiming at design phases where task mapping and scheduling has been frozen, our proposed approach shifts joint exploration into earlier design phases, creates more power-gating opportunities for PD partitions, and thus identifies better trade-offs in terms of energy consumption and design costs. Power gating is an effective technique to reduce static power consumption of System-on-Chips (SoCs), like 5G New Radio modems in which dozens of heterogeneous cores are often adopted to achieve Gbits/s uplink and downlink speed. An SoC is divided into multiple Power Domains (PDs), which can be switched off individually when all cores and Hardware (HW) IPs in the same PD are idle, a so-called common idle interval. Power-gating control is more flexible when finer-grained power domains are partitioned. However, this would indeed result in a huge design, verification, and layout effort, even increase area and degrade power consumption and timing closure [13] . On the other hand, due to This research work was funded by Intel Deutschland GmbH. parallelism among tasks, merging HW resources which are active simultaneously into the same power domain may reduce design complexity without sacrificing power efficiency. Some researchers have started investigating methodologies for exploration of PD partitioning to trade off energy consumption and the number of PDs. In [13] , PD partitioning is explored by using a Multi-Objective Evolutionary Algorithm (MOEA), but it aims at the design phases in which task mapping and scheduling has been accomplished already, and determines the idle intervals of each HW resource rather than optimizing them. During subsequent PD partitioning, common idle intervals are post-processed for each PD, as well as power-gating break-even times. Power gating is exploited only for common idle intervals longer than a break-even time. After that, energy consumption is evaluated for partition candidates. Finally, trade-off fronts are obtained by the MOEA in terms of energy consumption and the number of used power domains. However, this approach does not explore the influence of task mapping and scheduling. We illustrate the lost potential through a motivating example in the following. Figure 1 shows two periodic applications with the same period generated by TGFF [5] , as well as a HW architecture consisting of three fully connected heterogeneous processors. Power consumption of each processor is modeled by three power states [14] , i.e., P run , P idle , and P of f , where the state RU N denotes the resource actively executing a program task, IDLE denoting being powered on with no task in execution, and OF F denoting the power-gated mode. First, task mapping and scheduling is performed to minimize energy consumption, where only two states -RU N and IDLE -and transition energy between them are assumed for processors. After that, PD partitions are explored using the approach in [13] . The found trade-off fronts are presented in Fig. 2 . Take the trade-off front with 2 PDs as an example shown in Fig. 3(a) . Although r b is idle from 0 ms to 23 ms, P D 1 cannot be powered off because r a is still executing. If the task mapping and scheduling would consider the power-off dependency between r a and r b , it may re-allocate the tasks and align the execution in the same PD. Unfortunately, scheduling before PD partitioning does not have such knowledge and, thus, misses optimization potential. Based on this observation, we propose a methodology to explore power domain partitioning with concurrent task mapping and scheduling. For each candidate explored during PD partitioning, task mapping and scheduling is performed with additional constraints for the power domain dependency and powergating break-even time. As a result, more and longer common idle intervals in each PD may be created by properly mapping and aligning task execution on processors, as shown for v 03 in Fig. 3 . Power consumption is thus reduced due to longer power-gated state as shown in Fig. 3(b) and Fig. 2 . More important, system architects may even prefer the 2-PD option identified by our approach to reduce design cost if it already meets the power target. The proposed approach actually expands the exploration space of PD partitioning. State-of-the-art approaches for PD partitioning exploration consider power vs. design cost for heterogeneous multi-core SoCs where task mapping and scheduling has been already frozen at design time, e.g., assuming multiple subsystems are re-used and integrated in an SoC. This paper discusses the further optimizations applicable to SoCs when task mapping and scheduling can be combined with PD partitioning and jointly optimized. Our major contributions are summarized as follows: -Tasks are mapped and scheduled specifically for each PD partition candidate, concurrently with PD partitioning exploration by a Multi-Objective Evolutionary Algorithm (MOEA). This aligns task execution and creates more common idle intervals for power gating. -Task mapping and scheduling is formulated as an Integer Linear Programming (ILP), in particular integrating: 1) power-on/off dependencies introduced by PD partitioning among HW resources in the same PD; 2) constraints of powergating break-even time due to transition energy and latency overhead. -Experimental results show that our proposed joint exploration can identify much better trade-off fronts with significantly reduced design costs but without scarifying the power target. E.g., one experiment shows that the same power target can be achieved by 2 PDs, instead of 8 PDs when applying the approach in [13] . The aimed application domains of this work are time-critical or safety-critical [3] application specific embedded systems, such as wireless communications and electric vehicles [10] . There, most application tasks and use cases are known at design time, and static scheduling is also more favorable due to its determinism. Several research works exist on how to partition power domains at circuit level. In [2] , Finite State Machine with Datapath (FSMD) circuits are decomposed into loosely coupled domains which may be power or clock gated. But, the workload characteristics are not considered. In [1] , an approach leveraging rule-based design is proposed to automatically partition combinational logic into multiple PDs while considering usage characteristics. However, all of these studies [1, 2] focus on micro-architecture level and RTL design phases. In [13] , PD partitioning is explored at the Electronic System Level (ESL), thus for SoC architecture definition phases, but after task mapping and scheduling has been accomplished. As motivated earlier, this may hinder the maximization of idle intervals to reduce power or to allow to lower the number of power domains. In [7] , a relevant task mapping and scheduling problem is discussed to Maximize Common Idle Interval (MCIT) among all cores, though the objective is to reduce active time and power consumption of a shared memory. The ILP formulation of common idle intervals is based on a discrete time axis. In [6] , the idle interval of each core is modeled at a continuous time axis. But the approach does not formulate common idle intervals. In both [7] and [6] , homogeneous multi-core systems are considered. However, these are different from the power optimization of a heterogeneous architecture. Allocating tasks to more energy efficient cores may lead to lower power than merely pursuing MCIT. Moreover, both works do not investigate PD partitioning problem, but assume each core in an individual power domain. Some other works address voltage-frequency islands partitioning at system level [9, 12] , to reduce dynamic power. But the problem formulation is different. PD partitioning has to model power-off states, on/off dependencies within power domains, and power-off break-even times. This is difficult to model together with the problem of task mapping and scheduling. And, [12] does not consider task mapping and scheduling while [9] considers scheduling but not mapping. An overview of our methodology is presented in Fig. 4 . -Periodic applications, each of which can be modeled as a directed acyclic task graph G(V, E, T P , T D ), in which a task v belongs to the set of tasks V , E denotes data dependencies among tasks, an arbitrary period T P and deadline T D . -An SoC architecture consisting of a set of HW resources denoted as R, power model of any resource r ∈ R in different power states, e.g., P run,r , P of f,r , P idle,r , power-gating transition latency T tr of f (r) and energy E tr of f,r , as well as wake-up transition latency T tr on,r and energy E tr on,r from power-off state. -Mapping constraints that represent which task can be realized on which resource and the execution time of each task D v,r for a given resource. The objective is to explore trade-off fronts in terms of energy consumption and the number of power domains (representing a measure of design complexity) for the problem of power domain partitioning including task mapping and scheduling. An MOEA in [11] is used to explore the space of PD partitionings. Physical design or floorplan constraints can be added to prune the exploration space, if they can be forecast from previous products. For example, two resources far away in floorplan make less sense to be placed into the same PD. For each PD partition, an ILP is generated and solved to determine a mapping and schedule for each task and a suitable schedule of power mode transitions for each PD, with the objective to minimize the energy consumption. Poweron/off dependencies of HW resources in the same PD, power-gating transition energy and break-even time are all considered here. The energy consumption value derived by the ILP solver is fed back to the MOEA as one evaluated objective of each PD partition. The state-based power modeling approach as in [14] is chosen because it achieves sufficient accuracy at system level and early design phases. The power models can be refined along the design cycle, e.g., consider different active power P run,r for different types of tasks running on a resource. This work considers only static scheduling at design time. In principle, it may inspire the solution that considers the impact of run-time task migration. For example, add online scheduling algorithms after the ILP solver, and then evaluate the power consumption of each PD partition. However, it would take significantly longer exploration time, because the simulation is required to evaluate the power consumption. This is not the target application domain of this work. The time is assumed to be discrete, divided into unit time intervals [t, t + 1), for t = 0, 1, . . ., which we call time slots [7] . We refer to [t, t + 1) as time slot t, or even as time t. Tasks are assigned to time slots and D v,r is an integer. The continuous-time version of the same problem can be approximated as a discrete-time version. In this work, multiple independent periodic applications, e.g., [V 0 , . . . , V L ], with arbitrary deadlines and periods, can be considered together in a single ILP. This applies to the architecture which supports multiple applications simultaneously. A hyper-period of all applications, denoted as M , is chosen to map and schedule tasks from all applications within this hyper-period. Moreover, when the deadline of an application is longer than the period, a pipelined schedule is performed, i.e., a task graph is divided into several pipeline stages so the current iteration of the task can overlap in execution with previous iterations [15] . However, our methodology is not limited to any specific pipeline approach, which is also not the focus of this paper. The following formulations are elaborated by using only one application with multiple periods for ease of explanation. But experiments in this work were done for problems containing multiple applications. Table 1 defines ILP constants which are determined for each PD partitioning candidate by the MOEA. Table 2 explains the introduced ILP binary variables prior to introducing the ILP mapping and scheduling model. Transition energy of a power domain is calculated by Eqs. E tr on,r + T tr on,pd − T tr on,r * P idle,r , ∀pd ∈ P D (6) E tr,pd = E tr of f,pd + E tr on,pd , ∀pd ∈ P D E tot tr,pd = 1 2 * J pd * E tr,pd , ∀pd ∈ P D Here, we focus on explanation of ILP formulation related to power gating and PD partitioning. Other ILP constraints for basic task mapping and scheduling are not elaborated, since they are very well-known and not novel, e.g. task mapping constraints, task dependency constraints, deadline constraints, and so on [8] . Unique Start Time Constraint: Each task must start exactly once, thus in one time slot. Resource Busy Time Constraint: The number of busy slots of a resource should be equal to the total execution time of all tasks mapped on it. Moreover, from the start time slot of a task, it should be consecutive 1's assigned to the busy vector of a resource on which the task is mapped. Common Idle Time Constraint: A power domain is idle only when all resources in that domain are idle. This can be modeled by performing a logical NOR operation among the busy vectors of all resources in that domain: The NOR operation is nonlinear, but the Boolean logic operation can be transformed to linear constraints. Let N R (pd) denote the number of resources in power domain pd. Equation (12) is transformed as below. Off State Time Constraint: A power domain should be switched off only when its common idle interval is longer than its power-gating break-even time which can be modeled as Eqs. (15) To derive off-state slot vectors, an auxiliary variable I pd,t = 1 is introduced to represent T be,pd adjacent slots of a pd from slot t to slot t + T be,pd − 1 are all idle. This can be done by a logical AND operation: And, T be,pd − 1 zeros have to be padded at the beginning and the end of vector I pd,t using Eq. (19) . The proposed approach has been experimented on different benchmarks. The first set of benchmarks is from a public benchmark suite E3S [4] , while the second one consists of synthetic benchmarks generated using the tool TGFF [5] . The main program of the flow was implemented using Python, but the MOEA was implemented using Java [11] . All of programs have been executed on a laptop with an i5-5300U CPU @ 2.3 GHz (2 cores, 4 threads) and 12 GB DDR memory. For comparison, the same experiments were performed by applying the approach [13] performing PD partitioning after task mapping and scheduling. We called it as the reference approach in the following. Here, various task mapping and scheduling algorithms can be applied before PD partitioning with desired optimization objectives, like execution time or power. They lead to different energy consumption after PD partitioning and power gating. Since our work focuses on energy optimization, as a fair comparison, we performed an energyaware task mapping and scheduling also using the approach of ILP. But in this ILP formulation, processors are assumed to be only in RU N or IDLE states without OF F states. PD partitioning and power-gating related constraints are not applied during this step. Therefore, in the objective function, E of f,r and E tot tr,pd in Eq. (1) become zero, and t∈T O pd(r),t in Eq. (3) are zero too. Three benchmarks are selected from E3S [4] , i.e., Networking, Telecom and Consumer. They are scheduled onto a heterogeneous architecture consisting of a 2-D 3 × 3 mesh of processors whose power consumption is also specified in E3S. The transition latency in Table 1 varied in the range of 10-50 us, and the task execution times ranged in the interval of 0.5-1 ms. The transition energies E tr of f,r and E tr on,r in Eqs. (5)- (6) were assumed zero in the following. Therefore, the power-gating break-even time T be,pd was determined by the transition latency T tr,pd according to Eqs. (16)-(17). The MOEA has been configured to use 20 generations with 10 individuals per generation. For each number of power domains, the solutions with the lowest normalized energy according to Eqs. (1)- (8) are shown in Fig. 5 . It can be noticed that for each number of power domains, the trade-off point using our approach has a lower energy. This is because our concurrent mapping and scheduling of tasks with PD partitioning is able to create more common idle intervals specific for each PD partition to allow more power-off opportunities. Therefore, better power savings can be achieved even with fewer power domains. For example, in the benchmark of Telecom, a lower power consumption can be achieved even with 2-PD partition, in comparison to a trade-off point for an 8-PD partition in the reference approach [13] . Much better PD partitioning trade-off points can be identified to meet the power target at significantly reduced design cost. The exploration took 8-9 h in which we set the timelimit of the ILP solver to 3 min. Notably, this is longer than the approach [13] which took about 1-2 h This is expected, because our approach has to perform task mapping and scheduling in addition. Still, limiting the ILP solver to 3 min has two impacts: 1) the currently best found solution by the ILP may not be the optimal one in terms of energy consumption, but has a relative optimality gap of 10-20%, reported by the solver; 2) the ILP solver even may not find any feasible solution as the problem size increases, though it never happened in our benchmarks. Nevertheless, our approach was always able to find lower energy consumption points for each number of PD partitions. If more exploration time is acceptable, our approach would be able to probably find even better results. This is a tradeoff that system architects can decide during system-level exploration. Three benchmarks have been generated by TGFF [5] , with different tightness of deadline, i.e., the deadline is equal to, shorter, or longer than the period, as shown in Table 3 . Each use case has multiple applications to be scheduled over their hyper-period. When the deadline is longer than the period, e.g., in use case 3, or a multimedia streaming application, different iterations of applications can overlap. Therefore, we partitioned the task set and performed scheduling for steady state in one hyper-period [15] . All three use cases have size in the range of 40-50 tasks whereas the heterogeneous architecture consists of a 2D mesh with 3 × 3 processors. The power data, transition time, and energy for these processors were obtained from an in-house design. The EA parameters for the exploration are the same as for the E3S benchmarks. As shown in Fig. 6 , our proposed approach also identifies better trade-off fronts than the reference approach [13] . Trade-off fronts for TGFF benchmarks with normalized energy (to energy of 1-PD partition obtained by the reference approach [13] , i.e., PD partitioning performed after mapping and scheduling) vs. hardware complexity (number of power domains). The total exploration time depends on two parts: 1) EA parameters, mainly the number of generations and individuals per generation (PD partition options), which typically increases with the higher complexity of the hardware architecture; 2) execution time of ILP solver for each PD partition option, which scales non-linearly with the size of hardware architecture, the size of task graph, and most importantly, with the time scale of the schedule. Therefore, our approach is not easily scalable for bigger problems. We experimented the execution time of the ILP solver, given an architecture of a mesh network with 6 processors. When increasing the number of tasks to 80 and set the relative optimality gap of the ILP solver to 20%, a feasible schedule cannot be found within 2 h though it is preferred in the range of minutes as a part of whole flow. Alternative models for scheduling might be the key to reduce the number of binary variables and thus search space of the ILP formulation to improve scalability. Although our performed experiments were solvable for real-world benchmarks in still an acceptable amount of time, we envision to investigate scalability in future work. In this paper, an exploration approach is proposed to systematically explore PD partitioning for heterogeneous multi-core SoCs, jointly with task mapping and scheduling. An ILP-based task mapping and scheduling is performed for each PD partition candidates while partitioning PDs by a Multi-Objective Evolutionary Algorithm. The ILP formulation considers the constraints of power-off dependencies among hardware resources belonging to the same PD and the power-gating break-even time. For a given PD partition, it creates more and longer common idle intervals of PDs which can be switched off more often to save power. Compared to state-of-the-art approaches performed after task mapping and scheduling frozen, our approach offers significantly larger optimization opportunities for system architects. It has been shown that better trade-off fronts in terms of energy consumption and number of PDs and thus hardware costs may be found by shifting exploration to earlier design phases. Leveraging rule-based designs for automatic power domain partitioning FSMD partitioning for low power using simulated annealing Certification-cognizant time-triggered scheduling of mixedcriticality systems Embedded system synthesis benchmarks suites TGFF: task graphs for free Modeling processor idle times in MPSoC platforms to enable integrated DPM, DVFS, and task scheduling subject to a hard deadline Maximizing common idle time on multicore processors with shared memory Hybrid optimization techniques for system-level design space exploration Clustering-based simultaneous task and voltage scheduling for NoC systems Cyber-physical systems design for electric vehicles Opt4J-a modular framework for meta-heuristic optimization Voltage-frequency island partitioning for GALS-based networks-on-chip Exploration of power domain partitioning for application-specific SoCs in system-level design A very fast and quasi-accurate power-state-based system-level power modeling methodology Pipelined data parallel task mapping/scheduling technique for MPSoC Acknowledgments. This research work was funded by Intel Deutschland GmbH, and finished before Bo Wang and Aneek Imtiaz left Intel. We would like to acknowledge Dr. Yang Xu and Dr. Ralph Hasholzner at Intel, and also Dr. Thomas Wild and Prof. Andreas Herkersdorf at Technische Universität München, for valuable discussions.