key: cord-0163794-b9xr8rc7 authors: Boroujerdian, Behzad; Jing, Ying; Kumar, Amit; Subramanian, Lavanya; Yen, Luke; Lee, Vincent; Venkatesan, Vivek; Jindal, Amit; Shearer, Robert; Reddi, Vijay Janapa title: FARSI: Facebook AR System Investigator for Agile Domain-Specific System-on-Chip Exploration date: 2022-01-13 journal: nan DOI: nan sha: b808b68555300e0f46681c7acc47d57266450355 doc_id: 163794 cord_uid: b9xr8rc7 Domain-specific SoCs (DSSoCs) are attractive solutions for domains with stringent power/performance/area constraints; however, they suffer from two fundamental complexities. On the one hand, their many specialized hardware blocks result in complex systems and thus high development effort. On the other, their many system knobs expand the complexity of design space, making the search for the optimal design difficult. Thus to reach prevalence, taming such complexities is necessary. This work identifies necessary features of an early-stage design space exploration (DSE) framework that targets the complex design space of DSSoCs and further provides an instance of one called FARSI, (F)acebook (AR) (S)ystem (I)nvestigator. Concretely, FARSI provides an agile system-level simulator with speed up and accuracy of 8,400X and 98.5% comparing to Synopsys Platform Architect. FARSI also provides an efficient exploration heuristic and achieves up to 16X improvementin convergence time comparing to naive simulated annealing (SA). This is done by augmenting SA with architectural reasoning such as locality exploitation and bottleneck relaxation. Furthermore, we embed various co-design capabilities and show that on average, they have a 32% impact on the convergence rate. Finally, we demonstrate that using simple development-cost-aware policies can lower the system complexity, both in terms of the component count and variation by as much as 150% and 118% (e,g., for Network-on-a-Chip subsystem) With the end of Moore's law, simple systems with a few general-purpose processors do not provide a path forward for domains with stringent power/performance/area constraints. This is due to the performance and energy inefficiencies of these systems/processors, such as the high instruction fetch and decode overhead. Domain-specific SoCs (DSSoCs) are an attractive and adaptive alternative as they provide the required performance and energy scalability. This is due to their domain awareness and specialized hardware blocks to keep the computation/communication fast and efficient. Although efficient, DSSoCs are complex systems and demand high development efforts to design. This complexity is driven by number and variation (heterogeneity) of the processing elements, and the intricate topological structure connecting them. Concretely, from the compute perspective, increasingly customized accelerators accompany general-purpose processors; from the communication perspective, both the Network-on-a-Chip (NoC) and memory subsystems see a complexity rise to keep the data local and the movement low energy [53] . In addition to the system complexity, the sheer number of design knobs per component dramatically expands the design space making the search for the optimal design difficult. For example, a simple system with five simple workloads and six total knobs (e.g., frequency, bus width, accelerator hardening knobs) totals over a million design points. Naive brute force sweeps are infeasible for the design space exploration of DSSoCs. Fig. 1 . DSE components to the left need the characteristics on the right to tame the DSSoC design space complexities. We introduce FARSI, an early-stage DSE framework that has those characteristics. Hence, it enables an agile DSE for DSSoCs. To manage all said complexities, we need a design space exploration (DSE) technique that is agile/efficient to navigate the space effectively, and further is development cost-aware to dampen the design effort. DSEs typically consist of a simulator ( Figure 1 , top left) to capture the design behavior and an exploration heuristic ( Figure 1 , bottom left) to navigate the design space. Both of these components need to be efficient. To that end, in this work, we provide an efficient early-stage (pre-synthesis) DSE, called FARSI, that concretely targets DSSoCs and their complexity. Our work is open-sourced and free for public use: https://github.com/facebookresearch/Project_FARSI. FARSI meets the requirement that DSSoC simulators need to be agile, accurate, and capable of system-level estimation ( Figure 1 , top right); DSSoC simulators must be capable of estimating system-level dynamics as profiling components in isolation cannot accurately measure their systemlevel impact. Furthermore, they must be agile to enable sufficient coverage of the vast design space. FARSI provides a system-level simulator to capture the complexity of accelerator-rich systems (e.g., with more than 50 hardware blocks). It also achieves high agility/accuracy by combining analytical models' speed and the event-driven simulation's accuracy. Our analytical models (for the first time) build on top and further expand Gables [21] , a roofline-based bottleneck analysis for SoCs. We augment these analytical models with a lightweight phase-driven simulator to estimate system dynamics. FARSI splits the program progression into phases (the longest unit of time within which the system bottleneck stays constant), calculates their bottlenecks, and advances the simulation accordingly. Such an approach enables fast simulation as the phases are not fixed and adaptively expand or shrink based on the program characteristics. We quantify our simulator fidelity by comparing it against Synopsys' Platform Architect [45] , which is an industry-strength early-stage analysis and optimization tool for SoC architectures. In all of our studies, we target the Augmented reality (AR) domain without the loss of generality as its rigid constraints demand accelerator-rich and complex DSSoCs. Specifically, we compare against over 200 designs of different complexities and show an average speedup of 8,400X (.032 seconds on average for FARSI and 235 seconds for PA) and an accuracy of 98.5%. We further quantify the agility scalability of our simulator with respect to the system complexity, showing a slight slowdown of 3X when the number of blocks is increased by 20X. Lastly, we show that using our simulator in the DSE can improve the exploration convergence time 1 from (estimated) 3 years to 3 hours. In addition, DSSoC exploration heuristics need to be domain-, architecture-, and developmentaware and also co-design capable (Figure 1 , bottom right). Embedding these characteristics into a search heuristic (e.g., simulated annealing, although we can expand to other heuristics) can improve the convergence time and system complexity. DSSoC exploration heuristics need to be "domainaware", meaning that they must extract and further exploit the opportunities provided by their workload set (e.g., loop/task-level parallelism). We equip FARSI with this feature and show how parallelism and computation/communication boundedness can effectively guide the search. DSSoC exploration heuristics also need to be "architecture-aware", i.e., deploy architectural reasoning during the search and efficiently prune the space. For example, they should exploit spatial locality to bring the data closer for the accelerator or use bottleneck analysis to relax the congestion for an overloaded NoC. We equip FARSI with such reasoning and show that when embedded into naive search algorithms (e.g., simulated annealing), architectural insights can improve the convergence time by 16X. We further quantify how different levels of reasoning can impact this convergence time and hypothesize that the said convergence gap will only grow with more complex workload sets. DSSoC exploration heuristics also need to exploit co-design, e.g., simultaneous communication and computation optimization, as this approach provides additional optimization opportunities to accelerate the convergence. FARSI exploits co-design heavily across four main vectors. It codesigns (1) across workloads, looking for holistic improvements across workloads boundaries. This approach is also applied (2) across metrics, (3) across computation-communication, and (4) across optimizations (mapping, allocation, customization). We show that this feature can impact the convergence rate on average by 32% and further detail the impact of each co-design vector. We also show that embedding the same co-design capabilities in regular simulated annealing (without architectural considerations) does not necessarily translate to design improvements. Finally, DSSoC exploration heuristics need to be development cost-aware to keep system complexity and development effort low. We equip FARSI with this feature by prioritizing low developmentcost optimizations (e.g., prioritizing software mapping over hardware customization) and provide a case study quantifying the impact. We show that when system specifications (e.g., latency budget of SoC) are relaxed, FARSI decreases the output design complexity by lowering the number (e.g., 150% reduction on the number of NoCs) and variation of components (118% reduction in NoC frequency). We compare our methodology with the divide and conquer methodology, commonly used to manage complexity. Divide and conquer simply splits the design into sub-components and targets each locally. We shed light on significant problems associated with this methodology and show that they can lead to systems with up to 56% and 52% system degradation in performance and power comparing to FARSI's designs. In short, FARSI reveals a methodology for efficiently managing the system and the search space complexity of DSSoCs. Our contributions include the following: • We provide the first Gables-based [21] characterization of a host of AR workloads from three different AR application primitives-audio decoding [24] , CAVA [59] , and edge detection [32] . • We show how a hybrid estimation methodology that combines analytical models and phasedriven simulation can provide both the agility (8,400X speed up) and the accuracy (98.5%) much needed for DSSoCs. Our simulator lowers the convergence time from (estimated) 3 years to 3 hours comparing to Synopsys Platform Architect. • We show how architectural reasoning (e.g., locality exploitation or bottleneck relaxation) and co-design can improve the convergence time of simple search heuristics such as simulated annealing by more than an order of magnitude. • We show how development-aware policies (e.g., prioritizing incremental improvements) can exploit relax budgets and lower the final system complexity, both in terms of component counts and variations by more than 100%. This section presents an overview of the AR workloads that we test the FARSI DSE with. First, we motivate the suitability of the Augmented Reality (AR) domain for DSSoCs and then detail our workloads. Though we focus on the AR domain, the FARSI simulator is applicable in other domains. Augmented reality is emerging as one of the main drivers of technology. With use cases in medical training [28] , repair and maintenance [3] , education [2] , tourism [4] and others, this domain is on the cusp of showing its full potential. Furthermore, the current COVID_19 pandemic has accelerated this emergence as various use cases such as training and education continue migrating to the online domain [36] . AR's promise to pervade various aspects of our lives has caught the attention of technology giants such as Facebook [5] , Google [7] , Apple [1] and others with the effort to embed this technology in various form factors ranging from phones to glasses [6] . Despite said promises, this domain faces various challenges from the compute perspective. First of all, due to the domain's tight constraints, currently, there are several orders of magnitude gaps between the needed and achievable performance, power, and usability [24] . Closing this gap is imperative as the performance constraints ensure timely interaction with the world while the power constraints ensure operational battery life. Secondly, this domain demands SOCs with a diverse set of sub-domains, including video, graphics, computer vision, optics, and audio [24] . The combination of these sub-systems forms a rich set of pipelines that continuously interact to deliver the real-time functionality needed. This diversity increases the development effort to design AR systems. Said two challenges make AR an ideal candidate for domain-specific SOCs. On the one hand, the tight budgets of the domain require a host of specialized hardware units with high power and performance efficiency. On the other, the diversity of the sub-domains ask for holistic designs that go beyond each sub-domain and instead exploit cross sub-domain co-design opportunities. Augmented reality spans across many sub-domains and workloads. In this work, we target the primitive workloads, i.e., audio and image processing, as they are present in almost any AR applications [18, 19, 38] . Note that this work leaves out Facebook's proprietary internal workloads and only resorts to open-sourced libraries. Here we detail our workload's functionality and characteristics. Audio Decoder (Audio) is used to playback the audio based on the user pose. Concretely, said pose, obtained from the IMU integrator, is used to rotate and zoom the soundfield which is then mapped to available speakers. This workload is from a new AR/VR testbed [24] . CAVA simulates a configurable camera vision pipeline and is used to processes raw images (pixels) from sensors to feed the vision backend (e.g., a CNN). The default ISP kernel is modeled after Nikon-D7000 camera [8] and is developed by Harvard [59] . Edge Detection (ED) processes the image and finds sharp changes in brightness to capture significant events and changes in properties of the world. It is one of the key steps used in image analysis and computer vision. Our workload appears in [33] . To characterize each workload's execution flow, we split them into a set of tasks (e.g., Tone Map task in CAVA workload) where a task is the smallest unit of simulation and is selected from workload's functions. The execution dependency between the tasks is captured in the task dependency graph (TDG), where each node specifies tasks and edges specify their dependencies. Figure 2 provides each workload's TDG. Audio has the most tasks (15) , while Edge Detection has the least (6). In general, this distribution of tasks is in line with several internal proprietary workloads. To characterize the tasks, we use Gables [21] . For the first time, we provide a Gables profile of a set of AR workloads. Gables is a set of abstract analytical models that captures high-level software computational and communicational characteristics. Concretely, it models each task's computation with its work ( = instruction count) and its communication with operational intensity ( =operation count per memory access which is split to and corresponding to read and write respectively). Table 1 quantifies each workload's Gable relevant variables with each value averaged over all the tasks of a workload. TaLP and LLP values quantify (Ta)sk (L)evel and (L)oop (L)evel parallelism, respectively. The former quantifies the number of task combinations that can run in parallel, and the latter quantifies the average number of independent loop iterations. As seen, Audio uses both loop and task-level parallelism (highest TaLP among all), CAVA only uses loop-level parallelism, and Edge Detection has task level and highest LLP. Edge detection is also the most communication-intensive benchmark. Note that task by task details for all the workloads are presented in the appendix. In this section, we detail FARSI DSE's methodology. FARSI is comprised of four stages (Figure 3 ), (1) database generation, (2) system simulation, (3) system generation, and (4) system selection. The first stage characterizes our workloads, collects (in isolation) IP performance/power/area (PPA) estimates, and populates our software/hardware database. This database is then continuously queried by our system simulator, system generator, and system selector as they work together to generate new systems and improve them. This process iteratively continues until the design's budgets (i.e., performance, power and area) are met. FARSI's main contribution is in the last three stages as prior work such as [32, 52, 60] sufficiently addresses the first. This stage populates a database with the characteristics of the domain's workloads and the PPA estimates of individual IPs. Both of these components are then used to estimate the system behavior while running various workloads. Workload Analysis (Software Database): This involves generating a task dependency graph (TDG) shown in Figure 2 . TDG's nodes are further augmented with computational characteristics (e.g., loop iteration and instruction counts), while edges augmented with communication data movement. Tasks are typically chosen from the functions with the highest computational/communication intensity as their optimization can significantly impact system behavior. Optimal task selection is a difficult problem and is outside of this work's scope; in our work, selection relies on application developers' insights. For TDG generation and profiling tasks' computational and communication characteristics, we use existing tools such as perf [39] , AccelSeeker [60] and HPVM [32] . IP Analysis (Hardware Database): This involves PPA estimation of each task for different hardware mappings (e.g., to general-purpose processors or specialized accelerators). We build on top of an existing toolest called AccelSeeker [60] and collect PPA estimates for each mapping. We augment our database with many memory and NoC widths and frequencies. This stage raises the view to the system level and enables holistic analysis. System view is necessary since profiling components in isolation cannot accurately measure their overall impact within complex and accelerator rich DSSoCs. This stage considers computational (general-purpose processors and accelerators) and communication IPs (NoCs and Memory) as the lowest abstraction unit. We deploy a hybrid estimation methodology combining analytical models and a lightweight phasedriven simulator. The former enables agile traversal and thus an extensive design space coverage. The latter improves the simulation fidelity by capturing the system dynamics. Here we discuss each. Software Analytical Models: We augment the Gables SoC-level roofline model as it provides a simple yet holistic/system-level set of analytical models [21] . Gables combines bottleneck analysis and high-level computational/communication models. Our improvements include: • Finer Computation/Communication Granularity: We lower the smallest execution unit from workload to task. This captures the workload's lower-level compute, memory, and communication characteristics and further provides extra within-workload optimization opportunities. Hardware Peak bandwidth (bytes/sec) , Actual memory, NoC bandwidth (bytes/sec) Peak performance of CPU (ops/sec) , Actual CPU, IP performance (ops/sec) Peak acceleration (unitless) Number of links for a NoC (unitless) Software Task Task's work for an IP (ops) Task's work for CPU (ops) Total task's data transferred (bytes) Burst size for a task (bytes) Completion time of task T (sec) Timing Duration of a phase (sec) Also, we introduce a communication burst size parameter that captures NoC congestion behavior more accurately. • Task Dependencies: The Gables model assumes parallel execution of all workloads/tasks for simplicity. We use TDGs to model task-level parallelism (TaLP) accurately. • Computation/Communication Breakdown: We augment Gables with loop iteration count and thus loop level parallelism (LLP). We also break down into read and write operational intensity ( , ) as modern routers/memories support separate channels for each. Hardware Analytical Models: Gabels models hardware with peak performance ( ) and bandwidth ( ). It assumes a single NoC with a single channel. Our improvements include: • Computation/Communication Resource Sharing: We integrate CPU's multi-tasking models (pre-emptive scheduling), and multi-channel routers for master-slave combinations. • Topology Improvement: We introduce "many NoC" topology systems to improve congestion/locality. Phase-driven Simulation: Gables is a static model that cannot accurately capture the dynamic flow of complicated task graphs. So we wrap our analytical models with a lightweight phase-driven simulation ( Figure 4a) . A phase is a flexible time quantum with which we advance the simulation. It specifies the longest interval that the system behavior stays constant; Since our models use bottleneck analysis to estimate the system behavior, a new phase emerges when a task is completed and schedulued out (shifting the bottleneck by relaxing NoC/Memory/PEs pressure) or a new task is scheduled in (Figure 4b ). At the highlevel, our simulator estimates a phase duration based on the fastest task's completion time, advances all tasks according to this interval, schedules out the completed task, and then schedules in tasks whose dependencies are satisfied. This process continues iteratively until all the tasks/workloads(s) are completed ( Figure 4a ). Hybrid Estimation: Our hybrid approach combines the said analytical models and phase-driven simulation to estimate the SOC performance. At the beginning of each phase, our phase-driven simulator schedules all the tasks whose dependencies are satisfied. Then, our analytical models determine each task's completion time by estimating their hardware blocks' processing rate (Equations 1 through 4). and denote a task and its burst size, _ and denote a general-purpose processor's peak and actual performance, and denote peak accelerator speedup and its actual performance, , denote memory and NoC peak and actual bandwidth, denotes number of links in a NoC, and | | denotes a variable cardinality (e.g., | | denote number of running tasks). To calculate the processing rate of CPU (Eq. 1), IP (Eq. 2) and NoC Amit Jindal, Robert Shearer, and Vijay Janapa Reddi Phases for an example workload. and denote a phase and task respectively. (Eq. 3) their peak rate is divided by the number of tasks running on them. This is because preemptive scheduling of CPU/IP and equal arbitration of a NoC will allocate the rate equally among the tasks. 2 However, for memory (Eq. 4) and NoC's links (Eq. 3), this division is determined by the burst size ratio of the task over the total bursts of all running tasks. Once each block's processing rate for a task is determined, the completion time of the task is determined by its slowest block (Equation 5). and , and denote a task, its completion time, and its work for an IP. Each component in the tuple calculates the execution time of a different block where each block's work (e.g., data or for memory) is divided by its execution rate (e.g., bandwidth for memory). The maximum function finds the slowest of blocks, and the number and type of its inputs are determined by the blocks hosting the task. The end of a phase (phase duration) is determined by the fastest task, and thus estimated by the minimum of all tasks' completion times (Equation 6). Φ , and denote the phase duration and the task i's completion time. Once said duration is calculated, the phase-driven simulator moves the simulation tick to the end of the phase, updates each task's progress according to the duration, schedules out the completed task, and schedules in ready tasks. 3 This process continues until no more tasks are left to be scheduled. T1 T3 T2 T1 T2, T3 T1, T2 Power/Area Modeling: For power and area of IPs, we use the verified AccelSeeker [60] estimations, and for NoCs and memory, we have fully integrated CACTI [40] into FARSI. This stage exploits architectural insights to explore the design space. Without the loss of generality, we use simulated annealing (SA) as the search heuristic base and then augment it with various architectural insights. To explore the design space, we greedily generate a number of designs' neighbors (candidate designs in SA), simulate them and select the best one until the constraints are met ( Figure 3) . A design's neighbor, is a design whose mapping, allocation or topology has been incrementally modified ( Figure 5 ). For example, designs B and C are design A's neighbors as the former has an allocation change (an extra IP) and the latter a mapping change (Task 2). We generate neighbors by selecting a five-tuple, i.e., a Metric, a Direction, a Task, a Block, and an Optimization Move to improve the said components. Metric/Direction: To generate a neighbor, we target one metric per iteration (e.g., performance). We typically pick the metric farthest from its budget as it contributes to the distance (to goal) the most. Note that this metric can vary from iteration to iteration as we improve said metrics. Overall, FARSI currently targets performance, power, and area. In addition to the metric, we also target an direction to improve the metric with (e.g., decrease or increase if we have overshot). Task/Hardware Block: To generate a neighbor, we target a task/block to improve, typically the one causing the highest distance to budget. For example, if targeting latency, the highest latency task and its block bottleneck are selected. Optimization Move: Move is an optimization on a task/block to improve a metric. FARSI supports high-level optimizations such as hardware customization, hardware allocation, and software to hardware mapping and provides a move primitive for each, concretely swap, fork/join, and migrate. Swap: Enables customization by replacing a hardware block with a more specialized one (Figure 6a) . For example, we replace an ARM core with an accelerator or a narrow bus with a wider one. (e.g., 100MHz→200MHz instead of 100MHz→800MHz) to minimize the impact on competing metrics (e.g., performance vs power). Fork/Join: Impacts allocation by duplicating an existing block and migrating some of its tasks over (Figure 6b ). Duplication relaxes the block pressure, e.g., cutting down a NoC traffic and relaxing congestion. Join is the opposite of fork. Migrate: Modifies software to hardware mapping by migrating a task from one block to another (Figure 6c ). Migration improves data locality (moving data closer to where it is used), load balancing (mapping to an idling processor), and relaxes buses/memory pressure (mapping to bus/memory with the lower congestion). Currently, mapping is done statically. Using Architectural Reasoning for Move Selection: FARSI's exploration heuristic is equipped with architectural reasoning, concretely the capability to apply parallelism, locality, and customization according to the system needs to improve the convergence rate. For example, to improve power/performance, it automatically detects if a task's data is multiple hops away and chooses "migrate" to bring the data closer (spatial locality reasoning). Furthermore, to improve performance, it automatically recognizes if a block runs concurrent, parallel tasks and uses "fork" or "migrate" to relax the pressure/congestion (parallelism reasoning). Alternatively, if the metric to optimize is power, it uses "join" to "serialize" to improve power or reduce area. This automatic capability to reason about system/targeted metric and applying appropriate optimizations improves the convergence rate and is necessary for complex design spaces of DSSoCs. At a high level, to select optimization moves (Algorithm 1), FARSI first applies architectural reasoning to find moves that can improve the design (step I), prioritizes them according to their development cost (detailed in next paragraph) (step II), and probabilisticly samples and applies the move (step III). This approach improves the random neighbour generation (or move selection) of SA and increases its convergence rate by more than an order of magnitude (section 5.2). Note that said reasonings is not tied to a specific search heuristic, and although this work applies them to simulated annealing, they can improve other heuristics' convergence rate. Also note that FARSI users, i.e., system designers and architects, can easily extend said reasonings with their architectural insights to improve FARSI's intelligence. Move Symmetry: Our optimization moves are symmetrical to enable backtracking and prevent the navigation from getting stuck. For example, we can swap up followed by a swap down (e.g., first widen and then narrow the bus), migrate back and forth, and join-in after forking-out. Development-cost Awareness: Since highly customized SoCs are complex and thus financially costly to develop, we embed various policies into our heuristic to keep the development-cost low: • We introduce move-precedence to select low-effort moves over the expensive ones. For example, we prioritize join over other moves as it reduces the hardware complexity by eliminating a hardware block. Furthermore, we prioritize software moves (migrate) over the hardware ones (fork, swap) as software manipulation is cheaper. Finally, within hardware moves, we prioritize fork over swap for processing elements as the former only requires duplication whereas the latter involves porting (or hardening), which is more expensive. For memory and NoCs, the swap is prioritized if it does not increase the system heterogeneity, for example, creating a system with NoCs of different Frequency. A snipped of our move precedence is shown in move_precedence of Algorithm 1. • FARSI starts with a very simple base design (one general-purpose processor, one NoC, and one DRAM bank) and advances its complexity incrementally only if needed. Since every new allocation or customization is costly (the former introduces new hardware and thus increases the development effort while the latter contributes to the system heterogeneity), we only apply one move at a time and thus incrementally modify the design. Furthermore, our moves are designed to only modify one knob at a time, e.g., migrate only one task. Overall, this approach allows us to increase the complexity in small steps and only if the design has not met the budget, thus keeping the development effort low. We detail the impact of our development-aware policies in a case study (Section 6.1). Once we generate and simulate the neighbors (candidate generation in SA), we select the best. Neighbors' fitness is quantified using their design's (metric) normalized distance to budget with a dampening factor to the metrics already meeting budget (Equation 7). , , and denote, a metric, design value for the metric, budget value for that metric, and dampening factor when the budget is met. If no improved designs are found, we inform the system generation stage for the next iteration to target the task/block with the next highest distance (comparing to the last). Note that similar to SA, we also use a temperature value to choose sub-optimal designs infrequently. This section sheds light on the performance breakdown of the FARSI components and further quantifies our simulation fidelity in terms of modeling accuracy. In this study, we exploit our augmented simulated annealer to generate over 250 SoCs for our three workloads, namely, Audio Decoder, CAVA, and Edge Detection. These SoCs have different topology complexities ranging from 1 to 13 processing elements, 1 to 8 memory blocks, and 1 to 3 NoCs. All data are collected on a Xeon Skylake Machine with 2.00 GHz frequency, and PA's time interval parameter is set to 10 µs (corresponding to 1000-10000 cycles of our hardware blocks) to enable high accuracy. Simulation Validation: We validate our simulation framework against Synopsys Platform Architect (PA) [45], a widely used industry-grade performance simulator. Similar to FARSI, PA's approximately timed mode targets agile/early-stage system-level estimation. PA deploys event triggers (i.e., transactions and fixed time intervals) to advance the simulation. We achieve an accuracy of 98.5% (Table 4b) . We measure our simulator's fidelity sensitivity with respect to the types and quantity of hardware blocks within said designs (Figure 7a ). Processing elements (PEs) show the lowest sensitivity while buses show the highest. This is because we do not model intermittent congestion that can occur within a sequence of (NoCs') pipes and rather assume constant congestion for a phase. Simulation Speedup: We achieve an average speedup of 8,400X (compared to Synopsys PA), owing to our hybrid estimation methodology that combines analytical models and flexible phasebased simulation. We also measure our simulator's slowdown with respect to the number of system blocks (Figure 7b ). We incur a small 3X slowdown for a 20X increase in the number of blocks. For Synopsys PA, we do not see a meaningful relationship as the simulation time sometimes lowers and then plateaus with a higher number of blocks. However, PA shows a high sensitivity to the workload's execution latency (Figure 7c) where an increase of .005(s) to 50(s) for this latency raises the simulation time from 101(s) to 814(s). FARSI also experiences a slow down of .018 (s) to .21(s) for the same increase, still maintaining high agility. Performance Breakdown: We profile FARSI's last three iterative stages to show what needs to be sped up. Note that the first stage (Database Generation) occurs only once and further does not change across different runs. Figure 8a shows that most of the time (79.9%) is spent on the system generation stage. Further breakdown of this stage (Figure 8b ) reveals that sub-stages such as task, block, or metric selection that do actual system modifications only consume a small percentage (2%<). Instead, design duplication, which copies the original design object to be modified/improved, consumes the most time. This step can be sped up by lowering the memory footprint for a faster memcpy. As FARSI manages the design space complexity by using (1) agile system-level simulation, (2) architecture awareness, (3) heavy use of co-design, and (4) domain awareness, we quantify the impact of each and quantify their efficacy. For all four studies, we lower the target SoC power and area budgets of 0.1 W, and 100 2 [56] according to [24] . Audio power budget was estimated based on the system power breakdown of [24] , and CAVA and Edge Detection power budgets were estimated according to their relative power consumption to Audio. System power is then calculated as the sum of all workload components. For area budget, we use power as a proxy and similarly use the breakdown in [24] to guide the budgeting. These configurations are shown in Table 4a . 4 For the first three studies, we set the individual workload's latencies to the values provided in [24] and [55] . For the last study, these latencies are intentionally lowered to the workload's limits. This allows us to stress test FARSI toward exploring all possible optimization opportunities (e.g., TaLP and LLP). Here we quantify the impact of FARSI's agile simulator on the convergence time and fast coverage of the design space. Note that convergence time is the time that it takes for the DSE to meet the design budgets. We envision a SoC that runs all three workloads together (as they almost always are in the AR use case). We compare two DSEs that use the same exact heuristic, but one uses FARSI's simulator and the other uses PA. Note that for the latter, we estimate the convergence time by increasing the time of each iteration by the average slow-down mentioned in Table 4b . This is because, as we will show, actually running PA takes too long and thus is infeasible. Figure 9a shows the difference in the convergence time with y-axis and x-axis, showing the distance 5 to the goal or budget (Equation 7) and time to convergence, respectively. As shown, FARSI takes 3 hours to converge by exploring more than 750 designs, and PA takes more than 3 years. Note that by the time FARSI has converged, PA has only looked at 4 designs. FARSI's simulation agility makes it an ideal candidate for a DSSoC DSE as it enables high coverage of the complex/big design spaces. Given the number of knobs across allocation, mapping, and tuning, the design space is too great to be navigated blindly. For example, a system with 6 tasks and 2 knobs per processor/memory/NoC results in more than a million design points, and our AR complex with more than 28 tasks results in a space greater than the number of stars in the universe. Thus, optimal navigation demands architectural insights for guidance. To highlight the importance of architecture-awareness, we compare FARSI against a simple simulated annealer (SA) that generates neighbors at random. We further vary the awareness level and show its impact on convergence. From least to most aware, SA uses random neighbor generation, Task-aware uses bottleneck analysis to select the task for such generation, Task&Block-aware uses architectural reasoning to select the task and hardware block, and finally, FARSI uses architectural insights to select a task, a block, and an optimization move. Figure 9b shows the convergence rate of each DSE with the x-axis and y-axis denoting the iteration count and the city block distance to budget or goal (Equation 7). As seen, our two least aware DSEs, i.e., SA and Task-aware, do not fully converge, as they get stuck without finding an improvement. These DSEs are, on average, about 16X and 6.5X slower than FARSI to reach the same distance. Task&Block-aware adds block selection awareness and thus lowers the convergence difference to 2X. We believe these differences only worsen as workload set/complexity grow and thus, the architecture community insights must be embedded into the design automation community heuristics. A DSE without a co-design cannot exploit cross-boundary optimization opportunities necessary for convergence. For example, without a cross-workload co-design, a DSE misses on area reduction enabled by inter workload memory sharing. FARSI uses co-design by not being fixated on one optimization for too long and rather continuously moving from one optimization to another. Our codesign occurs within each of the following vectors: (1) across metrics (i.e., among power, performance, area), (2) across high-level optimization (i.e., among mapping/allocation/customization) and across low-level optimizations (frequency modulation, bandwidth modulation, hardening, etc.), (3) across computation/communication, and (4) across workloads. Here we quantify their impact. Figure 10a as an example shows FARSI's co-design progression across two of these vectors, i.e., metrics and workloads. X-axis and y-axis show iteration count and each metric's distance to the goal (log-scaled), respectively. To emphasize when FARSI targets a metric/workload, we make the color of its curve bold; otherwise, make it see-through. When the budget is met (distance=0), in the plot, we set it to 10 −2 (as 0 cannot be shown with log). As shown in the figure, the exploration goes through various zones (A through H), each with a different exploration focus. For example, zone A mainly targets latency and exploits cross-workload opportunities between Edge Detection and CAVA. Zone C shifts its focus toward system power and area co-design. Note that Audio latency is also targeted (although already met) as it can be increased to help with power/area. Finally, FARSI splits its attention between latency (of Audio and CAVA) and system area in zone E. Note that different (a) Co-design over time across metrics and workloads. (b) Co-design deployment rate. (c) Co-design resulted convergence rate. metrics meet their budget (reach 10 −2 ) throughout the exploration at different times, but a continuous trade-off between them is exploited until all distances are zero. Such co-design exploitation is a significant contributor to the fast convergence previously shown in Figure 9b . Note that Figure 11 shows the normalized iteration breakdown of our other co-design vector, namely high-and low-level optimizations. However, their co-design progression plots are left out due to space limitations. Figure 10b shows how often (deployment rate) co-design occurs within each of our vectors. For example, a co-design rate of .2 for workloads means that FARSI changed its focus from one workload to another 2 times every 10 iterations on average. Note that these rates are not tuned by the tool user but determined dynamically by FARSI as it explores the space. For example, if FARSI sees that one workload demands more attention (i.e., its distance to the PPA budget is high), it will select and focus on it long before switching to others. FARSI applies co-design across both high-level (mapping, allocation, software to hardware mapping) and low-level (frequency modulation, memory allocation, etc.) optimization options the most. Concretely, it changes its focus between these options around every 4-5 iterations. Workloads co-design rate is lower, indicating a difference in each workload's convergence difficulty. Finally, FARSI applies even a longer attention length (>10 iterations) for computation and communication, indicating their higher imbalance. Note that SA deploys a higher co-design rate for all vectors as it randomly selects neighbors; however, next, we show that this strategy is not optimal. To quantify the impact of the said co-design approaches, we measure the average convergence rate (i.e., average of improved distance per iteration) resulted from each. Figure 10c reveals that computation/communication co-design leads to the highest improvement, i.e., an average convergence rate of 35%. Overall on average, co-design improves the distance by 32%. Note that the SA convergence rate due to co-design is very small, in fact, negative as random switching between, for example workloads can hurt the distance. This shows that a principled co-design is necessary. DSEs that target domain-specific SoCs need to be domain aware, i.e., 1) extract workloads' characteristics and then 2) direct their optimizations to exploit workload's inherent opportunities (e.g., TaLP). Here, we first detail our workload's characteristics and then showcase FARSI's awareness of them. We run each workload individually and provide FARSI's response to them. Figure 12 (table to the right) sheds light on each workload's computation and communication characteristics, concretely, parallelism (loop-and task-level), and data movement. Edge Detection (ED) has the highest LLP and data movement, and Audio has the highest TaLP. Figure 12 (left) further details the computation/communication boundedness of each workload. The y-axis shows the (normalized) breakdown of the hardware bottlenecks FARSI encounters during exploration. This information should guide the DSE in selecting their target stage (i.e., communication or computation). CAVA has the most computation boundness tendency while the other two show more balanced boundedness. Figure 13 illustrates FARSI's response to the above characteristics. Concretely, Figure 13a shows FARSI's uses of parallelism (in the number of iterations). FARSI applies task-level parallelism on Audio the most and CAVA the least, as Audio's TDG provides the highest TaLP opportunities and CAVA provides none. FARSI targets loop-level parallelism for Edge Detection more than Audio correlated with their LLP. Note that FARSI's excessive use of LLP for CAVA is inevitable as it has no other parallelism option. Moreover, Figure 13b shows FARSI's relative focus on computation and communication as a response to these bottlenecks. CAVA has a high computation boundness tendency, and thus, FARSI targets its computation the most while a more balanced approach is used for the other two according to their needs. FARSI's focus on communication is correlated with each workload's data movement, with Edge Detection having the highest (7 MB on average) and Audio having the least (0.18 MB on average) data movement. To show FARSI's capability with the design challenges of complex SoCs, we provide two case studies. First, we show how FARSI's cost-aware methodology balances "system complexity" (and thus the development effort) with the product quality. Then, we shed light on the inefficiencies of the "divide and conquer" approach often used to tame the complexity and show that FARSI can provide more optimal designs. For both case studies, we assume a system that runs all three workloads simultaneously. Since a DSSoC requires a high-development effort, an optimal DSE framework must use various trade-offs to lower this design effort. An example of such trade-off is the product quality and development effort. Concretely, an optimal DSE must lower the development effort whenever a product quality is relaxed. This study showcases FARSI's capability in this balancing act. To this end, we proxy the product quality with performance/power/area (PPA) budgets. This is because for example in AR, performance budget, e.g., frame-rate, directly impacts user experience as it determines how smooth user-world interactions are; power budget impacts battery life and thus determines whether AR glasses can be reasonably operational; and finally area impacts the die cost and thus the overall product cost. We also proxy development effort by the number of hardware blocks in the system and their variations (heterogeneity) as an increase in either magnifies the development effort. We relax Section 5 PPA budgets (shown in Table 4a ) with 1X,2X, and 4X, run FARSI for each budget and illustrate how FARSI lowers hardware block counts and variations when faced with larger budgets (e.g., 4X budget). Component Implications: Figures 14a and 14b show the impact of different budgets on IP and Network on a chip (NoC) subsystems. As shown, increasing the latency budget from 1X to 4X reduces the IP and NoC counts by 17% and 150% respectively. This shows that FARSI lowers the block count when system budgets are relaxed. In addition, this indicates that in delivering performance, NoC design/integration should be prioritized as the performance has a higher sensitivity to NoC than IP count. We also observe that a power budget increase of 1X to 4X results in a 36% and 59% reduction in IP and NoC count. This means that IP's impact on power reduction is higher than its impact on performance. However, the reverse is true for NoC's. We also do not see a meaningful relationship with the area. Figure 14c shows memory size sensitivity to the budgets. Increasing the latency and power budget from 1X to 4X results in a total on-chip scratchpad memory size reduction of 28% and 21%, respectively. This is because the relaxed budget can tolerate higher global data movement latency and energy of DRAM. However, for the same budget increase in area, the (low-density) SRAM area can be freely increased by 25% to move the data back locally. System designers can use FARSI and conduct such studies to find optimal memory allocations across SRAM and DRAM. System designers can also use FARSI to investigate the impact of budget on NoC Frequency. Figure 14d details this sensitivity. Concretely, we observe that as the latency budget is relaxed to 4X, NoC frequency is scaled down by 360%. This is because lower performance budgets do not require high-frequency operating NoCs. In contrast, relaxing the power budget by 4X instead scales Amit Jindal, Robert Shearer, and Vijay Janapa Reddi up the frequency by 70%. This is because, for higher power budgets, the system can tolerate higher frequency NoCs. Overall, we show that FARSI can exploit system-budget trade-offs and tune the design according to the budget needs. System Heterogeneity: System heterogeneity (block variations) is also a determinant of the design complexity/development effort. To quantify a system variable's heterogeneity (e.g., scratch pad size variation), we use the coefficient of variation ( = ) [41] which captures the variable's deviation from its mean. Higher CV equates to a higher variation; as an example, in Figure 15b , the height of leftmost bar, i.e., 1.1, indicates a 110% mean variation of scratchpad memory sizes within the system. As shown in Figure 15a , increasing the performance or power budgets by 4X allow for a 450% and 150% reduction in the link (channel) count heterogeneity among NoCs. Simply put, this means lower variation in the number of links across NoCs. This lowers the design effort as variation in NoC design increases the design complexity. We also observe that higher NoC customization is necessary for power budget comparing to performance budget variations. This is because the lowest and the highest power budgets (1X and 4X) tend to cause more heterogeneity than their corresponding latency scaling values. A similar trend can be observed in both memory size ( Figure 15b ) and NoC frequency heterogeneity (Figure 15c) . The former experiences a 118% and 47% variation reduction as a result of relaxing the system performance and power budget. The latter experiences a 17% and 267% reduction for relaxation of the same metrics. This renders memory variation more critical for latency delivery while bus frequency variation more critical for power. Note that memory size and NoC frequency variations impact complexity as the former increases the optimization or customization effort, and the latter increases the clock tree and PLL complexity. Overall, we see that FARSI lowers system heterogeneity in response to budget relaxation. In addition, system designers can use FARSI to investigate the product quality impact on system heterogeneity for different components. This then can guide various product decisions while keeping the development effort in mind. System Dynamics: We provide designers with system dynamics' analysis to guide the design planning. Here we detail some of these dynamics. Figure 16a quantifies the accelerator level parallelism, i.e., the average number of accelerators that run in parallel [22] . As shown, an increase of 8% and 13% are needed to meet the tighter 1X performance and power budgets. Such parallelism increases the local traffic by 23% and 31% (Figure 16b ) and demands a 107% increase in reuse (how many times a memory is re-accessed, measured in bytes) to keep the memory size small, (c) Scratchpad memory reuse. Fig. 16 . System dynamics. FARSI's capability to capture various system dynamics can help system designers to make design decisions. specifically when power efficiency is required (Figure 16c ). However, to deliver for performance, FARSI requires lower memory reuse of 65% to prevent memory contention. Note that the reuse's opposite direction between performance and power indicates the importance of memory mapping and its delicate balancing act. Such optimizations can only be achieved using an agile (rather than manual) methodology with a holistic system-level lens to explore sufficient scenarios. Lack of access to holistic design methodologies forces designers to tame the complexity by taking a divide and conquer approach. This means splitting the system into subsystems (one for each workload), imposing ad hoc (estimated) power and area budgets on each, and finally using best efforts to reach them in isolation. This approach is specifically common in domains that consists of a diverse set of sub-domains such as AR (video, audio, graphics, ...), however as we will show, it can lead to sub-optimal designs due to myopic budget estimations (Problem 1) and myopic optimizations (Problem 2). Problem 1, Myopic Budget Estimation: In the absence of automated DSEs, each workload's power/area budget is decided manually using architects' insights and back-of-the-envelope estimates. If the estimates for a workload are too tight, the entire chip budget needs to be expanded to incorporate the workload's best design. However, if a workload's budgets are too loose, optimization opportunities targeting a tighter budget are left unexploited. (Note that the extra budget can then be redistributed to the other workloads in need). We emulate this problem by setting the power budgets according to isolated power estimates provided in [24] , while for the area, we use power as a proxy and budget each workload according to its power ratio relative to the entire system [29] . Note that from [24] , data are scaled for 5nm according to [57] , [51] and [49] . Latency values are set similar to previous sections to ensure a high-quality user experience. We use FARSI to find a design that met this pre-determined budget for each workload. Problem 2, Myopic Optimizations in Isolation: Focusing on each workload in isolation misses out on the cross-workload optimization opportunities such as memory sharing. To isolate this issue and relieve the said budgeting problem (problem 1), for each workload, we sweep all budget values from 0 to the SoC budget with increments of 5%. Concretely, for each workload in isolation, we run FARSI with the mentioned budget sweep and generate a power/area Pareto front. Then, final SoCs (that run all the workloads) are put together by combining all the permutations of the designs on the workload's Pareto Fronts. System Degradation Associated with Problem 1 and 2: Here we quantify the degradation by comparing the methodology used in Problem 1 and 2 with full-fledged FARSI. Note that at the high level, full-fledged FARSI automatically solves problem 1 as it does not require individual workload budgets. Instead, it finds the optimal sub-budgeting as it explores the space. Furthermore, FARSI circumvents Problem 2 by conducting cross-workload optimizations. Figure 17 illustrates the suboptimality of both approaches with the x-axis and y-axis denoting the power and area associated with the designs generated for each methodology. Myopic Budgeting predictably does the worst with its point (far top right) experiencing a 56% and 52% power and area degradation comparing to FARSI's Pareto Front points (on average). The table shown sheds light on the cause behind this issue by presenting the distance between Myopic Budgetting's best design and its pre-determined budget (normalized and multiplied by 100). Negative values mean the design never met the budget, and thus, the budget was too tight and vice versa. As shown, the power/area budgets for Edge Detection and the area for Audio were too tight. However, CAVA's power budget was set too loose and should have been distributed across the other two workloads. The area budget for CAVA was optimal. Myopic Optimization (Myopic Opt) provides suboptimal solutions as well. As shown in Figure 17 , both its Pareto Fronts and almost all of its generated designs are less optimal comparing to fullfledged FARSI. Concretely, points on the Pareto Front of this methodology, on average, experience a 27% and 21% power and area degradation comparing to FARSI. Myopic optimization cannot use cross-workload mapping solutions to share memory and save area or keep the congestion low and improve performance without improving frequency. Simulators: Many pre-RTL simulators ranging from accurate yet slow cycle-accurate simulators [11, 12] to fast yet low fidelity analytical models [9, 13, 21, 54, 58] have been proposed. The formers' low agility and the latters' inaccuracy render them suboptimal for DSSoC DSEs. Trace-driven frameworks address some of the said problems although they focus on individual components, e.g., [30, 50, 52] target accelerators and [10, 46] target memory. In contrast, our work is agile, accurate, and captures system-level dynamics. Although works such as [50, 52] are later augmented to enable system-level modeling, they revert to cycle-accurate non-agile simulation. Design Space Exploration (DSE) Frameworks: Works such as [14, 17, 25, 34] target DSE for individual components such as the processor, accelerator, or memory. Our work improves on them by providing a holistic system view. Others only target one design stage. For example, [27, 31, 47, 48] target scheduling/runtime resource management, [23, 35, 42] target mapping, and [20, 26, 37] combine allocation and mapping. Our work improves on them by providing multi-stage optimization and three other co-design vectors. SoC DSE has used various heuristics, such simulated annealing [16] , integer linear programming [20] , particle swarm [43] , genetic algorithms [15, 44] , and reinforcement learning [31] ; Our work improves on them by incorporating architectural reasoning. Note we are not limited to simulated annealing as our architectural reasoning can be integrated into other heuristics. This work presents FARSI, a DSSoC DSE equipped with an agile system simulator, and an automated heuristic with built-in and expandable architectural reasoning. We identify critical ingredients of an optimal DSSoC DSE and quantify their impact on the convergence time, and further show FARSI's heavy use of them. We achieve 8,400X, and 98.5% simulation speed up and accuracy compared to Platform Architect. We also achieve 16X convergence speed up exploiting architectural reasoning and co-design comparing to simulated annealing. We further present two case studies show-casing FARSI's importance in the design challenges of future complex systems. FARSI is an open-source project and it has been evaluated at a major industry organization for industry-relevant use cases that require DSSoC solutions in a highly resource-constrained environment such as AR/VR/MR glasses. Ar in glasses Logca: A high-level performance model for hardware accelerators Synfull: Synthetic traffic models capturing cache coherent behaviour Analyzing cuda workloads using a detailed gpu simulator The gem5 simulator Execution time prediction for energy-efficient hardware accelerators Traffic management: A holistic approach to memory placement on numa systems Mogac: a multiobjective genetic algorithm for hardware-software cosynthesis of distributed embedded systems System level hardware/software partitioning based on simulated annealing and tabu search. Design Automation for Embedded Systems Efficient design space exploration of high performance embedded out-of-order processors Multi-variant-based design space exploration for automotive embedded systems Gables: A roofline model for mobile socs Accelerator-level parallelism. CoRR, abs Optimizing stream program performance on cgra-based systems Exploring Extended Reality with ILLIXR: A New Playground for Architecture Research Rpstacks-mt: A high-throughput design evaluation methodology for multi-core processors Nasa: A generic infrastructure for system-level mp-soc design space exploration Schedtask: a hardware-assisted task scheduler Noor) Christoph. Augmented reality in medical education? Computer architecture techniques for power-efficiency Accel-sim: An extensible simulation framework for validated gpu modeling Autoscale: Energy efficiency optimization for stochastic edge inference using reinforcement learning HPVM: Heterogeneous parallel virtual machine HPVM: Heterogeneous parallel virtual machine Rpstacks: Fast and accurate processor design space exploration using representative stall-event stacks Communication-driven task binding for multiprocessor with latency insensitive network-on-chip Online education in the post-covid era Combined system synthesis and communication architecture exploration for mpsocs Sponza scene in Godot with OpenXR addon Hp labs: Cacti Variation coefficient wikipedia Automated memory-aware application distribution for multi-processor system-on-chips Discrete particle swarm optimization for multi-objective design space exploration Multi-objective design space exploration using genetic algorithms Worst case delay analysis for memory interference in multicore systems Tangram: Integrated control of heterogeneous computers Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '52 Gloss: Seamless live reconfiguration and reoptimization of stream programs Cheetah: Optimizing and accelerating homomorphic encryption for private inference gem5-salam: A system architecture for llvm-based accelerator modeling TSMC Starts 5-Nanometer Risk Production Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures Die photo analysis A performance analysis framework for identifying potential benefits in gpgpu applications Mini -Mixed-Reality Camera Oculus chief scientist mike abrash still sees the rosy future through ar/vr glasses TSMC. TSMC 40 nm Technology Roofline: An insightful visual performance model for multicore architectures Compiler-assisted selection of hardware acceleration candidates from application source code