key: cord-0811175-x2me5npa
authors: Faisal, Faiz Al; Rahman, M. M. Hafizur; Inoguchi, Yasushi
title: 3D-TTN: a power efficient cost effective high performance hierarchical interconnection network for next generation green supercomputer
date: 2021-05-19
journal: Cluster Comput
DOI: 10.1007/s10586-021-03297-1
sha: d04260a0a830615d48014a8a6464896f2e229d8f
doc_id: 811175
cord_uid: x2me5npa

Green computing is an important factor to ensure the eco-friendly use of computers and their resources. Electric power used in a computer converts into heat and thus, the system takes fewer watts ensuring less cooling. This lower energy consumption allows to be less costly to run as well as reduces the environmental impact of powering the computer. One of the most challenging problems for the modern green supercomputers is the reduction of current power consumptions. Consequently, regular conventional interconnection networks also show poor cost performance. On the other hand, hierarchical interconnection networks (like-3D-TTN) can be a possible solution to those issues. The main focus for this paper is the estimation of power usage at the on-chip level for 3D-TTN with the various other networks along with the analysis of static network performance. In our analysis, 3D-TTN requires about 32.48% less router power usage at the on-chip level and can also achieve near about 21% better diameter performance as well as 12% better average distance performance than the 5D-Torus network. Similarly, it also requires only about 14.43% higher router power usage; however, can achieve 23.21% better diameter performance and 26.3% better average distance than recent hierarchical interconnection network- 3D-TESH. The most attractive feature of this paper is the static hop distance parameter and per watt analysis (power-performance). According to our power-performance results, 3D-TTN can also show better result than the 3D-Mesh, 2D-Mesh, 2D-Torus and 3D-TESH network even at the lowest network level. Moreover, this paper is also featured with the static effectiveness analysis, which ensures cost and time efficiency of 3D-TTN.

Green computing reflects the designing, manufacturing/ engineering, using and disposing of computing equipment in a way to reduce their environmental impact. And, reduction of the electric power is the key to achieve the green computing. In 2016, supercomputer PEZY-SCnp at RIKEN (Japan) achieved 6673.8 MFLOPS/watt and ranked top in the Green500 list, while beating the Sunway TaihuLight with 6051.3 MFLOPS/watt. On the other hand, the requirement of exa-scale performance is enormous. In fact, today's molecular research in health (specially for the analysis on COVID-19)(For example-world's fastest supercomputer, Fugaku is being used to chose dozens of possible COVID-19 remedies through analysing more than 2000 drugs [1] ), nuclear analysis and organic simulation highly depends on the parallel computers. Fugaku supercomputer used Tofu interconnect with 7,299,072 cores and can achieve about 415PFLOPS requiring about 28,335kW power usage [2] . Another very interesting topic could be the use of supercomputers in smart cities. Supercomputers certainly could be very useful for food safety inspections, optimize energy generation, response to emergency situations and even for traffic congestions [3, 4] . In addition, modern MPC systems like-K-computer has already achieved 10.51 petaflops performance with more than 80,000 computing nodes and also requires about 12.6 MW electrical power using the Tofu interconnect. Therefore, low energy consumption is the most desired choice for the next generation supercomputers with continuing the other constraints like-low network performance, low scalability, low throughput and latency.

The overall performances as well as the power consumption of MPC systems are heavily affected by the interconnection networks and its processing nodes. Interconnection network acts as a communicating path for processing nodes as well as for the memory units [5] . Consequently, every MPC system requires interconnection network as an obvious choice. Vastly used interconnect for the MPC systems is Flat-Tree network [6] , which has a big concern in case of network performance. In MPC systems, the total number of outgoing links like-on-chip as well as offchip links is a big concern, due to power usages as well as high latency [7] . Hence, the interconnect pattern for network topology is a vital issue. Moreover, network topology at the off-chip level should maintain less number of physical outgoing links to reduce the power usage. For example, in modern supercomputers off-chip links for intra-rack requires 0.0101 W and inter-rack requires 0.035 W [8] .

The later part of this paper describes about the related research analysis, then the architectural structure of 3D-TTN, reviews the routing algorithm for 3D-TTN in Sect. ''Routing algorithm for 3D-TTN'', estimates the onchip power consumption for the 3D-TTN, after that Sect. ''Power-performance analysis'' shows the analysis of performance verses power and finally Sect. ''Cost effectiveness analysis'' shows the cost and time-cost effectiveness factor of various networks, while Sect. ''Conclusion'' concludes the overall outcome of the paper.

Exa-scale performance is the prime goal for next generation supercomputers and most likely next generation high performance computing is solely depends on the massively parallel computers. In contrast, sequential computers are not feasible for meeting the increased computational demand due to its small processing limit for example-according to the Geekbench, Intel Core i9-9980XE can achieve about 1360.0 GFLOPS through its 18 cores [9] . Flat networks like-torus networks show better performance than the mesh networks [10] . However, static electric power for torus network consumes more than the mesh networks due the extra wraparound connection. Eventually, one of the probable solutions to reduce power consumption as well as to maintain the stable network performance is to use Hierarchical Interconnection Networks (HIN) [11] or undirected interconnection networks [12] or multistage interconnection networks [13, 14] . However, many HIN networks have already been introduced like-TESH [15] , 3D-TESH [5] , TTN [16] , which are unable to show good performance in comparison to 5D-Torus network and this paper focuses only on the hierarchical networks. Even the networks like-3D-TESH has proved the power efficiency, but has fallen behind the 5D-Torus in case of performance efficiency due to its mesh connection at the onchip level. Furthermore, torus networks are more performance efficient over the mesh networks, which is our key motivation for a new network. Hence, in this paper, we like show the detailed analysis of a hierarchical interconnection network-3D-TTN (Three Dimensional Tori-connected Torus network) was first introduced in 2016 [17] , was focused on the power usage and topological analysis rather than the network performance and power-performance comparisons.

Hierarchical interconnection networks are one of the probable solutions for obtaining the low powered network as well as maintaining suitable network performance with high degree of network scalability. 3D-TTN is a HIN network, which contains multiple basic modules (BMs) that are hierarchically interconnected for higher levels [17] .

Definition A BM of 3D-TTN(m, L, q) network is similar to 3D-torus network, which consists of 2 3m connected processing elements (PEs) having 2 m rows and 2 m Â 2 m columns, where m is a positive integer, L is defined for levels of hierarchy and q is used for inter-level connectivity.

The construction of the lowest level network for 3D-TTN is called as the Basic Module (BM). A ð2 m Â 2 m Â 2 m Þ BM of 3D-TTN has 2 2mþ2 free ports for the higher level interconnected hierarchy. Each BM uses 2 m Â 4 Â ð2 q Þ ¼ 2 mþqþ2 of its free links for the upper level networks, where 2ð2 mþq Þ free links are used for the vertical connections and similarly 2ð2 mþq Þ free links for the horizontal connections. Here, q defined as the inter-level connectivity (q 0,1,..., m). Consequently, according to Fig. 1 , a (4 Â 4 Â 4) BM has 2 2Â2þ2 = 64 free ports. Moreover, in Fig. 1 , node(0,0,0) has two offchip connectivity of level-2 vertical in and vertical out connections. Similarly, node(0,0,1), node(0,0,2) and node(0,0,3) has both the level-2 vertical in and vertical out connections. And, even node(1,0,1), node(1,0,2) and node(1,0,3) has similar off-chip connection like node(1,0,0) as the level-4 vertical in connection. Figure 1 , has skipped some of those links in order to reduce the figure complexity. In this paper, we are particularly analyzing the network class with 3D-TTN(2, L, 0).

Higher level of 3D-TTN follows the recursive structural pattern of the immediate lowest level of sub-networks (where, 3D-torus is for the on-chip level). Therefore, constructing a level-2 network level-1 network will be used for the 3D-TTN. Figure 2 illustrates the high-level interconnection of 3D-TTN. For example, a level-2 network can be built by ð2 2Â2 Þ 16 BMs (16 level-1 3D-TTN). Based on the Table 1 , the total number of nodes at level-2 3D-TTN(2, 2, 0) network is N = ð2 8 Â 2 2 Þ = 1024. In considering the highest level for 3D-TTN network is based upon the ð2 m Â 2 m Â 2 m Þ BM by L max = 2 mÀq þ 1 (with m = 2 and q = 0, L max = 5 and this case total number of nodes will be as N = ð2 2Â2Â5 Â 2 2 Þ = 4,194,304). Table 1 generalizes the various parameters for the 3D-TTN.

Node addressing for 3D-TTN requires 3 digit combination for the BM level and 2 digit combination at the higher levels. At the BM level 3D-TTN can be presented by Yindex as the first digit, then for X-index and finally for the Z-index. On the other hand, in case of higher levels; the first digit represents the Y-index and then the X-index for higher levels. In general, a Level-L 3D-TTN can be represented by:

More generally, in 3D-TTN(m, L, q) the node address is represented by-

:::::::::

; a 2LÀ1 Þ::::::

Here, the node address ða 2 ; a 1 ; a 0 Þ has been defined for the lowest level 3D-TTN, where a 0 has been treated for z-axis of the level-1 network, a 1 is used for x-axis nodes and a 2 has been considered for the Y-directional nodes. On the other hand, the upper levels from level-2 to level-5 networks are contained with two-dimensional structures. Now, a connection path from the source node n 1 = [(s 2L , s 2LÀ1 ).. 

..(s 4 , s 3 ) (s 2 , s 1 , s 0 )] to destination node n 2 included in BM 2 is represented as

are connected if the following connections are satisfied for n 2 (when m = 2, q = 0)-

Here, addressing for 3D-TTN has been defined for level-1 to level-3 network. However, similarly we can also define the upper level interconnections as well as increased interconnectivity with q = 1 or q = 2.

A simple deterministic dimension-order routing (DOR) algorithm has been considered for 3D-TTN [17] . In dimension-order routing a packet starts its routing from the source node to the destination though checking the same BM first. However, if it is destined for another BM, then the source node will send the packet to the outlet_node which connects the outer BM at which the routing will be performed. The function SP_routing will help to find the shortest route for the higher levels. Now, if we consider a source node as s

then routing tag can be defined as t = [(t 2L ; t 2LÀ1 ).. ..(t 4 ; t 3 ) (t 2 ; t 1 ; t 0 )]. Algorithm 4.1 shows the routing algorithm for 3D-TTN. Moreover, in order to simplify the routing algorithm, we have showed the overall packet routing through Fig. 3 packet routing flowchart. Figure 3 shows the step by step packet routing for each corresponding level. Routing algorithm requires the source and destination node address and tag value is based s i and d i . If the tag value t i is not zero, then upper level routing is required and packet will be moved to next BM based on the value of routedir. Next, if any packet moves to a new source node, tag value as well as the current source address will be updated. This process continues for all the value of level index, i. Finally, routing will be done at the basic module level. 

Reduction of the power usages is the most desirable target for the supercomputers. Sunway TaihuLight System has achieved about 93 PFLOPS (requiring about 15.3M W electrical power) performance with about 10.65M cores considering the 2D mesh network for the interconnectivity of its cores [18] . This section considers only the on-chip electrical power analysis.

Power requirement at the on-chip network level can be up to 50% of total chip power usage [19] . Hence, this paper focuses only the on-chip power estimation for 3D-TTN. The power consumption for on-chip network has been estimated by the leakage and dynamic power consumption model for both the links and routers using an on-chip power model simulator. Therefore, in the on-chip level, we have considered all the interior links for 3D-TTN at the BM level. One of the interesting features of this paper is the power estimation of 5D-Torus network (used in Blue Gene/Q supercomputer) [6] . Apart from this, another attractive feature of our paper is to show the performance per watt for 3D-TTN, which has also been evaluated by the on-chip power usage.

On-chip power model for this paper considered the Orion energy model [20] using 65nm fabrication process for the 3D-TTN and others. To simulate the on-chip model, we have used the GARNET network simulator [21] along with Orion energy model. In order to integrate the network simulation along with the Orion energy model evaluation, we have considered the GEM5 [22] , which is a full system simulator specially designed for computer architecture research. Gem5 requires Unix platform in order to build and run simulations [22] . The dynamic power and leakage power are the main source of power consumption. Hence, both the router and link power dissipation are entirely responsible for total power consumption. Router total energy depends on the read and write operations in buffers, energy consumption by the total activity at the local and global arbiters and then for the total number of crossbar traversals. Equation 2 shows the total energy consumption inside the router [21] . On the other hand, the dynamic energy is defined by E ¼ 0:5aCV 2 , where a is the switching activity, C is the capacitance and V is the supply voltage [20] . Dynamic power of physical links is evaluated through the charging and discharging of capacitive loads. In a CMOS circuits link power formulated as P = Ef clk , where f clk is the clock frequency. Hence, the link dynamic power is defined as, P link = aC 1 V 2 dd f clk , where C 1 is the load capacitance. However, static power of physical links is due to the inserted repeaters.

Now, considering the clock frequency with 1 GHz, supply voltage 1.0 V, 128 bits message size and uniform traffic pattern with 2 mm per link length, Table 2 shows the Fig. 6 shows the per router power consumption and has been compared with respect to static and dynamic power. Moreover, it is expected to link power will have also have a large impact in comparing with 3D-TTN (router radix is 8) as the router radix for 5D-Torus is 10. Figure 6 explains that considering per router consumption of 4D-Torus requires about 24.21% and for 5D-Torus network is about 32.48% higher electric power than the 3D-TTN.

Performance per watt can be one of the most attractive features for the supercomputers. As the modern MPCs are highly affected by the electrical power consumptions, Performance per watt can able to trace the system performance with respect to power, which is a relatively new feature in the field of interconnection network. The choice of this parameter has been taken from the observation of scenario where network with little poor performance, but having much better power efficiency, had always been rejected. Similarly, another motivation for this parameter came from the Green500 list, where listed supercomputers are required to ensure the energy-efficiency through gigaflops/watt parameter [23] . By definition, performance per Cluster Computing watt can be defined by the ratio between the network performances against the total power consumption. Regarding the definition of performance, it can be treated as the static or dynamic network performance of the corresponding network. In this paper, we have considered the only the static network performance for analyzing the performance per watt.

Performance per watt ¼ Achieved Performance Total Power Usage ð3Þ

In this section, the diameter performance has been used against the total router power consumption. In massage passing, source node should communicate by the curtain route to transmit the data from source to destination, which may not be directly connected. Shortest routed path is expected for interconnection networks as the increased routing path also increases the communication delay. Diameter is the maximum inter-node distance between all distinct pairs of nodes along the shortest path. Interconnection network with smaller diameter is preferable [24] . Now, to evaluate the calculated value for diameter performance of 3D-TTN(2, L, 0) Eq. 4 can be used-

Here, D z is considered the total path required at the Zdirections, D s is the value to move to the outgoing node of highest level, then D si used as the value to go to next level of routing, D i is for corresponding level routing and finally D d is for level-2 receiving node to the destination node. Table 3 shows the calculated analysis of equation 4. Figure 7 shows the diameter for the 3D-TTN, which explains that it is much better than the 2D or 3D mesh and torus networks. It has outperformed the 3D-TESH [5] and about 21% better performance than the most considerable recent interconnection network 5D-Torus [25] at the maximum level as per our analysis. However, equation 5 defines the consideration for diameter performance per watt considering the router power. Total router power usage depends on the total router leakage and dynamic power usage with also the clock power obtained from the Fig. 5 . On the other hand, equation 6 defines the diameter performance per watt considering the link power (considering the same simulation condition as , we have showed the same analysis for 4D-Torus with 256 nodes and 5D-Torus with 512 nodes (according to their lowest network level) and it is obvious that having high node number 4DT and 5DT will outperform others. As the small diameter is preferable hence the network with low performance per watt for diameter is more desirable. Moreover, 2D-Mesh network (used in Sunway TaihuLight System having 10,649,600 cores and achieving about 6 Gflops/W with 8 Â 8 mesh network [26] ]) shows the worst diameter performance per watt than the 3D-TTN. 

In case of interconnection networks, low average distance is more preferable over the diameter due to the communication patterns, where every node requires to communicate with the every other [27] . The average distance can be treated as the mean distance between all distinct pairs of nodes in a network. Hence, it is expected for the networks to have small average distance. The average distance for 3D-TTN(2, L, 0) is shown in the Fig. 10 , which confirms that the average distance performance of our network is far better than conventional 2D or 3D networks. Considering our network with the 5D-Torus network, it also shows about 12% better performance at the maximum level. Now, average distance performance per watt for the MPC systems can be defined by the achieved average distance over the total power usage. Equation 7 shows the average distance performance per watt with regards to total router power usage and equation 8 shows average distance performance per watt with regards to total link power usage (considering the same simulation condition as Table 2 ). 

Speedup and efficient computing are the common parameters that have been used for the performance evaluation for MPC systems. However, number of communication links is a big concern for the MPC systems due to the total system cost. System cost are not only depends on the number of processors, but also through the communication links [28] . Hence, cost-effectiveness factor can be handy for the MPCs. This parameter considers system cost through the communication links. The cost-effectiveness factor (CEF) for 3D-TTN has been defined by the equation 9.

Now, suppose the cost for a single processor including its processing unit, control unit and memory unit is defined by C p and C L is the cost for single communication link, then q is defined by the ratio of C L against C p . Fig. 13 shows the CEF for various networks (for q = 0.1), which explains that the cost-effectiveness factor for 3D-TTN is better than the 2D as well as 3D mesh and torus networks and little worst than 4DT and 5DT networks with obvious high wiring complexity. It has also outperformed 3D-TESH network with a big margin. On the other hand, Fig. 14 shows the CEF for various networks with variable q, which explains that the cost-effectiveness factor for 3D-TTN is better than the 2D as well as 3D mesh and torus networks. However, little poor than the 4DT and 5DT network due to the low wiring complexity of 3D-TTN.

The requirement for a MPC system to show the time efficiency for any kind of program can be obtained from the time-cost-effectiveness factor (TCEF) [28] . A faster solution is more desirable than the low cost effectiveness for the MPC systems. Hence, TCEF can be a useful parameter in order to characterize any interconnection network. The TCEF of 3D-TTN has been shown in equation 10, where q is defined by the ratio of C L against C p . T 1 has been used for the time to solve a single problem by a single processor & a is a linear time penalty in T p . T p is the time which is required by p processing nodes to solve a single problem.

Total number of nodes Fig. 15 shows the TCEF for 3D-TTN, which is better than any 2D or 3D networks of mesh and torus connections and slightly worst than 4D or 5D networks. Even it has outperformed the other hierarchical interconnection networks like-3D-TESH(2, L, 0). As TCEF considers time for the solution of a problem, 3D-TTN can produce a faster solution together with increased profit. Similar to Fig. 14, Fig. 16 shows the TCEF analysis with variable q values, which depicts that 3D-TTN is an obvious choice over other 2D and 3D networks and highly comparable with 4D torus network even having much low wiring complexity.

In conclusion, in this paper there are three contributions; power analysis, static hop distance parameter and performance per watt analysis, and static cost effectiveness analysis. However, our main objective was to find an interconnection network, which achieves high performance as well as reducing the current power usages for MPC systems and also to introduce parameter (performance per watt) in the field of interconnection network that could show the analysis of system performance against power usage.

Power efficiency for 3D-TTN has been compared with the various networks, which shows that 3D-TTN requires 24.21% less router power usage than the 4D-Torus network and also 32.48% less router power usage than the 5D-Torus network (explained in Sect. ''Estimation of power consumption''). In contrast, it requires about only 14.43% higher router power usage than the 3D-TESH network. However, if we consider power-performance analysis for diameter and average distance in comparing with 3D-TTN, 2D-Mesh (69.74% worst router diameter power-performance), 2D-Torus (43.72% worse router diameter powerperformance) and even 3D-TESH (34.46% worse router diameter power-performance) shows the worse result than the 3D-TTN (explained in Sect. ''Power-performance analysis''). On the other hand, 4D-Torus and 5D-Torus networks with 256 nodes and 512 nodes (lowest network level nodes), in comparing with the only 64 nodes of 3D-TTN (lowest network level nodes), definitely will show better power-performance for their high power usage. Moreover, this research focuses only the on-chip power usages. As 3D-TTN shows better performance at the higher level, it is expected that it will show much better performance per watt at the upper level network. Concerning the analysis on diameter and average distance, 3D-TTN has outperformed the latest HIN network 3D-TESH with a big margin in case of diameter (23.21%) and average distance (26.3%). It has achieved near about 53% better diameter and 47% better average distance than the 4D-Torus network over 4 millions of nodes (explained in Sect. ''Powerperformance analysis''). Comparing with the 5D-Torus network, it has also outperformed 5D-Torus by near about 21% diameter and 12% average distance performance at over 4 millions of nodes. Now, considering cost-effectiveness (CEF) and time-cost-effectiveness (TCEF) parameters, it is obviously better choice than the 2D and Cluster Computing 3D mesh and torus, and also 3D-TESH network (explained in Sect. ''Cost effectiveness analysis''). In addition, having higher communication links than 3D-TESH network, 3D-TTN also shows better CEF and TCEF results than those other interconnection networks of 2D and 3D mesh and torus networks along with 3D-TESH for variable q values. On the other hand, with 16,384 nodes 3D-TTN have the wiring complexity of 53,248 links whereas 4D-Torus network requires 58,564 links for its only 14,641 nodes, which is a 9.08% higher interconnected links with having 1743 less nodes. This exemplifies the main cause for the slight better performance of 4DT and 5DT in CEF and TCEF over 3D-TTN. Issues for future work include the following: (1) evaluation of dynamic network performance, (2) faulttolerant analysis and (3) assessment of the performance improvement for 3D-TTN with an adaptive routing algorithm.

DREAM: online control mechanisms for data aggregation error minimization in privacy-preserving crowdsensing

An incentive mechanism for privacy-preserving crowdsensing via deep reinforcement learning

A new power efficient high performance interconnection network for many-core processors

IBM Research

3-D topologies for networks-onchip

Silicon photonics for extreme scale computing

An adaptive routing of the 2-D torus network based on turn model

HTN: a new hierarchical interconnection network for massively parallel computers

A generalization of block structures in undirected interconnection networks

Network reliability evaluation and analysis of multistage interconnection networks

Multistage Interconnection Networks (MINs)

TESH: a new hierarchical interconnection network for massively parallel computing

High and stable performance under adverse traffic patterns of tori-connected torus network

Power analysis with variable traffic loads for next generation interconnection networks

Report on the Sunway TaihuLight System

Unique chips and systems

ORION 2.0: a powerarea simulator for interconnection networks

GARNET: a detailed on-chip network model inside a full-system simulator

The gem5 simulator, Computer-system architecture platform

Symmetric and folded tori connected torus network

The IBM Blue Gene/Q interconnection network and message unit

The sunway TaihuLight supercomputer: system and applications

Topological properties of hierarchical interconnection networks: a review and comparison

Cost and time-cost effectiveness of multiprocessing

Acknowledgements This research is partly supported by JSPS KAKENHI GRANT NUMBER 20K11731. The authors are highly grateful to the anonymous reviewers for their utterly constructive comments.

Conflict of interest The authors declare that they have no conflict of interest.