A low power NoC router using the marching memory through type

Ryota Yasudo, Takahiro Kagami, Hideharu Amano, Yasunobu Nakase, Masashi Watanebe, Tsukasa Oishi, Toru Shimizu, Tadao Nakamura

†Keio Univ., JAPAN ‡Renesas Electronics
Tel./Fax.: +81-045-566-1748
E-mail: marching@am.ics.keio.ac.jp

1 Introduction
NoC (Network-on-Chip) is a key component of recent multi-core systems. Unlike the traditional bus connected systems, many cores can be connected through routers which transfer packets. Since the operational frequency of NoCs reaches a few GHz to achieve the low latency data transfer, the power consumption of NoC sometimes occupies a considerable part of the system. For example, in MIT RAW CMP[1] and Intel Tera FLOPS processor[2] occupy 36% and 28% of each total power, respectively. Since FIFOs provided in the router are the dominant part of the power consumption, introducing a low power FIFO concept is efficient to reduce the total power of the router. Here, a novel power efficient “through type” of the marching memory[3] is proposed and used in the router of an NoC.

2 Marching Memory Through Type
Marching Memory[3] (MM) is invented as a novel memory device which can avoid the memory bottleneck by accessing with the same clock cycle of the CPU. MM uses DRAM based memory cell technology but reorganizes the structure that consists of columns and rows of the DRAM. Data are shifted to the CPU synchronized with the clock and come to the processor rather than the CPU accessing bit lines of the DRAM using precharge and sensing can be done.

Marching Memory Through Type (MMTH) is a novel buffer memory based on an idea from MM, and it is mainly designed for buffers for communication including NoC. The structure and target of MMTH are completely different from the original MM, however, it is an application example of MM concept. The characteristic of MMTH is that a kind of asynchronous circuit is used. Basically, this circuit consists of tandem connected transparent latches. A set of latches whose size corresponds to the size of flit composes a column. Figure 1 illustrates the operation of MMTH. Each rectangle represents a column, and the left most column and the right most column are an input port and an output port, respectively. W-pointer and R-pointer control the movement of the data.

When MMTH is reset, both pointers indicate the right most position. For write operation, the written data are directly transferred to the position where W-pointer indicates with a clock cycle, and W-pointer moves one to the left at the next clock cycle. For read operation, the data pointed by R-pointer are transferred to the output port, and then R-pointer moves one to the left. If W-pointer moves the left most position, the memory becomes full and the data cannot be written anymore. The data pointed out by R-pointer can only read out. On the contrary, when R-pointer indicates the same position as W-pointer, it becomes empty. When both W-pointer and R-pointer reach the left most position, the MMTH is reset and both pointers return to the right most position.

Both write and read operation can be done in the same clock, but the read operation requires some time delay of signals for transferring data to the output port. Although the write operation also requires some delay, it does not influence the performance. The delay is depending on the operational clock frequency and size of the memory. For example, when eight-depth MMTH works with 2GHz, a corresponding clock cycle is needed to read the data.

The memory cell of MMTH is composed of a transparent latch with a transmission gate as shown in Figure 2, where T and TB are a control signal and its inverted one, respectively. If T is asserted, the data goes through the memory cell, otherwise the data stored in the cell. This structure reduces power and area, since a simple transparent latch is used and a local clock signal is unnecessary.

The power consumption of the MMTH depends on the difference of bit pattern between current and the preceding written data. If the same data are continuously written, almost no power except the controller’s is required. Here the possibility of bit change is called BCR (Bit Change Rate). The power consumption of MMTH linearly changes in response to BCR.

3 Router Architecture using MMTH
A standard router shown in Figure 3 for the mesh network is assumed. It provides five inputs and five outputs (four for neighboring routers and one for the node) with two virtual channels. FIFOs using MMTH are equipped in the input port of the router, and the virtual cut-through routing is adopted.

Generally, five steps, Routing Computation(RC), Virtual channel Allocation(VA), Switch Allocation(SA), Switch Traversal(ST) and Link Traversal(LT) are required to transfer a packet through a router. By using the speculative technique and look-ahead technique, Next Routing Computation (NRC), VA and SA can be done in the same clock. Thus, a low latency router with a 3-stage pipeline shown in Figure 4(a), which is illustrated from a router’s perspective, can be designed if standard FIFO is used for the input buffer. However, since an extra clock is needed to read MMTH, Buffer Read (BR) stage is necessary for the router with MMTH as shown in Figure 4(b), and the latency of the packet transfer is stretched. In order to avoid the extra clock delay, we propose a novel router design for MMTH.

Figure 5 illustrates the stage reduction applying the look-ahead technique from a header flit’s perspective. The routing information for the next router computed in the NRC stage is filled in an additional temporary flit to transmit only the routing information. This flit is called pre-header flit. The pre-header flit bypasses input buffers and is directly forwarded to the next router as shown in Figure 5. The next router can start VA and SA during the LT of the first header flit of the packet. By using this design, the overhead becomes only a clock cycle in the destination router.
A reset signal is needed to use MMTH, since the buffer reset is required when a packet is stored in the buffer. Since the whole flit of a packet is transmitted consecutively in a virtual cut-through router, a reset signal should just be asserted after finishing transmitting a packet. This does not affect performance because the term of reset is only one clock cycle.

4 Evaluation
The router with MMTH is designed with Renesas 40nm CMOS design technology. The structure is the same shown in Figure 3. The width of a link is set to be 64bits, and four 16-bit width 8-depth MMTHs are used for a virtual channel. The packet consists of a flit header and a six-flit body. The commonly used Dimension Order Routing (DOR) is utilized.

4.1 Performance Overhead
First, we investigate the performance degradation by using MMTH in the router. Figure 7 shows the average latency versus injected traffic under the uniform traffic (a) and bit-complement traffic (b). Although the naive router design with the BR extra stage shown as “MM(naive)” in the figure stretches the latency by about 10%, the overhead in the latency of the improved design is only 2%. The saturation traffic which shows the bandwidth of the network is the same in all the three designs. Figure 8 shows the execution result of full simulation results. We assume a many-core processor with 16 cores which are connected with $4 \times 4$ mesh by the above described router. Nine benchmark programs from NAS parallel benchmark[4] are simulated with GEM5 full system simulator[5]. This graph also shows that the performance degradation of the improved version router is only 1% - 2%.

4.2 Power Consumption
The power consumption of the router with MMTH is evaluated. Apache’s PowerArtist is used for analysis. The second bar in Figure 9 shows the maximum power consumption (100% BCR) of the router using MMTH when it works at 2GHz. For the comparison, we also evaluate the router with the traditional FIFO using registers and the first bar in Figure 9 shows it. Since the standard cells are designed for low energy consumption rather than high speed operation, the traditional router can only work at 800MHz clock. The dynamic power of the traditional router is scaled assuming that it works at 2GHz.

From the figure, it appears that the router with MMTH improves the power consumption by 28.8%. Although the power except for input buffers shown as “The others” in the figure increases by 13.3% owing to additional control signals and logic, the power of input buffers shown as “Input buffers” decreases by 46.5%. It indicates that the proposed router can reduce the power certainly even if BCR is 100%. Since the consuming power of MMTH is related to the BCR of the data, the real power is computed as follows:

$$(\text{Max. Power} - \text{Min. Power}) \cdot \text{BCR} + \text{Min. Power}$$

Here, Min. Power represents power consumption of the router other than input buffer. The BCR is depending on the data exchanged in application programs. The other bars in Figure 9 show the power consumption considering the BCR in NAS benchmark programs. As shown in the figure, the reduction ratio of power consumption is further increased to 42.4% on average.

5 Conclusions
We have concluded that with a router using MMTH the power consumption is associated with the bit change rate of the data, and when NAS parallel benchmarks work on NoC, it is reduced by 42.4% on average at 2GHz compared with a traditional FIFO implementation. The performance degradation caused by the delay of the reading time can be mostly saved by the look-ahead technique in the router.

Acknowledgement
This work is supported by NEDO under the “Energy-saving innovation Project of Leading Research of Marching Memory realizing the High Speed and the Low Power Dissipation Processing of the Streaming Data”.

References
Figure 1: Structure of Marching Memory Through Type (MMTH)

Figure 2: Memory cell of MMTH

Figure 3: Router Architecture

(a) A traditional router  (b) A router using the MMTH

Figure 4: Pipeline structures

Figure 5: The latency reduction applying the look-ahead technique

(a) Uniform traffic  (b) Bit-complement traffic

Figure 6: The latency reduction scheme

Figure 7: The average latency vs. injected traffic

Figure 8: Execution Time of Full System Simulation

Figure 9: Power Consumption