# POWER OPTIMIZATION CONSIDERING THE CHIP TEMPERATURE OF LOW POWER RECONFIGURABLE ACCELERATOR CMA-SOTB

Yu Fujita, Hayate Okuhara, Koichiro Masuyama, Hideharu Amano

Dept. of ICS, Keio University, Yokohama Japan email: wasmii@am.ics.keio.ac.jp

# ABSTRACT

For low power yet high performance processing in battery driven devices, a coarse grained reconfigurable accelerator called Cool Mega Array (CMA)-SOTB is implemented by using Silicon on Thin BOX (SOTB), a new process technology developed by the Low-power Electronics Association & Project (LEAP). This chip has three voltages for controlling power and performance; supply voltage, PE-Array body bias voltage and microcontroller body bias voltage. In order to find the optimal operational point for a given requirement, a large effort for measurements and adjustments is required. This paper proposes power model for finding the optimal operation point from several measurement results. From the proposed model, the power can be estimated with 4.4% difference from the measured value on average. By using the model, the optimal source voltage and body bias voltages for PE-array and microcontroller can be obtained for a given operational frequency. Compared with the result of the exhaustive search, 37.4% of energy is saved with much small effort of measurements.

#### I. INTRODUCTION

Many-core accelerators have become a key device for recent battery driven mobile devices because of their high energy efficiency. Coarse-grained reconfigurable arrays (CGRAs) which use a large array of processing elements (PEs) can reduce the operational frequency with keeping the total performance receive attention as a extremely low power consuming accelerator. However, such the large PE array in CGRAs often introduces a large leakage power.

Silicon-on-Thin BOX (SOTB) CMOS technology developed by LEAP [1] allows to work transistors with much lower power supply voltage than that for the conventional bulk CMOS transistors by reducing the variation of threshold level. By using body biasing, the leakage current and operational delay can be controlled widely.

We developed a coarse grained reconfigurable accelerator called Cool Mega Array(CMA) [2] by using such an SOTB technology. Like the original CMA, CMA-SOTB has a large PE array and micro-controller. The PE array is consisting of combinatorial logic, and the data flow of the target application is mapped directly. micro-controller manages the data reading and writing between the input/output of the PE array and data memory modules. CMA-SOTB has independent body bias supply for PE array and microcontroller in order to make the balance between the performance and leakage power in each module according to the arithmetic intensity of the target application. For the computation intensive application, zero bias or the forward bias is given to the PE array to enhance the performance, while the reverse bias is given to micro controller and data memory. If the target application bottlenecks the data transfer between memory to the PE array, zero or the forward bias is given to micro controller and memory, while the PE array receives the reverse bias to suppress the leakage power without degrading the performance [3].

The problem is how to find the optimized body biasing for each component when the target application and required performance are given. Although exhaustive search by changing body bias voltages can find it, it is impractical for various combination of application programs and performance requirements.

This paper proposes formulas to find the optimal bias voltages for target application and required performance from the limited number of measurements. Since temperature is a major parameter, it is also included.

The contribution of this paper is as follows: the SOTB and CMA-SOTB are introduced in Section II. Section IV describes the fundamental formulas used in the power model. In Section V and VI, the power model to find the optimal supply voltage and body bias voltages is proposed, and parameters are fixed based on the real chip evaluation. Section VII is for case studies to obtain the optimal voltages and the power consumption using them from the power model. They are compared to the results of exhaustive measurements, and the usefulness of the model is shown. The key points are summarized and future work is mentioned in Section VIII.

#### **II. CMA-SOTB**

#### **II-A. SOTB CMOSFET**

The SOTB utilized in this study (Fig. 1) is classified into FD-SOI, but the transistors are formed on thin BOX (Buried Oxide) layer.



Fig. 1. Cross-sectional view of the SOTB Device

Unlike in conventional bulk CMOS, in SOI, transistors are formed on top of the insulator(typically  $SiO_2$ ). Surrounding the transistor with insulating material means that the electrical interference does not need to be considered, and the electric characteristics therefore become sharp [4]. By using both extremely thin FD-SOI layer and BOX layer, short channel effect (SCE) can be suppressed. Since no impurity dopant to the channel is required, the variation of threshold voltage by the RDF is suppressed. By doping impurity into the substrate under the thin BOX layer, multithreshold level design can be done easily.

The delay and leakage power consumption can be optimized by controlling the bias voltage to the body (backgate). Here, we refer the body bias voltages to NMOS transistor and PMOS transistor as VBN and VBP, respectively. VBN for NMOS transistor is given to p-well, that is if VBN = 0, the transistor works with a normal threshold level. If reverse bias (VBN < 0) is given, the threshold is raised; thus the leakage current is reduced while the delay is stretched. On the contrary, forward bias (VBN > 0) lowers the threshold which enhances the operational speed with increase of the leakage current. In the case of PMOS transistor, VBP is given to n-well, thus, zero bias means  $VBP = V_{DD}$ . VBP more than  $V_{DD}$  works as the reverse bias, and VBP which is lower than  $V_{DD}$  is corresponding to the forward bias.

The characteristics of SOTB are summarized as follows: (1) The junction capacitance of the SOI is about 1/10 that of the bulk, thus making high-speed operation possible especially with lower the voltage operation. (2) The latch-up, a problem of bulk CMOS is caused by a parasitic thyristor formed by adjacent transistors in bulk CMOS. However, these are not formed in SOI. (3) Anti-radiation tolerance is high. The part of the substrate of that generates charge by incident radiation is blocked by the insulation layer, and does not affect the operation of the circuit. (4) Noise propagation (cross-talk) is small because of the insulation.

# **II-B.** The CMA architecture

A key concept of the CMA architecture is reducing any energy usage other than that required for computation. The PE array is built with combinatorial circuits to eliminate the power needed to store the intermediate results in registers and to distribute a clock to each PE. The dataflow graph of the target application is directly mapped on the PE array. Registers are only provided at the inputs/outputs of the PE array. Computation starts when all data are set up in the input "launch register," and the outputs of the PE array are stored into the "result register" with a certain delay time. The energy overhead caused by glitches in the large combinatorial circuits can be reduced by carefully setting the configuration data of switching element so as not to propagate glitches [2].

A micro-controller flexibly manages the data transfer between the data memory (DMEM) and registers by using mapping registers and vector operations. The aforementioned structure enables the implementation of various application programs without a power hungry dynamic reconfiguration in the PE array.

Another key concept of the CMA architecture is optimizing the energy of each target application by balancing the performance of the PE array and micro-controller. For application with a high degree of arithmetic intensity, the performance of the PE array is enhanced by using a power budget, while the power of the micro-controller is lowered. However, when the application requires a lot of data sets for a computation, the power budget is used for the microcontroller that manages the data transfer between data memory and launch/result registers. In the first prototype, CMA-1 [2][5] changes supply voltage independently to the PE array and the micro-controller.

# **II-C. CMA-SOTB**

Fig. 2 shows the block diagram of the CMA-SOTB, a prototype CMA architecture using SOTB technology [3]. A PE consists of a simple 24-bit ALU that executes multiply, add, subtract, shift, and logic operations, and a switching element (SE). It has an  $8 \times 8$  PE array connected with a network using a two-channel island-style interconnection and direct links that connect to the north-east and east of the PE. The SEs transfer the input data from the PE in the south, west, and east of the PE and the output data of the ALU to the PE in the appropriate direction according to the configuration data.

The micro-controller is a tiny microprocessor that executes a 14-bit micro-code stored in 128-entry micromemory. It has 16 general purpose registers and 8 special purpose registers storing pointers of DMEM, bit-map vectors, and stride lengths for a stride data transfer. It reads eight data from the DMEM and sets the launch register with a single instruction. A dedicated memory controller triggered with the instruction executes the data transfer with eight clock cycles. Also, the data in the result register can be written back to the DMEM with a single instruction



Fig. 2. Block diagram of CMA-SOTB

handled by another controller. Because the DMEM is a single-read/single-write dual-port memory, the reading and writing data can be done in parallel. Two banks of 256-entry 24-bit dual-port memory are provided for overlap operation of streaming data input/output and computation. Since appropriate size of SRAM macro is not available in SOTB, it is implemented with a set of registers and the size is limited.

In CMA-SOTB, unlike controlling independent power supply, independent body bias is given to the PE array and micro-controller/data memory. Here, bias voltages for the PE array are referred to as VBNM and VBPM, and those for the micro-controller are VBN and VBP. Note that all PEs received the same VBNM and VBPM. By controlling the body bias separately, we can optimize the energy consumption while keeping the required performance. For a target application with strong arithmetic intensity, the PE array is given a forward bias (VBNM > 0,  $VBPM < V_{DD}$ ) while the micro-controller/data memory is given a reverse bias (VBN < 0,  $VBP > V_{DD}$ ). In contrast, if the data transfer has a bottleneck, the forward bias is given to the micro-controller/data memory, and the reverse bias is given to the PE array.

The process technology and CAD tool used in CMA-SOTB are shown in Table I. As shown in the chip photo in Fig. 3, CMA-SOTB has two macros; the left side is microcontroller, while the right side is the PE array. Although the area of the PE array is larger than that of the microcontroller, the difference is not so large. This comes from the fact that micro-controller macro includes DMEMs implemented with registers, configuration registers, constant registers, and launch/result registers.

| Table I. | Specificatio | n of CMA | -SOTB |
|----------|--------------|----------|-------|
|----------|--------------|----------|-------|

| Chip  | Process   | LEAP 65nm SOTB 7-metal   |
|-------|-----------|--------------------------|
|       | Size      | $5$ mm $\times$ 5mm      |
|       | I/O       | 208pins                  |
| Tools | Design    | Verilog HDL              |
|       | Synthesis | Synopsys Design Compiler |
|       |           | 2011.09-SP2              |
|       | P&R       | Synopsys IC Compiler     |
|       |           | 2010.12-SP5              |



Fig. 3. The layout of the CMA-SOTB

Here, PE array and micro-controller share the same supply voltage, but they have their own body bias voltages to control the performance and leakage current independently.

#### **III. RELATED WORK**

Finding the energy minimal point by controlling power supply voltage and body bias supply voltage is widely researched [6][7][8]. However, the minimum energy operational point tends to be low performance with low supply voltage. From the practical viewpoint, the operational point which cannot satisfy the required performance is useless. Kao et al. [9] investigated optimization techniques from the practical viewpoint, but their study targeted only the functional units and used a conventional bulk technique. Although the optimization for the SOTB was investigated in [10] and [11], the target is CPU in which the performance and power consumption is not dependent on the application. FLEX power FPGA[12] selects body bias voltages for each configurable logic block with its configuration data, and the similar concept is applied by using SOTB technology[13]. Although the leakage power can be largely suppressed by the combination of a sophisticated CAD techniques, the overhead of body bias control for a small logic block tends to be large.

Our previous paper [3] demonstrates the effect of body bias control by using the real chip measurement results. However, it is not based on the concrete theory of optimization. The aim of this paper is establishment of the theory to find the optimal operational point for CMA-SOTB. Although this is useful for only a specific accelerator, the similar theory can be applied to any type of accelerators which optimize the balance of computation and data transfer by using the body basing.

# IV. FORMULA OF GENERAL LSI

Here, the general formulas which are used as basis of the power model are introduced.

In general, the consumed power is represented as the following expression.

$$P_{all} = I_{leak} V_{DD} + \alpha_{at} f C V_{DD}^2 \tag{1}$$

 $P_{all}$  is the total power,  $I_{leak}$  is the total leakage current, f is an operational frequency,  $\alpha_{at}$  is an activation coefficient and C is the capacity. Here,  $\alpha_{at}$  and C are combined and treated as a coefficient of dynamic current  $I_{dynamic}$ .

$$P_{all} = I_{leak} V_{DD} + I_{dynamic} f V_{DD}^2 \tag{2}$$

 $I_{leak}$  and f are further decomposed.

Leakage current of general bulk CMOS is classified into four categories: subthreshold leak, gate leak, junction leak and GIDL. Since SOTB is an FD-SOI CMOS, the junction leak can be ignored [14]. Since GIDL is caused by the high  $V_{DD}$ , it is not considered in this paper which treats low or middle range of  $V_{DD}$ .

First, subthreshold leak is represented by the following expression.

$$I_{sub} = I_{sub0} e^{\frac{V_{gs} - V_{th0} + \eta V_{ds} - K_{\gamma} V_{sb}}{\eta \nu_T}} \left(1 - e^{\frac{-V_{ds}}{\nu_T}}\right) \quad (3)$$

 $I_{sub0}$  is the leakage current with the threshold voltage,  $V_{gs}$  is gate-source voltage,  $V_{ds}$  is drain-source voltage,  $V_{th0}$  is the threshold with zero-body bias,  $\eta$  is a coefficient of drain-source voltage,  $V_{sb}$  is the body bias voltage, n is a value representing the characteristics of the empty region, and  $\nu_T$  is the thermal voltage. The subthreshold leak is increased exponentially to the temperature [14].

Next, gate leak is shown as follows.

$$I_{gate} = WP_A (\frac{V_{DD}}{t_{ox}})^2 e^{-P_B \frac{t_{ox}}{V_{DD}}}$$
(4)

W is the gate width,  $P_A$  and  $P_B$  are coefficient depending on the process technology, and  $t_{ox}$  is the thickness of oxide film.

Both leakage currents glow exponentially to both the  $V_{DD}$  and body bias voltage [11].

The operational frequency is related to the delay time. The delay of the general LSI transistors is represented with the following  $\alpha$ -power low.

$$\tau = k \frac{C V_{DD}}{(V_{DD} - V_{th})^{\alpha}} \tag{5}$$

 $\tau$  is the delay, k is the process parameter, and  $V_{th}$  is the threshold level. Here, k and C are combined and represented with a constant F, and the operational frequency can be given by inverting the total delay time.

$$f = F \frac{(V_{DD} - V_{th})^{\alpha}}{V_{DD}} \tag{6}$$

Here, F is referred as operational frequency coefficient.  $V_{th}$  is a term depending on the body bias voltage as shown in the expression.

$$V_{th} = V_{th0} + \gamma (\sqrt{\phi_s + V_{sb}} - \sqrt{\phi_s}) \tag{7}$$

 $\gamma$  is a coefficient of body bias effect, and  $\phi_s$  is a value corresponding to the surface potential with the threshold voltage including temperature. Since SOTB process only works with low voltage, it can be approximated as follows.

$$V_{th} = V_{th0} + K_{\gamma} V_{sb} \tag{8}$$

# V. POWER OPTIMIZATION MODEL

Here, a model for power optimization is derived from general formulas shown in the previous section.

#### V-A. Assumption

From the practical viewpoint, in accelerators, the power should be minimized with a required performance for the application. The purpose of the power model is finding the optimized supply voltage and body bias voltages with the required operational frequency. Both operational frequency and leakage current of PE array and micro-controller can be measured and computed independently. Dynamic current coefficient  $I_{dynamic}$  must be shared by the both components, since both components use the same power supply.

#### V-B. Power model

First, leakage current is represented with an expression. The leakage current is increased exponentially by  $V_{DD}$  and  $V_b$  [11]. Adding the temperature, the leakage current can be represented as follows:

$$I_{leak} = I_{leak0} e^{AV_{DD} + BV_b + CT},$$
(9)

where A is a coefficient of power supply, B is for body bias voltage, and C is for chip temperature.  $I_{leak0}$  represents characteristics of each chip.

Since CMA-SOTB consists of two components: microcontroller and PE array, leakage current must be represented independently. Thus, the total leakage current is represented as follows.

$$I_{leakall} = I_{leak0_{MC}} e^{A_{MC}V_{DD} + B_{MC}V_b + C_{MC}T} + I_{leak0_{PA}} e^{A_{PA}V_{DD} + B_{PA}V_b + C_{PA}T}$$
(10)

Next, operational frequency can be simplified by combining Expression (6), (7) and (8).

$$f = F \frac{(V_{DD} - V_{th0} + K_{\gamma}V_b + K_T T)^{\alpha}}{V_{DD}}$$
(11)

 $K_{\gamma}$  is a coefficient to body bias voltage, and  $K_T$  is a coefficient of the chip temperature.  $\alpha$ -power low considers the saturation of velocity of the carrier of drain current. Thus,  $\alpha$  is called velocity saturation coefficient and defined



Fig. 4. Measurement environment

by the approximation from I-V characteristics. When the transistor works in low  $V_{DD}$  region, I-V characteristics can be approximated by squire I-V curve, and  $\alpha$  can be treated as 2. In this paper, we use  $\alpha = 2$ . The operational frequency is shown as the following expression.

$$f = F \frac{(V_{DD} - V_{th0} + K_{\gamma}V_b + K_T T)^2}{V_{DD}}$$
(12)

In addition, maximum operational frequency of CMA-SOTB is limited by a slow on module of micro-controller or PE array. Thus, it is expressed by this expression.

$$f_{max} = MIN(f_{MC}, f_{PA}) \tag{13}$$

From these expressions, Expression (2) is represented as follows.

$$P_{all} = I_{leak0_{MC}} e^{A_{MC}V_{DD} + B_{MC}V_b + C_{MC}T} V_{DD} + I_{leak0_{PA}} e^{A_{PA}V_{DD} + B_{PA}V_b + C_{PA}T} V_{DD} + I_{dynamic} f_{max} V_{DD}^2$$
(14)

From Expression (11),  $V_b$  can be represented as follows.

$$V_{b_{calc}} = \frac{\left(\frac{V_{DD}f}{F}\right)^{\frac{1}{\alpha}} - \left(V_{DD} - V_{th0} + K_TT\right)}{K_{\gamma}}$$
(15)

The final power model is given by assigning this expression into a term in Expression (14).

$$P_{all} = I_{leak0_{MC}} e^{A_{MC}V_{DD} + B_{MC}V_{bMCcalc} + C_{MC}T} V_{DD} + I_{leak0_{PA}} e^{A_{PA}V_{DD} + B_{PA}V_{bPAcalc} + C_{PA}T} V_{DD} + I_{dynamic} f_{given} V_{DD}^2$$
(16)

By assigning the required operational frequency into  $f_{given}$ in Expression (16), the power consumption of each  $V_{DD}$ is given. By finding  $V_{DD}$  which gives the lowest power consumption, we can fix the optimal  $V_{DD}$ . Then, the optimal body bias voltage is obtained from Expression (15) and the optimal  $V_{DD}$ .

Table II. Coefficients of the model

|              | micro-controller      | PE array              |  |  |  |
|--------------|-----------------------|-----------------------|--|--|--|
| Ileak0       | $1.70 \times 10^{-8}$ | $2.47 \times 10^{-7}$ |  |  |  |
| A            | 1.21                  | 1.51                  |  |  |  |
| B            | 4.25                  | 4.20                  |  |  |  |
|              | $3.66 \times 10^{-2}$ | $3.01 \times 10^{-2}$ |  |  |  |
| F            | $5.26 \times 10^{8}$  | $6.61 \times 10^{8}$  |  |  |  |
| $K_{\gamma}$ | $4.36 \times 10^{-2}$ | $6.85 \times 10^{-2}$ |  |  |  |
| $K_T$        | $7.31 \times 10^{-5}$ | $7.31 \times 10^{-5}$ |  |  |  |
| Idynamic     | $9.41 \times$         | $10^{-5}$             |  |  |  |

# VI. FINDING PARAMETERS FOR CMA-SOTB

# VI-A. Conditions of the measurement

In order to fix parameters of the model shown in the previous section, the real chip measurement results are shown.

Here, a simple image filter application "af" with packed RGB data is used. In this application, two images whose RGB data packed into 24bit are separated, applied alpha blending and them combined into 24 bit width again. The influence of the application program is discussed later. As shown in the previous section, body bias voltages for micro-controller are referred as *VBN* and *VBP*, and those for the PE array are *VBNM* and *VBPM*. The same degree of body bias voltage is given to NMOS and PMOS transistors so that the following condition is satisfied.

$$V_{DD} = VBN + VBP \tag{17}$$

Since VBP is automatically defined with a fixed VBN, we only show VBN hereafter without showing VBP. Similarly, for the PE array, only VBNM is shown.

As shown in Fig. 4, the chip temperature is controlled by using a thermal control unit with Peltier device. The chip is surrounded with heat insulators so that its temperature can be quickly changed. The measurement board equips a small FPGA daughter board for giving testing data and receiving the results. By using the FPGA and on-board D/A converters, body bias voltages can be controlled. A temperature sensor directly sticking to the chip in the package is used. The range of evaluated temperature is from 30 to 60 °C considering common usage of commercial products [14].

Coefficients obtained from a real CMA-SOTB chip evaluation are shown in Table II. Values in the table were obtained by computing the average of measurement results. Note that  $I_{dynamic}$  cannot be measured independently for each module, and so the same value is used.

#### VI-B. The accuracy of the model

This section compares values obtained from the model with those from the real chip measurement results, and discusses the accuracy of the model.

Fig. 5 - Fig. 10 show the leakage current from the model and measurement results. Note that exponential scale



Fig. 5. Leakage Current vs. Supply Voltage



Fig. 6. Leakage Current vs. Chip Temperature

is used for the vertical axis of these graphs. In Fig. 7, there is a gap with -0.2 V body bias voltage which was caused by the changing the range of ammeter used in the measurement. From Fig. 5 to Fig 7, it is found that the leakage current is increased exponentially with three elements; supply voltage, chip temperature and body bias voltage. The PE array with larger area suffers larger leakage current than that of micro-controller. When the temperature is fixed, the difference between the leakage current from the model and that from the measurement is quite small. Fig. 9 and Fig. 10 show the case of temperature is changed. In the case shown in Fig. 9, the model is well matched to the real evaluation results. However, Fig. 10 shows the slope of the increasing line is different. It appears that the model proposed in the previous section has a room of improvement on the temperature.

Next, the operational clock frequency is examined. Fig. 11 shows measurement results of the maximum operational frequency versus body-bias of PE array and bodybias of micro-controller. Fig. 12 shows the difference between values from the model and measurements. The



Fig. 7. Leakage Current vs. Body Bias Voltage



Fig. 8. Comparison of calculate value and measurement value

average difference was just 5.2%, and it shows that the model approximates the real chip well. The difference is large with small VBNM. It comes from the small operational frequency makes ratio of the difference large. The influence of the temperature to the operational frequency is not large in SOTB process, and [8] reported that it is slightly increased with high temperature unlike the bulk CMOS. The measurement results here also showed the increasing frequency with high temperature, but only 1MHz from 30 to 60 °C. When the supply voltage is changed, the difference between values from the model and measurement results is small as shown in Fig. 13.

Finally, the difference in the overall power consumption is shown in Fig. 14. The ratio of the difference with  $V_{DD} =$ 0.4 under 30 °C is shown. The ratio of difference is 10% in maximum and 4.4% on average. This shows the model is useful. With different temperatures, the ratio of difference was almost the same.



Fig. 9. Comparison of calculate value and measurement value for Chip temperature (VBN = 0 V, VBNM = 0V)



**Fig. 10**. Comparison of calculate value and measurement value for Chip temperature (VBN = 0.4 V, VBNM = 0.4V)

# VI-C. Influence of the application

The operational frequency of CMA-SOTB is dependent on the application. It is defined either with the critical path on the PE array and the time to transfer input/output data between launch/result register and DMEM. Table III shows the number of operations on the critical path of the dataflow graph in the PE array. For each application, the number of PEs used in each operation (Sift - Mult), ones just passed through (Through), the number of passing through PEs (Path), and the maximum operational frequency (MHz) with  $V_{DD} = 0.4$  and VBNM = 0 under 30 °C. Here, all application programs are built based on "af" introduced in the previous subsection. "af-long" executes the same operation as "af" with a longer critical path. "Mult", "Multshort", "shift-test", "Logic-test" and "Add-test" use the same critical path as "af" but the operation is different. For all these application programs, the same micro-controller program is used. Examples of the critical path in the programs are shown in Fig. 15. The results show that the operational frequency bottlenecked by the PE array is influenced with three factors: the length of the critical path,



Fig. 11. Maximum Operational Frequency vs. Body Bias Voltage



Fig. 12. Error Rate of Maximum Operational Frequency

the number of PEs, and the number of "Mult" operation which has the largest delay time. The maximum operational frequency bottlenecked by the PE array can be estimated from these factors in the application program. Considering the number of input and output, the arithmetic intensity of the target application can be estimated, thus, we can know which will limit the operational frequency.

# VII. POWER OPTIMIZATION

By using the model, optimized bias voltages of the CMA-SOTB are obtained and compared to the real optimized points searched by exhaustive measurements.

For given operational frequency, the consuming power can be obtained with the model as shown in Fig. 16. When the supply voltage is decreased beyond a certain point, the consuming power is rapidly increased for the use of strong forward biasing. The bottom point in the graph is corresponding to the maximum energy efficient



**Fig. 13**. Comparison of calculate value and measurement value for power supply in Zero-bias



Fig. 14. Error Rate of Power Consumption

point. Table. IV shows power consumption from Fig. 16 and measurements from the real chip. The supply voltage and bias voltages are slightly adjusted so that the chip is stably operational. The largest changed for the adjustment was 50 mV. There were about 20 % difference between the voltages from the power model and those from the real measurement at maximum.

Fig. 17 shows the power by the optimal point from the model and those from exhaustive search. Each condition is shown in Table V. Exhaustive search is done by every 0.1 V from 0.3 to 0.5 V in  $V_{DD}$ , from -0.1 to 0.4 V in VBN and from -0.4 to 0.4 V in VBNM. From all measurement results, the smallest power consumption is recorded. The power from the model is much smaller than the result from exhaustive search when the optimal point is out of the range of search. For 45 MHz target frequency, energy consumption is reduced by 37.4 % compared to the result from exhaustive search. The exhaustive search requires a large time and manual effort, while the proposed model can get the comparable results just by a simple computation.

Table III. Operations on the critical path



Fig. 16. Power Consumption by power model

0.7

VDD (V)

0.8

0.9

1

1.1

1.2

0.6

0.2

0.3

0.4

0.5

# VIII. CONCLUSION

This paper proposes a power model of a low power coarse grained reconfigurable accelerator CMA-SOTB, fixed parameters from measurement results of a real chip, and shows examples of optimization with a given operational clock frequency.

From the proposed model, the power can be estimated with 4.4% difference from the measured value on average. By using the model, the optimal source voltage and body bias voltages for PE-array and microcontroller can be obtained for a given operational frequency. Compared with the result of the exhaustive search, 37.4% of energy is saved with much small effort of measurements.

 Table IV. Error Rate of between calculate value and measure value

 20 MHz
 40 MHz
 45 MHz

|              | 30 MHz |         | 40 MHz |         | 45 MHz |         |
|--------------|--------|---------|--------|---------|--------|---------|
|              | calc.  | measure | calc.  | measure | calc.  | measure |
| $V_{DD}$ (V) | 0.42   | 0.467   | 0.45   | 0.502   | 0.46   | 0.528   |
| VBN (V)      | -0.859 | -0.854  | -0.854 | -0.855  | -0.776 | -0.784  |
| VBNM (V)     | -0.790 | -0.791  | -0.834 | -0.831  | -0.806 | -0.802  |
| Power (mW)   | 1.328  | 1.587   | 1.986  | 2.382   | 2.320  | 2.960   |
| Error Rate   | 19.5 % |         | 19.9 % |         | 25.6 % |         |



Fig. 17. Power Optimization Results

Now, the supply voltage and bias voltages from the model is manually adjusted for the stable operation of the chip. The model should include the margin for stable operation. Also, the temperature has not been well modeled, and future improvement on it is required.

# ACKNOWLEDGMENT

This work was performed as "Ultra-Low Voltage Device Project" funded and supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO). Also, this work was partially supported by JSPS KAKENHI S Grant Number 25220002. The authors thank to VLSI Design and Education Center (VDEC) and Synopsys for EDA tools.

#### **IX. REFERENCES**

- [1] Low-power Electronics Association & Project, http://www.leap.or.jp/.
- [2] Nobuyuki Ozaki, et. al., "Cool Mega-Arrays: Ultralow-Power Reconfigurable Accelerator Chips," *IEEE Micro*, *Vol.31*, pp. 6–18, 2011.
- [3] Hongliang Su, Yu Fujita, Hideharu Amano, "Body Bias Control for a Coarse Grained Reconfigurable Accelerator Implemented with Silicon on Thin BOX Technology," in *Proc. of Field Programable Logic and Applications(FPL)*, September 2014.
- [4] Takashi Ishigaki, et al., "Ultralow-power LSI Technology with Silicon on Thin Buried Oxide (SOTB) CMOSFET," Solid State Circuits Technologies, Jacobus W. Swart (Ed.), ISBN: 978-953-307-045-2, InTech, pp. 146–156, 2010.

# Table V. Body bias voltage of the power optimization results mesurement

|              | 30 MHz   |        | 40 MHz   |        | 45 MHz   |        |
|--------------|----------|--------|----------|--------|----------|--------|
|              | measured | model  | measured | model  | measured | model  |
| $V_{DD}$ (V) | 0.4      | 0.467  | 0.4      | 0.502  | 0.4      | 0.528  |
| VBN (V)      | -0.3     | -0.854 | 0        | -0.855 | 0.2      | -0.784 |
| VBNM (V)     | -0.4     | -0.791 | 0        | -0.831 | 0        | -0.802 |

- [5] Nobuyuki Ozaki, et al., "Cool Mega-Array: A highly energy efficient reconfigurable accelerator," in *Proc. of Field-Programmable Technology(FPT)*, December 2011.
- [6] Bo Zhai, et. al., "Energy Efficient Near-threshold Chip Multi-processing," in *Proceedings of International Sympo*sium on Low Power Electronics and Design, Aug. 2007, pp. 32–37.
- [7] David Fick, et. al., "Centip3De: A3930DMIPS/W Configurable Near-Threshold 3D Stacked System with 64 ARM Cortex-M3 Cores," in *Proceedings of International Solid-State Circuits Conference*, Aug. 2012, pp. 190–192.
- [8] Shohei Nakamura, Jun Kawasaki, Yuichi Kumagai, Kimiyoshi Usami, "Measurement of the Minimum Energy Point in Silicon on Thin-BOX(SOTB) and Bulk MOSFET," in *proc of EUROSOI-ULIS*, January 2015.
- [9] James T. Kao, et. al., "A 175mV MultiplyAccumulate Unit Using an Adaptive Supply Voltage and Body Bias Architecture," in *IEEE Journal of Solid-State Circuits.*, Nov. 2002, pp. 1545–1554.
- [10] Koichiro Ishibashi, et. al., "A Perpetuum Mobile 32bit CPU with 13.4pj/cycle, 0.14μA sleep current using Reverse Body Bias Assisted 65nm SOTB CMOS technology," in *Proceedings of COOL Chips XVII*, April. 2014, pp. 1–3.
- [11] Hayate Okuhara, Kuniaki Kitamori, Yu Fujita, Kimiyoshi Usami Hideharu Amano, "An Optimal Power Supply And Body Bias Voltage for Ultra Low Power Micro-Controller with Silicon on Thin BOX MOSFET," in proc of International symposium on Low Power Electronics and Design (ISLPED), July 2015.
- [12] Masakazu Hioki, et. al, *Fully-Functional FPGA Prototyple* with Fine-Grain Programmable Body Biasing, Feb. 2013.
- [13] Masakazu Hioki, et. al., "SOTB Implementation of a Field Programmable Gate Array with Fine-Grained Vt Programmability," in *J. Low Power Electroappl.*, April. 2014, pp. 329–332.
- [14] N. H. E. Weste, D. M. Harris, CMOS VLSI Design A Circuit and Systems Perspective. Addison Wesley, 4 edition, 2010.