

DEPARTMENT OF ELECTRICAL AND INFORMATION TECHNOLOGY

MASTER OF SCIENCE THESIS

# Energy Characterization of RISC processors in the Sub-V $_{\rm T}$ Domain

Author:

Muhammad Umair Siddiqui

Supervisors: Joachim N. Rodrigues Chenxin Zhang S. M. Yasser Sherazi

Lund 2011

The Department of Electrical and Information Technology Lund University Box 118, S-221 00 LUND SWEDEN

This thesis is set in Computer Modern 10pt, with the  $\ensuremath{\mathrm{L}}\xspace{\mathrm{TEX}}$  Documentation System

©Muhammad Umair Siddiqui 2011

Printed in E-huset Lund, Sweden. October 2011

c

#### Abstract

Devices like medical implants and remote sensors etc, are required to operate with very low energy dissipation for longer battery-life. For such ultra-low energy devices, the sub-threshold design is an essential design technique for reducing the energy dissipation of a circuit. An important aspect of this technique is to model the energy dissipation of each design component (and if possible whole design) in subthreshold domain. This thesis presents the energy characterization of two 32-bit microprocessors, namely LEON-3 and Cortex-M0, in sub-threshold domain. For this study, a high-level energy characterization model was used to analyze the energy dissipation and operating-frequency trends of these two microprocessors. The sub-threshold designing can be combined with other energy saving techniques, like clock-gating, multi- $V_{DD}$  and power gating etc, to further improve the energy efficiency of a design. In this thesis, the sub-threshold analysis is performed with and without clock-gating. The results from energy model show that by using a sub-threshold supply voltage and clock-gating, the energy dissipation of both microprocessors can be reduced to the order of pico joules (pJ). The sub-threshold operation will reduce their clock frequency to almost 50 KHz, but most of the medical implants and remote sensors have relaxed throughput constraints.

### Acknowledgement

I would like to thank my thesis supervisors, Dr. Joachim Rodrigues, Chenxin Zhang, and Yasser Sherazi for their unwavering support during this thesis. Special thanks to Oskar Andersson, Isael Diaz, Deepak Dasalukunte and Stefan Molund for sharing their technical expertise and advices during the whole thesis.

Finally, I want to thank my parents, for their encouragement and support during my MS studies.

 $\mathbf{v}$ 

Muhammad Umair Siddiqui Lund, October, 2011

## Contents

| A        | bstra                      | $\operatorname{ct}$                                                                                                                                                                           |   | iii                                                                                   |
|----------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|---------------------------------------------------------------------------------------|
| A        | cknov                      | vledgements                                                                                                                                                                                   |   | v                                                                                     |
| Li       | st of                      | Tables                                                                                                                                                                                        |   | ix                                                                                    |
| Li       | st of                      | Figures                                                                                                                                                                                       |   | xi                                                                                    |
| A        | crony                      | 7ms                                                                                                                                                                                           |   | xiii                                                                                  |
| 1        | <b>Int</b> r<br>1.1<br>1.2 | oduction<br>Overview                                                                                                                                                                          |   | <b>1</b><br>1<br>2                                                                    |
| <b>2</b> | Sub                        | -threshold CMOS Design                                                                                                                                                                        |   | -<br>3<br>3                                                                           |
|          | 2.1<br>2.2<br>2.3<br>2.4   | MOS Transistor in Sub- $V_T$ region                                                                                                                                                           | • |                                                                                       |
| 3        | <b>The</b><br>3.1          | sis Methodology<br>Overview                                                                                                                                                                   |   | <b>9</b><br>9                                                                         |
|          | 3.2                        | Extraction of Model Parameters         3.2.1       SPICE simulation         3.2.2       Logic Synthesis         3.2.3       Place & Route                                                     |   | $     \begin{array}{c}       10 \\       11 \\       11 \\       12     \end{array} $ |
|          | 3.3                        | <ul> <li>3.2.4 Netlist analysis and Estimation of k-parameters</li> <li>3.2.5 Netlist simulation and Estimation of Switching Activity</li> <li>Netlist Simulation of Microprocessor</li></ul> |   | 12<br>13<br>13                                                                        |

| <b>4</b> | Ene   | rgy Estimation of LEON-3                                                                                                                                                       | 17 |
|----------|-------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
|          | 4.1   | Overview                                                                                                                                                                       | 17 |
|          | 4.2   | LEON-3 Synthesis                                                                                                                                                               | 18 |
|          | 4.3   | LEON-3 Place & Route                                                                                                                                                           | 19 |
|          | 4.4   | Benchmark compilation for LEON-3                                                                                                                                               | 19 |
|          | 4.5   | LEON-3 Netlist Simulation and Power Estimation                                                                                                                                 | 20 |
|          | 4.6   | Sub-V <sub>T</sub> Analysis and Results $\ldots \ldots \ldots$ | 23 |
| <b>5</b> | Ene   | rgy Estimation of Cortex-M0                                                                                                                                                    | 27 |
|          | 5.1   | Overview                                                                                                                                                                       | 27 |
|          | 5.2   | Cortex-M0 Synthesis and PAR                                                                                                                                                    | 27 |
|          | 5.3   | Benchmark compilation for Cortex-M0                                                                                                                                            | 28 |
|          | 5.4   | Cortex-M0 Netlist Simulation and Power Estimation                                                                                                                              | 29 |
|          | 5.5   | Sub-V <sub>T</sub> Analysis and Results $\ldots \ldots \ldots$ | 32 |
| 6        | Con   | aclusion a                                                                                                                                                                     | 35 |
|          | 6.1   | Comparison of LEON-3 with Cortex-M0                                                                                                                                            | 35 |
|          | 6.2   | Conclusion                                                                                                                                                                     | 36 |
| Bi       | bliog | graphy                                                                                                                                                                         | 37 |

## List of Tables

| 3.1 | Sub-V <sub>T</sub> model parameters calculated from SPICE $\ldots$                             | 11 |
|-----|------------------------------------------------------------------------------------------------|----|
| 4.1 | LEON-3 — Synthesis Results                                                                     | 19 |
| 4.2 | LEON-3 — Post-PAR Results                                                                      | 19 |
| 4.3 | LEON-3 — Benchmark execution summary                                                           | 21 |
| 4.4 | LEON-3 — Energy Dissipation per clock cycle (50 MHz) at nominal                                |    |
|     | voltage                                                                                        | 22 |
| 4.5 | LEON-3 — Sub-V <sub>T</sub> k-parameters                                                       | 23 |
| 4.6 | LEON-3 — Average circuit switching $\operatorname{activity}(\mu_e)$                            | 24 |
| 4.7 | LEON-3 — summary of sub-V <sub>T</sub> analysis                                                | 25 |
| 5.1 | Cortex-M0 — Synthesis Results                                                                  | 28 |
| 5.2 | Cortex-M0 — Post-PAR Results                                                                   | 28 |
| 5.3 | Cortex-M0 — Benchmark execution summary                                                        | 30 |
| 5.4 | Cortex-M0 — Energy Dissipation per clock cycle (50 MHz) at nominal                             |    |
|     | voltage                                                                                        | 31 |
| 5.5 | Cortex-M0 — Sub-V <sub>T</sub> k-parameters $\ldots \ldots \ldots \ldots \ldots \ldots \ldots$ | 32 |
| 5.6 | Cortex-M0 — Average circuit switching $activity(\mu_e)$                                        | 34 |
| 5.7 | Cortex-M0 — summary of sub-V <sub>T</sub> analysis $\ldots \ldots \ldots \ldots \ldots$        | 34 |

## List of Figures

| 2.1 | Static CMOS inverter                                                                                                                                                                   |  |
|-----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 2.2 | VTC of static CMOS inverter in Sub-V <sub>T</sub> regime $\ldots \ldots \ldots \ldots 5$                                                                                               |  |
| 3.1 | Sub-V <sub>T</sub> Energy characterization flow $\ldots \ldots \ldots$ |  |
| 4.1 | LEON-3 customization wizard screen                                                                                                                                                     |  |
| 4.2 | LEON-3 Testbench schematic                                                                                                                                                             |  |
| 4.3 | LEON-3 sub-V <sub>T</sub> energy curve $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots 25$                                                                             |  |
| 4.4 | LEON-3 sub-V <sub>T</sub> maximum operating frequency graph $\ldots \ldots 26$                                                                                                         |  |
| 5.1 | Cortex-M0 Testbench schematic                                                                                                                                                          |  |
| 5.2 | Cortex-M0 sub-V <sub>T</sub> energy curve $\ldots \ldots \ldots \ldots \ldots 33$                                                                                                      |  |
| 5.3 | Cortex-M0 sub- $V_T$ maximum operating frequency graph 33                                                                                                                              |  |

## List of Acronyms

| ASIC  | Application-Specific Integrated Circuit             |
|-------|-----------------------------------------------------|
| EEMBC | EDN Embedded Microprocessor Benchmarking Consortium |
| ELF   | Executable and Linkable Format                      |
| EMV   | Energy Minimum operating Voltage                    |
| FFT   | Fast Fourier Transform                              |
| HDL   | Hardware discription language                       |
| IFFT  | Inverse Fast Fourier Transform                      |
| IP    | Intellectual Property                               |
| PAR   | Place and Route                                     |
| RISC  | Reduced Instruction Set Computer                    |
| RAM   | Random-Access Memory                                |
| RTL   | Register Transfer Level                             |
| SAIF  | Switching Activity Interchange Format               |
| SDF   | Standard Delay Format                               |
| SRAM  | Static Random-Access Memory                         |
| VCD   | Value Change Dump                                   |

l Chapter

## Introduction

#### 1.1 Overview

The digital design involves a trade-off between three factors, i.e., area, energy and performance. Every application domain dictates its own unique set of requirements for these three factors. There are number of applications where minimizing the energy dissipation is the single most important goal of their design process. These so called "ultra-low energy" applications includes biomedical implants, remote sensors for supply chain management and environment monitoring etc.

For very low energy dissipation, the most effective design technique for CMOS designs, is to operate them in "sub-threshold" (sub- $V_T$ ) domain, where the supply-voltage is less than the threshold voltage of MOS transistor. However, the sub- $V_T$  operation reduces the operating frequency of circuit (typically to the order of KHz). As a result, currently this technique is only used for biomedical implants and remote sensors etc [1].

Recently, these medical implants and remote sensors are becoming increasingly complex, due to their industry/application requirements. As a result, most of these devices are now implemented as System-on-Chip (SoC) and contain atleast one microprocessor for supervisory tasks and/or to execute the main processing algorithm [2, 3]. However due to their increased complexity, these circuits now dissipate more energy. Therefore, for architectural exploration, it is necessary to accurately model the energy dissipation of the device and its individual components.

In this thesis two general purpose microprocessors, LEON-3 [4] and Cortex-M0 [5], were analyzed in sub-V<sub>T</sub> regime. The Cortex-M0 is currently the smallest and most energy efficient processor available from ARM Ltd., which makes it a suitable candidate for this study. The Cortex-M0 is a 32-bit, 3-stage pipelined, RISC processor implementing ARMv6-M architecture [6]. The LEON-3 is a high performance processor available from Aeroflex Gaisler AB. The LEON-3 is a 32-bit, 7-

stage pipelined, RISC processor implementing IEEE-1754 (SPARC V8) architecture [7].

Typically sub-V<sub>T</sub> modeling methodologies extensively relay on SPICE simulations, like [8,9]. However SPICE simulations are not feasible for complex digital functions like microprocessor. Therefore in this thesis, the energy characterization is performed using a high-level energy estimation methodology presented in [10], which uses high-level synthesis engines and power simulation tools to estimate the energy dissipation.

#### 1.2 Thesis Outline

The remaining chapters in this thesis are organized as follows. The next chapter will discuss the different aspects of sub- $V_T$  design issues and the energy estimation model used in this thesis. Chapter 3 will discuss the methodology used in this thesis to find the sub- $V_T$  characteristics of Cortex-M0 and LEON-3. In chapter 4 and 5, the sub- $V_T$  methodology is applied on LEON-3 and Cortex-M0, respectively, to find their sub- $V_T$  characteristics. The conclusion is given in chapter 6.

Chapter 2

## Sub-threshold CMOS Design

#### 2.1 Overview

When the gate voltage of MOS transistor is lowered below the threshold voltage  $V_T$ , the transistor does not turn-off instantaneously but enters into another region of operation called "sub-threshold" or "weak inversion" region. In this region a small leakage current still flows between drain and source terminals of transistor. Generally, the presence of sub-threshold (sub- $V_T$ ) current is undesirable as not only it deviates from ideal switch like behavior but also causes the leakage energy dissipation in MOS circuits. However, with the advent of ultra low energy applications, like medical implants and remote sensors etc, there is a great interest in sub- $V_T$  designing [11]. The following sections briefly discuss the different aspects of sub- $V_T$  design issues and the energy estimation model used in this thesis.

#### 2.2 MOS Transistor in Sub-V $_{\rm T}$ region

In sub-V<sub>T</sub> region, the MOS transistor behaves as a (poor) bipolar device (npn for an NMOS) with its base coupled to gate through a capacitive divider, the drain current of (N)MOS transistor operating in sub-V<sub>T</sub> region is given by (2.1) [12]:

$$I_{DS} = I_S exp \frac{V_{GS} - V_T}{nU_T} \left( 1 - exp \frac{-V_{DS}}{U_T} \right)$$
(2.1)

where  $U_T$  is the thermal voltage whose value is 26 mV at 300 K, n is a process dependent parameter called slope factor and is typically in the range of 1.3 - 1.5 for modern CMOS processes. The  $V_{GS}$  and  $V_{DS}$  are gate-to-source and drain-to-source voltages, respectively. The  $I_S$  is also a process parameter called specific current and is given by:

$$I_S = 2n\mu C_{ox} U_T^2 \frac{W}{L} \tag{2.2}$$

where  $\mu$  is mobility constant,  $C_{ox}$  is the capacitance of gate oxide per unit area, and  $\frac{W}{L}$  is the aspect ratio of transistor. The same equations are valid for PMOS if the sign of current and voltages is inverted.

#### 2.3 CMOS inverter in Sub-V $_{\rm T}$ region

Consider a static CMOS inverter shown in figure 2.1, with input voltage  $V_i$  and output voltage  $V_o$ .



Figure 2.1: Static CMOS inverter

The Voltage Transfer Characteristic (VTC) of this inverter is derived by equating the current (2.1) through NMOS and PMOS transistors. If both transistors have similar strength then VTC of inverter is given by (2.3) [12]:

$$x_o = x_D + ln\left(\frac{1 - G + \sqrt{(G - 1)^2 + 4Ge^{-x_D}}}{2}\right)$$
(2.3)

where

$$G = exp\left(\frac{2x_i - x_D}{n}\right) \tag{2.4a}$$

$$x_o = \frac{V_o}{U_T} \tag{2.4b}$$

$$x_i = \frac{V_i}{U_T} \tag{2.4c}$$

$$x_D = \frac{V_{DD}}{U_T} \tag{2.4d}$$

Figure 2.2 shows the VTC plots of inverter for different supply voltages. It is evident that when the normalized supply voltage approaches its minimum value, the VTC degenerates, and the static noise margins are reduced to zero [12]. For reasonable noise margin in sub-V<sub>T</sub> regime, the  $V_{DD}$  should be at least 4 times the  $U_T$  (assuming n = 1.5) [12].



Figure 2.2: VTC of static CMOS inverter in Sub-V $_{\rm T}$  regime

#### 2.4 Energy Estimation in Sub-V<sub>T</sub> Domain

The energy dissipation of static CMOS circuits is given by (2.5) [13].

$$E_{\rm T} = \underbrace{\alpha C_{\rm tot} V_{\rm DD}}_{E_{\rm dyn}}^2 + \underbrace{I_{\rm leak} V_{\rm DD} T_{\rm clk}}_{E_{\rm leak}} + \underbrace{I_{\rm peak} t_{\rm sc} V_{\rm DD}}_{E_{\rm sc}}, \tag{2.5}$$

where  $E_{\rm dyn}$ ,  $E_{\rm leak}$ , and  $E_{\rm sc}$  are the average energy dissipation due to switching activity, the energy dissipation resulting from integrating the leakage power over one clock cycle  $T_{\rm clk}$ , and the energy dissipation due to short circuit currents, respectively.

When supply-voltage enters into sub-V<sub>T</sub> regime, the  $E_{dyn}$  reduces quadratically while the  $E_{leak}$  increase with voltage scaling [8] [9]. The reason for the increase in  $E_{leak}$  in the sub-V<sub>T</sub> regime is that as the voltage is scaled below the threshold voltage, the "on-current" (and hence, the delay circuit delay) decreases exponentially with voltage scaling while the off-current is reduced less severely. Hence, the  $E_{leak}$ will rise and supersede the  $E_{dyn}$ . This effect creates a minimum energy point (EMV), where CMOS logic reaches maximum energy efficiency per operation. In sub-V<sub>T</sub> domain the  $E_{sc}$  is ignored as it is known to contribute only a small portion of the overall energy [8] [9].

Assuming a standard CMOS process technology where  $V_T$  is fixed (i.e., no triple wells for body biasing), the problem becomes finding the optimum  $V_{DD}$ or EMV to minimize energy per operation for a given design. There are several energy estimation models for sub-V<sub>T</sub> operation, like [12], [9] and [8], however they are not suitable for high-level design exploration because they require extensive SPICE simulations for extraction of their parameters. Therefore in this study the energy model presented by Akgun et. al. [10] was used, which provides an accurate estimation of sub-V<sub>T</sub> parameters without requiring computation and time intensive SPICE simulations [14]. In [10], Akgun et al. used high-level synthesis engines and power simulation tools for design characterization<sup>1</sup>, which makes their model suitable for high-level design exploration. Moreover, their model only require a single synthesis at the nominal library voltage for model parameter extraction.

In Akgun et al.'s model, the total energy dissipation is given by

$$E_{\rm T} = \underbrace{\mu_{\rm e} k_{cap} C_{inv} V_{\rm DD}}_{E_{\rm dyn}}^2 + \underbrace{k_{leak} I_0 V_{DD} T_{clk}}_{E_{\rm leak}}$$
(2.6)

where  $\mu_e$  is the average circuit switching activity,  $I_0$  is the average leakage of a single inverter and,  $C_{inv}$  is the equivalent capacitance of a single inverter. The  $k_{leak}$  is the average leakage scaling factor over all gates with respect to a single inverter. The  $k_{cap}$  capacitance scaling factor of the circuit with respect to a single inverter.

If the clock period  $(T_{\rm clk})$  is equal to the critical path delay, then  $T_{\rm clk}$  can be written as

$$T_{\rm clk} = k_{\rm crit} T_{\rm sw\_inv}, \qquad (2.7)$$

<sup>&</sup>lt;sup>1</sup>they extracted few technology parameters from SPICE as well.

where  $k_{\rm crit}$  is a coefficient that defines the critical path delay of the circuit in terms of the inverter delay  $T_{\rm sw_{-inv}}$ . In [12], the delay  $T_{\rm sw_{-inv}}$  of an inverter operating in the sub-V<sub>T</sub> regime is given by

$$T_{\rm sw\_inv} = \frac{C_{\rm inv} V_{\rm DD}}{I_0 e^{V_{\rm DD}/(nU_{\rm t})}},$$
(2.8)

By introducing (2.8) into (2.7), the critical path delay is given as

$$T_{\rm clk} = k_{\rm crit} \frac{C_{\rm inv} V_{\rm DD}}{I_0 e^{V_{\rm DD}/(nU_{\rm t})}},$$
(2.9)

and the reciprocal of (2.9) defines the maximum frequency at which the circuit may be operated for a given supply voltage  $V_{\rm DD}$ . Finally, the total energy dissipation  $E_{\rm T}$  assuming operation at the maximum frequency is found by introducing (2.9) into (2.6), which gives

$$E_{\rm T} = C_{\rm inv} V_{\rm DD}^2 \bigg[ \mu_{\rm e} k_{\rm cap} + k_{\rm crit} k_{\rm leak} e^{-V_{\rm DD}/(nU_{\rm t})} \bigg].$$
(2.10)

The expression for energy minimum voltage (EMV) is obtained by taking the derivative of (2.10) with respect to V<sub>DD</sub> to zero [10]:

$$V_{\rm opt} = 2nU_{\rm t} - nU_{\rm t}W_{-1} \left[ -\frac{2e^2k_{\rm cap}\mu_{\rm e}}{k_{\rm crit}k_{\rm leak}} \right], \qquad (2.11)$$

where  $W_{-1}$  is the -1 branch of LambertW function.

Using equations (2.9), (2.10) and, (2.11), any synchronous design can be evaluated for its possible sub-V<sub>T</sub> implementation by identifying its key properties (maximum frequency, total energy dissipation and EMV) in sub-V<sub>T</sub> regime. The next chapter describes the methodology for extracting the sub-V<sub>T</sub> model parameters.

Chapter

## **Thesis Methodology**

#### 3.1 Overview

This chapter discuss the methodology used in this thesis to find the sub-V<sub>T</sub> characteristics of Cortex-M0 and LEON-3. As evident from equations (2.9), (2.10) and, (2.11), the sub-V<sub>T</sub> model depends upon following parameters:

- 1. Slope factor (n)
- 2. Equivalent capacitance of inverter  $(C_{inv})$
- 3. Average leakage power of single inverter  $(\mathbf{P}_{\text{leak\_inv}})$
- 4. Intrinsic delay (at nominal voltage) of inverter  $(\mathbf{T}_{inv})$
- 5. Average leakage current of single inverter  $(\mathbf{I}_{o})$
- 6. Capacitance scaling factor  $(\mathbf{k}_{cap})$
- 7. Critical path delay scaling factor  $(\mathbf{k}_{crit})$
- 8. Average leakage scaling factor  $(\mathbf{k}_{\text{leak}})$
- 9. Average circuit switching activity  $(\mu_e)$

The parameter extraction flow is discussed in next section.

Reducing the supply voltage is the main technique used in ultra low energy designs to reduce the energy dissipation. However there are several other techniques [15] which can be applied to circuit to further fine-tune the energy dissipation. The remote sensors and medical implants operates in a bursty manner, i.e. short intervals of intense activity interspersed with long intervals of no- (or low-) activity. During these idle periods the main source of dynamic energy dissipation is the clock. Keeping the clock connected to flip-flops during idle periods cause spurious activity in the logic. Using the *clock-gating* technique, designer can avoid these spurious logic activity and reduce the overall dynamic energy dissipation of the circuit. With modern HDL synthesis engines, the clock-gating has become more easier because these new synthesis engines, based on their input constraints, automatically insert clock-gating circuits into synthesized (gate-level) design netlist. In this thesis, both microprocessors were analyzed with and with-out clock-gating.

The k-parameters are extracted from the synthesis and place & route (PAR) reports generated from high-level synthesis and PAR engines, which saves a lot of time. However, one of the sub-V<sub>T</sub> model parameter,  $(\mu_e)$ , is calculated by netlist simulation. The stimulus for processor are software programs, the programs used in this sub-V<sub>T</sub> analysis are explained in the last section of this chapter.

#### 3.2 Extraction of Model Parameters

Figure 3.1 shows the different steps of parameter extraction flow. This flow is based on the methodology described in [10]. The following sub-sections explain the different steps of this flow.



Figure 3.1: Sub- $V_T$  Energy characterization flow

#### 3.2.1 SPICE simulation

Few technology parameters, namely,  $\mathbf{n}$ ,  $\mathbf{P}_{\text{leak},\text{inv}}$ ,  $\mathbf{T}_{\text{inv}}$ ,  $\mathbf{C}_{\text{inv}}$  and  $\mathbf{I}_0$ , are extracted from the SPICE simulations. These SPICE simulations were already performed by other researchers in our university department and their results were used in this study. The table 3.1 shows the values of these parameters for STMicroelectronics's 65-nm CMOS low-leakage high threshold (LL-HVT) cell library. An important model parameter  $\mathbf{I}_0$  is not shown in table 3.1, because it has to be measured for range of supply-voltage values (0 to  $V_T^{1}$ ). This LL-HVT cell library has a very low leakage and is suitable for the implementation of ultra-low power applications. The down-side of using this library is that the library cells of LL-HVT are comparatively slower, which results in slower implementations.

Table 3.1: Sub- $V_T$  model parameters calculated from SPICE

| Parameter                         | Value                |
|-----------------------------------|----------------------|
| n                                 | 1.5                  |
| $\mathbf{C}_{\mathrm{inv}}$       | $0.00087 \ {\rm pF}$ |
| $\mathbf{P}_{\mathrm{leak\_inv}}$ | 3.42066  pW          |
| $\mathbf{T}_{\mathrm{inv}}$       | 22.2815  ps          |

#### 3.2.2 Logic Synthesis

The main flow starts with logic synthesis of the target design, in which Verilog or VHDL description of design is synthesized into gate-level netlist using any ASIC design synthesis tool like *Synopsys Design Compiler* or *Cadence Encounter RTL Compiler*. For clock-gating case, the settings and constraints for clock-gating are also provided to synthesis tool, which adds clock-gating circuits before the clock-input of registers. Most CMOS design libraries include clock-gating circuits as integrated cells which are appropriately selected by the synthesis tool during netlist generation. After synthesis step, the following outputs are generated from the synthesis tool for the sub-V<sub>T</sub> flow:

- Gate-level netlist of the design
- **SDC file**, which contains the design constraints about area and timing, written in Synopsys Design Constraint format

 $<sup>^1 \</sup>mathrm{for}~\mathrm{ST}\text{-}65$  LL-HVT  $\mathrm{V_T}$  is 450 mV

#### 3.2.3 Place & Route

The sub-V<sub>T</sub> model, as shown in (2.10), depends on the accurate modeling of total capacitance present in the design. In bigger synchronous designs, like processor etc, the clock-tree contributes a significant amount of wiring capacitance. The logic synthesis tool does not creates the clock-tree, as clock-tree design depends upon the physical placement of cells. Therefore, *place & route* (PAR) was performed using *Cadence SoC Encounter*. The PAR tool uses design netlist and constraints (in SDC format) from *DesignCompiler* to physically place the cells, create the clock-tree and route the signals. After PAR step, the following outputs are generated for the sub-V<sub>T</sub> flow:

- Post-PAR Gate-level netlist of the design
- **SDF file,** the timing information after PAR, which is annotated into gate-level netlist simulation

#### 3.2.4 Netlist analysis and Estimation of k-parameters

The k-parameters of sub- $V_T$  model are calculated by analyzing the post-PAR netlist in *DesignCompiler*. By analyzing the post-PAR netlist following reports are generated:

- Timing report, which contains the information about the critical path delay.
- Cell report, a complete *list of cells* in design.
- Net report, a listing of all the nets in the post-PAR design and their load capacitances

The Critical path delay scaling factor  $(\mathbf{k}_{crit})$  is calculated by extracting the *Critical path delay* from the *timing report* and dividing it by the *inverter delay* (at nominal voltage).

Similarly, the Capacitance scaling factor  $(\mathbf{k}_{cap})$  is obtained by the post-processing of the *net report*, which lists all the nets in the design and their load capacitances. The  $\mathbf{k}_{cap}$  is defined as the ratio of *total capacitance* of post-PAR design and  $\mathbf{C}_{inv}$ . This *Total capacitance* is calculated by adding the *load capacitance* of each net present in the design.

The Average leakage scaling factor  $(\mathbf{k}_{\text{leak}})$  is calculated by finding the *total leakage* of the synthesized design. The vendors/foundries provide *.lib* files for their cell libraries. This *.lib* file contains all the information about each cell (including its average-leakage) present in the library. By processing the *Cell report* and *.lib* file, average leakage of each cell present in the design is obtained, the *total leakage* is the sum of these individual cell leakage values. The  $\mathbf{k}_{\text{leak}}$  is calculated by dividing the *total leakage* of synthesized design by  $\mathbf{P}_{\text{leak,inv}}$ .

#### 3.2.5 Netlist simulation and Estimation of Switching Activity

The Average circuit switching activity ( $\mu_e$ ) is obtained from the Dynamic Power dissipation (at nominal voltage) of the post-PAR design. The estimation of Dynamic Power dissipation requires SDF annotated gate-level simulation in any HDL simulator, like Synopsys VCS or MentorGraphics ModelSim. By simulating the post-PAR design, its toggle information is captured as "Value Change Dump" (VCD) file or "Switching Activity Interchange Format" (SAIF) file. Although VCD file generation is supported by all HDL simulators, but for bigger design or longer simulations, VCD file takes lot of disk-space and its generation slow-down the HDL simulation. The SAIF is a compact format for storing the toggle information of a design and natively supported by the Synopsys VCS, other HDL simulators also support SAIF file generation by usually using Synopsys Verilog PLI.

After netlist simulation, the power-analysis tool like Synopsys PrimeTime, estimates the Dynamic Power dissipation (at nominal voltage) of the post-PAR design by reading the design netlist along with its SAIF/VCD file. Using dynamic power dissipation, the  $\mu_e$  is calculated as:

$$\mu_e = \frac{P_{dyn} T_{CLK}}{C_{tot} V_{DD}^2} \tag{3.1}$$

where  $P_{dyn}$  is the average dynamic power from PrimeTime report,  $T_{CLK}$  is the time period of (design's) clock in netlist simulation,  $C_{tot}$  is the total capacitance of post-PAR design and  $V_{DD}$  is the value of nominal supply voltage. The value of  $\mu_e$  depends on given input vectors, and is constant for all clock frequencies.

#### 3.3 Netlist Simulation of Microprocessor

The important aspect of this Sub-V<sub>T</sub> flow is the SDF annotated netlist simulation of the synthesized design. To accurately determine the dynamic power dissipation of any design, the input vectors provided in netlist simulation should be comprehensive enough to exercise every aspect of the design. For microprocessor the "input vectors" are the compiled software-programs. In order to accurately determine the dynamic power dissipation of both Cortex-M0 and LEON-3, few industry standard benchmark softwares by EEMBC<sup>2</sup> were selected as the "input vectors". This section only gives the overview of these benchmark softwares, the other details of netlist simulation are described in next chapters.

The EEBMC has created a number of software benchmarks suites for testing the performance of embedded microprocessors. For this study EEMBC *Telebench1.1* suite [16] was selected which consists of DSP kernels and communication algorithms, as these algorithms can be used in medical implant and remote sensors applications. The EEMBC Telebench1.1 suite is implemented in C-language and can be

<sup>&</sup>lt;sup>2</sup>EDN Embedded Microprocessor Benchmarking Consortium

easily compiled for different 32-bit microprocessor architectures with very minimum changes. The benchmark compilation for LEON-3 and Cortex-M0 is discussed in next chapters. The Telebench1.1 has five (5) different type of algorithms and for each algorithm there are atleast three (3) different dataset, so there are total sixteen (16) different benchmarks in Telebench1.1. These five algorithms are

- Autocorrelation
- Bit Allocation
- Convolutional Encoder
- Fast Fourier Transform (FFT)
- Viterbi Decoder

The *Autocorrelation* benchmarks performs a fixed-point autocorrelation function calculation of a finite length input sequence:

$$R_{xx}[k] = \frac{1}{N} \sum_{n} x[n]x[n+k], k = 0, 1, \dots, K-1$$

Where input data x[n] is a 16-bit signed integer. This benchmark is provided with three different datasets:

- 1. autcor00data\_1 Sine wave of frequency Fs/32 and 1024 sample length
- 2. autcor00data\_2 a 16 samples symmetric pulse function
- 3. autcor00data\_3 a segment of 500 samples voiced speech signal

The *Bit Allocation* algorithm is mainly used in digital subscriber loop (DSL) modems for discrete multi-tone (DMT) modulation. However this benchmark was selected as it involve significant 16-bit fixed-point arithmetic and memory accesses. This benchmark is also provided with three different datasets to test bit allocation in different signal-to-noise (SNR) scenarios. The names of Bit Allocation benchmarks used in this study are:

- 1. fbital00data\_2
- 2. fbital00data\_3
- 3. fbital00data\_6

The *Convolutional Encoder* is commonly used in wireless transmission for forward error correction (FEC). This benchmark is a generic algorithm of convolution encoding because the generating polynomials are parametrized. This benchmark is provided with three different datasets, where each dataset uses a different generating polynomials:

- 1. conven00data\_1 1/2, K = 5, G<sub>0</sub> = 1 +  $x^2 + x^3 + x^4$  and G<sub>1</sub> = 1 +  $x + x^4$
- 2. conven00data\_2 1/2, K = 4, G<sub>0</sub> = 1 + x + x<sup>2</sup> + x<sup>3</sup> and G<sub>1</sub> = 1 + x<sup>2</sup> + x<sup>3</sup>
- 3.  $conven00data_3 1/2$ , K = 3,  $G_0 = 1 + x + x^2$  and  $G_1 = 1 + x^2$

The FFT is a commonly used DSP algorithm. This benchmark performs a 256-point FFT on 16-bit fixed point data. The benchmark also perform decimation in time on its input data. Three different type of datasets are provided with FFT benchmark

- 1.  $fft00data_1$  Sine wave
- 2.  $fft00data_2$  Square pulse
- 3.  $fft00data_3$  High frequency test

The Viterbi Decoding is an "asymptotically optimum" approach to decode the convolutional codes. This benchmark performs 3-bit soft-decision Viterbi decoding on input stream generated by a 1/2 rate convolutional encoder:  $G_0 = 1 + x + x^3 + x^5$  and  $G_1 = 1 + x^2 + x^3 + x^4 + x^5$ . This benchmark is provided with four different datasets. The names of Viterbi Decoding benchmarks used in this study are:

- 1. viterb00data\_1
- 2. viterb00data\_2
- 3. viterb00data\_3
- 4. viterb00data\_4

For more details on benchmark algorithms, the interested reader is referred to [16].



### **Energy Estimation of LEON-3**

#### 4.1 Overview

The LEON-3 [4] is a high performance microprocessor available from Aeroflex Gaisler AB. It is a 7-staged pipelined, 32-bit RISC processor based on IEEE-1754 (SPARC V8) architecture [7]. The LEON-3 is an opensource design and Aeroflex Gaisler provides its VHDL RTL sourcecode in their GRLib IP bundle, which include their other IP cores like Memory controllers, GPIO, UART and timer etc. Due to availability of RTL sourcecode, it is possible to customize the IP core and remove the extra peripherals. To simplify this process, Aeroflex Gaisler provides a Tcl/Tk based graphical wizard (shown in figure 4.1) to modify the LEON-3 processor.

In this study, the cache subsystem was removed from LEON-3 using the graphical wizard, because ordinary 6-T cell based SRAMs can not operate in sub- $V_T$  domain [17]. The sub- $V_T$  domain requires special type of SRAM blocks which were not available during the study. Similarly extra peripherals like SDRAM controller, UARTs and extra timers were also removed. Only one 16-bit GPIO and a timer were kept in LEON-3 core to execute the benchmarks. The timer is required for time related functions in C-library (like clock()) which are used in benchmarks. Similarly, GPIO is used as output peripheral for printing the benchmark's results, because UART is a very slow device and its simulation is very time consuming. Using GPIO as an output device requires minor changes in benchmarks' sourcecode which are explained later in this chapter. The analysis of LEON-3 is performed for both normal and clock-gating cases, since the sub- $V_T$  flow is same for both cases, they are both explained together. The following sections will explain different steps of sub- $V_T$  flow (figure 3.1) for LEON-3 processor.

| 😣 🛇 📀 Cache system                             | _    |
|------------------------------------------------|------|
| Cache system                                   |      |
|                                                | Help |
| 1 Associativity (sets)                         | Help |
| 4 Set size (kbytes/set)                        | Help |
| 32 Line size (bytes/line)                      | Help |
| Random Replacement alorithm                    | Help |
|                                                | Help |
|                                                | Help |
| 4 Local data RAM size (kbytes)                 | Help |
| 8e Local instruction RAM start address (8 MSB) | Help |
| y ◆ n Enable data cache                        | Help |
|                                                | ,, j |
| OK <u>N</u> ext <u>P</u> r                     | ev   |

Figure 4.1: LEON-3 customization wizard screen

#### 4.2 LEON-3 Synthesis

The LEON-3 was synthesized with *Synopsys DesignCompiler*, using STMicroelectronics 65-nm low leakage high threshold (LL-HVT) CMOS library. Tight synthesis constraints were set to obtain the minimum area and leakage energy.

For clock-gating case, one also has to specify the type of clock-gating (latch-based or latch-free) and/or type of clock-gating (integrated) cell. Based on clock-gating and design constraints, *DesignCompiler* divide the design into register groups and then select a clock-gating cell of appropriate type and strength for that group. For this study, latch-based integrated clock-gating cells which are provided in HVT cell library, were selected for synthesis. Table 4.1 shows the synthesis results of LEON-3 design:

After synthesis following files are generated by DesignCompiler, which are used later on in sub-V<sub>T</sub> flow.

- Gate-level netlist in Verilog
- SDC file

|                             | Normal               | Clock-gating         |
|-----------------------------|----------------------|----------------------|
| Total Cell Area             | $118623 \text{um}^2$ | $104689 \text{um}^2$ |
| Total Cell Count            | 21986                | 21557                |
| Critical Path Delay         | 9.99 ns              | 9.92 ns              |
| Total Register Count        | 6484                 | 6484                 |
| Gated Register Count        | 0                    | 6189 (95.45%)        |
| Total Clock Gating Elements | 0                    | 195                  |

Table 4.1: LEON-3 — Synthesis Results

#### 4.3 LEON-3 Place & Route

The physical layout of LEON-3 was performed with *Cadence SoC Encounter*, which uses the gate-level netlist and SDC file from synthesis step as an input. Due to clock-tree generation and cell-placement optimizations, both area and critical path is updated. Table 4.2 shows the post-PAR results of LEON-3 design:

| Table 4.2: LEO | N-3 — Po | ost-PAI | R Results |  |
|----------------|----------|---------|-----------|--|
|                |          |         |           |  |

|                     | Normal               | Clock-gating         |
|---------------------|----------------------|----------------------|
| Total Cell Area     | $120038 \text{um}^2$ | $108222 \text{um}^2$ |
| Total Cell Count    | 21300                | 21371                |
| Critical Path Delay | 8.54  ns             | 8.53  ns             |

After PAR following files are generated by SoC Encounter:

- Post-PAR netlist in Verilog
- SDF file

The post-PAR netlist is used for two purpose, first it is used for netlist simulation and secondly for the netlist analysis which is explained in section 3.2.4.

#### 4.4 Benchmark compilation for LEON-3

All the EEMBC benchmarks were compiled into "Executable and Linkable Format" (ELF) files for LEON-3 using Bare-C Cross-compiler system (BCC) [18] provided by Aeroflex Gaisler. The C-library from Aeroflex Gaisler only supports UART as STDOUT device, however the UART is a very slow device and is not suitable for netlist simulation. Fortunately the EEMBC benchmarks do not call the IO-functions of C-library directly, instead they use a single function al\_write\_con. This function

was modified to write the text messages, character-by-character, on GPIO. The EEMBC porting guide [19] describes the procedure to retarget the benchmarks for new processor or new software tool chain, using EEMBC porting guide build-scripts and Make-files were modified to build the benchmarks for LEON-3/BCC.

However, LEON-3 in a "Standalone" configuration (program-execution without any operating system), can not execute the ELF file. For execution, ELF file has to be converted into a "boot-image". A boot-image consist of "boot-loader" and actual software from ELF. The boot-loader is responsible for performing all the setup operations before the actual program execution, like creating and aligning the different program sections (code, data and stack sections) in RAM, and performing the peripheral setup etc. After the boot-loader execution, the processor executes the actual software code. The Aeroflex Gaisler provides a boot-image generator called "mkprom2" [20], to create boot-image from ELF generated by BCC.

#### 4.5 LEON-3 Netlist Simulation and Power Estimation

To estimate the power dissipation of LEON-3 at the nominal voltage, post-PAR netlist simulations with SDF annotation were performed using *Synopsys VCS*. During simulation all the switching/toggling information of the design is recorded and saved as Switching Activity Interchange Format (SAIF) file format. This SAIF file is later used to estimate the power dissipation.

In the GRLib IP bundle, a VHDL testbench environment (Figure 4.2) is provided to simulate the LEON-3 processor netlist. This testbench environments include SRAM and PROM models to simulate the software execution by LEON-3 processor. All the benchmark simulations were performed at 50 MHz. These SRAM and PROM models read text-based files written in Motorola S-Record (SREC) format file. Using the software toolchain provided by the Aeroflex Gaisler, the benchmark boot-images are converted into SREC format for netlist simulation. For better estimation of switching activity, all 16 benchmarks in Telebench suite were simulated in LEON-3 testbench. Overall, for both normal and clock-gating designs, total 32 netlist simulations were performed.

At the simulation startup, power-on reset is performed and the SREC of bootimage is loaded into PROM. After reset, LEON-3 executes the boot-loader in PROM to initialize the SRAM, which involves setting-up the program sections in SRAM. Additionally, boot-loader initialize the timer registers based on system clock frequency (50 MHz). When this setup is completed, the processor starts the actual benchmark execution from the SRAM. The benchmarks are executed till their completion, the execution time of each benchmark is shown in table 4.3. The execution time in this case is the amount of time simulated by the Synopsys VCS<sup>1</sup>.

 $<sup>^1</sup>$  However, typically it takes 1 week to simulate the 5 second simulation of LEON-3 (SDF-annotation) netlist in VCS.



Figure 4.2: LEON-3 Testbench schematic

| Benchmark      | Iter. | Exec.          | Dyn. Power      | Dyn. Power      |
|----------------|-------|----------------|-----------------|-----------------|
|                |       | Time           | (Normal)        | (Clock-gating)  |
|                |       | $[\mathbf{s}]$ | $[\mathbf{mW}]$ | $[\mathbf{mW}]$ |
| autcor00data_1 | 300   | 0.08           | 4.4             | 0.9             |
| autcor00data_2 | 300   | 6.49           | 5.3             | 1.1             |
| autcor00data_3 | 300   | 5.66           | 5.3             | 1.1             |
| conven00data_1 | 400   | 5.20           | 4.5             | 0.9             |
| conven00data_2 | 400   | 4.69           | 4.4             | 0.9             |
| conven00data_3 | 400   | 3.71           | 4.4             | 0.9             |
| fbital00data_2 | 120   | 4.73           | 5.1             | 1.0             |
| fbital00data_3 | 120   | 0.30           | 5.0             | 1.0             |
| fbital00data_6 | 120   | 2.91           | 5.1             | 1.0             |
| fft00data_1    | 800   | 4.76           | 5.1             | 1.0             |
| fft00data_2    | 800   | 4.67           | 5.2             | 1.1             |
| fft00data_3    | 800   | 4.83           | 5.2             | 1.1             |
| viterb00data_1 | 190   | 6.19           | 3.7             | 0.7             |
| viterb00data_2 | 190   | 6.17           | 4.3             | 0.9             |
| viterb00data_3 | 190   | 6.19           | 4.4             | 0.9             |
| viterb00data_4 | 190   | 6.17           | 4.3             | 0.9             |

Table 4.3: LEON-3 — Benchmark execution summary

| Table 4.4: | LEON-3 $-$ | Energy | Dissipation | per cloc | k cycle (50 | MHz) a | at nominal |
|------------|------------|--------|-------------|----------|-------------|--------|------------|
| voltage    |            |        |             |          |             |        |            |
|            |            |        |             |          |             |        |            |

| Benchmark      | Dyn. Energy | Dyn. Energy    |
|----------------|-------------|----------------|
|                | (Normal)    | (Clock-gating) |
|                | [pJ]        | [pJ]           |
| autcor00data_1 | 87.7        | 17.67          |
| autcor00data_2 | 105.2       | 21.15          |
| autcor00data_3 | 106.1       | 21.38          |
| conven00data_1 | 89.3        | 18.03          |
| conven00data_2 | 88.7        | 17.87          |
| conven00data_3 | 88.9        | 17.93          |
| fbital00data_2 | 102.9       | 20.90          |
| fbital00data_3 | 100.8       | 20.32          |
| fbital00data_6 | 102.4       | 20.78          |
| fft00data_1    | 102.5       | 20.59          |
| fft00data_2    | 104.5       | 21.10          |
| fft00data_3    | 104.2       | 21.01          |
| viterb00data_1 | 74.7        | 14.15          |
| viterb00data_2 | 86.6        | 17.45          |
| viterb00data_3 | 87.1        | 17.55          |
| viterb00data_4 | 86.0        | 17.29          |
| average        | 94.85       | 19.07          |

The dynamic power consumption of LEON-3 during each benchmark was estimated using *Synopsys PrimeTime*. PrimeTime estimates the power consumption (at nominal voltage) of each benchmark, by loading its SAIF file and LEON-3 post-PAR netlist. The table 4.3 shows the dynamic power consumption of LEON-3 at nominal voltage while executing the EEMBC Telebench benchmarks, these power values are calculated over whole benchmark duration (Exec. Time). From table 4.3 the dynamic energy dissipation per clock cycle is calculated, which is shown in table 4.4.

#### 4.6 Sub-V<sub>T</sub> Analysis and Results

The sub-V<sub>T</sub> model parameters were extracted using the methodology described in chapter-3. All the k-parameters ( $k_{crit}$ ,  $k_{leak}$  and  $k_{cap}$ ) are calculated by analyzing the post-PAR netlist as explained in section 3.2.4, these parameters are given in table 4.5. Since area of clock-gating design is less than normal design, therefore, its  $k_{leak}$  and  $k_{cap}$  parameters are less than normal design.

| Parameter         | Value                | Value                |  |
|-------------------|----------------------|----------------------|--|
|                   | (Normal)             | (Clock-gating)       |  |
| k <sub>leak</sub> | $1.15 \times 10^{5}$ | $1.11 \times 10^{5}$ |  |
| k <sub>cap</sub>  | $1.78 \times 10^{5}$ | $1.65 \times 10^{5}$ |  |
| k <sub>crit</sub> | 383.28               | 382.83               |  |

Table 4.5: LEON-3 — Sub-V<sub>T</sub> k-parameters

Similarly, the Average circuit switching activity  $(\mu_e)$  is calculated from the power report generated by the *Synopsys PrimeTime*. The  $\mu_e$  factor is calculated for each benchmark as it depends on the dynamic power dissipation, which is different for each benchmark. Table 4.6 shows the  $\mu_e$  values for different benchmarks. A typical "real-world" application will be a combination of different algorithms, therefore a single mean value of  $\mu_e$  is used in the sub-V<sub>T</sub> analysis. It is evident from table 4.6, that the switching activity for clock-gating case is very low as compared to normal case.

Figure 4.3 shows the sub- $V_T$  energy curves of both clock-gated and normal LEON-3 designs, these graphs are generated by substituting the corresponding parameter values from table 4.5 and 4.6 into sub- $V_T$  energy equation (2.10). It can be observed that for both cases there exist an EMV point where the energy dissipation is minimum. If the supply voltage is further reduced beyond EMV then it will increase the energy dissipation. For normal case the energy dissipation at EMV is **22 times** less than energy dissipation at nominal voltage. Similarly for clock-gating case, the energy dissipation at EMV is **16 times** less than energy dissipation at nominal voltage.

| Benchmark        | $\mu_{ m e}$ | $\mu_{ m e}$   |
|------------------|--------------|----------------|
|                  | (Normal)     | (Clock-gating) |
| autcor00data_1   | 0.1786       | 0.0388         |
| autcor00data_2   | 0.2142       | 0.0465         |
| autcor00data_3   | 0.2160       | 0.0470         |
| $conven00data_1$ | 0.1819       | 0.0396         |
| conven00data_2   | 0.1806       | 0.0393         |
| $conven00data_3$ | 0.1809       | 0.0394         |
| fbital00data_2   | 0.2094       | 0.0459         |
| fbital00data_3   | 0.2052       | 0.0447         |
| fbital00data_6   | 0.2085       | 0.0457         |
| fft00data_1      | 0.2086       | 0.0453         |
| fft00data_2      | 0.2127       | 0.0464         |
| fft00data_3      | 0.2121       | 0.0462         |
| viterb00data_1   | 0.1520       | 0.0311         |
| viterb00data_2   | 0.1763       | 0.0384         |
| viterb00data_3   | 0.1772       | 0.0386         |
| viterb00data_4   | 0.1750       | 0.0380         |
| Average          | 0.19         | 0.04           |

Table 4.6: LEON-3 — Average circuit switching activity ( $\mu_e)$ 

Figure 4.3 also shows that the energy dissipation of clock-gating case is significantly reduced as compared to the normal case. Since the clock-gating case has lower switching activity (as seen in table 4.6), therefore it has lower dynamic energy dissipation (2.6).

Figure 4.4 shows the graph of maximum operating frequency of both LEON-3 designs, generated by substituting the parameters into (2.9). Since the difference between  $k_{crit}$  values is very small, therefore, the maximum frequency graph for both designs lie on top of each other. For both designs, the maximum frequency is reduced to KHz due to supply voltage scaling. Table 4.7 summarizes the sub-V<sub>T</sub> results for LEON-3. Table 4.7 shows the frequency value at EMV points, the clock-gating case has much higher frequency because its EMV point is greater than normal case. The next chapter will present the sub-V<sub>T</sub> results for Cortex-M0 processor.

Table 4.7: LEON-3 — summary of sub-V $_{\rm T}$  analysis

|              | Normal  | Clock-gating |
|--------------|---------|--------------|
| EMV          | 320  mV | 390  mV      |
| Energy @ EMV | 4.14 pJ | 1.17 pJ      |
| Fmax @ EMV   | 36 KHz  | 219 KHz      |



Figure 4.3: LEON-3 sub- $V_T$  energy curve



Figure 4.4: LEON-3 sub-V  $_{\rm T}$  maximum operating frequency graph

Chapter

## **Energy Estimation of Cortex-M0**

#### 5.1 Overview

The Cortex-M0 [5] is the smallest and most energy efficient processor available from ARM Ltd. The Cortex-M0 is a 32-bit, 3-stage pipelined, RISC processor implementing ARMv6-M architecture [6]. The IP core of Cortex-M0 processor is provided by ARM in two different configurations, one is full implementation, while other is limited "DesignStart" configuration. The full implementation is provided as (commented plain-text) Verilog RTL code which can be customized as per requirement, while the DesignStart IP is a fixed and limited configuration which is provided as flatten and obfuscated Verilog netlist. For this study only DesignStart IP was available. Due to limitations of DesignStart IP [21] and difference in software toolchain requirements, benchmarks' compilation and netlist simulations were performed in slightly different manner, these differences are explained in following sections. Like LEON-3, the analysis of Cortex-M0 was also performed for both normal and clock-gating configuration.

#### 5.2 Cortex-M0 Synthesis and PAR

The synthesis and PAR steps for Cortex-M0 are same as LEON-3. The Cortex-M0 was also synthesized for both normal and clock-gating configurations. The synthesis was performed with *Synopsys DesignCompiler*, using same gate library (STMicro. 65-nm CMOS LL-HVT library) and design constraints (min area and leakage). Table 5.1 shows the synthesis results of Cortex-M0 design.

Similarly, PAR was performed with *Cadence SoC Encounter*, using gate-level netlist and SDC from synthesis step. Due to clock-tree generation and cell-placement optimizations, both area and critical path is updated. Table 5.2 shows the post-PAR

|                             | Normal             | Clock-gating        |
|-----------------------------|--------------------|---------------------|
| Total Cell Area             | $26662 {\rm um}^2$ | $23934 \text{um}^2$ |
| Total Cell Count            | 6147               | 5108                |
| Critical Path Delay         | 12.02  ns          | 11.55  ns           |
| Total Register Count        | 841                | 841                 |
| Gated Register Count        | 0                  | 802 (95.36%)        |
| Total Clock Gating Elements | 0                  | 44                  |

Table 5.1: Cortex-M0 — Synthesis Results

results of Cortex-M0 design.

Table 5.2: Cortex-M0 — Post-PAR Results

|                     | Normal              | Clock-gating        |
|---------------------|---------------------|---------------------|
| Total Cell Area     | $27450 \text{um}^2$ | $25091 \text{um}^2$ |
| Total Cell Count    | 6267                | 5261                |
| Critical Path Delay | $9.71 \mathrm{~ns}$ | 9.24 ns             |

#### 5.3 Benchmark compilation for Cortex-M0

All the EEMBC benchmarks were compiled into ELF files using ARM RealView Development Suite (RVDS). The toolchain provided in RVDS is different than BCC toolchain explained in chapter 4, for example, RVDS does not have a "mkprom2" type tool. So programmer has to write a boot-loader by him self and also need to explicitly initialize the timer register in the main software (benchmarks in our case). For the testbench provided with DesignStart IP there is no need to write a boot-loader, however software still has to initialized the timer registers and setup the timer interrupt.

The Cortex-M0 DesignStart IP is provided as fixed configuration (it only contains one timer as peripheral). Therefore, a different technique was used to get the benchmark output during netlist simulation. This technique is described in [22] and its implementation is provided in the DesignStart IP as "hello world" C program. In this technique following low-level C functions and structures in C-library are replaced by new their implementations provided inside the program:

\_\_\_FILE the file structure

\_\_stdin the standard input object of type \_\_FILE

\_\_stdout the standard output object of type \_\_FILE

fputc() outputs a character to a file

ferror() returns the error status accumulated during file I/O

fgetc() gets a character from a file

**\_\_backspace()** moves the file pointer to the previous character

By overwriting these low level I/O functions, one can use the high-level C-functions (printf) on any customized I/O peripheral. In this case (as in "hello world" example) these functions were overwritten to write an arbitrary memory location in Cortex-M0 address-space. Finally, using the EEMBC porting guide [19] build-scripts and Make-files were modified to build the benchmarks for Cortex-M0/RVDS.

#### 5.4 Cortex-M0 Netlist Simulation and Power Estimation

The power dissipation of Cortex-M0 at nominal voltage was estimated by performing SDF annotated netlist simulations on *Synopsys VCS*. In the DesignStart IP package, a Verilog testbench environment (figure 5.1) is provided to simulate the Cortex-M0. The environment models RAM and a console peripheral. The console peripheral is mapped in processor address-space and it displays the data written by the processor on HDL simulator. ARM provides a tool called "fromelf" which can convert ELF file into plain binary format which can be easily read into Verilog testbench. Like LEON-3, the system clock frequency was set at 50 MHz. For better estimation of switching activity, all 16 benchmarks in Telebench suite were simulated in Cortex-M0 testbench. Overall, for both normal and clock-gating designs, total 32 netlist simulations were performed.

At the simulation startup, power-on reset is performed and binary file for benchmark is loaded into the memory. After the reset, the processor executes benchmark from memory. The benchmarks are executed till their completion, the execution time of each benchmark is shown in table 5.3.

The dynamic power dissipation of Cortex-M0 was estimated using *Synopsys PrimeTime*. Table 5.3 shows the (average) dynamic power consumption of Cortex-M0 (at nominal voltage) while executing the benchmarks, these power values are calculated over whole benchmark duration (Exec. Time). From table 5.3 the dynamic energy dissipation per clock cycle is calculated, which is shown in table 5.4.



Figure 5.1: Cortex-M0 Testbench schematic

| Benchmark      | Iter. | Exec.          | Dynamic Power         | Dynamic Power         |
|----------------|-------|----------------|-----------------------|-----------------------|
|                |       | Time           | (Normal)              | (Clock-gating)        |
|                |       | $[\mathbf{s}]$ | [W]                   | $[\mathbf{W}]$        |
| autcor00data_1 | 300   | 0.03           | $5.35 \times 10^{-4}$ | $3.72 \times 10^{-4}$ |
| autcor00data_2 | 300   | 4.50           | $5.74 \times 10^{-4}$ | $4.05 \times 10^{-4}$ |
| autcor00data_3 | 300   | 4.29           | $5.81 \times 10^{-4}$ | $4.13 \times 10^{-4}$ |
| conven00data_1 | 400   | 1.27           | $7.96 \times 10^{-4}$ | $6.57 \times 10^{-4}$ |
| conven00data_2 | 400   | 1.09           | $7.98 \times 10^{-4}$ | $6.58 \times 10^{-4}$ |
| conven00data_3 | 400   | 0.86           | $7.98 \times 10^{-4}$ | $6.57 \times 10^{-4}$ |
| fbital00data_2 | 120   | 1.47           | $7.48 \times 10^{-4}$ | $6.01 \times 10^{-4}$ |
| fbital00data_3 | 120   | 0.12           | $7.45 \times 10^{-4}$ | $5.97 \times 10^{-4}$ |
| fbital00data_6 | 120   | 0.98           | $7.53 \times 10^{-4}$ | $6.05 \times 10^{-4}$ |
| fft00data_1    | 800   | 3.47           | $6.01 \times 10^{-4}$ | $4.30 \times 10^{-4}$ |
| fft00data_2    | 800   | 3.48           | $6.13 \times 10^{-4}$ | $4.40 \times 10^{-4}$ |
| fft00data_3    | 800   | 3.48           | $6.12 \times 10^{-4}$ | $4.39 \times 10^{-4}$ |
| viterb00data_1 | 190   | 1.35           | $7.77 \times 10^{-4}$ | $6.44 \times 10^{-4}$ |
| viterb00data_2 | 190   | 1.35           | $7.76 \times 10^{-4}$ | $6.43 \times 10^{-4}$ |
| viterb00data_3 | 190   | 1.35           | $7.81 \times 10^{-4}$ | $6.47 \times 10^{-4}$ |
| viterb00data_4 | 190   | 1.36           | $7.73 \times 10^{-4}$ | $6.40 \times 10^{-4}$ |

Table 5.3: Cortex-M0 — Benchmark execution summary

| Benchmark      | Dynamic Energy | Dynamic Energy |
|----------------|----------------|----------------|
|                | (Normal)       | (Clock-gating) |
|                | [pJ]           | [pJ]           |
| autcor00data_1 | 10.70          | 7.43           |
| autcor00data_2 | 11.47          | 8.11           |
| autcor00data_3 | 11.62          | 8.25           |
| conven00data_1 | 15.92          | 13.13          |
| conven00data_2 | 15.96          | 13.16          |
| conven00data_3 | 15.93          | 13.14          |
| fbital00data_2 | 14.96          | 12.01          |
| fbital00data_3 | 14.89          | 11.94          |
| fbital00data_6 | 15.05          | 12.09          |
| fft00data_1    | 12.02          | 8.59           |
| fft00data_2    | 12.25          | 8.80           |
| fft00data_3    | 12.24          | 8.79           |
| viterb00data_1 | 15.54          | 12.88          |
| viterb00data_2 | 15.52          | 12.86          |
| viterb00data_3 | 15.61          | 12.95          |
| viterb00data_4 | 15.45          | 12.80          |
| average        | 14.07          | 11.06          |

Table 5.4: Cortex-M0 — Energy Dissipation per clock cycle (50 MHz) at nominal voltage

#### 5.5 Sub-V<sub>T</sub> Analysis and Results

The sub-V<sub>T</sub> model parameters were extracted using the methodology described in chapter-3. All the k-parameters ( $k_{crit}$ ,  $k_{leak}$  and  $k_{cap}$ ) are calculated by analyzing the post-PAR netlist as explained in section 3.2.4, these parameters are given in table 5.5. Since area of clock-gating design is less than normal design, therefore, its  $k_{leak}$  and  $k_{cap}$  parameters are less than normal design.

| Parameter         | Value                | Value                |  |
|-------------------|----------------------|----------------------|--|
|                   | (Normal)             | (Clock-gating)       |  |
| k <sub>leak</sub> | $2.38 \times 10^4$   | $2.30 \times 10^{4}$ |  |
| k <sub>cap</sub>  | $4.67 \times 10^{4}$ | $4.05 \times 10^4$   |  |
| k <sub>crit</sub> | 435.79               | 414.69               |  |

Table 5.5: Cortex-M0 — Sub-V<sub>T</sub> k-parameters

Similarly, the Average circuit switching activity  $(\mu_e)$  is calculated from the power report generated by the *Synopsys PrimeTime*. The  $\mu_e$  factor is calculated for each benchmark, table 5.6 shows the  $\mu_e$  values for different benchmarks. Like LEON-3, a single mean value of  $\mu_e$  is used in the sub-V<sub>T</sub> analysis. It is evident from table 5.6, that the switching activity for clock-gating case is low as compared to normal case.

Figure 5.2 shows the sub- $V_T$  energy curves of both clock-gated and normal Cortex-M0 designs, these graphs are generated by substituting the corresponding parameter values from table 5.5 and 5.6 into sub- $V_T$  energy equation (2.10). It can be observed that for both cases there exist an EMV point where the energy dissipation is minimum. For both cases, the energy dissipation at EMV point is almost same, i.e. **20 times** less then the average energy dissipation at nominal voltage (table 5.4).

Figure 5.2 also shows that the energy dissipation of clock-gating case is low as compared to the normal case. Since the clock-gating case has lower switching activity (table 5.6), therefore it has lower dynamic energy dissipation (2.6).

Figure 5.3 shows the graph of maximum operating frequency of both Cortex-M0 designs, generated by substituting the parameters into (2.9). Since the difference between  $k_{crit}$  values is very small, therefore, the maximum frequency graph for both designs lie on top of each other. For both designs, the maximum frequency is reduced to KHz due to supply voltage scaling. Table 5.7 summarizes the sub-V<sub>T</sub> results for Cortex-M0. Table 5.7 shows the frequency value at EMV points, the clock-gating case has relatively high frequency because its EMV point is greater than normal case.



Figure 5.2: Cortex-M0 sub-V $_{\rm T}$  energy curve



Figure 5.3: Cortex-M0 sub-V $_{\rm T}$  maximum operating frequency graph

| Benchmark        | $\mu_{ m e}$ | $\mu_{ m e}$   |  |
|------------------|--------------|----------------|--|
|                  | (normal)     | (clock-gating) |  |
| $autcor00data_1$ | 0.0831       | 0.0666         |  |
| $autcor00data_2$ | 0.0891       | 0.0727         |  |
| $autcor00data_3$ | 0.0903       | 0.0740         |  |
| $conven00data_1$ | 0.1237       | 0.1177         |  |
| $conven00data_2$ | 0.1239       | 0.1180         |  |
| $conven00data_3$ | 0.1237       | 0.1178         |  |
| fbital00data_2   | 0.1162       | 0.1077         |  |
| fbital00data_3   | 0.1157       | 0.1070         |  |
| fbital00data_6   | 0.1169       | 0.1084         |  |
| fft00data_1      | 0.0933       | 0.0770         |  |
| fft00data_2      | 0.0951       | 0.0788         |  |
| fft00data_3      | 0.0951       | 0.0788         |  |
| viterb00data_1   | 0.1207       | 0.1154         |  |
| viterb00data_2   | 0.1205       | 0.1153         |  |
| viterb00data_3   | 0.1213       | 0.1161         |  |
| viterb00data_4   | 0.1200       | 0.1147         |  |
| Average          | 0.12         | 0.10           |  |

Table 5.6: Cortex-M0 — Average circuit switching activity  $(\mu_e)$ 

| Table 5.7: Cortex-M0 — | summary | of sub-V $_{\rm T}$ | analysis |
|------------------------|---------|---------------------|----------|
|------------------------|---------|---------------------|----------|

|              | Normal              | Clock-gating        |
|--------------|---------------------|---------------------|
| EMV          | 345  mV             | 352  mV             |
| Energy @ EMV | $0.68 \mathrm{ pJ}$ | $0.56 \mathrm{~pJ}$ |
| Fmax @ EMV   | 54 KHz              | 68 KHz              |

# Chapter 6

## Conclusion

#### 6.1 Comparison of LEON-3 with Cortex-M0

The table 4.2 and 5.2 shows area of LEON-3 and Cortex-M0, respectively. The LEON-3 has almost 4 times bigger area than Cortex-M0. Exact feature comparison of both processors is not possible because sourcecode of Cortex-M0 is not available in DesignStart version. However, according to ARM literature, Cortex-M0 has 15 general-purpose registers, a 32-cycle multiplier (DesignStart version) and, supports a small set of instructions [5,21]. On the other hand, the LEON-3 implements a bigger instruction set (implements complete SPARC-V8 instruction-set), and has a 32-bit fast multiplier. Additionally, the LEON-3 contains (by-default) 8-register-windows [4] for efficient function-calling, where each window has 24 general-purpose registers (and each general-purpose register is 32-bit).

The figure 4.3 and figure 5.2 show the the sub- $V_T$  energy curves of LEON-3 and Cortex-M0, respectively. In Cortex-M0, clock-gating saves relatively less energy than LEON-3 because Cortex-M0 has just 841 registers (table 5.1), while LEON-3 has 6484 registers (table 4.1). Due to these large number of registers, there is a much bigger difference of energy dissipation between normal and clock-gating versions of LEON-3.

The table 4.7 and table 5.7 show the sub-V<sub>T</sub> model results for LEON-3 and Cortex-M0, respectively. These results show that Cortex-M0 has more energy efficient architecture than LEON-3. The main (and obvious) reason for high energy dissipation in LEON-3 is its relatively bigger area because bigger area will increase both  $k_{cap}$  and  $k_{leak}$  factors.

Moreover, Cortex-M0 is specially designed to directly interface with low-latency on-chip memories [5, 21]. While LEON-3 (like any typical processor) assumes a memory hierarchy and differentiate between cache and system-memory access. Consequently, the performance of LEON-3 was badly affected by cache removal which is evident from table 4.3.

In this study, both processors were analyzed in their main execution mode for their worst case energy dissipation. However, these processors also support different low-power execution modes which can be activiated in idle states to avoid energy leakage. The energy dissipation in these low-power execution modes can be analyzed in future work.

#### 6.2 Conclusion

In this thesis the energy dissipation of Cortex-M0 and LEON-3 processors was analyzed using a high-level energy estimation model. Using this model, it was found that by using the clock-gating and reducing the supply voltage down to 0.35V, the energy dissipation of both processors can be reduced to the order of pJ. The sub-threshold operation will reduce their clock frequency to almost 50 KHz, but most of the medical implants and remote sensors do not require higher clock frequency.

## **Bibliography**

- H. Soeleman and K. Roy, "Ultra-low power digital subthreshold logic circuits," Low Power Electronics and Design, 1999. Proceedings. 1999 International Symposium on, pp. 94–96, 1999.
- [2] C. Strydis and D. Dave, "Identifying optimal generic processors for biomedical implants," 2010 IEEE International Conference on Computer Design, pp. 494– 501, 2010.
- [3] J.-P. Vasseur and A. Dunkels, "What Are Smart Objects?," in Interconnecting Smart Objects with IP, pp. 3 – 20, Morgan Kaufmann, 2010.
- [4] Aeroflex Gaisler AB, GRLIB IP Core User's Manual, 2010.
- [5] ARM Ltd., Cortex<sup>TM</sup>-M0 Devices Generic User Guide, 2009.
- [6] ARM Ltd., ARMv6-M Architecture Reference Manual, 2010.
- [7] SPARC International, Inc., The SPARC Architecture Manual Version 8, 1992.
- [8] D. Blaauw and B. Zhai, "Energy efficient design for subthreshold supply voltage operation," 2006 IEEE International Symposium on Circuits and Systems, pp. 4pp.-32, 2006.
- [9] B. Calhoun, A. Wang, and A. Chandrakasan, "Modeling and sizing for minimum energy operation in subthreshold circuits," *IEEE Journal of Solid-State Circuits*, vol. 40, no. 9, pp. 1778–1786, 2005.
- [10] O. Akgun, J. Rodrigues, Y. Leblebici, and V. wall, "High-level energy estimation in the sub-VT domain: simulation and measurement of a cardiac event detector," *Transactions on Biomedical Circuits and Systems*, 2011.
- [11] H. Soeleman, K. Roy, and B. Paul, "Robust subthreshold logic for ultra-low power operation," *IEEE T-VLSI Systems*, vol. 9, pp. 90–99, Feb 2001.

- [12] E. Vittoz, Low-Power Electronics Design, ch. 16. CRC Press, 2004.
- [13] J. M. Rabaey and et al., Digital Integrated Circuits, ch. 5. Prentice Hall, 2003.
- [14] O. Akgun and Y. Leblebici, "Energy efficiency comparison of asynchronous and synchronous circuits operating in sub-threshold regime," *Low Power Electronics*, vol. 3, no. 3, pp. 320–336, 2008.
- [15] J. M. Rabaey, Low Power Design Essentials, ch. 8. Springer, 2009.
- [16] Embedded Microprocessor Benchmark Consortium, TeleBench<sup>TM</sup>1.1 software benchmark databook.
- [17] N. Verma and A. Chandrakasan, "A 256 kb 65 nm 8t subthreshold sram employing sense-amplifier redundancy," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 1, pp. 141–149, 2008.
- [18] Aeroflex Gaisler AB, BCC Bare-C Cross-Compiler User's Manual, 2010.
- [19] Embedded Microprocessor Benchmark Consortium, EEMBC Benchmark Software Porting Guide, 2008.
- [20] Aeroflex Gaisler AB, MKPROM2 Overview, 2010.
- [21] ARM Ltd., Cortex<sup>TM</sup>-M0 r0p0-00rel0 Release Note, 2010.
- [22] ARM Ltd., ARM Compiler toolchain Using ARM C and C++ Libraries and Floating-Point Support, 2010.