## Low Power Decoding Circuits for Ultra Portable Devices

Reza Meraji



#### LUND UNIVERSITY

Doctoral Dissertation Circuits and Systems Lund, October 2014

Reza Meraji Department of Electrical and Information Technology Circuits and Systems Lund University P.O. Box 118, 221 00 Lund, Sweden

Series of licentiate and doctoral dissertations ISSN 1654-790X; No. 64 ISBN 978-91-7623-057-2 (pdf)

© 2014 Reza Meraji Typeset in Palatino and Helvetica using  $\[Mathbb{Lambda}T_EX 2_{\mathcal{E}}$ . Printed in Sweden by Tryckeriet i E-huset, Lund University, Lund.

No part of this dissertation may be reproduced or transmitted in any form or by any means, electronically or mechanical, including photocopy, recording, or any information storage and retrieval system, without written permission from the author.

#### Popular Science Summary

Over the years, considerable advances have been made in wireless communication. Driven by the mobile phone industry, most of the previous efforts have been concentrated on improving data rates, to provide a wide range of services to the users. However, plenty of applications exist and more are emerging that can benefit from low rate and short range wireless communication devices. Such devices can be used in any portable electronics product with applications in the field of healthcare, sports and fitness, wearable gadgets, PC peripherals, industrial monitoring and automation, gaming devices, and plenty of other consumer electronics. More specifically, due to the aging population of the world and increased attention to health awareness in recent years, the healthcare and fitness industry have a huge application potential.

General purpose, short range wireless standards such as Bluetooth or Zig-Bee have been around for some time. Such standards have made life of users more comfortable by providing a wide range of services such as facilitating wireless connection to smart phones, tablets, TVs and other electronic devices. However, that has been achieved at the cost of increased power consumption, resulting in an inconvenience obligation for regular battery recharging or replacement. In these standards, circuit modules dedicated to handle radio connectivity, drain considerable amount of energy from the battery. Consequently, one design aspect that has become very important in recent years is reducing the power consumption. Specifically, with increasing miniaturization of devices, power has become a top priority design consideration that has motivated researchers to find techniques for reducing power consumption in all components of system during the design period. Battery technology is not improving in coherence with the increased demands of the processing power. Furthermore, in many applications there is an upper limit to the physical size of the battery and thus on the total amount of energy available. There are many cases, such as in medical implantable electronics or *install and forget* remote sensor networks, where the battery is desired to last the life-time of the device, as recharging or battery replacement is either difficult or not feasible. What can be achieved with the available power budget depends on how efficiently, in terms of energy, the corresponding integrated circuits for radio connectivity operate.

Another relevant circuit design aspect is the physical dimension. A smaller integrated circuit is always desired as it requires less resources and will reduces the cost in mass production. Furthermore, it will be easier for a smaller circuit to fit in a miniature product. All these eventually will result in a more affordable product for the potential users.

An important hardware component that can help to reduce the total power consumption in a wireless communication link is the *error control* circuitry. Error control, or *error detection and correction*, are referred to techniques that facilitate reliable delivery of digital data over poor conditions of a communication channel. In many cases, error control techniques enable reconstruction of the original transmitted data from the corrupted received data due to passing through an unreliable channel. Also, for similar quality of service, error control components aid to reduce the transmission power and hence save more energy. To benefit the most from error control circuits and reduce the total power consumption of the device, it is critical to design such circuits to operate as power efficiently as possible.

In this dissertation, in the framework of a low rate, short range and low power wireless system, low power implementation methods for error control circuits are investigated. Channel decoding circuits that are implemented according to either low power digital or analog circuit design techniques are fundamentally different. Each of these approaches introduces different sets of design challenges. Accordingly, simulations and low power design techniques are followed. Furthermore, attempts to deal with various challenges in the design period are presented. Consequently, alternative low power circuit architectures both in analog and digital domains are proposed, fabricated at an industrial facility and evaluated through laboratory measurements. The proposed decoder integrated circuits are analyzed in terms of critical aspects such as coding gain, required silicon area, speed of operation, energy efficiency and minimum power needed for successful operation.

The research work presented in this dissertation is fulfilled as part of the project *Wireless Communication for Ultra Portable Devices*. The project is funded by a grant from the Swedish Foundation for Strategic Research (*Stiftelsen för Strategisk Forskning - SSF*). The chip fabrications have been carried out by STMicroelectronics.

to my mother and in memory of my father

>The illiterate of the 21<sup>st</sup> century will not be those who cannot read and write, but those who cannot learn, unlearn, and relearn.<

Alvin Toffler writer and futurist

#### Abstract

A wide spread of existing and emerging battery driven wireless devices do not necessarily demand high data rates. Rather, ultra low power, portability and low cost are the most desired characteristics. Examples of such applications are wireless sensor networks (WSN), body area networks (BAN), and a variety of medical implants and health-care aids. Being small, cheap and low power for the individual transceiver nodes, let those to be used in abundance in remote places, where access for maintenance or recharging the battery is limited. In such scenarios, the lifetime of the battery, in most cases, determines the lifetime of the individual nodes. Therefore, energy consumption has to be so low that the nodes remain operational for an extended period of time, even up to a few years. It is known that using error correcting codes (ECC) in a wireless link can potentially help to reduce the transmit power considerably. However, the power consumption of the coding-decoding hardware itself is critical in an ultra low power transceiver node. Power and silicon area overhead of coding-decoding circuitry needs to be kept at a minimum in the total energy and cost budget of the transceiver node. In this thesis, low power approaches in decoding circuits in the framework of the mentioned applications and use cases are investigated. The presented work is based on the 65 nm CMOS technology and is structured in four parts as follows:

In the first part, goals and objectives, background theory and fundamentals of the presented work is introduced. Also, the ECC block in coordination with its surrounding environment, a low power receiver chain, is presented. Designing and implementing an ultra low power and low cost wireless transceiver node introduces challenges that requires special considerations at various levels of abstraction. Similarly, a competitive solution often occurs after a conclusive design space exploration. The proposed decoder circuits in the following parts are designed to be embedded in the low power receiver chain, that is introduced in the first part.

Second part, explores analog decoding method and its capabilities to be embedded in a compact and low power transceiver node. Analog decoding method has been theoretically introduced over a decade ago that followed with early proof of concept circuits that promised it to be a feasible low power solution. Still, with the increased popularity of low power sensor networks, it has not been clear how an analog decoding approach performs in terms of power, silicon area, data rate and integrity of calculations in recent technologies and for low data rates. Ultra low power budget, small size requirement and more relaxed demands on data rates suggests a decoding circuit with limited complexity. Therefore, the four-state (7,5) codes are considered for hardware implementation. Simulations to chose the critical design factors are presented. Consequently, to evaluate critical specifications of the decoding circuit, three versions of analog decoding circuit with different transistor dimensions fabricated. The measurements results reveal different trade-off possibilities as well as the potentials and limitations of the analog decoding approach for the target applications. Measurements seem to be crucial, since the available computer-aided design (CAD) tools provide limited assistance and precision, given the amount of calculations and parameters that has to be included in the simulations. The largest analog decoding core (AD1) takes  $0.104 \text{ mm}^2$  on silicon and the other two (AD2 and AD3) take  $0.035 \text{ mm}^2$  and 0.015 mm<sup>2</sup>, respectively. Consequently, coding gain in trade-off with silicon area and throughput is presented. The analog decoders operate with 0.8 V supply. The achieved coding gain is 2.3 dB at bit error rates (BER)=0.001 and 10 pico-Joules per bit (pJ/b) energy efficiency is reached at 2 Mbps.

Third part of this thesis, proposes an alternative low power digital decoding approach for the same codes. The desired compact and low power goal has been pursued by designing an equivalent digital decoding circuit that is fabricated in 65 nm CMOS technology and operates in low voltage (nearthreshold) region. The architecture of the design is optimized in system and circuit levels to propose a competitive digital alternative. Similarly, critical specifications of the decoder in terms of power, area, data rate (speed) and integrity are reported according to the measurements. The digital implementation with 0.11 mm<sup>2</sup> area, consumes minimum energy at 0.32 V supply which gives 9 pJ/b energy efficiency at 125 kb/s and 2.9 dB coding gain at BER=0.001.

The forth and last part, compares the proposed design alternatives based on the fabricated chips and the results attained from the measurements to conclude the most suitable solution for the considered target applications. Advantages and disadvantages of both approaches are discussed. Possible extensions of this work is introduced as future work.

## Preface

#### **Journal Articles**

• R. MERAJI, Y. SHERAZI, J. B. ANDERSON, H. SJÖLAND, V. ÖWALL, »A Comparison of Low Power Analog and Digital (7,5) Convolutional Decoders in 65 nm CMOS, « *submitted to IEEE Transactions on Circuits and Systems I: Regular papers (TCAS I).* 

**contribution:** This article concludes the research work on low power decoder and shows trade-offs among critical specifications of the sub-threshold digital and analog decoders. The effect of transistor dimensions on the performance of the analog decoding circuit is investigated through measurements. The paper concludes with the implementation approach (digital or analog) that is most suitable for the targetted compact and ultra low power, low rate radio receiver. The entire work has been carried out by the first author, with some assistance from the second author in digital implementation, and under supervision of the remaining authors.

 H. SJÖLAND, J. B. ANDERSON, C. BRYANT, R. CHANDRA, O. EDFORS, A. JO-HANSSON, N. SEYED MAZLOUM, R. MERAJI, P. NILSSON, D. RADJEN, J. RO-DRIGUES, Y. SHERAZI, V. ÖWALL, »A receiver architecture for devices in wireless body area networks,« *IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS)*, vol., no., Month 2012.

**contribution:** The research work has been performed jointly with the other authors, under supervision of the first author. This article partly included in the thesis, mainly the part that is the contribution from the author of this thesis. This part includes the simulated performance of the low power decoder, in the proposed ultra low power radio receiver through system simulations.

#### Peer Reviewed Conference Papers

In the following papers, the major research work has been performed by the first author, with contributions from the remaining authors.

• H. SJÖLAND, J. B. ANDERSON, C. BRYANT, R. CHANDRA, O. EDFORS, A. JO-HANSSON, N. SEYED MAZLOUM, R. MERAJI, P. NILSSON, D. RADJEN, J. RO-DRIGUES, Y. SHERAZI, V. ÖWALL, »Ultra low power transceivers for wireless sensors and body area networks,« 8th International Symposium on Medical Information and Communication Technology (ISMICT), Firenze, Italy, apr. 2014.

**contribution:** The research work has been performed jointly with the other authors, under supervision of the first author.

• R. MERAJI, Y. SHERAZI, J. B. ANDERSON, H. SJÖLAND, V. ÖWALL, »Analog and Digital Approaches for an Energy Efficient Low Complexity Channel Decoder,« *international symposium on circuits and systems (ISCAS)*, Beijing, may 2013.

**contribution:** This paper is based on simulations and synthesis results that presents a study of analog and digital versions of a low complexity channel decoder to investigate the overall performance of both circuits in 65 nm CMOS for moderate bit rate applications.

• R. MERAJI, J. B. ANDERSON, H. SJÖLAND, V. ÖWALL, »A 3 uW 500 kb/s Ultra Low Power Analog Decoder with Digital I/O in 65 nm CMOS,« *International Conference on Electronics and Communication Systems (ICECS)*, Abu Dhabi, dec 2013.

**contribution:** This paper presents the measurement results on the analog decoing chip with digital interface in 65nm CMOS.

• R. MERAJI, J. B. ANDERSON, H. SJÖLAND, V. ÖWALL, »An Analog (7,5) Convolutional Decoder in 65 nm CMOS for Low Power Wireless Applications, *« international symposium on circuits and systems (ISCAS)*, Rio de Janeiro, may 2011.

**contribution:** Mainly, the architecture of an analog decoder for emdedding in a conventional digital receiver is presented. Simulations on the performance, expected power consumption at simulated throughput are carried out.

• R. MERAJI, J. B. ANDERSON, H. SJÖLAND, V. ÖWALL, »Transistor sizing for a 4-state current mode analog channel decoder in 65-nm CMOS,« *NORCHIP*, Lund, Sweden, nov 2011. **contribution:** A simulation technique combining Monte-Carlo analysis in Spectre with Matlab processing has been used to investigate suitable transistor sizing for an analog (7,5) convolutional decoder.

• R. MERAJI, J. B. ANDERSON, H. SJÖLAND, V. ÖWALL, »A low power analog channel decoder for ultra portable devices in 65 nm technology,« *NORCHIP*, Tampere, Finland, nov 2010.

**contribution:** This work investigates an analog Hamming decoder with peripheral data converter and digital interface circuitry performance in 65nm CMOS. The architecture of the decoding core is mainly based on previously published works in older technologies.

The research work included in this dissertation is supported by the Swedish Foundation for Strategic Research. Chip fabrications are supported by STMicroelectronics.

## Acknowledgments

Apart from the efforts of myself, the finalization of this dissertation was not possible without the support of many others. I take this opportunity to express my appreciation to the people who have been encouraging me in pursuing my doctoral degree.

First and foremost, I would like to express my sincere gratitude to my main supervisor Professor Viktor Öwall. I cannot thank him enough for his tremendous support and help. He has not only well guided me to my academic achievements, but also has inspired me to experience more, read better, try new things and broaden my horizons. I could not have imagined having a better supervisor for my PhD study.

I would like to extend my special appreciation to my co-supervisor Professor John B. Anderson for giving me valuable advices and for all the enjoyable discussions that we had.

I would like to thank Professor Henrik Sjöland for his excellent management of the UPD project, which I was part of. I am indebted to him for his so many useful technical comments and feedback, that helped me to improve my work. I am also thankful to Dr. Carl Bryant, Dr. Yasser Sherazi, Dr. Dejan Radjen, Nafiseh Seyed Mazloum and Dr. Rohit Chandra for being great project teammates.

Being part of the Digital ASIC group has been a wonderful experience for me. Apart from all that I have learned from our numerous technical and nontechnical discussions, it has left me unforgettable pleasant memories from all the fun times that we had together during workshops, group activities, conference trips, after-works and barbecue evenings. For all these, I am truly grateful to my former and present colleagues and friends: Dr. Deepak Dasalakunte, Dr. Johan Löfgren, Dr. Isael Diaz, Dr. Yasser Sherazi, Dr. Chenxin Zhang, Dr. Liang Liu, Associate Professor Joachim Rodriguez, Rakesh Gangarajaiah, Hemanth Prabhu, Oskar Andersson, Babak Mohammadi, Michal Stala, Yangxurui Liu and Professor Peter Nilsson. Indeed, I have had the most enjoyable time working with a group of amazing more recent colleagues in the Digital ASIC group: Steffen Malkowsky, Christoph Müller, Farrokh Ghani Zadegan, Dimitar Nikolov, Breeta SenGupta and Professor Erik Larsson.

Special thanks to Pia Bruhn, Doris Glöck, Anne Andersson, Erik Jonsson, Robert Johnsson, Bertil Lindvall, Stefan Molund, Martin Nilsson, Lars Hedenstjerna, and Josef Wajnblom for taking care of all technical and administrative issues.

There are more friends and colleagues that deserve my appreciation. The list will be long, so I thank you all and wish you all the bests!

Finally, I would like to show my deepest gratitude to my mother for her patience and unconditional support. And my father, though I lost him more than ten years ago, he is still inspiring me to reach ever further.

Pers Merenje

Lund, October 2014

## Contents

| Preface                     |      | xiii<br>xvii<br>xix                        |    |
|-----------------------------|------|--------------------------------------------|----|
| Acknowledgments<br>Contents |      |                                            |    |
|                             |      |                                            | 1  |
|                             | 1.1  | Motivation                                 | 3  |
|                             | 1.2  | Error Control Methods in Wireless Networks | 4  |
|                             | 1.3  | Energy Considerations in a coded System    | 4  |
|                             | 1.4  | Target Applications                        | 6  |
|                             | 1.5  | Background Survey                          | 7  |
|                             | 1.6  | Challenges                                 | 10 |
|                             | 1.7  | Research Objectives and Contributions      | 10 |
|                             | 1.8  | Outline of the thesis                      | 12 |
| 2                           | Ultr | a Low Power Radio                          | 15 |
|                             | 2.1  | Introduction                               | 15 |
|                             | 2.2  | The Vision                                 | 16 |
|                             | 2.3  | The UPD system overview                    | 17 |
|                             | 2.4  | UPD Modulation Scheme                      | 20 |

|   | 2.5  | Low power radio standards                                   | 23 |
|---|------|-------------------------------------------------------------|----|
|   | 2.6  | Relevance and challenges of sub-threshold design            | 23 |
| 3 | Fun  | damentals of MOSFET Sub-threshold Operation                 | 25 |
|   | 3.1  | MOS Transistor Basic Regions of Operation                   | 25 |
|   | 3.2  | MOS Transistor Regions of Operation at Low Drain Currents   | 26 |
|   | 3.3  | MOS model in weak inversion region                          | 28 |
|   | 3.4  | Sub-threshold MOS Ultra Low Power Circuits                  | 29 |
|   | 3.5  | Circuit Design Considerations in Sub-threshold              | 29 |
| 4 | Sele | ection of the Coding Scheme                                 | 33 |
|   | 4.1  | Benefits of error control codes                             | 33 |
|   | 4.2  | System and hardware considerations for a coded transmission | 34 |
|   | 4.3  | Coding Scheme                                               | 35 |
|   | 4.4  | Soft Decision Decoding                                      | 36 |
|   | 4.5  | Trellis Representation of Convolutional Codes               | 36 |
|   | 4.6  | Tail-biting codes                                           | 36 |
|   | 4.7  | The Generic Sum-Product Decoding Algorithm                  | 37 |
|   | 4.8  | The BCJR Decoding Algorithm                                 | 38 |
|   | 4.9  | System Level UPD Baseband simulation                        | 40 |
| 5 | Low  | Power Analog Decoding                                       | 43 |
|   | 5.1  | Decoding in Analog                                          | 43 |
|   | 5.2  | Analog Decoding Basic Calculations: A Simplified Model      | 44 |
|   | 5.3  | Forward-Backward Computations                               | 50 |
|   | 5.4  | Operation of an Analog Decoding Core                        | 52 |
| 6 | Har  | dware Mapping of the Analog Decoding Circuit                | 55 |
|   | 6.1  | System Perspective                                          | 55 |
|   | 6.2  | Top-Level Architecture of the Decoder                       | 56 |
|   | 6.3  | Extended (8,4) Hamming Decoder: A Brief Investigation       | 58 |
|   | 6.4  | (7,5) Analog Decoding Circuit                               | 63 |

|    | 6.5        | Circuit Details                             | 68  |
|----|------------|---------------------------------------------|-----|
|    | 6.6        | Fabricated Analog Decoders                  | 69  |
| 7  | Perf       | ormance of Analog Decoding Circuits         | 77  |
|    | 7.1        | Objectives                                  | 77  |
|    | 7.2        | Measurement Setup                           | 78  |
|    | 7.3        | Analog Decoder Mesurements                  | 79  |
|    | 7.4        | Observations from the Measurements          | 83  |
| 8  | Low        | Power Digital Design Techniques             | 91  |
|    | 8.1        | Power consumption in a digital CMOS circuit | 91  |
|    | 8.2        | Dynamic power reduction                     | 92  |
|    | 8.3        | Short circuit power reduction               | 95  |
|    | 8.4        | Leakage power reduction                     | 95  |
| 9  | Des        | igning of the Digital Decoding Circuit      | 99  |
|    | 9.1        | Basics of Low Power Digital Decoder         | 99  |
|    | 9.2        | The max-log MAP BER performance             | 100 |
|    | 9.3        | Architecture                                | 101 |
|    | 9.4        | Synthesis at nominal voltage                | 103 |
| 10 | Perf       | ormance of Digital Decoding Circuit         | 107 |
|    | 10.1       | Objectives                                  | 107 |
|    | 10.2       | Measurement Setup                           | 107 |
|    | 10.3       | Reduced Voltage Measurement Results         | 108 |
| 11 | Ana        | log versus Digital: Analysis of the Results | 113 |
|    | 11.1       | Analog and Digital: an analysis             | 113 |
| 12 | Sum        | imary                                       | 117 |
| Re | References |                                             |     |

## **List of Figures**

| 1.1        | Sample UPD target applications.                                                                                     | 6        |
|------------|---------------------------------------------------------------------------------------------------------------------|----------|
| 2.1<br>2.2 | UPD receiver architecture                                                                                           | 17       |
|            | formation bit.                                                                                                      | 22       |
| 2.3        | The ideal BFSK modulation spectrum used in UPD with $\pm 250$ kF in-band transmission, corresponding to $250$ kb/s. | Hz<br>22 |
| 3.1        | Operation regions for an NMOS transistor                                                                            | 27       |
| 4.1        | Structure of encoder and memory initialization for the (7,5) tail-biting convolutional codes.                       | 37       |
| 4.2        | Tail-biting trellis structure for the 4-state $(7,5)$ convolutional codes with block length (BL) = 6                | 38       |
| 4.3        | UPD BER in thermal noise.                                                                                           | 40       |
| 5.1        | Stack of NMOS transistors.                                                                                          | 45       |
| 5.2        | A pair of diode connected PMOS transistors                                                                          | 47       |
| 5.3        | Gilbert vector multiplier with differential inputs                                                                  | 48       |
| 5.4        | Generalized Gilbert multiplier network for implementing the                                                         |          |
|            | resentation.                                                                                                        | 51       |
| 5.5        | Transient plots of the decoded output bits in an analog de-                                                         |          |
|            | coder represented by low level currents                                                                             | 52       |
| 6.1        | Architecture of the analog decoding circuit                                                                         | 58       |
| 6.2        | Timing diagram of the decoder. 2 BL clock cycles is the dedi-                                                       | 50       |
| ( )        |                                                                                                                     | 59       |
| 6.3        | Bit error rate performance, 2.5 Mb/s                                                                                | 61       |

| 6.4          | Sample soft bit output for Matlab model in comparison with circuit simulations. | 64  |
|--------------|---------------------------------------------------------------------------------|-----|
| 6.5          | BER performance of the (7,5) decoder for different BL.                          | 65  |
| 6.7          | Sample 2-dimensional matching of transistors used in the in-                    |     |
|              | put cells                                                                       | 72  |
| 6.8          | Block diagram of the decoder core.                                              | 72  |
| 6.9          | Current steering DACS with binary weighting                                     | 73  |
| 6.10         | Implemented current sources.                                                    | 73  |
| 6.11         | Input cell.                                                                     | 74  |
| 6.12         | Computing blocks.                                                               | 74  |
| 7.1          | Measurement setup.                                                              | 78  |
| 7.3          | BER performance for room temperature (27 $^{\circ}$ C) and body tem-            |     |
|              | perature (37 °C) when power of decoding core is limited to 3                    |     |
|              | $\mu$ W                                                                         | 81  |
| 7.5          | Measured coding gains for AD1,2,3 at 2 Mb/s                                     | 83  |
| 7.6          | Measured coding gains for AD1,2,3 at 500 kb/s                                   | 84  |
| 9.1          | BER performance of the max-log-MAP algorithm applied on                         |     |
|              | tail-biting (7,5) codes with $BL=14$                                            | 100 |
| 9.2          | Typical energy dissipation behavior in a digital circuit for 65 nm CMOS.        | 101 |
| 9.3          | Timing diagram to perform iterations and recursive calcula-                     | 101 |
|              | tions of A and B metrics in the digital decoder.                                | 103 |
| 9.4          | Complete architecture of the implemented digital max-log-                       | 100 |
| <i>,</i> ,,, | MAP decoder.                                                                    | 105 |
| 9.5          | Die photo of the fabricated digital decoder.                                    | 106 |
| 9.6          | Layout of the fabricated digital decoder.                                       | 106 |
| 11 2         | Massured normalized energy per decoded bit evalution over                       |     |
| 11.4         | technology concrations for analogy [6] [25] [21] [75] [70] [82] [85]            |     |
|              | and digital [19] [38] [44] [60] [69] [88] [92] decoders.                        | 116 |
| 1            | PCB and sotup for measuring the appled decoding chip AD1                        | 122 |
| .1           | PCB and setup for measuring the analog decoding chip, AD1.                      | 132 |
|              | and AD3                                                                         | 132 |
| .3           | PCB and setup for measuring the digital decoding chips                          | 133 |
| .4           | Measurement instruments and the environment chamber                             | 133 |

## **List of Tables**

| 2.1  | UPD project target specifications                            | 16  |
|------|--------------------------------------------------------------|-----|
| 2.2  | UPD initial power budget allocation                          | 21  |
| 6.1  | Power consumption of different sections of the decoder       | 61  |
| 6.2  | Analog decoder characteristics                               | 62  |
| 6.3  | Energy comparison for Analog Hamming decoders                | 62  |
| 6.4  | Transistor dimensions of fabricated analog decoding cores    | 69  |
| 7.1  | Digital interface characteristics                            | 79  |
| 7.2  | AD1's fixed and random errors at 500 kb/s, test experiment 1 | 85  |
| 7.3  | AD1's fixed and random errors at 1.5 Mb/s, test experiment 1 | 85  |
| 7.4  | AD3's fixed and random errors at 500 kb/s, test experiment 1 | 86  |
| 7.5  | AD3's fixed and random errors at 1.5 Mb/s, test experiment 1 | 86  |
| 7.6  | AD1's fixed and random errors at 1.5 Mb/s, test experiment 2 | 87  |
| 7.7  | AD3's fixed and random errors at 500 kb/s, test experiment 2 | 88  |
| 7.8  | AD3's fixed and random errors at 1.5 Mb/s, test experiment 2 | 88  |
| 9.1  | Synthesized digital decoder characteristics                  | 104 |
| 11.1 | Decoder Comparison                                           | 114 |

#### Acronyms

- ADC Analog-to-Digital Converter. 19, 55
- ADi Analog Decoding Circuit i. 69
- **ARQ** Automatic Repeat Request. 4
- AWGN Additive White Gaussian Noise. 60
- BAN Body Area Networks. 15
- BER Bit Error Rate. 3
- **BFSK** Binary Frequency Shift Keying. 21
- **BL** Block Length. 35
- BPSK Binary Phase Shift Keying. 23
- **BTLE** Bluetooth Low Energy. 22
- CAD Computer Aided Design. 9
- **CMOS** Complementary Metal-Oxide-Semiconductor. 6
- CS-DAC Current Steering Digital-to-Analog Converter. 56
- DAC Digital-to-Analog Converter. 55

#### Acronyms

- **DDC** Digital Decoding Core. 102
- DFF D Flip-Flop. 56
- **DRAM** Dynamic Random Access Memories. 93
- ECC Error Control Code. 3
- **EMV** Energy Minimum Voltage. 98
- **GFSK** Gaussian Frequency Shift Keying. 22
- HVT High Threshold Voltage. 95
- **I and Q** In phase and Quadrature phase. 20
- **ISM** Industrial, Scientific and Medical. 15
- LDPC Low-Density Parity-Check. 9
- LLR Log-Likelihood Ratio. 39
- LNA Low Noise Amplifier. 18
- LO Local Oscillator. 18
- LP-HVT Low Power-High Threshold Voltage. 60
- **LP-SVT** Low Power Standard Threshold Voltage. 101
- LVT Low Threshold Voltage. 95
- MAC Media Access Control. 19
- MAP Maximum a Posteriori. 39
- MC Monte-Carlo. 64
- MOS Metal Oxide Semiconductor. 25
- OOK On-Off Keying. 21

**OQPSK** Offset Quadrature Phase Shift Keying. 23

- PCC Peripheral Communication Core. 102
- RF Radio-Frequency. 17
- **SNR** Signal to Noise Ratio. 3, 5
- SP Sum-Product algorithm. 37
- SRAM Static Random Access Memories. 93
- SVT Standard Threshold Voltage. 95
- TB Tail-Biting. 36
- TD-AMS Time-Domain Analog and digital Mixed-Signal processing. 116
- **UPD** Ultra Portable Devices. 6
- **VHDL** V(ery high speed integrated circuit) Hardware Description Language. 101
- WBAN Wireless Body Area Networks. 16
- WSN Wireless Sensor Networks. 15

## Glossaries

- $A_{V_T}$  Technology dependent proportionality constant. 30
- $A_{\beta}$  Technology dependent proportionality constant. 30
- BL Block length of the codes. 56
- $C_L$  Load capacitance. 91
- $C_{ox}$  Oxide capacitance per unit area. 26
- $E_b$  Energy per bit. 34
- $I_{Ref}$  Reference current. 50
- $I_S$  Specific current. 28
- *L* Gate length of a transistor. 30
- $N_0$  Noise power spectral density. 34
- $U_T$  Thermal voltage also known as the Boltzmann voltage. 28
- V<sub>CC</sub> Supply voltage in analog domain. 67
- $V_{DS}$  Drain to source voltage of a transistor. 26
- $V_{GS}$  Gate to source voltage of a transistor. 26

#### Acronyms

- $V_{SAT}$  Saturation velocity. 31
- $V_{SB}$  Source to substrate voltage of a transistor. 31
- V<sub>TH</sub> Threshold voltage. 26
- W Gate width of a transistor. 30
- $\Delta\Sigma$  Delta-Sigma data converter. 19
- $\alpha_{SW}$  Switching activity factor. 91
- $\alpha_{V_{SAT}}$  An empirical parameter. 32
- $\alpha_{V_{TH}}$  An empirical parameter. 32
- $\alpha_{\mu}$  An empirical parameter. 32
- $\frac{W}{L}$  Width-to-length ratio of a transistor. 28
- $\gamma_{SP}$  The scaling factor in the Sum-Product algorithm. 38
- $\mu$  Mobility of the electric carrier. 26
- f Frequency. 31
- $g_m$  Transistor's trans-conductance. 28
- $i_{n,d}$  Drain current thermal noise. 31
- $k_f$  An empirical coefficient in the gate voltage flicker noise model. 31
- *n* Subthreshold slope factor. 28
- $v_{n,f}$  Gate voltage flicker noise. 31

**sub-** $V_{\rm T}$  Sub-threshold. 7

# Part I

# 1

## Introduction

#### 1.1. MOTIVATION

The presence of wireless links in our daily life is constantly increasing. Many new exciting applications, in addition to increased attentions to health care practices, have stimulated interests on battery-supplied, ultra low power wireless devices. Such devices can be worn, placed in locations where access is difficult or implanted in the human body. Therefore, these devices need to be small, inexpensive and have a reasonably long battery lifetime, i.e. operate on an extremely limited power budget. Such demands introduce great challenges throughout the design process. While high data rates are not needed in most scenarios, maintaining communication reliability under ultra low power operation is critical.

In order to minimize the errors that occur during transmission over a lossy channel, Error Control Code (ECC)s can be enforced. In a coded transmission, ECCs are used in communication systems to either improve the overall Signal to Noise Ratio (SNR) in the system, or help to reduce the transmission power for a similar Bit Error Rate (BER) performance with respect to an uncoded transmission.

With ever increasing popularity of the mentioned low power wireless devices, small scale decoding circuits are likely to attract more attention, especially when power consumption and silicon area are among the critical design factors. It is worth to note that due to the relatively massive amount of computations required in the decoding algorithms, decoding circuits are particularly power demanding and are major sinks of energy in a receiver chain. Therefore, using ECCs and energy efficient implementation of the corresponding decoding circuits can greatly prolong the battery lifetime of the wireless transceiver nodes.

#### **1.2. ERROR CONTROL METHODS IN WIRELESS NETWORKS**

The common practiced methodology for error control in wireless links or networks are divided into two main approaches.

One method is referred to as the Automatic Repeat Request (ARQ). When the receiver detects an error in the transmitted data packet, it automatically requests the transmitter for re-transmission. This process is repeated until the packet is either received error-free or the number of repetitions exceeds a predetermined number. However, this method is inefficient in bad channel conditions, especially when we deal with energy and latency.

The second method is using an ECC scheme. ECCs are used in various communication systems to provide more reliable transmission of data by fixing some of the errors that occur during transmission through a lossy channel. ECCs require more processing power at transceiver nodes, mainly to execute the corresponding decoding algorithms. Error correction capability of ECCs is directly related to complexity of the generated codes. Powerful codes are likely to increase the processing energy consumption at the receiver in decoding process. Sine the powerful codes have higher decoding complexity; therefore, demand higher energy for decoding process . In this work, rather small scale channel decoders have been chosen; firstly to fit the target application in the section 1.4, and secondly to be able to fulfill all the research objectives as described in the section 1.7.

#### **1.3. ENERGY CONSIDERATIONS IN A CODED SYSTEM**

Although, employing ECCs could be beneficial in a communication system to reduce power consumption or to improve the quality of data transmission, one should also consider the underlying trade-offs in the design process. The decoding algorithms in general are computationally complex and when implemented in hardware require a noticeable amount of power to decode the message.

In a coded communication system the decoder fixes some of the errors that occurs over the channel. Therefore, when the bit error rate requirements are
kept the same, the transmit power can be reduced with respect to the uncoded system since the system can operate at lower SNRs for the same BERs. That helps to save the power, and hence increase the battery lifetime of the transceiver. The price of this reduction in power, is the hardware overhead of the encoder and decoder as well as the power consumption of theses blocks. In fact, the complexity and power are much more significant in the decoder hardware than the encoder. For a conventional long range communication link, the transmit power is normally much higher in comparison with the power consumption of the receiver chain. However, for short range transmission links, the transmit power is also reduced with respect to the distance between the transmit and receive nodes. In these scenarios, the total energy dissipation is no more dominated by the radio transmission energy alone. Instead radio energy together with the computation energy in the transmitting and receiving nodes could take comparable shares in the total energy requirements of a communication link [30].

Encoding a message adds redundancy to the data. That makes the coded data which are often called *codewords*, or simply *codes* to contain more bits than the message itself. Codes must be powerful enough to not only compensate for the corresponding reduced SNR levels at the receiver, but also result in reduced number of errors after demodulation and decoding processes.

Circuits that implement a decoding algorithm perform substantial computations, and therefore, are one of the power hungry block in a receiver chain. At the same time, for a certain BER requirement, the amount of energy that is saved by reduction in transmit power in a coded transmission must then be reasonably greater than the energy dissipated during the encoding-decoding process. Also, because of added redundancies in a coded transmission, extra processing is needed in other components of the system to deal with the increased data rates . Consequently, if the codes are not chosen suitably or the decoder hardware is not designed properly, there is a possibility that the total power consumption in the system will not be less than that in an uncoded system. In other words, the power consumption of the decoder circuitry dominates the saving in the transmit power. Using an ECC block might not be efficient if the power saved in transmission is instead dissipated as power consumption of the ECC circuit itself [71].

Another matter to note is the complexity of the codes. Stronger codes provide better BER performance with less transmit power requirements, but with respect to simpler codes, demand decoders with higher complexity. Hence,



Figure 1.1.: Sample UPD target applications.

more power budget is required. With respect to the target application, a tradeoff typically considered between error correction capability of the codes and decoder complexity.

# **1.4. TARGET APPLICATIONS**

Wireless standardization is a costly and time-consuming process. In order to deploy the available resources in an efficient way, complex, general purpose and massively produced radio transmitters and receivers are generally expected to conform to industry standards. If we consider a bottom-up strategy, a suitable candidate sub-block in a wireless system might not necessarily be one that offers significant reduction in power consumption, but instead would be the one that fits well into the system without demanding drastic changes in the currently agreed and well developed standard. Despite this, there exist plenty of present and emerging applications that allow for a fully custom design in a quest to provide a superior performance rather than a more affordable design cost. Consequently, following the target application requirements, decoding circuits can be implemented in many different methods and their corresponding algorithms.

The main target of this work is proposing a low power decoding circuits for *Ultra Portable Devices (UPD)*; i.e. custom, small scale, short range and low power radio devices that may include wireless sensor nodes, near body communications, medical implants or modern hearing aid devices. Such applications do not usually demand very high data rates. In most cases the range of required throughput is from a few kb/s up to only a few Mb/s. As a matter of fact, the overall physical dimension and power consumption are the critical aspects. Small physical dimension is critical for implanted devices; therefore, it is desired to reduce implementation costs as well as better portability in on or near body communications. Low power operation is crucial to prolong the battery-driven life-time of the device specially in cases where access is difficult such as medical implants, or as for remotely placed sensor nodes.

# 1.5. BACKGROUND SURVEY

To find a suitable solution for extremely small and power constrained applications, two different low power approaches in 65 nm Complementary Metal-Oxide-Semiconductor (CMOS) technology are investigated; one approach is based on weak inversion operating current mode analog decoding, and the other approach is digital decoding in the sub- $V_{\rm T}$  region.

## 1.5.1. ANALOG DECODING

Analog decoding was initially proposed by Hagenauer [33] [32] and Loeliger [45] [47]. The idea then further developed by other researches to emphasis on its efficiencies and benefits in various applications, [1], [87], [57], [7],, [73]. In this method, simple blocks of analog circuitry are used as basic processing units. These blocks are then arranged in a complex fully connected network, whose architecture and connections are determined by the corresponding trellis or graph of the selected codes. It has been observed that analog computations, analog computations may achieve the robustness of digital systems but consume several orders of magnitude less power. Similarly, the motivation to use analog circuits for decoding has been based on faster analog continu-

ous time processing compared to digital designs, while consuming less power from the supply. Considerably fewer number of transistors also promised area efficient analog decoding circuits.

The idea has been pursued by other researchers to demonstrate the advantage of implementing the soft iterative decoding algorithms in analog circuitry over digital implementations in terms of silicon area, speed and power consumption [46], [51], [22]. Initially, simple analog decoders were realized in hardware using bipolar transistors. Since CMOS devices biased in the weak inversion, referred to as the sub-threshold (sub-V<sub>T</sub>) region, show similar I-V exponential behavior as bipolar transistors, in recent years they have been used to successfully implement iterative decoding algorithms in analog circuitry. As an example, an analog Hamming decoder correctly operating in sub-V<sub>T</sub> region was fabricated in  $0.18\mu$ m CMOS and the measurement results are reported in [53] and [82]. Early analog decoders claimed to provide significant improvements in consumed power from several times to even more than two orders of magnitude compared to their digital counterparts [46] [51].

Likewise, over the last decade, several analog decoding chips have been fabricated and the results have been presented in the literature [75], [25], [79], [82], [85], [31]. In most cases, the published works reflected merely proof of concept circuits. An attempt in [79] to employ analog decoding concept in a realistic error correcting turbo code for mobile phone standards, was not followed in later technologies. One major drawback of designing analog decoders for complex applications such as high speed cellular phone data transfer is that due to fully parallel computing, the silicon area requirement grows proportionally with respect to the complexity of the codes. Since, mismatch, process and threshold voltage variations affect the overall bit error rate, performances of analog decoders tend to degrade with scaling technologies. These effects impose a lower bound on the physical size of the decoder. While digital designs benefit the most from shrinking in the technology, scaling is usually not welcomed in analog designs. Being no exception, analog decoders suffer from increased mismatch and other imperfection errors. As shown in [91], the number of errors are more significant when the complexity of the decoder is increased; however, small scale decoders were discussed to be more resilient to mismatch errors.

As mentioned earlier, for high speed applications, early power and area advantages of analog decoders over digital designs have deteriorated. In addition, as shown in [91], complex analog decoding circuits are prone to severe BER performance degradation. The errors due to transistor mismatches may completely ruin the error correction property of the implemented decoding algorithm. Such unresolved issues in the past few years, subsided the initial enthusiastic interests on designing of complex analog decoding circuits for high speed applications.

### 1.5.2. DIGITAL DECODING

For decoders implemented in the digital domain, technology scaling has constantly improved both area and power consumption. Analog decoders initially found to be smaller and much more power efficient than their digital equivalents. However, as mentioned earlier, constant technology shrinking trend in the most recent years has moderated those gaps between the two approaches towards power efficient design of the decoding circuits.

Furthermore, low voltage operation in todays technologies offers significant savings in power consumption. In [86] the scaling trend on the energy efficiency of analog and digital decoders has been investigated based on the published works over the last decade. There, also an efficient Gallager-style [24], digital Low-Density Parity-Check (LDPC) decoder is presented for sub- $V_T$  operation and is evaluated via simulations based on the models described in [66] and [5]. It is shown that the power consumption is reduced considerably for sub- $V_T$  digital implementations, compared to standard super-threshold (super- $V_T$ ) implementation. For digital designs, however, while the dynamic power quadratically decreases with voltage scaling, the leakage power does not scale as dynamic power, which suggests a minimum energy point. Moreover, speed of processing becomes significantly slower in sub- $V_T$  operating digital circuits.

Digital decoders, benefit from the technology scaling and can achieve increased power efficiency at high throughput. However, limited complexity analog decoders may still remain desirable in terms of area and energy at low to moderate throughput [91]. When digital designs are operated at lower speeds, the dynamic power consumption is reduced, while the static leakage power remains. This results in a degradation of the digital decoders' energy efficiency for low throughput applications.

#### 1.6. CHALLENGES

Pursuing low power and hardware efficient analog and low voltage digital decoding approaches introduces various challenges. One particular issue is the limited assistance that is provided by the Computer Aided Design (CAD) tools compared to their service in conventional industry standard design flows. Due to the numerous transistors that are needed in an analog decoding circuit, in addition to to the nature of testing and verifying decoding circuits for functional validity, transient-time simulations become extremely time consuming. This places a huge obstacle for repeated simulations that are usually needed for optimizing the design. Furthermore, the lack of a reliable methodology to verify trustworthiness of analog decoder designs before sign-off for fabrication is a major inconvenience.

Likewise, for the digital designs intended to operate on lower voltages than the nominal voltage in a particular technology there exists several restraints. For one thing, deriving accurate and reliable estimates for power consumption and speed of processing in near-threshold voltage operation is problematic. This is because standard cell library models provided by vendors are not normally calibrated for extreme low voltage operation levels. One other challenging issue is finding the minimum energy point, that occurs in lower voltages due to increased share of the leakage currents.

# **1.7. RESEARCH OBJECTIVES AND CONTRIBUTIONS**

Targeting an ultra low power radio transceiver in 65 nm CMOS technology as mentioned in this chapter and described in more details in chapter 2, this dissertation pursues the following objectives. Likewise, the contributions of this work relies on the presented attempts, suggestions, answers and results obtained out of the underlying research.

• While several low power decoder implementations have been presented [92] [88] [6] [25] [56] [75] [79] [82] [85] [31] [26] [38] [44] [60] [69] [19], especially in recent years, there have been little in-depth investigation based on silicon measurements to evaluate the relative performance and efficiency of alternative analog and digital implementations. While there have been successful designs using both analog and digital circuits, it is not clear which approach is more efficient in the most recent technologies, especially for low rate decoders. Except in the early proof

of concept analog decoding circuits, there has been little side-by-side investigation of analog and digital decoding circuits. Besides, decoding circuits are designed for a wide range of applications, and fabricated using various technologies, which makes it hard to conclusively evaluate the two alternative approaches.

- ECCs and their corresponding decoding algorithms are designed for many diverse applications. The primary focus of the applications could be towards either highest throughput, best error correcting properties, lowest power consumption or least costly in terms of hardware usage. Naturally, there is no ideal solution that satisfies all the mentioned critical factors at the same time. Thus, a trade-off is made such that the primary criteria of interest is satisfied. As an example, wireless medical implants or remote sensor nodes usually do not require high throughput. Instead, low power consumption and minimum hardware usage are the main targets. Considering an ultra low power receiver with predetermined specifications, make it possible to conclude about the most suitable approach.
- While a handful of successfully operating analog decoders have been implemented and reported in the academic literature, so far these circuits have not been able to find their way into real world applications. Therefore, a key question in analog decoding is whether it can be applied to real world applications and what gains can be expected in speed, area and power consumption compared to a digital decoder implementations. The aim of this work is to suggest decoder architectures, either in the analog or digital domain, that can easily be embedded in a low power digital receiver architecture. Therefore, system simulations and performance of digital interface circuitry for analog designs are also included in the investigations.
- Digital circuits operate much slower at reduced supply voltages. In this dissertation, a near-threshold digital decoder design is investigated through measurements to observe the minimum operating voltage, minimum energy point and the related data rates at these operating points.
- Since analog decoding circuits even for the most simple codes can easily include several thousands of transistors, the required Monte-Carlo

simulation for accurate BER evaluation is not practical. Therefore, simulation of analog decoders has been usually performed through oversimplified methods such as statistical analysis [91]. Such simulations can only offer a rough estimate of the circuit performance at best and perhaps provide some help to modify the critical design factors. Solid evaluation of these circuits is best performed by empirical measures (chip measurements).

Analog decoding circuits are inherently low power, because sub-V<sub>T</sub> operation for CMOS transistors is essentially required. However, performance of analog decoding circuits like other analog circuits are sensitive to non-idealities such as noise, mismatch and process variation which degrades the error correction ability of the circuit. There is a need to investigate how analog decoders perform in recent technologies, as a trade-off between power and BER performance while taking into account a target throughput. Therefore, three versions of analog decoding circuit are fabricated with different transistor sizes to observe how severe the effects of mismatch errors and noise are. Also, what would be the silicon area requirement for a successful operation in comparison with an equivalent digital design?

# **1.8. OUTLINE OF THE THESIS**

This thesis is structured as follows.

# Part I:

- Chapter 1
- **Chapter 2:** Introduces UPD, a low power radio receiver architecture, as the main target application of this dissertation.
- Chapter 3: Fundamentals and challenges of low power design in the sub-threshold region of a CMOS transistor are discussed in this chapter.
- **Chapter 4:** Introduces the studied coding scheme as well as the corresponding decoding algorithm.

Part II:

- **Chapter 5:** Presents basics of the analog decoding concept as one of the approaches that are followed in this work.
- **Chapter 6:** Provides an overview of the proposed architecture and design aspects of the implemented (7,5) low power analog decoding circuits and reasons behind choosing the critical design factors.
- **Chapter 7:** Summarizes chip measurement results for the implemented (7,5) analog decoders.

# Part III:

- **Chapter 8:** Introduces an overview of the methods and critical considerations for a low power digital design approach.
- **Chapter 9:** Provides an overview of the proposed architecture and design aspects of the implemented (7,5) low power equivalent digital decoding circuits and reasons behind choosing the critical design factors.
- **Chapter 10:** Summarizes chip measurement results for the implemented equivalent (7,5) digital decoder.

# Part IV:

- **Chapter 11:** Presents an analysis of the results and comparison between the sub-threshold digital and analog design approaches.
- **Chapter 12:** Provides concluding remarks and introduces potential extensions for future work.

# Appendices:

- **Appendix A:** Layout screenshot of the whole UPD radio receiver chain, as a result of collaboration between multiple researchers, that has been sent for fabrication.
- Appendix B: Includes some photos of the measurement setup.

# 2

# **Ultra Low Power Radio**

# 2.1. INTRODUCTION

Nowadays, wireless connectivity in our everyday life is present more than ever. In addition, there are a range of emerging wireless applications such as Body Area Networks (BAN)s, Wireless Sensor Networks (WSN)s, and a wide diversity of healthcare related devices and medical implants that have attracted dramatic attentions in recent years. These new fields, propose many exciting opportunities and at the same time, present a lot of challenges. Despite the diversity of the applications, these devices or sensors share some common aspects. In addition to being small and highly portable, running on batteries for a long period of time without a need to charge or replace the battery, in some cases for the whole lifetime of the device, is a critical demand. In this regard, one particularly challenging aspect of designing these tiny wireless devices is the implementation of the radio link. The challenge is to provide basic radio connectivity at an extremely low power consumption.

The work presented in this thesis has been carried out as part of a bigger project titled *Wireless Communication for Ultra Portable Devices* or UPD in abbreviation. The UPD project has been funded by *Swedish foundation for strategic research-SSF* and pursues the following goal: A fully integrated and compact short range radio receiver that can be used in ultra low power and portable devices or sensor nodes, such as medical implants, hearing aids, wireless headphones, temperature sensors, etc. Key target specifications of the UPD project are shown in Table 2.1. The operation frequency is chosen within the Industrial, Scientific and Medical (ISM) band which provides 83 MHz of unli-

| Total power in active mode | 1 mW              |
|----------------------------|-------------------|
| Total power in standby     | $1\mu W$          |
| Data rate (uncoded)        | 250 kbps          |
| Data rate (coded)          | 125 kbps          |
| Operating frequency        | 2.45 GHz          |
| Total chip area            | 1 mm <sup>2</sup> |

 Table 2.1.:
 UPD project target specifications.

censed bandwidth (2400-2483 MHz).

# 2.2. THE VISION

Increased popularity of mobile phones directed a lot of research in the past to satisfy the demands for high rate communications. In recent years, however, interests for wireless networks consisting of several low power and low rate communicating nodes have increased. This is due the countless potentials and possibilities that these types of networks introduce. Examples of these networks are WSN and Wireless Body Area Networks (WBAN). WSN can be used for monitoring or sensing the environment and sending the sensed information to other nodes or to a central receiving hub. Similarly, WBAN includes network of devices that are intended to operate inside, on, or around the human body. Each network consists of several nodes or even hundreds; therefore, it is critical for each node to be small and cheap. Operating at ultra low power is easily perceived, since these types of networks are desired to operate for an extended period of time without a need to charge or replace the batteries. The characteristic of communication, in most cases, is transmission at random times, which is when there exists updated data to report. Also, in most applications, each node transmits limited amount of data at each transmission period. Thus, basic transmission at ultra low power and cost is the envisioned device, which introduces design challenges. If the consumed power can be reduced to sufficiently low levels, the required energy can even be harvested from the surrounding environment in forms of heat, motion, etc.



Figure 2.1.: UPD receiver architecture.

# 2.3. THE UPD SYSTEM OVERVIEW

To minimize the size and cost when fabricated in large volumes, the radio transceiver will be realized as a single chip in 65 nanometer CMOS technology. Nanometer CMOS technology is a suitable choice as it offers low cost for implementing digital circuits, and devices with enough speed that allows Radio-Frequency (RF) circuits to operate in weak inversion with extremely low power consumption.

The UPD project has been divided into six research areas: antenna design, RF front-end, analog-to-digital converter, digital baseband, channel decoding and system control <sup>1</sup>.

The aggressive power and area requirements, demand both system level and circuit level design efficiency. The proposed low power UPD receiver architecture is shown in Fig.2.1. As can be seen, a direct conversion receiver architecture is presented. A direct conversion architecture helps to get rid of the image frequency since the signal after conversion is at low frequency, centered at DC. This helps to simplify the filtering and analog-to-digital conversion circuitry. The accompanied issues of direct conversion, such as presence of DC offsets, are taken care of by the selection of a suitable modulation scheme. The used modulation scheme carries no information at the center of the channel, thus the demodulator becomes insensitive to the DC signals. While different parts of the system have been worked on separately, there has been collaboration between the individuals, to make the receiver operational as a whole. Therefore, the proposed receiver chain, with the accompanying

<sup>&</sup>lt;sup>1</sup>Each sub-project has been carried out by one PhD student. This thesis covers the work which has been done in low power channel decoding.

system level simulations carried out as a collaboration between individuals in the UPD project, has been presented in a joint publication (see the preface). To understand the system environment and requirements for operation of the decoder, a summary of different parts of the UPD receiver is presented in the following sub-sections.<sup>2</sup>

# 2.3.1. ANTENNA DESIGN AND PROPAGATION CHANNEL MEASUREMENTS

Since a significant potential range of applications for an ultra low power radio is on-body communication, medical aids, or bio-implants, the chosen frequency band plays an important role. Apart from the fact that the frequency range 2400-2483 MHz is license-free, operating at this frequency range provides a trade-off between the size of the antenna and the propagation loss around the human body. Operating at lower frequencies, imposes a large size antenna, while there is a limit for the antenna size in the medical applications. On the other hand, the link loss increases at higher frequencies, since tissue absorption of electro-magnetic waves increase. Therefore, small antenna models have been designed and various propagation channel measurements have been performed. Most of the investigations include near-body simulations and measurements, such as the case of ear-to-ear communication (as needed in modern hearing aids). See [17] for more information.

## 2.3.2. RF FRONT-END

The main function for an RF front-end is to receive high frequency signals from the antenna, amplify the signal in the presence of noise and interference, and convert the radio frequency signals to a much lower frequency range, that is much easier to process. The front-end circuitry includes the Low Noise Amplifier (LNA), mixers, and the associated Local Oscillator (LO). These circuits, since intended to operate at high frequencies, require the highest allocation of power budget. The efficiency and accuracy of these circuits are very important because the signals must have high quality to be useful for the rest of the receiver chain. In addition, the overall sensitivity of the receiver to noise and interference, to a large extent is determined by the performance of the RF

<sup>&</sup>lt;sup>2</sup>Based on the research work on individual blocks, a fully integrated design, from the RF front-end circuitry down to the decoder, has been fabricated in 65 nm CMOS. Measurements are pending at the time of writing of this dissertation.

front-end circuitry. This aspect is even more important in low power receiver designs. In the RF front-end design, to reduce area, inductor-less solutions for the LNA and direct conversion mixers are proposed. Also, an LC oscillator with an on-chip inductor is designed. The quadrature LO signal needed for direct conversion architecture is generated by using a frequency divider, that can be designed with relatively low power consumption. See [11] for more information.

## 2.3.3. ANALOG TO DIGITAL CONVERTERS

The main requirements for data conversion circuitry are also small area and low power consumption. Still, the dynamic range should be high enough to eliminate the need for any automatic gain control block in the receiver. Thus, to convert the received baseband analog signals to digital domain, continuous time  $\Delta\Sigma$  data converters are used. These types of converters not only provide high resolution at low power consumption, but also have the property of inherent anti-aliasing filtering. The filtering is done by the implemented loop filter in analog domain which relaxes the filtering requirements prior to the Analog-to-Digital Converter (ADC). See [61] for more information.

# 2.3.4. DIGITAL BASEBAND AND SYNCHRONIZATION

The main functions in digital baseband is digital filtering and demodulation. Data detection and synchronization is also required to locate starting of a data sequence out of the background noise. To design for low power and area, digital filtering is done in a filter chain consisting of four stages of half-band filtering, plus a decimation stage after each filter. The reason for using such a structure is the availability of extremely simple half-band filter structures that require no multiplication. Instead, just a few additions and simple shift operations are needed. Demodulation is performed by the implementation of matched filters on In phase and Quadrature phase (I and Q) received signals. Depending on the received signal, resembling more to the (1,j,-1,-j) sequence or (1,-j,-1,j), the output information bit is decided to be 0 or 1, respectively, in case of uncoded transmission. In the coded mode, the level of resemblance to the mentioned sequences, as defined by the modulation scheme, provides the required soft information for the channel decoder. Four copies of the matched filters with shifted sequences are enough for symbol synchronization. In addition, Barker code preamble and cross-correlation function is devised for data detection. See [72] for more information.

#### 2.3.5. CHANNEL DECODER

This sub-project is studied in this dissertation. Use of error correction codes in the system can improve the integrity of data reception by adding algorithmic redundancy to the transmitted data, and then use the added redundancy in the receiver to correct some of the errors. The decoding process is rather complex. Circuits for such purpose, require proper attention in design, to be useful in a small and low power receiver as in the UPD project. Designing a power and area efficient solution that fits the power and area budget of an ultra low power receiver is a challenging task. That is due to the diversity of the error correction codes and the variety of available decoding algorithms, as well as various hardware implementation methods. The proposed solution has to be able to handle the required data rate as well. In the subsequent chapters, the detailed work on this subject is presented.

# 2.3.6. SYSTEM CONTROL AND WAKE-UP RECEIVER

For a wireless sensor network to operate efficiently, and consume as low power as possible while maintaining the required functionality, devising a Media Access Control (MAC) protocol is essential. A proper MAC protocol not only helps to minimize the power consumption of individual sensor nodes, but also reduces the total power consumption of the network with regard to the target application. In many applications, a continuous transmission of data is not required, rather data is transmitted in infrequent bursts and at random times. Therefore, a wake-up receiver is devised to handle standby and active modes of the sensor nodes. The implemented low power wake-up receiver provides the possibility of significantly reducing the power consumption of the sensor nodes. This is achieved by powering down the nodes when not in use, and activate those again when there is data to be received. Care is taken such that the design of the wake-up receiver and the used protocol, does not lead to unacceptable latency in the wireless network. See [49] for more information.

# 2.4. UPD MODULATION SCHEME

The choice of modulation plays a critical role in designing a low power radio transmission link, as modulation to a great extent affects the architecture of both transmitter and receiver [20]. One simple option for modulation is On-Off Keying (OOK) [77]. However, this choice can easily lead to out-of-

| RF front-end     | 650 µW |
|------------------|--------|
| ADCs             | 200 µW |
| digital baseband | 120 µW |
| decoding         | 30 µW  |

**Table 2.2.:** UPD initial power budget allocation.

band spurious emissions that disturb other communication. The undesired off-band emissions must be efficiently filtered, which complicates the design of the transmitter. A better choice is constant envelope phase or frequency modulation. In case of UPD, Binary Frequency Shift Keying (BFSK) modulation is considered. Each bit of information is modulated to a sequence of four symbols. Figure 2.2 shows the chosen BFSK modulation on the I/Q plane. The information bits "1"s are encoded to a sequence of 90 degrees clock-wise phase shifts, so that the modulated signal covers a full 360 degree rotation around the I/Q constant amplitude circle. Likewise, the information bits "0"s are encoded by counter-clock-wise rotation. This modulation scheme, significantly simplifies detecting the transmitted information bits in the receiver. Data detection is achieved by multiplying the received complex vectors with a pair of matched filters. One filter is matched to the sequence "1, j, -1, -j", while the other one to "1, -i, -1, j", corresponding to clock-wise and counterclock-wise rotations on the I/Q complex plane. This simple detection has an important advantage, since simplicity in the architecture of the receiver leads to a less power consuming receiver circuitry.

By choosing BFSK with a  $\pm$  250 kHz frequency deviation for 250 kb/s data rate, a notch occurs in the spectrum at the center of the band, as shown in Figure 2.3. Since there is no information transmitted at the center frequency, the demodulation becomes insensitive to the DC offsets and low frequency noise. An FSK modulation, allows demodulation of the signal without detection of the absolute phase, thus, non-coherent demodulation is applied. The sidelobes in the spectrum are considerably weaker than the main lobes (18.3 dB), which can be further suppressed by filtering.



**Figure 2.2.:** UPD modulation. The direction of rotation on the I/Q diagram determines the transmitted coded symbol for each information bit.



Figure 2.3.: The ideal BFSK modulation spectrum used in UPD with  $\pm 250$  kHz in-band transmission, corresponding to 250 kb/s.

## 2.5. LOW POWER RADIO STANDARDS

WSN and WBAN are interesting subjects for companies and academia, and currently new standards are being proposed as extensions for IEEE 802.15 series of standards [35]. Among the low power radio standards are Bluetooth Low Energy (BTLE) [9], ZigBee [93], and ANT [3]. BTLE, sometimes marketed as Bluetooth Smart, is currently embedded in some smart phones, and is increasingly used in novel technologies in healthcare, fitness and entertainment industry. Bluetooth Smart, similar to the classic Bluetooth technology, operates in the ISM spectrum range (2.400 GHz-2.4835 GHz), but uses a different set of channels. Within each of the 40, 2 MHz channels in BTLE, data is transmitted using Gaussian Frequency Shift Keying (GFSK) modulation, similar to classic Bluetooth. Over the air data rate is 1 Mbit/s, while the real throughput is 270 kb/s.

The ZigBee protocol is based on IEEE 802.15 standards, and the nodes can operate without a need for a central transceiver or control unit. Data rate defined in ZigBee protocol is 250 kb/s which is suitable for consumer or industrial applications that require short-range wireless transfer of data at low power and relatively low rates. ZigBee can operate at 868 MHz in Europe, 915 MHz in North America and Australia, and 2.4 GHz worldwide. However, data transmission rates vary from 20 kb/s in the 868 MHz frequency band to 250 kb/s in the 2.4 GHz frequency band. For modulation, Binary Phase Shift Keying (BPSK) is used in the 868 and 915 MHz bands, and Offset Quadrature Phase Shift Keying (OQPSK) is applied in the 2.4 GHz band.

Another notable standard to mention is ANT, which is a proprietary open access sensor network protocol, intended to operate in ISM 2.4 GHz band. ANT uses short duty cycle transmissions and deep sleep modes to ensure very low power consumption. In the ANT protocol, nodes can be configured to operate as master or slave and data rate can be up to 1 Mb/s with GFSK modulation.

These general purpose standardized radio solutions generally have 50-100 nJ/b power consumption [10]. For UPD radio communication, the target is to reach much lower levels of power consumption.

#### 2.6. RELEVANCE AND CHALLENGES OF SUB-THRESHOLD DESIGN

For application specific use cases, as are intended for the described low power radio, a non-rechargeable battery is a logical choice. Whereas, the envisioned

small size of the transceiver nodes imply limited physical space for batteries. These facts, emphasize the value of the available energy.

One major advantage of a circuit operating in the sub-threshold mode is the ability to generate very low levels of current between the source and drain of the transistors. Analog processing in sub-threshold region is inherently low power since the current levels are low due to weak inversion operation. Low current levels lead to low overall power consumption. Low power decoding in analog domain, uses exponential dependency of drain currents in subthreshold or weak inversion region. Therefore, low power operation not only is achieved, but also is necessary to execute the decoding algorithm. However, due to the exponential relation, parameter variations caused by mismatch and process variations have large effects on the current levels, which in turn tend to reduces the accuracy of processing. Other phenomena such as noise and temperature variations may also degrade performance of analog computing circuitry. How severely these effects will degrade the performance of an analog decoding circuit, is a question that most accurately can be answered by the implementation of prototypes and evaluation via empirical measures; i.e. chip fabrication and measurements.

Digital circuits operating in sub-threshold region consume much less dynamic energy than when operating in strong inversion (above threshold). The price to be paid is speed of processing, which does not seem critical here, as the transceiver nodes are intended for low rate communications. Therefore, sub-threshold design for a digital decoding circuit might provide enough computation speed at the power levels required for extended battery lifetimes. One of the major design challenges are that the digital modeling and synthesis tools are not developed primarily for sub-threshold design and hence may not provide reliable power or speed estimates for extremely low voltages at the design period. In addition, a digital design is more sensitive to parametric variation in near threshold operation [40], [13] [29]. Again, the definite circuit performance is better to be supported by chip measurements.

# 3

# Fundamentals of MOSFET Sub-threshold Operation

For digital circuits, mainly driven by demands for high processing speed applications, Metal Oxide Semiconductor (MOS) transistors generally are used as switches with high speed activities. For a MOS transistor to function as a high speed switch, it should have large turn on currents. For such circuits, the steady state current (leakage) of logic gates, in comparison to the current in active mode, is rather small. For analog circuits, to be responsive at high frequencies, as well as keeping the noise level low, MOS transistors are mainly used as active devices that consume large amount of current. This is only possible when transistors are biased in strong inversion. Since the circuit topologies that are presented in this work are based on sub-threshold or near threshold MOS devices, a brief review on the characteristics of MOS devices in these regions of operation is provided in this chapter. Additionally, some important design considerations for sub-threshold circuits are provided.

# 3.1. MOS TRANSISTOR BASIC REGIONS OF OPERATION

Depending on the voltages applied to the gate (G), drain (D) and source (S) terminals of the transistor, the transistor can be at different regions of operation. These regions of operation are distinguished according to the dependencies of drain current to the voltages across the terminals of a MOS transistor [36]. The equations in this section are presented for a NMOS transistor. Similar equations are applied to PMOS transistors where the signs of current and voltages are inverted.

## 3.1.1. LINEAR OR TRIODE REGION

When  $V_{DS} < V_{GS} - V_{TH}$ , the drain current increases with  $V_{GS}$  because it increases the channel conductivity and also with  $V_{DS}$  because it is the voltage across the channel.  $V_{GS}$  and  $V_{DS}$  are the gate-to-source and drain-to-source voltages, and  $V_{TH}$  is the threshold voltage. The relation is defined as

$$I_{DS} = \mu C_{ox} \frac{W}{L} [(V_{GS} - V_{TH}) V_{DS} - \frac{1}{2} V_{DS}^2], \qquad (3.1)$$

where  $\mu$  is the mobility of the electric carrier (electrons for NMOS,  $\mu n$ , and holes for PMOS,  $\mu p$ ),  $C_{ox}$  is the oxide capacitance per unit area and W and L are the transistor's width and length, respectively. The drain current is directly proportional to  $V_{GS}$  as can be noticed.

# **3.1.2. QUADRATIC OR SATURATION REGION**

When  $V_{DS} = V_{GS} - V_{TH}$ , the transistor moves into the saturation region. Under such conditions, Eq.3.1 becomes

$$I_{DS} = \mu C_{ox} \frac{W}{L} (V_{GS} - V_{TH})^2.$$
(3.2)

This region is also referred to as quadratic or square-law, since the drain current is proportional to the square of  $V_{GS}$ . While in the saturation region, when  $V_{DS} > V_{GS} - V_{TH}$ , the drain current is mostly controlled by  $V_{GS}$  and depends lightly on changes of  $V_{DS}$ .

# 3.1.3. CUT-OFF REGION

In a basic MOS model, it is assumed that current only flows through the channel while  $V_{GS} > V_{TH}$ . In reality, current flows from drain to source even when  $V_{GS}$  is lower than the threshold voltage  $V_{TH}$ . Though, it is orders of magnitude weaker than the drain currents when  $V_{GS} >> V_{TH}$ . More is provided in the next section.

# 3.2. MOS TRANSISTOR REGIONS OF OPERATION AT LOW DRAIN CUR-RENTS

A more elaborated MOS transistor model has two distinct regions of operation known as weak and strong inversion. In sub-threshold or weak inversion,  $V_{GS} < V_{TH}$  causes the drain current to be dominated by diffusion



Figure 3.1.: Operation regions for an NMOS transistor

current, which is proportional to the exponential of effective gate-source voltage ( $V_{GS} - V_{TH}$ ). In strong inversion,  $V_{GS} >> V_{TH}$  and drift current dominates. Thus, drain current becomes proportional to the square of effective gate-source voltage (square law region). The regions of operation, for an NMOS transistor with  $\frac{W}{L} = \frac{1.0 \mu m}{0.5 \mu m}$  in 65 nm CMOS, is illustrated in Fig. 3.1. Velocity saturation effects, reduce the drain current below the strong inversion (square law) value. Between weak and strong inversion there is a transition region known as moderate inversion where contributions from both drift and diffusion currents are significant.

A short notice here might be helpful. In literature weak inversion and subthreshold terms are often used interchangeably. Nonetheless, there is a subtle difference between the regions. Exponential weak inversion region starts when gate-source voltage is slightly below the threshold voltage; therefore, when  $V_{GS}$  is equal to the threshold voltage, the transistor in fact operates in moderate inversion. Throughout this thesis, the term weak inversion is often used when drain current is of critical importance, as in the designing of analog decoding circuit. Similarly, the term sub-threshold or near-threshold is used when voltage is more focused on, as in the chapters presenting the low power digital decoding circuit.

# 3.3. MOS MODEL IN WEAK INVERSION REGION

In weak inversion operation mode, the dependency between drain-to-source current,  $I_{DS}$ , and gate-source voltage  $V_{GS}$  becomes exponential. For an NMOS transistor,  $I_{DS}$  in weak inversion can be expressed as [80]

$$I_{DS} = I_{S} exp \frac{V_{GS} - V_{TH}}{nU_{T}} [1 - exp \frac{-V_{DS}}{U_{T}}],$$
(3.3)

in which *n* is the subthreshold slope factor (practically 1 < n < 1.6), and  $I_S$  is called specific current that represents a current that leaks through the transistor.  $I_S$  is expressed by

$$I_S = 2n\mu C_{ox} U_T^2 \frac{W}{L}.$$
(3.4)

In Eq.3.4,  $\mu$  refers to the carrier mobility,  $C_{ox}$  is the gate oxide capacitance per unit area,  $U_T = k \frac{T}{q}$  is the thermal voltage (also known as the Boltzmann voltage).  $\frac{W}{T}$  is the width-to-length ratio for the channel.

By introducing

$$I_0 = I_S exp \frac{-V_{TH}}{nU_T},\tag{3.5}$$

Equation 3.3 as a function of gate-source voltage and drain-source voltage can be written as

$$I_{DS} = I_0 exp \frac{V_{GS}}{nU_T} [1 - exp \frac{-V_{DS}}{U_T}].$$
(3.6)

A simpler approximation for drain current in weak inversion is achieved by neglecting the last term in Eq. 3.6 and

$$I_{DS} = I_0 exp \frac{V_{GS}}{n U_T}, \tag{3.7}$$

by which,  $I_{DS} = I_0$  at  $V_{GS} = 0$ .

# 3.4. SUB-THRESHOLD MOS ULTRA LOW POWER CIRCUITS

The fact that sub-threshold MOS devices operate with very small current levels makes them very convenient for ultra low power (ULP) applications. Furthermore, trans-conductance ( $g_m$ ) to bias current ( $I_{DS}$ ) ratio, i.e.,  $\frac{g_m}{I_{DS}}$  is maximum in sub-threshold region. This means that power efficiency of the MOS circuit can be maximized in sub-threshold operation [21].

Meanwhile, due to exponential IV characteristics of sub-threshold MOS devices, the circuit can be operational within a very wide span of bias currents with very small variation on the bias voltage levels. This property of sub-threshold MOS devices is suitable for implementation of current-mode circuits.

## 3.5. CIRCUIT DESIGN CONSIDERATIONS IN SUB-THRESHOLD

Some of the main design considerations associated with MOS devices operating in weak inversion, or sub-threshold regime, are briefly addressed in this section. Considerations such as matching, noise, process and temperature variations are generally important in designing a circuit, but deserve more attention in sub-threshold circuit design. As will be seen later, these nonideality effects contribute to a significant share in the design cost in terms of area, energy consumption, and reliability.

#### 3.5.1. MATCHING

Device mismatch is probably one of the most important design issues, especially for analog and digital circuits in modern sub-hundred nanometer technologies [81]. That is because device mismatch limits the accuracy of signal processing considerably by reducing the dynamic range of operation.

Process and device parameter variations are results of fabrications. The variations can be categorized as systematic or random. The variations from wafer-to-wafer are common among all devices on the circuit and introduce the same offset or shift in characteristics of all the devices on a wafer. Biasing techniques or using differential topology in an analog circuit, can assist to make the circuit less sensitive to the effects of systematic variations. Systematic device variations are not dependent on the device size. The other type of variations have random characteristics and are referred to as the device mismatch. Device mismatch refers to the fact that identically sized devices on the

same silicon die will not have exactly the same parameters, due to random phenomena such as the fluctuations of the dopant atoms. Device matching is a key principle in analog circuit design, especially where current mirrors or differential pairs are used. Device mismatch plays an important role in analog designs. Available models for matching of MOS devices suggest that the matching improves with the increment in the device area.

Analyzing experimental data shows that the drain-source current or gatesource voltage mismatch between two identical and adjacent MOS transistors are majorly dependent on two sources: threshold voltage variations ( $\Delta V_{\text{TH}}$ ) and current factor differences ( $\Delta\beta$ ), where  $\beta = \mu C_{\text{ox}} \frac{W}{L}$  [39]. These studies, also show that the random variations on these parameters can be modeled as normal distributions with  $V_{TH0}$  and  $\beta_0$  for means, and variances that are dependent on the transistor area, i.e. W.L; where, W is the gate width and Lrefers to the gate length of a transistor:

$$\sigma^2(\Delta V_{TH}) = \frac{A_{V_{TH}}^2}{W.L} \tag{3.8}$$

$$\left(\frac{\sigma(\Delta\beta)}{\beta}\right) = \frac{A_{\beta}^2}{W.L}.$$
(3.9)

 $A_{V_T}$  and  $A_\beta$  are known as proportionality constants and are technology dependent.  $\Delta V_{\text{TH}}$  and  $\Delta\beta$  are normally treated as independent random variables. A deviation in either transistor geometry ( $\Delta\beta$ ), or threshold voltage ( $\Delta V_{\text{TH}}$ ), causes drift in the operating point of the transistor which leads to deviation in the drain-source current. Mismatch in the transistor geometry is caused by fabrication drifts during the lithography process and can be mitigated by increasing the device size. Threshold voltage mismatch, on the other hand, is caused by process gradient.

For simple current mirrors and differential pair configurations, it can be shown that the mismatch between current values and gate-source voltage offset, respectively are [76]:

$$\left(\frac{\sigma(\Delta I_{DS})}{I_{DS}}\right)^2 = \left(\frac{\sigma(\Delta\beta)}{\beta}\right)^2 + \left(\frac{g_m}{I_{DS}}\right)^2 \quad .\sigma^2(\Delta V_{TH}) \tag{3.10}$$

$$\sigma^2(\Delta V_{GS}) = \sigma^2(\Delta V_{TH}) + (\frac{I_{DS}}{g_m})^2 \left(\frac{\sigma(\Delta\beta)}{\beta}\right)^2.$$
(3.11)

Since the ratio  $\frac{gm}{I_D}$  reaches its maximum value in the weak inversion region and the role this term plays in the above equations, it is expected that the voltage matching improves slightly by moving towards weak inversion, while the matching in current degrades. This implies that implementing current mirrors with desired matching level is much more difficult in weak inversion region compared to the current mirrors implemented in strong inversion region.

# 3.5.2. NOISE

The noise level limits the minimum signal level that can be correctly processed in a circuit. The model that is generally used to estimate the noise in a MOS device, considers mostly the drain current thermal noise,  $i_{n,d}$ , and the gate voltage flicker noise,  $v_{n,f}$ , [76]. The drain current thermal noise is expressed as

$$i_{n,d}^2 = 4KT\gamma_n g_m, \tag{3.12}$$

where  $\gamma_n$  represent excess noise factor that has a value about  $\frac{n}{2}$  in the weak inversion region, that increases to about  $\frac{2n}{3}$  in the strong inversion region.

The gate voltage flicker noiser is described as

$$v_{n,f}^2 = \frac{k_f}{WLC_{ox}} \cdot \frac{1}{f_u^{\alpha}}.$$
(3.13)

Eq. 3.13 shows that flicker noise is inversely proportional to the frequency, f. The term  $k_f$  represent an empirical coefficient. The equation also shows that flicker noise is inversely proportional to the device area. Therefore, to reduce the effect of flicker noise, an effective method is to increase the device dimensions.

#### 3.5.3. THRESHOLD VOLTAGE VARIATIONS

Since the I-V characteristic is exponential in the weak inversion region, any small variation on the device threshold voltage will be translated to exponential variation on the drain current.

The threshold voltage  $V_{TH}$  depends on the source-to-substrate voltage ,  $V_{SB}$ , as [58]

$$V_{TH} = V_{TH0} + (n-1)V_{SB}, (3.14)$$

where  $V_{TH0}$  refers to the threshold voltage when the substrate voltage  $V_{SB}$  =

0. Considering threshold voltage variations, Eq. 3.5 extends to

$$I_0 = I_S exp \frac{-V_{TH}}{nU_T} = I_S exp \frac{-(V_{TH0} + (n-1)V_{SB})}{nU_T}.$$
 (3.15)

#### 3.5.4. TEMPERATURE VARIATIONS

It has been known that variations in temperature, alters the threshold voltage  $V_{TH}$ , mobility factor  $\mu$ , and saturation velocity ( $V_{SAT}$ ) in a MOS transistor. Variations in these parameters affect speed, power and timing in a circuit. Temperature dependencies of  $V_{TH}$ ,  $\mu$  and  $V_{SAT}$  for MOSFET devices are expressed by the following empirical equations [48]:

$$V_{TH}(T) = V_{TH}(T0) + \alpha_{V_{TH}}(T - T0), \qquad (3.16)$$

$$\mu(T) = \mu(T0).(\frac{T}{T0})^{\alpha_{\mu}},$$
(3.17)

$$V_{SAT}(T) = V_{SAT}(T0) + \alpha_{V_{SAT}}(T - T0), \qquad (3.18)$$

where T refers to temperature,  $\mu(T0)$ ,  $V_T(T0)$  and  $V_{SAT}(T0)$  are mobility, threshold voltage and saturation voltage at nominal temperature (300 °K), respectively. Empirical parameters  $\alpha_{\mu}$ ,  $\alpha_{V_{TH}}$  and  $\alpha_{V_{SAT}}$  called the mobility temperature exponent, threshold voltage temperature coefficient, and saturation velocity temperature coefficient, respectively.

As studied in [54], temperature dependency of MOSFET transistors over different operating voltages, shows the following characteristics. By increasing the temperature, both threshold voltage and carrier mobility decrease. Lower threshold voltage results in higher drain current, but lower mobility decreases the drain current. Thus, depending on which one of these effects dominates in response to the operating supply voltage and temperature, drain current either increases or decreases. For voltages far above the threshold level, drain current is mostly affected by carrier mobility. Therefore, an increase in temperature normally leads to lower drain current, while the situation is reversed in lower supply voltage operations. In near-threshold operations, threshold voltage variation is the dominating factor on the drain current and higher temperature results in higher drain current. However, the effect of threshold voltage reduction in higher temperature on drain current is stronger in near threshold operations.

# 4

# Selection of the Coding Scheme

# 4.1. BENEFITS OF ERROR CONTROL CODES

Reliable communications through a noisy channel may be achieved by simply increasing the transmission power. Yet, this would be a very costly approach in an energy constrained application. The main purpose of designing a coded system is to relax the demands on the other parts of the system [59]. Therefore, incorporating a successful coding scheme should lead to a reduced overall system cost, while delivering similar or better performance. Error control coding deals with methods that introduce controlled redundancy to the source information data in a way that the information can be delivered to the destination with a minimum amount of errors. Here, introducing controlled redundancy means that the receiver gets sequences of data that contain more symbols than needed to transmit the original data. However, due to the controlled nature of redundancy, the receiver only relates certain received patterns of data, known as code-words or simply codes, to a valid transmission. The redundancy in a successfully implemented coding scheme can help the decoding algorithm to recognize, locate and fix errors. That how many errors can be corrected depends on the chosen coding scheme. Usually, a general expectation is that the error correcting property is improved by increasing the complexity of the codes. Using a coding scheme in a digital transmission may benefit us in any of the following ways or a mixture of those:

- Keep transmit power and range fixed, but reduce the average errors
- Keep average errors and transmit power fixed, but increase the range

• Keep average errors and range fixed, but reduce transmit power

# 4.2. SYSTEM AND HARDWARE CONSIDERATIONS FOR A CODED TRANS-MISSION

Using coded transmission introduces several matters to consider during the design period. Due to the introduced redundancy, more bits are transmitted. The ratio of the the original uncoded message bits to the coded ones, is usually referred to as the *code rate*. Thus, to send a particular message in a coded form, either the time or the speed of transmission must be increased compared to an uncoded system. The former act increases latency in the system and the latter demands extended bandwidth. Both cases result in increased noise in the received sequence; in the first case because the data is subjected to noise from the channel for a longer time, and in the second case because more noise is trapped within the extended bandwidth.

Another important consideration is the additional cost and power overhead. Implemented coding and decoding circuits cost in additional power and hardware. These costs should be justified by savings in total power consumption by a reduced transmit power. It is worth mentioning that in the following chapters, wherever the word "power" is mentioned, by default it refers to circuit power and not transmission power.

# 4.2.1. CODE PERFORMANCE AND CODING GAIN

Error performance of a coded system compared to an uncoded one should be evaluated assuming similar transmit power per information bit. Let us express the ratio of energy per bit,  $E_b$ , to noise power spectral density,  $N_0$ , in an uncoded system by  $Eb_{info}/N_0$ . In a coded transmission, more bits are sent to convey the same information. Assuming a similar power budget for both systems, the ratio of energy per coded bit to noise power spectral density  $Eb_{coded}/N_0$  is reduced. In case of using a half-rate code, the energy per bit in the received sequence is halved, but the actual information remains the same. This results in a 3 dB penalty in the signal power at the receiver end. Therefore, the code must not only compensate for this loss in signal power and additional noise in the system, but also provide a considerably better error performance with respect to an uncoded transmission.

In case the main reason for using ECC is to reduce transmit power, then performance of the coded system with respect to uncoded one can be seen at the same error rate. For a coded transmission to provide the same error rate as the uncoded one,  $Eb_{info}/N_0$  can be reduced to an amount, that is, referred to as the coding gain at that error rate. Since this term will be used frequently in the later chapters where the results are provided, the definition of coding gain is given below [28], [15]:

Error correction capability of the ECCs over a noisy environment provides a better BER performance compared to an uncoded system for the same SNR levels. The difference in required SNR to reach to a certain BER for a specific code, with respect to uncoded scheme, is mentioned as the "coding gain" for the considered codes.

# 4.3. CODING SCHEME

There exist a large variety of codes and associated decoding algorithms that can be considered for different applications. In this work, the coding scheme and their associated hardware, are considered for a small and ultra low power transceiver. As mentioned in the previous chapters, this application demands a limited complexity, yet effective coding scheme that can be implemented considering an extremely tight power budget. Therefore, in the following chapters, first an analog Hamming decoder is simulated to estimate a tradeoff between the power budget, rate, coding gain and code complexity. In the majority of the remaining chapters, the familiar (7,5) codes are considered to benefit from a better coding gain in trade-off with a reasonable increment in circuit power, [43].

Hamming codes belong to the category of *linear block codes*, while (7,5) codes are a sub-set of *convolutional codes* [74], [37], [4]. If the encoder uses only the current block of information data to produce its coded output, then the code is called block code, i.e. the encoder does not have memory. If the encoder remembers a number of previous information bits and uses them in its coding algorithm, then the code is called a convolutional code. Unlike block codes, theoretically, there is no limit to the length of coded sequences in a convolutional coding scheme. However, due to limitations in hardware implementation, information bits are presented in blocks with certain length for encoding. This is referred to as the *Block Length (BL)* for the codes.

As presented in the following chapters, selection of (7,5) convolutional codes allow for selection of short BL for the codes. Therefore, a small size decoding circuit which is dictated by the mentioned target applications can be realized.

# 4.4. SOFT DECISION DECODING

In a conventional receiver chain, decoding takes place after demodulation. That means the demodulator itself, in the final phase of demodulation processing, does not have to decide about the value of each bit. Instead, if potential errors are to be corrected, it is better for the demodulator to pass on the information about the certainty of its decisions on each bit to the decoder. If utilized efficiently by the decoder, this extra information might help the decoding algorithm in better detection and fixing of the potential errors. In this method, the input information to the decoder is called soft bits or soft data and the process is called *soft-decision decoding*. One parameter that needs to be decided during the design period for hardware implementation is the quantization level of the soft bits. There is a theoretical limit to error correcting ability of certain codes. That means, higher than necessary number of quantization levels will only lead to increased silicon area and processing power. On the other hand, too low number of quantization levels might degrade the coding gain. Thus, finding a compromise between quantization level and coding performance and hardware area/power seems necessary.

# 4.5. TRELLIS REPRESENTATION OF CONVOLUTIONAL CODES

As mentioned previously, the proposed decoders in the coming chapters are designed for the familiar memory-2 (7,5) convolutional codes. These codes are defined by the generator polynomials  $G(D) = [1 + D^2 + D + D^2]$ . The structure of the encoder is shown in the Fig. 4.1.

The relations between inputs, states and outputs of an encoder, can graphically be illustrated by a state diagram referred to as the *trellis*. For an encoder with m memory, the trellis representation shows all  $2^m$  states and all possible transitions between those. Every path in a trellis, represents a codeword and the number of stages, represent the BL of the code.

# 4.6. TAIL-BITING CODES

When a trellis forced to start and end at the same states by proper encoder memory initializations, as presented in the Fig. 4.1, a circular trellis is formed. This structure, as presented in the Fig. 4.2 for the codes used in this work, is recognized as a *Tail-Biting* (*TB*) trellis [68] [16], [41].



**Figure 4.1.:** Structure of encoder and memory initialization for the (7,5) tail-biting convolutional codes.

Tail-biting is a commonly used method for terminating the trellis of convolutional codes to avoid rate loss. TB forces the trellis to begin and end at the same state [67]. In a TB circular trellis, every path revisits itself after a specific number of transitions which are set by the BL of the code. Every path in a trellis represents a codeword, and the number of stages represent the BL of the code.

# 4.7. THE GENERIC SUM-PRODUCT DECODING ALGORITHM

Unlike the encoding procedure, decoding is generally considered a complicated process. An algorithm that is used in the decoding process must deal with a complex non-linear global function of many variables. One practical method to deal with such problems, is to break down the global function and express it in terms of products of simpler local functions. Each of local functions then have dependency on a subset of variables. Solution for the global function can be achieved by iterative interactions among the local functions. The interactions are executed by *passing messages*. That means passing the outputs of each local function as inputs to the other local functions.

A commonly used iterative message passing algorithm for decoding is the *Sum-Product algorithm (SP)* [84]. As the name suggests, the decoding algorithm computation is mainly composed of summation and production operations. When a trellis representation is considered, the basic computations of the algorithm underlying iterative decoding can be expressed as [46]

$$p_{z}(z) = \gamma_{SP} \sum_{y} \sum_{x} p_{x}(x) p_{y}(y) f(x, y, z),$$
(4.1)

where  $p_x(x)$  and  $p_y(y)$  are probability distributions such that  $\sum_x p_x(x) = 1$ ,



**Figure 4.2.:** Tail-biting trellis structure for the 4-state (7,5) convolutional codes with block length (BL) = 6.

 $\sum_{y} p_y(y) = 1$  and  $\gamma_{SP}$  is a scaling factor to ensure  $\sum_{z} p_z(z) = 1$ . Also, *f* is a function that takes either 0 or 1 values. In the SP decoding algorithm, *f* can conveniently be determined by the trellis diagram. In a trellis representation, f(x, y, z) = 1 if and only if an edge labeled *y* between the left-hand node *x* and the right-hand node *z* exists.

# 4.8. THE BCJR DECODING ALGORITHM

There exist many variants of the sum-product algorithm. One specific instance of the algorithm referred to as the *forward-backward* or the *BCJR* algorithm. The published work in [8] known as BCJR decoding, which is called so after the name of its authors, has since been an efficient procedure based on trellis

representation to perform Maximum a Posteriori (MAP) estimations. The BCJR algorithm is a soft-in-soft-out algorithm.

The BCJR algorithm which can be applied to convolutional codes is rather theoretically complex, but has received increased practical popularity since the introduction of Turbo Codes [14]. TB convolutional codes can be decoded using the BCJR algorithm. In this algorithm, two recursive clock-wise and counter clock-wise calculations are performed along the trellis to calculate the feedforward and feedback metrics, which are referred to as  $\alpha$  and  $\beta$ , respectively. The BCJR decoding algorithm estimates the original bit sequence **u** by computing the *a posteriori* Log-Likelihood Ratio (LLR),  $L(u_k | \mathbf{y})$ , for each single bit; that is a real number defined by the ratio

$$L(u_k|\mathbf{y}) = ln \frac{p(u_k = +1|\mathbf{y})}{p(u_k = -1|\mathbf{y})},$$
(4.2)

where **y** is a sequence of *n* real values at the input of the decoder. The numerator and denominator of Eq. 4.2 contain *a posteriori* conditional probabilities; which are, probabilities computed after the whole sequence **y** is received. The positive or negative sign of Eq. 4.2 indicates which bit, +1 or -1, was coded at time instance *k*. Its magnitude can be considered as a reliability measure on the decided bit: the more the magnitude is away from the decision threshold, the more confidence is implied on the estimated bit. This sign and magnitude information provided by  $L(u_k|\mathbf{y})$ , expresses a *soft information* definition for each bit that can be applied to the next decoding block, or converted to the corresponding information bit as a hard decision, e.g. if  $L(u_k|\mathbf{y})$  is negative, the decoder will estimate bit  $u_k = -1$  and  $u_k = +1$  if  $L(u_k|\mathbf{y})$  is positive.

According to the BCJR algorithm, the posteriori LLR,  $L(u_k|\mathbf{y})$  can be written as

$$L(u_{k}|\mathbf{y}) = ln \frac{\sum_{TR_{1}} \alpha_{k-1}(s') \gamma_{k}(s',s) \beta_{k}(s)}{\sum_{TR_{0}} \alpha_{k-1}(s') \gamma_{k}(s',s) \beta_{k}(s)},$$
(4.3)

In the above equation, s' and s refer to the trellis (encoder) previous state and current state, respectively. The channel metric  $\gamma_k(s', s)$  is a conditional probability that is defined by the received signals from the channel. The  $\alpha$ and  $\beta$  metrics are computed recursively around the trellis, as

$$\alpha_{k}(s) = \sum_{s'} \alpha_{k-1}(s')\gamma_{k}(s',s) \qquad k = 1, ..., BL - 1$$
  
$$\beta_{k-1}(s') = \sum_{s} \beta_{k}(s')\gamma_{k}(s',s) \qquad k = BL, ..., 2$$
(4.4)



Figure 4.3.: UPD BER in thermal noise.

For  $\alpha_k(s)$ , the summation is over all converging branches from previous states  $s_{k-1} = s'$  linked to current state s, while for  $\beta_{k-1}(s')$  the summation is over all states  $s_k = s$  that have links to state s'.

Figure 4.2 shows the TB trellis of the (7,5) codes and how the  $\alpha$  and  $\beta$  metrics are calculated along the trellis.

# 4.9. SYSTEM LEVEL UPD BASEBAND SIMULATION

To investigate the behavior of the chosen baseband architecture, a signal chain consisting of  $\Delta\Sigma$  ADCs [63], digital decimation and channel select filters [70], digital matched filters, and (7,5) decoder has been simulated in Matlab. First the performance in presence of just thermal noise was investigated. In Fig. 4.3 the BER versus energy per information bit (Eb) divided by thermal noise spectral density (No) is shown. Two simulated curves can be seen, one including the decoder, and one for the uncoded mode. In the figure, also the theoretical curve for non-coherent detection is indicated as a reference.
## Part II

## 5

### Low Power Analog Decoding

#### 5.1. DECODING IN ANALOG

This chapter presents some basics of analog decoding concept that was initiated by Hagenauer and Loeliger [34], [32], [42], [47]. Computing in the analog domain is fundamentally different from that in the digital one. In digital implementations values are represented by binary numbers with limited word-lengths, while for analog computing, continuous-time currents and/or voltages are used to represent real values [50]. This is intrinsically helpful in soft decision decoding algorithms in which the strengths of the received signals in the coded block play a role in decoding the transmitted message. Decoders that use analog circuitry to implement the decoding algorithm operate by passing either currents or voltages among the primitive processing blocks [34][42]. If primitive calculations are performed on currents, the circuit is called a current mode decoder. Otherwise, It is called a voltage mode decoder.

Most commonly used soft decoding algorithms require a significant number of additions and multiplications. To implement such algorithms, these tasks have to be realized in analog circuitry. In this work, currents are used for basic calculations, because implementing adders in a current mode circuit is straightforward and is done by shorting wires together. Thus, addition does not require any power or dedicated area on silicon. Also, there is no need for additional circuitry like voltage level shifters that might be needed in voltage mode computations.

The operation of a current mode analog decoder is based on MOS tran-

sistors which realize an exponential relation between the drain current and gate-source voltage when operating in weak inversion. This is suitable to convert the received LLR values to the corresponding probabilities, represented by currents throughout the network. Despite the slow operation of the transistors at low current levels, high throughput in analog decoders can be achieved by a highly parallel network of transistors operating in continuous time. Consequently, the convergence in the iterative decoding algorithm is achieved by settlement of transient voltage and current values after presentation of each new set of received coded data.

Analog decoding, can be used to implement both high performance LDPC decoders as reported in [25], or to decode TB convolutional codes as are studied in this work. In the following sections, a simplified model for analog decoding in CMOS weak inversion is presented to illustrate the mapping from the decoding algorithm to transistor level circuits.

#### 5.2. ANALOG DECODING BASIC CALCULATIONS: A SIMPLIFIED MODEL

According to the analog decoding concept, the exponential I-V characteristic of MOS transistors in the weak inversion region can be used to implement the BCJR decoding algorithm. To understand how, let us consider the simplified version of drain current equation as introduced before in the eq. 3.7. That equation is rewritten below for convenience:

$$I_{DS} = I_0 e^{\frac{V_{GS}}{V_T}}, (5.1)$$

where, for the sake of brevity, the constant argument  $nU_T$  has been replaced by  $V_T$ .

By considering a transistor configuration as shown in fig. 5.1 and assuming that all transistors operate in weak inversion, the drain current of  $Q_1$  can be written as

$$I_{1} = I_{0}e^{\frac{V_{G1S}}{V_{T}}} = I_{0} \cdot \frac{\left(e^{\frac{V_{G1S}}{V_{T}}} + e^{\frac{V_{G2S}}{V_{T}}}\right)}{\left(e^{\frac{V_{G1S}}{V_{T}}} + e^{\frac{V_{G2S}}{V_{T}}}\right)} \cdot e^{\frac{V_{G1S}}{V_{T}}}.$$
(5.2)

Since  $I_b = I_0 e^{\frac{V_{G1S}}{V_T}} + I_0 e^{\frac{V_{G2S}}{V_T}}$  the above equation can be rewritten as

$$I_{1} = I_{b} \frac{e^{\frac{V_{G1S}}{V_{T}}}}{e^{\frac{V_{G1S}}{V_{T}}} + e^{\frac{V_{G2S}}{V_{T}}}}.$$
(5.3)



Figure 5.1.: Stack of NMOS transistors.

Likewise for  $Q_2$ 

$$I_{2} = I_{b} \frac{e^{\frac{V_{G2S}}{V_{T}}}}{e^{\frac{V_{G1S}}{V_{T}}} + e^{\frac{V_{G2S}}{V_{T}}}}.$$
(5.4)

Rearranging the two previous equations for  $I_1$  and  $I_2$  leads to

$$\frac{I_1}{I_b} = \frac{1}{1 + e^{-\frac{V_{G1} - V_{G2}}{V_T}}},$$
(5.5)

and

$$\frac{I_2}{I_b} = \frac{e^{-\frac{V_{G1} - V_{G2}}{V_T}}}{1 + e^{-\frac{V_{G1} - V_{G2}}{V_T}}}.$$
(5.6)

Since  $I_1 + I_2 = I_b$ , it can be assumed that fractions  $\frac{I_1}{I_b}$  and  $\frac{I_2}{I_b}$  can represent probabilities of an arbitrary random variable X to be zero or one; i.e.:

$$p(X=0) = \frac{I_1}{I_b}, \quad p(X=1) = \frac{I_2}{I_b}.$$
 (5.7)

Now let us define:

$$L(X) = \frac{V_{G1} - V_{G2}}{V_T} = \frac{\Delta V}{V_T}.$$
(5.8)

Then by substituting the above equation and Eq. 5.7 in Eq. 5.5 and 5.6 we will end up with

$$p(X=0) = \frac{1}{1+e^{-L(X)}}, \quad p(X=1) = \frac{e^{-L(X)}}{1+e^{-L(X)}}.$$
 (5.9)

If a transistor is working in the weak inversion region, by rearranging eq. 5.1, gate-source voltage can be expressed as a function of the drain current as

$$V_{GS} = V_T ln \frac{I_{DS}}{I_0}.$$
(5.10)

Now in Fig. 5.1, since source voltages are the same the differential gate voltage between  $Q_1$  and  $Q_2$  can be written as

$$\Delta V = V_{G1} - V_{G2} = V_T ln(\frac{I_1}{I_0}) - V_T ln(\frac{I_2}{I_0}).$$
(5.11)

or,

$$\Delta V = V_T ln(\frac{l_1}{l_2}) \tag{5.12}$$

By dividing both numerator and denominator of the argument of the logarithmic function the following equation is resulted

$$\frac{\Delta V}{V_T} = ln(\frac{\frac{l_1}{l_b}}{\frac{l_2}{l_h}})$$
(5.13)

The above equation shows the relations between the differential gate voltage and the drain currents of  $Q_1$  and  $Q_2$ . Then, by referring to the assumptions in eq. 5.7 and eq. 5.8, the connection between circuit voltages and currents and the LLR definition in chapter 4 can be seen

$$L(X) = \frac{\Delta V}{V_T} = ln(\frac{p(X=0)}{p(X=1)}).$$
(5.14)

Likewise, the relation between differential currents and the *LLR* value can be derived as follows

$$\Delta I = I_1 - I_2 = I_b \frac{1}{1 + e^{(-\frac{\Delta V}{V_T})}} - I_b \frac{e^{(-\frac{\Delta V}{V_T})}}{1 + e^{(-\frac{\Delta V}{V_T})}}$$
(5.15)

therefore

$$\Delta I = I_b tanh(\frac{\Delta V}{2V_T}) \tag{5.16}$$

or equivalently

$$\Delta I = I_b tanh(\frac{L(X)}{2}) \tag{5.17}$$

by defining

$$\lambda(X) = \frac{\Delta I}{I_b} \tag{5.18}$$

then we have

$$\lambda(X) = tanh(\frac{L(X)}{2})$$
(5.19)



Figure 5.2.: A pair of diode connected PMOS transistors.

#### 5.2.1. LOGARITHMIC CONVERSIONS

The inverse transformation from probabilities (currents in electronic circuits terms) into log-likelihood ratio (differential voltage) can be performed by using a pair of diode connected transistors. The transistors need to operate in the weak inversion region. Such a pair is shown in fig. 5.2 for PMOS transistors. For this configuration and under weak inversion conditions, the following equations apply

$$V_{CC} - V_{O1} = V_T ln(\frac{I_1}{I_0})$$
(5.20)

and

$$V_{CC} - V_{O2} = V_T ln(\frac{I_2}{I_0}).$$
(5.21)

Now by subtracting eq. 5.21 from eq. 5.20, the following expression is resulted.

$$V_{O2} - V_{O1} = V_T ln(\frac{I_1}{I_0}) - V_T ln(\frac{I_2}{I_0}).$$
(5.22)

The above equation is rearranged to the following

$$\frac{V_{O2} - V_{O1}}{V_T} = ln(\frac{\frac{l_1}{l_b}}{\frac{l_2}{l_b}}),$$
(5.23)

which expresses how the currents in the diode connected pair is related to the output differential voltage. Again if the normalized currents to the total



Figure 5.3.: Gilbert vector multiplier with differential inputs.

current  $I_b$  is considered as probabilities, the above equation tells how the probabilities are related to the output *LLR* value; i.e.

$$\frac{\Delta V_O}{V_T} = ln(\frac{p(X=0)}{p(X=1)}).$$
(5.24)

The same procedure applies for a pair of NMOS transistors to convert input *LLR* values to the corresponding probabilities represented by currents.

#### 5.2.2. GILBERT VECTOR MULTIPLIER

Now, let us consider the circuit in fig. 5.3 which shows a Gilbert multiplier configuration [27]. The configuration is an extension of the transistor configuration previously presented in fig. 5.1. Therefore, equations 5.7, 5.8, and 5.9 apply here too, i.e. we can write

$$p(X=0) = \frac{I_5}{I_b} = \frac{1}{1 + e^{(-\frac{\Delta V_1}{V_T})}} = \frac{1}{1 + e^{-L(X)}}.$$
(5.25)

and

$$p(X=1) = \frac{I_6}{I_b} = \frac{e^{-\frac{\Delta V_1}{V_T}}}{1 + e^{-\frac{\Delta V_1}{V_T}}} = \frac{e^{-L(X)}}{1 + e^{-L(X)}}.$$
(5.26)

So, the current  $I_1$  can be expressed as

$$\frac{I_1}{I_b} = \frac{I_1}{I_5} \cdot \frac{I_5}{I_b} = \frac{1}{1 + e^{-\frac{\Delta V_2}{V_T}}} \cdot \frac{1}{1 + e^{-\frac{\Delta V_1}{V_T}}}.$$
(5.27)

Similar to *X*, the following assumption can be considered for another arbitrary random variable *Y* 

$$p(Y=0) = \frac{1}{1+e^{-L(Y)}}$$
 and  $p(Y=1) = \frac{e^{-L(Y)}}{1+e^{-L(Y)}}$  (5.28)

where L(Y) in the above assumptions is equal to  $\frac{\Delta V_2}{V_T}$ . Now, eq. 5.27 can be rewritten as

$$\frac{I_1}{I_b} = \frac{1}{1 + e^{-L(Y)}} \cdot \frac{1}{1 + e^{-L(X)}},$$
(5.29)

or in simpler terms

$$\frac{I_1}{I_b} = p(Y=0).p(X=0).$$
(5.30)

The above equation shows how the output currents of a Gilbert multiplier relate to the product of probabilities for two random variables *X* and *Y*. Following similar procedure as above, other possible probability multiplications relate to other currents in the circuit topology, i.e.

$$\frac{I_2}{I_b} = \frac{I_2}{I_5} \cdot \frac{I_5}{I_b} = \frac{e^{-\frac{\Delta V_2}{V_T}}}{1 + e^{-\frac{\Delta V_2}{V_T}}} \cdot \frac{1}{1 + e^{-\frac{\Delta V_1}{V_T}}} = \frac{e^{-L(Y)}}{1 + e^{-L(Y)}} \cdot \frac{1}{1 + e^{-L(X)}}$$
(5.31)

$$\frac{I_2}{I_b} = p(Y=1).p(X=0)$$
(5.32)

$$\frac{I_3}{I_b} = \frac{I_3}{I_6} \cdot \frac{I_6}{I_b} = \frac{e^{-\frac{\Delta V_2}{V_T}}}{1 + e^{-\frac{\Delta V_2}{V_T}}} \cdot \frac{e^{-\frac{\Delta V_1}{V_T}}}{1 + e^{-\frac{\Delta V_1}{V_T}}} = \frac{e^{-L(Y)}}{1 + e^{-L(Y)}} \cdot \frac{e^{-L(X)}}{1 + e^{-L(X)}}$$
(5.33)

$$\frac{I_3}{I_b} = p(Y=1).p(X=1)$$
(5.34)

$$\frac{I_4}{I_b} = \frac{I_4}{I_6} \cdot \frac{I_6}{I_b} = \frac{1}{1 + e^{-\frac{\Delta V_2}{V_T}}} \cdot \frac{e^{-\frac{\Delta V_1}{V_T}}}{1 + e^{-\frac{\Delta V_1}{V_T}}} = \frac{1}{1 + e^{-L(Y)}} \cdot \frac{e^{-L(X)}}{1 + e^{-L(X)}}$$
(5.35)

A 17

$$\frac{I_4}{I_b} = p(Y=0).p(X=1)$$
(5.36)

#### 5.3. FORWARD-BACKWARD COMPUTATIONS

An example of a generic sum-product module based on the Gilbert vector multiplier at transistor-level is shown in fig. 5.4 together with the corresponding trellis representation. The inputs of the block are the BCJR algorithm forward and backward metrics  $\alpha$  and  $\beta$ , as well as channel metrics  $\gamma$  that are represented by the current vectors  $Ix_i$ ;  $i \in 0, N$  and  $Iy_j$ ;  $j \in 0, M$  respectively. The vector multiplier generates all possible probability products of the two input variables, labeled by currents  $Iz_{ij}$ . In fact, the currents are normalized by the total currents flowing into the first set of inputs; i.e.

$$I_{z_{ij}} = \frac{I_{x_i} I_{y_j}}{\sum_{k=0}^n I_{y_k}} \qquad i = 0, ..., m; j = 0, ..., n.$$
(5.37)

Thus, the output currents represent the output probabilities, which are the results from the input probability products. Low current levels (i.e. weak inversion) are not only necessary for proper operation, but also maintain a low power consumption. The transistors in the multipliers should have an exponential relation between gate-source voltage and drain current for proper multiplication, which occurs at low current levels in the sub-V<sub>T</sub> region. Thus the outputs are summed in the summation/connectivity network, by shorting the corresponding wires; and discard the rest via connecting the wire to  $V_{CC}$ , since they are not needed. The output currents apply to the next cell by a set of PMOS current mirrors after normalization. The normalized output currents are representing the output probabilities, which are the input for the next block in the network.

In fig. 5.4, a key parameter is the *reference current*,  $I_{Ref}$ , of primitive blocks. Since the probabilities are represented by currents, there has to be a unique reference current in the circuit corresponding to a probability of 1. Then all the real valued probabilities can be defined by a fraction of this current. While the input probabilities in the circuit correspond to  $p(x_i) = I_{x_i}/I_{Ref}$  and



**Figure 5.4.:** Generalized Gilbert multiplier network for implementing the sum-product algorithm shown with corresponding trellis representation.

 $p(y_j) = I_{y_j}/I_{Ref}$ , the input probability vectors must satisfy  $\sum_i I_{x_i} = I_{Ref}$  and  $\sum_j I_{y_j} = I_{Ref}$ . Similarly, the same requirements are valid for the output current



Figure 5.5.: Transient plots of the decoded output bits in an analog decoder represented by low level currents.

vector  $I_{z_{ij}}$ . If some of the partial products are discarded in the multiplier, then all the currents to the next block must be re-normalized in order to satisfy  $\sum I_{z_{ij}} = I_{Ref}$ .

#### 5.4. OPERATION OF AN ANALOG DECODING CORE

The computations needed for the equations in the BCJR algorithm described in chapter 4 are carried out by a circuit topology that is determined by the tailbiting trellis of the code. The complete forward-backward computation circuit are performed by a network of fully connected individual vector multipliers. The forward and backward wiring of individual blocks in the network are based on TB circular trellis formation, in counter-clock-wise and clock-wise directions. The decoding process of a coded block starts by loading soft values from the channel in parallel to the network. The soft data then stand as voltages or currents in the highly connected networks of the analog multipliers. Assuming the whole circuit successfully realizes the tail-biting trellis of the code, feedback loops in the network make the levels of the voltages or currents converge to steady state levels which corresponds to the decoded data. A typical behavior of an analog decoding core is shown in Fig. 5.5. The time between two pulses in Fig. 5.5 shows the allocated time for the circuit to reach a stable status, while the transient waves speak for the value of each output bit that eventually end up above or below a certain decision threshold. Since there is no need for any kind of memory in this scheme, the settling time required for convergence is limited only by intrinsic transistors' speed and by the parasitic capacitance of the routing.

# 6

### Hardware Mapping of the Analog Decoding Circuit

#### 6.1. SYSTEM PERSPECTIVE

In order to incorporate an analog decoding circuit into a functional receiver chain, different alternatives should be investigated. Basically, there are two main scenarios: a) to apply analog decoding directly on the received analog signals and b) to use it after digital base-band processing and demodulation. In a), synchronization and symbol detection is a challenge in the analog domain. These tasks still are best to be done in the digital domain. In b), a Digital-to-Analog Converter (DAC) is required before the analog decoder, which introduces hardware overhead.

In this work, scenario b) has been followed. The best solution for wireless communication are processing blocks that provide robustness and programmability of digital designs while providing the power and speed performance of analog computing circuits. These points motivate the investigation of low power DACs and ADCs combined with an analog decoding core. These circuits are necessary for the analog decoders to interface with the surrounding digital circuitry. In addition, they can eliminate the costly and inefficient storage capacitors which are normally required in fully analog interfaces. Consequently, it is important to investigate if the additional complexity and power consumption of the data conversion circuits, still make analog decoding a feasible alternative.

#### 6.2. TOP-LEVEL ARCHITECTURE OF THE DECODER

The top-level architecture of the analog decoding circuit is provided in Fig. 6.1. A digital interface, as well as data converting circuits are considered to facilitate using the decoder in a digital receiver. The design consists of the analog decoding core for a tail-biting trellis with block length *BL*, a simple digital interface, an array of 2xBL low resolution Current Steering Digital-to-Analog Converter (CS-DAC)s, and an array of *BL* current comparators. The following subsections provide descriptions of different components of the design.

#### 6.2.1. INPUT INTERFACE

An input interface is needed to take the serially incoming quantized digital soft information, and temporarily store it in a memory. As soon as all the soft information for a block of BL coded data has been received, it needs to be translated into differential electrical currents and applied in parallel to the analog decoding core. In order to do so, a separate *current steering* DAC is required for each quantized data in the received block of coded data. Compared with the architectures with fully analog interfaces where sampleand-hold blocks are used to store the received values, in this scheme data storage is robust and there is no need for any capacitor. An array of 2BL x n-bit registers are required to store a block of BL soft information data, each quantized by *n* bits. In addition, an array of D Flip-Flop (DFF)s are placed between the registers and the DACs. The data is first clocked into the registers. Then the DFFs are simultaneously clocked to transfer and hold the data for the DAC inputs. New data (i.e. the next block of coded data) can now be clocked into the registers. The DFFs will hold the DAC input words so that the decoding core can work while the new data is clocked in. Each pair of differential inputs required for the decoding core could be generated simultaneously in a current-steering DAC with differential output. The DACs are built from arrays of current sources directly injecting differential current into the decoder inputs. The sum of electrical currents from each DAC should match that of the decoding core; i.e. should match the reference current  $I_{Ref}$ . Essentially, each bit in the DAC consists of a number of PMOS current source transistors and a PMOS differential pair. The outputs of the differential pair are connected to the two differential inputs of the core. In this way, the current is always on and steered to the core inputs.

#### 6.2.2. ANALOG DECODING CORE

The decoding core works on the data, represented by currents, and generates BL differential decoded soft output bits. The comparator array translates the soft decoded bits into hard decided bits. The level of the currents in the decoding core can be adjusted by an off-chip variable resistor. Low current levels force the transistors in the decoding core to operate in the sub-V<sub>T</sub> region, and maintain an exponential characteristic at low power consumption.

#### 6.2.3. OUTPUT INTERFACE

Since the output of the decoder core is an analog vector showing the probabilities of the decoded bits to be 0 or 1 by means of electrical currents, there is a need for an output interface to decide on the value of each bit. For this purpose, an array of latched current-mode comparators is used. The comparators are based on a design using a pair of cross-coupled inverters with a flip-flop latch, [90] [83]. Every comparator takes a pair of electrical currents representing the probabilities of the output bit to be 0 or 1. If the value of the current representing the probability of 1 is greater than the other one then the comparator output voltage reaches the digital supply,  $V_{DD}$ ; or zero whenever the condition is reversed. Thus, the output interface translates the analog probability currents into the digital decided bits.

#### 6.2.4. DIGITAL CONTROLLER

The digital circuitry buffers the 2xBL received soft information symbols for each coded block. Hence, a total of 2.BL x n-bit registers are dedicated for buffering a complete block. When a complete block has been buffered, it is applied in parallel to the decoding core via an array of n-bit resolution binary weighted differential CS-DACs. The digital interface also takes the outputs from the comparators, coordinates the serial streaming of the decoded bits, and handles all the required timing signals, including the time period for the analog core to converge. Except for input buffering, which only takes  $2BL \times n$ -bit registers, no other storage is required; i.e. no analog memory is involved. Finally, a parallel-to-serial shift register feeds out the decoded bits in serial. Digital I/O interfaces facilitate using the decoder the same way as an ordinary digital decoder without a need for changes in the receiver architecture.



Figure 6.1.: Architecture of the analog decoding circuit.

#### 6.2.5. TIMING

The timing of the operations for the decoder is shown in Fig. 6.2. The total time needed for the currents in the analog decoding core to settle to their final values, which represents the decoded data, is set by 28x(clock period). In consequence, the allocated time for decoding can easily be adjusted by the clocking frequency. During this time, the decoder processes the current block of coded data, while the input interface buffers the next block of 28 newly received data, each at every clock cycle. Decision time is the time that the hard decision is taken over the level of the output currents. Right after the decoded data in serial bits.

#### 6.3. EXTENDED (8,4) HAMMING DECODER: A BRIEF INVESTIGATION

Hamming codes have been implemented in analog domain previously in older technologies, as proof of analog decoding concept. In [85], current mode method on the tail-biting representation of codes has been pursued while



**Figure 6.2.:** Timing diagram of the decoder. 2 BL clock cycles is the dedicated time for decoding.

in [22], Forney-style factor graphs were used for hardware implementation. Current mode Hamming decoder is briefly re-evaluated here, mainly to find out an initial benchmark for the following design considerations: Firstly, given the technology gap between the mentioned older designs and the present 65 nm CMOS technology, what current levels are operational in the decoding core and what speed and computational accuracies are achievable? Secondly, with the intention of using the decoder in UPD receiver chain at the digital back-end, what would be the power and energy overhead of the peripheral interface circuitry? Finally, following the UPD power and rate specifications as mentioned in the chapter 2, and studying Hamming decoder as a benchmark design, would selecting a more complex decoder be an option or not?

#### 6.3.1. ANALOG DECODING CORE

The decoder core demonstrates current mode implementation of the BCJR forward-backward decoding algorithm of the tail-biting trellis of an extended Hamming code, [85]. The decoder for the (8,4) Hamming codes receives 2BL = 8 parallel input samples from the channel and decodes the BL = 4 information bit estimates in parallel. Every analog input sample represents a *soft bit* which is a differential pair of currents that represent probability of receiving 1 on one end and probability that the received bit is 0 on the other. The probability of variable *x*, using the currents on a pair of wires, is introduced as the vector (Ix0, Ix1) corresponding to (p(x = 0), p(x = 1)). The probability of 1 is therefore denoted by the reference current  $I_{Ref}$ ; thus,  $Ix0 + Ix1 = I_{Ref}$ .

The chosen amount for the unit current must ensure that all transistors are biased and stay in the sub- $V_T$  region. As one might notice, the integrity of the decoding process is highly dependent on the accuracy of the unit currents used throughout the network.

#### 6.3.2. SIMULATION RESULTS

The ST's 65 nm Low Power-High Threshold Voltage (LP-HVT) CMOS transistor library was used to simulate the analog Hamming decoder architecture. A wireless link with BPSK modulation and Additive White Gaussian Noise (AWGN) channel is considered in order to evaluate the performance of the decoder.

In the simulations, the reference current  $I_{Ref} = 100nA$  was chosen which ensures that all transistors in the analog core as well as in the current-steering DACs operate in the sub-V<sub>T</sub> region. BER performance of the decoder that resulted from transistor level simulations is shown in Fig. 6.3. The curve closely follows the ideal performance that is expected from the extended Hamming decoder. The BER performance of an uncoded system with a signal corrupted in an AWGN channel is provided for comparison.

Power consumption estimates and characteristics for the decoder are summarized in tables 6.1 and 6.2, respectively. The analog circuits and the input DACs use a 1.2 V supply, whereas the digital circuitry operates on 0.8 V. The decoder converges to a 4-bit codeword in less than 2  $\mu$ s, which translates into a decoding speed of 2.5 Mb/s. At this rate, there is not any significant loss in the BER performance. The complete decoder consumes only 40  $\mu$ W at a throughput of 2.5 Mb/s. The required power reduces to a total of 16  $\mu$ W at a lower throughput of 250 kb/s, mostly thanks to power savings in the digital circuitry at lower clock frequencies. Power consumption calculation of the digital circuitry at 0.8 V is also derived from transient simulations in the Cadence Spectre environment.

The power consumption for the reported analog Hamming decoders is provided in Table 6.3. Studying the table should be done with caution, since the power consumption heavily depends on different factors such as chosen technology and decoder type. Required energy per decoded bit (E/b) is also included as an indicator for a comparison.

Studying the specification of the Hamming decoder in 65 nm CMOS reveals the possibility of choosing a more complex design for UPD receiver. That is because the dedicated power budget for decoder in the UPD specification



Figure 6.3.: Bit error rate performance, 2.5 Mb/s.

| Table 6.1.: Power | consumption | of different | sections | of the | de- |
|-------------------|-------------|--------------|----------|--------|-----|
| coder             |             |              |          |        |     |

| Sub-Circuit          | <b>Power Consumption</b> [ $\mu$ <b>W</b> ] |                      |  |
|----------------------|---------------------------------------------|----------------------|--|
|                      | 2.5 Mb/s                                    | 250 kb/s             |  |
| DACs                 | 5                                           | < 2                  |  |
| Analog decoding core | 6                                           | 6 (rate independent) |  |
| Digital circuitry    | 28                                          | 8                    |  |
| Output comparators   | 1                                           | < 1                  |  |
| Total                | 40                                          | 16                   |  |

target, allows to trade power for a better coding gain. Since the Hamming decoder is not intended for hardware implementation, further simulations

| Technology                   | 65 nm CMOS, LP-HVT               |
|------------------------------|----------------------------------|
| analog supply voltage        | 1.2 V                            |
| digital supply voltage       | 0.8 V                            |
| clock frequency              | up to 5 MHz, max. coding gain    |
| decoder throughput           | up to 2.5 Mb/s, max. coding gain |
| total energy per decoded bit | 16 pJ/b @ 2.5 Mb/s               |
| coding gain @ BER= $10^{-3}$ | 1.5 dB                           |

Table 6.2.: Analog decoder characteristics

Table 6.3.: Energy comparison for Analog Hamming decoders

|                   | technology | I <sub>Ref</sub> | E/b    | core  | total |
|-------------------|------------|------------------|--------|-------|-------|
|                   | CMOS       |                  |        | power | power |
| Reference         | [µm]       |                  |        | [µW]  | [µW]  |
| [53] (simulation) | 0.18       | 1 µA             | 640 pJ | N/A   | 283   |
| [22] (measured)   | 0.25       | 100 nA           | 140 nJ | < 5   | 55    |
| [85] (measured)   | 0.18       | 10 µA            | 102 pJ | 150   | 229   |
| this work         | 0.065      | 100 nA           | 16 pJ  | 6     | 40    |
| (simulation)      |            |                  |        |       |       |

are not carried out. Consequently, the rest of this chapter provides details of steps that have been followed to decode the well-known (7,5) codes in analog domain [2].

#### 6.4. (7,5) ANALOG DECODING CIRCUIT

In the following sections, till the end of this chapter, the design flow to implement a promising candidate for the (7,5) analog decoding circuit is described.

#### 6.4.1. MISMATCH CONSIDERATIONS FOR GILBERT VECTOR MULTIPLIERS

One of the major limiting factors of analog decoders is the device matching, and it has been shown in [91] and [12] that local threshold voltage variation is the dominant issue in the sub- $V_T$  region. The threshold voltage variation of a transistor causes variations in the drain current which can be expressed as

$$I_{D_{\text{Circuit}}} = I_D \epsilon \tag{6.1}$$

where  $\boldsymbol{\epsilon} = \exp(\frac{\Delta \mathbf{v}_{TH}}{nU_T})$  and  $U_T$  is the thermal voltage.  $\Delta \mathbf{v}_{TH}$  is a Gaussian random variable with zero mean and variance  $\sigma_{V_{TH}}^2$ , where

$$\sigma_{V_{TH}} = \frac{A_{\Delta}}{\sqrt{WL}}.$$
(6.2)

 $A_{\Delta}$  is a parameter related to the fabrication process and *W* and *L* are the dimensions of the transistor. Now if the mismatch effects are applied to Eq. 5.37 we can derive a more realistic model for the multiplier

$$I_{z_{ij},\text{Circuit}} = I_{z_{ij}}(1 + \Delta_{I_{z_{ij}}}) = \frac{I_{x_i}\epsilon_i I_{y_j}\epsilon_j}{\sum_{k=0}^n I_{y_k}\epsilon_k}.$$
(6.3)

In the above equation,  $\Delta_{I_{z_{ij}}}$  is a parameter representing the total deviation from the ideal output current, i.e. the total effect of the mismatch. Using Eq. 5.37 to rearrange Eq. 6.3 leads to

$$\Delta_{I_{z_{ij}}} = \frac{\epsilon_i \epsilon_j}{\sum_{k=0}^n \frac{I_{y_k}}{I_{\text{Ref}}} \epsilon_k} - 1.$$
(6.4)

Since the fraction  $I_{Z_{ij}}/I_{Ref}$  defines the output probabilities, it can be seen from the above equation that these probabilities are subject to deviations from the ideal values essentially due to the local threshold voltage variations of the transistors. The scale of the currents used, as long as other effects are ignored and the cell acts as a multiplier, are irrelevant to the variations in the probabilities.



**Figure 6.4.:** Sample soft bit output for Matlab model in comparison with circuit simulations.

#### 6.4.2. JOINT SPECTRE-MATLAB SIMULATION AND ANALYSIS MODEL

It has been observed that analog decoding circuits are sensitive to mismatch variations. Therefore, to evaluate the expected performance loss because of mismatch between transistors, Monte-Carlo (MC) simulations are needed to be carried out. That means, for every block of received coded data, MC mismatch simulations must be performed. However, the performance of decoders is usually assessed by evaluating thousands of blocks of data. When dealing with physical implementation of decoders, clearly it would be extremely time-consuming to employ circuit simulation tools to perform the needed comprehensive analysis. As a result, transistor level MC simulations for analog decoding circuits are not practical even for small decoders due to the processing power required. An alternative approach with a reasonable processing time would be needed while providing a close estimate of the results from transistor level simulations. In this thesis, a combination approach has been carried out; i.e. first the Gilbert cell building blocks are accurately simulated in transistor.



**Figure 6.5.:** BER performance of the (7,5) decoder for different BL.

sistor level and carefully evaluated using Monte-Carlo simulations. This helps to derive the corresponding statistics of the deviations from the ideal performance of a Gilbert multiplier in the presence of the process and mismatch variations. Then, the derived statistics are employed in a Matlab model which is developed precisely according to the topology of the corresponding analog decoding circuit. In the Matlab model, the multiplications that are needed in the BCJR decoding algorithm are replaced by the non-ideal multiplications with the error statistics that correspond to the circuit model of a non-ideal Gilbert multiplier.

#### 6.4.3. ESTIMATING THE REQUIRED SILICON AREA

Here, there is a need to elaborate on two issues. Firstly, the decoding process in an analog decoding circuit is actually a settlement of the currents and voltages throughout the circuit after some time that the new input has been introduced. That means the decoding is based on the continuous time values, while in a Matlab simulation model the values are necessarily represented by digital values of limited precision. Secondly, the settlement of the current and voltage levels in the circuit happens in parallel, whereas in the Matlab model the convergence has to be achieved by consecutive stage-after-stage computation along the implemented circuit model. Therefore, the accuracy of the Matlab model itself should be verified side by side in comparison with the accurate transistor level simulation of the analog decoder.

To evaluate the Matlab model, a comparison has been performed for the (7,5) analog decoding circuit and the constructed Matlab model. The soft output of the two models for a sample decoded bit is shown in Fig. 6.4. The convergence timing cannot be derived from the Matlab model, therefore in the figure it has been inserted to match the analog simulation results. The small difference between the level of the two curves suggests the acceptable accuracy of the Matlab model in comparison to the transistor simulation results. The achieved statistics of the normalized differences to  $I_{Ref}$  between the soft output of the two simulations performed for 1000 bits resulted in average difference less than 1 percent between the two methods while the variance of this error is 1.6 percent.

To find the minimum silicon area needed for the decoder, two major design parameters have to be determined.

#### **CHOOSING A SUITABLE BLOCK LENGTH**

The required area for an analog decoding circuit designed for a TB trellis, primarily depends on the number of states and BL of the codes which determine the trellis circle size. To achieve full coding ability, BL should not be selected too short while on the other hand, beyond a certain length which provides the full coding ability, increasing BL only results in a larger circuit. A TB code allows for a short BL without sacrificing too much of the performance. The proper BL choice can be deduced from high-level simulations.

Such a simulation for the selected codes is performed and the BER curves are included in Fig. 6.5. Block length of the code that determines the size of the tail-biting trellis structure is varied and the BER performance was observed. For large BL for the code, maximum error correcting performance can be gained, but approximately similar performance can be achieved by selecting smaller trellis structure. Since the smaller hardware is desired, the selection criteria for the block length size was set to be within 5 percent difference in logarithmic scale from the maximum achievable coding gain. For (7,5) codes, BL=14 promises the minimum hardware size without sacrificing more than 5 percent in BER performance. This choice helps to benefit from the near maximum coding gain offered by the codes. Also, it facilitates the minimum area implementation of the corresponding decoder.

#### CHOOSING TRANSISTOR DIMENSIONS

Besides the selection of BL, another important design factor for the chip area, is the transistor dimensions. As described in the previous section, the performance of the decoder is limited by mismatch errors that are inversely related to the size of the transistors. Also, if deviation of the current values from the ideal ones due to mismatch becomes a big percentage of the reference current, the circuit response is more prone to errors. Thus, it is desirable to find the smallest required device size before the combined effects of mismatch, process variations, and flicker noise start to deteriorate the BER performance. A favorable solution for a decoder intended to be used in a compact, low power, and portable device is the one that offers maximum coding gain and minimum area, while consuming the minimum power.

For this purpose, the performance of each decoder is simulated using the described joint Spectre-Matlab method in section 6.4.2, and under mismatch non-idealities. To be more realistic, reference currents and the used current mirrors are assumed to be non-ideal with a Gaussian variation of zero mean and variance = 10. The performance simulations are averaged over 50 realizations of each circuit with process and mismatch non-idealities. It is also assumed that a flawless array of current comparators, decide about the decoded bit. The decision is taken according to the differential output currents, that represent the soft output of each decoded output.

The BER performance curves are presented in Fig. 6.6. These simulations help to design an efficient decoder, since the results provide a rough estimate of the required area, power consumption as well as the offered coding gain. Energy efficiency of the decoder is calculated according to the power consumption and maximum achievable throughput of the circuit. Plain transient time circuit simulations revealed that the minimum settling time to decode a block of coded received samples is 30  $\mu$ sec and 50  $\mu$ sec, when  $I_{Ref}$  adjusted to 20 nA and 10 nA, respectively. Because the BL=14 is chosen, this time period relates to decoding of 14 output bits. This means that the maximum data rates at these current levels are 470 kb/s and 280 kb/s, respectively. In the simulations, supply voltage is  $V_{CC}$ =1.2 V.

To derive estimates on total area in each scenario, a complete routed layout is required to be used as a reference. Thus, a complete layout of the analog decoding core with transistor dimensions W/L = 1.0  $\mu$ m /0.5  $\mu$ m was implemented. The complete layout for the circuit is developed for (7,5) codes with a TB trellis size of 14 (BL = 14). The total area of the decoding core turned out to be 0.015 mm<sup>2</sup>. This complete circuit layout was used as a benchmark to report estimates of the required area for the circuit in the Fig. 6.6, when other dimensions are selected for the transistors. Thus, for each design candidate, the size of the transistors, as well as the reference currents were varied to conclude selection of the proper device dimensions. It is assumed that the area required for routing is increasing proportional to the total number of basic multipliers. Having that in mind, together with the performance simulations in the next section, one can get insights for the performance of the decoder for different transistor sizes. The total area estimates, based on each set of transistor sizes and the mentioned sample, completely routed layout for W/L = 1.0  $\mu$ m /0.5  $\mu$ m transistors, are calculated. These area estimates are also included in the Fig. 6.6 as the numbers inside the brackets. As the BER curves in Fig. 6.6 suggest, selection of shorter length (L=0.2  $\mu$ m) for the core transistors results in severe estimated BER degradation in low power profiles.

#### 6.5. CIRCUIT DETAILS

As an outcome of the investigation presented in section 6.4, three versions of (7,5) analog decoders were fabricated. This section presents details of the implemented circuits.

#### DECODING CORE BLOCK DIAGRAM

Block diagram of the decoding core is illustrated in Fig. 6.8. It includes input cells to calculate the  $\gamma$  parameters, the computing cells to calculate  $\alpha$  and  $\beta$  parameters, and the output cells to compute the final probability values.

#### CURRENT STEERING DACS

An array of 4-bit, binary weighted CS-DACs are fabricated in the design. The array consists of 28 individual DACs. Details of the circuit is shown in Fig. 6.9. The level of the currents are adjustable with an off-chip variable resistor.

|                              | AD1             | AD2         | AD3       |
|------------------------------|-----------------|-------------|-----------|
| <b>PMOS W/L [μm]</b>         | 12.0/0.4        | 2.0/0.4     | 1.0/0.4   |
| <b>NMOS W/L [μm]</b>         | 6,(16,12,9)/0.5 | 2,(4,3)/0.5 | 1,(2)/0.5 |
| core area [mm <sup>2</sup> ] | 0.104           | 0.038       | 0.015     |

 
 Table 6.4.: Transistor dimensions of fabricated analog decoding cores.

#### **CURRENT SOURCES**

To feed the multiplier cells in the decoding core, a total of 128 copies of  $I_{Ref}$  current are needed. The currents should have identical values, otherwise may cause inaccuracy in the calculations. Therefore, large transistors are used in the mirrored transistors to minimize variations for low levels of current.

#### MULTIPLIER CELLS

Samples of input cells and computing cells with the chosen dimensions are shown in Fig. 6.11 and Fig. ??.

#### 6.6. FABRICATED ANALOG DECODERS

Following the presented simulations, the analog decoding core with three selected different sets of transistor dimensions was fabricated. The three decoding core versions are named, Analog Decoding Circuit *i* (ADi), i=1, 2, 3. The fabricated chips also include the digital interface and mixed signal circuits.

#### 6.6.1. ANALOG DECODING CIRCUIT 1, AD1

For AD1, NMOS transistor dimensions (W/L) were mainly selected as 6.0/0.5  $\mu$ m. The placed and routed decoding core takes 0.104 mm<sup>2</sup>. The chip photo and complete layout of the design is provided in Fig. 6.13. For the chip shown in Fig. 6.13(a), the silicon area excluding pads is 0.27 mm<sup>2</sup>, of which the analog decoding circuitry, AD1, occupies 0.104 mm<sup>2</sup>.

#### 6.6.2. ANALOG DECODING CIRCUIT 2, AD2

For AD2, NMOS transistor dimensions (W/L) were mainly selected as  $2.0/0.5 \mu$ m. The chip photo and the complete layout of the design is provided in Fig. 6.14. The whole design occupies  $0.30 \text{ mm}^2$  excluding pads, of which AD2 takes  $0.035 \text{ mm}^2$ .

#### 6.6.3. ANALOG DECODING CIRCUIT 3, AD3

For AD2, NMOS transistor dimensions (W/L) were mainly selected as 1.0/0.5  $\mu$ m. Area for AD3 is 0.015 mm<sup>2</sup>.

The design of the interface and mixed signal circuits were kept identical for all the three circuits to support a valid comparison of the decoding cores. The only difference is some minor changes in the digital interface for AD2 and AD3 cores, such that it can be shared between those. The LP-HVT transistors in CMOS 65 nm technology were used in the implementations.

While designing the layout, it appeared that uniform dimensions for all transistors result in unused spaces in the layout. Therefore, as included in Table 6.4, at a few places bigger transistor were also used for better matching. Thus, performance can be improved without imposing any overhead on the total area of the circuit.

In all three cores, one and two dimensional common centroid layout techniques have been used to improve the matching conditions of current mirrors. A sample of a matching layout technique for the transistors is shown in Fig. 6.7. The layout of all individual computational blocks are done manually, whereas all intra-block and higher level routing are performed with the defined Verilog netlist and by aid of auto-routing tools.



**Figure 6.6.:** BER simulations for the (7,5) codes, while considering degradation because of mismatch errors in analog decoder for different transistor sizes. Values in "[]" declare the estimated area. (a)  $I_{Ref}$  = 20 nA, L=0.5  $\mu$ m, (b)  $I_{Ref}$  = 20 nA, L=0.2  $\mu$ m, (c)  $I_{Ref}$  = 10 nA, L=0.5  $\mu$ m, (d)  $I_{Ref}$  = 10 nA, L=0.2  $\mu$ m.



**Figure 6.7.:** Sample 2-dimensional matching of transistors used in the input cells.



Figure 6.8.: Block diagram of the decoder core.



Figure 6.9.: Current steering DACS with binary weighting.



Figure 6.10.: Implemented current sources.



Figure 6.11.: Input cell.



Figure 6.12.: Computing blocks.



**Figure 6.13.:** analog decoding circuit, AD1, with the accompanying data converters and digital interface circuitry: (a) die photo and (b) layout of the design.



**Figure 6.14.:** analog decoding circuit, AD2 and AD3, with the accompanying data converters and digital interface circuitry: (a) die photo and (b) layout of the design.
## 7

## Performance of Analog Decoding Circuits

#### 7.1. OBJECTIVES

Experiments and Measurement results of the fabricated analog decoding circuits are presented in this chapter. The goal of measurements is to find out about the following items:

- BER performance of the fabricated chips for different power profiles
- Minimum power level
- Energy efficiency at different throughputs
- Maximum offered coding gain
- Minimum operational supply voltage
- Does the BER performance for smaller cores, AD2 and AD3, degrade with respect to that of AD1?
- Temperature dependency (between room and body temperature)
- How much is the power overhead for the interface circuitry?



Figure 7.1.: Measurement setup.

#### 7.2. MEASUREMENT SETUP

To generate the required test data, a communication system with BPSK modulation and AWGN channel was considered. A measurement setup including logic analyzer, digital pattern generator, power supplies and high precision digital multimeters was used. Test files were generated in MATLAB<sup>®</sup> for SNRs from 1 dB to 6 dB in steps of 1 dB. Measurements were performed in a climate chamber at both room and body temperature to consider the operational environment for the target applications. During measurements the clock frequency was varied from 250 kHz to 1 MHz in steps of 250 kHz and from 1 MHz to 4 MHz in steps of 1 MHz, which corresponds to throughputs from 125 kb/s to 2 Mb/s due to the half rate codes. The level of total current for each decoding core is adjustable by an off-chip variable resistor. Figure 7.1 shows a photo of the measurement setup.

| Clk   | Throughput | Min. Supply | Current       | Power         |
|-------|------------|-------------|---------------|---------------|
| [MHz] | [Mb/s]     | [V]         | [µ <b>A</b> ] | [µ <b>W</b> ] |
| 0.25  | 0.125      | 0.58        | 5.6           | 3.2           |
| 0.50  | 0.250      | 0.60        | 5.9           | 3.5           |
| 0.75  | 0.335      | 0.62        | 6.5           | 4.0           |
| 1     | 0.5        | 0.65        | 7.3           | 4.7           |
| 2     | 1          | 0.71        | 8.3           | 5.8           |
| 3     | 1.5        | 0.78        | 9.8           | 7.6           |
| 4     | 2          | 0.86        | 11.4          | 9.8           |

Table 7.1.: Digital interface characteristics

#### 7.3. ANALOG DECODER MESUREMENTS

In this section, first measurement results on the decoder with largest decoding core, AD1, is presented. That follows with the measurement results for AD2 and AD3.

#### 7.3.1. AD1 MEASUREMENT RESULTS

The BER performance of the decoders was measured at a minimum supply voltage of 0.8 V for the decoding core. Different power profiles were set by adjusting the current. The BER results are plotted in Fig. 7.2(a) for different power levels and in Fig. 7.2(b) for different throughputs. Since the bias current of the core can accurately be controlled by an off-chip variable resistor and BER results are averaged over testing thousands of blocks of data, measuring several chip samples of the same circuit resulted in approximately similar overall BER curves. In addition to the measurement results, theoretical BER limits for the used codes with BL=14, as well as for an uncoded system are also provided. The BER curve for plain transistor level simulation, i.e. without the effect of noise, mismatch or process variation is also included for comparison. There is already a small degradation for circuit simulated curves compared to the ideal theoretical ones due to the inherent inaccuracy of analog multipliers.

The presented results verify the functionality of the decoder circuit and show what throughput and BER can be achieved under very constrained



**Figure 7.2.:** BER performance for (a) different analog power profiles at 500 kb/s throughput.; (b) BER performance for different throughput when analog power is limited to  $10.6 \,\mu$ W.

power levels. The plots reveal that the fabricated decoder still operates as an error correcting circuit even at a power consumption of just  $3.0 \,\mu\text{W}$  for a throughput of 500 kb/s. In this case the minimum total power is  $6.5 \,\mu\text{W}$  if the digital power is also included. Power consumption of the digital circuitry can be reduced even more by decreasing the supply further if clock gating and an array of low power sub-V<sub>T</sub> to above V<sub>T</sub> level shifters between digital and CS-DACS are used. For 500 kb/s throughput the minimum power needed to reach close to the maximum coding gain of the circuit is  $12.5 \,\mu\text{W}$ . At this power level, as can be seen in Fig. 7.2(b), the chip still provides significant improvement compared to uncoded BPSK communication. The circuit provides this level of performance at throughputs up to 2 Mb/s by increasing the power to  $36.1 \,\mu\text{W}$ .

Behavior of the decoder under temperature variation can be seen in Fig. 7.3.



**Figure 7.3.:** BER performance for room temperature  $(27 \,^{\circ}\text{C})$  and body temperature  $(37 \,^{\circ}\text{C})$  when power of decoding core is limited to 3  $\mu$  W.

Here, the power for analog circuitry was restricted to  $3 \mu$ W. BER performance for body temperature is slightly improved compared to that of the room temperature conditions. At 125 kb/s the improvement reaches approximately 0.5 dB at BER=0.001. While the supply voltage for analog circuitry is fixed to 0.8 V and its power is controlled by current, for the digital circuitry the power is reduced by scaling down the supply voltage. Table 7.1 demonstrates the minimum power levels of digital circuitry for the measured throughputs.

Fig. 7.4 presents the operation limits of the decoding core in terms of coding gain at BER=0.001 versus energy per decoded bit. Fig. 7.4(a) takes into account the total power efficiency for the whole circuitry. On the other hand, Fig. 7.4(b) shows the power efficiency of just the decoding core for the measured throughputs. Maximum achievable coding gain appears to be 2.3 dB at a minimum core energy as low as 20 pJ/b. At 7 pJ/b energy, though the performance is degraded, the decoder still can provide 1 dB of coding gain.



**Figure 7.4.:** Power efficiency for (a) the whole circuitry including digital; and (b) analog decoding circuitry.

The output interface circuitry (current comparator array) consumes less than  $1\mu W$ , for the measured frequencies. Therefore, they consume only a negligible share of the total power.

#### 7.3.2. AD2 AND AD3 MEASUREMENT RESULTS

Again, the BER performance of the decoders was measured at a minimum supply voltage of 0.8 V for the decoding core. Figure 7.5 shows the measured energy per decoded bit versus coding gain at BER= $10^{-3}$  for the three analog decoding cores, AD1 to AD3 at 2 Mbps. This should be compared to the coding gain of 3.1 dB for an ideal implementation with BL=14, simulated in MATLAB<sup>®</sup>. As shown, for the largest decoding core, AD1, at least 20.6  $\mu$ W is needed to reach to its maximum 2.3 dB gain. The coding gain, however, is reduced to 2.0 dB and 1.2 dB for AD2 and AD3 respectively at the same power



Figure 7.5.: Measured coding gains for AD1,2,3 at 2 Mb/s

level. This power level corresponds to about 10 pJ/b energy dissipation. It can be seen in Fig. 7.5 that more or less the same energy is enough for AD2, to reach to its maximum coding gain. Therefore, AD2 with 0.038 mm<sup>2</sup> silicon area might be a better choice if area has to be traded for reduction in gain from 2.3 dB to 1.9 dB. AD3 is pushed for even smaller area, in which the minimum energy required to reach to 1.2 dB gain is 22 pJ/b. The decreasing coding gain trend from AD1 to AD3 relates to the increased mismatch errors for smaller transistors. Degraded gains at lower power levels generally relates to the increased effects of noise on computations.

#### 7.4. OBSERVATIONS FROM THE MEASUREMENTS

As presented in the previous section, BER performance for smaller cores, which are built on smaller size transistors, suffer from some degree of degradation. In addition, the coding gain tends to get worse for at low current levels. To get a more clear picture of the behavior of the fabricated chips, it



Figure 7.6.: Measured coding gains for AD1,2,3 at 500 kb/s

might be interesting to observe the response of the circuits, to a few particular test cases. For the following test cases, the coded data is free from any added Gaussian noise; i.e. noise free data is considered. To find out about the effect of circuit noise on the accuracy of results, a data stream of 20 K blocks of coded data was introduced to the decoders. To observe the random behavior due to the circuit noise, each measurement was repeated 10 times. These measurements are performed on AD1 and AD3 and three setups of power profile are considered as follows. For each setup, the decoding core plus DAC power consumption limited to

- **setup 1:** 44.5 µW
- setup 2: 10.5 μW
- setup 3: 3.0 µW

All measurements are carried out for two target throughputs, 500 kb/s and 1.5 Mb/s.

#### 7.4.1. TEST EXPERIMENT 1

|         | correctly decoded | errors that occur | errors that do not always |
|---------|-------------------|-------------------|---------------------------|
| 500     | bits              | at same locations | occur at same locations   |
| [kb/s]  | [%]               | [%]               | [%]                       |
| setup 1 | 99.11             | 0.68              | 0.21                      |
| setup 2 | 98.50             | 1.20              | 0.30                      |
| setup 3 | 97.63             | 1.99              | 0.38                      |

### **Table 7.2.:** AD1's fixed and random errors at 500 kb/s, test experiment 1

**Table 7.3.:** AD1's fixed and random errors at 1.5 Mb/s, test experiment 1

|         | correctly decoded | errors that occur | errors that do not always |
|---------|-------------------|-------------------|---------------------------|
| 1.5     | bits              | at same locations | occur at same locations   |
| [Mb/s]  | [%]               | [%]               | [%]                       |
| setup 1 | 98.98             | 0.79              | 0.23                      |
| setup 2 | 98.45             | 1.23              | 0.32                      |
| setup 3 | 92.02             | 6.94              | 1.04                      |

Since no Gaussian noise is added and each soft data input consists of 4 bits, the soft input can be represented by similar values for 1's and 0's. Assuming  $7_{dec} = (0111)_{bin}$  as the center point, all values from 0 to 7 are considered 0's with different strengths. Likewise, the input values from 8 to 15 are taken as 1's with increasing strength. For the current test experiment, and to provide a sufficient gap between 0's and 1's, the following conditions are applied on the data:

In the input soft coded data all received 1s represented by 13 (out of 15).

Alternatively, all 0s represented by 3 (out of 15). More distance between representation of the input data will be translated to a larger differential level between the currents, representing probabilities in the DACs. The percentage of errors, for 10 times repetition of the measurements, are summarized in Tables 7.2 and 7.3 for AD1 for the two previously mentioned throughputs. Same experiments were performed on the smallest decoding circuit, AD3, and the results are provided in Tables 7.4 and 7.5.

## **Table 7.4.:** AD3's fixed and random errors at 500 kb/s, test experiment 1

|         | correctly decoded | errors that occur | errors that do not always |
|---------|-------------------|-------------------|---------------------------|
| 500     | bits              | at same locations | occur at same locations   |
| [kb/s]  | [%]               | [%]               | [%]                       |
| setup 1 | 73.98             | 22.19             | 3.82                      |
| setup 2 | 76.28             | 20.40             | 3.32                      |
| setup 3 | 75.39             | 21.42             | 3.19                      |

### **Table 7.5.:** AD3's fixed and random errors at 1.5 Mb/s, test experiment 1

|         | correctly decoded | errors that occur | errors that do not always |
|---------|-------------------|-------------------|---------------------------|
| 1.5     | bits              | at same locations | occur at same locations   |
| [Mb/s]  | [%]               | [%]               | [%]                       |
| setup 1 | 73.79             | 22.38             | 3.83                      |
| setup 2 | 76.06             | 20.56             | 3.38                      |
| setup 3 | 50.42             | 43.78             | 5.80                      |

As can be seen in the tables, the errors for AD3 is much more that those for AD1. Also, the contribution of the errors from noise (random behavior) and mismatch (errors at fixed locations) is distinguished.

|         | correctly decoded | errors that occur | errors that do not always |
|---------|-------------------|-------------------|---------------------------|
| 1.5     | bits              | at same locations | occur at same locations   |
| [Mb/s]  | [%]               | [%]               | [%]                       |
| setup 1 | 100               | 0.00              | 0.00                      |
| setup 2 | 100               | 0.00              | 0.00                      |
| setup 3 | 99.12             | 0.69              | 0.19                      |

**Table 7.6.:** AD1's fixed and random errors at 1.5 Mb/s, test experiment 2

#### 7.4.2. TEST EXPERIMENT 2

In this test experiment, similar measurements as the previous experiment is followed, but the input gap between representation of 1's and 0's has been increased. Thus, for the input coded data, 1s represented by 14 (out of 15). Alternatively, 0s represented by 2 (out of 15). The percentages of errors, are given in Table 7.6 for AD1 at 1.5 Mb/s rates, respectively. At 500 kb/s, no error was detected. That confirms the sufficiency of the 4-bit input range. At 1.5 Mb/s, for higher currents still no error occurs. However, for lower currents in the circuit, computations are slower. For some blocks, dedicated time is not enough for convergence and results in some errors. For the decoding circuit, AD3, and the results are provided in Tables 7.7 and 7.8. As can be seen, integrity of the calculations in the decoding core is worse than that of AD1. Still, the majority of errors occur at fixed locations.

#### 7.4.3. MEASUREMENTS OF DIFFERENT SAMPLES PER CHIP

Measurements on different samples per chip shows that errors in decoding happen at different locations. In other words, sometimes, different samples decode the same block of coded data into different sequences. This observation, reflects the significance of the process variation on the integrity of the computations. However, the overall BER performance curves for different samples of each chip, show similar results.

| Table 7.7.: AD3's fixed and random errors at 500 kb/s, test ex- |
|-----------------------------------------------------------------|
| periment 2                                                      |
|                                                                 |

|         | correctly decoded | errors that occur | errors that do not always |
|---------|-------------------|-------------------|---------------------------|
| 500     | bits              | at same locations | occur at same locations   |
| [kb/s]  | [%]               | [%]               | [%]                       |
| setup 1 | 93.08             | 5.78              | 1.14                      |
| setup 2 | 94.36             | 4.80              | 0.84                      |
| setup 3 | 94.63             | 4.65              | 0.71                      |

**Table 7.8.:** AD3's fixed and random errors at 1.5 Mb/s, test experiment 2

|         | correctly decoded | errors that occur | errors that do not always |
|---------|-------------------|-------------------|---------------------------|
| 1.5     | bits              | at same locations | occur at same locations   |
| [Mb/s]  | [%]               | [%]               | [%]                       |
| setup 1 | 92.74             | 6.06              | 1.20                      |
| setup 2 | 94.27             | 4.86              | 0.87                      |
| setup 3 | 61.56             | 32.39             | 6.05                      |

## Part III

## 8

### Low Power Digital Design Techniques

#### 8.1. POWER CONSUMPTION IN A DIGITAL CMOS CIRCUIT

The total power consumption of a digital circuit consists of three components as shown in the below formula:

$$P_{total} = P_{dynamic} + P_{shortcircuit} + P_{leakage}.$$
(8.1)

The dynamic power dissipation,  $P_{dynamic}$ , is caused by the charging and discharging of capacitances in the circuit. The total capacitances, represented by a load capacitance,  $C_L$ . Therefore, the dynamic power consumption can be expressed as

$$P_{dynamic} = \alpha_{sw} C_L V_{DD}^2 f, \qquad (8.2)$$

where,  $\alpha_{SW}$  is called the switching activity factor. The short circuit power comes from a fraction of the time during switching of the MOS transistors, where a direct path between  $V_{DD}$  and ground is created

$$P_{short\,circuit} = V_{DD}I_{short\,circuit}.$$
(8.3)

The leakage power term comes from the transistor's various leakage current components [62], and in brief is expressed as

$$P_{leakage} = V_{DD} I_{leakage}.$$
(8.4)

#### 8.2. DYNAMIC POWER REDUCTION

By looking at the Eq. 8.2, it can be seen that reducing one or more of the contributing factors in the right hand side of the equation, leads to total reduction of the dynamic power consumption. Here, in the following sub-sections, some transistor level, gate level or even system level techniques are provided that result in reducing the dynamic power dissipation.

#### 8.2.1. LOW VOLTAGE OPERATION

Dynamic power dissipation in CMOS digital integrated circuits is a strong function of the supply voltage. Therefore, lowering the supply voltage is a very effective method to reduce power [52], [23]. The dynamic power reduces quadratically with lowering the supply voltage. However, the reduction in power consumption costs a significantly increased propagation delay in the circuit. To keep power consumption low and maintain the circuit speed, methods like parallel processing or pipelining can be utilized, which in turn costs in silicon area [18].

#### 8.2.2. MULTIPLE SUPPLY VOLTAGES

The idea behind multiple supply voltage operation, is to dedicate higher voltages for high speed part of the circuit and lower voltages for the low performance parts, where the speed is less critical. Therefore, multiple supply option reduces the power consumption, by utilizing the slack time between low and high performance parts of the circuit. In dual  $V_{DD}$  operation, high  $V_{DD}$  is applied to the part of the circuit that includes the critical paths, and low  $V_{DD}$  to the parts that include non-critical paths. However, DC-DC level converters are required when the outputs of low  $V_{DD}$  circuit are connected to the inputs of high  $V_{DD}$  gates. Otherwise, the output levels might not be enough to turn off the PMOS transistors in high  $V_{DD}$  gates.

#### 8.2.3. ARCHITECTURE APPROACHES

Pipelining or parallel processing can be used to improve the circuit throughput or instead reduce voltage [55]. Parallel processing involves using replicas of the same hardware, which operate in parallel. The circuit can then handle a higher throughput. If higher processing speed is not required, the  $V_{DD}$  can be lowered to the point where the original throughput is met. Pipelining is another technique that is performed by introducing additional registers to break the critical paths. Once a circuit is pipelined, then it can operate in higher speeds with respect to the original design. Like parallel processing, for the same speed, supply voltage can be reduced.

#### 8.2.4. ALGORITHMIC OPTIMIZATION

Lowering power consumption through algorithmic optimization strongly depends on the used application and consequently the representation of data in that application. Power consumption is reduced if hardware implementation of the corresponding algorithm can be done with, for example, shorter data dynamic range, reduced number of memory accesses, less number of additions or multiplications, algorithmic simplifications, approximations, etc. Strength reduction techniques, such as using shift registers where applicable, instead of multiplications, reduces the consumption of power.

#### 8.2.5. WORD-LENGTH OPTIMIZATION

Having fixed word-length for representing the data throughoput the circuit has the drawback of unnecessary power and area overhead. Elaborated designed circuits for low power operation use just enough word-lengths in different parts for representing the data.

#### 8.2.6. LOW POWER MEMORIES

Memories are important elements of most digital designs. Almost any substantially complex digital design, requires memory blocks for temporary storage of data. Designing low power memory blocks, helps to keep the total power consumption of a digital circuit low. However, low power memories should keep data integrity, as well as provide reliable and fast read/write operations. Static Random Access Memories (SRAM)s are normally the proper choice for low power systems due to their higher speed and lower power consumption, compared to other on-chip storage options such as Dynamic Random Access Memories (DRAM) memories.

#### 8.2.7. TRANSISTOR SIZING

NMOS transistors have higher carrier mobility than PMOS transistors. However, it is desired to have equal drive strengths for both transistor types to balance the rise and fall times, as well as improving the noise margin. For low voltage operation, reasonably comparable or equal drive strength of P and N type MOS transistors, helps to achieve reliable operation at lower  $V_{DD_{min}}$ . This is achieved by the improved noise margin. Hence, proper sizing of transistors can help to reduce the power consumption at similar speeds.

#### 8.2.8. CLOCK GATING

Clock gating is a popular technique in synchronous circuit design that can be used to reduce the power consumed by the clock tree. The contributors to this power reduction are:

- 1. power consumption in flip-flops.
- 2. power consumption in clock tree buffers throughout the design.
- 3. power consumption in combinational logics, which their values may change at clock edges .

By disabling the clock at different parts of the clock tree, clock gating technique reduces unwanted switching.

#### 8.2.9. MULTIPLE CLOCK DOMAINS

Using multiple clock domains, offers extra possibilities to save power. Some clock domains can be used as gated clocks. Another possibility is to ensure that parts of circuit are not clocked unneccessarily faster than required.

#### 8.2.10. AVOIDING GLITCHES

Glitches are unnecessary temporary transitions that can happen in combinational logic circuits, before the final value of the gate is evaluated. Skew in the input signals of a gate can potentially result in glitches, that may or may not propagate through the rest of the combinational circuitry [62]. In case of propagation of unwanted transitions, accordingly, the power dissipation increases. Glitches can be reduced by proper gate sizing and path balancing techniques. Propagation of glitches can be reduced by optimizing the logic to use less number of inversions that tend to boost and propagate glitches.

#### 8.2.11. REDUCING SWITCHING ACTIVITY

Generally any method that helps to reduce the switching activity in a CMOS digital circuit, while maintaining the expected functionality, reduces the consumed power. Switching activity can be reduced by algorithmic optimization,

by architecture optimization, and by circuit-level optimization. In circuit-level, switching activities can be reduced for example by avoiding using shared data paths. Suitable representation of data is also essential to reduce switching activities. For example, in situations where the changes in data are sequential (like memory address bits), use of Gray coded representation rather than binary coding leads to less switching transitions. Furthermore, in algorithmic operations, where data sign changes are frequent, using sign-magnitude representation instead of the more commonly used two's complement representation, reduces the number of transitions in each clock period.

#### 8.3. SHORT CIRCUIT POWER REDUCTION

Short circuit power dissipation exists, due to the temporary short circuit path between supply rails during switching, and is directly related to the fall and rise time of the gate. A short circuit current flows, when both the NMOS and PMOS transistors are ON, that creates a direct path between  $V_{DD}$  and ground. Short circuit power can be reduced through shortening input slope (rise/fall time). In near threshold or sub-threshold operation regions, where generally  $V_{DD} < (V_{TH_n} + |V_{TH_p}|)/2$ , NMOS and PMOS transistor are not ON at the same time. Therefore, short circuit power dissipation is eliminated [64].

#### 8.4. LEAKAGE POWER REDUCTION

The last component that contributes to the total power consumption of a CMOS circuit, is the leakage power.

#### 8.4.1. MULTIPLE DEVICE THRESHOLDS

For a power with speed trade off, transistors with multiple threshold can be used. In 65nm CMOS technology, standard cells are often available with three threshold options. These options characterized as High Threshold Voltage (HVT), Standard Threshold Voltage (SVT), and Low Threshold Voltage (LVT). For HVT devices, the leakage currents are much lower than the LVT devices. Therefore, it is possible to use the LVT cells in the critical paths, while HVT cells can be used elsewhere. This techniques mainly helps to reduce the leakage power consumption of the design, since, the parts of the design that are not critical include less leaky devices. However, reduction in dynamic power can also be achieved. Since, LVT cells operate faster than HVT cells for similar supply voltages, when high rates are not required, supply voltage of the LVT cells can be reduced. In addition, replacement of HVT or LVT cells are convenient during design, since multi- $V_{\rm T}$  method does not alter the placement of the cells in the layout.

#### 8.4.2. POWER GATING

Power gating is an effective method in low power design. The idea is to temporarily detach the part of circuit that is not active from the supply, to eliminate the leakage currents passing through those parts. That means the digital design is divided into two parts: one part that has always power on, and the other part that shares a virtual power network that has power on, at times when it is needed to be active. Either header PMOS, or footer NMOS transistors can be used as power switching devices. Sizing of the switching transistors is important, since they need to be large enough to handle the switching current, without significant V = IR voltage drop. However, the larger the switch transistors, the slower those become, which makes delays in turning on or shutting down the gated circuitry.

#### 8.4.3. BODY BIASING

To keep the speed while consuming less power, an efficient method is to use low threshold devices and reducing the supply voltage. However, lower threshold voltage results in increased leakage and consequently increased standby power consumption. The voltage difference between  $V_{DD}$  and n-well of a PMOS transistor, or between p-well of an NMOS transistor and GND, can be adjusted in different operating modes to reduce the power. By applying a positive or negative bias voltage to the substrate of a transistor, the effective threshold voltage can be adjusted to the current operating mode. In the active mode, the effective threshold can be reduced, whereas when in idle mode, higher effective threshold voltage, reduces the leakage current. Body biasing technique, therefore, offers the flexibility for even higher speed, while reduces the power consumption in the standby mode.

#### 8.4.4. TRANSISTOR STACKING

When more than one transistor, in a stacked configuration (connected in series), is turned off, the *stacking effect* occurs that reduces the sub-threshold current. Transistor stacking increases the source voltage of the upper transistors in the stack, thus lowers the gate-source voltage ( $V_{GS}$ ) of these transistors, even to negative values. This effect contributes to a lower sub-threshold leakage current in the circuit. The sub-threshold leakage passing through a logic gate, then depends on the applied input vector. That makes the total leakage current of a circuit to be dependent on the states of the primary inputs. A proper selection of input vectors during the standby mode of a circuit results in minimized leakage current, due to the stacking effects. The most straightforward method to find the minimum leakage input vector is by testing all combinations of primary inputs.

# 9

### Designing of the Digital Decoding Circuit

#### 9.1. BASICS OF LOW POWER DIGITAL DECODER

In an equivalent low power digital decoder, quantized data are used and computations are performed in discrete time, where the speed is limited by the critical path. Multiplications in digital implementations are costly in terms of both area and power. The max-log-MAP algorithm is an approximate realization of the decoding algorithm that provides sub-optimum error performance compared to the MAP based BCJR algorithm. As will be seen in section 9.2, this sub-optimum performance is still sufficiently close to that of the BCJR algorithm for most low power applications. In the max-log-MAP algorithm, the multiplications are replaced by additions.

$$A_{k} = ln[\alpha_{k}(s)] = Max_{s'}[A_{k-1}(s') + \Gamma_{k}(s)],$$
  

$$B_{k-1} = ln[\beta_{k-1}(s')] = Max_{s}[B_{k}(s) + \Gamma_{k}(s')]$$
(9.1)

where the capital letters indicate  $\alpha$ ,  $\beta$  and  $\gamma$  parameters of the BCJR algorithm, in the logarithmic domain. This reduction in complexity reduces the power consumption and chip area significantly in a digital implementation. Memories are normally required to store the temporary data calculations. However, short BL helps to avoid using large memory blocks for temporary storage, which also helps shrink the size of the decoder.

Aside from simplifications to the algorithm, decreasing the supply voltage to sub- $V_{\rm T}$  is an effective method to lower the power consumption, since the dynamic power decreases quadratically with voltage [78], [89]. However, the circuit will then operate slower, increasing the critical path delay and the leak-



**Figure 9.1.:** BER performance of the max-log-MAP algorithm applied on tail-biting (7,5) codes with BL=14.

age energy per operation. In order to analyze energy dissipation and critical path delay of a given digital design, gate-level sub- $V_{\rm T}$  characterization is required. The sub- $V_{\rm T}$  energy model for standard cell based design presented in [66] has been used for this purpose. A benefit of the analysis is that it locates the energy minimum operating point ( $E_{\rm min}$ ), as shown in Fig. 9.2 for a typical case. With the assumption of operating at maximum frequency at a given supply voltage, this figure shows how the dynamic energy ( $E_{\rm dyn}$ ) scales down quadratically with the scaling of supply voltage V<sub>DD</sub>, and how leakage energy per operation increases exponentially. There is a sweet spot for the minimum total energy consumption  $E_{\rm T}$ , where the sum of dynamic and leakage energy amounts to a minimum, which is called the Energy Minimum Voltage (EMV) point. The EMV is the optimum point in terms of energy per operation which can be used if the data rate requirements are satisfied.

#### 9.2. THE MAX-LOG MAP BER PERFORMANCE

Performance of the implemented digital max-log-map algorithm for the (7,5) tail-biting codes are simulated and included in Fig. 9.1 in comparison to the



**Figure 9.2.:** Typical energy dissipation behavior in a digital circuit for 65 nm CMOS.

performance of the theoretical BCJR algorithm. It is observed that only 0.2 dB out of maximum 2.9 dB is sacrificed in coding gain at BER= 10-3 compared to the direct BCJR algorithm. Instead, a much simpler and hence less power consuming implementation is promised. It appears that for the decoding algorithm to converge, calculations should be performed for at least two rounds along the implemented trellis. More iterations are not required, since the performance will not improve. On the other hand, a single iteration results in large performance loss.

#### 9.3. ARCHITECTURE

The architecture of the proposed digital decoder is presented in Fig. 9.4. By using the max-log-MAP algorithm, multiplications in the BCJR algorithm are replaced by adders in the logarithmic domain. The max-log-MAP decoding requires calculation of  $\Gamma$ , A, and B parameters and storage over the entire data block due to the forward and backward recursions. For large scale decoding circuits, the BL is usually long and there is a need of memory blocks corre-

sponding to it. However, for this design the target is a small scale decoder with a short BL=14; so, register files are used for data storage. The number of iterations around the circular TB trellis and the minimum required wordlengths were determined by high level simulations. It was concluded that by starting at all-zero initial values for *A* and *B* metrics, at least two iterations along the trellis are needed to successfully decode the received data. As an alternative to the analog decoder, the digital decoder is designed to operate on similar input streams, and the received soft LLR values are thus also expressed with 4-bits. Simulations also showed that to benefit from the full error correcting capability of the algorithm,  $\Gamma$  has to be represented by at least 7bits. Consequently, at least 11-bits are required to cover the full range of *A* and *B* values after the iterative decoding calculations. The operation of the decoder is described in the following sub-sections:

- *Input Section:* Similar to the analog implementation, the digital decoder operates on received blocks of 28 coded soft bits. Hence, the decoding of each block starts with buffering into allocated input registers. After this, all data is moved to another register file for calculation of Γ metrics. Simultaneously, buffering of the next block of incoming data starts. For the forward and backward metrics (*A* and *B*) to be calculated concurrently, the Γ calculations are performed from both directions by Γ-Low and Γ-High calculation blocks. The allocated Γ registers are filled gradually as the computations are performed. Since BL is equal to 14 and each trellis stage has 4 states, 14x4=56 registers are dedicated for Γ storage.
- Iterative Forward-Backward Calculations: While the  $\Gamma$  registers are getting filled in parallel, the calculation of  $\alpha$  and  $\beta$  starts, as illustrated in Fig. 9.3. For these calculations to initiate, the start and end  $\Gamma$  parameters at the dedicated register block addresses 0-4 and 52-55 have to be available. The rest of the calculations continue step by step. After 14 clock periods, the second iteration starts. On clock cycle 17, all  $\Gamma$  values are already calculated and updating the values are no longer required. '*A*/*B* load' signal refers to the loading time for the *A* and *B* metrics to the next stage before decoding of the next block.
- Decision Section: The final stage is where the hard decision on the value of each bit is made. Each decision consists of addressing the corresponding register locations, then addition, comparison, and selection



**Figure 9.3.:** Timing diagram to perform iterations and recursive calculations of A and B metrics in the digital decoder.

operations are performed. A flag signal precedes the starting of each block of decoded bits in the output.

The convergence of the decoding algorithm in the digital implementation is achieved by calculating the corresponding metrics consecutively around the circular trellis. Usually, the initial metrics are chosen randomly; therefore, to reach to the final results, the calculations has to be performed around the circular trellis as many times as required. When the calculated metrics does not change anymore, then the calculation for decoding a single block of coded data is sufficient and computations for the next block of data can be started.

#### 9.4. SYNTHESIS AT NOMINAL VOLTAGE

The digital version was implemented in V(ery high speed integrated circuit) Hardware Description Language (VHDL) and synthesized in standard 65nm CMOS. The design has been synthesized with Low Power Standard Threshold Voltage (LP-SVT) standard cells. LP-SVT proved favorable in a study presented in [66], where the main constraints were maximum throughput, lowest energy dissipation, and a single power domain. Furthermore, tight synthesis constraints were set to achieve minimum area, minimum leakage, and a short critical path at nominal voltage. As provided in Table 9.1, estimated power

|                    | Digital decoder |
|--------------------|-----------------|
| Supply voltage     | 1.2 V           |
| Area (synthesized) | 0.07 mm2        |
| Maximum frequency  | 100 MHz         |
| Power at 470 kb/s  | 51 µW           |
| Energy at 470 kb/s | 108 pJ /b       |
| Power at 280 kb/s  | 32 µW           |
| Energy at 280 kb/s | 115 pJ/b        |

Table 9.1.: Synthesized digital decoder characteristics

consumption of the circuit at 280 kb/s and 280 kb/s are  $32 \mu$ W and  $51 \mu$ W. Data rates were chosen in line with the analog decoding circuit simulations, that presented in chapter 6.

#### HARDWARE MAPPING OF DIGITAL DECODER

The digital decoder was also fabricated in 65 nm CMOS and the final placed and routed design takes 0.11 mm<sup>2</sup> silicon area excluding pads. The die photo is shown in Fig. 9.5, while the related layout design is illustrated in Fig. 9.6. During place and route, the Digital Decoding Core (DDC) was placed as a separate block together with a Peripheral Communication Core (PCC). The purpose of the PCC is to provide communication between the DDC and the external test environment. The benefit of using PCC is that the DDC can operate at very low voltages, while the outputs remain strong enough for measurements. The connections between these blocks are realized without using level-shifters; rather, buffers are placed in between the two domains for appropriate translation of signal voltages.



105



Figure 9.5.: Die photo of the fabricated digital decoder.



Figure 9.6.: Layout of the fabricated digital decoder.

# 10

## Performance of Digital Decoding Circuit

#### 10.1. OBJECTIVES

Experiments and Measurement results of the fabricated alternative digital decoding chip is presented this chapter. The goal of measurements is to find out about the following items:

- finding the minimum operating supply voltage at different throughputs
- minimum energy point
- achievable data rate at the minimum energy point
- response to temperature variations (room and body temperature)

#### **10.2. MEASUREMENT SETUP**

A measurement setup including logic analyzer, digital pattern generator, power supplies and high precision digital multimeters was used. At extreme low voltages, outputs signals are weak. Therefore, a setup of external operational amplifiers were used to boost the output signals, for better detection by the logic analyzer. Measurements were performed in a climate chamber at both room and body temperature to consider the operational environment for the target applications. Measurements were performed using three chip samples.

#### **10.3. REDUCED VOLTAGE MEASUREMENT RESULTS**

The fabricated chip were measured at throughputs from 5 kbps up to 2 Mbps. At each throughput, the minimum operational supply voltage was recorded. The measured performance of the digital decoder is presented by the four plots in Fig. 10.1. Minimum energy dissipation is 9 pJ/b at room temperature (23°C), whereas it improves slightly to 8 pJ/b at body temperature. Although circuits operated at higher temperature have higher leakage currents [65], they are also faster. Therefore, for a given throughput, the supply voltage at body temperature can be reduced by 30 mV. This reduction in supply voltage results in a slight improvement in energy dissipation compared to operation at 23°C.

Throughputs measured from 5 kbps up to 2 Mbps, correspond to supply voltages from 0.25 V to 0.52 V. The corresponding power consumption span is from  $0.10 \,\mu\text{W}$  to  $25 \,\mu\text{W}$ . Minimum energy dissipation at room temperature, 9 pJ/b, is reached at 0.32 V for a throughput of 125 kbps. Maximum measured throughput, however, is 20 Mbps, which is reached at nominal voltage 1.2 V.



Figure 10.1.: Measured low voltage operational limits of the digital decoder; (a) total dissipated energy vs. supply voltage, (b) total dissipated energy vs. maximum throughput, (c) Operational clock frequency vs. supply voltage, and (d) total consumed power vs. supply voltage.

## Part IV
## 11

### Analog versus Digital: Analysis of the Results

#### 11.1. ANALOG AND DIGITAL: AN ANALYSIS

Figure 11.1 shows the BER performances of AD1, which offers the best coding gain among the three analog decoding cores, together with the performance of the digital decoder for two test case throughput 125 kbps and 2 Mbps. This figure also includes the theoretical performance of the decoder with long BL=100, the performance of the short BL=14 used for implementations in this work and the expected performance of a similar communication system without coding. For lower throughput of 125 kbps, the digital decoder offers the desired BER performance with only  $1.2 \,\mu\text{W}$  power consumption. As presented earlier, this can be achieved at 0.32 V which corresponds to the minimum energy dissipation point for the digital decoder. For this rate, the performance of the analog implementation is 0.6 dB degraded compared to that of its digital counterpart while the power consumption is also higher. At 2 Mbps, AD1 with consuming only  $15.6 \,\mu\text{W}$  can function as a error correcting block, even though the gain is not at its maximum. At 2 Mbps the minimum required power for the digital circuit is about double of that power, i.e.  $32.4 \,\mu\text{W}$  at 0.52 V. Below this power level, the digital implementation is nonfunctional at this rate. While the proposed digital decoder offers a superior max 2.9 dB coding gain at BER= $10^{-3}$ , the analog alternative decoder offers the option of full control over power consumption in trade-offs with the coding gain.

AD1 has an area comparable to the area of the DDC, has higher processing speed for the same power budget but shows degraded BER performance. The degradation in performance partly comes from using non-ideal analog mul-

|                                  |                                | 2009    | 2010                           | 2012               | 2012         | 2013               | 2006             |                               | 2006           | 2011            | 2013                      |                   | Year           |                 |
|----------------------------------|--------------------------------|---------|--------------------------------|--------------------|--------------|--------------------|------------------|-------------------------------|----------------|-----------------|---------------------------|-------------------|----------------|-----------------|
| This<br>work                     | work                           | [69]    | [38]                           | [44]               | [88]         | [60]               | [31]             |                               | [85]           | [25]            | [6]                       |                   | Ref.           |                 |
| Digital                          | Analog                         | Digital | Digital                        | Digital            | Digital      | Digital            | Analog           |                               | Analog         | Analog          | Analog                    |                   | Implementation |                 |
| (7,5) tail-biting<br>max-log-map | (7,5) tail-biting<br>BCJR      | LDPC    | convolutional<br>Log-Map Turbo | non-binary<br>LDPC | LDPC         | non-binary<br>LDPC | (32,8)<br>LDPC   | Trellis Graph<br>Factor Graph | (8,4) Hamming  | (32,8)<br>LDPC  | (120,75)<br>TS-LDPC       |                   | Code           | Table 1         |
| 65 nm<br>0.250-0.52 V            | 65 nm<br>0.8 V                 | 65 nm   | 90 nm                          | 90 nm<br>1.2 V     | 65 nm<br>1 V | 65 nm<br>0.675 V   | 0.18 µm<br>1.8 V | 1.8 V                         | $0.18 \ \mu m$ | 0.5 μm<br>3.3 V | 90 nm<br>1.2, 1.1, 0.85 V | тесплотя          | CMOS           | 1.1.: Decoder C |
| 0.11                             | 0.10<br>0.038<br>0.015         | 1.2     | 2.6*                           | 1.17*              | 1.56         | 7.04               | 0.57<br>(0.07)*  | 0.02<br>(0.00026)*            | 0.002          | $0.091^{*}$     | 0.72*                     | mm <sup>2</sup> ] | Core           | ompariso        |
| 0.032<br>0.001                   | 0.010-0.044<br>different gains | 180     | 265                            | 211                | 361<br>450   | 726                | 5<br>(chip)      | 0.807                         | 0.15           | 1.2             | 13                        | [ ATIT            | Power          |                 |
| 2.0<br>0.125                     | N                              | 415     | 930                            | 22.8               | 6,620        | 656                | 6                | 3.7                           | 3.7            | 13              | 750                       | [JVID/S]          | Throughput     |                 |
| 16<br>9                          | 5-22                           | 433     | 285                            | 9254               | 54.6         | 1100               | 830              | 220                           | 40             | 86              | 17                        | [pJ/b]            | Energy         |                 |

\* Normalized to 65 nm by  $Area_{normalized} = Area * (\frac{65nm}{tech})^2$ 



**Figure 11.1.:** BER performance comparisons of the analog and digital decoders at (a) 125 kbs and (b) 2 Mbs.

tipliers that have limited dynamic range of operation, and partly due to the combination of all circuit effects as mismatch errors, noise or process variations. Following the performance improvement trend from AD3, AD2 to AD1 suggests that increasing the sizes for transistors even further may possibly improve the BER performance. However, in that case while providing higher throughput for a lower power, the decoding core area will become larger than the DDC circuitry.

Table 11.1 summarizes the specifications of the presented decoders together with the previously published analog and digital decoders. Only results from chip measurements are included in the table. Since decoders are usually designed for different applications, energy efficiency in terms of pJ/b has been considered in Table 11.1 as a rough indicator to compare the power efficiency of decoders. However, it make a difference when, for example, the emphasis of design is more on higher data throughput or lower power consumption.



Figure 11.2.: Measured normalized energy per decoded bit evolution over technology generations for analog: [6][25][31][75][79][82][85] and digital [19][38][44][60][69][88][92] decoders.

The decoders presented in this work have the lowest energy per decoded bits so far. Though, the diversity of the measured decoder chips, as can be seen in the table, make it almost impossible to provide a comparative analysis. Figure 11.2 illustrates the trend of energy efficiency improvement of the reported measured decoders in the literature. Both analog and digital implementations with a variety of code selections, complexity and decoding algorithms are outlined in the figure. While it is hard to draw a solid conclusion due to the variety of the decoders and the technologies they are fabricated in, but it seems the trend together with the results provided in this work suggest an energy efficiency meeting point between analog and digital implementation approaches at about 9 pJ/b.

# 12

#### Summary

Considering wireless bio-implants and wearable devices, an exploration of hardware mapping alternatives for ultra low power convolutional decoding circuits was pursued. Two main approaches were investigated. The first approach involved using analog circuitry in the weak inversion region to perform the task of decoding with low power consumption. The second approach was using low power digital techniques in the implementation of an equivalent digital decoder. For the sake of a complete investigation, four designs in three separate chips where fabricated: three versions for the analog decoder, and a fully digital design. To push analog decoder to occupy less silicon area, the three versions were fabricated with different transistor sizes. While the main focus has been on low power and energy efficiency, other important specifications such as silicon area, throughput, BER performance, coding gain and temperature variations were also studied based on the measured chips. Measurement results provide silicon area versus coding gain trade-offs. The analog decoders operate with 0.8 V supply. The best achieved coding gain is 2.3 dB at bit error rates (BER)=0.001. Also, 10 pico-Joules per bit (pJ/b) energy efficiency is reached at 2 Mbps. The presented digital decoder chip has a minimum of 9 pJ/b energy dissipation and the coding gain is 2.3 dB at bit error rates (BER)=0.001. The analog decoder chips dissipate energy even less than 9 pJ/b while processing faster, though the coding gain will suffer. Even though it is hard to define a minimum energy point for the presented analog decoders, based on energy curves presented in the previous section, it seems that in 65 nm CMOS analog and sub- $V_{\rm T}$  digital decoders are roughly comparable in terms of energy efficiency. It is important to note that while

computations in analog domain are much faster, but the digital alternative decoder performs better at the target throughput of 125 kb/s. The main reason is the flexibility of a digital approach that allows to use the simplified decoding algorithm; i.e. the max-log-MAP algorithm instead of the original BCJR algorithm. While, implementing the max-log-MAP algorithm with high accuracy computations using analog circuitry might be a challenging task.

Based on the results presented, decoders based on analog processing, while have higher processing speed at low power profiles, but started to loose performance more significantly in comparison with the more robust digital decoder.

According to the measured specifications of all analog and digital designs at the targetted coded throughput for the UPD, 125 kb/s, finally the digital design was chosen to integrate with the digital baseband circuitry. A complete layout of the final UPD implementation on silicon in 65 nm CMOS can be seen in Attachment A.

#### References

- [1] G. AMAT, G. MONTORSI, A. NEVIANI, AND A. XOTTA, "An analog decoder for concatenated magnetic recording schemes," in *IEEE International Conference on Communications, ICC*, vol. 3, 2002, pp. 1563–1568.
- [2] J. B. ANDERSON, »Best short rate 1/2 tailbiting codes for the biterror rate criterion, « *IEEE Transactions on Communications*, vol. 48, no. 4, pp. 597–610, Apr 2000.
- [3] ANT. [Online]. Available: http://www.thisisant.com/
- [4] J. B. ANDERSON AND E. OFFER, »Reduced-state sequence detection with convolutional codes, *IEEE Transactions on Information Theory*, vol. 40, no. 3, pp. 965–972, May 1994.
- [5] O. C. AKGUN, J. N. RODRIGUES, Y. LEBLEBICI, AND V. ÖWALL, »High-level energy estimation in the sub-VT domain: Simulation and measurement of a cardiac event detector,« *IEEE Trans. Biomed. Circuits and Syst.*, vol. 6, no. 1, pp. 15–27, 2012.
- [6] A. ABOLFAZLI, Y. R. SHAYAN, AND G. E. COWAN, »750Mb/s 17pJ/b 90nm CMOS (120,75) TS-LDPC min-sum based analog decoder,« in *IEEE Asian Solid-State Circuits Conference*, Singapore, 2013, pp. 181–184.

- [7] G. AMAT, D. VOGRIG, S. BENEDETTO, G. MONTORSI, A. NEVIANI, AND A. GEROSA, »Cth08-3: Reconfigurable analog decoder for a serially concatenated convolutional code,« in *IEEE Global Telecommunications Conference, GLOBECOM*, Nov 2006, pp. 1–6.
- [8] L. BAHL, J. COCKE, F. JELINEK, AND J. RAVIV, »Optimal decoding of linear codes for minimizing symbol error rate, « *IEEE Trans. Information Theory*, vol. 20, no. 2, pp. 284 – 287, Mar. 1974.
- [9] Bluetooth smart. [Online]. Available: http://www.bluetooth. com/Pages/Bluetooth-Smart.aspx
- [10] J. D. BOECK, »Game-changing opportunities for wireless personal healthcare and lifestyle,« in *IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, San Francisco, CA*, Feb 2011, pp. 15–21.
- [11] C. BRYANT AND H. SJÖLAND, »A 2.45GHz ultra-low power quadrature front-end in 65nm CMOS,« in IEEE Radio Frequency Integrated Circuits Symposium (RFIC), Montreal, Canada, June 2012, pp. 247– 250.
- [12] J. A. CROON et al., »Matching properties of deep sub-micron MOS transistors," The Springer International Series in Engineering and Computer Science, vol. 851, 2005.
- [13] L. CHANG *et al.*, »Practical strategies for power-efficient computing technologies,« *Proceedings of the IEEE*, vol. 98, no. 2, pp. 215– 236, Feb 2010.
- [14] P. T. C. BERROU, A. GLAVIEUX, »Near shannon limit errorcorrecting coding and decoding: Turbo-codes. 1,« in *IEEE International Conference on Communications, ICC '93, Geneva*, vol. 2, May 1993, pp. 1064–1070.
- [15] G. CLARK AND J. CAIN, Error-Correction Coding for Digital Communications. Plenum Press, New York, NY, 1981.
- [16] A. CALDERBANK, G. FORNEY, AND A. VARDY, "Minimal tail-biting trellises: the golay code and more," in *Proceedings of IEEE International Symposium on Information Theory*, Aug 1998, pp. 255–.

- [17] R. CHANDRA AND A. JOHANSSON, »A link loss model for the onbody propagation channel for binaural hearing aids, « *IEEE Trans. Antennas and Propagation*, vol. 61, no. 12, pp. 6180–6190, Dec 2013.
- [18] A. CHANDRAKASAN, S. SHENG, AND R. BRODERSEN, »Low-power cmos digital design,« *IEEE Journal of Solid-State Circuits*, vol. 27, no. 4, pp. 473–484, Apr 1992.
- [19] A. DARABIHA, A. CARUSONE, AND F. KSCHISCHANG, »Power reduction techniques for LDPC decoders, *IEEE J. Solid-State Circuits*, vol. 43, no. 8, pp. 1835–1845, 2008.
- [20] C. ENZ, N. SCOLARI, AND U. YODPRASIT., »Ultra low-power radio design for wireless sensor networks,« in IEEE International Workshop on Radio-Frequency Integration Technology: Integrated Circuits for Wideband Communication and Wireless Sensor Networks, Nov 2005, pp. 1–17.
- [21] C. C. ENZ AND E. A. VITTOZ, Charge-Based MOS Transistor Modeling: The EKV Model for Low-Power and RF IC Design. Wiley, West Sussex, England, 2006.
- [22] M. FREY *et al.*, »Two experimental analog decoders,« *Int. Analog VLSI Workshop, Bordeaux, France*, 2005.
- [23] S. FISHER et al., »Digital subthreshold logic design motivation and challenges,« in IEEE 25th Convention of Electrical and Electronics Engineers, Dec 2008, pp. 702–706.
- [24] R. GALLAGER, Error Control Coding: From Theory to Practice. Wiley, Cambridge, MA, 1963.
- [25] M. GU AND S. CHAKRABARTTY, »A 100 pJ/bit, (32,8) CMOS analog low-density parity-check decoder based on margin propagation,« *IEEE J. Solid-State Circuits*, vol. 46, no. 6, pp. 1433–1442, 2011.
- [26] V. GAUDET AND P. GULAK, »A 13.3-Mb/s 0.35- μm CMOS analog turbo decoder IC with a configurable interleaver,« *IEEE J. Solid-State Circuits*, vol. 38, no. 11, pp. 2010–2015, 2003.
- [27] B. GILBERT, »A precise four-quadrant multiplier with subnanosecond response,« IEEE J. Solid-State Circuits, 1968.

- [28] A. GOLDSMITH, Digital Communications. Cambridge University Press, New York, NY, 2005.
- [29] S. GUPTA, A. RAYCHOWDHURY, AND K. ROY, »Digital computation in subthreshold region for ultralow-power operation: A devicecircuit-architecture codesign perspective,« *Proceedings of the IEEE*, vol. 98, no. 2, pp. 160–190, Feb 2010.
- [30] P. GROVER, K. WOYACH, AND A. SAHAI, "Towards a communication-theoretic understanding of system-level power consumption," *IEEE Journal on Selected Areas in Communications*, vol. 29, no. 8, pp. 1744–1755, Sep 2011.
- [31] S. HEMATI, A. BANIHASHEMI, AND C. PLETT, »A 0.18-um CMOS analog Min-Sum iterative decoder for a (32,8) low-density paritycheck (LDPC) code, « *IEEE J. Solid-State Circuits*, vol. 41, no. 11, pp. 2531–2540, 2006.
- [32] J. HAGENAUER, E. OFFER, C. MEASSON, AND M. MÖRZ:, »Decoding and equalization with analog non-linear networks," *European Transactions on Telecommunications*, vol. 10, pp. 659–680, Oct 1999.
- [33] J. HAGENAUER AND M. WINKLHOFER, "The analog decoder," in IEEE International Symposium on Information Theory, ISIT, Cambridge, MA, August 16 – 21, 1998, pp. 145 –.
- [34] J. HAGENAUER AND M. WINKLHOFER, »The analog decoder,« *IEEE* Int. Symp. Inf. Theory, Cambridge, MA., 1998.
- [35] IEEE. IEEE 802.15 working group. [Online]. Available: http: //www.ieee802.org/15/
- [36] D. JOHNS AND K. MARTIN, Analog Integrated Circuit Design. Wiley, USA, 1997.
- [37] R. JOHANNESSON AND K. ZIGANGIROV, Fundamentals of convolutional coding. IEEE Press, 1999.
- [38] S. M. KARIM AND I. CHAKRABARTI, »An improved low-power high-throughput log-MAP turbo decoder,« *IEEE Trans. Consum. Electron.*, vol. 56, no. 2, pp. 450–457, 2010.

- [39] P. R. KINGET, »Device mismatch and tradeoffs in the design of analog circuits,« *IEEE J. Solid-State Circuits*, vol. 40, no. 6, pp. 1212 – 1224, 2005.
- [40] U. KARPUZCU, N. KIM, AND J. TORRELLAS, »Coping with parametric variation at near-threshold voltages," *IEEE Micro*, vol. 33, no. 4, pp. 6–14, July 2013.
- [41] R. KÖTTER AND A. VARDY, "The structure of tail-biting trellises: minimality and basic principles," *IEEE Transactions on Information Theory*, vol. 49, no. 9, pp. 2081–2105, Sep 2003.
- [42] H. A. LOELIGER et al., »Iterative sum-product decoding with analog VLSI,« Int. Symp. Inf. Theory, Cambridge, MA., 1998.
- [43] S. LIN AND D. COSTELLO, Error Control Coding. Prentice Hall, Upper Saddle River, NJ, 2004.
- [44] C.-L. LIN, C.-L. CHEN, H.-C. CHANG, AND C.-Y. LEE, »A (50,2,4) nonbinary LDPC convolutional code decoder chip over GF(256) in 90nm CMOS,« in *IEEE Asian Solid State Circuits Conf.*, Kobe, 2012, pp. 201–204.
- [45] H. A. LOELIGER, M. HELFENSTEIN, F. LUSTENBERGER, AND F. TARKOY, »Probability propagation and decoding in analog VLSI,« in *IEEE International Symposium on Information Theory, ISIT*, Cambridge, MA, August 16 – 21, 1998, pp. 146 –.
- [46] F. LUSTENBERGER, M. HELFENSTEIN, G. S. MOSCHYTZ, H.-A. LOELIGER, AND F. TARKOY, »All analog decoder for a binary (18,9,5) tail biting trellis code,« in *European Solid-State Circuits Conference, ESSCIRC*, Duisburg, September 21 – 23, 1999, pp. 362 – 365.
- [47] H. A. LOELIGER, »Analog decoding and beyond, « in IEEE Information Theory Workshop, 2001, pp. 126–127.
- [48] T. H. MORSHED et al. (2009) BSIM 4.6.4 MOSFET model. [Online]. Available: http://www-device.eecs.berkeley.edu/bsim/Files/ BSIM4/BSIM464/BSIM464\_Manual.pdf

- [49] N. MAZLOUM AND O. EDFORS, »DCW-MAC: An energy efficient medium access scheme using duty-cycled low-power wake-up receivers,« in *IEEE Vehicular Technology Conference (VTC Fall)*, Sept 2011, pp. 1–5.
- [50] C. A. MEAD, Analog VLSI and Neural Systems. Addison-Wesley, 1989.
- [51] M. MOERZ, T. GABARA, R. YAN, AND J. HAGENAUER, »An analog 0.25 μm BiCMOS tailbiting map decoder,« in *IEEE International Solid-State Circuits Conference, ISSCC*, San Francisco, February 9 – 9, 2000, pp. 356 – 357.
- [52] D. MARKOVIC, C. WANG, L. ALARCON, T. LIU, AND J. RABAEY, »Ultralow-power design in near-threshold region,« *Proceedings of the IEEE*, vol. 98, no. 2, pp. 237–252, Feb 2010.
- [53] N. NGUYEN, C. WINSTEAD, V. C. GAUDET, AND C. SCHLEGEL, »A 0.8v cmos analog decoder for an (8,4,4) extended Hamming code,« in *IEEE International Symposium on Circuits and Systems, IS-CAS*, Vancouver, May 23 – 26, 2004, pp. I–1116 – 19.
- [54] C. PARK et al., »Reversal of temperature dependence of integrated circuits operating at very low voltages,« in *International Electron Devices Meeting*, 1995. *IEDM '95*, Washington, DC, December 23 – 26, 1995, pp. 71 – 74.
- [55] K. PARHI, VLSI Digital Signal Processing Systems: Design and Implementation. Wiley, New York, NY, 1999.
- [56] Y. S. PARK, D. BLAAUW, D. SYLVESTER, AND Z. ZHANG, »A 1.6-mm2 38-mw 1.5-Gb/s LDPC decoder enabled by refresh-free embedded DRAM,« in *Symp. VLSI Circuits*, 2012, pp. 114–115.
- [57] M. PERENZONI, A. GEROSA, AND A. NEVIANI, "Analog cmos implementation of gallager's iterative decoding algorithm applied to a block turbo code," in *International Symposium on Circuits and Systems, ISCAS*, vol. 5, May 2003, pp. 813–816.
- [58] C. PIGUET, *Low-Power Electronics Design*. CRC Press, Boca Raton, FL, 2004.

- [59] J. PROAKIS AND M. SALEHI, Digital Communications, 5th Edition. McGraw-Hill Education, New York, NY, 2007.
- [60] Y. S. PARK, Y. TAO, AND Z. ZHANG, »A 1.15Gb/s fully parallel nonbinary LDPC decoder with fine-grained dynamic clock gating,« in *IEEE Int. Solid-State Circuits Conf., Dig. Tech. Papers*, San Francisco, CA., 2013, pp. 422–423.
- [61] D. RADJEN, P. ANDREANI, M. ANDERSON, AND L. SUNDSTROM, »A continuous time  $\Delta\Sigma$  modulator with reduced clock jitter sensitivity through DSCR feedback,« in *NORCHIP*, 2011, Nov 2011, pp. 1–4.
- [62] J. RABAEY, Low Power Design Essentials. Springer, Berkeley, CA, 2009.
- [63] D. RADJEN, »Continuous-time delta-sigma modulators for ultralow-power radios,« PhD Thesis, Lund University, Lund, Sep 2014.
- [64] J. RABAEY, A. CHANDRAKASAN, AND B. NIKOLIC, Digital integrated circuits: a design perspective. Prentice Hall, Upper Saddle River, NJ, 2003.
- [65] S. SHERAZI et al., »A 100-fJ/cycle sub-VT decimation filter chain in 65 nm CMOS,« in IEEE International Conference on Electronics, Circuits and Systems, ICECS, 2012, pp. 448–451.
- [66] S. SHERAZI et al., »Ultra Low Energy Design Exploration of Digital Decimation Filters in 65 nm Dual-VT CMOS in the Sub-VT Domain,« *Microprocessors and Microsystems*, 2013. [Online]. Available: http://dx.doi.org/10.1016/j.micpro.2012.04.002
- [67] P. STAHL, J. ANDERSON, AND R. JOHANNESSON, "New tailbiting encoders," in *Information Theory*, 1998. Proceedings. 1998 IEEE International Symposium on, Aug 1998, pp. 389–.
- [68] P. STAHL, J. ANDERSON, AND R. JOHANNESSON, »Optimal and nearoptimal encoders for short and moderate-length tail-biting trellises, *« IEEE Transactions on Information Theory*, vol. 45, no. 7, pp. 2562–2571, Nov 1999.

- [69] Y. SUN, J. CAVALLARO, AND T. LY, "Scalable and low power LDPC decoder design using high level algorithmic synthesis," in *IEEE Int. Syst.-On-Chip Conf.*, Belfast, 2009, pp. 267–270.
- [70] S. Y. SHERAZI, »Design space exploration of digital circuits for ultra-low energy dissipation," PhD Thesis, Lund University, Lund, Jan 2014.
- [71] N. SADEGHI, S. HOWARD, S. KASNAVI, K. INIEWSKI, V. GAUDET, AND C. SCHLEGEL, »Analysis of error control code use in ultralow-power wireless sensor networks,« in *IEEE International Symposium on Circuits and Systems, ISCAS*, Island of Kos, May 21 – 24, 2006, pp. 3558 – 3561.
- [72] S. SHERAZI, P. NILSSON, O. AKGUN, H. SJÖLAND, AND J. RO-DRIGUES, »Design exploration of a 65 nm sub-VT cmos digital decimation filter chain,« in *IEEE International Symposium on Circuits and Systems (ISCAS)*, May 2011, pp. 837–840.
- [73] S. SOLDA, D. VOGRIG, A. BEVILACQUA, A. GEROSA, AND A. NE-VIANI, »Analog decoding of trellis coded modulation for multilevel flash memories,« in *IEEE International Symposium on Circuits* and Systems, ISCAS, May 2008, pp. 744–747.
- [74] P. SWEENEY, Low Density Parity Check Codes. MIT Press, West Sussex, England, 2002.
- [75] B. TOMATSOPOULOS AND A. DEMOSTHENOUS, »A CMOS harddecision analog convolutional decoder employing the MFDA for low-power applications, « *IEEE Tran. Circuits Syst. I, Reg. Papers*, vol. 55, no. 9, pp. 2912–2923, 2008.
- [76] A. TAJALLI AND Y. LEBLEBICI, Extreme low-power mixed signal IC design: subthreshold source-coupled circuits. Springer, 2010, ch. 2.
- [77] M. VIDOJKOVIC et al., »A 2.4GHz ULP OOK single-chip transceiver for healthcare applications, « in IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb 2011, pp. 458–460.
- [78] P. VAN DER MEER, Low-Power Deep Sub-Micron CMOS Logic. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2006.

- [79] D. VOGRIG, A. GEROSA, A. NEVIANI, A. GRAELL I AMAT, G. MON-TORSI, AND S. BENEDETTO, »A 0.35- μm CMOS analog turbo decoder for the 40-bit rate 1/3 UMTS channel code,« *IEEE J. Solid-State Circuits*, vol. 40, no. 3, pp. 753–762, 2005.
- [80] A. WANG, B. CALHOUN, AND A. CHANDRAKASAN, Sub-Threshold Design for Ultra Low-Power Systems. Springer, Cambridge, MA, 2006.
- [81] J. WANG, Y. CAO, M. CHEN, J. SUN, AND A. MITEV, »Capturing device mismatch in analog and mixed-signal designs, « *IEEE Circuits* and Systems Magazine, vol. 8, no. 4, pp. 37–44, Apr 2008.
- [82] C. WINSTEAD, J. DAI, S. YU, C. MYERS, R. HARRISON, AND C. SCHLEGEL, »CMOS analog MAP decoder for (8,4) Hamming code,« *IEEE J. Solid-State Circuits*, vol. 39, no. 1, pp. 122–131, 2004.
- [83] C. WINSTEAD, »Analog iterative error control decoders,« PhD Thesis, University of Alberta, Edmonton, 2005.
- [84] N. WIBERG, H. LOELIGER, AND R. KÖTTER, »Codes and iterative decoding on general graphs," *European Transactions on Telecommunications*, vol. 6, pp. 513–526, Sep 1995.
- [85] C. WINSTEAD, N. NGUYEN, V. C. GAUDET, AND C. SCHLEGEL, »Lowvoltage CMOS circuits for analog iterative decoders,« vol. 53, no. 4, pp. 829 – 841, April 2006.
- [86] C. WINSTEAD AND J. N. RODRIGUES, "Ultra-low-power error correction circuits: Technology scaling and Sub-VT operation," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 59, no. 12, pp. 913–917, 2012.
- [87] A. XOTTA, D. VOGRIG, A. GEROSA, A. NEVIANI, G. AMAT et al., »An all-analog cmos implementation of a turbo decoder for hard-disk drive read channels,« in *IEEE International Symposium on Circuits* and Systems, ISCAS, vol. 5, 2002, pp. 69–72.
- [88] S.-W. YEN, S.-Y. HUNG, C.-L. CHEN, CHANG, HSIE-CHIA, S.-J. JOU, AND C.-Y. LEE, »A 5.79-Gb/s energy-efficient multirate LDPC codec chip for IEEE 802.15.3c applications, « IEEE J. Solid-State Circuits, vol. 47, no. 9, pp. 2246–2257, 2012.

- [89] K.-S. YEO AND K. ROY, Low-Voltage, Low-Power VLSI Subsystems. McGraw-Hill, 2005.
- [90] S. Yu, »Design and test of error control decoders in analog cmos,« PhD Thesis, The University of Utah, Utah, Dec 2003.
- [91] M. ZARGHAM *et al.*, »Scaling of analog LDPC decoders in sub-100nm CMOS processes, *« Integration, the VLSI Journal*, vol. 43, no. 4, pp. 365–377, 2010.
- [92] Z. ZHANG, V. ANANTHARAM, M. WAINWRIGHT, AND B. NIKOLIC, »An efficient 10GBASE-T ethernet LDPC decoder design with low error floors, « *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 843–855, 2010.
- [93] ZigBee alliance. [Online]. Available: http://www.zigbee.org/

### Appendix A



Complete layout of the implemented UPD Radio in 65 nm CMOS. The design is pad limited, and the total area including pads is 1mm x 2mm.

### Appendix B



Figure .1.: PCB and setup for measuring the analog decoding chip, AD1.



**Figure .2.:** PCB and setup for measuring the analog decoding chip, AD2 and AD3.



Figure .3.: PCB and setup for measuring the digital decoding chips.



Figure .4.: Measurement instruments and the environment chamber.