Supervisor: Viktor Öwall
Co-supervisors: Joachim Rodrigues, Per Andersson
Financial support: VINNOVA Industrial Excellence Center
Reconfigurable computing is an emerging trend for processing digital signal applications. To achieve high performance to a feasible hardware cost, the reconfigurable architecture should be a trade-off between efficiency, flexibility and programmability. A dynamically reconfigurable architecture not only enables hardware reuse in multiple designs to allow rapid system development, but also provides resource sharing between different function implementations during system run-time by dynamically reconfiguring the hardware platform. Coarse-grained reconfigurable architectures (CGRAs) are arrays constructed from building blocks in a size ranging from arithmetic logic units to full-scale processors. Compared to conventional fine-grained reconfigurable architectures, such as FPGAs, which require bit level manipulations in system designing and configurations, CGRAs trade mapping flexibility to improve configuration time, to reduce routing area overhead and to achieve higher performance with the use of optimized word-level data processing.
Our proposed CGRA is built upon the heterogeneous resource cells, containing decoupled processing and memory elements. Data communication between different network cells are carried out via a combination of local interconnections with dedicated wires and a global hierarchical routing network. All resource cells are parameterizable at system compile-time, and are capable of being dynamically reconfigured to support run-time application mappings. To meet different computational demands, different functional blocks, comprising both application-tuned and general-purpose processors, could be placed into the cell array by encapsulating modules with network adapters during system elaboration time.
Fig. 1. Block diagram of a processor cell (PC) and a memory cell (MC) in the proposed CGRA.
(1) 32~2048-point dynamically reconfigurable radix-22 FFT processor
To explore operating performance of our proposed CGRA, a commonly used benchmark algorithm - the fast Fourier transform (FFT) has been mapped onto the reconfigurable cell array. Typical applications of the FFT operation are spectrum analysis, linear filtering, and orthogonal frequency division multiplexing (OFDM) that is used in the fourth generation of radio technologies - 3GPP long term evolution (LTE), and the wireless network standard 802.11n, etc.
A basic radix-22 FFT building block consists of two radix-2 butterfly units separated by a trivial multiplication and followed by a complex multiplier, as shown in Fig. 2 (b). This can be directly mapped onto the CGRA by using a 2x4 cell array, consisting of four processing cells and four memory cells. The 32-bit DSP cores are used for the butterfly (BF) operations as well as the trivial multiplication, while a CORDIC cell emulates complex multiplication using vector rotation. Memory cells associated with both processor and CORDIC cells are used as single path delay feedback (SDF) buffers and twiddle factor ROMs.
Fig. 2. A basic radix-22 FFT building block and its mapping on the CGRA.
A time-multiplexed radix-22 FFT structure using one basic radix-22 FFT building block has been implemented on an FPGA platform. Performance evaluation exhibits that the reconfiguration code size of our cell array architecture is 8 times smaller than an ordinary DSP solution, and the execution clock cycles is reduced by at least 3 times comparing to an ARM implementation.
Fig. 3. Benchmark comparisons.
(2) Coarse synchronization in multi-standard OFDM systems
The coarse time- and frequency estimation is performed during the (re)establishment of a data link between transmitter and receiver, which implies that the required computational units are active only for a small fraction of the total time the receiver is on. This motivates the use of reconfigurable computing from two aspects. Firstly, in a multi-standard single-stream scenario, the same hardware may be reconfigured after OFDM acquisition to perform other baseband processing in succeeding stages, such as FFT, refined frequency offset estimation and tracking. Secondly, the underlying hardware resources may be shared in a multi-standard multi-stream environment, to be used for concurrent reception of data from multiple standards.
A coarse-grained dynamically reconfigurable cell array for processing coarse time synchronization and fractional frequency offset estimation for multiple OFDM standards has been developed. The radio standards under analysis are IEEE 802.11n, LTE, and DVB-H. The reconfigurable cell array, containing 2-by-2 cells, is capable of processing two concurrent data streams from the three standards. Dynamic reconfigurability of the architecture enables run-time switching between the standards. Furthermore, the reconfigurable cell array enables adaptive wordlength scheduling for data computation, which helps in balancing the performance loss and resource utilization. The high system flexibility enables mapping of different algorithms and tasks onto the same platform without any additional hardware cost. In the conducted experiment, flexibility is illustrated by mapping a novel sign-bit OFDM acquisition algorithm onto the presented DRA, thus supporting all three OFDM transmission modes (2k, 4k, and 8k subcarriers) in the DVB-H standard, see Fig. 4.
Fig. 4. Supported multiple radio standards on a 2-by-2 cell array.
Two PCs in the reconfigurable cell array are configured as 16-bit SIMD cores (Fig. 5), suitable to handle correlations with 4-bit complex-valued inputs. The wordlength of internal memory array in both MCs are 32 bits wide. Therefore, a pair of 16 bits in-phase and quadrature data may be stored in the same memory location in a MC. Memory capacity of the correlation and moving-sum FIFOs are configured to suffice the standard with largest storage requirement, i.e., DVB-H. Smaller FIFO sizes when required, are obtained by dynamically reconfiguring the same MCs.
Fig. 5. Data path of a 16-bit customized RISC with support of SIMD like operations.
In addition to the task-level pipeline, autonomous FIFO operation in MCs helps to hide memory operations from the PCs. Moreover, micro-block function of the MC eliminates data alignment operations in PC during the handling of truncated data pairs in the correlation FIFO (MC0), as illustrated in Fig. 6.
Fig. 6. Interleaved data storage in correlation FIFO (MC0). (a) Received 12 bits data pair in PC0. (b) Exploded view of data storage in MC0. Data pairs are truncated down to 4 bits in PC0. (c) Final data storage at address 'x'.
The 2-by-2 cell array has been synthesized using a 65nm low-leakage standard cell CMOS library. Shown by the synthesis results in Fig. 7, the entire DRA including the interface controller occupies 0.479mm2 area, and has a maximum clock frequency of 534MHz. The potential usage of the cell array is not fully explored when only evaluating the mapping of the synchronization algorithms. The reconfigurable cell array may be dynamically reconfigured to perform different tasks, e.g. subsequent baseband processing like FFT and refined frequency offset estimation, while an accelerator implements fixed functionality.
Fig. 7. Synthesis results of the 2-by-2 cell array.
This work is curried out in cooperation with the EU FP7 Multibase project, which has Ericsson, Imec and Infineon as partners.