FPGA-based Implementation
of Signal Processing Systems

Second Edition

Roger Woods

Queen’s University, Belfast, UK

John McAllister

Queen’s University, Belfast, UK

Gaye Lightbody

University of Ulster, UK

Ying Yi

SN Systems — Sony Interactive Entertainment, UK

This edition first published 2017
© 2017 John Wiley & Sons, Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Roger Woods, John McAllister, Gaye Lightbody and Ying Yi to be identified as the authors of this work has been asserted in accordance with law.

Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons, Ltd., The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Woods, Roger, 1963- author. | McAllister, John, 1979- author. |
   Lightbody, Gaye, author. | Yi, Ying (Electrical engineer), author.
Title: FPGA-based implementation of signal processing systems / Roger Woods,
   John McAllister, Gaye Lightbody, Ying Yi.
Description: Second editon. | Hoboken, NJ : John Wiley & Sons Inc., 2017. |
   Revised edition of: FPGA-based implementation of signal processing systems /
   Roger Woods … [et al.]. 2008. | Includes bibliographical references and index.
Identifiers: LCCN 2016051193 | ISBN 9781119077954 (cloth) | ISBN 9781119077978 (epdf) |
   ISBN 9781119077961 (epub)
Subjects: LCSH: Signal processing--Digital techniques. | Digital integrated
   circuits. | Field programmable gate arrays.
Classification: LCC TK5102.5 .F647 2017 | DDC 621.382/2--dc23 LC record available at
   https://lccn.loc.gov/2016051193

Cover Design: Wiley
Cover Image: © filo/Gettyimages;
(Graph) Courtesy of the authors

The book is dedicated by the main author to his wife, Pauline, for all for her support and care, particularly over the past two years.

The support from staff from the Royal Victoria Hospital and Musgrave Park Hospital is greatly appreciated.

Preface
List of Abbreviations
1 Introduction to Field Programmable Gate Arrays
1. 1.1 Introduction
2. 1.2 Field Programmable Gate Arrays
3. 1.3 Influence of Programmability
4. 1.4 Challenges of FPGAs
5. Bibliography
2 DSP Basics
1. 2.1 Introduction
2. 2.2 Definition of DSP Systems
3. 2.3 DSP Transformations
4. 2.4 Filters
5. 2.5 Adaptive Filtering
6. 2.6 Final Comments
7. Bibliography
3 Arithmetic Basics
1. 3.1 Introduction
2. 3.2 Number Representations
3. 3.3 Arithmetic Operations
4. 3.4 Alternative Number Representations
5. 3.5 Division
6. 3.6 Square Root
7. 3.7 Fixed-Point versus Floating-Point
8. 3.8 Conclusions
9. Bibliography
4 Technology Review
1. 4.1 Introduction
2. 4.2 Implications of Technology Scaling
3. 4.3 Architecture and Programmability
4. 4.4 DSP Functionality Characteristics
5. 4.5 Microprocessors
6. 4.6 DSP Processors
7. 4.7 Graphical Processing Units
8. 4.8 System-on-Chip Solutions
9. 4.9 Heterogeneous Computing Platforms
10. 4.10 Conclusions
11. Bibliography
5 Current FPGA Technologies
1. 5.1 Introduction
2. 5.2 Toward FPGAs
3. 5.3 Altera Stratix^® V and 10 FPGA Family
4. 5.4 Xilinx Ultrascale^TM/Virtex-7 FPGA families
5. 5.5 Xilinx Zynq FPGA Family
6. 5.6 Lattice iCE40isp FPGA Family
7. 5.7 MicroSemi RTG4 FPGA Family
8. 5.8 Design Stratregies for FPGA-based DSP Systems
9. 5.9 Conclusions
10. Bibliography
6 Detailed FPGA Implementation Techniques
1. 6.1 Introduction
2. 6.2 FPGA Functionality
3. 6.3 Mapping to LUT-Based FPGA Technology
4. 6.4 Fixed-Coefficient DSP
5. 6.5 Distributed Arithmetic
6. 6.6 Reduced-Coefficient Multiplier
7. 6.7 Conclusions
8. Bibliography
7 Synthesis Tools for FPGAs
1. 7.1 Introduction
2. 7.2 High-Level Synthesis
3. 7.3 Xilinx Vivado
4. 7.4 Control Logic Extraction Phase Example
5. 7.5 Altera SDK for OpenCL
6. 7.6 Other HLS Tools
7. 7.7 Conclusions
8. Bibliography
8 Architecture Derivation for FPGA-based DSP Systems
1. 8.1 Introduction
2. 8.2 DSP Algorithm Characteristics
3. 8.3 DSP Algorithm Representations
4. 8.4 Pipelining DSP Systems
5. 8.5 Parallel Operation
6. 8.6 Conclusions
7. Bibliography
9 Complex DSP Core Design for FPGA
1. 9.1 Introduction
2. 9.2 Motivation for Design for Reuse
3. 9.3 Intellectual Property Cores
4. 9.4 Evolution of IP cores
5. 9.5 Parameterizable (Soft) IP Cores
6. 9.6 IP Core Integration
7. 9.7 Current FPGA-based IP cores
8. 9.8 Watermarking IP
9. 9.9 Summary
10. Bibliography
10 Advanced Model-Based FPGA Accelerator Design
1. 10.1 Introduction
2. 10.2 Dataflow Modeling of DSP Systems
3. 10.3 Architectural Synthesis of Custom Circuit Accelerators from DFGs
4. 10.4 Model-Based Development of Multi-Channel Dataflow Accelerators
5. 10.5 Model-Based Development for Memory-Intensive Accelerators
6. 10.6 Summary
7. Notes
8. Bibliography
11 Adaptive Beamformer Example
1. 11.1 Introduction to Adaptive Beamforming
2. 11.2 Generic Design Process
3. 11.3 Algorithm to Architecture
4. 11.4 Efficient Architecture Design
5. 11.5 Generic QR Architecture
6. 11.6 Retiming the Generic Architecture
7. 11.7 Parameterizable QR Architecture
8. 11.8 Generic Control
9. 11.9 Beamformer Design Example
10. 11.10 Summary
11. Bibliography
12 FPGA Solutions for Big Data Applications
1. 12.1 Introduction
2. 12.2 Big Data
3. 12.3 Big Data Analytics
4. 12.4 Acceleration
5. 12.5 k-Means Clustering FPGA Implementation
6. 12.6 FPGA-Based Soft Processors
7. 12.7 System Hardware
8. 12.8 Conclusions
9. Bibliography
13 Low-Power FPGA Implementation
1. 13.1 Introduction
2. 13.2 Sources of Power Consumption
3. 13.3 FPGA Power Consumption
4. 13.4 Power Consumption Reduction Techniques
5. 13.5 Dynamic Voltage Scaling in FPGAs
6. 13.6 Reduction in Switched Capacitance
7. 13.7 Final Comments
8. Bibliography
14 Conclusions
1. 14.1 Introduction
2. 14.2 Evolution in FPGA Design Approaches
3. 14.3 Big Data and the Shift toward Computing
4. 14.4 Programming Flow for FPGAs
5. 14.5 Support for Floating-Point Arithmetic
6. 14.6 Memory Architectures
7. Bibliography
Index
EULA

List of Tables

Chapter 1
1. Table 1.1
Chapter 3
1. Table 3.1
2. Table 3.2
3. Table 3.3
4. Table 3.4
5. Table 3.5
6. Table 3.6
7. Table 3.7
8. Table 3.8
9. Table 3.9
10. Table 3.10
11. Table 3.11
Chapter 4
1. Table 4.1
2. Table 4.2
Chapter 5
1. Table 5.1
2. Table 5.2
3. Table 5.3
4. Table 5.4
Chapter 6
1. Table 6.1
2. Table 6.2
3. Table 6.3
4. Table 6.4
Chapter 8
1. Table 8.1
2. Table 8.2
3. Table 8.3
4. Table 8.4
5. Table 8.5
Chapter 9
1. Table 9.1
Chapter 10
1. Table 10.1
2. Table 10.2
3. Table 10.3
4. Table 10.4
5. Table 10.5
6. Table 10.6
7. Table 10.7
8. Table 10.8
Chapter 11
1. Table 11.1
2. Table 11.2
3. Table 11.3
4. Table 11.4
5. Table 11.5
6. Table 11.6
7. Table 11.7
8. Table 11.8
Chapter 12
1. Table 12.1
2. Table 12.2
3. Table 12.3
4. Table 12.4
5. Table 12.5
Chapter 13
1. Table 13.1
2. Table 13.2
3. Table 13.3
4. Table 13.4
5. Table 13.5

List of Illustrations

Chapter 1
1. Figure 1.1 Moore’s law
2. Figure 1.2 Change in ITRS scaling prediction for clock frequencies
Chapter 2
1. Figure 2.1 Basic DSP system
2. Figure 2.2 Digitization of analogue signals
3. Figure 2.3 Example applications for DSP
4. Figure 2.4 Sampling rates for many DSP systems
5. Figure 2.5 Eight-point radix-2 FFT structure
6. Figure 2.6 Wireless communications transmitter
7. Figure 2.7 Original FIR filter SFG
8. Figure 2.8 FIR filter SFG
9. Figure 2.9 FIR filter configurations
10. Figure 2.10 Filter specification features
11. Figure 2.11 Low-pass filter response
12. Figure 2.12 Direct form IIR filter
13. Figure 2.13 IIR filter distortion
14. Figure 2.14 Frequency impact of warping
15. Figure 2.15 Cascade of second-order IIR filter blocks
16. Figure 2.16 WDF configuration
17. Figure 2.17 WDF building blocks
18. Figure 2.18 Adaptive filter system
19. Figure 2.19 Error surface of a two-tap transversal filter
20. Figure 2.20 Systolic QR array for the RLS algorithm
21. Figure 2.21 BC for QR-RLS algorithm
22. Figure 2.22 IC for QR-RLS algorithm
23. Figure 2.23 Squared Givens Rotations QR-RLS algorithm
Chapter 3
1. Figure 3.1 Number wheel representation of four-bit numbers
2. Figure 3.2 Impact of overflow in two's complement
3. Figure 3.3 Floating-point representations
4. Figure 3.4 One-bit adder structure
5. Figure 3.5 n-bit adder structure
6. Figure 3.6 Alternative CLA structures
7. Figure 3.7 Carry-save adder
8. Figure 3.8 Carry-save array multiplier
9. Figure 3.9 Wallace tree multiplier
10. Figure 3.10 SBNR adder
11. Figure 3.11 Quadratic convergence
12. Figure 3.12 Block diagram for bipartite approximation methods
Chapter 4
1. Figure 4.1 Conclusions from Intel CPU scaling
2. Figure 4.2 Simple parallel implementation of a FIR filter
3. Figure 4.3 Von Neumann processor architecture
4. Figure 4.4 Epiphany architecture
5. Figure 4.5 Processor architectures
6. Figure 4.6 TI TMS32010 DSP processor (Reproduced with permission of Texas Instruments Inc.)
7. Figure 4.7 TMS320C6678 multicore fixed and floating-point DSP microprocessor
8. Figure 4.8 Nvidia GeForce GPU architecture
9. Figure 4.9 Linear systolic array
10. Figure 4.10 Systolic array architecture
11. Figure 4.11 Triangular systolic array architecture
Chapter 5
1. Figure 5.1 PLD architecture
2. Figure 5.2 Storage view of PLD architecture
3. Figure 5.3 Early Manhattan architecture
4. Figure 5.4 Adaptive logic module
5. Figure 5.5 Altera DSP block (simplified view)
6. Figure 5.6 Xilinx Ultrascale^TM floorplan. Reproduced with permission of Xilinx, Incorp.
7. Figure 5.7 Simplified view of 1/4 Xilinx slice functionality
8. Figure 5.8 Xilinx DSP48E2 DSP block. Reproduced with permission of Xilinx, Incorp.
9. Figure 5.9 Xilinx Zynq architecture
10. Figure 5.10 Zynq application processing unit
11. Figure 5.11 Lattice Semiconductor iCE40isp. Reproduced with permission of Lattice Semiconductor Corp.
12. Figure 5.12 Bandwidth possibilities in FPGAs
Chapter 6
1. Figure 6.1 Mapping logic functionality into LUTs
2. Figure 6.2 Additional usage of CLB LUT resource. Reproduced with permission of Xilinx, Incorp.
3. Figure 6.3 Configuration for Xilinx CLB LUT resource
4. Figure 6.4 Detailed SRL logic structure for Xilinx Spartan-6. Reproduced with permission of Xilinx, Incorp.
5. Figure 6.5 Complex multiplier realization
6. Figure 6.6 Dataflow for motion estimation IP core
7. Figure 6.7 Initial LUT encoding
8. Figure 6.8 Modified LUT encoding
9. Figure 6.9 Low-pass FIR filter
10. Figure 6.10 DCT circuit architecture
11. Figure 6.11 LUT-based 8-bit multiplier
12. Figure 6.12 DA-based multiplier block diagram
13. Figure 6.13 Reduced-complexity multiplier
14. Figure 6.14 DA-based multiplier block diagram. Source: Turner 2004. Reproduced with permission of IEEE.
15. Figure 6.15 Possible implementations using multiplexer-based design technique. Source: Turner 2004. Reproduced with permission of IEEE
16. Figure 6.16 Generalized view of RCM technique
17. Figure 6.17 Multiplication by either 45 or 15
18. Figure 6.18 Multiplication by either 45 or 23
Chapter 7
1. Figure 7.1 High-level synthesis in Gajski and Kuhn’s Y-chart
2. Figure 7.2 Vivado HLS IP creation and integration into a system
3. Figure 7.3 Control logic extraction and I/O port implementation example
4. Figure 7.4 Pipelined processor implementation
5. Figure 7.5 GAUT synthesis flow
Chapter 8
1. Figure 8.1 Latency and throughput rate relationship for system y(n) = ax(n)
2. Figure 8.2 Latency and throughput rate relationship for system y(n) = ay(n − 1)
3. Figure 8.3 Algorithms realizations using three processes P₁, P₂ and P₃
4. Figure 8.4 Interleaving example
5. Figure 8.5 Example of pipelining
6. Figure 8.6 Pipelining of recursive computations y(n) = ay(n − 1)
7. Figure 8.7 Various representations of simple DSP recursion y(n) = ay(n − 1) + x(n)
8. Figure 8.8 SFG representation of three-tap FIR filter
9. Figure 8.9 Simple DFG
10. Figure 8.10 Retiming example
11. Figure 8.11 Retimed FIR filter
12. Figure 8.12 Cut-set theorem application
13. Figure 8.13 Cut-set timing applied to FIR filter
14. Figure 8.14 Cut-set timing applied to FIR filter
15. Figure 8.15 Second-order IIR filter
16. Figure 8.16 Pipelining of a second-order IIR filter. Source: Parhi 1999. Reproduced with permission of John Wiley & Sons.
17. Figure 8.17 Simple DFG example (Parhi 1999)
18. Figure 8.18 Lattice filter
19. Figure 8.19 Manipulation of parallelism
20. Figure 8.20 Block FIR filter
21. Figure 8.21 Reduced block-based FIR filter
22. Figure 8.22 Unfolded first-order recursion
23. Figure 8.23 Unfolded FIR filter-block
24. Figure 8.24 Folded FIR filter section
25. Figure 8.25 Folding transformation
26. Figure 8.26 Folding process
27. Figure 8.27 Alternative folding
Chapter 9
1. Figure 9.1 Benefits of IP types
2. Figure 9.2 Evolution of IP cores
3. Figure 9.3 Fixed- and floating-point operations
4. Figure 9.4 Circuit design flow
5. Figure 9.5 Rapid design flow
6. Figure 9.6 Components suitable for IP
7. Figure 9.7 IP parameters
8. Figure 9.8 Effect of generalization on design reliability
9. Figure 9.9 Wordlength analysis
10. Figure 9.10 Design flow
11. Figure 9.11 FIR filter hierarchy example
Chapter 10
1. Figure 10.1 Simple KPN structure
2. Figure 10.2 Simple SDF graph
3. Figure 10.3 Simple CSDF graph
4. Figure 10.4 Simple MSDF graph
5. Figure 10.5 MR-DFG accelerator architectural synthesis
6. Figure 10.6 Beamformer architecture
7. Figure 10.7 Parallel matrix multiplication
8. Figure 10.8 Matrix multiplication MADF
9. Figure 10.9 Matrix decomposition for fixed token size processing
10. Figure 10.10 Full MADF matrix multiplication
11. Figure 10.11 Block processing matrix multiplication
12. Figure 10.12 Dataflow accelerator architecture
13. Figure 10.13 Two-stage FIR WBC
14. Figure 10.14 Scaled variants of two-stage FIR WBC
15. Figure 10.15 Eight-channel NLF filter bank MADF graph
16. Figure 10.16 NLF SFG, pipelined architecture and WBC
17. Figure 10.17 NLF stage SFG, pipelined architecture and WBC
18. Figure 10.18 Fixed beamformer MADF graph
19. Figure 10.19 Fixed beamformer overview
20. Figure 10.20 Fixed beamformer MADF graph
21. Figure 10.21 Full search motion estimation representations
22. Figure 10.22 FSME modeled using CSDF
23. Figure 10.23 Modified FSME CSDF model
Chapter 11
1. Figure 11.1 Diagram of an adaptive beamformer for interference canceling
2. Figure 11.2 Generic design process
3. Figure 11.3 Multiple beam adaptive beamformer system
4. Figure 11.4 Adaptive filter system
5. Figure 11.5 Triangular systolic array for QRD RLS filtering
6. Figure 11.6 From algorithm to architecture
7. Figure 11.7 Dependence graph for QR decomposition
8. Figure 11.8 From dependence graph to signal flow graph
9. Figure 11.9 Simple linear array mapping
10. Figure 11.10 Radar mapping (Rader 1992, 1996)
11. Figure 11.11 Projecting the QR array onto a linear architecture
12. Figure 11.12 Interleaved processor array
13. Figure 11.13 Linear architecture for a seven-input QR array
14. Figure 11.14 Interleaving successive QR operations. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
15. Figure 11.15 Generic QR array. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
16. Figure 11.16 Repetitive section
17. Figure 11.17 Processor array. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
18. Figure 11.18 Linear array. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
19. Figure 11.19 Sparse linear array
20. Figure 11.20 One QR update scheduled on the sparse linear array. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
21. Figure 11.21 Rectangular array
22. Figure 11.22 Sparse rectangular array. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
23. Figure 11.23 Arithmetic modules. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
24. Figure 11.24 Cell SFGs for the complex arithmetic SGR QR algorithm. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
25. Figure 11.25 Arithmetic modules. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
26. Figure 11.26 Generically retimed BC
27. Figure 11.27 Generically retimed IC
28. Figure 11.28 Schedule for a linear array with an IC latency of 3
29. Figure 11.29 Types of cells in processor array
30. Figure 11.30 QR cells for the linear architecture
31. Figure 11.31 Sparse linear array schedule
32. Figure 11.32 Example partitioning of three columns onto one processor. (Source: Lightbody 2003. Reproduced with permission of IEEE.)
33. Figure 11.33 Generic partitioning of N_IC columns onto one processor
34. Figure 11.34 Sparse linear array cells
35. Figure 11.35 Redistributed delays for sparse linear array cells
36. Figure 11.36 Control sequence for sparse linear array
37. Figure 11.37 Possible effect of latency on sparse linear array schedule
38. Figure 11.38 Merging the delays into the latency of the QR cells
39. Figure 11.39 Effect of latency on schedule for the sparse linear array (N_IC = 3)
40. Figure 11.40 QR cells for the rectangular array
41. Figure 11.41 QR cells for the sparse rectangular array
42. Figure 11.42 LMR control circuitry for sparse arrays
43. Figure 11.43 Control for the external inputs for the linear QR arrays
44. Figure 11.44 External inputs
45. Figure 11.45 Partitioning of control seed
46. Figure 11.46 Example QR architecture derivation, m = 22, p = 4
47. Figure 11.47 Example architecture
Chapter 12
1. Figure 12.1 Inference of data. Source: Cichosz 2015. Reproduced with permission of John Wiley & Sons.
2. Figure 12.2 Simple example of classification
3. Figure 12.3 Scaling computing resources
4. Figure 12.4 Core configuration where Monte Carlo simulations can be programmed
5. Figure 12.5 Flow chart for k-means algorithm
6. Figure 12.6 Distance calculation
7. Figure 12.7 Bits of data in and out of each block
8. Figure 12.8 IPPro architecture
9. Figure 12.9 Proposed system architecture
10. Figure 12.10 Proposed architecture for k-means image clustering algorithm on FPGA
Chapter 13
1. Figure 13.1 Sources of leakage components in CMOS transistor
2. Figure 13.2 Simple CMOS inverter
3. Figure 13.3 Impact of transistor scaling
4. Figure 13.4 Impact of static versus dynamic power consumption with technology evolution (ITRS, 2003; Kim et al. 2003)
5. Figure 13.5 Estimated power consumption for mobile backhaul on Artix-7
6. Figure 13.6 Use of parallelism (and voltage scaling) to lower power consumption
7. Figure 13.7 Generic MAC time domain filter implementation
8. Figure 13.8 Application of pipelining
9. Figure 13.9 Typical FPGA interconnection route
10. Figure 13.10 Pipelined FIR filter implementation for low power
11. Figure 13.11 Tap signal capacitance and toggle activity
12. Figure 13.12 Eight-point radix-2 modified flow graph
13. Figure 13.13 Interconnect capacitance versus point size for FFT designs in Xilinx Virtex-7 FPGA technology
14. Figure 13.14 Interconnect capacitance for a 64-point FFT design implemented using Xilinx Virtex-7 FPGA technology
15. Figure 13.15 Digital receiver architecture for radar system
16. Figure 13.16 Pulse width instances: mixture of nautical and airborne radars
17. Figure 13.17 DR instances: nautical and airborne radars
18. Figure 13.18 Combining data streams

Preface

DSP and FPGAs

Digital signal processing (DSP) is the cornerstone of many products and services in the digital age. It is used in applications such as high-definition TV, mobile telephony, digital audio, multimedia, digital cameras, radar, sonar detectors, biomedical imaging, global positioning, digital radio, speech recognition, to name but a few! The evolution of DSP solutions has been driven by application requirements which, in turn, have only been possible to realize because of developments in silicon chip technology. Currently, a mix of programmable and dedicated system-on-chip (SoC) solutions are required for these applications and thus this has been a highly active area of research and development over the past four decades.

The result has been the emergence of numerous technologies for DSP implementation, ranging from simple microcontrollers right through to dedicated SoC solutions which form the basis of high-volume products such as smartphones. With the architectural developments that have occurred in field programmable gate arrays (FPGAs) over the years, it is clear that they should be considered as a viable DSP technology. Indeed, developments made by FPGA vendors would support this view of their technology. There are strong commercial pressures driving adoption of FPGA technology across a range of applications and by a number of commercial drivers.

The increasing costs of developing silicon technology implementations have put considerable pressure on the ability to create dedicated SoC systems. In the mobile phone market, volumes are such that dedicated SoC systems are required to meet stringent energy requirements, so application-specific solutions have emerged which vary in their degree of programmability, energy requirements and cost. The need to balance these requirements suggests that many of these technologies will coexist in the immediate future, and indeed many hybrid technologies are starting to emerge. This, of course, creates a considerable interest in using technology that is programmable as this acts to considerably reduce risks in developing new technologies.

Commonly used DSP technologies encompass software programmable solutions such as microcontrollers and DSP microprocessors. With the inclusion of dedicated DSP processing engines, FPGA technology has now emerged as a strong DSP technology. Their key advantage is that they enable users to create system architectures which allow the resources to be best matched to the system processing needs. Whilst memory resources are limited, they have a very high-bandwidth, on-chip capability. Whilst the prefabricated aspect of FPGAs avoids many of the deep problems met when developing SoC implementations, the creation of an efficient implementation from a DSP system description remains a highly convoluted problem which is a core theme of this book.

Book Coverage

The book looks to address FPGA-based DSP systems, considering implementation at numerous levels.

Circuit-level optimization techniques that allow the underlying FPGA fabric to be used more intelligently are reviewed first. By considering the detailed underlying FPGA platform, it is shown how system requirements can be mapped to provide an area-efficient, faster implementation. This is demonstrated for a number of DSP transforms and fixed coefficient filtering.
Architectural solutions can be created from a signal flow graph (SFG) representation. In effect, this requires the user to exploit the highly regular, highly computative, data-independent nature of DSP systems to produce highly parallel, pipelined FPGA-based circuit architectures. This is demonstrated for filtering and beamforming applications.
System solutions are now a challenge as FPGAs have now become a heterogeneous platform involving multiple hardware and software components and interconnection fabrics. There is a need for a higher-level system modeling language, e.g. dataflow which will facilitate architectural optimizations but also to address system-level considerations such as interconnection and memory.

The book covers these areas of FPGA implementation, but its key differentiating factor is that it concentrates on the second and third areas listed above, namely the creation of circuit architectures and system-level modeling; this is because circuit-level optimization techniques have been covered in greater detail elsewhere. The work is backed up with the authors’ experiences in implementing practical real DSP systems and covers numerous examples including an adaptive beamformer based on a QR-based recursive least squares (RLS) filter, finite impulse response (FIR) and infinite impulse response (IIR) filters, a full search motion estimation and a fast Fourier transform (FFT) system for electronic support measures. The book also considers the development of intellectual property (IP) cores as this has become a critical aspect in the creation of DSP systems. One chapter is given over to describing the creation of such IP cores and another to the creation of an adaptive filtering core.

Audience

The book is aimed at working engineers who are interested in using FPGA technology efficiently in signal and data processing applications. The earlier chapters will be of interest to graduates and students completing their studies, taking the readers through a number of simple examples that show the trade-off when mapping DSP systems into FPGA hardware. The middle part of the book contains a number of illustrative, complex DSP system examples that have been implemented using FPGAs and whose performance clearly illustrates the benefit of their use. They provide insights into how to best use the complex FPGA technology to produce solutions optimized for speed, area and power which the authors believe is missing from current literature. The book summarizes over 30 years of learned experience of implementing complex DSP systems undertaken in many cases with commercial partners.

Second Edition Updates

The second edition has been updated and improved in a number of ways. It has been updated to reflect technology evolutions in FPGA technology, to acknowledge developments in programming and synthesis tools, to reflect on algorithms for Big Data applications, and to include improvements to some background chapters. The text has also been updated using relevant examples where appropriate.

Technology update: As FPGAs are linked to silicon technology advances, their architecture continually changes, and this is reflected in Chapter 5. A major change is the inclusion of the ARM^® processor core resulting in a shift for FPGAs to a heterogeneous computing platform. Moreover, the increased use of graphical processing units (GPUs) in DSP systems is reflected in Chapter 4.

Programming tools update: Since the first edition was published, there have been a number of innovations in tool developments, particularly in the creation of commercial C-based high-level synthesis (HLS) and open computing language (OpenCL) tools. The material in Chapter 7 has been updated to reflect these changes, and Chapter 10 has been changed to reflect the changes in model-based synthesis tools.

“Big Data” processing: DSP involves processing of data content such as audio, speech, music and video information, but there is now great interest in collating huge data sets from on-line facilities and processing them quickly. As FPGAs have started to gain some traction in this area, a new chapter, Chapter 12, has been added to reflect this development.

Organization

The FPGA is a heterogeneous platform comprising complex resources such as hard and soft processors, dedicated blocks optimized for processing DSP functions and processing elements connected by both programmable and fast, dedicated interconnections. The book focuses on the challenges of implementing DSP systems on such platforms with a concentration on the high-level mapping of DSP algorithms into suitable circuit architectures.

The material is organized into three main sections.

First Section: Basics of DSP, Arithmetic and Technologies

Chapter 2 starts with a DSP primer, covering both FIR and IIR filtering, transforms including the FFT and discrete cosine transform (DCT) and concluding with adaptive filtering algorithms, covering both the least mean squares (LMS) and RLS algorithms. Chapter 3 is dedicated to computer arithmetic and covers number systems, arithmetic functions and alternative number representations such as logarithmic number representations (LNS) and coordinate rotation digital computer (CORDIC). Chapter 4 covers the technologies available to implement DSP algorithms and includes microprocessors, DSP microprocessors, GPUs and SoC architectures, including systolic arrays. In Chapter 5, a detailed description of commercial FPGAs is given with a concentration on the two main vendors, namely Xilinx and Altera, specifically their UltraScale^TM/Zynq^® and Stratix^® 10 FPGA families respectively, but also covering technology offerings from Lattice and MicroSemi.

Second Section: Architectural/System-Level Implementation

This section covers efficient implementation from circuit architecture onto specific FPGA families; creation of circuit architecture from SFG representations; and system-level specification and implementation methodologies from high-level representations. Chapter 6 covers only briefly the efficient implementation of FPGA designs from circuit architecture descriptions as many of these approaches have been published; the text covers distributed arithmetic and reduced coefficient multiplier approaches and shows how these have been applied to fixed coefficient filters and DSP transforms. Chapter 7 covers HLS for FPGA design including new sections to reflect Xilinx’s Vivado HLS tool flow and also Altera’s OpenCL approach. The process of mapping SFG representations of DSP algorithms onto circuit architectures (the starting point in Chapter 6) is then described in Chapter 8. It shows how dataflow graph (DFG) descriptions can be transformed for varying levels of parallelism and pipelining to create circuit architectures which best match the application requirements, backed up with simple FIR and IIR filtering examples.

One of the ways to perform system design is to create predefined designs termed IP cores which will typically have been optimized using the techniques outlined in Chapter 8. The creation of such IP cores is outlined in Chapter 9 and acts to address the key to design productivity by encouraging “design for reuse.” Chapter 10 considers model-based design for heterogeneous FPGA and focuses on dataflow modeling as a suitable design approach for FPGA-based DSP systems. The chapter outlines how it is possible to include pipelined IP cores via the white box concept using two examples, namely a normalized lattice filter (NLF) and a fixed beamformer example.

Third Section: Applications to Big Data, Low Power

The final section of the book, consisting of Chapters 11–13, covers the application of the techniques. Chapter 11 looks at the creation of a soft, highly parameterizable core for RLS filtering, showing how a generic architecture can be created to allow a range of designs to be synthesized with varying performance. Chapter 12 illustrates how FPGAs can be applied to Big Data applications where the challenge is to accelerate some complex processing algorithms. Increasingly FPGAs are seen as a low-power solution, and FPGA power consumption is discussed in Chapter 13. The chapter starts with a discussion on power consumption, highlights the importance of dynamic and static power consumption, and then describes some techniques to reduce power consumption.

Acknowledgments

The authors have been fortunate to receive valuable help, support and suggestions from numerous colleagues, students and friends, including: Michaela Blott, Ivo Bolsens, Gordon Brebner, Bill Carter, Joe Cavallaro, Peter Cheung, John Gray, Wayne Luk, Bob Madahar, Alan Marshall, Paul McCambridge, Satnam Singh, Steve Trimberger and Richard Walke.

The authors’ research has been funded from a number of sources, including the Engineering and Physical Sciences Research Council, Xilinx, Ministry of Defence, Qinetiq, BAE Systems, Selex and Department of Employment and Learning for Northern Ireland.

Several chapters are based on joint work that was carried out with the following colleagues and students: Moslem Amiri, Burak Bardak, Kevin Colgan, Tim Courtney, Scott Fischaber, Jonathan Francey, Tim Harriss, Jean-Paul Heron, Colm Kelly, Bob Madahar, Eoin Malins, Stephen McKeown, Karen Rafferty, Darren Reilly, Lok-Kee Ting, David Trainor, Richard Turner, Fahad M Siddiqui and Richard Walke.

The authors thank Ella Mitchell and Nithya Sechin of John Wiley & Sons and Alex Jackson and Clive Lawson for their personal interest and help and motivation in preparing and assisting in the production of this work.

List of Abbreviations

1D: One-dimensional
2D: Two-dimensional
ABR: Auditory brainstem response
ACC: Accumulator
ADC: Analogue-to-digital converter
AES: Advanced encryption standard
ALM: Adaptive logic module
ALU: Arithmetic logic unit
ALUT: Adaptive lookup table
AMD: Advanced Micro Devices
ANN: Artificial neural network
AoC: Analytics-on-chip
API: Application program interface
APU: Application processing unit
ARM: Advanced RISC machine
ASIC: Application-specific integrated circuit
ASIP: Application-specific instruction processor
AVS: Adaptive voltage scaling
BC: Boundary cell
BCD: Binary coded decimal
BCLA: Block CLA with intra-group, carry ripple
BRAM: Block random access memory
CAPI: Coherent accelerator processor interface
CB: Current block
CCW: Control and communications wrapper
CE: Clock enable
CISC: Complex instruction set computer
CLA: Carry lookahead adder
CLB: Configurable logic block
CNN: Convolutional neural network
CMOS: Complementary metal oxide semiconductor
CORDIC: Coordinate rotation digital computer
CPA: Carry propagation adder
CPU: Central processing unit
CSA: Conditional sum adder
CSDF: Cyclo-static dataflow
CWT: Continuous wavelet transform
DA: Distributed arithmetic
DCT: Discrete cosine transform
DDR: Double data rate
DES: Data Encryption Standard
DFA: Dataflow accelerator
DFG: Dataflow graph
DFT: Discrete Fourier transform
DG: Dependence graph
disRAM: Distributed random access memory
DM: Data memory
DPN: Dataflow process network
DRx: Digital receiver
DSP: Digital signal processing
DST: Discrete sine transform
DTC: Decision tree classification
DVS: Dynamic voltage scaling
DWT: Discrete wavelet transform
E²PROM: Electrically erasable programmable read-only memory
EBR: Embedded Block RAM
ECC: Error correction code
EEG: Electroencephalogram
EPROM: Electrically programmable read-only memory
E-SGR: Enhanced Squared Givens rotation algorithm
EW: Electronic warfare
FBF: Fixed beamformer
FCCM: FPGA-based custom computing machine
FE: Functional engine
FEC: Forward error correction
FFE: Free-form expression
FFT: Fast Fourier transform
FIFO: First-in, first-out
FIR: Finite impulse response
FPGA: Field programmable gate array
FPL: Field programmable logic
FPU: Floating-point unit
FSM: Finite state machine
FSME: Full search motion estimation
GFLOPS: Giga floating-point operations per second
GMAC: Giga multiply-accumulates
GMACS: Giga multiply-accumulate per second
GOPS: Giga operations per second
GPUPU: General-purpose graphical processing unit
GPU: Graphical processing unit
GRNN: General regression neural network
GSPS: Gigasamples per second
HAL: Hardware abstraction layer
HDL: Hardware description language
HKMG: High-K metal gate
HLS: High-level synthesis
I2C: Inter-Integrated circuit
I/O: Input/output
IC: Internal cell
ID: Instruction decode
IDE: Integrated design environment
IDFT: Inverse discrete Fourier transform
IEEE: Institute of Electrical and Electronic Engineers
IF: Instruction fetch
IFD: Instruction fetch and decode
IFFT: Inverse fast Fourier transform
IIR: Infinite impulse response
IM: Instruction memory
IoT: Internet of things
IP: Intellectual property
IR: Instruction register
ITRS: International Technology Roadmap for Semiconductors
JPEG: Joint Photographic Experts Group
KCM: Constant-coefficient multiplication
KM: Kernel memory
KPN: Kahn process network
LAB: Logic array blocks
LDCM: Logic delay measurement circuit
LDPC: Low-density parity-check
LLVM: Low-level virtual machine
LMS: Least mean squares
LNS: Logarithmic number representations
LPDDR: Low-power double data rate
LS: Least squares
lsb: Least significant bit
LTI: Linear time-invariant
LUT: Lookup table
MA: Memory access
MAC: Multiply-accumulate
MAD: Minimum absolute difference
MADF: Multidimensional arrayed dataflow
MD: Multiplicand
ME: Motion estimation
MIL-STD: Military standard
MIMD: Multiple instruction, multiple data
MISD: Multiple instruction, single data
MLAB: Memory LAB
MMU: Memory management unit
MoC: Model of computation
MPE: Media processing engine
MPEG: Motion Picture Experts Group
MPSoC: Multi-processing SoC
MR: Multiplier
MR-DFG: Multi-rate dataflow graph
msb: Most significant bit
msd: Most significant digit
MSDF: Multidimensional synchronous dataflow
MSI: Medium-scale integration
MSPS: Megasamples per second
NaN: Not a Number
NLF: Normalized lattice filter
NRE: Non-recurring engineering
OCM: On-chip memory
OFDM: Orthogonal frequency division multiplexing
OFDMA: Orthogonal frequency division multiple access
OLAP: On-line analytical processing
OpenCL: Open computing language
OpenMP: Open multi-processing
ORCC: Open RVC-CAL Compiler
PAL: Programmable Array Logic
PB: Parameter bank
PC: Program counter
PCB: Printed circuit board
PCI: Peripheral component interconnect
PD: Pattern detect
PE: Processing element
PL: Programmable logic
PLB: Programmable logic block
PLD: Programmable logic device
PLL: Phase locked loop
PPT: Programmable power technology
PS: Processing system
QAM: Quadrature amplitude modulation
QR-RLS: QR recursive least squares
RAM: Random access memory
RAN: Radio access network
RCLA: Block CLA with inter-block ripple
RCM: Reduced coefficient multiplier
RF: Register file
RISC: Reduced instruction set computer
RLS: Recursive least squares
RNS: Residue number representations
ROM: Read-only memory
RT: Radiation tolerant
RTL: Register transfer level
RVC: Reconfigurable video coding
SBNR: Signed binary number representation
SCU: Snoop control unit
SD: Signed digits
SDF: Synchronous dataflow
SDK: Software development kit
SDNR: Signed digit number representation
SDP: Simple dual-port
SERDES: Serializer/deserializer
SEU: Single event upset
SFG: Signal flow graph
SGR: Squared Givens rotation
SIMD: Single instruction, multiple data
SISD: Single instruction, single data
SMP: Shared-memory multi-processors
SNR: Signal-to-noise ratio
SoC: System-on-chip
SOCMINT: Social media intelligence
SoPC: System on programmable chip
SPI: Serial peripheral interface
SQL: Structured query language
SR-DFG: Single-rate dataflow graph
SRAM: Static random access memory
SRL: Shift register lookup table
SSD: Shifted signed digits
SVM: Support vector machine
SW: Search window
TCP: Transmission Control Protocol
TFLOPS: Tera floating-point operations per second
TOA: Time of arrival
TR: Throughout rate
TTL: Transistor-transistor logic
UART: Universal asynchronous receiver/transmitter
ULD: Ultra-low density
UML: Unified modeling language
VHDL: VHSIC hardware description language
VHSIC: Very high-speed integrated circuit
VLIW: Very long instruction word
VLSI: Very large scale integration
WBC: White box component
WDF: Wave digital filter