This document discusses image processing using FPGAs. It begins with an overview of FPGAs and their components. It then discusses using high-level synthesis to convert C++ code to hardware designs for FPGAs. An example of implementing Sobel edge detection on an FPGA is provided. The implementation was optimized from 40 cycles per pixel to 1 cycle per pixel through pipelining, parallelism, and using block RAM for intermediate storage. Challenges discussed include limited debugging tools and steep learning curves for FPGA development.
10. High Level Synthesis
Converts C++ code to hardware design
HLS compiler optimizes your code for FPGA
Automatically optimize RTL and timing
Provides #pragma’s for fine tuning
C++ API for arbitrary precision math
C++ API for stream data processing
Supports C++ 11
13. Things to remember
Instantaneous BRAM access
Register-level bandwidth 0.5M-bits / second
BRAM bandwidth 23T-bits / second
Numbers above for Xilinx Kintex®-7 410T device
16. Things to remember
● No branching penalty
● No cache penalty
● No dynamic memory allocation
● Instantaneous BRAM access
● Single producer - single consumer
● Pipelining
● Task-centric approach
17. HLS Development cycle
1. Get baseline version
2. Write simulation test
3. Run HLS synthesis
4. Simulate
5. Validate
6. Measure
7. Optimize
8. Goto 3
19. Sobel Edge Detection
Baseline implementation
Iterate over image
● Convolve 3x3 window with Gx and Gy kernels
● Compute their absolute sum
● Write to corresponding output pixel
The FPGA frequency is this example is 150 Mhz
To meet 1920x1080@60Hz goal we must process data at rate 1 cycle/pixel or faster
22. Sobel Edge Detection
Tuning FPGA implementation
Iterate over image
● Convolve 3x3 window with Gx and Gy kernels
Pipeline: Compute one field in the 3x3 filter window per clock cycle.
● Compute Gx and Gy absolute sum
● Write to corresponding output pixel
25. Sobel Edge Detection
Tuning FPGA implementation
Iterate over image
● Pipeline: Apply pipeline to the inner loop (columns)
● Convolve 3x3 window with Gx and Gy kernels
○ Loop gets totally unrolled and computed at 1 cycle
● Compute Gx and Gy absolute sum
○ Also computed in parallel
● Write to corresponding output pixel
28. Sobel Edge Detection
Tuning FPGA implementation
Issues
● Nine concurrent memory accesses
● More hardware blocks required
● HLS module can only connect a single port capable of one transaction/clock
29. Sobel Edge Detection
Tuning FPGA implementation
● Use BRAM to store intermediate line buffer
● Read data from external memory to line buffer
● Fill memory window (Flip-flop elements)
● Convolve 3x3 window with Gx and Gy kernels
○ Loop gets totally unrolled and computed at 1 cycle
● Compute their absolute sum
○ Also computed in parallel
● Write to corresponding output pixel