TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger
1. May 1, 2013 1
OpenCL for ALTERA FPGAs
Accelerating performance and design
productivity
Liad Weinberger – Appilo
May 1st, 2013
2. May 1, 2013 2
Technology trends
• Over the past years
– Technology scaling favors programmability and parallelism
Fine-Grained
Massively
Parallel
Arrays
Single Cores Coarse-Grained
Massively
Parallel
Processor
Arrays
Multi-Cores
Coarse-Grained
CPUs and DSPs
CPUs DSPs Multi-Cores Array GPGPUs FPGAs
3. May 1, 2013 3
Technology trends
0
20
40
60
80
100
120
140
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022
Process node (nm)
• Moore’s law still in effect
– More FPGA real-estate
• More potential for parallelism – an extremely good thing!
• Designs that utilize this real-estate, becomes harder to
manage and maintain – this is not so good...
4. May 1, 2013 4
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends
Worldwide Interest over the years
Verilog + VHDL
• Decreased interest
– Number of Google searches for VHDL or
Verilog in decline
5. May 1, 2013 5
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends
Interest over the years
Verilog + VHDL
Python
• Software development keeps momentum
– Number of Google searches for Python (as a
representing language)
6. May 1, 2013 6
FPGA (hardware) development
• Design (programming) is complex
– Define state machine, data-paths, arbitration, IP interfaces, etc.
– Sophisticated iterative compilation process
• Synthesis, technology mapping, clustering, placement and routing, timing closure
• Leads to long compilation times (hours vs. minutes in software)
– Debug process is also very time-consuming
• Code is not portable
– Written in Verilog / VHDL
• Can’t re-target for CPUs, GPUs, DSPs, etc.
• Not scalable
Compilation
HDL
Timing
Closure
Set
Constraints
7. May 1, 2013 7
Software development
• Programming is straight-forward
– Ideas are expressed in languages such as C/C++/Python/etc.
• Typically, start with simple sequential implementation
• Use parallel APIs / language extensions, in order to exploit multi-core
architectures for additional performance
– Compilation times are usually reasonably short
• Simple straight-forward compilation/linking process
– Immediate feedback when debugging/profiling
• An assortment of tools available for both debugging and profiling
• Portability is still an issue
– Possible, but require pre-planning
Compiler
&Linker
C/C++
Python
etc.
C/C++
Python
etc.
C/C++/
Python/
etc.
8. May 1, 2013 8
Product development point-of-view
• Product producers want:
– Lower development and maintenance costs
– Competitive edge
• Higher performance
• Short time-in-market, and short time-to-market
– Agile development methods are becoming more and more popular
– Can’t afford long development cycles
– Trained developers with established experience
• Or cost-effective path for training new developers
– Flexibility
• No vendor-locking is preferred
• Ability to rapidly adapt product to market requirement changes
9. May 1, 2013 9
Our challenge
• How do we bring FPGA design process closer to the
software development model?
– Need to make FPGAs more accessible to the software development
community
• Change in mind-set: look at FPGAs as massively multi-core devices that
could be used in order to accelerate parallel applications
• A programming model that allows that
• Shorter compilation times and faster feedback for debugging and profiling
the design
10. May 1, 2013 10
An ideal programming environment...
• Based on a standard programming model
– Rather than something which is FPGA-specific
• Abstracts away the underlying details of the hardware
– VHDL / Verilog are similar to “assembly language” programming
– Useful in rare circumstances where the highest possible efficiency is needed
• The price of abstraction is not too high
– Still need to efficiently use the FPGA’s resources to achieve high throughput / low
area
• Allows for software-like compilation & debug cycles
– Faster compile times
– Profiling & user feedback
11. May 1, 2013 11
Introducing OpenCL
Parallel heterogeneous computing
12. May 1, 2013 12
A case for OpenCL
• What is OpenCL?
– An open, royalty-free standard for cross-platform parallel software programming of
heterogeneous systems
• CPU + DSPs
• CPU + GPUs
• CPU + FPGAs
– Maintained by KHRONOS group
• An industry consortium creating open, royalty-free standards
• Comprised of hardware and software vendors
– Enables software to leverage silicon acceleration
• Consists of two major parts:
– Application Programming Interface (API) for device management
– Device programming language based on C99 with
some restrictions and extensions to support explicit parallelism
Or maybe all together
13. May 1, 2013 13
Benefits of OpenCL
• Cross-vendor software portability
– Functional portability—Same code would normally execute on
different hardware, by different vendors
– Not performance portable—Code still needs to be optimized to
specific device (at least a device class)
• Allows for the management of available computational
resources under a single framework
– Views CPUs, GPUs, FPGAs, and other accelerators as devices that
could carry the computational needs of the application
14. May 1, 2013 14
OpenCL program structure
• Separation between managerial and computational code bases
– Managerial code executes on a host CPU
• Any type of conventional micro-processor
• Written in any language that has bindings for the OpenCL API
– The API is in ANSI-C
– There is a formal C++ binding
– Other bindings may exist
– Computational code executes on the compute devices (accelerators)
• Written in a language called OpenCL C
– Based on C99
– Adds restrictions and extensions for explicit parallelism
• Can be compiled either offline, or online, depending on implementation
• Will most likely consist only of those portions of the application we want to accelerate
16. May 1, 2013 16
OpenCL host application
• Communicates with the Accelerator Device via a set of
library routines
– Abstracts away host processor to HW accelerator communication via
a set of API calls
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
Copy data
Host FPGA
Ask the FPGA to run
a particular kernel
Copy data
FPGA Host
17. May 1, 2013 17
OpenCL kernels
• Data-parallel function
– Executes by many parallel
threads
• Each thread has an identifier
which could be obtained with
a call to the get_global_id()
built-in function
• Uses qualifiers to define
where memory buffers reside
• Executed by a
compute device
– CPU
– GPU
– FPGA
– Other accelerator
float *a =
float *b =
float *y =
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
__kernel void sum( … );
18. May 1, 2013 18
OpenCL on FPGAs
How does it map?
19. May 1, 2013 19
Compiling OpenCL to FPGAs
x86
PCIe
SOF X86 binary
ACL
Compiler
Standard
C Compiler
OpenCL
Host Program + Kernels
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
Kernel Programs Host Program
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
20. May 1, 2013 20
Compiling OpenCL to FPGAs
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
PCIe
DDRx
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
Kernel Programs
Custom Hardware for Your Kernels
21. May 1, 2013 21
FPGA architecture for OpenCL
FPGA
Kernel
Pipeline
Kernel
Pipeline
Kernel
Pipeline
PCIe
DDR*
x86 /
External
Processor
External
Memory
Controller
& PHY
Memory
Memory
Memory
Memory
Memory
Memory
Global Memory Interconnect
Local Memory Interconnect
External
Memory
Controller
& PHY
Kernel System
22. May 1, 2013 22
Mapping multithreaded kernels to FPGAs
• Simplest way of mapping kernel functions to FPGAs is
to replicate hardware for each thread
– Inefficient and wasteful
• Technique: deep pipeline parallelism
– Attempt to create a deeply pipelined representation of a kernel
– On each clock cycle, we attempt to send in input data for a new
thread
– Method of mapping coarse grained thread parallelism to fine-grained
FPGA parallelism
23. May 1, 2013 23
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
0 1 2 3 4 5 6 7
8 threads for vector add example
Thread IDs
+
24. May 1, 2013 24
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
1 2 3 4 5 6 7
0
8 threads for vector add example
Thread IDs
+
25. May 1, 2013 25
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
2 3 4 5 6 7
0
1
8 threads for vector add example
Thread IDs
+
26. May 1, 2013 26
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
3 4 5 6 7
1
2
8 threads for vector add example
Thread IDs
+
0
27. May 1, 2013 27
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
4 5 6 7
0
2
3
8 threads for vector add example
Thread IDs
+
1
28. May 1, 2013 28
Some examples
Using ALTERA’s OpenCL solution
29. May 1, 2013 29
AES encryption
• Counter (CTR) based encryption/decryption
– 256-bit key
• Advantage FPGA
– Integer arithmetic
– Coarse grain bit operations
– Complex decision making
• Results Platform Throughput (GB/s)
E5503 Xeon Processor 0.01 (single core)
AMD Radeon HD 7970 0.33
PCIe385 A7 Accelerator 5.20
42% utilization (2 kernels)
•Power conservation
•Fill up for even higher performance
30. May 1, 2013 30
Multi-asset barrier option pricing
• Monte-carlo simulation
– Heston model
– ND range
• Assets x paths (64x1000000)
• Advantage FPGA
– Complex control flow
• Results
tttt
S
ttttt
dWdtd
dWSdtSdS
Platform
Power
(W)
Performance
(Msims/s)
Msims/W
W3690 Xeon Processor 130 32 0.25
nVidia Tesla C2075 225 63 0.28
PCIe385 D5 Accelerator 23 170 7.40