Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
SDAccel Design Contest: Xilinx SDAccel
1. Courses @ NECST
Lorenzo Di Tucci <lorenzo.ditucci@polimi.it>
Emanuele Del Sozzo <emanuele.delsozzo@polimi.it>
Marco Rabozzi <marco.rabozzi@polimi.it>
Marco D. Santambrogio <marco.santambrogio@polimi.it>
Xilinx SDAccel
15/02/2018
DEIB Seminar Room
2. 2
Agenda
- Recall on Hardware Design Flow
- Introduction to SDAccel Framework
- OpenCL
- computational model
- platform
- memory model
- SDAccel Design Flow
- Kernel Specification
- Examples
3. 3
Did you register?
Use this Google Doc to provide your data
https://goo.gl/FRCG6y
First, install the VPN we have provided you.
(Mac: Tunnelblick - Windows/Linux: OpenVPN)
To SSH to the machine:
ssh <name>.<surname>@nags31.local.necst.it
password: user
4. 4
Installation Party
You can change your password here:
http://changepassword.local.necst.it/
You can also RDP to the instance using
• Microsoft Remote Desktop (Microsoft/Mac OS)
• Remmina (Linux)
To connect to the machine, or change your password you must
have started the VPN.
5. 5
Hardware Design Flow for HPC
• Hardware Design Flow (HDF): process to realize a
hardware module
• HDF for FPGAs can be seen as a 2 step process
High Level Synthesis
From High level code to
Hardware Description
Language (HDL)
System Level Design
Implementation on board
High
Level Code
FPGA
9. 9
The Hardware Design Flow
System integration, driver generation and runtime management
10. 10
The Hardware Design Flow
• Complete automation of the 2 steps of the hardware
design flow
11. 11
Xilinx SDAccel
- Provided a high level code, completely automates the
steps of the hardware design flow
- Respect the OpenCL memory and computational
model
12. 12
OpenCL (Open Computing Language)
• Open, cross platform parallel programming
language for heterogeneous architectures
• Standard for the development and
acceleration of data parallel applications
• Allows to write accelerated portable code
across different devices and architectures
(FPGA, GPGPU, DSPs, …)
13. • Work item:
– The basic unit of work within an OpenCL device
• Global size:
– Declares an N-dimensional size of the total number of
work-items
– Size of the computational problem
size_t global[N]
• Local size
– Declares an N-dimensional work-group size
– The number of work-items that will execute within a
workgroup
size_t local[N]
OpenCL Computational model
14. • global and local can be 1D, 2D, 3D and corresponds to the
dimensionality of the data to be processed
1D 2D 3D
N-Dimensional kernel range
15. • global and local can be 1D, 2D, 3D and corresponds to the
dimensionality of the data to be processed
1D 2D 3D
N-Dimensional kernel range
16. size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
err = clEnqueueNDRangeKernel(
commands, kernel, 1, NULL,
(size_t*)&global, (size_t*) &local,
0, NULL, NULL
);
1-Dimensional kernel range (host code)
Global and local size of
dimension 1
1-Dimensional Kernel
→ work-group size of 1 work-item
→ 10 total work items
17. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
18. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
Work item: maps to a PE
Work group: mapped to a compute unit
19. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
20. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
21. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
22. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
PE
23. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
PE
Work item: maps to PEs
Work group: mapped to a compute unit
24. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
PE
25. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
PE
Compute Unit
PE PE
26. 1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element
PE
Compute Unit
PE PE
Work items
Work groups
Increased parallelism:
2 compute units working
in parallel on different
work items
27. • global and local can be 1D, 2D, 3D and corresponds to the
dimensionality of the data to be processed
1D 2D 3D
2-Dimensional kernel range
28. size_t global[2];
size_t local[2];
global[0] = 10;
global[1] = 10;
local[0] = 2;
local[1] = 2;
err = clEnqueueNDRangeKernel(
commands, kernel, 2, NULL,
(size_t*)&global, (size_t*) &local,
0, NULL, NULL
);
2-Dimensional kernel range (host code)
Global and local size of
dimension 2
2-Dimensional Kernel
→ work-group size of 2x2 work-item
→ 10x10 total work items
29. 29
2-Dimensional kernel range
Problem Size Dim 2 (10)
ProblemSizeDim1(10)
Work group Work item
OpenCL device
HOST
Compute Unit
PE
PE = Processing Element
PE
PE PE
30. 30
OpenCL Platform & Memory Model
Host’s responsibility involves:
- manage the operating system and
enable drivers for all devices
- pick up correct device for
computation
- Execute the application host
program
- manage and create Memory
Buffers
- launch and manage kernel
execution
31. 31
OpenCL Platform & Memory Model
The Device:
- memory based transfer
- reconfigured at runtime to execute
our kernel
- divided into multiple compute units
- Each compute unit executes a
work-group
- Each work-group contains multiple
work-items
- A compute unit is further divided into
processing elements
- A PE is responsible for the execution
of a work-item
33. 33
OpenCL Platform & Memory Model
Three layers of Memory:
1) Global - shared among host and device (DRAM - host accesses via PCIe)
2) Local - Accessible by all the work-items inside a compute unit (BRAM)
3) Private - Accessible only to the processing element/ single work-item
(Registers)
OpenCL memory abstraction does not allow to write directly from the host to
the device, it is necessary to pass from Global Memory
38. 38
Design Flow: Makefile
- compile the host
- generate the xo for each kernel
- link xo(s) to .xclbin to be executed
- emulate or build your application
41. 41
OpenCL Kernel
• Simply define the OpenCL Kernel and the associated
work group size (in the following example 10 elements
per group item)
• Must be called from the host as an NDRange kernel
42. 42
C/C++ Kernel
• Use standard AXI Master and AXI Lite interface as for
Vivado HLS
• All memory ports must be mapped to the same bundle
• Include your kernel code within an extern “C” block
• Must be called from the host as a simple task
43. 43
RTL Kernel
1) write your code using a HDL
(Verilog/VHDL/Chisel HDL,
etc...)
2) Integrate your HDL into
SDAccel and generate a
Xilinx Object (.xo)
52. 52
RTL Kernel
1) write your code using a HDL
(Verilog/VHDL/Chisel HDL,
etc...)
2) Integrate your HDL into
SDAccel and generate a
Xilinx Object (.xo)
3) Perform Hardware Emulation
to check correctness
4) Build for FPGA
53. 53
Examples
- Let’s start with the Vector Addition code presented by
Emanuele last time.
- Let’s produce a C/C++ version and an OpenCL one
Example code are available on NAGS31 @
/sdaccel_contest/
108. 108
This is only the beginning!!
For more information, read SDAccel manual(s)
https://www.xilinx.com/support/documentation-navigatio
n/development-tools/software-development/sdaccel.html
X
109. 109
Feedbacks
• We are working at improving this course, would you
share your feedback for this lesson?
https://goo.gl/forms/mcmtcojJEqFTpg8j1
110. Thank You for the
Attention!
110
Lorenzo Di Tucci
lorenzo.ditucci@polimi.it
Emanuele Del Sozzo
emanuele.delsozzo@polimi.it
Marco Rabozzi
marco.rabozzi@polimi.it
Marco D. Santambrogio
marco.santambrogio@polimi.it