SDAccel Design Contest: Xilinx SDAccel

Courses @ NECST
Lorenzo Di Tucci <lorenzo.ditucci@polimi.it>
Emanuele Del Sozzo <emanuele.delsozzo@polimi.it>
Marco Rabozzi <marco.rabozzi@polimi.it>
Marco D. Santambrogio <marco.santambrogio@polimi.it>
Xilinx SDAccel
15/02/2018
DEIB Seminar Room

2
Agenda
- Recall on Hardware Design Flow
- Introduction to SDAccel Framework
- OpenCL
- computational model
- platform
- memory model
- SDAccel Design Flow
- Kernel Specification
- Examples

3
Did you register?
Use this Google Doc to provide your data
https://goo.gl/FRCG6y
First, install the VPN we have provided you.
(Mac: Tunnelblick - Windows/Linux: OpenVPN)
To SSH to the machine:
ssh <name>.<surname>@nags31.local.necst.it
password: user

4
Installation Party
You can change your password here:
http://changepassword.local.necst.it/
You can also RDP to the instance using
• Microsoft Remote Desktop (Microsoft/Mac OS)
• Remmina (Linux)
To connect to the machine, or change your password you must
have started the VPN.

5
Hardware Design Flow for HPC
• Hardware Design Flow (HDF): process to realize a
hardware module
• HDF for FPGAs can be seen as a 2 step process
High Level Synthesis
From High level code to
Hardware Description
Language (HDL)
System Level Design
Implementation on board
High
Level Code
FPGA

9
The Hardware Design Flow
System integration, driver generation and runtime management

10
The Hardware Design Flow
• Complete automation of the 2 steps of the hardware
design flow

11
Xilinx SDAccel
- Provided a high level code, completely automates the
steps of the hardware design flow
- Respect the OpenCL memory and computational
model

12
OpenCL (Open Computing Language)
• Open, cross platform parallel programming
language for heterogeneous architectures
• Standard for the development and
acceleration of data parallel applications
• Allows to write accelerated portable code
across different devices and architectures
(FPGA, GPGPU, DSPs, …)

• Work item:
– The basic unit of work within an OpenCL device
• Global size:
– Declares an N-dimensional size of the total number of
work-items
– Size of the computational problem
size_t global[N]
• Local size
– Declares an N-dimensional work-group size
– The number of work-items that will execute within a
workgroup
size_t local[N]
OpenCL Computational model

• global and local can be 1D, 2D, 3D and corresponds to the
dimensionality of the data to be processed
1D 2D 3D
N-Dimensional kernel range

size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
err = clEnqueueNDRangeKernel(
commands, kernel, 1, NULL,
(size_t*)&global, (size_t*) &local,
0, NULL, NULL
);
1-Dimensional kernel range (host code)
Global and local size of
dimension 1
1-Dimensional Kernel
→ work-group size of 1 work-item
→ 10 total work items

1-Dimensional kernel range
size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE = Processing Element

size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 1;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
Work item: maps to a PE
Work group: mapped to a compute unit

size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE

size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE
Work item: maps to PEs
Work group: mapped to a compute unit

size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE
Compute Unit
PE PE

size_t global[1];
size_t local[1];
global[0] = 10;
local[0] = 2;
OpenCL deviceHOST
Communication
System
Compute Unit
PE
PE
Compute Unit
PE PE
Work items
Work groups
Increased parallelism:
2 compute units working
in parallel on different
work items

• global and local can be 1D, 2D, 3D and corresponds to the
dimensionality of the data to be processed
1D 2D 3D

size_t global[2];
size_t local[2];
global[0] = 10;
global[1] = 10;
local[0] = 2;
local[1] = 2;
err = clEnqueueNDRangeKernel(
commands, kernel, 2, NULL,
(size_t*)&global, (size_t*) &local,
0, NULL, NULL
);
2-Dimensional kernel range (host code)
Global and local size of
dimension 2
2-Dimensional Kernel
→ work-group size of 2x2 work-item
→ 10x10 total work items

29
Problem Size Dim 2 (10)
ProblemSizeDim1(10)
Work group Work item
OpenCL device
HOST
Compute Unit
PE
PE
PE PE

30
OpenCL Platform & Memory Model
Host’s responsibility involves:
- manage the operating system and
enable drivers for all devices
- pick up correct device for
computation
- Execute the application host
program
- manage and create Memory
Buffers
- launch and manage kernel
execution

31
The Device:
- memory based transfer
- reconfigured at runtime to execute
our kernel
- divided into multiple compute units
- Each compute unit executes a
work-group
- Each work-group contains multiple
work-items
- A compute unit is further divided into
processing elements
- A PE is responsible for the execution
of a work-item

32

33
Three layers of Memory:
1) Global - shared among host and device (DRAM - host accesses via PCIe)
2) Local - Accessible by all the work-items inside a compute unit (BRAM)
3) Private - Accessible only to the processing element/ single work-item
(Registers)
OpenCL memory abstraction does not allow to write directly from the host to
the device, it is necessary to pass from Global Memory

34
Processing Element

36
Design Flow: System Build
GUI Makefile

37
Design Flow: Makefile
Specify source files, host and kernel optimizations,
emulation type or system build via Makefile

38
Design Flow: Makefile
- compile the host
- generate the xo for each kernel
- link xo(s) to .xclbin to be executed
- emulate or build your application

39
Design Flow: GUI
Use the Eclipse-based GUI to perform each step of the flow

40
Kernel Specification
As seen before, Kernels can be specified in:
- OpenCL
- C/C++
- RTL

41
OpenCL Kernel
• Simply define the OpenCL Kernel and the associated
work group size (in the following example 10 elements
per group item)
• Must be called from the host as an NDRange kernel

42
C/C++ Kernel
• Use standard AXI Master and AXI Lite interface as for
Vivado HLS
• All memory ports must be mapped to the same bundle
• Include your kernel code within an extern “C” block
• Must be called from the host as a simple task

43
RTL Kernel
1) write your code using a HDL
(Verilog/VHDL/Chisel HDL,
etc...)
2) Integrate your HDL into
SDAccel and generate a
Xilinx Object (.xo)

52
RTL Kernel
1) write your code using a HDL
(Verilog/VHDL/Chisel HDL,
etc...)
2) Integrate your HDL into
SDAccel and generate a
Xilinx Object (.xo)
3) Perform Hardware Emulation
to check correctness
4) Build for FPGA

53
Examples
- Let’s start with the Vector Addition code presented by
Emanuele last time.
- Let’s produce a C/C++ version and an OpenCL one
Example code are available on NAGS31 @
/sdaccel_contest/

59
Specify data width for ports

62
Automatically include binary

68
Port Mapping - external pointers

69
Port Mapping - external pointers

71
Port Mapping - the compiler

82
Vector Addition - OCL Kernel

95
Vector Addition - OCL Kernel

96
Automatically include binaries

101
Increase compute units to 2

108
This is only the beginning!!
For more information, read SDAccel manual(s)
https://www.xilinx.com/support/documentation-navigatio
n/development-tools/software-development/sdaccel.html
X

109
Feedbacks
• We are working at improving this course, would you
share your feedback for this lesson?
https://goo.gl/forms/mcmtcojJEqFTpg8j1

Thank You for the
Attention!
110
Lorenzo Di Tucci
lorenzo.ditucci@polimi.it
Emanuele Del Sozzo
emanuele.delsozzo@polimi.it
Marco Rabozzi
marco.rabozzi@polimi.it
Marco D. Santambrogio
marco.santambrogio@polimi.it

SDAccel Design Contest: Xilinx SDAccel

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SDAccel Design Contest: Xilinx SDAccel

Similar to SDAccel Design Contest: Xilinx SDAccel (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

SDAccel Design Contest: Xilinx SDAccel