SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
May 1, 2013 1
OpenCL for ALTERA FPGAs
Accelerating performance and design
productivity
Liad Weinberger – Appilo
May 1st, 2013
May 1, 2013 2
Technology trends
• Over the past years
– Technology scaling favors programmability and parallelism
Fine-Grained
Massively
Parallel
Arrays
Single Cores Coarse-Grained
Massively
Parallel
Processor
Arrays
Multi-Cores
Coarse-Grained
CPUs and DSPs
CPUs DSPs Multi-Cores Array GPGPUs FPGAs
May 1, 2013 3
Technology trends
0
20
40
60
80
100
120
140
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022
Process node (nm)
• Moore’s law still in effect
– More FPGA real-estate
• More potential for parallelism – an extremely good thing!
• Designs that utilize this real-estate, becomes harder to
manage and maintain – this is not so good...
May 1, 2013 4
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends
Worldwide Interest over the years
Verilog + VHDL
• Decreased interest
– Number of Google searches for VHDL or
Verilog in decline
May 1, 2013 5
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends
Interest over the years
Verilog + VHDL
Python
• Software development keeps momentum
– Number of Google searches for Python (as a
representing language)
May 1, 2013 6
FPGA (hardware) development
• Design (programming) is complex
– Define state machine, data-paths, arbitration, IP interfaces, etc.
– Sophisticated iterative compilation process
• Synthesis, technology mapping, clustering, placement and routing, timing closure
• Leads to long compilation times (hours vs. minutes in software)
– Debug process is also very time-consuming
• Code is not portable
– Written in Verilog / VHDL
• Can’t re-target for CPUs, GPUs, DSPs, etc.
• Not scalable
Compilation
HDL
Timing
Closure
Set
Constraints
May 1, 2013 7
Software development
• Programming is straight-forward
– Ideas are expressed in languages such as C/C++/Python/etc.
• Typically, start with simple sequential implementation
• Use parallel APIs / language extensions, in order to exploit multi-core
architectures for additional performance
– Compilation times are usually reasonably short
• Simple straight-forward compilation/linking process
– Immediate feedback when debugging/profiling
• An assortment of tools available for both debugging and profiling
• Portability is still an issue
– Possible, but require pre-planning
Compiler
&Linker
C/C++
Python
etc.
C/C++
Python
etc.
C/C++/
Python/
etc.
May 1, 2013 8
Product development point-of-view
• Product producers want:
– Lower development and maintenance costs
– Competitive edge
• Higher performance
• Short time-in-market, and short time-to-market
– Agile development methods are becoming more and more popular
– Can’t afford long development cycles
– Trained developers with established experience
• Or cost-effective path for training new developers
– Flexibility
• No vendor-locking is preferred
• Ability to rapidly adapt product to market requirement changes
May 1, 2013 9
Our challenge
• How do we bring FPGA design process closer to the
software development model?
– Need to make FPGAs more accessible to the software development
community
• Change in mind-set: look at FPGAs as massively multi-core devices that
could be used in order to accelerate parallel applications
• A programming model that allows that
• Shorter compilation times and faster feedback for debugging and profiling
the design
May 1, 2013 10
An ideal programming environment...
• Based on a standard programming model
– Rather than something which is FPGA-specific
• Abstracts away the underlying details of the hardware
– VHDL / Verilog are similar to “assembly language” programming
– Useful in rare circumstances where the highest possible efficiency is needed
• The price of abstraction is not too high
– Still need to efficiently use the FPGA’s resources to achieve high throughput / low
area
• Allows for software-like compilation & debug cycles
– Faster compile times
– Profiling & user feedback
May 1, 2013 11
Introducing OpenCL
Parallel heterogeneous computing
May 1, 2013 12
A case for OpenCL
• What is OpenCL?
– An open, royalty-free standard for cross-platform parallel software programming of
heterogeneous systems
• CPU + DSPs
• CPU + GPUs
• CPU + FPGAs
– Maintained by KHRONOS group
• An industry consortium creating open, royalty-free standards
• Comprised of hardware and software vendors
– Enables software to leverage silicon acceleration
• Consists of two major parts:
– Application Programming Interface (API) for device management
– Device programming language based on C99 with
some restrictions and extensions to support explicit parallelism
Or maybe all together
May 1, 2013 13
Benefits of OpenCL
• Cross-vendor software portability
– Functional portability—Same code would normally execute on
different hardware, by different vendors
– Not performance portable—Code still needs to be optimized to
specific device (at least a device class)
• Allows for the management of available computational
resources under a single framework
– Views CPUs, GPUs, FPGAs, and other accelerators as devices that
could carry the computational needs of the application
May 1, 2013 14
OpenCL program structure
• Separation between managerial and computational code bases
– Managerial code executes on a host CPU
• Any type of conventional micro-processor
• Written in any language that has bindings for the OpenCL API
– The API is in ANSI-C
– There is a formal C++ binding
– Other bindings may exist
– Computational code executes on the compute devices (accelerators)
• Written in a language called OpenCL C
– Based on C99
– Adds restrictions and extensions for explicit parallelism
• Can be compiled either offline, or online, depending on implementation
• Will most likely consist only of those portions of the application we want to accelerate
May 1, 2013 15
OpenCL program structure
Compute DeviceHost
LocalMem
GlobalMem
LocalMemLocalMemLocalMem
AcceleratorAcceleratorAccelerator
Compute
unit
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
Host Program
Kernel Program
May 1, 2013 16
OpenCL host application
• Communicates with the Accelerator Device via a set of
library routines
– Abstracts away host processor to HW accelerator communication via
a set of API calls
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
Copy data
Host  FPGA
Ask the FPGA to run
a particular kernel
Copy data
FPGA  Host
May 1, 2013 17
OpenCL kernels
• Data-parallel function
– Executes by many parallel
threads
• Each thread has an identifier
which could be obtained with
a call to the get_global_id()
built-in function
• Uses qualifiers to define
where memory buffers reside
• Executed by a
compute device
– CPU
– GPU
– FPGA
– Other accelerator
float *a =
float *b =
float *y =
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
__kernel void sum( … );
May 1, 2013 18
OpenCL on FPGAs
How does it map?
May 1, 2013 19
Compiling OpenCL to FPGAs
x86
PCIe
SOF X86 binary
ACL
Compiler
Standard
C Compiler
OpenCL
Host Program + Kernels
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
Kernel Programs Host Program
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
May 1, 2013 20
Compiling OpenCL to FPGAs
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
PCIe
DDRx
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
Kernel Programs
Custom Hardware for Your Kernels
May 1, 2013 21
FPGA architecture for OpenCL
FPGA
Kernel
Pipeline
Kernel
Pipeline
Kernel
Pipeline
PCIe
DDR*
x86 /
External
Processor
External
Memory
Controller
& PHY
Memory
Memory
Memory
Memory
Memory
Memory
Global Memory Interconnect
Local Memory Interconnect
External
Memory
Controller
& PHY
Kernel System
May 1, 2013 22
Mapping multithreaded kernels to FPGAs
• Simplest way of mapping kernel functions to FPGAs is
to replicate hardware for each thread
– Inefficient and wasteful
• Technique: deep pipeline parallelism
– Attempt to create a deeply pipelined representation of a kernel
– On each clock cycle, we attempt to send in input data for a new
thread
– Method of mapping coarse grained thread parallelism to fine-grained
FPGA parallelism
May 1, 2013 23
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
0 1 2 3 4 5 6 7
8 threads for vector add example
Thread IDs
+
May 1, 2013 24
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
1 2 3 4 5 6 7
0
8 threads for vector add example
Thread IDs
+
May 1, 2013 25
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
2 3 4 5 6 7
0
1
8 threads for vector add example
Thread IDs
+
May 1, 2013 26
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
3 4 5 6 7
1
2
8 threads for vector add example
Thread IDs
+
0
May 1, 2013 27
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
4 5 6 7
0
2
3
8 threads for vector add example
Thread IDs
+
1
May 1, 2013 28
Some examples
Using ALTERA’s OpenCL solution
May 1, 2013 29
AES encryption
• Counter (CTR) based encryption/decryption
– 256-bit key
• Advantage FPGA
– Integer arithmetic
– Coarse grain bit operations
– Complex decision making
• Results Platform Throughput (GB/s)
E5503 Xeon Processor 0.01 (single core)
AMD Radeon HD 7970 0.33
PCIe385 A7 Accelerator 5.20
42% utilization (2 kernels)
•Power conservation
•Fill up for even higher performance
May 1, 2013 30
Multi-asset barrier option pricing
• Monte-carlo simulation
– Heston model
– ND range
• Assets x paths (64x1000000)
• Advantage FPGA
– Complex control flow
• Results
  


tttt
S
ttttt
dWdtd
dWSdtSdS


Platform
Power
(W)
Performance
(Msims/s)
Msims/W
W3690 Xeon Processor 130 32 0.25
nVidia Tesla C2075 225 63 0.28
PCIe385 D5 Accelerator 23 170 7.40
May 1, 2013 31
Document filtering
• Unstructured data analytics
– Bloom Filter
• Advantage FPGA
– Integer arithmetic
– Flexible memory configuration
• Results Platform Power (W) Performance (MTs) MTs/W
W3690 Xeon Processor 130 2070 15.92
nVidia Tesla C2075 215 3240 15.07
DE4 Stratix IV-530 Accelerator 21 1755 83.57
PCIe385 A7 Accelerator 25 3602 144.08
May 1, 2013 32
Fractal video compression
• Best matching codebook
– Correlation with SAD
• Advantage FPGA
– Integer arithmetic
• Results Platform Power (W) Performance (FPS) FPS/W
W3690 Xeon Processor 130 4.6 0.035
nVidia Tesla C2075 215 53.1 0.247
DE4 Stratix IV-530 Accelerator 21 70.9 3.376
PCIe385 A7 Accelerator 25 74.4 2.976

Contenu connexe

Tendances

programmable_devices_en_02_2014
programmable_devices_en_02_2014programmable_devices_en_02_2014
programmable_devices_en_02_2014
Svetozar Jovanovic
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar report
rahul kumar verma
 
FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014
Ibrahim Hejab
 
Fpga optimus main_print
Fpga optimus  main_printFpga optimus  main_print
Fpga optimus main_print
Sushant Burde
 

Tendances (20)

Fpga
FpgaFpga
Fpga
 
CPLDs
CPLDsCPLDs
CPLDs
 
programmable_devices_en_02_2014
programmable_devices_en_02_2014programmable_devices_en_02_2014
programmable_devices_en_02_2014
 
1.FPGA for dummies: Basic FPGA architecture
1.FPGA for dummies: Basic FPGA architecture 1.FPGA for dummies: Basic FPGA architecture
1.FPGA for dummies: Basic FPGA architecture
 
CPLDs
CPLDsCPLDs
CPLDs
 
Fundamentals of FPGA
Fundamentals of FPGAFundamentals of FPGA
Fundamentals of FPGA
 
FPGA
FPGAFPGA
FPGA
 
Programmable logic device (PLD)
Programmable logic device (PLD)Programmable logic device (PLD)
Programmable logic device (PLD)
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar report
 
Lecture syn 024.cpld-fpga
Lecture syn 024.cpld-fpgaLecture syn 024.cpld-fpga
Lecture syn 024.cpld-fpga
 
Cpld fpga
Cpld fpgaCpld fpga
Cpld fpga
 
0.FPGA for dummies: Historical introduction
0.FPGA for dummies: Historical introduction0.FPGA for dummies: Historical introduction
0.FPGA for dummies: Historical introduction
 
CPLD & FPLD
CPLD & FPLDCPLD & FPLD
CPLD & FPLD
 
#EEE - Field programmable gate array
#EEE - Field programmable gate array#EEE - Field programmable gate array
#EEE - Field programmable gate array
 
Fpga Knowledge
Fpga KnowledgeFpga Knowledge
Fpga Knowledge
 
Unit VI CPLD-FPGA Architecture
Unit VI CPLD-FPGA ArchitectureUnit VI CPLD-FPGA Architecture
Unit VI CPLD-FPGA Architecture
 
FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014
 
CPLDs
CPLDsCPLDs
CPLDs
 
Fpga optimus main_print
Fpga optimus  main_printFpga optimus  main_print
Fpga optimus main_print
 
Fpga & VHDL
Fpga & VHDLFpga & VHDL
Fpga & VHDL
 

Similaire à TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

Automatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapAutomatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmap
Manolis Vavalis
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
Achronix
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
Vivek Kumar
 

Similaire à TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger (20)

Automatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapAutomatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmap
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
Introducing Container Technology to TSUBAME3.0 Supercomputer
Introducing Container Technology to TSUBAME3.0 SupercomputerIntroducing Container Technology to TSUBAME3.0 Supercomputer
Introducing Container Technology to TSUBAME3.0 Supercomputer
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Christopher_Reder_2016
Christopher_Reder_2016Christopher_Reder_2016
Christopher_Reder_2016
 
OpenDataPlane Project
OpenDataPlane ProjectOpenDataPlane Project
OpenDataPlane Project
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
Modern Web-site Development Pipeline
Modern Web-site Development PipelineModern Web-site Development Pipeline
Modern Web-site Development Pipeline
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
FPGA Design Challenges
FPGA Design ChallengesFPGA Design Challenges
FPGA Design Challenges
 

Plus de chiportal

Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
chiportal
 

Plus de chiportal (20)

Prof. Zhihua Wang, Tsinghua University, Beijing, China
Prof. Zhihua Wang, Tsinghua University, Beijing, China Prof. Zhihua Wang, Tsinghua University, Beijing, China
Prof. Zhihua Wang, Tsinghua University, Beijing, China
 
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
 
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
 
Prof. Uri Weiser,Technion
Prof. Uri Weiser,TechnionProf. Uri Weiser,Technion
Prof. Uri Weiser,Technion
 
Ken Liao, Senior Associate VP, Faraday
Ken Liao, Senior Associate VP, FaradayKen Liao, Senior Associate VP, Faraday
Ken Liao, Senior Associate VP, Faraday
 
Prof. Danny Raz, Director, Bell Labs Israel, Nokia
 Prof. Danny Raz, Director, Bell Labs Israel, Nokia  Prof. Danny Raz, Director, Bell Labs Israel, Nokia
Prof. Danny Raz, Director, Bell Labs Israel, Nokia
 
Marco Casale-Rossi, Product Mktg. Manager, Synopsys
Marco Casale-Rossi, Product Mktg. Manager, SynopsysMarco Casale-Rossi, Product Mktg. Manager, Synopsys
Marco Casale-Rossi, Product Mktg. Manager, Synopsys
 
Dr.Efraim Aharoni, ESD Leader, TowerJazz
Dr.Efraim Aharoni, ESD Leader, TowerJazzDr.Efraim Aharoni, ESD Leader, TowerJazz
Dr.Efraim Aharoni, ESD Leader, TowerJazz
 
Eddy Kvetny, System Engineering Group Leader, Intel
Eddy Kvetny, System Engineering Group Leader, IntelEddy Kvetny, System Engineering Group Leader, Intel
Eddy Kvetny, System Engineering Group Leader, Intel
 
Dr. John Bainbridge, Principal Application Architect, NetSpeed
 Dr. John Bainbridge, Principal Application Architect, NetSpeed  Dr. John Bainbridge, Principal Application Architect, NetSpeed
Dr. John Bainbridge, Principal Application Architect, NetSpeed
 
Xavier van Ruymbeke, App. Engineer, Arteris
Xavier van Ruymbeke, App. Engineer, ArterisXavier van Ruymbeke, App. Engineer, Arteris
Xavier van Ruymbeke, App. Engineer, Arteris
 
Asi Lifshitz, VP R&D, Vtool
Asi Lifshitz, VP R&D, VtoolAsi Lifshitz, VP R&D, Vtool
Asi Lifshitz, VP R&D, Vtool
 
Zvika Rozenshein,General Manager, EngineeringIQ
Zvika Rozenshein,General Manager, EngineeringIQZvika Rozenshein,General Manager, EngineeringIQ
Zvika Rozenshein,General Manager, EngineeringIQ
 
Lewis Chu,Marketing Director,GUC
Lewis Chu,Marketing Director,GUC Lewis Chu,Marketing Director,GUC
Lewis Chu,Marketing Director,GUC
 
Kunal Varshney, VLSI Engineer, Open-Silicon
Kunal Varshney, VLSI Engineer, Open-SiliconKunal Varshney, VLSI Engineer, Open-Silicon
Kunal Varshney, VLSI Engineer, Open-Silicon
 
Gert Goossens,Sen. Director, ASIP Tools, Synopsys
Gert Goossens,Sen. Director, ASIP Tools, SynopsysGert Goossens,Sen. Director, ASIP Tools, Synopsys
Gert Goossens,Sen. Director, ASIP Tools, Synopsys
 
Tuvia Liran, Director of VLSI, Nano Retina
Tuvia Liran, Director of VLSI, Nano RetinaTuvia Liran, Director of VLSI, Nano Retina
Tuvia Liran, Director of VLSI, Nano Retina
 
Sagar Kadam, Lead Software Engineer, Open-Silicon
Sagar Kadam, Lead Software Engineer, Open-SiliconSagar Kadam, Lead Software Engineer, Open-Silicon
Sagar Kadam, Lead Software Engineer, Open-Silicon
 
Ronen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
Ronen Shtayer,Director of ASG Operations & PMO, NXP SemiconductorRonen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
Ronen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
 
Prof. Emanuel Cohen, Technion
Prof. Emanuel Cohen, TechnionProf. Emanuel Cohen, Technion
Prof. Emanuel Cohen, Technion
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

  • 1. May 1, 2013 1 OpenCL for ALTERA FPGAs Accelerating performance and design productivity Liad Weinberger – Appilo May 1st, 2013
  • 2. May 1, 2013 2 Technology trends • Over the past years – Technology scaling favors programmability and parallelism Fine-Grained Massively Parallel Arrays Single Cores Coarse-Grained Massively Parallel Processor Arrays Multi-Cores Coarse-Grained CPUs and DSPs CPUs DSPs Multi-Cores Array GPGPUs FPGAs
  • 3. May 1, 2013 3 Technology trends 0 20 40 60 80 100 120 140 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 Process node (nm) • Moore’s law still in effect – More FPGA real-estate • More potential for parallelism – an extremely good thing! • Designs that utilize this real-estate, becomes harder to manage and maintain – this is not so good...
  • 4. May 1, 2013 4 Technology trends 2007 2008 2009 2010 2011 2012 2013 Google trends Worldwide Interest over the years Verilog + VHDL • Decreased interest – Number of Google searches for VHDL or Verilog in decline
  • 5. May 1, 2013 5 Technology trends 2007 2008 2009 2010 2011 2012 2013 Google trends Interest over the years Verilog + VHDL Python • Software development keeps momentum – Number of Google searches for Python (as a representing language)
  • 6. May 1, 2013 6 FPGA (hardware) development • Design (programming) is complex – Define state machine, data-paths, arbitration, IP interfaces, etc. – Sophisticated iterative compilation process • Synthesis, technology mapping, clustering, placement and routing, timing closure • Leads to long compilation times (hours vs. minutes in software) – Debug process is also very time-consuming • Code is not portable – Written in Verilog / VHDL • Can’t re-target for CPUs, GPUs, DSPs, etc. • Not scalable Compilation HDL Timing Closure Set Constraints
  • 7. May 1, 2013 7 Software development • Programming is straight-forward – Ideas are expressed in languages such as C/C++/Python/etc. • Typically, start with simple sequential implementation • Use parallel APIs / language extensions, in order to exploit multi-core architectures for additional performance – Compilation times are usually reasonably short • Simple straight-forward compilation/linking process – Immediate feedback when debugging/profiling • An assortment of tools available for both debugging and profiling • Portability is still an issue – Possible, but require pre-planning Compiler &Linker C/C++ Python etc. C/C++ Python etc. C/C++/ Python/ etc.
  • 8. May 1, 2013 8 Product development point-of-view • Product producers want: – Lower development and maintenance costs – Competitive edge • Higher performance • Short time-in-market, and short time-to-market – Agile development methods are becoming more and more popular – Can’t afford long development cycles – Trained developers with established experience • Or cost-effective path for training new developers – Flexibility • No vendor-locking is preferred • Ability to rapidly adapt product to market requirement changes
  • 9. May 1, 2013 9 Our challenge • How do we bring FPGA design process closer to the software development model? – Need to make FPGAs more accessible to the software development community • Change in mind-set: look at FPGAs as massively multi-core devices that could be used in order to accelerate parallel applications • A programming model that allows that • Shorter compilation times and faster feedback for debugging and profiling the design
  • 10. May 1, 2013 10 An ideal programming environment... • Based on a standard programming model – Rather than something which is FPGA-specific • Abstracts away the underlying details of the hardware – VHDL / Verilog are similar to “assembly language” programming – Useful in rare circumstances where the highest possible efficiency is needed • The price of abstraction is not too high – Still need to efficiently use the FPGA’s resources to achieve high throughput / low area • Allows for software-like compilation & debug cycles – Faster compile times – Profiling & user feedback
  • 11. May 1, 2013 11 Introducing OpenCL Parallel heterogeneous computing
  • 12. May 1, 2013 12 A case for OpenCL • What is OpenCL? – An open, royalty-free standard for cross-platform parallel software programming of heterogeneous systems • CPU + DSPs • CPU + GPUs • CPU + FPGAs – Maintained by KHRONOS group • An industry consortium creating open, royalty-free standards • Comprised of hardware and software vendors – Enables software to leverage silicon acceleration • Consists of two major parts: – Application Programming Interface (API) for device management – Device programming language based on C99 with some restrictions and extensions to support explicit parallelism Or maybe all together
  • 13. May 1, 2013 13 Benefits of OpenCL • Cross-vendor software portability – Functional portability—Same code would normally execute on different hardware, by different vendors – Not performance portable—Code still needs to be optimized to specific device (at least a device class) • Allows for the management of available computational resources under a single framework – Views CPUs, GPUs, FPGAs, and other accelerators as devices that could carry the computational needs of the application
  • 14. May 1, 2013 14 OpenCL program structure • Separation between managerial and computational code bases – Managerial code executes on a host CPU • Any type of conventional micro-processor • Written in any language that has bindings for the OpenCL API – The API is in ANSI-C – There is a formal C++ binding – Other bindings may exist – Computational code executes on the compute devices (accelerators) • Written in a language called OpenCL C – Based on C99 – Adds restrictions and extensions for explicit parallelism • Can be compiled either offline, or online, depending on implementation • Will most likely consist only of those portions of the application we want to accelerate
  • 15. May 1, 2013 15 OpenCL program structure Compute DeviceHost LocalMem GlobalMem LocalMemLocalMemLocalMem AcceleratorAcceleratorAccelerator Compute unit __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); } Host Program Kernel Program
  • 16. May 1, 2013 16 OpenCL host application • Communicates with the Accelerator Device via a set of library routines – Abstracts away host processor to HW accelerator communication via a set of API calls main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); } Copy data Host  FPGA Ask the FPGA to run a particular kernel Copy data FPGA  Host
  • 17. May 1, 2013 17 OpenCL kernels • Data-parallel function – Executes by many parallel threads • Each thread has an identifier which could be obtained with a call to the get_global_id() built-in function • Uses qualifiers to define where memory buffers reside • Executed by a compute device – CPU – GPU – FPGA – Other accelerator float *a = float *b = float *y = 0 1 2 3 4 5 6 7 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7 __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } __kernel void sum( … );
  • 18. May 1, 2013 18 OpenCL on FPGAs How does it map?
  • 19. May 1, 2013 19 Compiling OpenCL to FPGAs x86 PCIe SOF X86 binary ACL Compiler Standard C Compiler OpenCL Host Program + Kernels __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Kernel Programs Host Program main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
  • 20. May 1, 2013 20 Compiling OpenCL to FPGAs Load Load Store Load Load Store Load Load Store Load Load Store Load Load Store Load Load Store PCIe DDRx __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Kernel Programs Custom Hardware for Your Kernels
  • 21. May 1, 2013 21 FPGA architecture for OpenCL FPGA Kernel Pipeline Kernel Pipeline Kernel Pipeline PCIe DDR* x86 / External Processor External Memory Controller & PHY Memory Memory Memory Memory Memory Memory Global Memory Interconnect Local Memory Interconnect External Memory Controller & PHY Kernel System
  • 22. May 1, 2013 22 Mapping multithreaded kernels to FPGAs • Simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread – Inefficient and wasteful • Technique: deep pipeline parallelism – Attempt to create a deeply pipelined representation of a kernel – On each clock cycle, we attempt to send in input data for a new thread – Method of mapping coarse grained thread parallelism to fine-grained FPGA parallelism
  • 23. May 1, 2013 23 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 0 1 2 3 4 5 6 7 8 threads for vector add example Thread IDs +
  • 24. May 1, 2013 24 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 1 2 3 4 5 6 7 0 8 threads for vector add example Thread IDs +
  • 25. May 1, 2013 25 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 2 3 4 5 6 7 0 1 8 threads for vector add example Thread IDs +
  • 26. May 1, 2013 26 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 3 4 5 6 7 1 2 8 threads for vector add example Thread IDs + 0
  • 27. May 1, 2013 27 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 4 5 6 7 0 2 3 8 threads for vector add example Thread IDs + 1
  • 28. May 1, 2013 28 Some examples Using ALTERA’s OpenCL solution
  • 29. May 1, 2013 29 AES encryption • Counter (CTR) based encryption/decryption – 256-bit key • Advantage FPGA – Integer arithmetic – Coarse grain bit operations – Complex decision making • Results Platform Throughput (GB/s) E5503 Xeon Processor 0.01 (single core) AMD Radeon HD 7970 0.33 PCIe385 A7 Accelerator 5.20 42% utilization (2 kernels) •Power conservation •Fill up for even higher performance
  • 30. May 1, 2013 30 Multi-asset barrier option pricing • Monte-carlo simulation – Heston model – ND range • Assets x paths (64x1000000) • Advantage FPGA – Complex control flow • Results      tttt S ttttt dWdtd dWSdtSdS   Platform Power (W) Performance (Msims/s) Msims/W W3690 Xeon Processor 130 32 0.25 nVidia Tesla C2075 225 63 0.28 PCIe385 D5 Accelerator 23 170 7.40
  • 31. May 1, 2013 31 Document filtering • Unstructured data analytics – Bloom Filter • Advantage FPGA – Integer arithmetic – Flexible memory configuration • Results Platform Power (W) Performance (MTs) MTs/W W3690 Xeon Processor 130 2070 15.92 nVidia Tesla C2075 215 3240 15.07 DE4 Stratix IV-530 Accelerator 21 1755 83.57 PCIe385 A7 Accelerator 25 3602 144.08
  • 32. May 1, 2013 32 Fractal video compression • Best matching codebook – Correlation with SAD • Advantage FPGA – Integer arithmetic • Results Platform Power (W) Performance (FPS) FPS/W W3690 Xeon Processor 130 4.6 0.035 nVidia Tesla C2075 215 53.1 0.247 DE4 Stratix IV-530 Accelerator 21 70.9 3.376 PCIe385 A7 Accelerator 25 74.4 2.976