CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

byteLAKE
byteLAKEAI Solutions for Industries | Automated Quality Inspection | Data Insights | Self-Checkout | byteLAKE.com à byteLAKE
DSc PhD Krzysztof ROJEK, byteLAKE’s CTO
PPAM 2019, Bialystok, Poland, September 8-11, 2019
CFD code adaptation to the FPGA
architecture
• Current trends in the FPGA
market
• Common FPGA applications
• FPGA access
• Architecture of the Xilinx Alveo
U250 FPGA
• Evaluation metrics
• Algorithm scenario
• Development of FPGA codes
• Algorithm design
2
Background
• OpenCL kernel processing
• Memory queue
• Limitations of memory access
• Burst memory access
• Vectorization
• Code regionalization
• CPU implementation overview
• Performance and Energy results
• Conclusion
3
Current trends in the FPGA market
• Confirmed effectiveness
– Audio processing
– Image processing
– Cryptography
– Routers/switches/gateways software
– Digital displays
– Scientific instruments (amplifiers, radio astronomy, radars)
• Current challenges
– Machine learning
– Deep learning
– High Performance Computing (HPC)
4
Common FPGA applications
• Test Drive in the Cloud
– Nimbix: High Performance Computing &
Supercomputing Platform
– Other cloud providers, soon…
• Your own cluster
– RAM memory: 80GB (16GB for deployment only)
– Hard disk space: 100GB
– OS: RedHat, CentOS, Ubuntu
– Xilinx Runtime – driver for Alveo
– Deployment Shell – the communication layer physically implemented
and flashed into the card
– The Xilinx SDAccel IDE – framework for development
5
FPGA access
More cloud providers
soon…
• Premiere: October 02, 2018
• Built on the Xilinx 16nm UltraScale™ architecture
6
Xilinx Alveo U250 FPGA
Memory
Off-chip
Memory
Capacity
64 GB
Off-chip Total
Bandwidth
77 GB/s
Internal SRAM
Capacity
54 MB
Internal SRAM
Total
Bandwidth
38 TB/s
Power and Thermal
Maximum Total
Power
225W
Thermal
Cooling
Passive
Clocks
KERNEL CLK 500 MHz
DATA CLK 300 MHz
• The deployment shell that handles device bring-up and configuration over
PCIe is contained within the static region of the FPGA
• The resources in the dynamic region are available for creating custom
accelerators
7
Xilinx Alveo U250 FPGA
SLR1
Dynamic Region
SLR2
Dynamic Region
SLR3
Dynamic Region
SLR0
Dynamic Region
Static Region
DDR
DDR
DDR
DDR
Resources
Look-Up
Tables
(LUTs) (K)
1341
Registers (K) 2749
36 Kb Block
RAMs
2000
288 Kb
UltraRAMs
1280
• Desired features of a data center
– Low price
– Low Energy consumption
– High performance
– Technical support
– Reliability and fast service
• Important metrics
– Execution time [s]
– Data throughput of a simulation [MB/s]
– Power dissipation [W]
– Energy consumption [J]
8
Is it a good for you?
How many cards is required to
achieve a desired performance?
How many cards can I handle
within a given Energy budget?
What performance can be achieved
within my Energy budget?
How these results refer to
the CPU-based solution?
• Computational Fluid Dynamics
(CFD) kernel with support for
all industrial parameters and
settings
• Advection algorithm that is the
method to predict changes in
transport of a substance (fluid)
or quantity by bulk motion in
time
– An example of advection is the
transport of pollutants or silt in a
river by bulk water flow downstream
– It is also transport of energy by
water or air
9
Real scientific scenario
• Based on upwind scheme
• 3D compute domain
• Dataset (9 arrays + scalar):
– 3 x velocity vectors
– 2 x forces (implosion, explosion)
– 2 x density vectors
– 2 x transported substance (in, out)
– t – time interval
• Configuration:
– Job setting (size, timestep)
– Border conditions (periodic, open)
– Data accuracy (double, single,
half)
PERIODIC
DOMAIN IN X
DIMENSION
OPEN
DOMAIN
• Config, makefile, and source
10
Development
• Config, makefile, and source
11
Development
• Config, makefile, and source
12
Development
• The compute domain is divided
into 4 sub-domains
• Host sends data to the FPGA
global memory
• Host calls kernel to execute it on
FPGA (kernel is called many
times)
• Each kernel call represents a
single time step
• FPGA sends the output array
back to host
Algorithm design
FPGA
CPU
Compute
domain
Sub-domain
Sub-domain
Sub-domain
Sub-domain
Kernel call
Data
sending
Data
receiving
Data
receiving
Data
sending
Kernel
processing
Migrate
memory
objects
N x call
Copy buffer
• Kernel is distributed
into 4 SLRs
• Each sub-domain is
allocated in different
memory bank
• Data transfer occurs
between neighboring
memory banks
Kernel processing
SLR0
Kernel_A
SLR1
Kernel_B
SLR2
Kernel_C
SLR3
Kernel_D
Kernel
Bank0 Bank1
Bank2 Bank3
Sub-domain Sub-domain
Sub-domain Sub-domain
19
• A pipe stores data organized as a FIFO
• Pipes can be used to stream data from one kernel to another inside
the FPGA device without having to use the external memory
• Pipes must be statically defined outside of all kernel functions
• Pipes must be declared in lower case alphanumerics
• Xilinx extended OpenCL pipes by adding blocking mode that
allows users to synchronize kernels
15
Kernels communication with pipes
pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));
• Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across iterations
16
Memory queue
Global
memory
BRAM
• Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across iterations
17
Memory queue
Global
memory
BRAM
• Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across interactions
18
Memory queue
Global
memory
BRAM
• 31 pins are available in Alveo u250
– Each pointer to the global memory set as the kernel argument
reserves one memory pin
– Each kernel reserves one memory pin
• Using 4 banks and 4 kernels we can set up to 6 global pointers to the global
memory
• To send all required arrays we need to pack them into larger buffers (different
for input and output data)
• All kernel ports require 512-bits data access to provide the highest memory
access
19
Memory access within a kernel
• Burst memory access
– Loop pipelining
– Port data width: 512bits
– Separated data copings from the computation
– Vectorization
20
Burst memory access/vectorization
void copy(__global const float16 * __restrict globMem)
{
float16 bram[tKM];
…
write_0: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bram[kj] = globMem[gIdx+kj];
}
…
}
Time
traditional
pipelining
• Shifting elements within a vector (standard shuffle API is not supported)
21
Stencil vectorization
__attribute__((always_inline))
inline float16 getM1(const float a, const float16 b) {
const float16 *ptr2=(realX*)&b;
float16 out;
float *o=(realX*)&out;
o[0] = a;
__attribute__((opencl_unroll_hint(15)))
for(int i=1; i<VECS; ++i) {
o[i] = ptr2[i-1];
}
return out; }
X[i] = Y[i-1]
X[i]=getM1(Y[i-1][15],
Y[i]);
• Memory access supports two accesses per a single array
22
Memory ports
calc_0: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bramX[kj] = bramY[kj-off]+bramY[kj]+bramY[kj+off];
}
calc_0: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bramX[kj] = bramY[kj-off]+bramY[kj];
}
calc_1: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bramX[kj] = bramX[kj]+bramY[kj+off];
}
• Independent regions in the code should be explicitly
separated
• It helps compiler distribute the code amongst LUT
• The separation can be done by adding brackets around
independent code blocks
23
Regionalization
{
//the first block of instructions
}
{
//the second block of instructions
}
• Our CPU implementation utilizes two processors:
– Intel® Xeon® CPU E5-2695 v2 2.40 – 3.2 GHz (2x12 cores)
• The code adaptation includes:
– 24 cores utilization
– Loop transformations
– Memory alignment
– Thread affinity
– Data locality within nested loops
– Compiler optimizations
• The final simulation throughput is: 3.7 GB/s
• The power dissipation is: 142 Watts
25
CPU implementation
26
FPGA optimizations
27
Results
FPGA 2xCPU
Ratio
FPGA/CPU
Exec. time [s] 11,4 18,0 1,6
Throughput
[MB/s] 5840,8 3699,2 0,6
Power [W] 101,0 142,0 1,4
Energy [J] 1151,4 2556,0 2,2
5840.8
3699.2
FPGA 2XCPU
The higher the better
Throughput [MB/s]
1151.4
2556.0
FPGA 2XCPU
The lower the better
Energy [J]
29
byteLAKE’s ecosystem of partners
Complete solutions
for CFD market
➢HPC system design, build-up
and configuration
➢HPC software applications
development and
optimization to make the
most of the hardware
… and
more
More at:
byteLAKE.com/en/CFD
Accelerated CFD Kernels
Compatible with geophysical models
like EULAG
Pseudovelocity
Divergence
Thomas algorithm
CFD Kernels
Advection • Faster time to results and more
efficient processing compared
to CPU-only nodes
• 4x faster
• 80% lower energy consumption
• 6x better performance per Watt
About byteLAKE
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)
Contact me: krojek@byteLAKE.com
31
We build AI and HPC solutions.
Focusing on software.
We use machine/ deep learning to bring
automation and optimize operations
in businesses across various industries.
We create highly optimized software for
supercomputers.
Our researchers hold PhD and DSc
degrees.
byteLAKE
www.byteLAKE.com
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)
Building solutions
for real-life
business problems
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
1 sur 31

Recommandé

Smart logic par
Smart logicSmart logic
Smart logicP V Krishna Mohan Gupta
503 vues49 diapositives
Maxwell siuc hpc_description_tutorial par
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
313 vues59 diapositives
Xilinx fpga cores par
Xilinx fpga coresXilinx fpga cores
Xilinx fpga coressanaz nouri
1.6K vues17 diapositives
Implementation of Soft-core Processor on FPGA par
Implementation of Soft-core Processor on FPGAImplementation of Soft-core Processor on FPGA
Implementation of Soft-core Processor on FPGADeepak Kumar
2.2K vues13 diapositives
Public Seminar_Final 18112014 par
Public Seminar_Final 18112014Public Seminar_Final 18112014
Public Seminar_Final 18112014Hossam Hassan
479 vues35 diapositives
SOC Processors Used in SOC par
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOCA B Shinde
5.7K vues80 diapositives

Contenu connexe

Tendances

A Dataflow Processing Chip for Training Deep Neural Networks par
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
2.5K vues25 diapositives
Microblaze par
MicroblazeMicroblaze
MicroblazeKrunal Siddhapathak
3.8K vues19 diapositives
OpenPOWER Acceleration of HPCC Systems par
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
721 vues29 diapositives
The Microarchitecure Of FPGA Based Soft Processor par
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorDeepak Tomar
1.6K vues37 diapositives
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain par
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainMDC_UNICA
120 vues48 diapositives
Scalability and Efficiency in Accelerator Sharing on FPGA Devices par
Scalability and Efficiency in Accelerator Sharing on FPGA DevicesScalability and Efficiency in Accelerator Sharing on FPGA Devices
Scalability and Efficiency in Accelerator Sharing on FPGA DevicesNECST Lab @ Politecnico di Milano
284 vues30 diapositives

Tendances(20)

A Dataflow Processing Chip for Training Deep Neural Networks par inside-BigData.com
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com2.5K vues
OpenPOWER Acceleration of HPCC Systems par HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems721 vues
The Microarchitecure Of FPGA Based Soft Processor par Deepak Tomar
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
Deepak Tomar1.6K vues
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain par MDC_UNICA
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
MDC_UNICA120 vues
SOC Chip Basics par A B Shinde
SOC Chip BasicsSOC Chip Basics
SOC Chip Basics
A B Shinde2.9K vues
Runtime Reconfigurable Network-on-chips for FPGA-based Devices par Mugdha2289
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Mugdha2289963 vues
BKK16-303 96Boards - TV Platform par Linaro
BKK16-303 96Boards - TV PlatformBKK16-303 96Boards - TV Platform
BKK16-303 96Boards - TV Platform
Linaro1.1K vues
SOC Peripheral Components & SOC Tools par A B Shinde
SOC Peripheral Components & SOC ToolsSOC Peripheral Components & SOC Tools
SOC Peripheral Components & SOC Tools
A B Shinde1.2K vues
BKK16-312 Integrating and controlling embedded devices in LAVA par Linaro
BKK16-312 Integrating and controlling embedded devices in LAVABKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVA
Linaro837 vues
SOC System Design Approach par A B Shinde
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
A B Shinde9.8K vues
SOC Application Studies: Image Compression par A B Shinde
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
A B Shinde1.8K vues
BKK16-311 EAS Upstream Stategy par Linaro
BKK16-311 EAS Upstream StategyBKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream Stategy
Linaro1.2K vues
PFQ@ 9th Italian Networking Workshop (Courmayeur) par Nicola Bonelli
PFQ@ 9th Italian Networking Workshop (Courmayeur)PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)
Nicola Bonelli412 vues

Similaire à CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

SoC FPGA Technology par
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA TechnologySiraj Muhammad
4.2K vues34 diapositives
AI Accelerators for Cloud Datacenters par
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
1.4K vues53 diapositives
Microprocessor.ppt par
Microprocessor.pptMicroprocessor.ppt
Microprocessor.pptsafia kalwar
17.3K vues23 diapositives
Oow 2008 yahoo_pie-db par
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbbohanchen
350 vues39 diapositives
00 opencapi acceleration framework yonglu_ver2 par
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
141 vues41 diapositives
6 open capi_meetup_in_japan_final par
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_finalYutaka Kawai
242 vues32 diapositives

Similaire à CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)(20)

AI Accelerators for Cloud Datacenters par CastLabKAIST
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
CastLabKAIST1.4K vues
Oow 2008 yahoo_pie-db par bohanchen
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-db
bohanchen350 vues
00 opencapi acceleration framework yonglu_ver2 par Yutaka Kawai
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
Yutaka Kawai141 vues
6 open capi_meetup_in_japan_final par Yutaka Kawai
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
Yutaka Kawai242 vues
Using a Field Programmable Gate Array to Accelerate Application Performance par Odinot Stanislas
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas2.1K vues
Cpld and fpga mod vi par Agi George
Cpld and fpga   mod viCpld and fpga   mod vi
Cpld and fpga mod vi
Agi George253 vues
Sony Computer Entertainment Europe Research & Development Division par Slide_N
Sony Computer Entertainment Europe Research & Development DivisionSony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development Division
Slide_N407 vues
Introduction to DPDK par Kernel TLV
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV5.9K vues
Dsp ajal par AJAL A J
Dsp  ajalDsp  ajal
Dsp ajal
AJAL A J5.2K vues
Sparc t4 2 system technical overview par solarisyougood
Sparc t4 2 system technical overviewSparc t4 2 system technical overview
Sparc t4 2 system technical overview
solarisyougood2.5K vues
Digital Systems Design par Reza Sameni
Digital Systems DesignDigital Systems Design
Digital Systems Design
Reza Sameni403 vues
FPGA Selection Methodology for Real time projects par Krishna Gaihre
FPGA Selection Methodology for Real time projectsFPGA Selection Methodology for Real time projects
FPGA Selection Methodology for Real time projects
Krishna Gaihre2K vues
Sparc t4 1 system technical overview par solarisyougood
Sparc t4 1 system technical overviewSparc t4 1 system technical overview
Sparc t4 1 system technical overview
solarisyougood2.9K vues
Project Slides for Website 2020-22.pptx par AkshitAgiwal1
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
AkshitAgiwal19 vues

Plus de byteLAKE

byteLAKE's expertise across NVIDIA architectures and configurations par
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
3 vues23 diapositives
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ... par
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...byteLAKE
32 vues23 diapositives
Empowering Industries with byteLAKE's High-Performance AI par
Empowering Industries with byteLAKE's High-Performance AIEmpowering Industries with byteLAKE's High-Performance AI
Empowering Industries with byteLAKE's High-Performance AIbyteLAKE
47 vues33 diapositives
Automatyczny Monitoring Jakości w Fabryce (Sztuczna Inteligencja, byteLAKE) par
Automatyczny Monitoring Jakości w Fabryce (Sztuczna Inteligencja, byteLAKE)Automatyczny Monitoring Jakości w Fabryce (Sztuczna Inteligencja, byteLAKE)
Automatyczny Monitoring Jakości w Fabryce (Sztuczna Inteligencja, byteLAKE)byteLAKE
25 vues33 diapositives
Sztuczna Inteligencja dla Biznesu (Made In Wroclaw 2020) par
Sztuczna Inteligencja dla Biznesu (Made In Wroclaw 2020)Sztuczna Inteligencja dla Biznesu (Made In Wroclaw 2020)
Sztuczna Inteligencja dla Biznesu (Made In Wroclaw 2020)byteLAKE
1.4K vues8 diapositives
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning) par
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)byteLAKE
725 vues35 diapositives

Plus de byteLAKE(12)

byteLAKE's expertise across NVIDIA architectures and configurations par byteLAKE
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE3 vues
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ... par byteLAKE
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
byteLAKE32 vues
Empowering Industries with byteLAKE's High-Performance AI par byteLAKE
Empowering Industries with byteLAKE's High-Performance AIEmpowering Industries with byteLAKE's High-Performance AI
Empowering Industries with byteLAKE's High-Performance AI
byteLAKE47 vues
Automatyczny Monitoring Jakości w Fabryce (Sztuczna Inteligencja, byteLAKE) par byteLAKE
Automatyczny Monitoring Jakości w Fabryce (Sztuczna Inteligencja, byteLAKE)Automatyczny Monitoring Jakości w Fabryce (Sztuczna Inteligencja, byteLAKE)
Automatyczny Monitoring Jakości w Fabryce (Sztuczna Inteligencja, byteLAKE)
byteLAKE25 vues
Sztuczna Inteligencja dla Biznesu (Made In Wroclaw 2020) par byteLAKE
Sztuczna Inteligencja dla Biznesu (Made In Wroclaw 2020)Sztuczna Inteligencja dla Biznesu (Made In Wroclaw 2020)
Sztuczna Inteligencja dla Biznesu (Made In Wroclaw 2020)
byteLAKE1.4K vues
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning) par byteLAKE
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
byteLAKE725 vues
byteLAKE's Alveo FPGA Solutions par byteLAKE
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
byteLAKE322 vues
CFD Acceleration with FPGA (byteLAKE's & Xilinx's presentation from H2RC work... par byteLAKE
CFD Acceleration with FPGA (byteLAKE's & Xilinx's presentation from H2RC work...CFD Acceleration with FPGA (byteLAKE's & Xilinx's presentation from H2RC work...
CFD Acceleration with FPGA (byteLAKE's & Xilinx's presentation from H2RC work...
byteLAKE255 vues
byteLAKE and Lenovo presenting Federated Learning at MWC 2019 par byteLAKE
byteLAKE and Lenovo presenting Federated Learning at MWC 2019byteLAKE and Lenovo presenting Federated Learning at MWC 2019
byteLAKE and Lenovo presenting Federated Learning at MWC 2019
byteLAKE484 vues
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius par byteLAKE
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel MovidiusBenchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius
byteLAKE19.5K vues
byteLAKE's Edge AI par byteLAKE
byteLAKE's Edge AIbyteLAKE's Edge AI
byteLAKE's Edge AI
byteLAKE107 vues
AI optimizing HPC simulations (presentation from 6th EULAG Workshop) par byteLAKE
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE182 vues

Dernier

MVP and prioritization.pdf par
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
39 vues8 diapositives
Digital Personal Data Protection (DPDP) Practical Approach For CISOs par
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
103 vues59 diapositives
Future of AR - Facebook Presentation par
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
54 vues27 diapositives
Ransomware is Knocking your Door_Final.pdf par
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
81 vues46 diapositives
Microsoft Power Platform.pptx par
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
74 vues38 diapositives
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... par
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...ShapeBlue
86 vues25 diapositives

Dernier(20)

Digital Personal Data Protection (DPDP) Practical Approach For CISOs par Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash103 vues
Future of AR - Facebook Presentation par Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty54 vues
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... par ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue86 vues
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... par James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson142 vues
State of the Union - Rohit Yadav - Apache CloudStack par ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 vues
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue par ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue191 vues
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue par ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue147 vues
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive par Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue par ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue75 vues
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... par ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue120 vues
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool par ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue56 vues
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... par ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue52 vues
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue par ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 vues
Why and How CloudStack at weSystems - Stephan Bienek - weSystems par ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue172 vues
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates par ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 vues
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT par ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue138 vues
DRBD Deep Dive - Philipp Reisner - LINBIT par ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue110 vues

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

  • 1. DSc PhD Krzysztof ROJEK, byteLAKE’s CTO PPAM 2019, Bialystok, Poland, September 8-11, 2019 CFD code adaptation to the FPGA architecture
  • 2. • Current trends in the FPGA market • Common FPGA applications • FPGA access • Architecture of the Xilinx Alveo U250 FPGA • Evaluation metrics • Algorithm scenario • Development of FPGA codes • Algorithm design 2 Background • OpenCL kernel processing • Memory queue • Limitations of memory access • Burst memory access • Vectorization • Code regionalization • CPU implementation overview • Performance and Energy results • Conclusion
  • 3. 3 Current trends in the FPGA market
  • 4. • Confirmed effectiveness – Audio processing – Image processing – Cryptography – Routers/switches/gateways software – Digital displays – Scientific instruments (amplifiers, radio astronomy, radars) • Current challenges – Machine learning – Deep learning – High Performance Computing (HPC) 4 Common FPGA applications
  • 5. • Test Drive in the Cloud – Nimbix: High Performance Computing & Supercomputing Platform – Other cloud providers, soon… • Your own cluster – RAM memory: 80GB (16GB for deployment only) – Hard disk space: 100GB – OS: RedHat, CentOS, Ubuntu – Xilinx Runtime – driver for Alveo – Deployment Shell – the communication layer physically implemented and flashed into the card – The Xilinx SDAccel IDE – framework for development 5 FPGA access More cloud providers soon…
  • 6. • Premiere: October 02, 2018 • Built on the Xilinx 16nm UltraScale™ architecture 6 Xilinx Alveo U250 FPGA Memory Off-chip Memory Capacity 64 GB Off-chip Total Bandwidth 77 GB/s Internal SRAM Capacity 54 MB Internal SRAM Total Bandwidth 38 TB/s Power and Thermal Maximum Total Power 225W Thermal Cooling Passive Clocks KERNEL CLK 500 MHz DATA CLK 300 MHz
  • 7. • The deployment shell that handles device bring-up and configuration over PCIe is contained within the static region of the FPGA • The resources in the dynamic region are available for creating custom accelerators 7 Xilinx Alveo U250 FPGA SLR1 Dynamic Region SLR2 Dynamic Region SLR3 Dynamic Region SLR0 Dynamic Region Static Region DDR DDR DDR DDR Resources Look-Up Tables (LUTs) (K) 1341 Registers (K) 2749 36 Kb Block RAMs 2000 288 Kb UltraRAMs 1280
  • 8. • Desired features of a data center – Low price – Low Energy consumption – High performance – Technical support – Reliability and fast service • Important metrics – Execution time [s] – Data throughput of a simulation [MB/s] – Power dissipation [W] – Energy consumption [J] 8 Is it a good for you? How many cards is required to achieve a desired performance? How many cards can I handle within a given Energy budget? What performance can be achieved within my Energy budget? How these results refer to the CPU-based solution?
  • 9. • Computational Fluid Dynamics (CFD) kernel with support for all industrial parameters and settings • Advection algorithm that is the method to predict changes in transport of a substance (fluid) or quantity by bulk motion in time – An example of advection is the transport of pollutants or silt in a river by bulk water flow downstream – It is also transport of energy by water or air 9 Real scientific scenario • Based on upwind scheme • 3D compute domain • Dataset (9 arrays + scalar): – 3 x velocity vectors – 2 x forces (implosion, explosion) – 2 x density vectors – 2 x transported substance (in, out) – t – time interval • Configuration: – Job setting (size, timestep) – Border conditions (periodic, open) – Data accuracy (double, single, half) PERIODIC DOMAIN IN X DIMENSION OPEN DOMAIN
  • 10. • Config, makefile, and source 10 Development
  • 11. • Config, makefile, and source 11 Development
  • 12. • Config, makefile, and source 12 Development
  • 13. • The compute domain is divided into 4 sub-domains • Host sends data to the FPGA global memory • Host calls kernel to execute it on FPGA (kernel is called many times) • Each kernel call represents a single time step • FPGA sends the output array back to host Algorithm design FPGA CPU Compute domain Sub-domain Sub-domain Sub-domain Sub-domain Kernel call Data sending Data receiving Data receiving Data sending Kernel processing Migrate memory objects N x call Copy buffer
  • 14. • Kernel is distributed into 4 SLRs • Each sub-domain is allocated in different memory bank • Data transfer occurs between neighboring memory banks Kernel processing SLR0 Kernel_A SLR1 Kernel_B SLR2 Kernel_C SLR3 Kernel_D Kernel Bank0 Bank1 Bank2 Bank3 Sub-domain Sub-domain Sub-domain Sub-domain 19
  • 15. • A pipe stores data organized as a FIFO • Pipes can be used to stream data from one kernel to another inside the FPGA device without having to use the external memory • Pipes must be statically defined outside of all kernel functions • Pipes must be declared in lower case alphanumerics • Xilinx extended OpenCL pipes by adding blocking mode that allows users to synchronize kernels 15 Kernels communication with pipes pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));
  • 16. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across iterations 16 Memory queue Global memory BRAM
  • 17. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across iterations 17 Memory queue Global memory BRAM
  • 18. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across interactions 18 Memory queue Global memory BRAM
  • 19. • 31 pins are available in Alveo u250 – Each pointer to the global memory set as the kernel argument reserves one memory pin – Each kernel reserves one memory pin • Using 4 banks and 4 kernels we can set up to 6 global pointers to the global memory • To send all required arrays we need to pack them into larger buffers (different for input and output data) • All kernel ports require 512-bits data access to provide the highest memory access 19 Memory access within a kernel
  • 20. • Burst memory access – Loop pipelining – Port data width: 512bits – Separated data copings from the computation – Vectorization 20 Burst memory access/vectorization void copy(__global const float16 * __restrict globMem) { float16 bram[tKM]; … write_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bram[kj] = globMem[gIdx+kj]; } … } Time traditional pipelining
  • 21. • Shifting elements within a vector (standard shuffle API is not supported) 21 Stencil vectorization __attribute__((always_inline)) inline float16 getM1(const float a, const float16 b) { const float16 *ptr2=(realX*)&b; float16 out; float *o=(realX*)&out; o[0] = a; __attribute__((opencl_unroll_hint(15))) for(int i=1; i<VECS; ++i) { o[i] = ptr2[i-1]; } return out; } X[i] = Y[i-1] X[i]=getM1(Y[i-1][15], Y[i]);
  • 22. • Memory access supports two accesses per a single array 22 Memory ports calc_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramY[kj-off]+bramY[kj]+bramY[kj+off]; } calc_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramY[kj-off]+bramY[kj]; } calc_1: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramX[kj]+bramY[kj+off]; }
  • 23. • Independent regions in the code should be explicitly separated • It helps compiler distribute the code amongst LUT • The separation can be done by adding brackets around independent code blocks 23 Regionalization { //the first block of instructions } { //the second block of instructions }
  • 24. • Our CPU implementation utilizes two processors: – Intel® Xeon® CPU E5-2695 v2 2.40 – 3.2 GHz (2x12 cores) • The code adaptation includes: – 24 cores utilization – Loop transformations – Memory alignment – Thread affinity – Data locality within nested loops – Compiler optimizations • The final simulation throughput is: 3.7 GB/s • The power dissipation is: 142 Watts 25 CPU implementation
  • 26. 27 Results FPGA 2xCPU Ratio FPGA/CPU Exec. time [s] 11,4 18,0 1,6 Throughput [MB/s] 5840,8 3699,2 0,6 Power [W] 101,0 142,0 1,4 Energy [J] 1151,4 2556,0 2,2 5840.8 3699.2 FPGA 2XCPU The higher the better Throughput [MB/s] 1151.4 2556.0 FPGA 2XCPU The lower the better Energy [J]
  • 27. 29 byteLAKE’s ecosystem of partners Complete solutions for CFD market ➢HPC system design, build-up and configuration ➢HPC software applications development and optimization to make the most of the hardware … and more
  • 28. More at: byteLAKE.com/en/CFD Accelerated CFD Kernels Compatible with geophysical models like EULAG Pseudovelocity Divergence Thomas algorithm CFD Kernels Advection • Faster time to results and more efficient processing compared to CPU-only nodes • 4x faster • 80% lower energy consumption • 6x better performance per Watt About byteLAKE • AI (highly optimized AI engines to analyze text, image, video, time series data) • HPC (highly optimized apps and kernels for HPC architectures)
  • 30. We build AI and HPC solutions. Focusing on software. We use machine/ deep learning to bring automation and optimize operations in businesses across various industries. We create highly optimized software for supercomputers. Our researchers hold PhD and DSc degrees. byteLAKE www.byteLAKE.com • AI (highly optimized AI engines to analyze text, image, video, time series data) • HPC (highly optimized apps and kernels for HPC architectures) Building solutions for real-life business problems