SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Multiscale Dataflow Computing
Competitive Advantage at the Exascale Frontier
2
What Makes Computers Inefficient?
A metaphor
ALU
DATA
DATA
DATA
DATA
3
What Makes Computers Inefficient?
A metaphor
4
The End of Free Performance
Frequency levels off, cores fill in the gap
5
The Control Flow Model
⬥Data is static, must be loaded/stored
⬥Instructions are data too – compute in time
⬥Inefficient way to solve any problem
⬥Most silicon used to move data, decode instructions etc
⬥Inefficient way to solve any problem
⬥Software development is fast and easy
⬥Hardware development is difficult and specialized
General but suboptimal
6
The Dataflow Model
⬥Data moves continuously
⬥Compute in space – arrange operations in 2D
⬥Optimal solution for a specific problem
⬥No wasted silicon – maximum performance density
⬥No wasted clock cycles – predictable speed
Build the computer around the problem
7
The Story of Maxeler Dataflow Computing
⬥ Researched at Stanford pre 2000
⬥ Mencer, O. (2000) Rational Arithmetic in
Computer Systems, (Ph. D. Thesis). Stanford
University, California, USA.
⬥ Refined at Bell Labs from 2000 - 2003
⬥ Computing Sciences Center, Unit 1127
⬥ Birthplace of the transistor, Unix, C, C++ ...
⬥ Realized via Maxeler, founded in 2003
⬥ Oil and Gas with Chevron, ENI, Schlumberger
⬥ Finance with J.P. Morgan, CME, Citi
⬥ Defense and Cyber Security
⬥ Strategic Technology Partnerships
⬥ Juniper, Hitachi, AWS
Research to real world
8
Maxeler Success Stories
⬥Chevron
⬥ Seismic shoot data must be
processed for imaging
⬥ Maxeler developed dataflow
computing to address
performance density
Dataflow computing provides competitive advantage in multiple industries
⬥JP Morgan
⬥ Complex credit derivatives
⬥ Unable to run risk calculations in 2008 crisis
⬥ Maxeler DFEs reduced run time from 8
hours to 2 minutes
⬥Juniper Networks
⬥ Added dataflow acceleration
to top-of-rack QFX5100 switch
⬥ Maxeler delivers in-line
processing of network data
9
HARDWARE
BUILD
MaxJ Simulator
Debugging and JUnit tests
Dataflow graph
Assembled by MaxCompiler
Building a Dataflow Computer
First, convert the problem to MaxJ
MaxJ
Java-based language
Algorithm analysis
Convert loops to dataflow
10
MaxJ
Dataflow computing in a language you know
11
MaxJ
Complex graphs from simple code
3D finite difference time step
12
Building a Dataflow Computer
Then build a physical machine
13
The Dataflow Engine
The dataflow graph as hardware
14
The Dataflow Engine
Communicate with a CPU through PCIe and the MaxelerOS API
15
The Dataflow Engine
High-bandwidth connections to large on-card memory
16
The Dataflow Engine
Two high-speed duplex interconnects to other DFEs through MaxRing
17
The Dataflow Engine
Optional networking hardware using MaxCompilerNet for frame decoding
18
The Maxeler DFE
Dataflow appliance
MPC-X1000
• 8 Dataflow Engines in 1U
• Up to 1 TB of DFE RAM
• Dynamic allocation of DFEs to
conventional CPU servers through
Infiniband
• Equivalent performance to
20-50 x86 servers
19
Dataflow Case Study
⬥FORTRAN software package for
⬥ Ab initio quantum chemistry
⬥ Materials modeling
⬥Iterative solve with FFTs and linear
algebra (BLAS etc)
⬥Reference system – Ta2O5
⬥ Two racks of BlueGene/Q
⬥ 6.7 m3 of space
⬥ 32,768 cores
⬥ 53m wall time
⬥ 384 kW (25% cooling)
Quantum ESPRESSO
20
Loopflow Graph
⬥Function calls are control flow concept
⬥ Jump to another point in instruction data
⬥ Reusable logic, independent of calling order
⬥ Most profiling tools focus on function calls
⬥For dataflow, map out major loops
⬥ Dataflow engines have an implicit outer loop
⬥ Measure rates of data flowing in and out
⬥ Compare to volume of transient data
generated internally
⬥QE case study
⬥ Typical FFT loops over 5GB psi input data
⬥ Input vrs is 128MB, changes rarely
⬥ Equivalent internal memory is 250GB
⬥ Control flow – break into small batches
⬥ Dataflow – run single streaming action
Focus profiling on loop structure, not function calls
21
<6.5% <19.6% <50% 100%
Optimize Memory
⬥Two types of memory:
⬥ FMem is fast and local to the chip – up to 40MB accessed every clock cycle
⬥ LMem is large on-board memory up to 96GB
⬥QE case study
⬥ Use FMem for 2D transposes (one plane is 0.5MB)
⬥ Use LMem for 3D transposes (one cube is 128MB)
⬥ Need to move 10x more data over LMem bandwidth than PCIe bandwidth
Identify data sizes to layout dataflow architecture
PCIe LMemFMem
22
Dataflow Architecture
Match dataflows to available capacities and bandwidths
23
Computing in Space
Fill up the chip for maximum performance
LMem
PCIe
24
Performance Modeling
Simple arithmetic without guess work of cache, OS, etc
PCIe
7.1 MB/cube
3 GB/s
433 cubes/s
Compute
4M cycles/cube
150MHz clock
6 pipes
215 cubes/s
BOTTLENECK
LMem
205 MB/cube
50 GB/s
250 cubes/s
Single DFE: 215 cubes/s
One rack of BlueGene/Q: 337 cubes/s
25
Performance Modeling
⬥BlueGene/Q contains significant water cooling and communication – FFT divided to 256 nodes
⬥Maxeler MPC-X is air-cooled, optically connected internally – FFT in a single node
⬥Overall 700x improvement in compute/space and 1000x improvement in compute/power
⬥ These are for the FFT task only – but a proper phase 2 architecture should scale them up to
the full model
Comparison to reference system
System 1 rack of BlueGene/Q
Maxeler MPC-X 1U
with 8 MAX5 DFEs
Comparison
Space 3.374 m3 0.025 m3 135x
Power 192 kW 1 kW 192x
Performance 338 cubes/s 1716 cubes/s 5.1x
26
Code Integration
⬥SAPI – Single DFE
⬥ Simple Live CPU (SLiC) interface
⬥ Non-blocking actions
⬥ Portable shared-object file
⬥MAPI – Multiple DFEs
⬥ Partition problem space
⬥ Allocate engines dynamically
⬥DAPI – Device API
⬥ Interact with pre-built MaxJ logic
⬥ Reconfigure an existing dataflow
solution for a new problem
APIs at multiple levels
27
AppGallery
Largest collection of dataflow applications
http://appgallery.maxeler.com/#/
28
MaxGenFD
⬥Developed to serve energy industry
⬥ Finite-difference in 3D
⬥ Seismic study modeling
⬥Layer over MaxJ/MaxCompiler
⬥ Science user codes FD equations in Java
⬥ Domain decomposition
⬥ Sharing of halo through MaxRing
⬥ Minimal dataflow knowledge required
Purpose-built finite difference suite for dataflow computing
29
Proven Performance
⬥Gan, L., Fu, H., Luk, W., Yang, C., Xue, W.,
Huang, X., et al. (2015, April). Solving the
Global Atmospheric Equations through
Heterogeneous Reconfigurable Platforms.
ACM Transactions on Reconfigurable
Technology and Systems, 8(2)
⬥Joint research with Imperial College and
Tsinghua University
⬥Simulating the atmosphere using the
shallow water equation
An order of magnitude improvement over a leading supercomputer
Platform Processor Points/s Speedup Power (W) Efficiency
CPU Rack 2xCPU 82K 1x 377 1x
Tianhe-1A Node 2xCPU + Fermi GPU 110.4K 1.4x 360 1.4x
Kepler K20x 2xCPU + Kepler GPU 468.1K 2.6x 365 2.6x
Maxeler MPC-X 4xDFE 1.54M 19.4x 514 14.2x
30
MaxML for Machine Learning
⬥ Machine learning on DFEs uses large-capacity memory and in-line
training updates
⬥ Support for convolutional and fully connected layers
⬥ Choose the exact precision you need for maximum performance
Order of magnitude improvements in training and inference
31
Questions?
What can dataflow programming accelerate for you?

Contenu connexe

Tendances

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Filipo Mór
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 

Tendances (20)

MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud StorageWebinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
 
Performance Analysis and Troubleshooting Methodologies for Databases
Performance Analysis and Troubleshooting Methodologies for DatabasesPerformance Analysis and Troubleshooting Methodologies for Databases
Performance Analysis and Troubleshooting Methodologies for Databases
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody CodeBrief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
Map reduce
Map reduceMap reduce
Map reduce
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel Processing
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel Programming
 
OS_Ch9
OS_Ch9OS_Ch9
OS_Ch9
 
Page replacement
Page replacementPage replacement
Page replacement
 

Similaire à Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier

Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 

Similaire à Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier (20)

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
COMPUTER ORGANIZATION AND ARCHITECTURE
COMPUTER ORGANIZATION AND ARCHITECTURECOMPUTER ORGANIZATION AND ARCHITECTURE
COMPUTER ORGANIZATION AND ARCHITECTURE
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Exadata
ExadataExadata
Exadata
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for Data
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier

  • 1. Multiscale Dataflow Computing Competitive Advantage at the Exascale Frontier
  • 2. 2 What Makes Computers Inefficient? A metaphor ALU DATA DATA DATA DATA
  • 3. 3 What Makes Computers Inefficient? A metaphor
  • 4. 4 The End of Free Performance Frequency levels off, cores fill in the gap
  • 5. 5 The Control Flow Model ⬥Data is static, must be loaded/stored ⬥Instructions are data too – compute in time ⬥Inefficient way to solve any problem ⬥Most silicon used to move data, decode instructions etc ⬥Inefficient way to solve any problem ⬥Software development is fast and easy ⬥Hardware development is difficult and specialized General but suboptimal
  • 6. 6 The Dataflow Model ⬥Data moves continuously ⬥Compute in space – arrange operations in 2D ⬥Optimal solution for a specific problem ⬥No wasted silicon – maximum performance density ⬥No wasted clock cycles – predictable speed Build the computer around the problem
  • 7. 7 The Story of Maxeler Dataflow Computing ⬥ Researched at Stanford pre 2000 ⬥ Mencer, O. (2000) Rational Arithmetic in Computer Systems, (Ph. D. Thesis). Stanford University, California, USA. ⬥ Refined at Bell Labs from 2000 - 2003 ⬥ Computing Sciences Center, Unit 1127 ⬥ Birthplace of the transistor, Unix, C, C++ ... ⬥ Realized via Maxeler, founded in 2003 ⬥ Oil and Gas with Chevron, ENI, Schlumberger ⬥ Finance with J.P. Morgan, CME, Citi ⬥ Defense and Cyber Security ⬥ Strategic Technology Partnerships ⬥ Juniper, Hitachi, AWS Research to real world
  • 8. 8 Maxeler Success Stories ⬥Chevron ⬥ Seismic shoot data must be processed for imaging ⬥ Maxeler developed dataflow computing to address performance density Dataflow computing provides competitive advantage in multiple industries ⬥JP Morgan ⬥ Complex credit derivatives ⬥ Unable to run risk calculations in 2008 crisis ⬥ Maxeler DFEs reduced run time from 8 hours to 2 minutes ⬥Juniper Networks ⬥ Added dataflow acceleration to top-of-rack QFX5100 switch ⬥ Maxeler delivers in-line processing of network data
  • 9. 9 HARDWARE BUILD MaxJ Simulator Debugging and JUnit tests Dataflow graph Assembled by MaxCompiler Building a Dataflow Computer First, convert the problem to MaxJ MaxJ Java-based language Algorithm analysis Convert loops to dataflow
  • 10. 10 MaxJ Dataflow computing in a language you know
  • 11. 11 MaxJ Complex graphs from simple code 3D finite difference time step
  • 12. 12 Building a Dataflow Computer Then build a physical machine
  • 13. 13 The Dataflow Engine The dataflow graph as hardware
  • 14. 14 The Dataflow Engine Communicate with a CPU through PCIe and the MaxelerOS API
  • 15. 15 The Dataflow Engine High-bandwidth connections to large on-card memory
  • 16. 16 The Dataflow Engine Two high-speed duplex interconnects to other DFEs through MaxRing
  • 17. 17 The Dataflow Engine Optional networking hardware using MaxCompilerNet for frame decoding
  • 18. 18 The Maxeler DFE Dataflow appliance MPC-X1000 • 8 Dataflow Engines in 1U • Up to 1 TB of DFE RAM • Dynamic allocation of DFEs to conventional CPU servers through Infiniband • Equivalent performance to 20-50 x86 servers
  • 19. 19 Dataflow Case Study ⬥FORTRAN software package for ⬥ Ab initio quantum chemistry ⬥ Materials modeling ⬥Iterative solve with FFTs and linear algebra (BLAS etc) ⬥Reference system – Ta2O5 ⬥ Two racks of BlueGene/Q ⬥ 6.7 m3 of space ⬥ 32,768 cores ⬥ 53m wall time ⬥ 384 kW (25% cooling) Quantum ESPRESSO
  • 20. 20 Loopflow Graph ⬥Function calls are control flow concept ⬥ Jump to another point in instruction data ⬥ Reusable logic, independent of calling order ⬥ Most profiling tools focus on function calls ⬥For dataflow, map out major loops ⬥ Dataflow engines have an implicit outer loop ⬥ Measure rates of data flowing in and out ⬥ Compare to volume of transient data generated internally ⬥QE case study ⬥ Typical FFT loops over 5GB psi input data ⬥ Input vrs is 128MB, changes rarely ⬥ Equivalent internal memory is 250GB ⬥ Control flow – break into small batches ⬥ Dataflow – run single streaming action Focus profiling on loop structure, not function calls
  • 21. 21 <6.5% <19.6% <50% 100% Optimize Memory ⬥Two types of memory: ⬥ FMem is fast and local to the chip – up to 40MB accessed every clock cycle ⬥ LMem is large on-board memory up to 96GB ⬥QE case study ⬥ Use FMem for 2D transposes (one plane is 0.5MB) ⬥ Use LMem for 3D transposes (one cube is 128MB) ⬥ Need to move 10x more data over LMem bandwidth than PCIe bandwidth Identify data sizes to layout dataflow architecture PCIe LMemFMem
  • 22. 22 Dataflow Architecture Match dataflows to available capacities and bandwidths
  • 23. 23 Computing in Space Fill up the chip for maximum performance LMem PCIe
  • 24. 24 Performance Modeling Simple arithmetic without guess work of cache, OS, etc PCIe 7.1 MB/cube 3 GB/s 433 cubes/s Compute 4M cycles/cube 150MHz clock 6 pipes 215 cubes/s BOTTLENECK LMem 205 MB/cube 50 GB/s 250 cubes/s Single DFE: 215 cubes/s One rack of BlueGene/Q: 337 cubes/s
  • 25. 25 Performance Modeling ⬥BlueGene/Q contains significant water cooling and communication – FFT divided to 256 nodes ⬥Maxeler MPC-X is air-cooled, optically connected internally – FFT in a single node ⬥Overall 700x improvement in compute/space and 1000x improvement in compute/power ⬥ These are for the FFT task only – but a proper phase 2 architecture should scale them up to the full model Comparison to reference system System 1 rack of BlueGene/Q Maxeler MPC-X 1U with 8 MAX5 DFEs Comparison Space 3.374 m3 0.025 m3 135x Power 192 kW 1 kW 192x Performance 338 cubes/s 1716 cubes/s 5.1x
  • 26. 26 Code Integration ⬥SAPI – Single DFE ⬥ Simple Live CPU (SLiC) interface ⬥ Non-blocking actions ⬥ Portable shared-object file ⬥MAPI – Multiple DFEs ⬥ Partition problem space ⬥ Allocate engines dynamically ⬥DAPI – Device API ⬥ Interact with pre-built MaxJ logic ⬥ Reconfigure an existing dataflow solution for a new problem APIs at multiple levels
  • 27. 27 AppGallery Largest collection of dataflow applications http://appgallery.maxeler.com/#/
  • 28. 28 MaxGenFD ⬥Developed to serve energy industry ⬥ Finite-difference in 3D ⬥ Seismic study modeling ⬥Layer over MaxJ/MaxCompiler ⬥ Science user codes FD equations in Java ⬥ Domain decomposition ⬥ Sharing of halo through MaxRing ⬥ Minimal dataflow knowledge required Purpose-built finite difference suite for dataflow computing
  • 29. 29 Proven Performance ⬥Gan, L., Fu, H., Luk, W., Yang, C., Xue, W., Huang, X., et al. (2015, April). Solving the Global Atmospheric Equations through Heterogeneous Reconfigurable Platforms. ACM Transactions on Reconfigurable Technology and Systems, 8(2) ⬥Joint research with Imperial College and Tsinghua University ⬥Simulating the atmosphere using the shallow water equation An order of magnitude improvement over a leading supercomputer Platform Processor Points/s Speedup Power (W) Efficiency CPU Rack 2xCPU 82K 1x 377 1x Tianhe-1A Node 2xCPU + Fermi GPU 110.4K 1.4x 360 1.4x Kepler K20x 2xCPU + Kepler GPU 468.1K 2.6x 365 2.6x Maxeler MPC-X 4xDFE 1.54M 19.4x 514 14.2x
  • 30. 30 MaxML for Machine Learning ⬥ Machine learning on DFEs uses large-capacity memory and in-line training updates ⬥ Support for convolutional and fully connected layers ⬥ Choose the exact precision you need for maximum performance Order of magnitude improvements in training and inference
  • 31. 31 Questions? What can dataflow programming accelerate for you?