SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Perlmutter - A 2020 Pre-Exascale
GPU-accelerated System for NERSC -
Architecture and Application
Performance Optimization
Nicholas J. Wright
Perlmutter Chief Architect
GPU Technology Conference
San Jose
March 21 2019
NERSC is the mission High Performance
Computing facility for the DOE SC
- 2 -
7,000 Users
800 Projects
700 Codes
2000 NERSC citations per year
Simulations at scale
Data analysis support for
DOE’s experimental and
observational facilities
Photo Credit: CAMERA
NERSC has a dual mission to advance science
and the state-of-the-art in supercomputing
3
• We collaborate with computer companies years
before a system’s delivery to deploy advanced
systems with new capabilities at large scale
• We provide a highly customized software and
programming environment for science
applications
• We are tightly coupled with the workflows of
DOE’s experimental and observational facilities –
ingesting tens of terabytes of data each day
• Our staff provide advanced application and
system performance expertise to users
4
LLNL
IBM/NVIDIA
P9/Volta
Perlmutter is a Pre-Exascale System
Crossroads
Frontier
Pre-Exascale Systems Exascale Systems
2021
Argonne
IBM BG/Q
Argonne
Intel/Cray KNL
ORNL
Cray/NVidia K20
LBNL
Cray/Intel Xeon/KNL
LBNL
Cray/NVIDIA/AMD
LANL/SNL
TBD
Argonne
Intel/Cray
ORNL
TBD
LLNL
TBD
LANL/SNL
Cray/Intel Xeon/KNL
2013 2016 2018 2020 2021-2023
Summit
ORNL
IBM/NVIDIA
P9/Volta
LLNL
IBM BG/Q
Sequoia
CORI
A21
Trinity
Theta
Mira
Titan
Sierra
NERSC Systems Roadmap
NERSC-7:
Edison
Multicore
CPU
NERSC-8: Cori
Manycore CPU
NESAP Launched:
transition applications to
advanced architectures
2013
2016
2024
NERSC-9:
CPU and GPU nodes
Continued transition of
applications and support for
complex workflows
2020
NERSC-10:
Exa system
2028
NERSC-11:
Beyond
Moore
Increasing need for energy-efficient architectures
Cori: A pre-exascale supercomputer for the
Office of Science workload
Cray XC40 system with 9,600+ Intel
Knights Landing compute nodes
68 cores / 96 GB DRAM / 16 GB HBM
Support the entire Office of Science
research community
Begin to transition workload to energy
efficient architectures
1,600 Haswell processor nodes
NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec
30 PB of disk, >700 GB/sec I/O bandwidth
Integrated with Cori Haswell nodes on
Aries network for data / simulation /
analysis on one system
Perlmutter: A System Optimized for Science
● GPU-accelerated and CPU-only nodes
meet the needs of large scale
simulation and data analysis from
experimental facilities
● Cray “Slingshot” - High-performance,
scalable, low-latency Ethernet-
compatible network
● Single-tier All-Flash Lustre based HPC
file system, 6x Cori’s bandwidth
● Dedicated login and high memory
nodes to support complex workflows
4x NVIDIA “Volta-next” GPU
● > 7 TF
● > 32 GiB, HBM-2
● NVLINK
1x AMD CPU
4 Slingshot connections
● 4x25 GB/s
GPU direct, Unified Virtual
Memory (UVM)
2-3x Cori
GPU nodes
Volta
specs
AMD “Milan” CPU
● ~64 cores
● “ZEN 3” cores - 7nm+
● AVX2 SIMD (256 bit)
8 channels DDR memory
● >= 256 GiB total per node
1 Slingshot connection
● 1x25 GB/s
~ 1x Cori
CPU nodes
Rome
specs
Perlmutter: A System Optimized for Science
● GPU-accelerated and CPU-only nodes
meet the needs of large scale
simulation and data analysis from
experimental facilities
● Cray “Slingshot” - High-performance,
scalable, low-latency Ethernet-
compatible network
● Single-tier All-Flash Lustre based HPC
file system, 6x Cori’s bandwidth
● Dedicated login and high memory
nodes to support complex workflows
How do we optimize
the size of each partition?
NERSC System Utilization (Aug’17 - Jul’18)
● 3 codes > 25% of the
workload
● 10 codes > 50% of the
workload
● 30 codes > 75% of the
workload
● Over 600 codes
comprise the remaining
25% of the workload.
GPU Readiness Among NERSC Codes (Aug’17 - Jul’18)
12
GPU Status & Description Fraction
Enabled:
Most features are ported
and performant
32%
Kernels:
Ports of some kernels have
been documented.
10%
Proxy:
Kernels in related codes
have been ported
19%
Unlikely:
A GPU port would require
major effort.
14%
Unknown:
GPU readiness cannot be
assessed at this time.
25%
Breakdown of Hours at NERSC
A number of applications in NERSC
workload are GPU enabled already.
We will leverage existing GPU codes
from CAAR + Community
How many GPU nodes to buy - Benchmark Suite
Construction & Scalable System Improvement
Select codes to represent the anticipated workload
• Include key applications from the current workload.
• Add apps that are expected to be contribute significantly
to the future workload.
Scalable System Improvement
Measures aggregate performance of HPC machine
• How many more copies of the benchmark can be run
relative to the reference machine
• Performance relative to reference machine
Application Description
Quantum
Espresso
Materials code using DFT
MILC
QCD code using
staggered quarks
StarLord
Compressible radiation
hydrodynamics
DeepCAM
Weather/Community
Atmospheric Model 5
GTC Fusion PIC code
“CPU Only”
(3 Total)
Representative of
applications that cannot be
ported to GPUs
!!" =
#%&'() × +&,)-.( × /(01_3(0_4&'(
#%&'()567 × +&,)-.(567× /(01_3(0_4&'(567
Hetero system design & price sensitivity:
Budget for GPUs increases as GPU price drops
Chart explores an isocost design space
• Vary the budget allocated to GPUs
• Assume GPU enabled applications have
performance advantage = 10x per node, 3
of 8 apps are still CPU only.
• Examine GPU/CPU node cost ratio
Circles: 50% CPU nodes + 50% GPU nodes
Stars: Optimal system configuration.
GPU / CPU
$ per node
SSI increase
vs. CPU-Only
(@ budget %)
8:1 None No justification for GPUs
6:1 1.05x @ 25%
Slight justification for up to 50% of
budget on GPUs
4:1 1.23x @ 50%
GPUs cost effective up to full system
budget, but optimum at 50%
B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J.
Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme
Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking
and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
Application readiness efforts justify larger GPU
partitions.
Explore an isocost design space
• Assume 8:1 GPU/CPU node cost ratio.
• Vary the budget allocated to GPUs
• Examine GPU / CPU performance gains
such as those obtained by software
optimization & tuning. 5 of 8 codes
have 10x, 20x, 30x speedup.
Circles: 50% CPU nodes + 50% GPU nodes
Stars: Optimal system configuration
GPU / CPU
perf. per node
SSI increase
vs. CPU-Only
(@ budget %)
10x None No justification for GPUs
20x 1.15x @ 45%
Compare to 1.23x
for 10x at 4:1 GPU/CPU cost ratio
30x 1.40x @ 60%
Compare to 3x
from NESAP for KNL
B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J.
Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme
Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking
and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
How to Enable NERSC’s diverse community of 7,000
users, 750 projects, and 700 codes to run on advanced
architectures like Perlmutter and beyond?
• NERSC Exascale Science Application Program (NESAP)
• Engage ~25 Applications
• up to 17 postdoctoral fellows
• Deep partnerships with every SC Office area
• Leverage vendor expertise and hack-a-thons
• Knowledge transfer through documentation and training for all users
• Optimize codes with improvements relevant to multiple architectures
Application Readiness Strategy for Perlmutter
16
NESAP for Perlmutter will extend activities from NESAP
1. Identifying and exploiting on-node parallelism
2. Understanding and improving data-locality within the
memory hierarchy
What’s New for NERSC Users?
1. Heterogeneous compute elements
2. Identification and exploitation of even more parallelism
3. Emphasis on performance-portable programming
approach:
– Continuity from Cori through future NERSC systems
GPU Transition Path for Apps
17
OpenMP is the most popular non-MPI
parallel programming technique
● Results from ERCAP
2017 user survey
○ Question answered
by 328 of 658
survey respondents
● Total exceeds 100%
because some
applications use
multiple techniques
OpenMP meets the needs of the
NERSC workload
● Supports C, C++ and Fortran
○ The NERSC workload consists of ~700 applications with a
relatively equal mix of C, C++ and Fortran
● Provides portability to different architectures at other DOE
labs
● Works well with MPI: hybrid MPI+OpenMP approach
successfully used in many NERSC apps
● Recent release of OpenMP 5.0 specification – the third version
providing features for accelerators
○ Many refinements over this five year period
Ensuring OpenMP is ready for
Perlmutter CPU+GPU nodes
● NERSC will collaborate with NVIDIA to enable OpenMP
GPU acceleration with PGI compilers
○ NERSC application requirements will help prioritize
OpenMP and base language features on the GPU
○ Co-design of NESAP-2 applications to enable effective use
of OpenMP on GPUs and guide PGI optimization effort
● We want to hear from the larger community
○ Tell us your experience, including what OpenMP
techniques worked / failed on the GPU
○ Share your OpenMP applications targeting GPUs
Breaking News !
Engaging around Performance Portability
22
NERSC is a MemberNERSC is working with PGI/NVIDIA to
enable OpenMP GPU acceleration
Doug Doerfler Lead Performance Portability
Workshop at SC18. and 2019 DOE COE Perf.
Port. Meeting
NERSC is leading development of
performanceportability.org
NERSC Hosted Past C++
Summit and ISO C++
meeting on HPC.
Slingshot Network
• High Performance scalable interconnect
– Low latency, high-bandwidth, MPI
performance enhancements
– 3 hops between any pair of nodes
– Sophisticated congestion control and
adaptive routing to minimize tail latency
• Ethernet compatible
– Blurs the line between the inside and the
outside of the machine
– Allow for seamless external
communication
– Direct interface to storage 23
3x!
Perlmutter has a All-Flash Filesystem
24
4.0 TB/s to Lustre
>10 TB/s overall
Logins, DTNs, Workflows
All-Flash Lustre Storage
CPU + GPU Nodes
Community FS
~ 200 PB, ~500 GB/s
Terabits/sec to
ESnet, ALS, ...
• Fast across many dimensions
– 4 TB/s sustained bandwidth
– 7,000,000 IOPS
– 3,200,000 file creates/sec
• Usable for NERSC users
– 30 PB usable capacity
– Familiar Lustre interfaces
– New data movement capabilities
• Optimized for NERSC data workloads
– NEW small-file I/O improvements
– NEW features for high IOPS, non-
sequential I/O
• Winner of 2011 Nobel Prize in
Physics for discovery of the
accelerating expansion of the
universe.
• Supernova Cosmology Project,
lead by Perlmutter, was a
pioneer in using NERSC
supercomputers combine
large scale simulations with
experimental data analysis
• Login “saul.nersc.gov”
NERSC-9 will be named after Saul Perlmutter
25
Perlmutter: A System Optimized for Science
● Cray Shasta System providing 3-4x capability of Cori system
● First NERSC system designed to meet needs of both large scale simulation
and data analysis from experimental facilities
○ Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes
○ Cray Slingshot high-performance network will support Terabit rate connections to system
○ Optimized data software stack enabling analytics and ML at scale
○ All-Flash filesystem for I/O acceleration
● Robust readiness program for simulation, data and learning applications
and complex workflows
● Delivery in late 2020
Thank you !
We are hiring - https://jobs.lbl.gov/

Contenu connexe

Plus de inside-BigData.com

Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computinginside-BigData.com
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)inside-BigData.com
 

Plus de inside-BigData.com (20)

Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
 
FPGAs and Machine Learning
FPGAs and Machine LearningFPGAs and Machine Learning
FPGAs and Machine Learning
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)
 

Dernier

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Dernier (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Perlmutter: A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization

  • 1. Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization Nicholas J. Wright Perlmutter Chief Architect GPU Technology Conference San Jose March 21 2019
  • 2. NERSC is the mission High Performance Computing facility for the DOE SC - 2 - 7,000 Users 800 Projects 700 Codes 2000 NERSC citations per year Simulations at scale Data analysis support for DOE’s experimental and observational facilities Photo Credit: CAMERA
  • 3. NERSC has a dual mission to advance science and the state-of-the-art in supercomputing 3 • We collaborate with computer companies years before a system’s delivery to deploy advanced systems with new capabilities at large scale • We provide a highly customized software and programming environment for science applications • We are tightly coupled with the workflows of DOE’s experimental and observational facilities – ingesting tens of terabytes of data each day • Our staff provide advanced application and system performance expertise to users
  • 4. 4 LLNL IBM/NVIDIA P9/Volta Perlmutter is a Pre-Exascale System Crossroads Frontier Pre-Exascale Systems Exascale Systems 2021 Argonne IBM BG/Q Argonne Intel/Cray KNL ORNL Cray/NVidia K20 LBNL Cray/Intel Xeon/KNL LBNL Cray/NVIDIA/AMD LANL/SNL TBD Argonne Intel/Cray ORNL TBD LLNL TBD LANL/SNL Cray/Intel Xeon/KNL 2013 2016 2018 2020 2021-2023 Summit ORNL IBM/NVIDIA P9/Volta LLNL IBM BG/Q Sequoia CORI A21 Trinity Theta Mira Titan Sierra
  • 5. NERSC Systems Roadmap NERSC-7: Edison Multicore CPU NERSC-8: Cori Manycore CPU NESAP Launched: transition applications to advanced architectures 2013 2016 2024 NERSC-9: CPU and GPU nodes Continued transition of applications and support for complex workflows 2020 NERSC-10: Exa system 2028 NERSC-11: Beyond Moore Increasing need for energy-efficient architectures
  • 6. Cori: A pre-exascale supercomputer for the Office of Science workload Cray XC40 system with 9,600+ Intel Knights Landing compute nodes 68 cores / 96 GB DRAM / 16 GB HBM Support the entire Office of Science research community Begin to transition workload to energy efficient architectures 1,600 Haswell processor nodes NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec 30 PB of disk, >700 GB/sec I/O bandwidth Integrated with Cori Haswell nodes on Aries network for data / simulation / analysis on one system
  • 7. Perlmutter: A System Optimized for Science ● GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities ● Cray “Slingshot” - High-performance, scalable, low-latency Ethernet- compatible network ● Single-tier All-Flash Lustre based HPC file system, 6x Cori’s bandwidth ● Dedicated login and high memory nodes to support complex workflows
  • 8. 4x NVIDIA “Volta-next” GPU ● > 7 TF ● > 32 GiB, HBM-2 ● NVLINK 1x AMD CPU 4 Slingshot connections ● 4x25 GB/s GPU direct, Unified Virtual Memory (UVM) 2-3x Cori GPU nodes Volta specs
  • 9. AMD “Milan” CPU ● ~64 cores ● “ZEN 3” cores - 7nm+ ● AVX2 SIMD (256 bit) 8 channels DDR memory ● >= 256 GiB total per node 1 Slingshot connection ● 1x25 GB/s ~ 1x Cori CPU nodes Rome specs
  • 10. Perlmutter: A System Optimized for Science ● GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities ● Cray “Slingshot” - High-performance, scalable, low-latency Ethernet- compatible network ● Single-tier All-Flash Lustre based HPC file system, 6x Cori’s bandwidth ● Dedicated login and high memory nodes to support complex workflows How do we optimize the size of each partition?
  • 11. NERSC System Utilization (Aug’17 - Jul’18) ● 3 codes > 25% of the workload ● 10 codes > 50% of the workload ● 30 codes > 75% of the workload ● Over 600 codes comprise the remaining 25% of the workload.
  • 12. GPU Readiness Among NERSC Codes (Aug’17 - Jul’18) 12 GPU Status & Description Fraction Enabled: Most features are ported and performant 32% Kernels: Ports of some kernels have been documented. 10% Proxy: Kernels in related codes have been ported 19% Unlikely: A GPU port would require major effort. 14% Unknown: GPU readiness cannot be assessed at this time. 25% Breakdown of Hours at NERSC A number of applications in NERSC workload are GPU enabled already. We will leverage existing GPU codes from CAAR + Community
  • 13. How many GPU nodes to buy - Benchmark Suite Construction & Scalable System Improvement Select codes to represent the anticipated workload • Include key applications from the current workload. • Add apps that are expected to be contribute significantly to the future workload. Scalable System Improvement Measures aggregate performance of HPC machine • How many more copies of the benchmark can be run relative to the reference machine • Performance relative to reference machine Application Description Quantum Espresso Materials code using DFT MILC QCD code using staggered quarks StarLord Compressible radiation hydrodynamics DeepCAM Weather/Community Atmospheric Model 5 GTC Fusion PIC code “CPU Only” (3 Total) Representative of applications that cannot be ported to GPUs !!" = #%&'() × +&,)-.( × /(01_3(0_4&'( #%&'()567 × +&,)-.(567× /(01_3(0_4&'(567
  • 14. Hetero system design & price sensitivity: Budget for GPUs increases as GPU price drops Chart explores an isocost design space • Vary the budget allocated to GPUs • Assume GPU enabled applications have performance advantage = 10x per node, 3 of 8 apps are still CPU only. • Examine GPU/CPU node cost ratio Circles: 50% CPU nodes + 50% GPU nodes Stars: Optimal system configuration. GPU / CPU $ per node SSI increase vs. CPU-Only (@ budget %) 8:1 None No justification for GPUs 6:1 1.05x @ 25% Slight justification for up to 50% of budget on GPUs 4:1 1.23x @ 50% GPUs cost effective up to full system budget, but optimum at 50% B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
  • 15. Application readiness efforts justify larger GPU partitions. Explore an isocost design space • Assume 8:1 GPU/CPU node cost ratio. • Vary the budget allocated to GPUs • Examine GPU / CPU performance gains such as those obtained by software optimization & tuning. 5 of 8 codes have 10x, 20x, 30x speedup. Circles: 50% CPU nodes + 50% GPU nodes Stars: Optimal system configuration GPU / CPU perf. per node SSI increase vs. CPU-Only (@ budget %) 10x None No justification for GPUs 20x 1.15x @ 45% Compare to 1.23x for 10x at 4:1 GPU/CPU cost ratio 30x 1.40x @ 60% Compare to 3x from NESAP for KNL B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
  • 16. How to Enable NERSC’s diverse community of 7,000 users, 750 projects, and 700 codes to run on advanced architectures like Perlmutter and beyond? • NERSC Exascale Science Application Program (NESAP) • Engage ~25 Applications • up to 17 postdoctoral fellows • Deep partnerships with every SC Office area • Leverage vendor expertise and hack-a-thons • Knowledge transfer through documentation and training for all users • Optimize codes with improvements relevant to multiple architectures Application Readiness Strategy for Perlmutter 16
  • 17. NESAP for Perlmutter will extend activities from NESAP 1. Identifying and exploiting on-node parallelism 2. Understanding and improving data-locality within the memory hierarchy What’s New for NERSC Users? 1. Heterogeneous compute elements 2. Identification and exploitation of even more parallelism 3. Emphasis on performance-portable programming approach: – Continuity from Cori through future NERSC systems GPU Transition Path for Apps 17
  • 18. OpenMP is the most popular non-MPI parallel programming technique ● Results from ERCAP 2017 user survey ○ Question answered by 328 of 658 survey respondents ● Total exceeds 100% because some applications use multiple techniques
  • 19. OpenMP meets the needs of the NERSC workload ● Supports C, C++ and Fortran ○ The NERSC workload consists of ~700 applications with a relatively equal mix of C, C++ and Fortran ● Provides portability to different architectures at other DOE labs ● Works well with MPI: hybrid MPI+OpenMP approach successfully used in many NERSC apps ● Recent release of OpenMP 5.0 specification – the third version providing features for accelerators ○ Many refinements over this five year period
  • 20. Ensuring OpenMP is ready for Perlmutter CPU+GPU nodes ● NERSC will collaborate with NVIDIA to enable OpenMP GPU acceleration with PGI compilers ○ NERSC application requirements will help prioritize OpenMP and base language features on the GPU ○ Co-design of NESAP-2 applications to enable effective use of OpenMP on GPUs and guide PGI optimization effort ● We want to hear from the larger community ○ Tell us your experience, including what OpenMP techniques worked / failed on the GPU ○ Share your OpenMP applications targeting GPUs
  • 22. Engaging around Performance Portability 22 NERSC is a MemberNERSC is working with PGI/NVIDIA to enable OpenMP GPU acceleration Doug Doerfler Lead Performance Portability Workshop at SC18. and 2019 DOE COE Perf. Port. Meeting NERSC is leading development of performanceportability.org NERSC Hosted Past C++ Summit and ISO C++ meeting on HPC.
  • 23. Slingshot Network • High Performance scalable interconnect – Low latency, high-bandwidth, MPI performance enhancements – 3 hops between any pair of nodes – Sophisticated congestion control and adaptive routing to minimize tail latency • Ethernet compatible – Blurs the line between the inside and the outside of the machine – Allow for seamless external communication – Direct interface to storage 23 3x!
  • 24. Perlmutter has a All-Flash Filesystem 24 4.0 TB/s to Lustre >10 TB/s overall Logins, DTNs, Workflows All-Flash Lustre Storage CPU + GPU Nodes Community FS ~ 200 PB, ~500 GB/s Terabits/sec to ESnet, ALS, ... • Fast across many dimensions – 4 TB/s sustained bandwidth – 7,000,000 IOPS – 3,200,000 file creates/sec • Usable for NERSC users – 30 PB usable capacity – Familiar Lustre interfaces – New data movement capabilities • Optimized for NERSC data workloads – NEW small-file I/O improvements – NEW features for high IOPS, non- sequential I/O
  • 25. • Winner of 2011 Nobel Prize in Physics for discovery of the accelerating expansion of the universe. • Supernova Cosmology Project, lead by Perlmutter, was a pioneer in using NERSC supercomputers combine large scale simulations with experimental data analysis • Login “saul.nersc.gov” NERSC-9 will be named after Saul Perlmutter 25
  • 26. Perlmutter: A System Optimized for Science ● Cray Shasta System providing 3-4x capability of Cori system ● First NERSC system designed to meet needs of both large scale simulation and data analysis from experimental facilities ○ Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes ○ Cray Slingshot high-performance network will support Terabit rate connections to system ○ Optimized data software stack enabling analytics and ML at scale ○ All-Flash filesystem for I/O acceleration ● Robust readiness program for simulation, data and learning applications and complex workflows ● Delivery in late 2020
  • 27. Thank you ! We are hiring - https://jobs.lbl.gov/