SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
ORNL is managed by UT-Battelle
for the US Department of Energy
Leveraging Leadership
Computing Facilities:
Assisting User's
Transition to Titan's
Accelerated
Architecture
Fernanda Foertter
HPC User Assistance Team
Oak Ridge Leadership Computing Facility
Oak Ridge National Laboratory
Workshop on “Directives and Tools for Accelerators:
A Seismic Programming Shift”
Center for Advanced Computing and Data Systems,
University of Houston
20 October 2014
2
Outline
•  OLCF Center Overview
•  Manycore is here to stay
•  The Titan Project: Lessons Learned
•  Coding for future architectures
3
OLCF Services
Liasons
User
Assistance
Viz
Tech
Ops
Outreach
Oak Ridge Leadership Computing Facility	

Everest
Future
Tours
Internships
Tools
Collaboration
Scaling
Performance
Advocacy
Training
Software
Communications
4
Increased our system capability by 10,000X
5
No more free lunch:
Moore’s Law continues, Denard Scaling is over
Herb Sutter: Dr. Dobb’s Journal:
http://www.gotw.ca/publications/concurrency-ddj.htm
6
Per core performance down, cores up
7
Kogge and Shalf, IEEE CISE
Watts per Sq Cm
8
Manycore Accelerators
9
4,352 ft2
404 m2
SYSTEM SPECIFICATIONS:
•  Peak performance of 27.1 PF (24.5 & 2.6)
•  18,688 Compute Nodes each with:
•  16-Core AMD Opteron CPU (32 GB)
•  NVIDIA Tesla “K20x” GPU (6 GB)
•  512 Service and I/O nodes
•  200 Cabinets
•  710 TB total system memory
•  Cray Gemini 3D Torus Interconnect
ORNL’s “Titan” Hybrid System: Cray XK7
with AMD Opteron and NVIDIA Tesla
processors
10
Titan Compute Nodes (Cray XK7)
Node
AMD Opteron 6200
Interlagos
(16 cores)
2.2 GHz
32 GB
(DDR3)
Accelerator
Tesla K20x
(2688 CUDA cores)
732
MHz
6 GB
(DDR5)
HT
3HT
3
PCIe
Gen2
11
Shift into Hierarchical Parallelism
•  Expose more parallelism through code
refactoring and source code directives
–  Doubles CPU performance of many codes
•  Use right type of processor for each task
•  Data locality: Keep data near processing
–  GPU has high bandwidth to local memory
for rapid access
–  GPU has large internal cache
•  Explicit data management: Explicitly
manage data movement between CPU
and GPU memories
CPU GPU Accelerator
•  Optimized
for sequential
multitasking •  Optimized for many
simultaneous tasks
•  10× performance
per socket
•  5× more energy-
efficient systems
12
Old Programming Models
Node	

Core	

MPI	

 MPI	

Node	

Core	

Node	

Core
13
1
Old Programming Models
Node	

MPI	

MPI
MPI
Collectives	

 Node	

MPI	

MPI
MPI
MPI
Node	

MPI	

MPI
MPI
MPI
Collectives	

MPI
14
1
Directive Programming Models
Node	

OpenMP	

MPI	

 MPI	

Node	

OpenMP	

Node	

OpenMP
15
1
5
Hybrid Programming Models
Node	

Directives	

Accelerator	

Node	

Directives	

Accelerator	

Node	

Directives	

Accelerator	

MPI	

 MPI
16
1
Hybrid Programming Models
TORUS	

TORUS	

TORUS	

Node	

MPI	

OpenMP	

OpenACC	

Intrinsics	

Accelerator	

Accelerator	

Node	

MPI	

OpenMP	

OpenACC	

Intrinsics	

Accelerator	

Accelerator	

Node	

MPI	

OpenMP	

OpenACC	

Intrinsics	

Accelerator	

Accelerator
17
Node1	

Node18,688	

File System	

...	

Let’s not forget I/O
18
Path to Exascale
Hierarchical parallelism
Improve scalability of applications	

Expose more parallelism
Code refactoring and source code directives can double
performance	

Explicit data management
Between CPU and GPU memories	

Data locality: Keep data near processing
GPU has high bandwidth to local memory and large internal cache	

Heterogeneous multicore processor architecture
Using right type of processor for each task
19
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
(CUDA, OpenCL)
Maximum
Performance
OpenACC
OpenMP
Directives
Incremental, Enhanced
Portability
Programming Hybrid Architectures
20
All Codes Will Need Refactoring To Scale!
•  Up to 1-2 person-years required to port each code from
Jaguar to Titan
•  We estimate possibly 70-80% of developer time was spent
in code restructuring, regardless of whether using
OpenMP / CUDA / OpenCL / OpenACC / …
–  Experience shows this is a one-time investment
•  Each code team must make its own choice of using
OpenMP vs. CUDA vs. OpenCL vs. OpenACC, based on
the specific case—may be different conclusion for each code
•  Our users and their sponsors must plan for this expense.
21
Center for Accelerated Application
Readiness (CAAR)
•  Prepare applications for accelerated architectures
•  Goals:
–  Create applications teams to develop and implement
strategies for exposing hierarchical parallelism for our
users applications
–  Maintain code portability across modern architectures
–  Learn from and share our results
•  We selected six applications from across different
science domains and algorithmic motifs
22
CAAR: SElected Lessons Learned
•  Repeated themes in the code porting work	

•  finding more threadable work for the GPU	

•  Improving memory access patterns	

•  making GPU work (kernel calls) more coarse-grained if possible	

•  making data on the GPU more persistent	

•  overlapping data transfers with other work (leverage HyperQ)	

•  use as much asynchronicity as possible (CPU, GPU, MPI, PCIe-2)
23
CAAR: SElected Lessons Learned
•  The difficulty level of the GPU port was in part
determined by:	

•  Structure of the algorithms—e.g., available parallelism, high
computational intensity	

•  Code execution profile—flat or hot spots	

•  The code size (LOC)
24
CAAR: SElected Lessons Learned
	

•  More available flops on the node should lead us to think
of new science opportunities enabled	

•  We may need to look in unconventional places to get
another ~30X thread parallelism that may be needed
for exascale—e.g., parallelism in time
25
Co-designing Future Programming Models
•  Evolutionary vs. Revolutionary approaches:
–  Message Passing and PGAS
•  MPI, UPC, OpenSHMEM, Fortran 2008 CoArrays, Chapel
–  Shared Memory Models
•  OpenMP, Pthreads
–  Acceletator-based models
•  OpenACC, OpenMP 4.0, OpenCL, CUDA
–  Hybrid Models
•  MPI+OpenACC ,MPI + OpenMP 4.0, OpenSHMEM + OpenACC, etc
•  New runtime models: Legion, OCR, Express, ParSeC,
–  Asychronous task based models
•  How to efficiently map the model to the hardware
while meeting application requirements?
26
•  Serve in standard’s committees
•  Gather requirements from users
•  Translate users’ needs and use cases
Directives collaboration
27
App Language Data structure Issues
LSMS 3 C++ Templated Matrix class with bare pointer to data. Either owns the data or is an
alias to another Matrix object. STL::vector and STL::complex needed on device
CAM-SE F90 Array of structs. A struct member of the struct has a multidimensional array
member of which sections must be transferred at different times.
Mini-FE C Vector of pointers transferred to the device. Pointers are to the same data
structure.
LAMMPS C / C++ Flat C arrays requiring transfer
ICON
(CSCS)
F95 array of structs of allocatable arrays. Need selective deep copy of derived type
members.
UPACS F90 structs of allocatable arrays.
GENESIS F90 structs of allocatable arrays, these arrays accessed by pointers that are set before
entering the parallel region
HFODD F90 Require better support for Fortran derived types
Delta5D F77 / F90 vectors, indexing arrays; no derived types
XGC1 F90 array of derived types with pointers to other nested derived types. block(b)
%grp(g)%p. Need deep copy.
DFTB F77 / F90 dense linear algebra
NIM/FIM F90 Multidimensional arrays, no structs
Requirements Gathering Example
28
Challenges with Directive-based
programming models
•  How to specify the in-node parallelism in the application
–  Loop based parallelism is not enough for future systems
•  How to efficiently map the parallelism of the application to
the hardware
–  How to schedule work to multiple accelerators within the node?
–  How to schedule work to within accelerators while being portable?
•  How to transfer data across different types of memory
–  Problem may go away but is important for data locality
•  How to specify different memory hierarchies in the
programming model
–  Shared memory within GPU, etc
29
Future is Descriptive Programming
•  Large number of small cores
•  Data parallelism is key
•  PCIe to CPU connection
AMD Discrete GPU
AMD APU
•  Integrated CPU+GPU cores
•  Target power efficient
devices at this stage
•  Shared memory system with
partitions
INTEL Many Integrated
Cores
•  50+ number of x86 cores
•  Support conventional programming
•  Vectorization is key
•  Run as an accelerator or standalone
NVIDIA GPU
•  Large number of small cores
•  Data parallelism is key
•  Support nested and dynamic
parallelism
•  PCIe to host CPU or low power
ARM CPU (CARMA)
Directives help describe data layout, parallelism
30
OpenACC influence à OpenMP
•  Compare OpenMP 4.0
accelerator extension
with OpenACC
–  Understand mapping
–  Understand impact of
newer OpenACC
features
•  OpenACC is evolving
with new features
which may impact
OpenMP 4.1 or 5.
•  OpenACC
interoperability with
OpenMP is important
for the transition
OpenACC 2.0 OpenMP 4.0
parallel target
parallel/gang/workers/vector target teams/parallel/simd
data target data
parallel loop teams/distribute/parallel for
update target update
cache
wait OpenMP 4.1 proposal
declare declare target
data enter/exit OpenMP 4.1 proposal
routine declare target
async wait OpenMP 4.1 proposal
device type
tile
host data
31
Training at OLCF
•  Webinars/Remote
•  Hands on
•  Lectures
•  Open to public!!
32
Training at OLCF
33
Conclusions
•  There’s no avoiding manycore
•  Rethink algorithms to expose more parallelism
•  Directives are morphing into Descriptive Programming
•  Memory placement is important
•  Flops are free, avoid reads/writes
•  Standards built from application requirements
•  Training events are open to the public
•  Looking for domain specific communities
34
Acknowledgements
OpenACC and OpenMP Standards Committees
OLCF-3 CAAR Team:
•  Bronson Messer, Wayne Joubert, Mike Brown, Matt
Norman, Markus Eisenbach, Ramanan Sankaran
OLCF-3 Vendor Partners: Cray, AMD, NVIDIA, CAPS, Allinea
This research used resources of the Oak Ridge Leadership
Computing Facility at the Oak Ridge National Laboratory,
which is supported by the Office of Science of the U.S.
Department of Energy under Contract No. DE-
AC05-00OR22725.
35
Questions?
FoertterFS@ornl.gov
35
Contact us at
http://olcf.ornl.gov
http://jobs.ornl.gov
@hpcprogrammer

Contenu connexe

Tendances

Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
Functional approach to packet processing
Functional approach to packet processingFunctional approach to packet processing
Functional approach to packet processingNicola Bonelli
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsWim Vanderbauwhede
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacGanesan Narayanasamy
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights LandingAndrey Vladimirov
 
Intel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn HintonIntel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn Hintonparallellabs
 
Some experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiSome experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiMaho Nakata
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhereinside-BigData.com
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientistsinside-BigData.com
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418inside-BigData.com
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Anne Nicolas
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 

Tendances (20)

Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Functional approach to packet processing
Functional approach to packet processingFunctional approach to packet processing
Functional approach to packet processing
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
 
Intel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn HintonIntel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn Hinton
 
Some experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiSome experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon Phi
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientists
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
Phytium 64 core cpu preview
Phytium 64 core cpu previewPhytium 64 core cpu preview
Phytium 64 core cpu preview
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 

Similaire à Assisting User’s Transition to Titan’s Accelerated Architecture

OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V International
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) Wim Vanderbauwhede
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...Yuichiro Yasui
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spacejsvetter
 
Parallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsParallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsHPCC Systems
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_finalYutaka Kawai
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...waqarnabi
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0Sahil Kaw
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
 

Similaire à Assisting User’s Transition to Titan’s Accelerated Architecture (20)

OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
 
Ef35745749
Ef35745749Ef35745749
Ef35745749
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
Parallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsParallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC Systems
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
 
Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Dernier

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Assisting User’s Transition to Titan’s Accelerated Architecture

  • 1. ORNL is managed by UT-Battelle for the US Department of Energy Leveraging Leadership Computing Facilities: Assisting User's Transition to Titan's Accelerated Architecture Fernanda Foertter HPC User Assistance Team Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory Workshop on “Directives and Tools for Accelerators: A Seismic Programming Shift” Center for Advanced Computing and Data Systems, University of Houston 20 October 2014
  • 2. 2 Outline •  OLCF Center Overview •  Manycore is here to stay •  The Titan Project: Lessons Learned •  Coding for future architectures
  • 3. 3 OLCF Services Liasons User Assistance Viz Tech Ops Outreach Oak Ridge Leadership Computing Facility Everest Future Tours Internships Tools Collaboration Scaling Performance Advocacy Training Software Communications
  • 4. 4 Increased our system capability by 10,000X
  • 5. 5 No more free lunch: Moore’s Law continues, Denard Scaling is over Herb Sutter: Dr. Dobb’s Journal: http://www.gotw.ca/publications/concurrency-ddj.htm
  • 6. 6 Per core performance down, cores up
  • 7. 7 Kogge and Shalf, IEEE CISE Watts per Sq Cm
  • 9. 9 4,352 ft2 404 m2 SYSTEM SPECIFICATIONS: •  Peak performance of 27.1 PF (24.5 & 2.6) •  18,688 Compute Nodes each with: •  16-Core AMD Opteron CPU (32 GB) •  NVIDIA Tesla “K20x” GPU (6 GB) •  512 Service and I/O nodes •  200 Cabinets •  710 TB total system memory •  Cray Gemini 3D Torus Interconnect ORNL’s “Titan” Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors
  • 10. 10 Titan Compute Nodes (Cray XK7) Node AMD Opteron 6200 Interlagos (16 cores) 2.2 GHz 32 GB (DDR3) Accelerator Tesla K20x (2688 CUDA cores) 732 MHz 6 GB (DDR5) HT 3HT 3 PCIe Gen2
  • 11. 11 Shift into Hierarchical Parallelism •  Expose more parallelism through code refactoring and source code directives –  Doubles CPU performance of many codes •  Use right type of processor for each task •  Data locality: Keep data near processing –  GPU has high bandwidth to local memory for rapid access –  GPU has large internal cache •  Explicit data management: Explicitly manage data movement between CPU and GPU memories CPU GPU Accelerator •  Optimized for sequential multitasking •  Optimized for many simultaneous tasks •  10× performance per socket •  5× more energy- efficient systems
  • 12. 12 Old Programming Models Node Core MPI MPI Node Core Node Core
  • 13. 13 1 Old Programming Models Node MPI MPI MPI Collectives Node MPI MPI MPI MPI Node MPI MPI MPI MPI Collectives MPI
  • 18. 18 Path to Exascale Hierarchical parallelism Improve scalability of applications Expose more parallelism Code refactoring and source code directives can double performance Explicit data management Between CPU and GPU memories Data locality: Keep data near processing GPU has high bandwidth to local memory and large internal cache Heterogeneous multicore processor architecture Using right type of processor for each task
  • 20. 20 All Codes Will Need Refactoring To Scale! •  Up to 1-2 person-years required to port each code from Jaguar to Titan •  We estimate possibly 70-80% of developer time was spent in code restructuring, regardless of whether using OpenMP / CUDA / OpenCL / OpenACC / … –  Experience shows this is a one-time investment •  Each code team must make its own choice of using OpenMP vs. CUDA vs. OpenCL vs. OpenACC, based on the specific case—may be different conclusion for each code •  Our users and their sponsors must plan for this expense.
  • 21. 21 Center for Accelerated Application Readiness (CAAR) •  Prepare applications for accelerated architectures •  Goals: –  Create applications teams to develop and implement strategies for exposing hierarchical parallelism for our users applications –  Maintain code portability across modern architectures –  Learn from and share our results •  We selected six applications from across different science domains and algorithmic motifs
  • 22. 22 CAAR: SElected Lessons Learned •  Repeated themes in the code porting work •  finding more threadable work for the GPU •  Improving memory access patterns •  making GPU work (kernel calls) more coarse-grained if possible •  making data on the GPU more persistent •  overlapping data transfers with other work (leverage HyperQ) •  use as much asynchronicity as possible (CPU, GPU, MPI, PCIe-2)
  • 23. 23 CAAR: SElected Lessons Learned •  The difficulty level of the GPU port was in part determined by: •  Structure of the algorithms—e.g., available parallelism, high computational intensity •  Code execution profile—flat or hot spots •  The code size (LOC)
  • 24. 24 CAAR: SElected Lessons Learned •  More available flops on the node should lead us to think of new science opportunities enabled •  We may need to look in unconventional places to get another ~30X thread parallelism that may be needed for exascale—e.g., parallelism in time
  • 25. 25 Co-designing Future Programming Models •  Evolutionary vs. Revolutionary approaches: –  Message Passing and PGAS •  MPI, UPC, OpenSHMEM, Fortran 2008 CoArrays, Chapel –  Shared Memory Models •  OpenMP, Pthreads –  Acceletator-based models •  OpenACC, OpenMP 4.0, OpenCL, CUDA –  Hybrid Models •  MPI+OpenACC ,MPI + OpenMP 4.0, OpenSHMEM + OpenACC, etc •  New runtime models: Legion, OCR, Express, ParSeC, –  Asychronous task based models •  How to efficiently map the model to the hardware while meeting application requirements?
  • 26. 26 •  Serve in standard’s committees •  Gather requirements from users •  Translate users’ needs and use cases Directives collaboration
  • 27. 27 App Language Data structure Issues LSMS 3 C++ Templated Matrix class with bare pointer to data. Either owns the data or is an alias to another Matrix object. STL::vector and STL::complex needed on device CAM-SE F90 Array of structs. A struct member of the struct has a multidimensional array member of which sections must be transferred at different times. Mini-FE C Vector of pointers transferred to the device. Pointers are to the same data structure. LAMMPS C / C++ Flat C arrays requiring transfer ICON (CSCS) F95 array of structs of allocatable arrays. Need selective deep copy of derived type members. UPACS F90 structs of allocatable arrays. GENESIS F90 structs of allocatable arrays, these arrays accessed by pointers that are set before entering the parallel region HFODD F90 Require better support for Fortran derived types Delta5D F77 / F90 vectors, indexing arrays; no derived types XGC1 F90 array of derived types with pointers to other nested derived types. block(b) %grp(g)%p. Need deep copy. DFTB F77 / F90 dense linear algebra NIM/FIM F90 Multidimensional arrays, no structs Requirements Gathering Example
  • 28. 28 Challenges with Directive-based programming models •  How to specify the in-node parallelism in the application –  Loop based parallelism is not enough for future systems •  How to efficiently map the parallelism of the application to the hardware –  How to schedule work to multiple accelerators within the node? –  How to schedule work to within accelerators while being portable? •  How to transfer data across different types of memory –  Problem may go away but is important for data locality •  How to specify different memory hierarchies in the programming model –  Shared memory within GPU, etc
  • 29. 29 Future is Descriptive Programming •  Large number of small cores •  Data parallelism is key •  PCIe to CPU connection AMD Discrete GPU AMD APU •  Integrated CPU+GPU cores •  Target power efficient devices at this stage •  Shared memory system with partitions INTEL Many Integrated Cores •  50+ number of x86 cores •  Support conventional programming •  Vectorization is key •  Run as an accelerator or standalone NVIDIA GPU •  Large number of small cores •  Data parallelism is key •  Support nested and dynamic parallelism •  PCIe to host CPU or low power ARM CPU (CARMA) Directives help describe data layout, parallelism
  • 30. 30 OpenACC influence à OpenMP •  Compare OpenMP 4.0 accelerator extension with OpenACC –  Understand mapping –  Understand impact of newer OpenACC features •  OpenACC is evolving with new features which may impact OpenMP 4.1 or 5. •  OpenACC interoperability with OpenMP is important for the transition OpenACC 2.0 OpenMP 4.0 parallel target parallel/gang/workers/vector target teams/parallel/simd data target data parallel loop teams/distribute/parallel for update target update cache wait OpenMP 4.1 proposal declare declare target data enter/exit OpenMP 4.1 proposal routine declare target async wait OpenMP 4.1 proposal device type tile host data
  • 31. 31 Training at OLCF •  Webinars/Remote •  Hands on •  Lectures •  Open to public!!
  • 33. 33 Conclusions •  There’s no avoiding manycore •  Rethink algorithms to expose more parallelism •  Directives are morphing into Descriptive Programming •  Memory placement is important •  Flops are free, avoid reads/writes •  Standards built from application requirements •  Training events are open to the public •  Looking for domain specific communities
  • 34. 34 Acknowledgements OpenACC and OpenMP Standards Committees OLCF-3 CAAR Team: •  Bronson Messer, Wayne Joubert, Mike Brown, Matt Norman, Markus Eisenbach, Ramanan Sankaran OLCF-3 Vendor Partners: Cray, AMD, NVIDIA, CAPS, Allinea This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE- AC05-00OR22725.