This is a presentation by Prof. Anne Elster at the International Workshop on Open Source Supercomputing held in conjunction with the 2017 ISC High Performance Computing Conference.
1. Prof. Anne C. Elster, PhD
Dept.of Computer Science, HPC-Lab
Norwegian University of Science & Technology
Trondheim, Norway
CloudLightning
and the OPM-based Use Case
(HW à Middleware)
2. Thank yous to: My Post Docs and
graduate students!
06/07:Spring 2007
08/09:Spring 2009
10/11: @ SC 10
09/10:Spring 2010
07/08:@ SC 07
11/12:Spring 2012
Spring 2014
3. Challenge re: Open Source – SW stack!
• Leverage GPUs and libraries inside Linux containers
e.g. using NVIDIA-docker
4. 5
OUTLINE
Overview of
EU H2020 project Cloud Lightning(CL)
Use case Oil&Gas -- OPM and Upscaling
Containarization
Concluding remarks/look foward
7. Prof John Morrison
UCC
Dr Gabriel Gonzalez Castane
UCC
Dr Huanhuan Xiong
UCC
Mr David Kenny
UCC
Dr Georgi Gaydadjiev
MAX
Dr Marian Neagul
IeATProf Dana Petcu
IeAT
Tobias Becker
MAX
Prof Anne Elster
NTNU
Prof George Gravvanis
DUTH
Dr Suryanarayanan Natarjan
INTEL
Mr Perumal Kuppuudaiyar
INTEL
Prof Theo Lynn
DCU
Ms Anna Gourinovitch
DCU
Dr Konstantinos Giannoutakis
CERTH
Dr Hristina Palikareva
MAX
Dr Malik M. Khan
NTNU
Dr Muhammad Qasim
NTNU
Mr Geir Amund Hasle
NTNU
Mr Dapeng Dong
UCC
8. Prof John Morrison
UCC
Dr Gabriel Gonzalez Castane
UCC
Dr Huanhuan Xiong
UCC
Mr David Kenny
UCC
Dr Georgi Gaydadjiev
MAX
Dr Marian Neagul
IeATProf Dana Petcu
IeAT
Tobias Becker
MAX
Prof Anne Elster
NTNU
Prof George Gravvanis
DUTH
Dr Suryanarayanan Natarjan
INTEL
Mr Perumal Kuppuudaiyar
INTEL
Prof Theo Lynn
DCU
Ms Anna Gourinovitch
DCU
Dr Konstantinos Giannoutakis
CERTH
Dr Hristina Palikareva
MAX
Dr Malik M. Khan
NTNU
Dr Muhammad Qasim
NTNU
Mr Geir Amund Hasle
NTNU
Mr Dapeng Dong
UCC
9. CLOUD-
LIGHTNING
Funded under Call H2020-ICT-2014-1
Advanced Cloud Infrastructures and Services.
(Feb 2015 thru Jan 2018)
Aim: To develop infrastructures, methods
and tools for high performance, adaptive cloud
applications and Services that go beyond
the current capabilities.
10. Specific
Challenge
• CloudLightning was
funded under Call
H2020-ICT-2014-1
Advanced Cloud
Infrastructures and
Services.
Cloud computing is being transformed by new requirements such as
- heterogeneity of resources and devices
- software-defined data centres
- cloud networking, security, and
- the rising demands for better quality of user experience.
Cloud computing research will be oriented towards
• new computational and data management
models (at both infrastructure and services
levels) that respond to the advent of faster and
more efficient machines,
- rising heterogeneity of access modes and devices,
- demand for low energy solutions,
- widespread use of big data,
- federated clouds and
- secure multi-actor environments including public
administrations.
11. 12
CloudLighning Overview – more detailed
Self-Organization and Self Management (SOSM) of HPC resources
Resource Allocation - (Based on Service Requirements) which can be:
l Bare Metal
l VM (Virtual Machines)
l Containers –the most most realistic option for HPC workloads
l Resources are devided into Cells, based on region or location
l Each Cell may have different hardware types, including servers with GPUs, MICs, DFFEs
l Each Cell is partition into vRacks which are sets of servers of the same type
Resource Utilization - (Self-Optimized )
lAlt. 1: Cell manager get request of resources, create them and allocated them directly
lAlt. 2: First discover resources available, then if more than one options, then the Cell Mgr
selects the most appropriate option.
lAlt. 3: Same as Alt. 2, except gives back solution rather than option
lvRack managers may create/aggregate the resources in its vRack and is the basic component
of self-organization in the CL system. Note this feature makes only sense for larger
deployments.
12.
13. BENEFICIARIES
Primary beneficiar :
Infrastructure-as-a-Service
provider.
Benefit from activating the
HPC in the cloud market and a
reduction in cost related to
better performance per cost and
performance per watt.
Increased energy efficiency can
result in lower costs throughout
the cloud ecosystem and can
increase the accessibility and
performance in a wide range of
use cases including:
• Oil and Gas simulations,
• Genomics and
• Ray Tracing
(e.g. 3D Image Rendering)
• Improved
performance/cost and
performance/Watt for
cloud-based
datacenter(s)
• Energy- and cost-
efficient scalable
solution for OPM-based
reservoir simulation
application (Upscaling)
• Ability to run better
models for more
efficient reservoir
utilization
• Improved
performance/cost and
performance/Watt.
• Faster speed of
genome sequence
computation.
• Reduced development
times.
• Increased volume and
quality of related
research.
• Reduced CAPEX and IT
associated costs.
• Extra capacity for
overflow (“surge”)
workloads.
• Faster workload
processing to meet
project timelines.
Ray Tracing
(3D Image Rendering)
- Intel
Genomics
- Maxeler
Oil and Gas
- NTNU
CLOUDLIGHTNING USE CASES
14. 15
EU Use Case
Motivations
CloudLightning’s use cases
support the European Union
HPC strategy and specific
industries identified by IDC
in their recent report on the
progress of the EU HPC
Strategy (IDC, 2015).
1
The health sector represents 10% of EU GDP and 8% of the
EU workforce (EC, 2014). HPC is increasingly central to
genome processing and thus advanced medicine and
bioscience research.
2
The oil and gas industry is responsible for 170,000 European
jobs and €440 billion of Europe's GDP (IDC, 2015). HPC
improves discovery performance and exploitation.
3
Ray tracing is a fundamental technology in many industries
and specifically in CAD/CAE, digital content and mechanical
design, sectors dominated by SMEs.
4
European ROI in HPC is very attractive - each euro invested in
HPC on average returned €867 in increased revenue/income
(IDC, 2015).
15. The OPM-based Upscaling Use Case
http://opm-project.org/
Project goal: Provide HPC-application as a service
What it does: Calculates upscaled permeability
Why chosen:
l Builds on the Open Source SW provided by the
Open Porous Media (OPM) project
l Already familiar with Upscaling application through
collaborations with Statoil
17. HPC Applications as a Service – OPM Upscalig
http://opm-project.org/
Calculates upscaled permeability
18. HPC Applications as a Service – OPM Upscalig
The OPM use-case application is calculation
of upscaled permeability
l uses PDE solver libraries
à MPI, GPU …
l Upscaling application is ported to work with PETSc which
provides:
l CPU-only execution
l CPU-GPU execution
l Both versions of the application provided as containerized
solution
19. Self-Optimized Libraries as a Service
l On-going work to provide self-optimized libraries as containerized
solutions
l Libraries under-review include ATLAS, MKL, cuBLAS, MKL,
FFTW, cuFFT
ATLAS
20. 21
GPU System
l Dell server blade
with NVIDIA Tesla P100
SMP Cluster
lNumascale 5-node SMP
MIC System
l PC with Xeon Phi card
DFE Cluster
lMaxeler MPC-C node with
4x Vectis MAX3 DFEs
CLOUDLIGHTNING HETEROGENEOUS TESTBED @ NTNU
23. Self-Organization-Self-Management (SOSM) -
Resource Registration Plugin
l SOSM resource registration plugins
l Python-based plugins
l Use Python bindings (pyNVML)
l Use Nvidia Management Library (NVML)
Tesla P100 GTX 980 Tesla K20
NVIDIA MANAGEMENT LIBRARY
pyNVML
SOSM
26. Telemetry System for Cloudlightning -
Resource Registration Plugin
SNAP-based telemetry plugins
l Python-based plugins
l Use Python bindings (pyNVML)
l Use Nvidia Management Library (NVML)
Tesla P100 GTX 980 Tesla K20
NVIDIA MANAGEMENT LIBRARY
pyNVML
SNAP
27. CL Telemetry System Plug-in
Python-based SNA Collector*
SNAP Parameter
Catalog Setup
Telemetry Parameter
Output for CL
*Debugging Stage
29. CL Telemetry System Plug-in
Python-based SNAP Collector*
SNAP
Parameter
Catalog Setup
Telemetry Parameter Output for CL
*Debugging Stage
30. CL Telemetry System Plug-in
Python-based SNAP Collector*
SNAP
Parameter
Catalog Setup
Telemetry
Parameter
Output for CL
*Debugging Stage
31. 32
CURRENT STATUS OF THE OVERALL
CLOUDLIGHTNING PROJECT:
• SOSM Architecture defined
• Plugins for resource registration developed
• Tested use cases on individual platforms
• Working on integration, testbed and simulation
that includes the OpenStack-based SOSM
system
32. Challenge re: Open Source
SW Engineering Priciples
versus
Skills of application programmers
33. Challenge re: Open Source
Cost of maintaining the code:
- Person hrs
- Uphold motivation
SW stack – can Leverage GPUs & libraries inside
Linux containers e.g. using NVIDIA-docker
34. HPC-Lab Beyond CloudLightning:
HPC-Lab welcomes
H2020 MCSA Post Doc
Weifeng Liu
(2017-2019)
Prof. Gavin Taylor, USNA to join HPC-Lab for 14 month staring June 2017
Further collaborations with Schlumberger ( Bjørn Nordmoen on left in photo)
39. Motivation – GPU Computing:
Many advances in processor designs
are driven by Billion $$ gaming market!
Modern GPUs (Graphic Processing Unit) offer
lots of FLOPS per watt!
.. and lots of parallelism!
NVIDA GTX 1080
(Pascal): 3640 CUDA cores!
-Kepler:
-GTX 690 and Tesla K10 cards
-have 3072 (2x1536) cores!
40. NVIDIA DGX-1 Server -- Details
CPUs : 2 x Intel Xeon E5-2698 v3 (16-core Haswell)
GPUs: 8 x NVIDIA Tesla P100 (3584 CUDA cores)
System Memory: 512 GB DDR4-23133
GPU Memory 128GB (8 x 16GB)
Storage: 4 x Samsung PM 863 1.9 TB SSD
Network: 4 x Infiniband EDR, 2x 10 GigE
Power : 3200W
Size 3U Blade
GPU Throughput: FP16: 170TFLOPs,
FP32: 85TFLOPs, FP 64: 42.5 TFLOPs
41. 43
Who am I? (Parallel Computing perspective)
• 1980’s: Concurrent and Parallel Pascal
• 1986: Intel iPSC Hypercube
– CMI (Bergen) and Cornell
(Cray arrived at NTNU)
• 1987: Cluster of 4 IBM 3090s (@ IBM Yorktown)
• 1988-91: Intel hypercubes (CMI Norway & Cornell)
• Some on BBN (Cornell)
• 1991-94: KSR
• 1993-98: MPI 1 & 2 (represented Cornell & Schlumberger)
Kendall Square Research (KSR)
KSR-1 at Cornell University:
- 128 processors – Total RAM: 1GB!!
- Scalable shared memory multiprocessors (SSMMs)
- Proprietary 64-bit processors
Notable Attributes:
Network latency across the bridge prevented viable scalability
beyond 128 processors.
Intel iPSC
42. 44
Especially want to thank
My 2 first GPU master students:
Fall 2006:
Christian Larsen (MS Fall Project, December 2006):
“Utilizing GPUs on Cluster Computers”
(joint with Schlumberger)
Erik Axel Nielsen asks for FX 4800 card for project
with GE Healthcare
Elster as head of Computational Science & Visualization program helped
NTNU acquire new IBM Supercomputer
(Njord, 7+ TFLOPS, proprietary switch)
Now: Tesla P100 rated 5-10 TF!
43. 45
45
GPUs & HPC-Lab at NTNU
2006: GPU Porgramming (Cg)
• IBM Supercomputer @ NTNU
(Njord, 7+ TFLOPS, proprietary switch)
2007:
• MS thesis on Wavelet on GPU
• CUDA Tutorial @SC
• Tesla C870 and S870 Announced
2008:
• CUDA Programming in Parallel Comp. course
• Quadcore Supercomputer at UiTø (Stallo) ca. 70 TF (*)
• HPC-LAB at IDI/NTNU opens with
• several NVIDIA donation
• several quad-core machines (1-2 donated by Schlumberger)
• HPC-Lab CUDA-based snow sim. & seismic & SC
(*) 1 Tesla P100 rated 5-10 TF for HPC loads
44. NTNU GPU Activities
Elster s HPC-lab has graduated
35+ Master students (diplom) in
GPU computing (2007-2017)
Currently supervising:
3 Post Docs,
3+ PhD students
5 master studs.
NTNU NVIDIA CUDA Teaching Center (summer 2011)
• PhD seminar course (Spring 2013: 7 students)
• Master’s level course (Fall 2012: 14 students)
• Senior Parallel Computing class (>1/3 in CUDA)
• Fall 2010: 43 taking exam
• Fall 2012: 57 students
• Fall 2016: >80 init. Enrollment
‘
Elster also PI for Teaching Center at Univ. of Texas at Austin &
NTNU NVIDIA CUDA Research Center (2012)
45. 47
47
GPUs & HPC-Lab at NTNU
2008:
• CUDA Programming in Parallel Comp. course
• Quadcore Supercomputer at UiTø (Stallo) ca. 70 TF
• HPC-LAB at IDI/NTNU opens with
• several NVIDIA donation
• several quad-core machines (1-2 donated by Schlumberger)
• HPC-Lab CUDA-based snow sim. & seismic & SC
2009:
• NVIDIA Tesla s1070 (4 GPUs 960 cores, 4TF)
• two NVIDIA Quadro FX 5800 cards (Jan ´09),
• NVIDIA Ion (Jun´09)
• AMD/ATI Radon 5850 (2TF)
2010-11: NVIDIA Fermi-based cards (470, c2050, c2070(2011))
2012-13: NVIDIA Kepler … NOK 1 Mill equip grant from IME
2014-15: 20 GTX 980 cards, 85” 4K screen, Numascale SMP, H2020 CL grant
2016-17: Tesla P100 server and GTX 1080 and 1080Tis
+ 60 Jetson TX1s for teaching, MSCA Post Doc