HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability

HPE demystifies deep
learning for faster intelligence
across all organizations
Edmondo Orlotti
HPC & AI Business Development Manager
October, 2017

Data analytics and insights are fueling the digital transformation
Enhanced customer
experiences
Improved products
and services
Optimized business
processes
Personalized, real-time
mobile insights for retail
Genomics sequencing
analytics for Life Sciences
Predictive maintenance
insights for manufacturing
2

AI propels analytics and insights to a new dimension
Unleash automated intelligence from massive data volumes
3
Data protection
and archival to
mitigate risk
HPE Fraud
Detection using
deep learning
Infrastructure
modernization for new
data types and scale
User behavioral analytics
for the data center using
machine learning
Next generation
analytics for
real-time business
HPE Intelligent Edge
real-time analytics with
SAP Leonardo
Insights from
modeling and
simulation
Deep learning in HPC
using GPU-accelerated
computing

What’s all the “buzz” around AI?
4
1 Source : McKinsey AI report, 2017
Gain competitive advantage using the vibrant new market of AI

Overview of HPE’s GPU portfolio

HPE has a comprehensive, purpose-built portfolio for deep learning
6
Compute ideal for training models in data center Edge analytics and
inference engine
Compute for both training models
and inference at edge
HPE Apollo 6500
HPC Storage Choice of Fabrics
HPE SGI 8600
Government,
academia and
industries
Financial
services
Life Sciences,
Health
Government
and academia
Autonomous
vehicles / Mfg.
AI Software Framework
HPE Apollo
4520
Arista
Networking
Intel® Omni-Path
Architecture
Mellanox
InfiniBand
HPE FlexFabric
Network
HPC Data
Management
Framework
Software
Large-scale, storage
virtualization & tiered
data management
platform
Petaflop scale for deep
learning and HPC
The enterprise bridge to
accelerated computing
HPE Apollo 2000
The bridge to enterprise
scale-out architecture
HPE Edgeline EL4000
Unprecedented deep edge compute and
high capacity storage; open standards
Advisory, professional and operational services, HPE Flexible Capacity, HPE Datacenter Care for Hyperscale
HPE Apollo sx40
Maximize GPU capacity and
performance with lower TCO
Easy Setup and Flexible OS
Using Bright Computing’s distribution
of deep learning software
development components and
workload management tool
integration

5
TESLA V100
THE MOST ADVANCED DATA CENTER GPU EVER BUILT
5,120 CUDA cores
640 NEW Tensor cores
7.5 FP64 TFLOPS | 15 FP32 TFLOPS
120 Tensor TFLOPS
20MB SM RF | 16MB Cache | 16GB HBM2 @ 900 GB/s
300 GB/s NVLink

V100
Tensor Cores
2
P100
FP32
V100
Tensor Cores
P100
FP16
ImagesperSecond
ImagesperSecond
2.4x faster
ResNet-50 Inference
TensorRT - 7ms Latency
3.7x faster
V100 measured on pre-production hardware.
ResNet-50 Training
VOLTA: A GIANT LEAP FOR DEEP LEARNING

4
The Fastest and Most Productive GPU for Deep Learning and HPC
Volta Architecture
Most Productive GPU Tensor Core
120 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
INTRODUCING TESLA V100

5
*full GV100 chip contains 84 SMs
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
TESLA V100 ARCHITECTURE

Completely new ISA
Twice the schedulers
Simplified Issue Logic
Large, fast L1 cache
Improved SIMT model
Tensor acceleration
VOLTA V100 SM

8
VOLTA NVLINK
300GB/sec
50% more links
28% faster signaling

Hardware
Accelerated
Work Submission
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
14
GPU Execution
Volta MPS Enhancements:
• Reduced launch latency
• Improved launch throughput
• Improved quality of service with
scheduler partitioning
• More reliable performance
• 3x more clients than Pascal
A B C
VOLTA MULTI-PROCESS SERVICE

Volta: Starvation Free AlgorithmsPascal: for messages Lock-Free
Algorithms
Threads cannot wait
Threads may wait for messages
VOLTA: INDEPENDENT THREAD SCHEDULING

6
ALL MAJOR FRAMEWORKSVOLTA-OPTIMIZED cuDNN
MATRIX DATA OPTIMIZATION:
Dense Matrix of Tensor Compute
TENSOR-OP CONVERSION:
FP32 to Tensor Op Data for
Frameworks
VOLTA TENSOR CORE
4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Optimized For Deep Learning
NEW TENSOR CORE BUILT FOR AI
Delivering 120 TFLOPS of DL Performance

7
Over 80x DL Training
Performance in 3 Years
cuDNN3
1x K80
cuDNN2
8x P100
cuDNN6
4x M40
8x V100
cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Googlenet Training Performance
(Speedup Vs K80)
SpeedupvsK80
85% Scale-Out Efficiency
Scales to 64 GPUs with Microsoft
Cognitive Toolkit
0 5 10 15
64X V100
8X V100
8X P100
Multi-Node Training with NCCL2.0
(ResNet-50)
ResNet50 Training for 90 Epochs with 1.28M images dataset | Using
Caffe2 | V100 performance measured on pre-production hardware.
1 Hour
7.4 Hours
18 Hours
3X Reduction in Time to Train
Over P100
0 10 20
1X
V100
1X
P100
2X
CPU
LSTM Training
(Neural Machine Translation)
Neural Machine Translation Training for 13 Epochs |German ->English,
WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance
measured on pre-production hardware.
15 Days
18 Hours
6 Hours
AI PERFORMANCE
3X Faster DL Training Performance

8
TensorRT
Fuse Layers
Compact
Optimize Precision
(FP32, FP16, INT8)
Compiled
Real-time
Network
Trained
Neural
Network
3x more throughput at 7ms latency with V100
(ResNet-50)
5,000
33ms
0
1,000
2,000
3,000
4,000
CPU Tesla P100 Tesla P100
(TensorFlow) (TensorRT)
Tesla V100
(TensorRT)Throughput@7ms(Images/Sec)
CPU Server: 2X Xeon E5-2660 V4; GPU: w/P100, w/V100 (@150W) | V100 performance measured on pre-production hardware.
3X
10ms
7ms
7ms
VOLTA DELIVERS 3X MORE INFERENCE THROUGHPUT
Low Latency performance with V100 and TensorRT

10
SINGLE UNIVERSAL GPU FOR ALL ACCELERATED WORKLOADS
V100 UNIVERSAL GPU
BOOSTS ALLACCELERATED WORKLOADS
HPC
1.5X
Vs P100
k
3X
Vs P100
AI Training
3X
Vs P100
AI Inference
2X
Vs M60
Virtual Desktop

11
80% Perf at Half the Power
40% More Performance in a Rack
V100
Max Efficiency
V100
Max Performance
13 KW Rack
4 Nodes of 8xV100
13
ResNet-50 Networks
Trained Per Day
13 KW Rack
7 Nodes of 8xV100
18
ResNet-50 Networks
Trained Per Day
ResNet-50 Training, Max Efficiency run with V100@160W | V100 performance
measured on pre-production hardware.
OPTIMIZED FOR DATACENTER EFFICIENCY

12
For NVLink Servers For PCIe Servers
Compute 7.5 TF DP ∙ 15 TF SP ∙ 120 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL
Memory HBM2: 900 GB/s ∙ 16 GB HBM2: 900 GB/s ∙ 16 GB
Interconnect NVLink (up to 300 GB/s) +
PCIe Gen3 (up to 32 GB/s)
PCIe Gen3 (up to 32 GB/s)
Power 300W 250W
TESLA V100 SPECIFICATIONS

HPE enables an optimized Deep Learning Experience
22
Hardware Infrastructure
Deep Learning Services
Fraud Detection, Predictive
Maintenance, Patient Diagnostics
Applications
Deep Learning Frameworks
Data Infrastructure
HPE Confidential
External announcement at NVDIIA GTC on May10th, 2017

Thank you
Edmondo.Orlotti@HPE.com
23December 2015, #c03880772

HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability

Similaire à HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability (20)

Dernier

Dernier (20)

HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability