SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
INSIDE THE VOLTA GPU
ARCHITECTURE AND CUDA 9
Axel Koehler, Principal Solution Architect
GPU$Technology$Conference$$Europe,$October$2017
2
CONTINUED DEMAND FOR COMPUTE POWER
Comprehensive$
Earth$System$
Model
Coupled$simulation$
of$entire$cells
Simulation$of$
combustion$for$new$
highEefficiency,$lowE
emision engines.
Predictive$
calculations$for$
supernovae
2016
Baidu Deep$Speech$2
Superhuman$Voice$
Recognition
2015
Microsoft$ResNet
Superhuman$Image$
Recognition
2017
Google$Neural$
Machine$Translation
Near$Human$
Language$Translation
100 ExaFLOPS
8700 Million Parameters
20 ExaFLOPS
300 Million Parameters
7 ExaFLOPS
60 Million Parameters
Neural$Network$complexity$is$ExplodingEverEincreasing$compute$power$
Demand$ in$HPC
3
INTRODUCING TESLA V100
The Fastest and Most Productive GPU for Deep Learning and HPC
Volta Architecture
Most Productive GPU
Tensor Core
120 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
4
NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU
5
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
TESLA V100
*full GV100 chip contains 84 SMs
6
NEW SM MICROARCHITECTURE
7
VOLTA GV100 SM
GP100 GV100
FP32 units 64 64
FP64 units 32 32
INT32 units NA 64
Tensor Cores NA 8
Register File 256 KB 256 KB
Unified L1/Shared
memory
L1: 24KB
Shared: 64KB
128 KB
Active Threads 2048 2048
Redesigned for Productivity
Completely$new$ISA
Twice$the$schedulers
Simplified$Issue$Logic
Large,$fast$L1$cache
Improved$SIMT$model
Tensor$acceleration
8
Shared
Memory
64 KB
L1$
24 KB
L2$
4 MB
Load/Store Units
Pascal SM
L2$
6 MB
Load/Store Units
Volta SM
L1$ and Shared Memory
128 KBLow Latency
Streaming
UNIFYING KEY TECHNOLOGIES
9
L2$
6 MB
Load/Store Units
SM
L1$ and Shared Memory
128 KB
VOLTA L1 AND SHARED MEMORY
Volta Streaming L1$ :
Unlimited cache misses in flight
Low cache hit latency
4x more bandwidth
5x more capacity
Volta Shared Memory :
Unified storage with L1
Configurable up to 96KB
10
NARROWING THE SHARED MEMORY GAP
with the GV100 L1 cache
Pascal Volta
Cache: vs shared
• Easier to use
• 90%+ as good
Shared: vs cache
• Faster atomics
• More banks
• More predictable
Average
Shared
Memory
Benefit
70%
93%
Directed testing: shared in global
11
INDEPENDENT THREAD SCHEDULING
12
PRE-VOLTA WARP EXECUTION MODEL
32 thread warp
Program
Counter (PC)
and Stack (S)
Pre-Volta
Time
X;#Y;
diverge
reconverge
A;#B;
if (threadIdx.x < 4) {
A;
B;
} else {
X;
Y;
}
No Synchronization Permitted
13
VOLTA WARP EXECUTION MODEL
32 thread warp with independent schedulingPC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
Convergence
Optimizer
Volta
diverge
A; B;
X; Y;
Synchronization may lead to interleaved scheduling!
Time
synchronize
if (threadIdx.x < 4) {
A;
__syncwarp();
B;
} else {
X;
__syncwarp();
Y;
}
__syncwarp();
14
Volta Independent Thread Scheduling:
• Enables interleaved execution of statements from divergent
branches
• Enables execution of fine-grain parallel algorithms where threads
within a warp may synchronize and communicate
• At any given clock cycle, CUDA cores execute the same instruction
for all active threads in a warp just as before
• Execution is still SIMT which retains the high throughput
• Use explicit synchronization, don’t rely on implicit convergence
• CUDA 9 provides a fully explicit synchronization model
VOLTA: INDEPENDENT THREAD SCHEDULING
Extended'SIMT'model'enables'thread4parallel'programs'to'execute'with'vector'efficiency
Volta: Threads may wait
for messages
15
VOLTA TENSOR CORE
16
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions
& data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
18
USING TENSOR CORES
Volta Optimized
Frameworks and Libraries
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,
wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
19
0
1
2
3
4
5
6
7
8
9
10
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Mixed2Precision2(FP162Input,2FP322compute)
P1002(CUDA28)
V1002Tensor2Cores22(CUDA29)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Single2Precision2(FP32)
P1002(CUDA28)
V1002(CUDA29)
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
9.3x1.8x
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
20
NEW HBM2 MEMORY ARCHITECTURE
STREAM:Triad-DeliveredGB/s
P100 V100
76% DRAM
Utilization
95% DRAM
Utilization
1.5x Delivered
Bandwidth
• Unifying$Compute$&$Memory$in$Single$Package
• More$bandwidth$and$more$energy$$efficient
• ECC$can$be$active$without$a$bandwidth$or$
capacity$penalty
21
VOLTA NVLINK
• 6 NVLINKS @ 50 GB/s
bidirectional
• Reduce number of lanes
for lightly loaded link
(Power savings)
• Coherence features for
NVLINK enabled CPUs POWER9 based node
Hybrid cube mesh
(eg. DGX1V)
22
STATE OF UNIFIED MEMORY
High performance, low effort
Allocate Beyond
GPU Memory Size
Unified Memory
GPU CPU
PGI OpenACC on Pascal P100
Geometric mean across all 15
SPEC ACCEL™ benchmarks
86% PCI-E, 91% NVLink
Unified Memory
Explicit data
movement
Automatic data movement for allocatables
86%
Performance vs no Unified Memory
PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark
name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation.
23
VOLTA + UNIFIED MEMORY
VOLTA + NVLINK CPU
VOLTA + PCIE CPU
24
VOLTA MULTI-PROCESS SERVICE
Hardware
Accelerated
Work Submission
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
GPU Execution
Volta MPS Enhancements:
• MPS clients submit work directly to
the work queues within the GPU
• Reduced launch latency
• Improved launch throughput
• Improved isolation amongst MPS clients
• Address isolation with independent
address spaces
• Improved quality of service (QoS)
• 3x more clients than Pascal
A B C
25
Efficient inference deployment without batching system
Single Volta Client,
No Batching,
No MPS
VOLTA MPS FOR INFERENCEResnet50Images/sec,7mslatency
Multiple Volta Clients,
No Batching,
Using MPS
Volta with
Batching
System
7x
faster
60% of
perf with
batching
V100 measured on pre-production hardware.
26
P100 V100 Ratio
Training acceleration 10 TOPS 125 TOPS 12.5x
Inference acceleration 21 TFLOPS 125 TOPS 6x
FP64/FP32 5/10 TFLOPS 7.8/15.7 TFLOPS 1.5x
HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x
NVLink Bandwidth 160 GB/s 300 GB/s 1.9x
L2 Cache 4 MB 6 MB 1.5x
L1 Caches 1.3 MB 10 MB 7.7x
GPU PERFORMANCE COMPARISON
27
REVOLUTIONARY AI PERFORMANCE
3X Faster DL Training Performance
Over 80x DL Training
Performance in 3 Years
1x K80
cuDNN2
4x M40
cuDNN3
8x P100
cuDNN6
8x V100
cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Googlenet Training Performance
(Speedup Vs K80)
SpeedupvsK80
85% Scale-Out Efficiency
Scales to 64 GPUs with Microsoft
Cognitive Toolkit
0 5 10 15
64X V100
8X V100
8X P100
Multi-Node Training with NCCL2.0
(ResNet-50)
ResNet50 Training for 90 Epochs with 1.28M images dataset | Cognitive
Toolkit with NCCL 2.0 | V100 performance measured on pre-production
hardware.
1 Hour
7.4 Hours
18 Hours
3X Reduction in Time to Train
Over P100
0 10 20
1X
V100
1X
P100
2X
CPU
LSTM Training
(Neural Machine Translation)
Neural Machine Translation Training for 13 Epochs |German ->English,
WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance
measured on pre-production hardware.
15 Days
18 Hours
6 Hours
28
VOLTA HPC PERFORMANCE
RelativetoTeslaP100
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla
P100 or V100. V100 measured on pre-production hardware.
29
INTRODUCING CUDA 9
Tesla V100
New GPU Architecture
Tensor Cores
NVLink
Independent Thread Scheduling
BUILT FOR VOLTA
COOPERATIVE THREAD GROUPS
Flexible Thread Groups
Efficient Parallel Algorithms
Synchronize Across Thread
Blocks in a Single GPU or
Multi-GPUs
cuBLAS for Deep Learning
NPP for Image Processing
cuFFT for Signal Processing
FASTER LIBRARIES
DEVELOPER TOOLS & PLATFORM UPDATES
Faster Compile Times
Unified Memory Profiling
NVLink Visualization
New OS and Compiler
Support
partition
sync sync
30
CUDA 9: WHAT’S NEW IN LIBRARIES
VOLTA PLATFORM SUPPORT PERFORMANCE
IMPROVED USER EXPERIENCENEW ALGORITHMS
Utilize Volta Tensor Cores
Volta optimized GEMMs (cuBLAS)
Out-of-box performance on Volta
(all libraries)
GEMM optimizations for RNNs
(cuBLAS)
Faster image processing (NPP)
FFT optimizations across various sizes
(cuFFT)
Multi-GPU dense & sparse solvers, dense
eigenvalue & SVD (cuSOLVER)
Breadth first search, clustering, triangle
counting, extraction & contraction
(nvGRAPH)
New install package for CUDA Libraries
(library-only meta package)
Modular NPP with small footprint,
support for image batching
DEEP LEARNING
Scientific Computing
31
CUDA 9: UP TO 5X FASTER LIBRARIES
2x faster library speeds up image, video
and signal processing operations
cuBLAS cuFFT NPP
5x – 9x faster GEMM operations speed
up deep learning and HPC apps
Up to 100x faster than IPP for image
processing and computer vision operations
0X
1X
1X
2X
2X
3X
1 64 16384 4194304
SpeedupVs.CUDA8*
Data Size
1D 2D 3D
0x 50x 100x
Color Proc.
Filters
Geometry Transforms
JPEG
Morphological Ops.
Speedup Vs. IPP**
* V100 and CUDA 9 (r384); Intel Xeon Broadwell, dual socket, E5-2698 v4@ 2.6GHz, 3.5GHz Turbo with Ubuntu 14.04.5 x86_64 with 128GB System Memory
* P100 and CUDA 8 (r361); For cublas CUDA$8$(r361): Intel Xeon Haswell, single-socket, 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo with CentOS 7.2 x86-64 with 128GB System Memory
** CPU system running IPP: Intel Xeon Haswell single-socket 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo Ubuntu 14.04.5 x86_64 with 128GB System Memory
0x
2x
4x
6x
8x
10x
512 1024 2048 2816
SpeedupVs.CUDA8*
Matrix Size
FP32 FP16 I/O, FP32 Compute
32
COOPERATIVE GROUPS
33
COOPERATIVE GROUPS
A flexible model for synchronisation and communication within groups of threads
Levels$of$cooperation:
TODAY
Levels$of$cooperation:
CUDA$9
34
COOPERATIVE GROUPS BASICS
Flexible, Explicit Synchronization
Thread groups are explicit objects in your program
You can synchronize threads in a group
Create new groups by partitioning existing groups
Partitioned groups can also synchronize
thread_group block =1this_thread_block();
block.sync();
thread_group tile321=1tiled_partition(block,132);
thread_group tile41=1tiled_partition(tile32,14);
tile4.sync();
Note: calls in green are part of the cooperative_groups:: namespace
Thread Block Group
Partitioned Thread Groups
35
COOPERATIVE GROUPS
Flexible and Scalable Thread Synchronization and Communication
Define, synchronize, and partition groups of
cooperating threads
Flexible: High-performance API for clean and
robust management of thread groups
Scalable: Create and manage groups within warps,
across thread blocks, and even across GPUs
Deploy Everywhere (*): Kepler and Newer GPUs
Supported by CUDA developer tools
* Note: Multi-Block and Multi-Device Cooperative Groups are only supported on Pascal and above GPUs
Thread Block Group
Partitioned Thread Groups
36
DEVELOPER TOOLS
37
UNIFIED MEMORY PROFILING
Correlate CPU Page Faults with Source
Page Fault Correlation
38
NEW UNIFIED MEMORY EVENTS
Page ThrottlingMemory Thrashing Remote Map
Visualize Virtual Memory Activity
39
FUTURE: UNIFIED SYSTEM ALLOCATOR
Allocate unified memory using standard malloc
Removes CUDA-specific allocator
restrictions
Data movement is transparently
handled
Requires operating system support:
HMM Linux Kernel Module
void1sortfile(FILE1*fp,1int N)1{
char1*data;
//1Allocate1memory1using1any1standard1allocator
data1=1(char1*)1malloc(N1*1sizeof(char));
fread(data,11,1N,1fp);
sort<<<...>>>(data,N,1,compare);
use_data(data);
//1Free1the1allocated1memory
free(data);
}
CUDA 8 Code with System Allocator
40
ADDITIONAL RESOURCES
• Volta
• Whitepaper http://www.nvidia.com/object/volta-architecture-whitepaper.html
• Blog https://devblogs.nvidia.com/parallelforall/inside-volta
• CUDA 9
• Blog https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed
• Download https://developer.nvidia.com/cuda-downloads
Axel Koehler, Principal Solution Architect
akoehler@nvidia.com

Contenu connexe

Tendances

Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P..."Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...Edge AI and Vision Alliance
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from FacebookEdge AI and Vision Alliance
 
Dave Gilbert - KVM and QEMU
Dave Gilbert - KVM and QEMUDave Gilbert - KVM and QEMU
Dave Gilbert - KVM and QEMUDanny Abukalam
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2NVIDIA
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)MuntasirMuhit
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Jafar Khan
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 

Tendances (20)

GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P..."Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Dave Gilbert - KVM and QEMU
Dave Gilbert - KVM and QEMUDave Gilbert - KVM and QEMU
Dave Gilbert - KVM and QEMU
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
Green computing ppt
Green computing pptGreen computing ppt
Green computing ppt
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Cuda
CudaCuda
Cuda
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 

Similaire à Inside the Volta GPU Architecture and CUDA 9

Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platforminside-BigData.com
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learninginside-BigData.com
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介NVIDIA Japan
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_reportMichael Zhang
 
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...Haidee McMahon
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
GTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionGTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionNVIDIA
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Julien SIMON
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit SupercomputerVigneshwarRamaswamy
 
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...Amazon Web Services
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...Amazon Web Services
 

Similaire à Inside the Volta GPU Architecture and CUDA 9 (20)

Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learning
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
Nvidia at SEMICon, Munich
Nvidia at SEMICon, MunichNvidia at SEMICon, Munich
Nvidia at SEMICon, Munich
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
GTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionGTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI Revolution
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
 
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Dernier

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Dernier (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Inside the Volta GPU Architecture and CUDA 9

  • 1. INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9 Axel Koehler, Principal Solution Architect GPU$Technology$Conference$$Europe,$October$2017
  • 2. 2 CONTINUED DEMAND FOR COMPUTE POWER Comprehensive$ Earth$System$ Model Coupled$simulation$ of$entire$cells Simulation$of$ combustion$for$new$ highEefficiency,$lowE emision engines. Predictive$ calculations$for$ supernovae 2016 Baidu Deep$Speech$2 Superhuman$Voice$ Recognition 2015 Microsoft$ResNet Superhuman$Image$ Recognition 2017 Google$Neural$ Machine$Translation Near$Human$ Language$Translation 100 ExaFLOPS 8700 Million Parameters 20 ExaFLOPS 300 Million Parameters 7 ExaFLOPS 60 Million Parameters Neural$Network$complexity$is$ExplodingEverEincreasing$compute$power$ Demand$ in$HPC
  • 3. 3 INTRODUCING TESLA V100 The Fastest and Most Productive GPU for Deep Learning and HPC Volta Architecture Most Productive GPU Tensor Core 120 Programmable TFLOPS Deep Learning Improved SIMT Model New Algorithms Volta MPS Inference Utilization Improved NVLink & HBM2 Efficient Bandwidth
  • 4. 4 NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU
  • 5. 5 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink TESLA V100 *full GV100 chip contains 84 SMs
  • 7. 7 VOLTA GV100 SM GP100 GV100 FP32 units 64 64 FP64 units 32 32 INT32 units NA 64 Tensor Cores NA 8 Register File 256 KB 256 KB Unified L1/Shared memory L1: 24KB Shared: 64KB 128 KB Active Threads 2048 2048 Redesigned for Productivity Completely$new$ISA Twice$the$schedulers Simplified$Issue$Logic Large,$fast$L1$cache Improved$SIMT$model Tensor$acceleration
  • 8. 8 Shared Memory 64 KB L1$ 24 KB L2$ 4 MB Load/Store Units Pascal SM L2$ 6 MB Load/Store Units Volta SM L1$ and Shared Memory 128 KBLow Latency Streaming UNIFYING KEY TECHNOLOGIES
  • 9. 9 L2$ 6 MB Load/Store Units SM L1$ and Shared Memory 128 KB VOLTA L1 AND SHARED MEMORY Volta Streaming L1$ : Unlimited cache misses in flight Low cache hit latency 4x more bandwidth 5x more capacity Volta Shared Memory : Unified storage with L1 Configurable up to 96KB
  • 10. 10 NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache Pascal Volta Cache: vs shared • Easier to use • 90%+ as good Shared: vs cache • Faster atomics • More banks • More predictable Average Shared Memory Benefit 70% 93% Directed testing: shared in global
  • 12. 12 PRE-VOLTA WARP EXECUTION MODEL 32 thread warp Program Counter (PC) and Stack (S) Pre-Volta Time X;#Y; diverge reconverge A;#B; if (threadIdx.x < 4) { A; B; } else { X; Y; } No Synchronization Permitted
  • 13. 13 VOLTA WARP EXECUTION MODEL 32 thread warp with independent schedulingPC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S Convergence Optimizer Volta diverge A; B; X; Y; Synchronization may lead to interleaved scheduling! Time synchronize if (threadIdx.x < 4) { A; __syncwarp(); B; } else { X; __syncwarp(); Y; } __syncwarp();
  • 14. 14 Volta Independent Thread Scheduling: • Enables interleaved execution of statements from divergent branches • Enables execution of fine-grain parallel algorithms where threads within a warp may synchronize and communicate • At any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before • Execution is still SIMT which retains the high throughput • Use explicit synchronization, don’t rely on implicit convergence • CUDA 9 provides a fully explicit synchronization model VOLTA: INDEPENDENT THREAD SCHEDULING Extended'SIMT'model'enables'thread4parallel'programs'to'execute'with'vector'efficiency Volta: Threads may wait for messages
  • 16. 16 TENSOR CORE Mixed Precision Matrix Math - 4x4 matrices New CUDA TensorOp instructions & data formats 4x4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32]
  • 17. 18 USING TENSOR CORES Volta Optimized Frameworks and Libraries __device__ void tensor_op_16_16_16( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a, …> Amat; wmma::fragment<matrix_b, …> Bmat; wmma::fragment<matrix_c, …> Cmat; wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } CUDA C++ Warp-Level Matrix Operations NVIDIA cuDNN, cuBLAS, TensorRT
  • 18. 19 0 1 2 3 4 5 6 7 8 9 10 512 1024 2048 4096 Relative2Performance Matrix2Size2(M=N=K) cuBLAS Mixed2Precision2(FP162Input,2FP322compute) P1002(CUDA28) V1002Tensor2Cores22(CUDA29) 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 2 512 1024 2048 4096 Relative2Performance Matrix2Size2(M=N=K) cuBLAS Single2Precision2(FP32) P1002(CUDA28) V1002(CUDA29) cuBLAS GEMMS FOR DEEP LEARNING V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply 9.3x1.8x Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
  • 19. 20 NEW HBM2 MEMORY ARCHITECTURE STREAM:Triad-DeliveredGB/s P100 V100 76% DRAM Utilization 95% DRAM Utilization 1.5x Delivered Bandwidth • Unifying$Compute$&$Memory$in$Single$Package • More$bandwidth$and$more$energy$$efficient • ECC$can$be$active$without$a$bandwidth$or$ capacity$penalty
  • 20. 21 VOLTA NVLINK • 6 NVLINKS @ 50 GB/s bidirectional • Reduce number of lanes for lightly loaded link (Power savings) • Coherence features for NVLINK enabled CPUs POWER9 based node Hybrid cube mesh (eg. DGX1V)
  • 21. 22 STATE OF UNIFIED MEMORY High performance, low effort Allocate Beyond GPU Memory Size Unified Memory GPU CPU PGI OpenACC on Pascal P100 Geometric mean across all 15 SPEC ACCEL™ benchmarks 86% PCI-E, 91% NVLink Unified Memory Explicit data movement Automatic data movement for allocatables 86% Performance vs no Unified Memory PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation.
  • 22. 23 VOLTA + UNIFIED MEMORY VOLTA + NVLINK CPU VOLTA + PCIE CPU
  • 23. 24 VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA MULTI-PROCESS SERVICE CONTROL CPU Processes GPU Execution Volta MPS Enhancements: • MPS clients submit work directly to the work queues within the GPU • Reduced launch latency • Improved launch throughput • Improved isolation amongst MPS clients • Address isolation with independent address spaces • Improved quality of service (QoS) • 3x more clients than Pascal A B C
  • 24. 25 Efficient inference deployment without batching system Single Volta Client, No Batching, No MPS VOLTA MPS FOR INFERENCEResnet50Images/sec,7mslatency Multiple Volta Clients, No Batching, Using MPS Volta with Batching System 7x faster 60% of perf with batching V100 measured on pre-production hardware.
  • 25. 26 P100 V100 Ratio Training acceleration 10 TOPS 125 TOPS 12.5x Inference acceleration 21 TFLOPS 125 TOPS 6x FP64/FP32 5/10 TFLOPS 7.8/15.7 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x GPU PERFORMANCE COMPARISON
  • 26. 27 REVOLUTIONARY AI PERFORMANCE 3X Faster DL Training Performance Over 80x DL Training Performance in 3 Years 1x K80 cuDNN2 4x M40 cuDNN3 8x P100 cuDNN6 8x V100 cuDNN7 0x 20x 40x 60x 80x 100x Q1 15 Q3 15 Q2 17 Q2 16 Googlenet Training Performance (Speedup Vs K80) SpeedupvsK80 85% Scale-Out Efficiency Scales to 64 GPUs with Microsoft Cognitive Toolkit 0 5 10 15 64X V100 8X V100 8X P100 Multi-Node Training with NCCL2.0 (ResNet-50) ResNet50 Training for 90 Epochs with 1.28M images dataset | Cognitive Toolkit with NCCL 2.0 | V100 performance measured on pre-production hardware. 1 Hour 7.4 Hours 18 Hours 3X Reduction in Time to Train Over P100 0 10 20 1X V100 1X P100 2X CPU LSTM Training (Neural Machine Translation) Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance measured on pre-production hardware. 15 Days 18 Hours 6 Hours
  • 27. 28 VOLTA HPC PERFORMANCE RelativetoTeslaP100 System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.
  • 28. 29 INTRODUCING CUDA 9 Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling BUILT FOR VOLTA COOPERATIVE THREAD GROUPS Flexible Thread Groups Efficient Parallel Algorithms Synchronize Across Thread Blocks in a Single GPU or Multi-GPUs cuBLAS for Deep Learning NPP for Image Processing cuFFT for Signal Processing FASTER LIBRARIES DEVELOPER TOOLS & PLATFORM UPDATES Faster Compile Times Unified Memory Profiling NVLink Visualization New OS and Compiler Support partition sync sync
  • 29. 30 CUDA 9: WHAT’S NEW IN LIBRARIES VOLTA PLATFORM SUPPORT PERFORMANCE IMPROVED USER EXPERIENCENEW ALGORITHMS Utilize Volta Tensor Cores Volta optimized GEMMs (cuBLAS) Out-of-box performance on Volta (all libraries) GEMM optimizations for RNNs (cuBLAS) Faster image processing (NPP) FFT optimizations across various sizes (cuFFT) Multi-GPU dense & sparse solvers, dense eigenvalue & SVD (cuSOLVER) Breadth first search, clustering, triangle counting, extraction & contraction (nvGRAPH) New install package for CUDA Libraries (library-only meta package) Modular NPP with small footprint, support for image batching DEEP LEARNING Scientific Computing
  • 30. 31 CUDA 9: UP TO 5X FASTER LIBRARIES 2x faster library speeds up image, video and signal processing operations cuBLAS cuFFT NPP 5x – 9x faster GEMM operations speed up deep learning and HPC apps Up to 100x faster than IPP for image processing and computer vision operations 0X 1X 1X 2X 2X 3X 1 64 16384 4194304 SpeedupVs.CUDA8* Data Size 1D 2D 3D 0x 50x 100x Color Proc. Filters Geometry Transforms JPEG Morphological Ops. Speedup Vs. IPP** * V100 and CUDA 9 (r384); Intel Xeon Broadwell, dual socket, E5-2698 v4@ 2.6GHz, 3.5GHz Turbo with Ubuntu 14.04.5 x86_64 with 128GB System Memory * P100 and CUDA 8 (r361); For cublas CUDA$8$(r361): Intel Xeon Haswell, single-socket, 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo with CentOS 7.2 x86-64 with 128GB System Memory ** CPU system running IPP: Intel Xeon Haswell single-socket 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo Ubuntu 14.04.5 x86_64 with 128GB System Memory 0x 2x 4x 6x 8x 10x 512 1024 2048 2816 SpeedupVs.CUDA8* Matrix Size FP32 FP16 I/O, FP32 Compute
  • 32. 33 COOPERATIVE GROUPS A flexible model for synchronisation and communication within groups of threads Levels$of$cooperation: TODAY Levels$of$cooperation: CUDA$9
  • 33. 34 COOPERATIVE GROUPS BASICS Flexible, Explicit Synchronization Thread groups are explicit objects in your program You can synchronize threads in a group Create new groups by partitioning existing groups Partitioned groups can also synchronize thread_group block =1this_thread_block(); block.sync(); thread_group tile321=1tiled_partition(block,132); thread_group tile41=1tiled_partition(tile32,14); tile4.sync(); Note: calls in green are part of the cooperative_groups:: namespace Thread Block Group Partitioned Thread Groups
  • 34. 35 COOPERATIVE GROUPS Flexible and Scalable Thread Synchronization and Communication Define, synchronize, and partition groups of cooperating threads Flexible: High-performance API for clean and robust management of thread groups Scalable: Create and manage groups within warps, across thread blocks, and even across GPUs Deploy Everywhere (*): Kepler and Newer GPUs Supported by CUDA developer tools * Note: Multi-Block and Multi-Device Cooperative Groups are only supported on Pascal and above GPUs Thread Block Group Partitioned Thread Groups
  • 36. 37 UNIFIED MEMORY PROFILING Correlate CPU Page Faults with Source Page Fault Correlation
  • 37. 38 NEW UNIFIED MEMORY EVENTS Page ThrottlingMemory Thrashing Remote Map Visualize Virtual Memory Activity
  • 38. 39 FUTURE: UNIFIED SYSTEM ALLOCATOR Allocate unified memory using standard malloc Removes CUDA-specific allocator restrictions Data movement is transparently handled Requires operating system support: HMM Linux Kernel Module void1sortfile(FILE1*fp,1int N)1{ char1*data; //1Allocate1memory1using1any1standard1allocator data1=1(char1*)1malloc(N1*1sizeof(char)); fread(data,11,1N,1fp); sort<<<...>>>(data,N,1,compare); use_data(data); //1Free1the1allocated1memory free(data); } CUDA 8 Code with System Allocator
  • 39. 40 ADDITIONAL RESOURCES • Volta • Whitepaper http://www.nvidia.com/object/volta-architecture-whitepaper.html • Blog https://devblogs.nvidia.com/parallelforall/inside-volta • CUDA 9 • Blog https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed • Download https://developer.nvidia.com/cuda-downloads
  • 40. Axel Koehler, Principal Solution Architect akoehler@nvidia.com