In this deck from the NVIDIA GPU Technology Conference, Axel Koehler presents: Inside the Volta GPU Architecture and CUDA 9.
"The presentation will give an overview about the new NVIDIA Volta GPU architecture and the latest CUDA 9 release. The NVIDIA Volta architecture powers the worlds most advanced data center GPU for AI, HPC, and Graphics. Volta features a new Streaming Multiprocessor (SM) architecture and includes enhanced features like NVLINK2 and the Multi-Process Service (MPS) that delivers major improvements in performance, energy efficiency, and ease of programmability. New features like Independent Thread Scheduling and the Tensor Cores enable Volta to simultaneously deliver the fastest and most accessible performance. CUDA is NVIDIA''s parallel computing platform and programming model. You''ll learn about new programming model enhancements and performance improvements in the latest CUDA9 release."
Watch the video: https://wp.me/p3RLHQ-iB7
Learn more: https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Inside the Volta GPU Architecture and CUDA 9
1. INSIDE THE VOLTA GPU
ARCHITECTURE AND CUDA 9
Axel Koehler, Principal Solution Architect
GPU$Technology$Conference$$Europe,$October$2017
2. 2
CONTINUED DEMAND FOR COMPUTE POWER
Comprehensive$
Earth$System$
Model
Coupled$simulation$
of$entire$cells
Simulation$of$
combustion$for$new$
highEefficiency,$lowE
emision engines.
Predictive$
calculations$for$
supernovae
2016
Baidu Deep$Speech$2
Superhuman$Voice$
Recognition
2015
Microsoft$ResNet
Superhuman$Image$
Recognition
2017
Google$Neural$
Machine$Translation
Near$Human$
Language$Translation
100 ExaFLOPS
8700 Million Parameters
20 ExaFLOPS
300 Million Parameters
7 ExaFLOPS
60 Million Parameters
Neural$Network$complexity$is$ExplodingEverEincreasing$compute$power$
Demand$ in$HPC
3. 3
INTRODUCING TESLA V100
The Fastest and Most Productive GPU for Deep Learning and HPC
Volta Architecture
Most Productive GPU
Tensor Core
120 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
7. 7
VOLTA GV100 SM
GP100 GV100
FP32 units 64 64
FP64 units 32 32
INT32 units NA 64
Tensor Cores NA 8
Register File 256 KB 256 KB
Unified L1/Shared
memory
L1: 24KB
Shared: 64KB
128 KB
Active Threads 2048 2048
Redesigned for Productivity
Completely$new$ISA
Twice$the$schedulers
Simplified$Issue$Logic
Large,$fast$L1$cache
Improved$SIMT$model
Tensor$acceleration
8. 8
Shared
Memory
64 KB
L1$
24 KB
L2$
4 MB
Load/Store Units
Pascal SM
L2$
6 MB
Load/Store Units
Volta SM
L1$ and Shared Memory
128 KBLow Latency
Streaming
UNIFYING KEY TECHNOLOGIES
9. 9
L2$
6 MB
Load/Store Units
SM
L1$ and Shared Memory
128 KB
VOLTA L1 AND SHARED MEMORY
Volta Streaming L1$ :
Unlimited cache misses in flight
Low cache hit latency
4x more bandwidth
5x more capacity
Volta Shared Memory :
Unified storage with L1
Configurable up to 96KB
10. 10
NARROWING THE SHARED MEMORY GAP
with the GV100 L1 cache
Pascal Volta
Cache: vs shared
• Easier to use
• 90%+ as good
Shared: vs cache
• Faster atomics
• More banks
• More predictable
Average
Shared
Memory
Benefit
70%
93%
Directed testing: shared in global
12. 12
PRE-VOLTA WARP EXECUTION MODEL
32 thread warp
Program
Counter (PC)
and Stack (S)
Pre-Volta
Time
X;#Y;
diverge
reconverge
A;#B;
if (threadIdx.x < 4) {
A;
B;
} else {
X;
Y;
}
No Synchronization Permitted
13. 13
VOLTA WARP EXECUTION MODEL
32 thread warp with independent schedulingPC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
PC,S
Convergence
Optimizer
Volta
diverge
A; B;
X; Y;
Synchronization may lead to interleaved scheduling!
Time
synchronize
if (threadIdx.x < 4) {
A;
__syncwarp();
B;
} else {
X;
__syncwarp();
Y;
}
__syncwarp();
14. 14
Volta Independent Thread Scheduling:
• Enables interleaved execution of statements from divergent
branches
• Enables execution of fine-grain parallel algorithms where threads
within a warp may synchronize and communicate
• At any given clock cycle, CUDA cores execute the same instruction
for all active threads in a warp just as before
• Execution is still SIMT which retains the high throughput
• Use explicit synchronization, don’t rely on implicit convergence
• CUDA 9 provides a fully explicit synchronization model
VOLTA: INDEPENDENT THREAD SCHEDULING
Extended'SIMT'model'enables'thread4parallel'programs'to'execute'with'vector'efficiency
Volta: Threads may wait
for messages
16. 16
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions
& data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
17. 18
USING TENSOR CORES
Volta Optimized
Frameworks and Libraries
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,
wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
18. 19
0
1
2
3
4
5
6
7
8
9
10
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Mixed2Precision2(FP162Input,2FP322compute)
P1002(CUDA28)
V1002Tensor2Cores22(CUDA29)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Single2Precision2(FP32)
P1002(CUDA28)
V1002(CUDA29)
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
9.3x1.8x
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
19. 20
NEW HBM2 MEMORY ARCHITECTURE
STREAM:Triad-DeliveredGB/s
P100 V100
76% DRAM
Utilization
95% DRAM
Utilization
1.5x Delivered
Bandwidth
• Unifying$Compute$&$Memory$in$Single$Package
• More$bandwidth$and$more$energy$$efficient
• ECC$can$be$active$without$a$bandwidth$or$
capacity$penalty
20. 21
VOLTA NVLINK
• 6 NVLINKS @ 50 GB/s
bidirectional
• Reduce number of lanes
for lightly loaded link
(Power savings)
• Coherence features for
NVLINK enabled CPUs POWER9 based node
Hybrid cube mesh
(eg. DGX1V)
21. 22
STATE OF UNIFIED MEMORY
High performance, low effort
Allocate Beyond
GPU Memory Size
Unified Memory
GPU CPU
PGI OpenACC on Pascal P100
Geometric mean across all 15
SPEC ACCEL™ benchmarks
86% PCI-E, 91% NVLink
Unified Memory
Explicit data
movement
Automatic data movement for allocatables
86%
Performance vs no Unified Memory
PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark
name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation.
23. 24
VOLTA MULTI-PROCESS SERVICE
Hardware
Accelerated
Work Submission
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
GPU Execution
Volta MPS Enhancements:
• MPS clients submit work directly to
the work queues within the GPU
• Reduced launch latency
• Improved launch throughput
• Improved isolation amongst MPS clients
• Address isolation with independent
address spaces
• Improved quality of service (QoS)
• 3x more clients than Pascal
A B C
24. 25
Efficient inference deployment without batching system
Single Volta Client,
No Batching,
No MPS
VOLTA MPS FOR INFERENCEResnet50Images/sec,7mslatency
Multiple Volta Clients,
No Batching,
Using MPS
Volta with
Batching
System
7x
faster
60% of
perf with
batching
V100 measured on pre-production hardware.
28. 29
INTRODUCING CUDA 9
Tesla V100
New GPU Architecture
Tensor Cores
NVLink
Independent Thread Scheduling
BUILT FOR VOLTA
COOPERATIVE THREAD GROUPS
Flexible Thread Groups
Efficient Parallel Algorithms
Synchronize Across Thread
Blocks in a Single GPU or
Multi-GPUs
cuBLAS for Deep Learning
NPP for Image Processing
cuFFT for Signal Processing
FASTER LIBRARIES
DEVELOPER TOOLS & PLATFORM UPDATES
Faster Compile Times
Unified Memory Profiling
NVLink Visualization
New OS and Compiler
Support
partition
sync sync
29. 30
CUDA 9: WHAT’S NEW IN LIBRARIES
VOLTA PLATFORM SUPPORT PERFORMANCE
IMPROVED USER EXPERIENCENEW ALGORITHMS
Utilize Volta Tensor Cores
Volta optimized GEMMs (cuBLAS)
Out-of-box performance on Volta
(all libraries)
GEMM optimizations for RNNs
(cuBLAS)
Faster image processing (NPP)
FFT optimizations across various sizes
(cuFFT)
Multi-GPU dense & sparse solvers, dense
eigenvalue & SVD (cuSOLVER)
Breadth first search, clustering, triangle
counting, extraction & contraction
(nvGRAPH)
New install package for CUDA Libraries
(library-only meta package)
Modular NPP with small footprint,
support for image batching
DEEP LEARNING
Scientific Computing
30. 31
CUDA 9: UP TO 5X FASTER LIBRARIES
2x faster library speeds up image, video
and signal processing operations
cuBLAS cuFFT NPP
5x – 9x faster GEMM operations speed
up deep learning and HPC apps
Up to 100x faster than IPP for image
processing and computer vision operations
0X
1X
1X
2X
2X
3X
1 64 16384 4194304
SpeedupVs.CUDA8*
Data Size
1D 2D 3D
0x 50x 100x
Color Proc.
Filters
Geometry Transforms
JPEG
Morphological Ops.
Speedup Vs. IPP**
* V100 and CUDA 9 (r384); Intel Xeon Broadwell, dual socket, E5-2698 v4@ 2.6GHz, 3.5GHz Turbo with Ubuntu 14.04.5 x86_64 with 128GB System Memory
* P100 and CUDA 8 (r361); For cublas CUDA$8$(r361): Intel Xeon Haswell, single-socket, 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo with CentOS 7.2 x86-64 with 128GB System Memory
** CPU system running IPP: Intel Xeon Haswell single-socket 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo Ubuntu 14.04.5 x86_64 with 128GB System Memory
0x
2x
4x
6x
8x
10x
512 1024 2048 2816
SpeedupVs.CUDA8*
Matrix Size
FP32 FP16 I/O, FP32 Compute
32. 33
COOPERATIVE GROUPS
A flexible model for synchronisation and communication within groups of threads
Levels$of$cooperation:
TODAY
Levels$of$cooperation:
CUDA$9
33. 34
COOPERATIVE GROUPS BASICS
Flexible, Explicit Synchronization
Thread groups are explicit objects in your program
You can synchronize threads in a group
Create new groups by partitioning existing groups
Partitioned groups can also synchronize
thread_group block =1this_thread_block();
block.sync();
thread_group tile321=1tiled_partition(block,132);
thread_group tile41=1tiled_partition(tile32,14);
tile4.sync();
Note: calls in green are part of the cooperative_groups:: namespace
Thread Block Group
Partitioned Thread Groups
34. 35
COOPERATIVE GROUPS
Flexible and Scalable Thread Synchronization and Communication
Define, synchronize, and partition groups of
cooperating threads
Flexible: High-performance API for clean and
robust management of thread groups
Scalable: Create and manage groups within warps,
across thread blocks, and even across GPUs
Deploy Everywhere (*): Kepler and Newer GPUs
Supported by CUDA developer tools
* Note: Multi-Block and Multi-Device Cooperative Groups are only supported on Pascal and above GPUs
Thread Block Group
Partitioned Thread Groups
38. 39
FUTURE: UNIFIED SYSTEM ALLOCATOR
Allocate unified memory using standard malloc
Removes CUDA-specific allocator
restrictions
Data movement is transparently
handled
Requires operating system support:
HMM Linux Kernel Module
void1sortfile(FILE1*fp,1int N)1{
char1*data;
//1Allocate1memory1using1any1standard1allocator
data1=1(char1*)1malloc(N1*1sizeof(char));
fread(data,11,1N,1fp);
sort<<<...>>>(data,N,1,compare);
use_data(data);
//1Free1the1allocated1memory
free(data);
}
CUDA 8 Code with System Allocator
39. 40
ADDITIONAL RESOURCES
• Volta
• Whitepaper http://www.nvidia.com/object/volta-architecture-whitepaper.html
• Blog https://devblogs.nvidia.com/parallelforall/inside-volta
• CUDA 9
• Blog https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed
• Download https://developer.nvidia.com/cuda-downloads