8. 8
CUDA プログラミングモデルの拡張
Thread Block Clusters
Grid > Cluster > Block > Thread
新しい階層、クラスタを導入
スレッドブロック間の協調 (クラスタ内)
同時にスケジューリング
高速バリア (HWサポート)
相互の共有メモリにアクセス可 (Distributed SMEM)
CUDA cooperative groups API
CUDA: 4階層
SM
GPC
HW: 4階層
GPU, GPC, SM, CUDA core
9. 9
非同期実行の強化
End-to-end fully asynchronous execution
規模化と多階層
遅延の増加
Latency hiding 困難
もっと、非同期実行
データ転送と計算をオーバーラップ
効率的な非同期実行機能
Load
A
Compute
A
Store
A
Load
B
Compute
B
Store
B
Load
C
Comput
C
Load
A
Compute
A
Store
A
Load
B
Compute
B
Store
B
Load
C
Compute
C
Store
C
Mem Copy
(Async)
Compute
Mem Copy
Compute
13. 13
THREAD BLOCK CLUSTER PERFORMANCE IMPACT
Support Vector Machine (SVM)
1.0x
1.3x
5.3x
0.0x
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
A100 H100 H100
Thread Block Cluster
SVM
Model
Training
Speedup
Thread Block Cluster Impact on SVM Perf
Locality w/ Distributed Shared Memory help SVM performance
For details and more on H100 features: see GTC Sept 2022 sessions
• CUDA: New Features and Beyond [A41100]
• CUDA Programming Model for Hopper Architecture [A41095]
19. 19
H100 クラスタの性能 (推定)
A100
H100
H100 + NVLink Network
All performance numbers are preliminary based on current expectations and subject to change in shipping products. A100 cluster: HDR IB network. H100
cluster: NDR IB network with NVLink Switch System where indicated. # GPUs: Climate Modeling 1K, Genomics 8, LQCD 1K, 3D-FFT 256.
21. 21
128GB/s
PCIe • Connect to x86
processor
• PCI Gen 5
• Not cache coherent
900GB/s
NVLINK
• Connect to other
Nvidia GPUs
• Program with
NCCL, etc.
HOPPER GPU + X86 CPU
HBM3
3.35 TB/s
PCIe
22. 22
900GB/s
NVLINK
C2C
• Connect to Grace
Processor
• Can sustain full
NVLINK BW into
large host memory
• Cache coherent
128GB/s
PCIe • Connect to x86
processor
• PCI Gen 5
• Not cache coherent
900GB/s
NVLINK
• Connect to other
Nvidia GPUs
• Program with
NCCL, etc.
HOPPER GPU + GRACE (ARM CPU)
NVLINK C2C
HBM3
3.35 TB/s
23. 23
GRACE HOPPER SUPERCHIP
GRACE HOPPER SUPERCHIP SKU
GPU Hopper 80GB HBM3 3.35 TB/s
CPU ARMv9, 72 Core
CPU Mem
LPDDR5 512GB, 546 GB/s
4x lower power than DDR5
CPU to GPU
NVLink C2C
900GB/s & cache coherent,
4x PCIe Gen5 x16
TDP Max 1000W
Schedule 1H 2023
24. 24
CPU 性能比較
X86 hopper vs. Grace hopper
X86 CPU
+ Hopper SXM
Grace Hopper
Superchip
Difference
Host Cores 64 Cores 72 Cores 1.1 X
Host Memory BW
205 GB/s 546 GB/s 2.7 X
Host Processor
Connectivity
64 GB/s (per direction)
PCIe Gen5 x16
450 GB/s (per direction)
NVLINK-C2C
7 X
Maximum GPU-to-
Network bandwidth
1 PCIe Gen5 x16
50 GB/s (per direction)
4 PCIe Gen5 x16
200 GB/s (per direction)
4 X
CPU/GPU Coherence? No Yes
Fully compatible with
A100?
Yes Mostly
25. 25
GRACE CPU 性能 (推定)
vs. X86 CPU
CFD Ocean Modeling Homology Search
26. 26
GRACE HOPPER SUPERCHIP
GRACE HOPPER SUPERCHIP SKU
GPU Hopper 80GB HBM3 3.35 TB/s
CPU ARMv9, 72 Core
CPU Mem
LPDDR5 512GB, 546 GB/s
4x lower power than DDR5
CPU to GPU
NVLink C2C
900GB/s & cache coherent,
4x PCIe Gen5 x16
TDP Max 1000W
Schedule 1H 2023
27. CPU/GPU MEMORY COHERENCY
CPU と GPU、別々のページテーブル
CPU は GPU メモリにアクセスできない
GPU は CPU メモリにアクセスできない
X86 Hopper Grace Hopper
NVLINK C2C + Address Translation Services (ATS)
CPU と GPU で、ページテーブルを共有
CPU と GPU、どちらも相手のメモリにアクセスできる
CPU と GPU から成る NUMA マシン (first touch)
HW アクセスカウンタ、CPU・GPU 間 migration
DDR4
28. GRACE HOPPER PROGRAMMING MODEL
• ISO C++
• ISO Fortran
• Python
• OpenACC
• OpenMP
• CUDA C++
• CUDA Fortran
X86+Hopper と、同じです
31. 31
どのようなときに GRACE HOPPER の方が良いのか?
Partially Ported Apps
• OpenFOAM – solver only (bar is lower to better price/perf)
Apps that bottleneck on PCI connectivity
• ABINIT example with pencil-shaped ZGEMM
• Large AI Training
Apps that can leverage tight cache coherence
• Data Assimilation step in weather models can stay on Grace
New-to-GPU Apps
• Can more effectively leverage standard language acceleration