HPC 的に H100 は魅力的な GPU なのか？

HPC的に H100 は魅力的な GPU なのか?
AKIRA NARUSE, DEVELOPER TECHNOLOGY

HPC的に (GRACE) HOPPER は使えるのか？
AKIRA NARUSE, DEVELOPER TECHNOLOGY

AGENDA
H100 (Hopper GPU)
Grace Hopper (ARM CPU + Hopper GPU)

HOPPER アーキテクチャ
NVIDIA H100 GPU
132 SMs
16,896 FP32 units
528 Tensor Cores
Larger L2, 50 MB
HBM3, 80 GB, 3.35 TB/s
4th-gen NVLink
900 GB/s (50 GB/s x 18 links)
SHARP, NVLink network
PCIe gen5
2nd-gen MIG
Confidential Computing
Thread Block
Clusters

HOPPER アーキテクチャ
GH100 Stream Multiprocessor
第 4 世代 Tensor Core
2 倍の FMA 性能 (FP32、FP64)
新 DPX 命令セット
L1/共有メモリのサイズ増, 256 KB
Thread Block Clusters
複数 SM 間の協調処理
Tensor Memory Accelerator (TMA)
非同期コピー

7
H100 vs. A100
演算性能、バンド幅、容量
H100 vs. A100
FP64 34 Tflops 3x
FP64 TC 67 Tflops 3x
FP32 67 Tflops 3x
TF32 TC 494 Tflops 3x
INT8 TC 1979 Tops 3x
DRAM
Bandwidth
3.35 TB/s 2x
NVLINK
Bandwidth
900 GB/s 1.5x
H100 vs. A100
Shared Memory
per SM
256 KB 1.6x
L2 Cache Size 50 MB 1.25x
DRAM Size 80 GB 2x
(*) A100-40GB

8
CUDA プログラミングモデルの拡張
Thread Block Clusters
Grid > Cluster > Block > Thread
 新しい階層、クラスタを導入
 スレッドブロック間の協調 (クラスタ内)
 同時にスケジューリング
 高速バリア (HWサポート)
 相互の共有メモリにアクセス可 (Distributed SMEM)
 CUDA cooperative groups API
CUDA: 4階層
SM
GPC
HW: 4階層
GPU, GPC, SM, CUDA core

9
非同期実行の強化
End-to-end fully asynchronous execution
規模化と多階層
 遅延の増加
 Latency hiding 困難
もっと、非同期実行
 データ転送と計算をオーバーラップ
 効率的な非同期実行機能
Load
A
Compute
A
Store
A
Load
B
Compute
B
Store
B
Load
C
Comput
C
Load
A
Compute
A
Store
A
Load
B
Compute
B
Store
B
Load
C
Compute
C
Store
C
Mem Copy
(Async)
Compute
Mem Copy
Compute

10
TENSOR MEMORY ACCELERATOR (TMA)
非同期メモリコピー
HW メモリコピーエンジン (SM 内)
グローバルメモリ  共有メモリ
共有メモリ  グローバルメモリ
テンソルのコピー (1D～5D)
完全オフロード
アドレス計算
トランザクションバリアで同期 (no spin-lock)
クラスタ対応
共有メモリ  共有メモリ (クラスタ内)
2D Tensor padding
Tensor width
Tensor
height
Tensor
stride
region
to copy
Block width
Block
height

ASYNC TRANSACTION BARRIER
非同期トランザクションバリア
Arrive (non-blocking)
通過したスレッド数をカウント
Barrier は、共有メモリへの store 数もカウント
共有メモリにデータが到着したら、トランザクションカウ
ントが増加
Wait (blocking)
全スレッドが Arrive を通過しており、かつ、トランザク
ションカウント数が指定数に達していたら、開放
それまで、待ちスレッドはここで sleep
bar.arrive()
bar.wait()
Produce Data
Consume Data
Independent
Work
Threads
Async
SMEM stores
sleep

非同期トランザクションバリアを用いたブロック間通信
クラスタ内のブロック間で、高速に協調実行
Consumer ブロックの共有メモリ上で
Data と Barrier を同時に更新
Flag より先に Data が更新されるのを保
証するにはメモリフェンスが必要

13
THREAD BLOCK CLUSTER PERFORMANCE IMPACT
Support Vector Machine (SVM)
1.0x
1.3x
5.3x
0.0x
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
A100 H100 H100
Thread Block Cluster
SVM
Model
Training
Speedup
Thread Block Cluster Impact on SVM Perf
Locality w/ Distributed Shared Memory help SVM performance
For details and more on H100 features: see GTC Sept 2022 sessions
• CUDA: New Features and Beyond [A41100]
• CUDA Programming Model for Hopper Architecture [A41095]

14
マルチ GPU
NVLINK で GPU 間を接続
4-Way 8-Way 256-Way
(multi-node NVLINK)

4-WAY H100
NVLINK直結
 H100 間を、NVLINK 6リンクで直結
 GPU間バンド幅: 300GB/s (150GB per direction)
HOPPER HOPPER
HOPPER HOPPER

16
4-WAY H100 サーバ
2 CPU, 4 InfiniBand
HOPPER HOPPER
HOPPER HOPPER
PCIe GEN5
GPU NVLINK
CX7
CX7
CX7
CX7
NDR IB
NDR IB

8-WAY H100
NVスイッチ
 各 H100 の NVLINK 18リンクは、全て NVスイッチに接続
 H100 間のデータ交換は、NVスイッチ経由
 GPU間バンド幅: 900GB/s (450GB per direction)
PCIe
Gen5 x16

18
NVIDIA DGX H100
8-WAY H100 サーバ (2 CPU, 8 InfiniBand)
NVLINK
ネットワーク
(optional)
NDR IB NDR IB

19
H100 クラスタの性能 (推定)
A100
H100
H100 + NVLink Network
All performance numbers are preliminary based on current expectations and subject to change in shipping products. A100 cluster: HDR IB network. H100
cluster: NDR IB network with NVLink Switch System where indicated. # GPUs: Climate Modeling 1K, Genomics 8, LQCD 1K, 3D-FFT 256.

20
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GRACE HOPPER
(ARM CPU + GPU)

21
128GB/s
PCIe • Connect to x86
processor
• PCI Gen 5
• Not cache coherent
900GB/s
NVLINK
• Connect to other
Nvidia GPUs
• Program with
NCCL, etc.
HOPPER GPU + X86 CPU
HBM3
3.35 TB/s
PCIe

22
900GB/s
NVLINK
C2C
• Connect to Grace
Processor
• Can sustain full
NVLINK BW into
large host memory
• Cache coherent
128GB/s
PCIe • Connect to x86
processor
• PCI Gen 5
• Not cache coherent
900GB/s
NVLINK
• Connect to other
Nvidia GPUs
• Program with
NCCL, etc.
HOPPER GPU + GRACE (ARM CPU)
NVLINK C2C
HBM3
3.35 TB/s

23
GRACE HOPPER SUPERCHIP
GRACE HOPPER SUPERCHIP SKU
GPU Hopper 80GB HBM3 3.35 TB/s
CPU ARMv9, 72 Core
CPU Mem
LPDDR5 512GB, 546 GB/s
4x lower power than DDR5
CPU to GPU
NVLink C2C
900GB/s & cache coherent,
4x PCIe Gen5 x16
TDP Max 1000W
Schedule 1H 2023

24
CPU 性能比較
X86 hopper vs. Grace hopper
X86 CPU
+ Hopper SXM
Grace Hopper
Superchip
Difference
Host Cores 64 Cores 72 Cores 1.1 X
Host Memory BW
205 GB/s 546 GB/s 2.7 X
Host Processor
Connectivity
64 GB/s (per direction)
PCIe Gen5 x16
NVLINK-C2C
7 X
Maximum GPU-to-
Network bandwidth
1 PCIe Gen5 x16
4 PCIe Gen5 x16
4 X
CPU/GPU Coherence? No Yes
Fully compatible with
A100?
Yes Mostly

25
GRACE CPU 性能 (推定)
vs. X86 CPU
CFD Ocean Modeling Homology Search

26
GRACE HOPPER SUPERCHIP
GRACE HOPPER SUPERCHIP SKU
GPU Hopper 80GB HBM3 3.35 TB/s
CPU ARMv9, 72 Core
CPU Mem
LPDDR5 512GB, 546 GB/s
4x lower power than DDR5
CPU to GPU
NVLink C2C
900GB/s & cache coherent,
4x PCIe Gen5 x16
TDP Max 1000W
Schedule 1H 2023

CPU/GPU MEMORY COHERENCY
CPU と GPU、別々のページテーブル
 CPU は GPU メモリにアクセスできない
 GPU は CPU メモリにアクセスできない
X86 Hopper Grace Hopper
NVLINK C2C + Address Translation Services (ATS)
CPU と GPU で、ページテーブルを共有
 CPU と GPU、どちらも相手のメモリにアクセスできる
 CPU と GPU から成る NUMA マシン (first touch)
 HW アクセスカウンタ、CPU・GPU 間 migration
DDR4

GRACE HOPPER PROGRAMMING MODEL
• ISO C++
• ISO Fortran
• Python
• OpenACC
• OpenMP
• CUDA C++
• CUDA Fortran
X86+Hopper と、同じです

GRACE HOPPER クラスタ
HGX Grace Hopper Superchip Platform
Grace Hopper ノード
CPU/GPU: Single Grace Hopper
NIC: BlueField-3 or ConnectX-7
TDP: 1000W
InfiniBand ネットワーク

GRACE HOPPER クラスタ
HGX Grace Hopper Superchip Platform
Grace Hopper ノード
CPU/GPU: Single Grace Hopper
NIC: BlueField-3 or ConnectX-7
TDP: 1000W
256 GPUs 256 GPUs
InfiniBand ネットワーク
NVLINK ネットワーク NVLINK ネットワーク

31
どのようなときに GRACE HOPPER の方が良いのか？
Partially Ported Apps
• OpenFOAM – solver only (bar is lower to better price/perf)
Apps that bottleneck on PCI connectivity
• ABINIT example with pencil-shaped ZGEMM
• Large AI Training
Apps that can leverage tight cache coherence
• Data Assimilation step in weather models can stay on Grace
New-to-GPU Apps
• Can more effectively leverage standard language acceleration

32
OPENFOAM
HPC Motorbike L (32M cells)
部分的
GPU化

33
ABINIT
Titanium 255 Atoms using the LOBPCG algorithm
部分的
GPU化

34
GROMACS
Stmv benchmark (1M atoms)
通信
ボトルネック

35
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
SC22
SC22 SC22 SC22 SC22
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
まとめ

Q: HPC的に GRACE HOPPER は使えるのか？
A: Yes
A100  H100 で、FLOPS、TB/s、どちらも2倍以上強化
NVLINK ネットワークで、ノード間通信性能を大幅に強化
Hopper inside:
 Thread Block Cluster、デバイスメモリアクセス削減
 TMA、非同期実行を増やして、GPU稼働率UP
Grace Hopper:
 CPU と GPU 間が高速に接続
 標準言語やディレクティブ言語でも、より CUDA に近い性能

HPC 的に H100 は魅力的な GPU なのか？

HPC 的に H100 は魅力的な GPU なのか？

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à HPC 的に H100 は魅力的な GPU なのか？

Similaire à HPC 的に H100 は魅力的な GPU なのか？ (20)

Plus de NVIDIA Japan

Plus de NVIDIA Japan (20)

Dernier

Dernier (12)

HPC 的に H100 は魅力的な GPU なのか？