Magnum IO GPUDirect Storage 最新情報

Magnum IO GPUDirect Storage 最新情報
2022/04/26
佐々木邦暢 (@_ksasaki)
エヌビディア合同会社

大量のデータ
Source for Big Data Growth chart: IDC – The Digitization of the World (May 2020) storagenewsletter.com: 64.2ZB of Data Created or Replicated in 2020
科学的データの増大
Fueled by accurate sensors and simulations
393TB
COVID-19 Graph Analytics
287TB/day
ECMWF
16TB/sec
SKA
550TB
NASA Mars Landing Simulation
ビッグデータのさらなる増大
90% of the world’s data in the last 2 years
2010 2015 2020 2025
175
zettabytes
58
zettabytes
64
zettabytes

(ちょっと前の話ですが) SC18 Gordon Bell award winner
気候モデル構築へのディープラーニングの適用
• 過去の膨大な観測データと異常気象との関係を、Semantic Segmentation ネットワークで
モデル化に成功 (Tiramisu および DeepLabv3+ の改良版)
• ORNL Summit: 27,360 x V100、CSCS PizDaint: 5,300 x P100 までスケール
T. Kurth, et. al., “Exascale Deep Learning for Climate Analytics”

GPFS の帯域が足りない
今回の Tiramisu ベースのネットワークはシングル GPU で 189MB/s のデータを読み込む。
Summit はノード毎に 6 GPU あるので、1.14GB/s のスループットが必要。
1024 ノードで実施するなら継続的に 1.16TB/s、Summit のフルシステムなら、5.23TB/s が必要と
なり、これは GPFS のターゲットパフォーマンスの 2 倍以上。素朴なデータの配り方をしていたら間に合
わないので、分散データステージングシステムを作った。
GPFS から各ノードのローカルディスクにデータを持ってくる。同じファイルが平均して 23 のノードで必要
となる。今回の実装では、データセットをあらかじめ分割し、8 スレッドで並列読み込み。また、同じ
ファイルが複数ノードで必要になる場合のノード間配信は MPI を使った。
高速並列ファイルステージング

GPUDIRECT STORAGE GA
AVAILABLE WITH CUDA
11.4
ディープラーニングの推論で 6.6 倍の高速化
低遅延、広帯域
マルチノードストレージ高速化ライブラリ
強力なエコシステムサポート
DGX 及びその他のサーバーでサポート
OS: DGX BASEOS 5.0, Ubuntu 18.04, 20.04, RHEL 8.3
DL フレームワーク: PyTorch, MXNet
データ分析ツール: DALI, RAPIDS cuDF
CUDA 11.4 に統合

GPUDirect Storage (GDS)
GPU とストレージデバイス間のデータ転送を高速化
▪ CPU 側のバッファ (無駄なコピー) をスキップ
▪ ローカル及びリモートのストレージで動作 (PCIe スイッチの有無を問わず)
▪ CUDA cuFile API を通じて利用
▪ 特別なハードウェアは不要
▪ GDS の利点
高いバンド幅。特にストレージと GPU が同じ PCIe スイッチに繋がっている場合
余分なコピーの回避と、経路を最適化するダイナミックルーティングによる遅延の削減
細かい I/O が多い場合特に有効
ページフォルトを契機にする手法 (mmap(2)) よりジッターを削減

GPUDirect Storage
ソフトウェアアーキテクチャ
▪ cuFile ライブラリ
アプリケーションやフレームワーク向けの API を提供
▪ nvidia-fs カーネルモジュール
DMA/RDMA 対応ストレージデバイスと GPU 間のI/O を
制御するカーネルモジュール
https://github.com/NVIDIA/gds-nvidia-fs
▪NVIDIA は、Linux が GPU の仮想アドレスを DMA で扱えるようにする
ために、コミュニティと協力しています。
Filesystem Driver
Application on CPU
cuFile API, libcufile.so
CUDA
Block IO Driver
Storage Driver
nvidia-fs.ko
Proprietary
Distributed
File
System
Virtual File System
Storage DMA Programmed with GPU BAR1 Address
OS カーネルや
サードパーティの
ドライバ
アプリケーション
GPU Memory Storage DMA Engine

GPU Direct Storage の利点
高帯域、低遅延、CPU 使用率低減

cuFILE ライブラリ
機能
▪ GDS 対応 DMA/RDMA ファイルシステムへの I/O を司るユーザースペースのライブラリ
▪ オフセットや IO サイズがページ境界 (4KB) に整列されていないファイル操作にも対応
▪ “posix_unaligned_writes” で制御可能。
▪ 11.6 でバッチ API に対応 (非同期操作 (aio) に対応しているファイルシステムで)
▪ cudaMemcpy の呼び出しを避け、ページキャッシュへの中間コピーを省略
▪ ユーザー空間での RDMA 接続登録と管理
▪ ハードウェアの接続トポロジと GPU リソースに応じたダイナミックルーティング
▪ 開発用途に向け、互換モードをサポート
要件
▪ ページキャッシュやバッファリングを避けるため、ファイルは O_DIRECT オプション付きでオープンする必要あり

cuFile API の使用例
CUfileDescr_t cf_descr;
status = cuFileDriverOpen(); // initialize
fd = open(TESTFILE, O_RDONLY|O_DIRECT, 0, true); // interop with normal file IO
cf_descr.handle.fd = fd;
status = cuFileHandleRegister(&cf_handle, &cf_descr); // check support for file at this mount
cuda_result = cudaMalloc(&devPtr, size); // user allocates memory
status = cuFileBufRegister(devPtr, size); // initialize and register the buffer
ret = cuFileRead(cf_handle, devPtr, size, 0, 0); // ~pread: file handle, GPU base address, size,
// offset in file, offset in GPU buffer
status = cuFileBufDeregister(devPtr); // Cleanup
cuFileHandleDeregister(cf_handle);
close(fd);
cuFileDriverClose();
/* Launch Cuda Kernels */

スマートなパスの選択
• スマートな経路選択
GPU A と GPU B の接続は、PCIe 経由と、NVLink 経由の2通りある。
GDS は高速な経路を自動的に選択。
• 中間バッファを使ったステージング
右の図の NIC -> GPU a -> GPU B の例では、GPU A に中間バッファが
確保され、GPU A または GPU B の DMA エンジンが、GPU A から GPU
B のデータ転送を担当する。
CPU A CPU B
PCIe
Switch
PCIe
Switch
GPU A GPU B
NIC A NIC B
Buf
NVLink

互換モード
ブロックサイズ 128KB 以上では非 GDS に対し劣化なし
▪ cuFile APIs は次のような場合にも利用可能
▪ nvidia-fs.ko カーネルモジュールがインストールされていない
▪ ファイルシステムが GDS に対応していない
▪ O_DIRECT が適用できない場合など、ファイルシステム固有の条件
▪ root 権限のない環境での開発作業
▪ 管理者またはユーザーにより config.json で設定可能
▪ 利点:
▪ GDS 非対応の環境でも GDS アプリケーションの開発が可能
▪ 様々な環境で利用される可能性のあるコンテナでの GDS 利用
▪ cuFile と互換モードと POSIX API (おなじみの read/write) の比較
▪ 性能はほぼ同等。ブロックサイズの小さなI/Oにおいて、互換モードに若干のオーバーヘッドあり
(DGX-2 及び DGX A100 のローカル NVMe で計測)

ハードウェア、ソフトウェア要件
Supported HW
● GPU
○ Data Center and Quadro (desktop)
cards with compute capability > 6
● NIC
○ ConnectX-5 and ConnectX-6
● CPU
○ x86-64
● NVMe
○ Version 1.1 or greater
● Systems
○ DGX, EGX
● Computational Storage
○ ScaleFlux CSD
SW Requirement
● GPU driver 418.x onwards
● CUDA version 10.1 or above
● MOFED
○ Preferred 5.4 or 5.3 (DKMS support)
● DGX BaseOS 5.0.2 or later
● Linux Kernel
○ 4.15 or later
○ NO kernel > 5.4
○ Note: No support for 5.9 or later
● Linux Distros
○ GDS mode
■ UB 18.04 (4.15.x),20.04 (5.4.x)
■ RHEL (>=8.x)
● compat mode only
■ RHEL 7.9, Debian 10, CentOS 8.3, SLES 15.2,
OpenSUSE (15.2)

現時点での制限事項
▪ Compute Capability 6.0 (Pascal) 以降の、データセンター及びワークステーション (Quadro/RTX) GPU でのみ GDS モードをサ
ポート。それ以外の GPU では互換モードで動作。
▪ Linux 上でのみ動作。Windows は非サポート。
▪ 仮想マシンでのサポートは限定的。
▪ NVIDIA Ampere アーキテクチャの、タイムスライスベースの vGPU のみサポート (MIG ベースの vGPU は非サポート)
▪ A40, A16, A10, A2, RTX A6000, RTX A5000
▪ 詳細はこちら
▪ POWER や Arm アーキテクチャは非サポート。
▪ O_DIRECT を厳密にサポートしたファイルシステムでのみ動作。Not supported on file systems which cannot perform IO
using strict O_DIRECT semantics like compression, checksum, inline IO, buffering to page cache.
▪ NVIDIA (Mellanox) 以外のネットワークアダプタは非サポート。
▪ IOMMU は DGX でのみ部分的にサポート。

ベンチマークテスト結果

GDS THROUGHPUT WITH IBM Spectrum Scale
1X ESS 3200 AND 2X NVIDIA DGX A100
▪GPUDirect Storage removes
the system bottlenecks to
deliver almost full wire speed
▪Typical improvements are 2x
when the storage and network
can support the throughput
▪
A single ESS 3200 delivering
77GB/s
(71 GiB/s) over the storage
fabric to two DGX A100s 16
GPUs (4x HDR)
IBM Lab Testing: GDS_DirectIO read measurements 1M I/O: One ESS 3200 running IBM Spectrum Scale 5.1.1,
with four InfiniBand HDR to two NVIDIA DGX A100 servers using storage NICS
© 2021 IBM Corporation

GDS 読み込みのレイテンシ
1 ESS 3200: 4 x HDR links
2 DGX A100: 4 x HDR links
(storage links) Benchmark: NVIDIA ‘gdsio’ benchmark with 1M I/O size and 2 threads per GPU
GDS でレイテンシが約半分に

BeeGFS と GDS
Test Config : NetAPP EF600 , 2 DGX A100 +200Gb IB

竜巻の可視化
GPUDirect Storage での高速データストリーミング
•10 fps achieved
on
•DGX A100 with
•8 Samsung P1733
•NVMe drives
•47 GB/s IO bandwidth
•See also: SC
demo
•on Tornado
visualization
Accelerating IO to GPUs SC’21 BoF139 11/16/21

データの読み込みと前処理を高速化する DALI
https://developer.nvidia.com/dali

DeepCam 推論
DALI 1.3: DeepCAM is a component of MLPerf HPC
0
10
20
30
40
8 16 32
NumPy (baseline) DALI + GDS (compat) DALI + GDS
バッチサイズ
有効帯域幅
[GB/s]
Performance benchmarking done with PyTorch DeepCAM using standard GDS configuration in DGX A100,
UB 20.04 MLNX_OFED 5.3 GDS 1.0 DALI 1.3, EXAScaler 5.1.1 , 2.12.3_ddn29.
Application for batch size >= 32 limited by GPU compute throughput

PyTorch + DALI + GDS でのトレーニング
~1.17x gain for a single node

DALI と GDS
▪ DALI の NumPy Reader が GDS に対応
@pipeline_def(batch_size=batch_size, num_threads=3, device_id=0)
def pipe_gds():
data = fn.readers.numpy(device='gpu', file_root=data_dir, files=files)
return data
p = pipe_gds()
p.build()
pipe_out = p.run()
data_gds = pipe_out[0].as_cpu().as_array() # as_cpu() to copy the data back to CPU memory
print(data_gds.shape)
plot_batch(data_gds)

30
データの準備 / ETL 可視化
分析/機械学習
RAPIDS とは
GPU でデータサイエンスを高速化
cuDF
➢ GPU-accelerated ETL functions
➢ Tracks Pandas and other common
PyData APIs
➢ Dask + UCX integration for scaling
RAPIDS ML
➢ GPU-native cuML library, plus
XGBoost, FIL, HPO, and more
cuGraph
➢ GPU graph analytics, including TSP,
PageRank, and more
cuxfilter
➢ GPU-accelerated cross-filtering
pyViz integration
➢ Plotly Dash, Bokeh, Datashader,
HoloViews, hvPlot
CLX + Morpheus
Cyber log processing + anomaly detection
cuSignal
Signals processing
cuSpatial
Spatial analytics
cuStreamz
Streaming analytics
cuCIM
Computer vision & image processing primitives
node-RAPIDS
Bindings for node.js
Domain-Specific Libraries
...and more!

cuDF
Apache Arrow のデータフォーマット
Pandas 的な API
• Unary and Binary Operations
• Joins, Merges
• GroupBys
• Filters
• File readers
• User Defined Functions
• Etc.
GPU データフレームライブラリ

RAPIDS cuDF と GDS
GDS is also used
by cuCIM as part
of the Clara
Health pipeline
3TB scale
3x DGX A100’s 24
workers
dataset stored in
parquet format
1 DDN Lustre filer
https://github.com/rapidsai/gpu-bdb

cuDF と GDS
▪ cuDF はデフォルトで GPU Direct Storage を使う。
▪ 環境変数 LIBCUDF_CUFILE_POLICY で制御可能。
▪ “GDS”: Enable GDS use; GDS compatibility mode is off.
▪ “ALWAYS”: Enable GDS use; GDS compatibility mode is on.
▪ “OFF”: Completely disable GDS use.
▪ GPU Direct Storage が有効な操作
▪ read_avro
▪ read_parquet
▪ read_orc
▪ to_csv
▪ to_parquet
▪ to_orc
▪ 調整可能なパラメーター
▪ LIBCUDF_CUFILE_THREAD_COUNT: ファイル毎の read/write 操作の最大並列数 (デフォルト値: 16)
▪ LIBCUDF_CUFILE_SLICE_SIZE: GDS read/write 操作の最大サイズ (単位: バイト、デフォルト値: 4MB)。これより大きなサイズの I/O は複
数回に分割される。

SPARK 3.0
▪ RAPIDS Accelerator for Apache Spark is a plugin for
accelerating Apache Spark workloads on NVIDIA GPUs
▪ 80% of Fortune 500 use Apache Spark in production
▪ Growing adoption of Spark 3.0
▪ Transparently accelerate Spark dataframe and SQL operations
▪ 3X performance speedup*
3X Faster at 50% the Cost with No Code Changes
Powering Adobe’s
Intelligent Services
platform optimizing
cost savings to 66% of
current CPU systems.
Reduce runtime of
55TB advertising
pipeline with over
2.8T records to
5 hours
Accelerating
production fraud
detection pipeline by
20x at 50% the cost.
* based on the NDS 3TB 105 query benchmark.

RAPIDS ACCELERATOR FOR Apache SPARK (PLUGIN)
UCX Libraries
RAPIDS libcudf
(C++ Libraries)
CUDA
JNI bindings
Mapping From Java/Scala to C++
RAPIDS Accelerator
for Spark
分散 Spark アプリケーション
Spark SQL API Spark Shuffle
DataFrame API
if gpu_enabled(operation, data_type)
call-out to RAPIDS
else
execute standard Spark operation
JNI bindings
Mapping From Java/Scala to C++
● シャッフルのカスタム実装
● RDMA と GPU Direct に最適化
Apache Spark コア

IoT/M2M大量データ処理にも～ PG-StromのGPU-Direct SQL機構
オモシロ技術を、カタチにする。
▌GPUって計算のアクセラレータで、I/O中心のワークロードだと役に立たないんでしょ？
 そんな風に思っていたことが、自分にもありました・・・。
▌例えば、IoT/M2Mログデータ処理
 大量のデータが時々刻々溜まっていく。簡単にTB（テラバイト）を越え、クラウドに持って行くのも、DBにインポートするのも大変。
 ログ収集サーバが生成したデータを、使い慣れたSQLを使って、その場で、数十GB/sの速度で実行できるのは…
➔ GPUDirect Storageを使って実装された、PG-StromのGPU-Direct SQL機構
Manufacturing
Home electronics
Mobile
WHERE
JOIN
GROUP BY
PostGIS
JSONB
GPU-Direct SQL
ログ収集
サーバ
DB/GPU
サーバ
DB管理者/ユーザ
BIツール
（可視化）
AL/ML
(異常検知など)
IoT/M2M
デバイス
RoCE対応・高速
ネットワーク
※ 小規模なシステム（～数十TB）なら、PCI-E接続のローカルNVME-SSDを利用して、
一台のサーバでログ収集とデータ分析を兼ねる構成も可能。

Manufacturing
Home electronics
Mobile
WHERE
JOIN
GROUP BY PostGIS
JSONB
GPU-Direct SQL
ログ収集
サーバ
DB/GPU
サーバ
DB管理者/ユーザ
BIツール
（可視化）
AL/ML
(異常検知など)
IoT/M2M
デバイス
RoCE対応・高速
ネットワーク
※ 小規模なシステム（～数十TB）なら、PCI-E接続のローカルNVME-SSDを利用して、
一台のサーバでログ収集とデータ分析を兼ねる構成も可能。
WHERE
JOIN
GROUP BY
データの流れ
SQL処理
NVME-SSDの普及・低価格化
数十GB/sもの速度で
ストレージからデータを
読み出せるようになる。
従来のSQL処理の前提
CPUやRAMに比べると、
ストレージは”桁違い”に
遅い。
ボトルネック

The Intelligent Storage Concept:
CPUから見ると、ストレージ自体が
SQLを解釈して、必要なデータのみ
前処理を行って返してくれるのと
同じように振る舞う。
H/W能力を限界まで引き出す
GPUが得意な並列処理で
SQLの前処理を実行する事で、
GPUあたり約20GB/sの
処理スループットを実現
WHERE
JOIN
GROUP BY
SQL処理
GPU-Direct SQL機構
データ転送経路の中間地点で
GPUが不要なデータを除去し、
CPUが処理すべきデータ量を
劇的に削減する。
NVME-SSDの普及・低価格化
数十GB/sもの速度で
ストレージからデータを
読み出せるようになる。
従来のSQL処理の前提
CPUやRAMに比べると、
ストレージは”桁違い”に
遅い。
0
5,000
10,000
15,000
20,000
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400
Total
Storage
Read
Throughput
[MB/s]
Elapsed Time [sec]
Query Execution
with GPU-Direct SQL
nvme3
nvme2
nvme1
nvme0
Query Execution with Filesystem on PostgreSQL Heap Tables
単位時間あたりに処理できるデータ量が少なく、
結果、クエリ全体の応答時間も悪化している。
集計クエリ実行時のストレージ性能比較（GPU-Direct SQL あり／なし）
CPU: AMD EPYC 7402P, RAM: 128GB, GPU: NVIDIA A100 [40GB; PCI-E] x1, SSD: Intel D7-P5510 (3.84TB; U.2) x4
GPU-Direct SQL使用時、H/Wスペックに近い性能で
データを読み出し、同時にクエリを実行している。

フレームワークと開発ツール
Frameworks, apps
● Visualization, e.g. IndeX
● Health, e.g. cuCIM in CLARA
● Data analytics, e.g. RAPIDS,cuCIM, SPARK, nvTABULAR
● HPC, e.g. molecular modeling, genomics
● DL, e.g. PyTorch,TensorFlow, MxNet (Via DALI)
● Databases, e.g. HeteroDB
Readers
● RAPIDS cuDF, DALI
● HDF5 Serial, LBNL (repo); passed HDF5 regression suite, IOR layered
on HDF5
● OMPIO U Houston; early PHDF5 functionality
● MPI-IO engaging with ANL, others
● ADIOS ORNL
● Zarr prototyped with Pangeo community
Key
NVIDIA functional
NVIDIA WIP/planned
Functional
WIP/planned

ストレージパートナー

GDS ファイルシステムパートナー
Partner Company Partner Product GDS version Date
NetApp ONTAP 9.10.1 1.0 and higher TBA
NetApp
ThinkParQ
System Fabrics Works
BeeGFS Tech Preview 1.1 and higher TBA
IBM Spectrum Scale 5.1.2 1.1 and higher Nov 2021
DDN EXAScaler 6.0* 1.1 and higher Nov 2021
VAST Universal Storage 4.1 1.1 and higher Nov 2021
WekaIO WekaFS 3.13 1.0 June 2021
DellEMC PowerScale 9.2.0.0 1.0 Oct 2021
Hitachi Vantara HCSF 1.0 Oct 2021
● Open source Lustre 2.15 can be used to deploy GDS acceleration*

始めてみよう GPUDirect Storage

NVIDIA Magnum IO リポジトリ
https://github.com/NVIDIA/MagnumIO

NGC に開発用コンテナがあります
nvcr.io/nvidia/magnum-io/magnum-io:21.07
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/magnum-io/containers/magnum-io

gdscheck ツール
/usr/local/cuda-x.y/gds/tools/gdscheck
GDS の利用に必要な各種情報をチェックして報告してくれるコマンド
▪ cuFile や nvidia_fs のバージョン
▪ ストレージデバイスの GDS 対応状況
▪ cuFile の設定情報
▪ GPU の情報
▪ 対応 GPU では “supports GDS” が表示されます
▪ PCIE スイッチの情報
▪ ACS は disable が推奨
▪ IOMMU の情報
▪ DGX 以外では disable 必須
▪ “Platform verification succeeded” は、かならずしも
「GDS がネイティブに動く環境」であることを意味しません。
互換モードで動く環境でも表示されます。
$ /usr/local/cuda/gds/tools/gdscheck -p
GDS release version: 1.0.2.10
nvidia_fs version: 2.3 libcufile version: 2.4
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA GeForce RTX 3080 bar:1 bar size (MiB):256
GPU index 1 NVIDIA GeForce RTX 3070 bar:1 bar size (MiB):256
==============
PLATFORM INFO:
==============
Found ACS enabled for switch 0000:00:03.1
Found ACS enabled for switch 0000:00:03.2
IOMMU: disabled
Platform verification succeeded
=========
GPU INFO:
=========
GPU index 0 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS

DGX での GPUDirect Storage の活用
PCIe の接続形態に応じた RAID ボリュームの再構成
• GDS でローカル NVMe ドライブを活用するには、GPU が同じ
PCIe スイッチに接続されたドライブへアクセスする必要あり
• デフォルトのシステム構成では、ローカル NVMe ドライブは 4 基
全てのスイッチに跨がった一つの RAID 0 ボリューム
• 性能を最適化するために、次のコマンドで PCI デバイスの接続状況を
調べ、同じスイッチに繋がっている GPU と NVMe ドライブを特定
• mdadm コマンドで、4つの RAID 0 ボリュームを作成

DGX での GPUDirect Storage の活用
GDS 有効化に必要なファイルシステムの設定
▪ GDS は ext4 または XFS フォーマットされた NVMe ドライブで動作、RAID の場合は RAID 0 のみサポート
▪ O_DIRECT フラグによるダイレクト I/O を使うために、マウントオプションに data=ordered が必要
▪ これらの条件が満たされない場合、GDS は互換モードで動作する (=cuFile API を呼び出しても、実際には通常の read/write が行われる)

DGX A100 のネットワークと GDS
Lower Latencies and Increased CPU Efficiency
North-south storage configuration adapters 4 and 5 used for GDS
DGX A100 での標準的な GDS 構成
Converged east-west configuration adapters 0-3, 6-9 used for GDS
高スループット GDS 構成
※ お問い合わせください
System Memory
CPU
PCIe Switch
GPU
NICs
Storage
2x
200Gb/s
System Memory
CPU
PCIe Switch
GPU
NICs
Storage
8x
200Gb/s

コンテナ環境での GPUDirect Storage
▪ コンテナでの GDS 利用には追加の設定が必要
▪ nvidia-fs.ko カーネルモジュール
▪ /dev/nvidia-fs* デバイスファイル
▪ ブロックデバイス検出のため --volume /run/udev オプション
▪ さらに、 --ipc=host も必要
▪ MagumIO のリポジトリにある gds-run-container スクリプトを使うと簡単
https://github.com/NVIDIA/MagnumIO/blob/main/gds/docker/README.md
Filesystem Driver
Application on CPU
cuFile API, libcufile.so
CUDA
Block IO Driver
Storage Driver
nvidia-fs.ko
Proprietary
Distributed
File
System
Virtual File System
Storage DMA Programmed with GPU BAR1 Address

関連情報
▪ GTC22.03: Accelerating Storage IO to GPUs with Magnum IO
https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41347/
▪ NVIDIA GPUDirect Storage
https://docs.nvidia.com/gpudirect-storage/index.html
▪ NVIDIA Magnum IO Developer Environment
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/magnum-io/containers/magnum-io
▪ NVIDIA Magnum IO GPUDirect Storage Overview Guide
https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html
▪ NVIDIA GPUDirect Storage Installation and Troubleshooting Guide
https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html
▪ GDS cuFile API Reference
https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html
▪ GPUDirect Storage: A Direct Path Between Storage and GPU Memory
https://developer.nvidia.com/blog/gpudirect-storage/
▪ Accelerating IO in the Modern Data Center: Magnum IO Storage Partnerships
https://developer.nvidia.com/blog/accelerating-io-in-the-modern-data-center-magnum-io-storage-partnerships/

Magnum IO GPUDirect Storage 最新情報

Magnum IO GPUDirect Storage 最新情報

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Magnum IO GPUDirect Storage 最新情報

Similar to Magnum IO GPUDirect Storage 最新情報 (20)

More from NVIDIA Japan

More from NVIDIA Japan (20)

Recently uploaded

Recently uploaded (10)

Magnum IO GPUDirect Storage 最新情報