200702material hirokawa

P.
電子動力学アプリケーションの最適化１
廣川祐太
筑波大学計算科学研究センター
量子物性研究部門
12020/7/2計算科学技術特論B

P.
自己紹介
• 2016年3月筑波大学大学院システム情報工学研究科
コンピュータサイエンス専攻（※）博士前期課程修了
修士（工学）
• 2018年9月同博士後期課程修了
博士（工学）
• 2018年10月筑波大学計算科学研究センター
量子物性研究部門着任（研究員、現職）
※本年より、理工情報生命学術院
システム情報工学研究群
情報理工学位プログラムに体制変更
2020/7/2計算科学技術特論B 2

P.
自己紹介 (Cont‘d)
• 高性能計算の研究に従事
• 実科学アプリケーションのコデザイン（計算科学と計算機科学の協調開発）
• 超並列システムに向けた最適化と性能評価
• 2014年（博士前期1年）から同センターHPC部門（指導教員が所属）と
現職場の間で共同研究を実施、現在も継続
• 電子動力学アプリケーションの最適化と性能評価
高性能計算システム研究部門（計算機科学）
• GPU、メニーコア、FPGAなど、最先端プ
ロセッサを活用したい・してほしい
• それができるアプリケーションを輩出して
いきたい
量子物性研究部門（計算科学）
• シミュレーションを高速化したい
• 大規模問題を計算したい
• それができるアプリケーションを開発して
いきたい
P.
自己紹介 (Co n t‘d )
• 高性能計算の研究に従事
• 実科学アプリケーションのコデザイン（計算科学と計算機科学の協調開発）
• 超並列システムに向けた最適化と性能評価
• 2014年（博士前期1年）から同センターH PC部門（指導教員が所属）と
現職場の間で共同研究を実施、現在も継続
• 電子動力学アプリケーションの最適化と性能評価
高性能計算システム研究部門（計算機科学）
• GPU、メニーコア、 FPGAなど、最先端プ
ロセッサを活用したい・してほしい
• それができるアプリケーションを輩出して
いきたい
量子物性研究部門（計算科学）
• シミュレーションを高速化したい
• 大規模問題を計算したい
• それができるアプリケーションを開発して
いきたい

P.
講義内容
1日目: アプリケーションとメニーコアシステムにおける性能評価
1. 実空間差分法に基づく電子動力学シミュレーション
2. 電子動力学アプリケーションSALMON
3. Xeon PhiプロセッサにおけるWide SIMD・メニースレッドへの最適化
4. Oakforest-PACS全系を用いたMulti-scale Maxwell+TDDFT計算の性能評価
2日目: スーパーコンピュータ「富岳」の活用と計算 (機) 科学のコデザイン
1. スーパーコンピュータ「富岳」における最適化
2. 富岳を用いたSingle-scale Maxwell+TDDFT計算の性能評価
3. 計算科学におけるオープンソースソフトウェア (OSS) 開発

P.
実空間差分法に基づく
電子動力学シミュレーション

P.
電子動力学シミュレーション
• 光と物質の相互作用 (電子動力学) に関するシミュレーションは幅広い科学
分野に寄与する極めて重要な分野である
• 物理、化学、光工学、電気工学
• 光デバイスの開発やレーザーを用いた極小物質の加工技術など
Ultrashort pulse High intensity Nano structure
Femto- and atto-second
Time-domain
HHG, Laser processing
Non-linear
Near field, Meta-surface
Non-local

P.
電子動力学シミュレーションの規模
• 電子構造計算 (ex. RSDFT) は10万原子以上の計算を可能としてきた
• 電子動力学は未だ数千原子規模
• 電子動力学は「電子構造計算 + 時間発展計算」
• 大規模には最低数万原子規模を計算する必要がある
• 超並列システムに耐える設計として実空間差分法に基づく計算が有効

P.
実空間差分法に基づく計算の特徴 (HPCの観点から)
RSDFT
(DFT: Density Functional Theory)
SALMON
(TDDFT: Time-Dependent DFT)
解く方程式 Kohn-Sham (KS) Time-Dependent KS (TDKS)
計算量 O(N3) O(N2)
律速計算速度メモリバンド幅
主たる計算 Gram-Schmidt
Subspace-Diagonalization
Hamiltonian
Hartree potential
主要カーネル (D/Z) GEMM, PDSYEV, PZHEEV (D/Z) Stencil, FFT
主たる通信 Collective (Bcast, Allreduce) Peer-to-Peer (Halo), Alltoall
通信方向 Orbital Real-space

P.
スーパーコンピュータ「富岳」
• 理化学研究所計算科学研究センターにて富岳の設置が進められている
• 現在は共用前評価環境として稼働
• 超大規模メニーコア型スーパーコンピュータ (48コア×150k+ノード)
• 世界最高性能のスーパーコンピュータ (TOP500 June 2020)
• 富岳によって1万原子を超える大規模電子動力学計算が可能となる
• 京に比べ6倍のコア数、Byte/FLOPは減少
• ノードあたり4 MPIプロセス推奨、全系で600k+ MPIプロセス
• 超大規模並列システムに耐えられるソフトウェアの開発が求められる

Center for Computational Sciences, Univ. of Tsukuba
Theoretical/Achievable perf. of cutting-edge processors
 “Achievable performance” is extremely low for “Theoretical peak”
 High-Performance Linpack (HPL) and High-Performance Conjugate Gradient (HPCG)
results from TOP500 June 2020 (2020.6.22)
10
Processor Rank
Theoretical
[PFLOPS]
HPL results
[PFLOPS]
HPCG results
[PFLOPS]
Fugaku@JP Many-Core #1 513.85 415.53 (80%) 13.400 (2.6%)
Summit@US GPU #2 200.79 148.60 (74%) 2.925 (1.5%)
Sunway TaihuLight@CN Many-core #4 125.44 93.01 (74%) 0.481 (0.4%)
Selene@US GPU #7 34.56 27.58 (80%) 0.509 (1.5%)
To solve the gap in theoretical/achievable perf.
Compute
bound
Bandwidth
bound
Achievable perf. range of
applications
Computational Science + HPC ⇒ Software co-degisn

P.
電子動力学アプリケーション
SALMON

P.
SALMON: 光科学ソフトウェア
• In-house collaboration between CCS research divisions
• Quantum Condensed Matter Physics and High-Performance Computing
• TDDFT + Maxwell + MD simulation
• Open-source application (version 1.2.1)
• Apache 2.0 license
• over 95% Fortran, + C
• OpenMP + MPI
• https://salmon-tddft.jp/
• 7月初旬にv2を公開予定
0
2
4
6
2.0 2.5 3.0
54
146
2.0 3.0
0
5
10
15
20
25
30
35
40
1.5 2.0 2.5 3.0 3.5 4.0
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Oscillatorstrength(1/eV) Energy (eV)
alizedstrength(1/eV)
54
146
308
560
922
1414
0.01
0.02
2.1 2.3 2.5
2.1 2.3 2.5
0.02
0.01
308
560
922
1414
Oscillator strength
of nano-particles
Light-induced electron
dynamics in solids
1D-Maxwell + 3D TDDFT
multiscale simulation

P.
計算と通信の性質
SUM (orbital) MPI_Allreduce (DP vector)
Stencil 25-points stencil (DP complex)
Halo Halo communication
FFT 3-D FFT
NL-PP SpMV-like calculation
𝝍 𝒃(𝒓, 𝒕) Wave function (Electron orbital)
𝝆(𝒓, 𝒕) Charge density of electrons
𝑽 𝑯(𝒓, 𝒕) Hartree potential (Poisson equation)

P.
MPIによる並列化方法
• 複数のMPIコミュニケータで通信を局所化
comm_rx
comm_ry
atom i
comm_ai
comm_aj
comm_rgrid
comm_rz
Ψ1,2,3(r) Ψ4,5,6(r) Ψ7,8,9(r) …
comm_orbital
atom j
Communicator
comm_orbital 軌道方向を束ねる
異なる軌道
同じ実空間を計算するプロセス群
comm_rgrid 実空間方向を束ねる
同じ軌道
異なる実空間を計算するプロセス群
comm_rx, ry, rz comm_rgrid内で各軸の通信
comm_ai, aj i, j番の原子に相当するグリッドを持
つMPIプロセス

P.
Xeon PhiプロセッサにおけるWide
SIMD・メニースレッドへの最適化

P.
今回の最適化対象と性能評価に関して
• SALMONは複数のシミュレーションを提供
• TDDFT (Single-Cell simulation)
• Maxwell+TDDFT (Multi-scale, Single-scale simulations)
• Maxwell (Electromagnetic simulation)
• Multi-scale Maxwell+TDDFTの性能評価@OFPを紹介
• 次回はSingle-scale Maxwell+TDDFTの性能評価@富岳

P.
ｚ
At each macroscopic grid point,
we solve real-time electron dynamics in parallel
Macroscopic grid
for Maxwell equation
Microscopic grid
for RT-TDDFT electron dynamics
     tJ
c
tA
Z
tA
tc
ZZZ
41
2
2
2
2
2






K. Yabana et.al, Phys. Rev. B85, 045134 (2012).
mm scale
less than nm scale
Multi-scale approach: Macroscopic Maxwell + Microscopic TDDFT

P.
Maxwell+TDDFT multi-scale simulation
• Macroscopic grid: Maxwell (Finite Difference Time Domain method)
• Microscopic grid: TDKS (Real space difference method)
• currentで式を接続、両式を時間発展で別々に解く
• 大規模波数 (k-point, wavevectors) かつ小規模実空間の計算がメイン

P.
マルチスケール計算の並列化
19
全体通信のみ
全体通信 + サブコミュニケータ通信
袖領域交換はどちらの
パターンでもなし

P.
ステンシル計算
• 波動関数のハミルトニアン
• 4次直交差分 (25点差分)
• 周期境界条件+倍精度複素数
• ほとんど倍精度実数として演算
• 波数空間並列
• 1回の波動関数更新に4回計算
#pragma omp parallel do collapse(2)
for ik = [1,Nk] ; for ib = [1,Nb]
local l_domain[:,:,:,0] = g_domain[1:Nz,1:Ny,1:Nz,ib,ik]
/* single-thread computation */
for s = [1,4]
/* 25 points stencil */
for ix = [1,Nx] ; for iy = [1,Ny] ; for iz = [1,Nz]
l_domain[iz,iy,ix,s] = l_domain[...,s-1]
/* pseudo-potential (skip) */
/* update */
g_domain[:,:,:] += l_domain[:,:,:,1:4] * c

P.
Wide SIMDへの最適化
2020/7/2 21
Single processor performance: KNC vs KNL
real(8), intent(in) :: B(0:NLz-1,0:NLy-1,0:NLx-1)
complex(8),intent(in) :: E(0:NLz-1,0:NLy-1,0:NLx-1)
complex(8),intent(out) :: F(0:NLz-1,0:NLy-1,0:NLx-1)
#define IDX(dt) iz,iy,modx(ix+(dt)+NLx)
#define IDY(dt) iz,mody(iy+(dt)+NLy),ix
#define IDZ(dt) modz(iz+(dt)+NLz),iy,ix
do ix=0,NLx-1
do iy=0,NLy-1
!dir$ vector nontemporal(F)
do iz=0,NLz-1
v=0; w=0
! z-computation
v=v+Cz(1)*(E(IDZ(1))+E(IDZ(-1))) ...
w=w+Dz(1)*(E(IDZ(1))-E(IDZ(-1))) ...
! y-computation
! x-computation
F(iz,iy,ix) = B(iz,iy,ix)*E(iz,iy,ix) &
& + A *E(iz,iy,ix) &
& - 0.5d0*v - zI*w
end do
end do
end do
Index calculate every time with
remainder table
Remainder op. is very slow
Use non-temporal store
(Intel Compiler only)
Calculation order awareness of
continuous address on memory
[C-language]
Explicit (hand-
coding)
vectorization
with 512-bit
SIMD on KNC
計算科学技術特論B

P.
Explicit vectorization with SIMD instruction
• The stencil code with SIMD is written by C language
• The application is mainly constructing with Fortran90
• Our implementation updates four grid points at a time
• Continuous dimension size sets multiple of four
• We apply three optimizations
1. Complex value multiplication
• One of the operands is a constant value at the computation
2. Vectorized index calculation
• Each dimension has eight neighbor points
• Non unit-stride (Y and X) dimensions requires 16 points
⇒ 512-bit SIMD with Integer fits to the size (512-bit / 32-bit = 16)
3. Continuous memory access (Z-dimension)

P.
For AVX-512 processors
• We demand the conversion of
KNC SIMD code into AVX-512
• Preprocessor directive can be solved
a minor difference of instructions
• Our code requires a common AVX-512F subset only (F: Foundation)
• Our implementation can be applied to all AVX-512 processor families
• For the performance evaluation
• The stencil computation performance of Xeon Gold 6148 (Skylake-SP)
• The implementation is same in both KNL and Skylake-SP
2020/7/2 23
#ifdef __AVX512F__
/* for AVX-512 processors */
#define _mm512_loadu_epi32 _mm512_loadu_si512
#define _mm512_storenrngo_pd _mm512_stream_pd
#elif __MIC__
/* for KNC */
inline __m512i _mm512_loadu_epi32(int const* v)
{
__m512i w = _mm512_loadunpacklo_epi32(w, v + 0);
return _mm512_loadunpackhi_epi32(w, v + 16);
}
#endif

P.
メニースレッドへの最適化: ベクトル和
• TDKS eq.がプロセス分割されていた場合
• 実空間をベクトルとして各成分の密度和を求める必要がある
• 𝑟ℎ𝑜𝑖𝑙 = σ𝑖𝑘=0
𝑁𝐾
σ𝑖𝑏=0
𝑁𝐵
𝑐𝑖𝑏,𝑖𝑘 × |𝑍𝑢𝑖𝑙,𝑖𝑏,𝑖𝑘|2
24
complex(8), intent(in) :: Zu(NL,NB,NK)
real(8), intent(in) :: c(NB,NK)
real(8), intent(out) :: rho(NL)
do ik=1,NK; do ib=1,NB; do il=1,NL
rho(il)=rho(il)+c(ib,ik)*abs(Zu(il,ib,ik))**2
end do; end do; end do
call MPI_ALLREDUCE(MPI_IN_PLACE,rho,NL &
,MPI_REAL8,MPI_SUM,comm_TDKS,ierr)

P.
実装方法
[Simple] 最内ループ並列化
• 依存関係からすぐ書けるがOpenMPのオーバーヘッドが非常に高い
• OpenMPの暗黙の同期 (fork-join model)、omp reductionの実装方法…
[Modify] omp reduction + omp collapse
• 外2つのループをcollapseで二重ループ並列化
• reductionに渡す変数は配列なので，スレッドローカルメモリの容量に注意
[Manual] 手動で並列和を実装
• 並列化自体はOpenMPで行う，スレッド間の足し合わせを手動実装
• スレッドローカルメモリの使用に関わるオーバーヘッドを減らす

P.
実装方法
26
do ik=1,NK
do ib=1,NB
!$omp parallel do
do il=1,NL
end do
!$omp end parallel do
end do
end do
!$omp parallel do collapse(2) reduction(+:rho)
do ik=1,NK
do ib=1,NB
do il=1,NL
end do
end do
end do
!$omp end parallel do
!$omp parallel
tid=omp_get_thread_num(); tmp(:,tid)=0.d0
!$omp do collapse(2)
do ik=1,NK
do ib=1,NB
do il=1,NL
tmp(il,tid)=tmp(il,tid)+c(ib,ik)*abs(Zu(il,ib,ik))**2
end do
end do
end do
!$omp end do
i=ceiling_pow2(omp_get_num_threads())/2
do while(i > 0)
if (tid < i) tmp(:,tid)=tmp(:,tid)+tmp(:,tid+i)
i=i/2
!$omp barrier
end do
!$omp do
do il=1,NL
rho(il)=tmp(il,0)
end do
!$omp end do
!$omp end parallel
First-approach
OpenMP-suitable Manual-reduction
Thread-Local summation
Thread-Global summation
1. Blelloch [1990]
2. Martin, et. al. [2012]

P.
時間発展計算への影響
27
10
100
1000
16 32 64 128
Dynamicstime/Iteration[msec]
# of compute node
Simple Modify Manual
0
20
40
60
80
100
120
140
64 128 192 256 64 128 192 256 64 128 192 256
Simple Modify Manual
Executiontime/Iteration[msec]
Schemes and # of OpenMP threads
OpenMP summation MPI summation
時間発展計算への影響
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
64 128 192 256 64 128 192 256
Modify Manual
Executiontime/Iteration[msec]
Schemes and # of OpenMP threads
OpenMP summation MPI summation
LowerisFaster
ベクトル和のベンチマーク (32ノード）

P.
Xeon Phiの性能比較

P.
Xeon Phi
• Intelが開発したメニーコア型プロセッサ
• 演算コアが60個以上、1コアあたり4スレッド実行可能
• 第2世代で開発中止、Xeonプロセッサのメニーコア化に移行
Knights Corner (KNC) Knights Landing (KNL)
世代第一世代第二世代
利用方法 Coprocessor CPU
コア数 60, 61コア 64, 68, 72コア
メモリ 16 GB (GDDR5) 16 GB (HBM2)
96 GB (DDR4-2400)
理論性能約1 TFLOPS 約2.8ー3.2 TFLOPS

P.
Evaluation environment
2020/7/2 30
COMA @U. Tsukuba, JP Oakforest-PACS (OFP) @JCAHPC, JP
# of node 393 (use up to 128) 8208 (use up to 8192)
Processor
Intel E5-2670v2 x2 (Ivy-Bridge)
Intel Xeon Phi 7110P x2 (KNC)
Intel Xeon Phi 7250 (KNL)
# of cores / node
20 (10 cores x2, IVB)
+ 120 (60 cores x2, KNC)
68 cores (Quadrant)
64 cores are assigned to application
4 cores are used with OS
Memory / node
64 GB (IVB, DDR3)
+ 8 GB x2 (KNC, GDDR5)
16 GB (MCDRAM)
+ 96 GB (DDR4) with flat-mode
Interconnect Mellanox InfiniBand FDR Connect-X3 Intel Omni-Path Architecture
Compiler and MPI Intel 16.0.2 and Intel MPI 5.1.3 Intel 17.0.1 and Intel MPI 2017 u1
Peak perf. / node
2.548 TFLOPS
(0.400 + 2.148 TFLOPS)
3.046 TFLOPS
1-MPI process attached to each processor
4-threads per core is fast on the Intel Xeon Phi

P.
COMA (U. Tsukuba, JP)
CPU 0 CPU 1QPI
Gen3x16XeonPhi0
Gen3x16XeonPhi1
InfiniBand
FDR
InifiniBand
Network
画像: 本センターHPより
# of compute nodes 393 (use up to 256 nodes)
Theoretical peak
1.001 PFLOPS
CPU: 157.2 TFLOPS
KNC: 843.8 TFLOPS
HPL result (list of June 2014) 746.0 TFLOPS (74.7% of peak)
COMA was shutdown on
March 2019.

P.
Knights Corner is a Coprocessor
• 第1世代Xeon Phi (Knights Corner) はCoprocessorとして提供された
• イメージとしてはXeon CPUの計算の一部を肩代わりする
• Xeon Phiの性能を考えると、計算を任せきりにできない
• CPUとXeon Phi両方に計算を割り当てるほうが高速な可能性 (Symmetric実行)
• 一方、GPU (Accelerator) は単体で高性能
• CPUとGPUの性能差が非常に大きく、ロードバランス (負荷分散) 制御が困難
• 計算をGPUにすべて任せるのが一般的
• Symmetric実行 = Heterogeneous実行のためロードバランスが必要

P.
Symmetric execution with static load-balancing
2020/7/2 33
100
1000
10000
100000
1 2 4 8 16 32 64 128 256
Maxlatency[us]
Number of compute node
MPI_Alldreduce latency (CPU+KNC)
Default algorithm Fast algorithm (4-d tree)
Faster
0
40
80
120
160
200
16 32 64 128
Time/Iteration[ms]
Number of compute node
Time-development part computation
Symmetric (Load even) Symmetric (Balanced)
Faster
x31
Y. Hirokawa, et. al.: “Electron Dynamics Simulation with Time-Dependent Density Functional
Theory on Large Scale Symmetric Mode Xeon Phi Cluster”, PDSEC16 計算科学技術特論B

P.
Oakforest-PACS (OFP) at JCAHPC (U. Tsukuba and U. Tokyo, JP)
2020/7/2 34
Japan’s 3rd fastest supercomputer
(latest TOP500 lists)
Intel Knights Landing
3
D
D
R
4
C
H
A
N
N
E
L
S
PCIe Gen3 DMI 3
D
D
R
4
C
H
A
N
N
E
L
S
TILE
Memory
Controller
Memory
Controller
misc
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
16GB
DDR4
16GB
DDR4
16GB
DDR4
16GB
DDR4
16GB
DDR4
16GB
DDR4
Intel Omni-Path Architecture
OPA Fat-tree
Network
2 VPU
CHA
2 VPU
1 MB
L2
CacheCore Core
TILE
# of compute nodes 8208 (use up to 8192 nodes)
Theoretical peak 24.91 PFLOPS (8192 nodes)
HPL result (list of Nov. 2017) 13.55 PFLOPS (54.4% of peak)
HPCG result (list of Nov. 2017) 0.3855 PFLOPS (1.54% of peak)

P.
Combines MCDRAM and DDR4
• Combining MCDRAM and DDR4
• DDR4 is a main memory
• MCDRAM as a ”scratch-pad cache”
• Advantage: high aggregate memory bandwidth
• Dis-advantage: complex data handling
• Our code in the dominant computation
1. Computation domain is allocated on a main memory (DDR4 or
MCDRAM)
2. Each thread copies a working set to thread private memory on
MCDRAM
3. Computation is closed on MCDRAM
4. After completing computation, result data is manually written back to
main memory
2020/7/2 35
Computation domain
@DDR4 or MCDRAM
1. copy
3. write-
back
2. compute iteratively
Thread private memory
@MCDRAM

P.
DDR4+MCDRAM performance
0
2
4
6
8
10
0 4 8 12 16 20 24
Elapsetime/Iteration[sec]
Wave function size [GiB]
Cache mode MCDRAM+DDR4 MCDRAM-only DDR4-only
2020/7/2 36
Comparable performance
Slightly better performance
Faster
All data fits inside
MCDRAM
Exceeds MCDRAM
Our code achieves good performance
regardless of the application requested data size
Single processor evaluation

P.
Stencil computation
2020/7/2 37
Better
0
100
200
300
400
500
600
700
800
IVB 10cores KNC 60cores KNL 64cores SKL 20cores
Performance[GFLOPS]Large parallel case (Silicon) Compiler vec. Explicit vec.
1.4x
2.5x
1.6x
25% peak
perf.
46% HPL perf.
1.8x
KNL processor is over 2.5x faster than KNC co-processor
over 1.8x faster than single socket Skylake Same implementation

P.
10
100
1000
1 2 4 8 16
Executiontime/Iteration[ms]
# of compute nodes
Poor parallel case (SiO2)
2 IVB
2 KNC
2 IVB + 2 KNC
KNL
10
100
1000
10000
2 4 8 16 32 64 128
Executiontime/Iteration[ms]
# of compute nodes
Large parallel case (Silicon)
2 IVB
2 KNC
2 IVB + 2 KNC
KNL
Entire computation
2020/7/2 38
Faster
Very good strong scaling
It has not enough thread
and MPI parallelism
Peak Perf. [GFLOPS] Actual Memory Bandwidth [GB/s]
OFP 3046 486.99
COMA 400 + 2148 = 2548 46.55×2 + 171.73×2 = 436.56

P.
Oakforest-PACS全系を用いた
Maxwell+TDDFT計算の性能評価

P.
再掲: マルチスケール計算の並列化
40
全体通信のみ
全体通信 + サブコミュニケータ通信
袖領域交換はどちらの
パターンでもなし

P.
Full-system evaluation of OFP
• Laser-interaction problem to apply material processing with laser cutter
2020/7/2 41
Graphite (Graphene) Silicon
Parallelization 8 MPI procs. / TDKS eq. 1, 2 or 4 TDKS eq. / MPI proc.
Total # of TDKS eq. (max) 1024 32768
# of wave space 7928 × 16 83
× 16
Size of 3-D real scape 26 × 16 × 16 163
TDKS equation size
7928 × 16 × NL
≈ 12.8 [GB]
83
× 16 × NL
≈ 0.5~2.0 [GB]
Actual calculation time
(1 case with 8192 KNL)
5−6 hours 8−9 hours
(b) N procs / macro-grid (a) N macro-grid / proc

P.
32
64
128
256
512
1024
2048
4096
128 512 2048 8192
Performance[TFLOPS]
# of compute node
Hamiltonian performance
Graphite Silicon
Weak scaling
0
50
100
150
200
250
300
350
400
450
128 512 2048 8192
# of compute node
Graphite Silicon
2020/7/2 42
Faster
Better
16% peak perf.
30% HPL perf.
93.6% efficiency
94.4% efficiency

P.
Strong scaling
2020/7/2 43
10
100
1000
512 1024 2048 4096 8192
# of compute node
Graphite Silicon
Faster
Graphite is saturated

P.
Why degraded performance
0.764
0.925
0.095
0.109
0.127
0.126
0
0.2
0.4
0.6
0.8
1
1.2
Best Worst
Normalizedelapsetime/Iteration
Hamiltonian Current Misc. computation Communication
• Blue box shows the computation-only part
1. Communication does not include
2. Problem size per node is even
• Graphite case uses two-stage parallelization
• Requires two-phase synchronization in both sub MPI
communicator and all MPI processes
• It is sensitive for the load-imbalancing
• Non-algorithmic load-imbalancing
• Intel Turbo Boost mechanism
• AVX-512 base clock (?)
2020/7/2 44
Faster
Normalized by ”Best” case

P.
本日のまとめ

P.
まとめ
• 電子動力学アプリケーションSALMON
• Multi-scale Maxwell+TDDFT simulation
• Xeon Phiプロセッサに対する最適化
• Xeon PhiクラスタOakforest-PACSにおける性能評価
• 次回
• スーパーコンピュータ「富岳」における最適化と性能評価
• 計算科学におけるOSS開発について

200702material hirokawa

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 200702material hirokawa

Similaire à 200702material hirokawa (20)

Plus de RCCSRENKEI

Plus de RCCSRENKEI (20)

200702material hirokawa