17. P.
z
At each macroscopic grid point,
we solve real-time electron dynamics in parallel
Macroscopic grid
for Maxwell equation
Microscopic grid
for RT-TDDFT electron dynamics
tJ
c
tA
Z
tA
tc
ZZZ
41
2
2
2
2
2
K. Yabana et.al, Phys. Rev. B85, 045134 (2012).
mm scale
less than nm scale
Multi-scale approach: Macroscopic Maxwell + Microscopic TDDFT
2020/7/2計算科学技術特論B 17
20. P.
ステンシル計算
• 波動関数のハミルトニアン
• 4次直交差分 (25点差分)
• 周期境界条件+倍精度複素数
• ほとんど倍精度実数として演算
• 波数空間並列
• 1回の波動関数更新に4回計算
#pragma omp parallel do collapse(2)
for ik = [1,Nk] ; for ib = [1,Nb]
local l_domain[:,:,:,0] = g_domain[1:Nz,1:Ny,1:Nz,ib,ik]
/* single-thread computation */
for s = [1,4]
/* 25 points stencil */
for ix = [1,Nx] ; for iy = [1,Ny] ; for iz = [1,Nz]
l_domain[iz,iy,ix,s] = l_domain[...,s-1]
/* pseudo-potential (skip) */
/* update */
g_domain[:,:,:] += l_domain[:,:,:,1:4] * c
2020/7/2計算科学技術特論B 20
21. P.
Wide SIMDへの最適化
2020/7/2 21
Single processor performance: KNC vs KNL
real(8), intent(in) :: B(0:NLz-1,0:NLy-1,0:NLx-1)
complex(8),intent(in) :: E(0:NLz-1,0:NLy-1,0:NLx-1)
complex(8),intent(out) :: F(0:NLz-1,0:NLy-1,0:NLx-1)
#define IDX(dt) iz,iy,modx(ix+(dt)+NLx)
#define IDY(dt) iz,mody(iy+(dt)+NLy),ix
#define IDZ(dt) modz(iz+(dt)+NLz),iy,ix
do ix=0,NLx-1
do iy=0,NLy-1
!dir$ vector nontemporal(F)
do iz=0,NLz-1
v=0; w=0
! z-computation
v=v+Cz(1)*(E(IDZ(1))+E(IDZ(-1))) ...
w=w+Dz(1)*(E(IDZ(1))-E(IDZ(-1))) ...
! y-computation
! x-computation
F(iz,iy,ix) = B(iz,iy,ix)*E(iz,iy,ix) &
& + A *E(iz,iy,ix) &
& - 0.5d0*v - zI*w
end do
end do
end do
Index calculate every time with
remainder table
Remainder op. is very slow
Use non-temporal store
(Intel Compiler only)
Calculation order awareness of
continuous address on memory
[C-language]
Explicit (hand-
coding)
vectorization
with 512-bit
SIMD on KNC
計算科学技術特論B
22. P.
Explicit vectorization with SIMD instruction
• The stencil code with SIMD is written by C language
• The application is mainly constructing with Fortran90
• Our implementation updates four grid points at a time
• Continuous dimension size sets multiple of four
• We apply three optimizations
1. Complex value multiplication
• One of the operands is a constant value at the computation
2. Vectorized index calculation
• Each dimension has eight neighbor points
• Non unit-stride (Y and X) dimensions requires 16 points
⇒ 512-bit SIMD with Integer fits to the size (512-bit / 32-bit = 16)
3. Continuous memory access (Z-dimension)
2020/7/2計算科学技術特論B 22
23. P.
For AVX-512 processors
• We demand the conversion of
KNC SIMD code into AVX-512
• Preprocessor directive can be solved
a minor difference of instructions
• Our code requires a common AVX-512F subset only (F: Foundation)
• Our implementation can be applied to all AVX-512 processor families
• For the performance evaluation
• The stencil computation performance of Xeon Gold 6148 (Skylake-SP)
• The implementation is same in both KNL and Skylake-SP
2020/7/2 23
#ifdef __AVX512F__
/* for AVX-512 processors */
#define _mm512_loadu_epi32 _mm512_loadu_si512
#define _mm512_storenrngo_pd _mm512_stream_pd
#elif __MIC__
/* for KNC */
inline __m512i _mm512_loadu_epi32(int const* v)
{
__m512i w = _mm512_loadunpacklo_epi32(w, v + 0);
return _mm512_loadunpackhi_epi32(w, v + 16);
}
#endif
計算科学技術特論B
24. P.
メニースレッドへの最適化: ベクトル和
• TDKS eq.がプロセス分割されていた場合
• 実空間をベクトルとして各成分の密度和を求める必要がある
• 𝑟ℎ𝑜𝑖𝑙 = σ𝑖𝑘=0
𝑁𝐾
σ𝑖𝑏=0
𝑁𝐵
𝑐𝑖𝑏,𝑖𝑘 × |𝑍𝑢𝑖𝑙,𝑖𝑏,𝑖𝑘|2
24
complex(8), intent(in) :: Zu(NL,NB,NK)
real(8), intent(in) :: c(NB,NK)
real(8), intent(out) :: rho(NL)
do ik=1,NK; do ib=1,NB; do il=1,NL
rho(il)=rho(il)+c(ib,ik)*abs(Zu(il,ib,ik))**2
end do; end do; end do
call MPI_ALLREDUCE(MPI_IN_PLACE,rho,NL &
,MPI_REAL8,MPI_SUM,comm_TDKS,ierr)
2020/7/2計算科学技術特論B
26. P.
実装方法
26
do ik=1,NK
do ib=1,NB
!$omp parallel do
do il=1,NL
rho(il)=rho(il)+c(ib,ik)*abs(Zu(il,ib,ik))**2
end do
!$omp end parallel do
end do
end do
!$omp parallel do collapse(2) reduction(+:rho)
do ik=1,NK
do ib=1,NB
do il=1,NL
rho(il)=rho(il)+c(ib,ik)*abs(Zu(il,ib,ik))**2
end do
end do
end do
!$omp end parallel do
!$omp parallel
tid=omp_get_thread_num(); tmp(:,tid)=0.d0
!$omp do collapse(2)
do ik=1,NK
do ib=1,NB
do il=1,NL
tmp(il,tid)=tmp(il,tid)+c(ib,ik)*abs(Zu(il,ib,ik))**2
end do
end do
end do
!$omp end do
i=ceiling_pow2(omp_get_num_threads())/2
do while(i > 0)
if (tid < i) tmp(:,tid)=tmp(:,tid)+tmp(:,tid+i)
i=i/2
!$omp barrier
end do
!$omp do
do il=1,NL
rho(il)=tmp(il,0)
end do
!$omp end do
!$omp end parallel
First-approach
OpenMP-suitable Manual-reduction
Thread-Local summation
Thread-Global summation
2020/7/2計算科学技術特論B
1. Blelloch [1990]
2. Martin, et. al. [2012]
30. P.
Evaluation environment
2020/7/2 30
COMA @U. Tsukuba, JP Oakforest-PACS (OFP) @JCAHPC, JP
# of node 393 (use up to 128) 8208 (use up to 8192)
Processor
Intel E5-2670v2 x2 (Ivy-Bridge)
Intel Xeon Phi 7110P x2 (KNC)
Intel Xeon Phi 7250 (KNL)
# of cores / node
20 (10 cores x2, IVB)
+ 120 (60 cores x2, KNC)
68 cores (Quadrant)
64 cores are assigned to application
4 cores are used with OS
Memory / node
64 GB (IVB, DDR3)
+ 8 GB x2 (KNC, GDDR5)
16 GB (MCDRAM)
+ 96 GB (DDR4) with flat-mode
Interconnect Mellanox InfiniBand FDR Connect-X3 Intel Omni-Path Architecture
Compiler and MPI Intel 16.0.2 and Intel MPI 5.1.3 Intel 17.0.1 and Intel MPI 2017 u1
Peak perf. / node
2.548 TFLOPS
(0.400 + 2.148 TFLOPS)
3.046 TFLOPS
1-MPI process attached to each processor
4-threads per core is fast on the Intel Xeon Phi
計算科学技術特論B
31. P.
COMA (U. Tsukuba, JP)
2020/7/2計算科学技術特論B 31
CPU 0 CPU 1QPI
Gen3x16XeonPhi0
Gen3x16XeonPhi1
InfiniBand
FDR
InifiniBand
Network
画像: 本センターHPより
# of compute nodes 393 (use up to 256 nodes)
Theoretical peak
1.001 PFLOPS
CPU: 157.2 TFLOPS
KNC: 843.8 TFLOPS
HPL result (list of June 2014) 746.0 TFLOPS (74.7% of peak)
COMA was shutdown on
March 2019.
33. P.
Symmetric execution with static load-balancing
2020/7/2 33
100
1000
10000
100000
1 2 4 8 16 32 64 128 256
Maxlatency[us]
Number of compute node
MPI_Alldreduce latency (CPU+KNC)
Default algorithm Fast algorithm (4-d tree)
Faster
0
40
80
120
160
200
16 32 64 128
Time/Iteration[ms]
Number of compute node
Time-development part computation
Symmetric (Load even) Symmetric (Balanced)
Faster
x31
Y. Hirokawa, et. al.: “Electron Dynamics Simulation with Time-Dependent Density Functional
Theory on Large Scale Symmetric Mode Xeon Phi Cluster”, PDSEC16 計算科学技術特論B
34. P.
Oakforest-PACS (OFP) at JCAHPC (U. Tsukuba and U. Tokyo, JP)
2020/7/2 34
Japan’s 3rd fastest supercomputer
(latest TOP500 lists)
Intel Knights Landing
3
D
D
R
4
C
H
A
N
N
E
L
S
PCIe Gen3 DMI 3
D
D
R
4
C
H
A
N
N
E
L
S
TILE
Memory
Controller
Memory
Controller
misc
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
16GB
DDR4
16GB
DDR4
16GB
DDR4
16GB
DDR4
16GB
DDR4
16GB
DDR4
Intel Omni-Path Architecture
OPA Fat-tree
Network
2 VPU
CHA
2 VPU
1 MB
L2
CacheCore Core
TILE
# of compute nodes 8208 (use up to 8192 nodes)
Theoretical peak 24.91 PFLOPS (8192 nodes)
HPL result (list of Nov. 2017) 13.55 PFLOPS (54.4% of peak)
HPCG result (list of Nov. 2017) 0.3855 PFLOPS (1.54% of peak)
計算科学技術特論B
35. P.
Combines MCDRAM and DDR4
• Combining MCDRAM and DDR4
• DDR4 is a main memory
• MCDRAM as a ”scratch-pad cache”
• Advantage: high aggregate memory bandwidth
• Dis-advantage: complex data handling
• Our code in the dominant computation
1. Computation domain is allocated on a main memory (DDR4 or
MCDRAM)
2. Each thread copies a working set to thread private memory on
MCDRAM
3. Computation is closed on MCDRAM
4. After completing computation, result data is manually written back to
main memory
2020/7/2 35
Computation domain
@DDR4 or MCDRAM
1. copy
3. write-
back
2. compute iteratively
Thread private memory
@MCDRAM
計算科学技術特論B
36. P.
DDR4+MCDRAM performance
0
2
4
6
8
10
0 4 8 12 16 20 24
Elapsetime/Iteration[sec]
Wave function size [GiB]
Cache mode MCDRAM+DDR4 MCDRAM-only DDR4-only
2020/7/2 36
Comparable performance
Slightly better performance
Faster
All data fits inside
MCDRAM
Exceeds MCDRAM
Our code achieves good performance
regardless of the application requested data size
Single processor evaluation
計算科学技術特論B
37. P.
Stencil computation
2020/7/2 37
Better
0
100
200
300
400
500
600
700
800
IVB 10cores KNC 60cores KNL 64cores SKL 20cores
Performance[GFLOPS]Large parallel case (Silicon) Compiler vec. Explicit vec.
1.4x
2.5x
1.6x
25% peak
perf.
46% HPL perf.
1.8x
計算科学技術特論B
KNL processor is over 2.5x faster than KNC co-processor
over 1.8x faster than single socket Skylake Same implementation
38. P.
10
100
1000
1 2 4 8 16
Executiontime/Iteration[ms]
# of compute nodes
Poor parallel case (SiO2)
2 IVB
2 KNC
2 IVB + 2 KNC
KNL
10
100
1000
10000
2 4 8 16 32 64 128
Executiontime/Iteration[ms]
# of compute nodes
Large parallel case (Silicon)
2 IVB
2 KNC
2 IVB + 2 KNC
KNL
Entire computation
2020/7/2 38
Faster
Very good strong scaling
It has not enough thread
and MPI parallelism
計算科学技術特論B
Peak Perf. [GFLOPS] Actual Memory Bandwidth [GB/s]
OFP 3046 486.99
COMA 400 + 2148 = 2548 46.55×2 + 171.73×2 = 436.56
44. P.
Why degraded performance
0.764
0.925
0.095
0.109
0.127
0.126
0
0.2
0.4
0.6
0.8
1
1.2
Best Worst
Normalizedelapsetime/Iteration
Hamiltonian Current Misc. computation Communication
• Blue box shows the computation-only part
1. Communication does not include
2. Problem size per node is even
• Graphite case uses two-stage parallelization
• Requires two-phase synchronization in both sub MPI
communicator and all MPI processes
• It is sensitive for the load-imbalancing
• Non-algorithmic load-imbalancing
• Intel Turbo Boost mechanism
• AVX-512 base clock (?)
2020/7/2 44
Faster
Normalized by ”Best” case
計算科学技術特論B