generalized_nbody_acs_2015_challacombe

Matt Challacombe, Nicolas Bock & Terry Haut
Los Alamos National Laboratory
matt.challacombe@freeon.org
Thanks be to LANL, operated by Los Alamos National Security, LLC for the NNSA of the
USDoE under Contract No. DE-AC52- 06NA25396; released under LA-UR-15-26489
Electronic Structure Theory as
Generic N-Body Problem
Electronic Structure Methods for Large Systems
August ACS, Boston 2015

Barriers for Large Systems
Whats Holding O(N) Methods Back?
• Sparse approximation based on matrix decay
• SpGEMM kernels cannot access strong parallel scaling
• Optimization inhibits evolution:
 Entrenched data structures dictate research (row-col)
 Funding cycles limit data innovation
Our Vision:
• Fast, generic and data local
• New math, one programming model

Conventional O (N ) Quantum Chemistry
• Pretend interesting quantum mechanics is highly local
• Conventional SpGEMM kernels (row-col)
• Pray for stability, error control …
Radial Cutoff
Numerical
Threshold
Matrix decay and
“sparsification”

The Parallel SpGEMM in Quantum Chemistry (I)
Also:
• Solvers entangled with
DBCSR require locality
• Massive redistribution on
each SCF cycle 
Randomize
Cannon,
SUMMA
p = 3
Best parallel SpGEMM is
Aiden Bulloc’s method for
work homogenization.
• Cannon’s algorithm for redistribution throttled as O( 𝒑 )
• Cannot access strong scaling regime, 𝒑 >> 𝑵
Fock Build
F  J[P] + a KF[P] + b Kxc[P]
Spectral Projection
P  θ[ F - μI ]
Randomize
Localize
p = 2
p = 1p = 0

Bowler et. al, arXiv:1402.6828
• So far, no QC code has
demonstrated a parallel
SpGEMM beyond
~ 4 atoms/core
• Enables bigger systems, but
not higher throughput
The Parallel SpGEMM in Quantum Chemistry (II)
Problem:
• row-col data structures do not respect locality
• row-col data structures do not query
• row-col decompositions lack flexibility

Also… High Value Correlations are Extended
• Delocalized physics and local support lead to matrix
ill-conditioning and dense matrices
• Ill-conditioning is a feature, not a bug
• Allows to control interpolation range:
Quantum Transport
(LCAOs)
Radial Basis
Function Networks
High Resolution
PDE solvers (RBFs)

• A hierarchical solver with generic database operations,
data locality and local approximation
• Genericity: Treat metric queries same as range queries,
same as higher dimensional queries, same as …
• Kernel independent skeletinizations
.
Datacentric, Generic N-Body Solvers

More Locality in Higher Dimensions (I)
• For ill-conditioned systems, matrix decay can be very, very
slow, with matrices that remain dense
• Instead of looking for data compression or sparsity, look
for locality in product tensor volume
• First, a generic quadtree matrix:
• block-by-magnitude ordering
• metric locality resolved by quadtree

More Locality in Higher Dimensions (II)
occlusion
• SpAMM is a recursive occlusion-cull of product intermediates
• Double metric query of modified Cauchy-Schwartz criteria
• More task locality in higher dimensional product volume:
Relative error in product,
is bounded:
cull

physics
blocking
block-by-magnitude
slow decay
metric locality
locality principle
More Locality in Higher Dimensions (III)
more local in
product volume
~ 𝒂𝒊 𝒃𝒊occlusion-cull
space filling curve

In addition to metric locality, exploit algebraic locality
in resolution of the identity:
𝑰 𝒔 = 𝒔
1
2 ∙ 𝒔
−1
2
A N-Body Solver for Square Root Iteration
Square root iteration equivalent
to matrix sign problem under
Higham’s identity:
𝑠𝑖𝑔𝑛
0 𝒔
𝐼 0
= 0 𝒔
1
2
𝒔
−1
2 0
Challacombe, Haut & Bock in arxiv 2015

• Square root iteration (sqi) with map hα ∙ and τ algebra
• 𝒛 𝑘 → 𝒔−1/2, 𝒚 𝑘 → 𝒔1/2, 𝒙 𝑘 → 𝑰 𝒔 with 𝜏 → 0
• Two instances we care about, single and dual channel:
sqidual 𝒔, 𝜏 ≔
𝒙0 =
𝒔
𝑠0
, 𝒚0 = 𝒙0,
𝒛0 = 𝑰, 𝜏 𝑠 ~.01 × 𝜏
while 𝑡𝑟𝒙 𝑘 − 𝑛 𝑛 > τ
return {𝒛 𝜏 ← 𝒛 𝑘}
𝒛 𝑘 ← 𝒛 𝑘−1 𝜏 ℎ 𝛼 𝒙 𝑘−1
𝒚 𝑘 ← ℎ 𝛼 𝒙 𝑘−1  𝝉 𝒔
𝒚 𝑘
𝒙 𝑘 ← 𝒚 𝑘 𝜏 𝒛 𝑘
sqisingle 𝒔, 𝜏 ≔
𝒙0 =
𝒔
𝑠0
, 𝒛0 = 𝑰 , 𝜏 𝑠~.01 × 𝜏
while 𝑡𝑟𝒙 𝑘 − 𝑛 𝑛 > τ
return {𝒛 𝜏 ← 𝒛 𝑘}
𝒛 𝑘 ← 𝒛 𝑘−1 𝜏 ℎ 𝛼 𝒙 𝑘−1
𝒙 𝑘 ← 𝒛 𝑇
𝑘  𝜏 (𝒔  𝝉 𝒔
𝒛 𝑘)
Instances of Square Root Iteration

• Super-linear convergence contracts error about identity:
• stability is guaranteed for < 1
• First order variation along unit errors:
• Derivatives are strongly contractive towards identity iteration
(orientational convergence kills error accumulation):
Contractive Identity Iteration (Stability)

Contractive Identity Iteration (Lensing)
• Matrix Market bcsstk14 (Roof of Omni Coliseum)
• Condition number is 𝟏𝟎 𝟏𝟎, 𝜏 = 10−5
• Contraction in the product
Metric locality:
• Locality principle + Cartesian or non-Euclidean separations
• Ordering: space filling curve, graph theory, & etc. (random destroys)
Algebraic Locality:
• Iteration collapses volumes to one plane (lensing)
• SpAMM bound strengthens

Bifurcations for Ill-Conditioned LCAOs
• (3,3) nanotube metric, U.C. × 36 @ 𝜅 𝒔 = 𝟏𝟎 𝟏𝟎
• Sensitivity due to full inverse in 𝛿𝑧 𝑘−1:
• Use 𝜏 𝑠 ≪ 𝜏0
• Calculations
dense through
U.C. × 𝟏𝟐𝟖!

• Most approximate 𝜏0 controlled by condition 𝜅 𝒔
• Control ill-conditioning incrementally with level shifts:
• Product representation: a sandwich of thin, generic SpAMM
products that are highly lensed:
• Most approximate 𝝉 𝟎 𝝁 𝟎 ; 𝒔−𝟏/𝟐 sets cost
• How low in 𝜏0 can we go?
Precision Scoping and Iterative Regularization
then

Most Approximate but Effective by 10
• Thin generic slice: improve 𝜅 by 10, with 1 digit precision
• (3,3) U.C. × 36 → × 128, w/ 𝜅 𝒔 = 1010
metric
• Looking at % volume for the SpAMM products 𝒚 𝑘 and 𝒛 𝑘:
dual: spectral resolution tends
to quadtree copy in place:
single: spectral resolution
becomes increasingly broad:
𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001
𝜇0 = .1, 𝜏0 = 10−2, 𝜏 𝑠0
= 10−4

× 8, 𝑘 = 0, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001

× 8, 𝑘 = 14, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001

× 8, 𝑘 = 37, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001

Compression for Thin Slice: .1 .1 ; 𝒔−1/2
• Nanotubes, 𝜅 𝒔 = 1010& × 36 → × 128
• Volume of terminal product relative to 𝑛3
𝜏 𝑠0
~.01 × 𝜏0

sqi 𝒔, 𝜏0 = 10−3
Ill-Conditioning: 𝝹 𝒔 = 𝟏𝟎 𝟏𝟏, (3,3)x8 nanotube
sqi 𝒛 𝑇
𝝉0
 𝜏1
𝒔  𝜏1
𝒛 𝝉0
, 𝜏1 = 10−7
sqi 𝒛 𝑇
𝝉1
 𝜏 𝟐
𝒛 𝑇
𝝉0
 𝜏 𝟏
𝒔  𝜏 𝟏
𝒛 𝝉0
 𝜏 𝟐
𝒛 𝝉1
, 𝜏2 = 10−11

An Optimized SpAMM Kernel (I)
Bock & Challacombe, SIAM J. Sci. Comput., 35(1), C72
• Assembly coded SpAMM in single precision:
• 50% of peak
with 4x4
blocking
• Crossover w/
SGEMM at
n = 2000,
same error
6-31G**,
Matrix Sign Function

• In single, SpAMM can beat MKL SGEMM in error also
• Recursion w/locality more accurate than row-col:
An Optimized SpAMM Kernel (II)
Bock & Challacombe, SIAM J. Sci. Comput., 35(1), C72

• N-body potentially communication optimal:
• N-body model supported by common runtimes (sort of):
o OpenMP: attention to memory (blocks & chunks)
o Charm++: does not support recursion, roll your own
• Quantum locality → metric locality
• Temporal locality → persistence load balancing
• Decomposition in 3-D task space, not row-col data space
SpAMM in the Strong Scaling Limit (I)
Bock, Challacombe & Kale arXiv: 1403.7458

SpAMM in the Strong Scaling Limit (II)
• OpenMP: Memory access impacts recursive task parallelism
• Contiguous chunks of memory (𝑁𝑐) and blocksize (𝑁𝑏) vs wall
time, 48 core opteron, 6-31G** spectral projection, (H2O)90:
1 thread, L1 cache exceeded
complexity reduction vs prefetch
48 threads, NUMA effects
~80% parallel efficiency

 𝜏=10−10  𝜏=10−10
first step final step
SpAMM in the Strong Scaling Limit (III)
• Charm++ is a modern runtime with persistence based load
balancing, but does not currently support recursion
• Build unrolled octree mesh of chares
• Dynamic load balancing for 𝑝~30 × 𝑁 (first iteration)
• GreedyCommLB persistence load balancing for 𝑝~500 × 𝑁

A N-Body Solver for Fock Exchange (I)
Challacombe & Bock, J. Chem. Phys. 140 (2014) p. 111101
Recursive occlusion-cull with triple metric query on the Cauchy-
Schwarz (direct SCF) criteria:
Quadtree of shell pairs:
Fock exchange as hextree recursion:
4 of these

• With permutational symmetry, expect a 4x speedup. Get less
with occlusion & culling:
• Data problem w/ 4 × sub-blocks of exchange and density
tracked, resulting in 7 unique contraction blocks:
A N-Body Solver for Fock Exchange (II)

• Generalized N-body methods resolve new, low
complexity structures that row-col cannot
• 𝒔−1/2 as deferred product of generic solves:
o Apply to target without forming bad inverse
o Compression by orders of magnitude (lensing)
• Generic programming:
o Generic matrix algebra, Fock exchange,
Hartree & exchange-correlation, all N-body
o Easy access to higher dimensional problems
(tensor multiplication, derivatives, … )
o Greatly reduce code base, lower barriers to
entry, minimize bugs, hidden approximations
Fast, Generic and Data Local

• Generic N-body programming model allows focus
on new math (row-col free):
o Rigorous bounds
o Mixed metrics on same footing
o Precision and regularization scoping
o Algebraic locality + other fast methods
(sketching, probing, joining & etc)
• N-Body problem is communication optimal:
o Strong parallel scaling for fast matrix
multiplication in electronic structure
o Works for runtimes you know
o Kernel independent skeletonization
New Math, One Programming Model

generalized_nbody_acs_2015_challacombe

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to generalized_nbody_acs_2015_challacombe

Similar to generalized_nbody_acs_2015_challacombe (20)

generalized_nbody_acs_2015_challacombe