PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
generalized_nbody_acs_2015_challacombe
1. Matt Challacombe, Nicolas Bock & Terry Haut
Los Alamos National Laboratory
matt.challacombe@freeon.org
Thanks be to LANL, operated by Los Alamos National Security, LLC for the NNSA of the
USDoE under Contract No. DE-AC52- 06NA25396; released under LA-UR-15-26489
Electronic Structure Theory as
Generic N-Body Problem
Electronic Structure Methods for Large Systems
August ACS, Boston 2015
2. Barriers for Large Systems
Whats Holding O(N) Methods Back?
• Sparse approximation based on matrix decay
• SpGEMM kernels cannot access strong parallel scaling
• Optimization inhibits evolution:
Entrenched data structures dictate research (row-col)
Funding cycles limit data innovation
Our Vision:
• Fast, generic and data local
• New math, one programming model
3. Conventional O (N ) Quantum Chemistry
• Pretend interesting quantum mechanics is highly local
• Conventional SpGEMM kernels (row-col)
• Pray for stability, error control …
Radial Cutoff
Numerical
Threshold
Matrix decay and
“sparsification”
4. The Parallel SpGEMM in Quantum Chemistry (I)
Also:
• Solvers entangled with
DBCSR require locality
• Massive redistribution on
each SCF cycle
Randomize
Cannon,
SUMMA
p = 3
Best parallel SpGEMM is
Aiden Bulloc’s method for
work homogenization.
• Cannon’s algorithm for redistribution throttled as O( 𝒑 )
• Cannot access strong scaling regime, 𝒑 >> 𝑵
Fock Build
F J[P] + a KF[P] + b Kxc[P]
Spectral Projection
P θ[ F - μI ]
Randomize
Localize
p = 2
p = 1p = 0
5. Bowler et. al, arXiv:1402.6828
• So far, no QC code has
demonstrated a parallel
SpGEMM beyond
~ 4 atoms/core
• Enables bigger systems, but
not higher throughput
The Parallel SpGEMM in Quantum Chemistry (II)
Problem:
• row-col data structures do not respect locality
• row-col data structures do not query
• row-col decompositions lack flexibility
6. Also… High Value Correlations are Extended
• Delocalized physics and local support lead to matrix
ill-conditioning and dense matrices
• Ill-conditioning is a feature, not a bug
• Allows to control interpolation range:
Quantum Transport
(LCAOs)
Radial Basis
Function Networks
High Resolution
PDE solvers (RBFs)
7. • A hierarchical solver with generic database operations,
data locality and local approximation
• Genericity: Treat metric queries same as range queries,
same as higher dimensional queries, same as …
• Kernel independent skeletinizations
.
Datacentric, Generic N-Body Solvers
8. More Locality in Higher Dimensions (I)
• For ill-conditioned systems, matrix decay can be very, very
slow, with matrices that remain dense
• Instead of looking for data compression or sparsity, look
for locality in product tensor volume
• First, a generic quadtree matrix:
• block-by-magnitude ordering
• metric locality resolved by quadtree
9. More Locality in Higher Dimensions (II)
occlusion
• SpAMM is a recursive occlusion-cull of product intermediates
• Double metric query of modified Cauchy-Schwartz criteria
• More task locality in higher dimensional product volume:
Relative error in product,
is bounded:
cull
13. • Super-linear convergence contracts error about identity:
• stability is guaranteed for < 1
• First order variation along unit errors:
• Derivatives are strongly contractive towards identity iteration
(orientational convergence kills error accumulation):
Contractive Identity Iteration (Stability)
14. Contractive Identity Iteration (Lensing)
• Matrix Market bcsstk14 (Roof of Omni Coliseum)
• Condition number is 𝟏𝟎 𝟏𝟎, 𝜏 = 10−5
• Contraction in the product
Metric locality:
• Locality principle + Cartesian or non-Euclidean separations
• Ordering: space filling curve, graph theory, & etc. (random destroys)
Algebraic Locality:
• Iteration collapses volumes to one plane (lensing)
• SpAMM bound strengthens
15. Bifurcations for Ill-Conditioned LCAOs
• (3,3) nanotube metric, U.C. × 36 @ 𝜅 𝒔 = 𝟏𝟎 𝟏𝟎
• Sensitivity due to full inverse in 𝛿𝑧 𝑘−1:
• Use 𝜏 𝑠 ≪ 𝜏0
• Calculations
dense through
U.C. × 𝟏𝟐𝟖!
16. • Most approximate 𝜏0 controlled by condition 𝜅 𝒔
• Control ill-conditioning incrementally with level shifts:
• Product representation: a sandwich of thin, generic SpAMM
products that are highly lensed:
• Most approximate 𝝉 𝟎 𝝁 𝟎 ; 𝒔−𝟏/𝟐 sets cost
• How low in 𝜏0 can we go?
Precision Scoping and Iterative Regularization
then
17. Most Approximate but Effective by 10
• Thin generic slice: improve 𝜅 by 10, with 1 digit precision
• (3,3) U.C. × 36 → × 128, w/ 𝜅 𝒔 = 1010
metric
• Looking at % volume for the SpAMM products 𝒚 𝑘 and 𝒛 𝑘:
dual: spectral resolution tends
to quadtree copy in place:
single: spectral resolution
becomes increasingly broad:
𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001
𝜇0 = .1, 𝜏0 = 10−2, 𝜏 𝑠0
= 10−4
23. An Optimized SpAMM Kernel (I)
Bock & Challacombe, SIAM J. Sci. Comput., 35(1), C72
• Assembly coded SpAMM in single precision:
• 50% of peak
with 4x4
blocking
• Crossover w/
SGEMM at
n = 2000,
same error
6-31G**,
Matrix Sign Function
24. • In single, SpAMM can beat MKL SGEMM in error also
• Recursion w/locality more accurate than row-col:
An Optimized SpAMM Kernel (II)
Bock & Challacombe, SIAM J. Sci. Comput., 35(1), C72
25. • N-body potentially communication optimal:
• N-body model supported by common runtimes (sort of):
o OpenMP: attention to memory (blocks & chunks)
o Charm++: does not support recursion, roll your own
• Quantum locality → metric locality
• Temporal locality → persistence load balancing
• Decomposition in 3-D task space, not row-col data space
SpAMM in the Strong Scaling Limit (I)
Bock, Challacombe & Kale arXiv: 1403.7458
26. SpAMM in the Strong Scaling Limit (II)
• OpenMP: Memory access impacts recursive task parallelism
• Contiguous chunks of memory (𝑁𝑐) and blocksize (𝑁𝑏) vs wall
time, 48 core opteron, 6-31G** spectral projection, (H2O)90:
1 thread, L1 cache exceeded
complexity reduction vs prefetch
48 threads, NUMA effects
~80% parallel efficiency
27. 𝜏=10−10 𝜏=10−10
first step final step
SpAMM in the Strong Scaling Limit (III)
• Charm++ is a modern runtime with persistence based load
balancing, but does not currently support recursion
• Build unrolled octree mesh of chares
• Dynamic load balancing for 𝑝~30 × 𝑁 (first iteration)
• GreedyCommLB persistence load balancing for 𝑝~500 × 𝑁
28. A N-Body Solver for Fock Exchange (I)
Challacombe & Bock, J. Chem. Phys. 140 (2014) p. 111101
Recursive occlusion-cull with triple metric query on the Cauchy-
Schwarz (direct SCF) criteria:
Quadtree of shell pairs:
Fock exchange as hextree recursion:
4 of these
29. • With permutational symmetry, expect a 4x speedup. Get less
with occlusion & culling:
• Data problem w/ 4 × sub-blocks of exchange and density
tracked, resulting in 7 unique contraction blocks:
A N-Body Solver for Fock Exchange (II)
30. • Generalized N-body methods resolve new, low
complexity structures that row-col cannot
• 𝒔−1/2 as deferred product of generic solves:
o Apply to target without forming bad inverse
o Compression by orders of magnitude (lensing)
• Generic programming:
o Generic matrix algebra, Fock exchange,
Hartree & exchange-correlation, all N-body
o Easy access to higher dimensional problems
(tensor multiplication, derivatives, … )
o Greatly reduce code base, lower barriers to
entry, minimize bugs, hidden approximations
Fast, Generic and Data Local
31. • Generic N-body programming model allows focus
on new math (row-col free):
o Rigorous bounds
o Mixed metrics on same footing
o Precision and regularization scoping
o Algebraic locality + other fast methods
(sketching, probing, joining & etc)
• N-Body problem is communication optimal:
o Strong parallel scaling for fast matrix
multiplication in electronic structure
o Works for runtimes you know
o Kernel independent skeletonization
New Math, One Programming Model