Single node hardware design is shifting to a heterogeneous nature and many of today’s largest HPC systems are clusters that combine accelerators in heterogeneous compute device architectures. The need for new programming abstractions in the advancements to the Exascale era has been widely recognized and variants of the Partitioned Global Address Space (PGAS) programming model are discussed as a promising approach in this respect DASH is a C++ template library that provides distributed data structures with support for hierarchical locality in a PGAS programming model. Portable efficiency, an essential goal in the design of DASH, can only be achieved with programming abstractions of hardware locality that allow to optimize data structures and algorithms to the underlying system at compile- and run time. Established tools like LIKWID and hwloc provide reliable interfaces to query the hardware topology on node level but fail to construct a global representation of distributed locality domains and do not support accelerator architectures like Intel MIC. We present Locality Hierarchies, an abstraction of distributed, hierarchical locality represented as a modifiable data structure. The underlying model supports heterogeneous systems as first-class use case and introduces a well-defined concept of distance for arbitrary distributed hardware hierarchies. Using common range-based algorithms as motivating examples, we explain how our approach facilitates locality-aware load-balancing and process mapping on SuperMIC compute nodes.
2. Motivation
Motivation:
Portable efficiency of ported applications like
LULESH and graph applications for heterogeneous
systems
hwloc static hierarchical hardware topology
DASH locality as view from variable run-time configuration
… extended to full distributed topology
… supporting heterogeneous systems, esp. Intel MIC
… designed for use cases like hierarchical graph partitioning
DASH Locality Hierarchies
3. Trees vs. Reality
We do not model hardware locality as trees
… because many represented systems are anything but trees
DASH Locality Hierarchies
http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
4. Related Concepts
Combining existing abstractions
- Hierarchical Place Trees
abstraction of locality scopes and process mapping
- hwloc
node-level machine topology and hardware capacity
focuses at topology discovery (once, at startup) and
provides queries on topology data
Locality Hierarchies
- are specifically designed for user-specified views and modifications of the
topology representation
- are specifically designed for hierarchical process structures (teams)
DASH Locality Hierarchies
5. Examples: Xeon PHI
DASH Locality Hierarchies
Compute Node with Xeon Phi Accelerators as seen by hwloc
9. Key Functionalities
Why we need locality hierarchies in DART/DASH:
- Locality-optimized grouping of processes into teams
- Load-balancing, requires:
topology information
… obviously, to find suitable processes for balancing
hardware capacities
… like number of cores/threads and (shared) memory capacities
available to processes
- Dynamic distance measures: variable at run time instead of static
distance matrix
DASH Locality Hierarchies
13. Some fundamental operations on locality hierarchies
Basic principle of usage:
( (filter/select) (group/split) )*
C API, bindings for Python in development, Fortran bindings feasible
C API
DASH Locality Hierarchies
14. Usage of C++ API
// split into num_groups teams at NUMA locality scope
auto & new_team = dash::Team::All().locality_split(
dash::util::Locality::Scope::NUMA,
// optional, defaults to one team per locality scope
num_groups);
// split into two teams: leader team and workers
auto & new_team = dash::Team::All().leader_split(
dash::util::Locality::Scope::NODE);
// split into teams by predicate
auto & new_team = dash::Team::All().specific_split(
[](dash::util::LocalityDomain ld) {
// ...
return new_team_id_for_ld;
});
DASH Locality Hierarchies