The Codex of Business Writing Software for Real-World Solutions 2.pptx
Graph500
1. Graph 500 Benchmark and Reference
Implementations
David Bader, Jason Riedy
Georgia Institute of Technology
(booth 1561)
2. Benchmark Problem
Initial benchmark problem:
Graph Search (BFS)
● Convert an input edge list to some internal format once (timed).
● Randomly select multiple search roots.
● Separately compute breadth-first search trees starting from
each search root (timed).
● Return the array of parent nodes; parent[i] = j means j is the
parent of i in the tree.
● Validate the output.
Other problems under consideration for the future (e.g.
independent set, ...)
3. Benchmark & Reference Impl. Structure
1.Generate the edge list.
2.Construct a graph from the edge list.
3.Randomly sample 64 unique search keys with
degree at least one, not counting self-loops.
4.For each search key: Timed kernels
1.Compute the BFS parent array.
2.Validate that the parent array is a correct BFS
search tree for the given search tree.
5.Compute and output performance information.
● (Take care to report correct quartiles, means, and
deviations, e.g. harmonic for rates.)
4. Problem Classes
● Sizes chosen to range from
Problem Class Size
currently accessible to
optimistically ahead. Toy (10) 17 GiB
● Chosen as powers of two Mini (11) 140 GiB
close to powers of 10. Small (12) 1.1 TiB
●
Toy: 1010 → 226 = 17 GiB
15 42
Medium (13) 18 TiB
●
Huge: 10 → 2 = 1.1 PiB!
● Submissions ranged up to the Large (14) 140 TiB
Medium class. Huge (15) 1.1 PiB
● Next year, will someone
tackle Large? Huge?
5. Reference Implementations
Multiple reference implementations:
● High-level but undefinitive code in GNU Octave.
● Single shared-memory driver for:
● two sequential examples,
● one OpenMP code, and
● Two Cray XMT codes.
● Separate, fully distributed MPI code from Jeremiah Willcock of
Indiana (who also wrote the reproducible, parallel generator).
(This space intentionally left unoptimized.)
6. Reference Implementations
Multiple reference implementations:
● High-level sketch in GNU Octave. (24 lines in the timed kernels
as counted by cloc)
● Not intended to be definitive.
● Used for executable examples in specification.
● Two sequential codes to demonstrate that the driver handles
different kernels.
● The first forms a linked list on the unaltered, uncopied input.
(103 lines)
● The second copies into a CSR graph representation. (171
lines)
7. Reference Implementations
Multiple reference implementations:
● One OpenMP code for wide portability. (342 lines)
● Uses mmap for pseudo-out-of-core operation, can tackle
anything that fits on a disk if you have the time...
● A Cray XMT code and a slight variation. (186 lines, 210 lines)
● Slight variation reduces hot-spotting in the BFS queue.
● An MPI code by Jeremiah Willcock from Indiana. (1107 lines)
● Fully distributed, runtime on SMP roughly comparable to
OpenMP.
(This space intentionally left unoptimized.)
8. Untuned Performance for Comparison
Threads Mean time (s) Mean rate (TEPS)
4 9.2 1.0 x 107
8 6.9 1.1 x 107
16 4.9 0.91 x 107 Untuned Cray XMT
implementation performance
against the toy class on PNNL's
Untuned OpenMP on scale- 128-processor Cray XMT
24 (smaller than Toy) using
Processors Mean time (s) Mean rate (TEPS)
a dual quad-core Intel Xeon
X5570 processors 32 23.7 4.5 x 107
(2.93GHz, 8MiB cache) with
48 GiB physical memory. 64 24.3 4.4 x 107
The 16-thread results use
HyperThreading. The toy 128 28.2 3.8 x 107
class ran too long...
9. [ EXPLORATION OF SHARED MEMORY GRAPH BENCHMARKS:
THE GRAPH500 ]
David A. Bader (PI), Jason Riedy
[ OBJECTIVE ]
Explore benchmarks for high-performance
data-intensive computations on parallel,
shared-memory platforms.
[ DESCRIPTION ]
Current high-performance architectures are
built to run linear algebra operations
effectively. These architectures seem a poor Image Source: Nexus (Facebook application)
fit for the massive growth of irregular data
coming from biological, social, regulatory, 5 8 Image Source: Giot et al., “A Protein
and other sources. There are no widely 1 Interaction Map of Drosophila
melanogaster”,
Science 302, 1722-1736, 2003
supported benchmarks to guide 0 7 3 4 6 9
architectural decisions for these source
applications. vertex
2
Problem Class Size
Georgia Tech worked within Graph500
steering committee to draft a new breadth- Toy (10) 17 GiB
first search benchmark acceptable for wide
Mini (11) 140 GiB
participation. Georgia Tech also provided
and supports the OpenMP and Cray XMT Small (12) 1.1 TiB
shared-memory reference codes.
Medium (13) 18 TiB
For more: Visit the Graph500 BoF!
Large (14) 140 TiB
[ FUNDING ]
Sandia National Labs Huge (15) 1.1 PiB