Long journey of Ruby standard library at RubyConf AU 2024
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
1. Fast & Energy-Efficient
Breadth-First Search
on a Single NUMA System
Yuichiro Yasui & Katsuki Fujisawa
Kyushu University & JST CREST
Yukinori Sato
JAIST & JST CREST
ISC14 (International supercomputing conference 2014)
Research Papers 08 ‒ Energy Efficiency, June 26, 2014
2. Outline
1. Background
2. Fast computation of graph processing
– Related work and our previous contributions
3. Bottlenecks analysis for our previous NUMA-
optimized BFS
4. Our proposal : Degree-aware BFS
5. Performance evaluation of proposal BFS
– Fast for Graph500 benchmark
– Energy-efficient for Green graph500 benchmark
3. Background
• Large scale graphs in various fields
– US Road network : 58 million edges
– Twitter follow-ship : 1.47 billion edges
– Neuronal network : 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing by using HPC
large
61.6 million vertices
& 1.47 billion edges
4. • Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• The cycle of graph analysis for understanding real-networks
• concurrent search (breadth-first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
graph
processing
Understanding
Application field
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
• One of most important and fundamental processing
• Many algorithms and applications based on exists (Max.-flow and centrality)
• low arithmetic intensity & irregular memory accesses.
Breadth-first search (BFS)
Source
BFS Lv. 3
source Lv. 2
Lv. 1
Outputs:Distance (Lv.)
and Predecessor for each
vertex from source
Inputs:Graph,
and source vertex
5. Target: NUMA arch. system
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM
RAM
CPU socket(16 logical cores)
+ Local RAM
Memory access for Local RAM(Fast)
Memory access for Remote RAM(Slow)
NUMA node
• Reduces and avoids memory accesses for Remote RAM
• 4-way Intel Xeon E5-4640 (Sandybridge-EP)
– 4 (# of CPU sockets)
– 8 (# of physical cores per socket)
– 2 (# of threads per core)
4 x 8 x 2 = 64 threads
NUMA node
Max.
NUMA-aware computation
6. Graph500 Benchmark
• Fast computation of graph processing is significant topic in HPC
• Graph500 benchmark measures computer performance using
TEPS ratio (# of Traversed edges per second) in graph processing
such as BFS (Breath-first search)
SCALE&&&edgefactor&(=16)
Median
TEPS
1. Generation
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
t parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed e
- TEPS
Input parameters ResultGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 642. Construction
x 64
TEPS ratio
• Kronecker graph
– synthetic scale-free network which was generated by
using Recursive Kronecker product
– 2SCALE vertices and 2SCALE edgefactor edges
– e.g.) SCALE 30 and edgefactor 16 1 billion vertices
and 17.2 billion edges
www.graph500.org
7. Level-synchronized parallel BFS (Top-down)
• Started from source vertex
and executes following two
phases for each level
ns (timed).: This step iterates the timed
untimed verify-phase 64 times. The BFS-
BFS for each source, and the verify-phase
ut of the BFS.
k is based on the TEPS ratio, which is
ven graph and the BFS output. Submission
hmark must report five TEPS ratios: the
uartile, median, third quartile, and maxi-
ARALLEL BFS ALGORITHM
ized Parallel BFS
the input of a BFS is a graph G = (V, E)
et of vertices V and a set of edges E.
f G are contained as pairs (v, w), where
et of edges E corresponds to a set of
where an adjacency list A(v) contains
s (v, w) ∈ E for each vertex v ∈ V . A
various edges spanning all other vertices
he source vertex s ∈ V in a given graph
predecessor map π, which is a map from
Algorithm 1: Level-synchronized Parallel BFS.
Input : G = (V, A) : unweighted directed graph.
s : source vertex.
Variables: QF
: frontier queue.
QN
: neighbor queue.
visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.
1 π(v) ← −1, ∀v ∈ V
2 π(s) ← s
3 visited ← {s}
4 QF
← {s}
5 QN
← ∅
6 while QF
̸= ∅ do
7 for v ∈ QF
in parallel do
8 for w ∈ A(v) do
9 if w ̸∈ visited atomic then
10 π(w) ← v
11 visited ← visited ∪ {w}
12 QN
← QN
∪ {w}
13 QF
← QN
14 QN
← ∅
Traversal
Swap
Frontier
Neighbor
Level k
Level k+1
QF
QN
Swap … swaps the frontier
QF and the neighbor QN for
next level
Traversal … finds unvisited
adjacency vertices from
current frontier QF and
append to neighbor QN!
8. Candidates of
neighbors
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Hybrid-BFS (Direction-optimizing BFS)
Chooses one from Top-down or Bottom-up for frontier size at each level
Frontier
Neighbors
Level0k
Level0k+1
Frontier
Level0k
Level0k+1
neighbors
Top-down algorithm
• Efficient for small-frontier
• Uses out-going edges
Bottom-up algorithm
• Efficient for large-frontier
• Uses in-coming edges
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Current frontier
Unvisited
neighbors
Current frontier
Beamer2012
Candidates of
neighbors
Skips unnecessary edge traversal
9. Chooses one from Top-down or Bottom-up
for a number of traversed edges at each level
Number of traversal edges of Kronecker graph with SCALE 26
Hybrid-BFS reduces
unnecessary edge traversals
Beamer2012
Hybrid-BFS (Direction-optimizing BFS)
Top=down
探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Bottom=up&
Top=down
Distance from source
|V| = 226, |E| = 230
= |E|
10. NUMA-optimized BFS
• Clearly separated to accessing for local and remote memory
– Edge traversal on Local RAM
– All-gather of local queues and bitmaps for Remote RAM
NUMA=optimized&
Top=down
NUMA=optimized&
Bottom=up
Large&frontier? Aggregates&local&
frontier&queues
Yes
No
At each level,!
Traversal on local RAM Swap on Remote RAM
QN
1
QN
0
QN
3
QF
QN
2
QF QN
2
• Searches&local&neighbors&
from&local&copied&frontier&
• Out-going edges for Top-down
• In-coming edges for Bottom-up
NUMA-opt. requires two CSR graphs
※&Not&same&for&undirected&graph
12. CPU Affinity and local memory binding
• ULIBC: Ubiquity Library for Intelligently Binding Cores
– provides some routines for CPU affinity + Local memory binding
– manages each processor core (processor ID) by topology
information as a tuple of (SMT ID, core ID, package ID).
All processors Online processors
(allocated&to¤t&process)
CPU Affinity
1.&Detects&online&processors&
&&&&&&using&sched_getaffinity&system&call
NUMA node 0
NUMA node 1
core 0
core 1
core 2
core 3
RAM
RAM
Local RAM
Use&
Other&processes
Package0ID0:0index&of&CPU&socket&
Core0ID0:0index&of&physical&core&in&each&CPU&socket&
SMT0ID0:0index&of&thread&in&each&physical&core&
Processor0ID&
index&of&logical&processor&core&
2.&Binds&each&thread&to&logical&core&
&&&&&&&&using&sched_setaffinity&system&call&or&&
&&&&&&&&&&&&&&&Intel&compiler&Thread&Affinity&Interface
13. 0
5
10
15
20
25
30
35
20 21 22 23 24 25 26 27 28 29
GTEPS
SCALE
reference code
Agawal2010
Beamer2012
Yasui2013
Yasui2014
Related work: TEPS ratios on a single node
• Our BFS achieves 31.7 GTEPS for Kronecker graph (SCALE27)
Yasui2013
Yasui2014
x 2.2
x 2.6
x 5.9Agarwal2010
faster
Agarwal2010
NUMA-aware Top-down BFS
4-way Intel Xeon 7560
Beamer2012
Hybrid-BFS
4-way Intel Xeon E5-8870
Yasui2013
NUMA-opt. Hybrid-BFS
4-way Intel Xeon E5-4640
0.8 GTEPS
(m/n=64, 1.1GTEPS)
5.1 GTEPS
11.1 GTEPS
Yasui2014
Degree-aware NUMA-opt. BFS
4-way Intel Xeon E5-4650
31.7 GTEPS
This paper
This paper
Reference code 0.1 GTEPS
14. Visited vertices!
Zero-degree!
71,140,085!
53.0%
Top-down!
283!
0.0%!
Bottom-up
63,035,833!
47.0%
Level Step Hybrid-BFS
0 Top-down 22
1 Top-down 239,930
2 Bottom-up 150,006,673
3 Bottom-up 19,742,764
4 Bottom-up 139,817
5 Bottom-up 41,846
6 Top-down 260
Total – 170,171,312
% 4.0 %
Breakdown of hybrid-BFS
• Most of CPU time taken to
Bottom-up step in Hybrid BFS.
• In particular, Bottom-up step in
Level-2 has almost edge
traversals. 99.9 %
231 = 2,147,483,648 (100 %)
for Kronecker graph with SCALE27
#Traversed edges
+ +
Total vertices!
134,217,728!
100.0%!
=
• Most of vertex traversal taken to Bottom-up step in Hybrid BFS.
• A half of number of vertices is unvisited.
Breakdown of vertex traversal
Traversed edges
88.1 %
Unvisited vertices!
Isolated!
41,527!
0.0%
+
=227
( 8 %)
227 vertices and 231 edges
15. Influence of ordering for adjacency vertices
• Computation complexity of Bottom-up step depends on
the ordering of adjacency vertices for each vertex
Number of traversal edges for each ordering
# of traversed edges is strongly affected by each ordering in Lv. 2.
Descending&order
High-degree Low-degree
A(v)
Sorted adjacency list A(v)!
using out-degree of w!
w
aversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Bottom-up Step Ascending Randomized Descending
223,250,243 T 22 22 22
258,645,723 T 239,930 239,930 239,930
83,878,899 B 848,743,124 150,006,673 83,878,899
19,616,130 B 19,935,737 19,742,764 19,616,130
139,606 B 139,868 139,817 139,606
41,846 B 41,846 41,846 41,846
41,586 T 260 260 260
585,614,033 – 869,100,787 170,171,312 103,916,693
179.6 % 20.6 % 4.0 % 2.5 %
108
Randomized
108
Descending
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
108
Ascending
108
Randomized
108
Descending
Better
Loop0count0τ!
A(va)
A(vb)
finds frontier vertex and breaks this loop……
Bottom=up&
Skipped&adjacency&vertices
Traversed&adjacency&vertices
16. τ=1
Analysis of loop count for each vertices
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Numberoffixedvertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS
Max: 5,873 Max: 58
Max: 28
19.0% + 27.8%
• Bottom-up found 46.8 % vertices
• Descending finds most vertices at first loop .Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Numberoffixedvertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS
22 4,223,250,243 T 22 22 22
239,930 3,258,645,723 T 239,930 239,930 239,930
,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
,007,608 139,606 B 139,868 139,817 139,606
98,339 41,846 B 41,846 41,846 41,846
260 41,586 T 260 260 260
,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
100 % 179.6 % 20.6 % 4.0 % 2.5 %
6000100 1000
count ⌧
ending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.528.1% + 18.7%
45.0%
τ = 1
Better
better
τ 2
First vertex of adjacency list
τ = 1 τ 2
τ = 1 τ 2
Descending order
Ascending Randomized
1.8%
17. 3 features and Degree-aware BFS
1. A half vertices has no adjacency vertices
Suppression of zero degree vertices using renumbering
technique for non-zero degree vertices
2. Computation complexity of Bottom-up depends on
the ordering of adjacency vertices for each vertex
Sorted adjacency list by out-degree in descending
3. Most vertices was found at first loop of Bottom-up
Separated graph representation; highest-degree
adjacency vertex list A+ and remaining CSR graph A-
High%degree Low%degree
i i+1
i
n
m-nn
Highest%degree
A-
High%degree Low%degree
i i+1
i
n
m-nn
Highest%degree
A+
=standard
CSR graph
+
Zero-degree opt.
High-degree opt.
21. Strong scaling on SGI Altix UV1000
0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUp
512 threads (one-rack)
37.70 GE/s26.17 GE/s
256 threads128 threads
18.76 GE/s
Local Local
Local
Remote Remote
Remote
L : R = 67% : 33%L : R = 80% : 20% L : R = 57% : 43%
197 ms
218 ms182 ms
258 ms
438 ms733 ms
• As the number of threads increases,
– Improves the CPU time for Local memory access
– Keeps the CPU time for Remote memory access
for Kronecker graph with SCALE30
– 1.07 billion vertices, 17.18 billion edges
Rank.50
Fastest of single-node
on nov.2013 list
22. BFS Performances for Real networks
• Suitable for small-world networks
– efficient for a low-diameter and a large-edgefactor
Twitter follow-ship network in 2009
61.6 million vertices & 1.47 billion edges
10.90 GTEPS (max. 24.09 GTEPS)
US road network
24 million vertices & 58 million edges
0.09 GTEPS (max. 0.11 GTEPS)
Small=world
Non&small=world
faster than the former owing to its edgefactor being 1.68 times larger relatively.
In addition, twitter and friendster show similar BFS performances of approxi-
mately 10 GTEPS because they have similar edgefactor and similar diameters.
Therefore, we verify whether our BFS is affected by using both the edgefactor
and diameter of the network. From these numerical results, we could achieve
high performance for large-scale small-world networks with a large edgefactor.
Table 9. BFS performance of real-world network on Sandybridge-EP system.
Graph size edgefactor Diameter GTEPS
Instance n m m/n diam′
G min 1/4 median 3/4 max
wiki-Talk [23, 24] 2.39 M 5.02 M 2.1 8 0.29 0.61 0.75 0.87 1.26
USA-road-d [25] 23.95 M 58.33 M 2.4 8,098 0.07 0.08 0.09 0.09 0.11
LiveJournal [26, 27] 4.85 M 68.99 M 14.2 16 2.76 3.76 4.07 4.32 4.94
twitter [28] 61.58 M 1,468.37 M 23.8 16 7.58 10.02 10.90 12.68 24.09
friendster [29] 65.61 M 1,806.07 M 27.5 25 4.89 9.61 10.74 11.29 11.81
5 Energy Efficiency of Our BFS
23. The Green Graph500 list in Nov. 2013
• Measures power-efficient using TEPS/W ratio
• Our results on various systems such as Xeon servers
and Android devices
http://green.graph500.org
Median
TEPS
1. Generation
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
E
factor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS phase
2. Construction
x 64
TEPS ratio
Watt
TEPS/W
Power measurement
Green Graph500
Graph500
Measuring power consumption
during the BFS phase
24. TEPS and TEPS/W on 4-way Xeon
16 8 8 4 4
44
7.92 GTEPS
364.9 W
21.71 MTEPS/W
11.83 GTEPS
452.6 W
26.13 MTEPS/W
13.96 GTEPS
517.8 W
26.96 MTEPS/W
Fast
16 16
1616
29.03 GTEPS
639.1 W
45.43 MTEPS/W
8 8
88
22.03 GTEPS
586.7 W
37.55 MTEPS/W
Energy efficient
#NUMA nodes = 4#threads = 16
0
5
10
15
20
25
30
1⇥1
(1)
4⇥1
(4)
4⇥2
(8)
1⇥16
(16)
2⇥8
(16)
4⇥4
(16)
2⇥16
(32)
4⇥8
(32)
4⇥16
(64)
w/o
(64)
4⇥16
(64)
GTEPS
` ⇥ t CPU Affinity (Number of threads)
Degree-aware (GTEPS)
Reference (GTEPS)
NUMA-opt. (GTEPS)
NUMA-opt.Ref.Degree-aware
25. 0
100
200
300
400
500
10 11 12 13 14 15 16 17 18 19 20
GTEPS
SCALE
Reference (p = 4)
Degree-aware BFS (p = 4)
7. MTEPS of reference BFS and Degree-aware BFS on XperiaA SO-04E.
10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/W
Reference (p = 1) 20 3.25 3.15 1.03
Reference (p = 4) 20 4.58 3.22 1.42
Degree-aware (p = 1) 20 136.29 3.23 42.25
Green Graph500 on Xperia-A-SO-04E
Manage both fast and energy-efficient
on, suggesting that the effective power is not strongly affected by the number
hreads and the algorithm used. With regard to energy-efficient computation,
BFS is around 100 times faster than the reference code for roughly the
e effective power of 3.0 W; specifically, our BFS shows an energy-efficient
ormance of 153.17 MTESP/W.
ble 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/W
Reference (p = 1) 20 3.25 3.15 1.03
Reference (p = 4) 20 4.58 3.22 1.42
This study (p = 1) 20 136.29 3.23 42.25
This study (p = 2) 20 248.08 2.99 82.92
This study (p = 4) 20 477.63 3.12 153.17
153.17 MTEPS/W
(477.64 MTEPS)
Roughly same power-consumption
Smartphone
SONY Xperia-A-SO-04E
CPU : 4-core Snapdragon
RAM : 2 GB
#1 in Nov. 2013 list
# threads
1 MTEPS/W
Energy=efficient
x150
Faster and
energy efficient
26. Conclusion
• Degree-aware BFS
– Speedup techniques considering the vertex degree
– 1) Zero-degree vertex suppression
– 2) Separated graph representation
– 2.68 times faster than our previous algorithm
• Our BFS achieves fastest of single-node
– 37.7 GTEPS for SCALE30 on SGI Altix UV1000 (one rack)
• Investigates affinity and power consumption
– 4 sockets x 16 threads -affinity is the highest MTEPS
and MTEPS/W on 4-way Intel Xeon server.
– First position of small data category of 2nd Green
graph500 on Android device