SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Fast & Energy-Efficient
Breadth-First Search
on a Single NUMA System
Yuichiro Yasui & Katsuki Fujisawa
Kyushu University & JST CREST
Yukinori Sato
JAIST & JST CREST
ISC14 (International supercomputing conference 2014)
Research Papers 08 ‒ Energy Efficiency, June 26, 2014
Outline
1.  Background
2.  Fast computation of graph processing
–  Related work and our previous contributions
3.  Bottlenecks analysis for our previous NUMA-
optimized BFS
4.  Our proposal : Degree-aware BFS
5.  Performance evaluation of proposal BFS
–  Fast for Graph500 benchmark
–  Energy-efficient for Green graph500 benchmark
Background
•  Large scale graphs in various fields
–  US Road network : 58 million edges
–  Twitter follow-ship : 1.47 billion edges
–  Neuronal network : 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
•  Fast and scalable graph processing by using HPC
large
61.6 million vertices
& 1.47 billion edges
•  Transportation
•  Social network
•  Cyber-security
•  Bioinformatics
Graph analysis and important kernel BFS
•  The cycle of graph analysis for understanding real-networks
•  concurrent search (breadth-first search)
•  optimization (single source shortest path)
•  edge-oriented (maximal independent set)
graph
processing
Understanding
Application field
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
•  One of most important and fundamental processing
•  Many algorithms and applications based on exists (Max.-flow and centrality)
•  low arithmetic intensity & irregular memory accesses.
Breadth-first search (BFS)
Source
BFS Lv. 3
source Lv. 2
Lv. 1
Outputs:Distance (Lv.)
and Predecessor for each
vertex from source
Inputs:Graph,
and source vertex
Target: NUMA arch. system
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM
RAM
CPU socket(16 logical cores)
+ Local RAM
Memory access for Local RAM(Fast)
Memory access for Remote RAM(Slow)
NUMA node
•  Reduces and avoids memory accesses for Remote RAM
•  4-way Intel Xeon E5-4640 (Sandybridge-EP)
–  4 (# of CPU sockets)
–  8 (# of physical cores per socket)
–  2 (# of threads per core)
4 x 8 x 2 = 64 threads
NUMA node
Max.
NUMA-aware computation
Graph500 Benchmark
•  Fast computation of graph processing is significant topic in HPC
•  Graph500 benchmark measures computer performance using
TEPS ratio (# of Traversed edges per second) in graph processing
such as BFS (Breath-first search)
SCALE&&&edgefactor&(=16)
Median
TEPS
1.  Generation
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
t parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed e
- TEPS
Input parameters ResultGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3.  BFS x 642.  Construction
x 64
TEPS ratio
•  Kronecker graph
–  synthetic scale-free network which was generated by
using Recursive Kronecker product
–  2SCALE vertices and 2SCALE edgefactor edges
–  e.g.) SCALE 30 and edgefactor 16 1 billion vertices
and 17.2 billion edges
www.graph500.org
Level-synchronized parallel BFS (Top-down)
•  Started from source vertex
and executes following two
phases for each level
ns (timed).: This step iterates the timed
untimed verify-phase 64 times. The BFS-
BFS for each source, and the verify-phase
ut of the BFS.
k is based on the TEPS ratio, which is
ven graph and the BFS output. Submission
hmark must report five TEPS ratios: the
uartile, median, third quartile, and maxi-
ARALLEL BFS ALGORITHM
ized Parallel BFS
the input of a BFS is a graph G = (V, E)
et of vertices V and a set of edges E.
f G are contained as pairs (v, w), where
et of edges E corresponds to a set of
where an adjacency list A(v) contains
s (v, w) ∈ E for each vertex v ∈ V . A
various edges spanning all other vertices
he source vertex s ∈ V in a given graph
predecessor map π, which is a map from
Algorithm 1: Level-synchronized Parallel BFS.
Input : G = (V, A) : unweighted directed graph.
s : source vertex.
Variables: QF
: frontier queue.
QN
: neighbor queue.
visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.
1 π(v) ← −1, ∀v ∈ V
2 π(s) ← s
3 visited ← {s}
4 QF
← {s}
5 QN
← ∅
6 while QF
̸= ∅ do
7 for v ∈ QF
in parallel do
8 for w ∈ A(v) do
9 if w ̸∈ visited atomic then
10 π(w) ← v
11 visited ← visited ∪ {w}
12 QN
← QN
∪ {w}
13 QF
← QN
14 QN
← ∅
Traversal
Swap
Frontier
Neighbor
Level k
Level k+1
QF
QN
Swap … swaps the frontier
QF and the neighbor QN for
next level
Traversal … finds unvisited
adjacency vertices from
current frontier QF and
append to neighbor QN!
Candidates of
neighbors
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V  visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Hybrid-BFS (Direction-optimizing BFS)
Chooses one from Top-down or Bottom-up for frontier size at each level
Frontier
Neighbors
Level0k
Level0k+1
Frontier
Level0k
Level0k+1
neighbors
Top-down algorithm
•  Efficient for small-frontier
•  Uses out-going edges
Bottom-up algorithm
•  Efficient for large-frontier
•  Uses in-coming edges
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V  visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Current frontier
Unvisited
neighbors
Current frontier
Beamer2012
Candidates of
neighbors
Skips unnecessary edge traversal
Chooses one from Top-down or Bottom-up
for a number of traversed edges at each level
Number of traversal edges of Kronecker graph with SCALE 26
Hybrid-BFS reduces
unnecessary edge traversals
Beamer2012
Hybrid-BFS (Direction-optimizing BFS)
Top=down
探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Bottom=up&
Top=down
Distance from source
|V| = 226, |E| = 230
= |E|
NUMA-optimized BFS
•  Clearly separated to accessing for local and remote memory
–  Edge traversal on Local RAM
–  All-gather of local queues and bitmaps for Remote RAM
NUMA=optimized&
Top=down
NUMA=optimized&
Bottom=up
Large&frontier? Aggregates&local&
frontier&queues
Yes
No
At each level,!
Traversal on local RAM Swap on Remote RAM
QN
1
QN
0
QN
3
QF
QN
2
QF QN
2
•  Searches&local&neighbors&
from&local&copied&frontier&
•  Out-going edges for Top-down
•  In-coming edges for Bottom-up
NUMA-opt. requires two CSR graphs
※&Not&same&for&undirected&graph
Local Edge Traversal
V0
QF
visited0
QN
0
A0
V
V1
QF
visited1
QN
1
V A1
V3
QF
visited3
QN
3
A3V
V2
QF
visited2
QN
2
A2
V
•  partial&edges&(vi),)vj)),&(vi) )V)and)vj) )Vk)
慮したグラフ領域の分割
U ソケット (ℓ-個の RAM) を持つ計算機を想定
部分点集合 Vk と隣接リストの部分集合 Ak に配置
V = V0 | V1 | · · · | Vℓ−1 , A = A0 | A1 | · · · | Aℓ−1 ,
Vk は単純な 1 次元分割 (n: 点数, ℓ: CPU ソケット数)
Vk = vj ∈ V | j ∈
kn
ℓ
,
(k + 1)n
ℓ
,
U ソケットを持つ計算機上に 2 種類の隣接リストを定義
B
•  partial&vertices&Vk
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM
RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM
RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM
RAM
0th NUMA node 3th NUMA node
2nd NUMA node1st NUMA node
k=th&NUMA&node&holds&
•  Local&copied&frontier&QF
CPU Affinity and local memory binding
•  ULIBC: Ubiquity Library for Intelligently Binding Cores
–  provides some routines for CPU affinity + Local memory binding
–  manages each processor core (processor ID) by topology
information as a tuple of (SMT ID, core ID, package ID).
All processors Online processors
(allocated&to&current&process)
CPU Affinity
1.&Detects&online&processors&
&&&&&&using&sched_getaffinity&system&call
NUMA node 0
NUMA node 1
core 0
core 1
core 2
core 3
RAM
RAM
Local RAM
Use&
Other&processes
Package0ID0:0index&of&CPU&socket&
Core0ID0:0index&of&physical&core&in&each&CPU&socket&
SMT0ID0:0index&of&thread&in&each&physical&core&
Processor0ID&
index&of&logical&processor&core&
2.&Binds&each&thread&to&logical&core&
&&&&&&&&using&sched_setaffinity&system&call&or&&
&&&&&&&&&&&&&&&Intel&compiler&Thread&Affinity&Interface
0
5
10
15
20
25
30
35
20 21 22 23 24 25 26 27 28 29
GTEPS
SCALE
reference code
Agawal2010
Beamer2012
Yasui2013
Yasui2014
Related work: TEPS ratios on a single node
•  Our BFS achieves 31.7 GTEPS for Kronecker graph (SCALE27)
Yasui2013
Yasui2014
x 2.2
x 2.6
x 5.9Agarwal2010
faster
Agarwal2010
NUMA-aware Top-down BFS
4-way Intel Xeon 7560
Beamer2012
Hybrid-BFS
4-way Intel Xeon E5-8870
Yasui2013
NUMA-opt. Hybrid-BFS
4-way Intel Xeon E5-4640
0.8 GTEPS
(m/n=64, 1.1GTEPS)
5.1 GTEPS
11.1 GTEPS
Yasui2014
Degree-aware NUMA-opt. BFS
4-way Intel Xeon E5-4650
31.7 GTEPS
This paper
This paper
Reference code 0.1 GTEPS
Visited vertices!
Zero-degree!
71,140,085!
53.0%
Top-down!
283!
0.0%!
Bottom-up
63,035,833!
47.0%
Level Step Hybrid-BFS
0 Top-down 22
1 Top-down 239,930
2 Bottom-up 150,006,673
3 Bottom-up 19,742,764
4 Bottom-up 139,817
5 Bottom-up 41,846
6 Top-down 260
Total – 170,171,312
% 4.0 %
Breakdown of hybrid-BFS
•  Most of CPU time taken to
Bottom-up step in Hybrid BFS.
•  In particular, Bottom-up step in
Level-2 has almost edge
traversals. 99.9 %
231 = 2,147,483,648 (100 %)
for Kronecker graph with SCALE27
#Traversed edges
+ +
Total vertices!
134,217,728!
100.0%!
=
•  Most of vertex traversal taken to Bottom-up step in Hybrid BFS.
•  A half of number of vertices is unvisited.
Breakdown of vertex traversal
Traversed edges
88.1 %
Unvisited vertices!
Isolated!
41,527!
0.0%
+
=227
( 8 %)
227 vertices and 231 edges
Influence of ordering for adjacency vertices
•  Computation complexity of Bottom-up step depends on
the ordering of adjacency vertices for each vertex
Number of traversal edges for each ordering
# of traversed edges is strongly affected by each ordering in Lv. 2.
Descending&order
High-degree Low-degree
A(v)
Sorted adjacency list A(v)!
using out-degree of w!
w
aversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Bottom-up Step Ascending Randomized Descending
223,250,243 T 22 22 22
258,645,723 T 239,930 239,930 239,930
83,878,899 B 848,743,124 150,006,673 83,878,899
19,616,130 B 19,935,737 19,742,764 19,616,130
139,606 B 139,868 139,817 139,606
41,846 B 41,846 41,846 41,846
41,586 T 260 260 260
585,614,033 – 869,100,787 170,171,312 103,916,693
179.6 % 20.6 % 4.0 % 2.5 %
108
Randomized
108
Descending
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
108
Ascending
108
Randomized
108
Descending
Better
Loop0count0τ!
A(va)
A(vb)
finds frontier vertex and breaks this loop……
Bottom=up&
Skipped&adjacency&vertices
Traversed&adjacency&vertices
τ=1
Analysis of loop count for each vertices
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Numberoffixedvertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS
Max: 5,873 Max: 58
Max: 28
19.0% + 27.8%
•  Bottom-up found 46.8 % vertices
•  Descending finds most vertices at first loop .Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Numberoffixedvertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS
22 4,223,250,243 T 22 22 22
239,930 3,258,645,723 T 239,930 239,930 239,930
,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
,007,608 139,606 B 139,868 139,817 139,606
98,339 41,846 B 41,846 41,846 41,846
260 41,586 T 260 260 260
,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
100 % 179.6 % 20.6 % 4.0 % 2.5 %
6000100 1000
count ⌧
ending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.528.1% + 18.7%
45.0%
τ = 1
Better
better
τ 2
First vertex of adjacency list
τ = 1 τ 2
τ = 1 τ 2
Descending order
Ascending Randomized
1.8%
3 features and Degree-aware BFS
1. A half vertices has no adjacency vertices
  Suppression of zero degree vertices using renumbering
technique for non-zero degree vertices
2. Computation complexity of Bottom-up depends on
the ordering of adjacency vertices for each vertex
  Sorted adjacency list by out-degree in descending
3. Most vertices was found at first loop of Bottom-up
  Separated graph representation; highest-degree
adjacency vertex list A+ and remaining CSR graph A-
High%degree Low%degree
i i+1
i
n
m-nn
Highest%degree
A-
High%degree Low%degree
i i+1
i
n
m-nn
Highest%degree
A+
=standard
CSR graph
+
Zero-degree opt.
High-degree opt.
Performance improvements
Degree-aware BFS is 2.68 faster than NUMA-opt.
0
5
10
15
20
25
30
NUMA-opt. + High-deg + zero-deg Degree-aware
GTEPS
⇥1.00
⇥1.34
⇥2.03
⇥2.681.34 x 2.03
This&paperPrevious&paper
Intel Xeon (64) v.s. SGI Altix UV1000 (512)
for Kronecker graph with SCALE29
–  536.9 million vertices, 8.59 billion edges
Intel Xeon E5-4650 @ 2.70GHz
64-threads = 4-NUMA x 8-cores x 2-threads
0
50
100
150
200
250
300
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUp 0
50
100
150
200
250
300
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUp
64 threads
400 GB
Intel Xeon E7-8837 @ 2.67GHz
512-threads = 64-NUMA x 8-cores
SGI Altix UV1000 (Westmere-EX arch.)Intel Xeon (Sandybridge-EP arch.)
for&Local&RAM
for&Remote&RAM
for&Remote&RAM
for&Local&RAM
31.81 GE/s21.81 GE/s
512 GB RAM 4.0 TB RAM
512 threads
1.0 TBytes
Intel Xeon (64) v.s. SGI Altix UV1000 (512)
for Kronecker graph with SCALE30
–  1.07 billion vertices, 17.18 billion edges
Intel Xeon E5-4650 @ 2.70GHz
64-threads = 4-NUMA x 8-cores x 2-threads
Intel Xeon E7-8837 @ 2.67GHz
512-threads = 64-NUMA x 8-cores
SGI Altix UV1000 (Westmere-EX arch.)Intel Xeon (Sandybridge-EP arch.)
512 GB RAM 4.0 TB RAM
0
50
100
150
200
250
300
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUp
BottomUp
512 threads
2.0 TBytes
for&Remote&RAM
for&Local&RAM
37.70 GE/s
Out of memory
Rank.50
Fastest of single-node
on nov.2013 list
Strong scaling on SGI Altix UV1000
0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUp
512 threads (one-rack)
37.70 GE/s26.17 GE/s
256 threads128 threads
18.76 GE/s
Local Local
Local
Remote Remote
Remote
L : R = 67% : 33%L : R = 80% : 20% L : R = 57% : 43%
197 ms
218 ms182 ms
258 ms
438 ms733 ms
•  As the number of threads increases,
–  Improves the CPU time for Local memory access
–  Keeps the CPU time for Remote memory access
for Kronecker graph with SCALE30
–  1.07 billion vertices, 17.18 billion edges
Rank.50
Fastest of single-node
on nov.2013 list
BFS Performances for Real networks
•  Suitable for small-world networks
–  efficient for a low-diameter and a large-edgefactor
Twitter follow-ship network in 2009
61.6 million vertices & 1.47 billion edges
10.90 GTEPS (max. 24.09 GTEPS)
US road network
24 million vertices & 58 million edges
0.09 GTEPS (max. 0.11 GTEPS)
Small=world
Non&small=world
faster than the former owing to its edgefactor being 1.68 times larger relatively.
In addition, twitter and friendster show similar BFS performances of approxi-
mately 10 GTEPS because they have similar edgefactor and similar diameters.
Therefore, we verify whether our BFS is affected by using both the edgefactor
and diameter of the network. From these numerical results, we could achieve
high performance for large-scale small-world networks with a large edgefactor.
Table 9. BFS performance of real-world network on Sandybridge-EP system.
Graph size edgefactor Diameter GTEPS
Instance n m m/n diam′
G min 1/4 median 3/4 max
wiki-Talk [23, 24] 2.39 M 5.02 M 2.1 8 0.29 0.61 0.75 0.87 1.26
USA-road-d [25] 23.95 M 58.33 M 2.4 8,098 0.07 0.08 0.09 0.09 0.11
LiveJournal [26, 27] 4.85 M 68.99 M 14.2 16 2.76 3.76 4.07 4.32 4.94
twitter [28] 61.58 M 1,468.37 M 23.8 16 7.58 10.02 10.90 12.68 24.09
friendster [29] 65.61 M 1,806.07 M 27.5 25 4.89 9.61 10.74 11.29 11.81
5 Energy Efficiency of Our BFS
The Green Graph500 list in Nov. 2013
•  Measures power-efficient using TEPS/W ratio
•  Our results on various systems such as Xeon servers
and Android devices
http://green.graph500.org
Median
TEPS
1. Generation
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
E
factor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3.  BFS phase
2. Construction
x 64
TEPS ratio
Watt
TEPS/W
Power measurement
Green Graph500
Graph500
Measuring power consumption
during the BFS phase
TEPS and TEPS/W on 4-way Xeon
16 8 8 4 4
44
7.92 GTEPS
364.9 W
21.71 MTEPS/W
11.83 GTEPS
452.6 W
26.13 MTEPS/W
13.96 GTEPS
517.8 W
26.96 MTEPS/W
Fast
16 16
1616
29.03 GTEPS
639.1 W
45.43 MTEPS/W
8 8
88
22.03 GTEPS
586.7 W
37.55 MTEPS/W
Energy efficient
#NUMA nodes = 4#threads = 16
0
5
10
15
20
25
30
1⇥1
(1)
4⇥1
(4)
4⇥2
(8)
1⇥16
(16)
2⇥8
(16)
4⇥4
(16)
2⇥16
(32)
4⇥8
(32)
4⇥16
(64)
w/o
(64)
4⇥16
(64)
GTEPS
` ⇥ t CPU Affinity (Number of threads)
Degree-aware (GTEPS)
Reference (GTEPS)
NUMA-opt. (GTEPS)
NUMA-opt.Ref.Degree-aware
0
100
200
300
400
500
10 11 12 13 14 15 16 17 18 19 20
GTEPS
SCALE
Reference (p = 4)
Degree-aware BFS (p = 4)
7. MTEPS of reference BFS and Degree-aware BFS on XperiaA SO-04E.
10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/W
Reference (p = 1) 20 3.25 3.15 1.03
Reference (p = 4) 20 4.58 3.22 1.42
Degree-aware (p = 1) 20 136.29 3.23 42.25
Green Graph500 on Xperia-A-SO-04E
Manage both fast and energy-efficient
on, suggesting that the effective power is not strongly affected by the number
hreads and the algorithm used. With regard to energy-efficient computation,
BFS is around 100 times faster than the reference code for roughly the
e effective power of 3.0 W; specifically, our BFS shows an energy-efficient
ormance of 153.17 MTESP/W.
ble 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/W
Reference (p = 1) 20 3.25 3.15 1.03
Reference (p = 4) 20 4.58 3.22 1.42
This study (p = 1) 20 136.29 3.23 42.25
This study (p = 2) 20 248.08 2.99 82.92
This study (p = 4) 20 477.63 3.12 153.17
153.17 MTEPS/W
(477.64 MTEPS)
Roughly same power-consumption
Smartphone
SONY Xperia-A-SO-04E
CPU : 4-core Snapdragon
RAM : 2 GB
#1 in Nov. 2013 list
# threads
1 MTEPS/W
Energy=efficient
x150
Faster and
energy efficient
Conclusion
•  Degree-aware BFS
–  Speedup techniques considering the vertex degree
–  1) Zero-degree vertex suppression
–  2) Separated graph representation
–  2.68 times faster than our previous algorithm
•  Our BFS achieves fastest of single-node
–  37.7 GTEPS for SCALE30 on SGI Altix UV1000 (one rack)
•  Investigates affinity and power consumption
–  4 sockets x 16 threads -affinity is the highest MTEPS
and MTEPS/W on 4-way Intel Xeon server.
–  First position of small data category of 2nd Green
graph500 on Android device

Contenu connexe

Tendances

Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
 
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
 
Area-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAArea-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAIJERA Editor
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...BigDataEverywhere
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Sean Moran
 
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderAn Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderIJERA Editor
 
Development of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsDevelopment of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsAtsushi Koike
 
ktruss-short
ktruss-shortktruss-short
ktruss-shortJia Wang
 
Nmea Introduction
Nmea IntroductionNmea Introduction
Nmea IntroductionTom Chen
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)IISRT
 
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsOrthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsT. E. BOGALE
 
Axes Tech
Axes TechAxes Tech
Axes Techncct
 
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Andrea Tassi
 

Tendances (20)

distance_matrix_ch
distance_matrix_chdistance_matrix_ch
distance_matrix_ch
 
Cnq1
Cnq1Cnq1
Cnq1
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
An Overview of HDF-EOS (Part 1)
An Overview of HDF-EOS (Part 1)An Overview of HDF-EOS (Part 1)
An Overview of HDF-EOS (Part 1)
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
 
Area-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAArea-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCA
 
GoogLeNet Insights
GoogLeNet InsightsGoogLeNet Insights
GoogLeNet Insights
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
 
Chenchu
ChenchuChenchu
Chenchu
 
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderAn Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
 
Development of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsDevelopment of Routing for Car Navigation Systems
Development of Routing for Car Navigation Systems
 
ktruss-short
ktruss-shortktruss-short
ktruss-short
 
Nmea Introduction
Nmea IntroductionNmea Introduction
Nmea Introduction
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
 
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsOrthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
 
Axes Tech
Axes TechAxes Tech
Axes Tech
 
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
 

Similaire à Fast & Energy-Efficient Breadth-First Search on a Single NUMA System

Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Andra Lungu
 
An Introduction to NV_path_rendering
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_renderingMark Kilgard
 
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...confluent
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Tom Hubregtsen
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNsAuro Tripathy
 
zkStudyClub: CirC and Compiling Programs to Circuits
zkStudyClub: CirC and Compiling Programs to CircuitszkStudyClub: CirC and Compiling Programs to Circuits
zkStudyClub: CirC and Compiling Programs to CircuitsAlex Pruden
 
Improving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationImproving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationAmgad Muhammad
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...Thejaka Amila Kanewala, Ph.D.
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
 
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Databricks
 
Compiler Construction | Lecture 11 | Monotone Frameworks
Compiler Construction | Lecture 11 | Monotone FrameworksCompiler Construction | Lecture 11 | Monotone Frameworks
Compiler Construction | Lecture 11 | Monotone FrameworksEelco Visser
 
Architectural_Synthesis_for_DSP_Structured_Datapaths
Architectural_Synthesis_for_DSP_Structured_DatapathsArchitectural_Synthesis_for_DSP_Structured_Datapaths
Architectural_Synthesis_for_DSP_Structured_DatapathsShereef Shehata
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
PLNOG 7: Pierre Francois - BGP Add-Paths
PLNOG 7: Pierre Francois - BGP Add-PathsPLNOG 7: Pierre Francois - BGP Add-Paths
PLNOG 7: Pierre Francois - BGP Add-PathsPROIDEA
 

Similaire à Fast & Energy-Efficient Breadth-First Search on a Single NUMA System (20)

Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
An Introduction to NV_path_rendering
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_rendering
 
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
 
Graph processing
Graph processingGraph processing
Graph processing
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNs
 
Lecture13
Lecture13Lecture13
Lecture13
 
zkStudyClub: CirC and Compiling Programs to Circuits
zkStudyClub: CirC and Compiling Programs to CircuitszkStudyClub: CirC and Compiling Programs to Circuits
zkStudyClub: CirC and Compiling Programs to Circuits
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
Improving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationImproving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimization
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
 
Lecture set 5
Lecture set 5Lecture set 5
Lecture set 5
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
 
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
 
Compiler Construction | Lecture 11 | Monotone Frameworks
Compiler Construction | Lecture 11 | Monotone FrameworksCompiler Construction | Lecture 11 | Monotone Frameworks
Compiler Construction | Lecture 11 | Monotone Frameworks
 
Architectural_Synthesis_for_DSP_Structured_Datapaths
Architectural_Synthesis_for_DSP_Structured_DatapathsArchitectural_Synthesis_for_DSP_Structured_Datapaths
Architectural_Synthesis_for_DSP_Structured_Datapaths
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
PLNOG 7: Pierre Francois - BGP Add-Paths
PLNOG 7: Pierre Francois - BGP Add-PathsPLNOG 7: Pierre Francois - BGP Add-Paths
PLNOG 7: Pierre Francois - BGP Add-Paths
 

Dernier

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Dernier (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Fast & Energy-Efficient Breadth-First Search on a Single NUMA System

  • 1. Fast & Energy-Efficient Breadth-First Search on a Single NUMA System Yuichiro Yasui & Katsuki Fujisawa Kyushu University & JST CREST Yukinori Sato JAIST & JST CREST ISC14 (International supercomputing conference 2014) Research Papers 08 ‒ Energy Efficiency, June 26, 2014
  • 2. Outline 1.  Background 2.  Fast computation of graph processing –  Related work and our previous contributions 3.  Bottlenecks analysis for our previous NUMA- optimized BFS 4.  Our proposal : Degree-aware BFS 5.  Performance evaluation of proposal BFS –  Fast for Graph500 benchmark –  Energy-efficient for Green graph500 benchmark
  • 3. Background •  Large scale graphs in various fields –  US Road network : 58 million edges –  Twitter follow-ship : 1.47 billion edges –  Neuronal network : 100 trillion edges 89 billion vertices & 100 trillion edges Neuronal network @ Human Brain Project Cyber-security Twitter US road network 24 million vertices & 58 million edges 15 billion log entries / day Social network •  Fast and scalable graph processing by using HPC large 61.6 million vertices & 1.47 billion edges
  • 4. •  Transportation •  Social network •  Cyber-security •  Bioinformatics Graph analysis and important kernel BFS •  The cycle of graph analysis for understanding real-networks •  concurrent search (breadth-first search) •  optimization (single source shortest path) •  edge-oriented (maximal independent set) graph processing Understanding Application field - SCALE - edgefactor - S - e - B - T - T Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations Relationships - SCALE - edgefactor Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations graph - SCALE - edgefactor Input parameters Graph generation Graph construction VBFS 6 results Step1 Step2 Step3 •  One of most important and fundamental processing •  Many algorithms and applications based on exists (Max.-flow and centrality) •  low arithmetic intensity & irregular memory accesses. Breadth-first search (BFS) Source BFS Lv. 3 source Lv. 2 Lv. 1 Outputs:Distance (Lv.) and Predecessor for each vertex from source Inputs:Graph, and source vertex
  • 5. Target: NUMA arch. system RAM RAM processor core & L2 cache 8-core Xeon E5 4640 shared L3 cache RAM RAM CPU socket(16 logical cores) + Local RAM Memory access for Local RAM(Fast) Memory access for Remote RAM(Slow) NUMA node •  Reduces and avoids memory accesses for Remote RAM •  4-way Intel Xeon E5-4640 (Sandybridge-EP) –  4 (# of CPU sockets) –  8 (# of physical cores per socket) –  2 (# of threads per core) 4 x 8 x 2 = 64 threads NUMA node Max. NUMA-aware computation
  • 6. Graph500 Benchmark •  Fast computation of graph processing is significant topic in HPC •  Graph500 benchmark measures computer performance using TEPS ratio (# of Traversed edges per second) in graph processing such as BFS (Breath-first search) SCALE&&&edgefactor&(=16) Median TEPS 1.  Generation SCALE edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS t parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed e - TEPS Input parameters ResultGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor - S - e - B - T - T Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations 3.  BFS x 642.  Construction x 64 TEPS ratio •  Kronecker graph –  synthetic scale-free network which was generated by using Recursive Kronecker product –  2SCALE vertices and 2SCALE edgefactor edges –  e.g.) SCALE 30 and edgefactor 16 1 billion vertices and 17.2 billion edges www.graph500.org
  • 7. Level-synchronized parallel BFS (Top-down) •  Started from source vertex and executes following two phases for each level ns (timed).: This step iterates the timed untimed verify-phase 64 times. The BFS- BFS for each source, and the verify-phase ut of the BFS. k is based on the TEPS ratio, which is ven graph and the BFS output. Submission hmark must report five TEPS ratios: the uartile, median, third quartile, and maxi- ARALLEL BFS ALGORITHM ized Parallel BFS the input of a BFS is a graph G = (V, E) et of vertices V and a set of edges E. f G are contained as pairs (v, w), where et of edges E corresponds to a set of where an adjacency list A(v) contains s (v, w) ∈ E for each vertex v ∈ V . A various edges spanning all other vertices he source vertex s ∈ V in a given graph predecessor map π, which is a map from Algorithm 1: Level-synchronized Parallel BFS. Input : G = (V, A) : unweighted directed graph. s : source vertex. Variables: QF : frontier queue. QN : neighbor queue. visited : vertices already visited. Output : π(v) : predecessor map of BFS tree. 1 π(v) ← −1, ∀v ∈ V 2 π(s) ← s 3 visited ← {s} 4 QF ← {s} 5 QN ← ∅ 6 while QF ̸= ∅ do 7 for v ∈ QF in parallel do 8 for w ∈ A(v) do 9 if w ̸∈ visited atomic then 10 π(w) ← v 11 visited ← visited ∪ {w} 12 QN ← QN ∪ {w} 13 QF ← QN 14 QN ← ∅ Traversal Swap Frontier Neighbor Level k Level k+1 QF QN Swap … swaps the frontier QF and the neighbor QN for next level Traversal … finds unvisited adjacency vertices from current frontier QF and append to neighbor QN!
  • 8. Candidates of neighbors 前方探索と後方探索でのデータアクセスの観察 • 前方探索でのデータの書込み v → w v w Input : Directed graph G = (V, AF ), Queue QF Data : Queue QN , visited, Tree π(v) QN ← ∅ for v ∈ QF in parallel do for w ∈ AF (v) do if w visited atomic then π(w) ← v visited ← visited ∪ {w} QN ← QN ∪ {w} QF ← QN • 後方探索でのデータの書込み w → v v w Input : Directed graph G = (V, AB ), Queue QF Data : Queue QN , visited, Tree π(v) QN ← ∅ for w ∈ V visited in parallel do for v ∈ AB (w) do if v ∈ QF then π(w) ← v visited ← visited ∪ {w} QN ← QN ∪ {w} break QF ← QN Hybrid-BFS (Direction-optimizing BFS) Chooses one from Top-down or Bottom-up for frontier size at each level Frontier Neighbors Level0k Level0k+1 Frontier Level0k Level0k+1 neighbors Top-down algorithm •  Efficient for small-frontier •  Uses out-going edges Bottom-up algorithm •  Efficient for large-frontier •  Uses in-coming edges 前方探索と後方探索でのデータアクセスの観察 • 前方探索でのデータの書込み v → w v w Input : Directed graph G = (V, AF ), Queue QF Data : Queue QN , visited, Tree π(v) QN ← ∅ for v ∈ QF in parallel do for w ∈ AF (v) do if w visited atomic then π(w) ← v visited ← visited ∪ {w} QN ← QN ∪ {w} QF ← QN • 後方探索でのデータの書込み w → v v w Input : Directed graph G = (V, AB ), Queue QF Data : Queue QN , visited, Tree π(v) QN ← ∅ for w ∈ V visited in parallel do for v ∈ AB (w) do if v ∈ QF then π(w) ← v visited ← visited ∪ {w} QN ← QN ∪ {w} break QF ← QN Current frontier Unvisited neighbors Current frontier Beamer2012 Candidates of neighbors Skips unnecessary edge traversal
  • 9. Chooses one from Top-down or Bottom-up for a number of traversed edges at each level Number of traversal edges of Kronecker graph with SCALE 26 Hybrid-BFS reduces unnecessary edge traversals Beamer2012 Hybrid-BFS (Direction-optimizing BFS) Top=down 探索に対する前方探索 (Top-down) と後方探索 (Bottom-up) Level Top-down Bottom-up Hybrid 0 2 2,103,840,895 2 1 66,206 1,766,587,029 66,206 2 346,918,235 52,677,691 52,677,691 3 1,727,195,615 12,820,854 12,820,854 4 29,557,400 103,184 103,184 5 82,357 21,467 21,467 6 221 21,240 227 Total 2,103,820,036 3,936,072,360 65,689,631 Ratio 100.00% 187.09% 3.12% Bottom=up& Top=down Distance from source |V| = 226, |E| = 230 = |E|
  • 10. NUMA-optimized BFS •  Clearly separated to accessing for local and remote memory –  Edge traversal on Local RAM –  All-gather of local queues and bitmaps for Remote RAM NUMA=optimized& Top=down NUMA=optimized& Bottom=up Large&frontier? Aggregates&local& frontier&queues Yes No At each level,! Traversal on local RAM Swap on Remote RAM QN 1 QN 0 QN 3 QF QN 2 QF QN 2 •  Searches&local&neighbors& from&local&copied&frontier& •  Out-going edges for Top-down •  In-coming edges for Bottom-up NUMA-opt. requires two CSR graphs ※&Not&same&for&undirected&graph
  • 11. Local Edge Traversal V0 QF visited0 QN 0 A0 V V1 QF visited1 QN 1 V A1 V3 QF visited3 QN 3 A3V V2 QF visited2 QN 2 A2 V •  partial&edges&(vi),)vj)),&(vi) )V)and)vj) )Vk) 慮したグラフ領域の分割 U ソケット (ℓ-個の RAM) を持つ計算機を想定 部分点集合 Vk と隣接リストの部分集合 Ak に配置 V = V0 | V1 | · · · | Vℓ−1 , A = A0 | A1 | · · · | Aℓ−1 , Vk は単純な 1 次元分割 (n: 点数, ℓ: CPU ソケット数) Vk = vj ∈ V | j ∈ kn ℓ , (k + 1)n ℓ , U ソケットを持つ計算機上に 2 種類の隣接リストを定義 B •  partial&vertices&Vk RAM RAM processor core & L2 cache 8-core Xeon E5 4640shared L3 cache RAM RAM RAM RAM processor core & L2 cache 8-core Xeon E5 4640shared L3 cache RAM RAM RAM RAM processor core & L2 cache 8-core Xeon E5 4640shared L3 cache RAM RAM 0th NUMA node 3th NUMA node 2nd NUMA node1st NUMA node k=th&NUMA&node&holds& •  Local&copied&frontier&QF
  • 12. CPU Affinity and local memory binding •  ULIBC: Ubiquity Library for Intelligently Binding Cores –  provides some routines for CPU affinity + Local memory binding –  manages each processor core (processor ID) by topology information as a tuple of (SMT ID, core ID, package ID). All processors Online processors (allocated&to&current&process) CPU Affinity 1.&Detects&online&processors& &&&&&&using&sched_getaffinity&system&call NUMA node 0 NUMA node 1 core 0 core 1 core 2 core 3 RAM RAM Local RAM Use& Other&processes Package0ID0:0index&of&CPU&socket& Core0ID0:0index&of&physical&core&in&each&CPU&socket& SMT0ID0:0index&of&thread&in&each&physical&core& Processor0ID& index&of&logical&processor&core& 2.&Binds&each&thread&to&logical&core& &&&&&&&&using&sched_setaffinity&system&call&or&& &&&&&&&&&&&&&&&Intel&compiler&Thread&Affinity&Interface
  • 13. 0 5 10 15 20 25 30 35 20 21 22 23 24 25 26 27 28 29 GTEPS SCALE reference code Agawal2010 Beamer2012 Yasui2013 Yasui2014 Related work: TEPS ratios on a single node •  Our BFS achieves 31.7 GTEPS for Kronecker graph (SCALE27) Yasui2013 Yasui2014 x 2.2 x 2.6 x 5.9Agarwal2010 faster Agarwal2010 NUMA-aware Top-down BFS 4-way Intel Xeon 7560 Beamer2012 Hybrid-BFS 4-way Intel Xeon E5-8870 Yasui2013 NUMA-opt. Hybrid-BFS 4-way Intel Xeon E5-4640 0.8 GTEPS (m/n=64, 1.1GTEPS) 5.1 GTEPS 11.1 GTEPS Yasui2014 Degree-aware NUMA-opt. BFS 4-way Intel Xeon E5-4650 31.7 GTEPS This paper This paper Reference code 0.1 GTEPS
  • 14. Visited vertices! Zero-degree! 71,140,085! 53.0% Top-down! 283! 0.0%! Bottom-up 63,035,833! 47.0% Level Step Hybrid-BFS 0 Top-down 22 1 Top-down 239,930 2 Bottom-up 150,006,673 3 Bottom-up 19,742,764 4 Bottom-up 139,817 5 Bottom-up 41,846 6 Top-down 260 Total – 170,171,312 % 4.0 % Breakdown of hybrid-BFS •  Most of CPU time taken to Bottom-up step in Hybrid BFS. •  In particular, Bottom-up step in Level-2 has almost edge traversals. 99.9 % 231 = 2,147,483,648 (100 %) for Kronecker graph with SCALE27 #Traversed edges + + Total vertices! 134,217,728! 100.0%! = •  Most of vertex traversal taken to Bottom-up step in Hybrid BFS. •  A half of number of vertices is unvisited. Breakdown of vertex traversal Traversed edges 88.1 % Unvisited vertices! Isolated! 41,527! 0.0% + =227 ( 8 %) 227 vertices and 231 edges
  • 15. Influence of ordering for adjacency vertices •  Computation complexity of Bottom-up step depends on the ordering of adjacency vertices for each vertex Number of traversal edges for each ordering # of traversed edges is strongly affected by each ordering in Lv. 2. Descending&order High-degree Low-degree A(v) Sorted adjacency list A(v)! using out-degree of w! w aversed edges in a BFS for Kronecker graph with scale 27. Hybrid algorithm Bottom-up Step Ascending Randomized Descending 223,250,243 T 22 22 22 258,645,723 T 239,930 239,930 239,930 83,878,899 B 848,743,124 150,006,673 83,878,899 19,616,130 B 19,935,737 19,742,764 19,616,130 139,606 B 139,868 139,817 139,606 41,846 B 41,846 41,846 41,846 41,586 T 260 260 260 585,614,033 – 869,100,787 170,171,312 103,916,693 179.6 % 20.6 % 4.0 % 2.5 % 108 Randomized 108 Descending Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27. Hybrid algorithm Level Top-down Bottom-up Step Ascending Randomized Descending 0 22 4,223,250,243 T 22 22 22 1 239,930 3,258,645,723 T 239,930 239,930 239,930 2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899 3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130 4 37,007,608 139,606 B 139,868 139,817 139,606 5 98,339 41,846 B 41,846 41,846 41,846 6 260 41,586 T 260 260 260 Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693 % 100 % 179.6 % 20.6 % 4.0 % 2.5 % 108 Ascending 108 Randomized 108 Descending Better Loop0count0τ! A(va) A(vb) finds frontier vertex and breaks this loop…… Bottom=up& Skipped&adjacency&vertices Traversed&adjacency&vertices
  • 16. τ=1 Analysis of loop count for each vertices Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27. Hybrid algorithm Level Top-down Bottom-up Step Ascending Randomized Descending 0 22 4,223,250,243 T 22 22 22 1 239,930 3,258,645,723 T 239,930 239,930 239,930 2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899 3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130 4 37,007,608 139,606 B 139,868 139,817 139,606 5 98,339 41,846 B 41,846 41,846 41,846 6 260 41,586 T 260 260 260 Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693 % 100 % 179.6 % 20.6 % 4.0 % 2.5 % 100 101 102 103 104 105 106 107 108 60001 10 100 1000 Numberoffixedvertices Loop count ⌧ Ascending Lv.2 Lv.3 Lv.4 Lv.5 100 101 102 103 104 105 106 107 108 1 2 3 4 5 10 20 30 60 Numberoffixedvertices Loop count ⌧ Randomized Lv.2 Lv.3 Lv.4 Lv.5 100 101 102 103 104 105 106 107 108 1 2 3 4 5 10 20 30 Numberoffixedvertices Loop count ⌧ Descending Lv.2 Lv.3 Lv.4 Lv.5 Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS Max: 5,873 Max: 58 Max: 28 19.0% + 27.8% •  Bottom-up found 46.8 % vertices •  Descending finds most vertices at first loop .Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27. Hybrid algorithm Level Top-down Bottom-up Step Ascending Randomized Descending 0 22 4,223,250,243 T 22 22 22 1 239,930 3,258,645,723 T 239,930 239,930 239,930 2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899 3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130 4 37,007,608 139,606 B 139,868 139,817 139,606 5 98,339 41,846 B 41,846 41,846 41,846 6 260 41,586 T 260 260 260 Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693 % 100 % 179.6 % 20.6 % 4.0 % 2.5 % 100 101 102 103 104 105 106 107 108 60001 10 100 1000 Numberoffixedvertices Loop count ⌧ Ascending Lv.2 Lv.3 Lv.4 Lv.5 100 101 102 103 104 105 106 107 108 1 2 3 4 5 10 20 30 60 Numberoffixedvertices Loop count ⌧ Randomized Lv.2 Lv.3 Lv.4 Lv.5 100 101 102 103 104 105 106 107 108 1 2 3 4 5 10 20 30 Numberoffixedvertices Loop count ⌧ Descending Lv.2 Lv.3 Lv.4 Lv.5 Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS 22 4,223,250,243 T 22 22 22 239,930 3,258,645,723 T 239,930 239,930 239,930 ,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899 ,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130 ,007,608 139,606 B 139,868 139,817 139,606 98,339 41,846 B 41,846 41,846 41,846 260 41,586 T 260 260 260 ,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693 100 % 179.6 % 20.6 % 4.0 % 2.5 % 6000100 1000 count ⌧ ending Lv.2 Lv.3 Lv.4 Lv.5 100 101 102 103 104 105 106 107 108 1 2 3 4 5 10 20 30 60 Numberoffixedvertices Loop count ⌧ Randomized Lv.2 Lv.3 Lv.4 Lv.5 100 101 102 103 104 105 106 107 108 1 2 3 4 5 10 20 30 Numberoffixedvertices Loop count ⌧ Descending Lv.2 Lv.3 Lv.4 Lv.528.1% + 18.7% 45.0% τ = 1 Better better τ 2 First vertex of adjacency list τ = 1 τ 2 τ = 1 τ 2 Descending order Ascending Randomized 1.8%
  • 17. 3 features and Degree-aware BFS 1. A half vertices has no adjacency vertices   Suppression of zero degree vertices using renumbering technique for non-zero degree vertices 2. Computation complexity of Bottom-up depends on the ordering of adjacency vertices for each vertex   Sorted adjacency list by out-degree in descending 3. Most vertices was found at first loop of Bottom-up   Separated graph representation; highest-degree adjacency vertex list A+ and remaining CSR graph A- High%degree Low%degree i i+1 i n m-nn Highest%degree A- High%degree Low%degree i i+1 i n m-nn Highest%degree A+ =standard CSR graph + Zero-degree opt. High-degree opt.
  • 18. Performance improvements Degree-aware BFS is 2.68 faster than NUMA-opt. 0 5 10 15 20 25 30 NUMA-opt. + High-deg + zero-deg Degree-aware GTEPS ⇥1.00 ⇥1.34 ⇥2.03 ⇥2.681.34 x 2.03 This&paperPrevious&paper
  • 19. Intel Xeon (64) v.s. SGI Altix UV1000 (512) for Kronecker graph with SCALE29 –  536.9 million vertices, 8.59 billion edges Intel Xeon E5-4650 @ 2.70GHz 64-threads = 4-NUMA x 8-cores x 2-threads 0 50 100 150 200 250 300 Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7 CPUtime(ms) Level Traversal (local) Swap (all-gather) BottomUp BottomUp BottomUpBottomUp 0 50 100 150 200 250 300 Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7 CPUtime(ms) Level Traversal (local) Swap (all-gather) BottomUp BottomUp BottomUpBottomUp 64 threads 400 GB Intel Xeon E7-8837 @ 2.67GHz 512-threads = 64-NUMA x 8-cores SGI Altix UV1000 (Westmere-EX arch.)Intel Xeon (Sandybridge-EP arch.) for&Local&RAM for&Remote&RAM for&Remote&RAM for&Local&RAM 31.81 GE/s21.81 GE/s 512 GB RAM 4.0 TB RAM 512 threads 1.0 TBytes
  • 20. Intel Xeon (64) v.s. SGI Altix UV1000 (512) for Kronecker graph with SCALE30 –  1.07 billion vertices, 17.18 billion edges Intel Xeon E5-4650 @ 2.70GHz 64-threads = 4-NUMA x 8-cores x 2-threads Intel Xeon E7-8837 @ 2.67GHz 512-threads = 64-NUMA x 8-cores SGI Altix UV1000 (Westmere-EX arch.)Intel Xeon (Sandybridge-EP arch.) 512 GB RAM 4.0 TB RAM 0 50 100 150 200 250 300 Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7 CPUtime(ms) Level Traversal (local) Swap (all-gather) BottomUp BottomUp BottomUp BottomUp 512 threads 2.0 TBytes for&Remote&RAM for&Local&RAM 37.70 GE/s Out of memory Rank.50 Fastest of single-node on nov.2013 list
  • 21. Strong scaling on SGI Altix UV1000 0 100 200 300 400 500 600 Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7 CPUtime(ms) Level Traversal (local) Swap (all-gather) BottomUp BottomUp BottomUpBottomUpBottomUp 0 100 200 300 400 500 600 Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7 CPUtime(ms) Level Traversal (local) Swap (all-gather) BottomUpBottomUp BottomUpBottomUpBottomUp 0 100 200 300 400 500 600 Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7 CPUtime(ms) Level Traversal (local) Swap (all-gather) BottomUpBottomUp BottomUpBottomUp 512 threads (one-rack) 37.70 GE/s26.17 GE/s 256 threads128 threads 18.76 GE/s Local Local Local Remote Remote Remote L : R = 67% : 33%L : R = 80% : 20% L : R = 57% : 43% 197 ms 218 ms182 ms 258 ms 438 ms733 ms •  As the number of threads increases, –  Improves the CPU time for Local memory access –  Keeps the CPU time for Remote memory access for Kronecker graph with SCALE30 –  1.07 billion vertices, 17.18 billion edges Rank.50 Fastest of single-node on nov.2013 list
  • 22. BFS Performances for Real networks •  Suitable for small-world networks –  efficient for a low-diameter and a large-edgefactor Twitter follow-ship network in 2009 61.6 million vertices & 1.47 billion edges 10.90 GTEPS (max. 24.09 GTEPS) US road network 24 million vertices & 58 million edges 0.09 GTEPS (max. 0.11 GTEPS) Small=world Non&small=world faster than the former owing to its edgefactor being 1.68 times larger relatively. In addition, twitter and friendster show similar BFS performances of approxi- mately 10 GTEPS because they have similar edgefactor and similar diameters. Therefore, we verify whether our BFS is affected by using both the edgefactor and diameter of the network. From these numerical results, we could achieve high performance for large-scale small-world networks with a large edgefactor. Table 9. BFS performance of real-world network on Sandybridge-EP system. Graph size edgefactor Diameter GTEPS Instance n m m/n diam′ G min 1/4 median 3/4 max wiki-Talk [23, 24] 2.39 M 5.02 M 2.1 8 0.29 0.61 0.75 0.87 1.26 USA-road-d [25] 23.95 M 58.33 M 2.4 8,098 0.07 0.08 0.09 0.09 0.11 LiveJournal [26, 27] 4.85 M 68.99 M 14.2 16 2.76 3.76 4.07 4.32 4.94 twitter [28] 61.58 M 1,468.37 M 23.8 16 7.58 10.02 10.90 12.68 24.09 friendster [29] 65.61 M 1,806.07 M 27.5 25 4.89 9.61 10.74 11.29 11.81 5 Energy Efficiency of Our BFS
  • 23. The Green Graph500 list in Nov. 2013 •  Measures power-efficient using TEPS/W ratio •  Our results on various systems such as Xeon servers and Android devices http://green.graph500.org Median TEPS 1. Generation - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS Input parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS Input parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations E factor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS rameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations 3.  BFS phase 2. Construction x 64 TEPS ratio Watt TEPS/W Power measurement Green Graph500 Graph500 Measuring power consumption during the BFS phase
  • 24. TEPS and TEPS/W on 4-way Xeon 16 8 8 4 4 44 7.92 GTEPS 364.9 W 21.71 MTEPS/W 11.83 GTEPS 452.6 W 26.13 MTEPS/W 13.96 GTEPS 517.8 W 26.96 MTEPS/W Fast 16 16 1616 29.03 GTEPS 639.1 W 45.43 MTEPS/W 8 8 88 22.03 GTEPS 586.7 W 37.55 MTEPS/W Energy efficient #NUMA nodes = 4#threads = 16 0 5 10 15 20 25 30 1⇥1 (1) 4⇥1 (4) 4⇥2 (8) 1⇥16 (16) 2⇥8 (16) 4⇥4 (16) 2⇥16 (32) 4⇥8 (32) 4⇥16 (64) w/o (64) 4⇥16 (64) GTEPS ` ⇥ t CPU Affinity (Number of threads) Degree-aware (GTEPS) Reference (GTEPS) NUMA-opt. (GTEPS) NUMA-opt.Ref.Degree-aware
  • 25. 0 100 200 300 400 500 10 11 12 13 14 15 16 17 18 19 20 GTEPS SCALE Reference (p = 4) Degree-aware BFS (p = 4) 7. MTEPS of reference BFS and Degree-aware BFS on XperiaA SO-04E. 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E. Implementation SCALE MTEPS watt MTEPS/W Reference (p = 1) 20 3.25 3.15 1.03 Reference (p = 4) 20 4.58 3.22 1.42 Degree-aware (p = 1) 20 136.29 3.23 42.25 Green Graph500 on Xperia-A-SO-04E Manage both fast and energy-efficient on, suggesting that the effective power is not strongly affected by the number hreads and the algorithm used. With regard to energy-efficient computation, BFS is around 100 times faster than the reference code for roughly the e effective power of 3.0 W; specifically, our BFS shows an energy-efficient ormance of 153.17 MTESP/W. ble 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E. Implementation SCALE MTEPS watt MTEPS/W Reference (p = 1) 20 3.25 3.15 1.03 Reference (p = 4) 20 4.58 3.22 1.42 This study (p = 1) 20 136.29 3.23 42.25 This study (p = 2) 20 248.08 2.99 82.92 This study (p = 4) 20 477.63 3.12 153.17 153.17 MTEPS/W (477.64 MTEPS) Roughly same power-consumption Smartphone SONY Xperia-A-SO-04E CPU : 4-core Snapdragon RAM : 2 GB #1 in Nov. 2013 list # threads 1 MTEPS/W Energy=efficient x150 Faster and energy efficient
  • 26. Conclusion •  Degree-aware BFS –  Speedup techniques considering the vertex degree –  1) Zero-degree vertex suppression –  2) Separated graph representation –  2.68 times faster than our previous algorithm •  Our BFS achieves fastest of single-node –  37.7 GTEPS for SCALE30 on SGI Altix UV1000 (one rack) •  Investigates affinity and power consumption –  4 sockets x 16 threads -affinity is the highest MTEPS and MTEPS/W on 4-way Intel Xeon server. –  First position of small data category of 2nd Green graph500 on Android device