Contenu connexe Similaire à Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL (20) Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL2. http://bigdata.com
http://mapgraph.io
Graphs
• This talk is about recent advances in large scale graph processing on
GPUs.
– The motivation is extreme performance.
– Everything we do (as a company) is focused on graphs.
• Graph Database
• Graph processing
• Common characteristics:
– irregular data shape, irregular access patterns, and irregular parallelism.
• A lot of data can be mapped onto graphs
– Sparse matrices and graphs are very close data structures
– Graphs, as we deal with them, have attributes on vertices and edges.
• A lot of algorithms can be mapped onto graphs
– Including many machine learning algorithms.
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
2
http://www.bigdata.com/blog
3. Small Business, Founded 2006 100% Employee Owned
http://bigdata.com
http://mapgraph.io
SYSTAP, LLC
Graph Database
• High performance, Scalable
– 50B edges/node
– High level query language
– Efficient Graph Traversal
– High 9s solution
• Open Source
– Subscriptions
GPU Analytics
• Extreme Performance
– 5-100x faster than graphlab
– 10,000x faster than graphdbs
• DARPA funding
• Disruptive technology
– Early adopters
– Huge ROIs
• Open Source
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
3
9/19/2014
4. Related “Graph” Technologies
Redpoint repositions existing technology,
adding interoperability for blueprints and
gremlin.
http://bigdata.com
http://mapgraph.io
STTR
Pair up bigdata
and MapGraph
MapGraph compares favorably with high end
hardware solutions from YARC, Oracle, and
SAP, but is open source and uses commodity
hardware.
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
4
9/19/2014
5. Embedded, Single Server, HA, Scale-out
• RDF/SPARQL
• Property graphs
– Blueprints, gremlin, rexter
• REST API (NSS)
• Extension points
– Stored queries for custom
application logic on the
server.
– Custom services & indices
– Custom functions
– Vertex-centric programs
http://bigdata.com
http://mapgraph.io
• Embedded Server
Journal
JVM
• Standalone Server
Journal
WAR
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
5
9/19/2014
6. http://bigdata.com
http://mapgraph.io
High Availability
• Shared nothing architecture
– Same data on each node
– Coordinate only at commit
– Transparent load balancing
• Scaling
– 50 billion triples or quads
– Query throughput scales linearly
• Self healing
– Automatic failover
– Automatic resync after disconnect
– Online single node disaster recovery
• Online Backup
– Online snapshots (full backups)
– HA Logs (incremental backups)
• Point in time recovery (offline)
HAService
Quorum
k=3
size=3
leader
follower
HAService
HAService
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
6
9/19/2014
7. Embedded, Single Server, HA, Scale-out
http://bigdata.com
http://mapgraph.io
RDF Data and SPARQL Query
Client Service
Distributed Index Management and Query
Management Functions
Client Service
Registrar
Data Service
Client Service
Data Service Data Service Data Service
Data Service Data Service Data Service
Zookeeper
Shard Locator
Transaction Mgr
Load Balancer
Unified API
Application
Client
Application
Client
Application
Client
Application
Client
Application
Client
Client Service
SPARQL XML
SPARQL JSON
RDF/XML
N-Triples
N-Quads
Turtle
TriG
RDF/JSON
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
7
9/19/2014
9. Similar models, different problems
• Graph query and graph analytics (traversal/mining)
– Related data models
– Very different computational requirements
• Many technologies are a bad match or limited solution
– Key-value stores (bigtable, Accumulo, Cassandra, HBase)
– Map-reduce
• Anti-pattern
– Dump all data into “big bucket”
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
9
9/19/2014
10. Similar models, different problems
• Graph query and graph analytics (traversal/mining)
– Related data models
– Very different computational requirements
• Many technologies are a bad match or limited solution
– Key-value stores (bigtable, Accumulo, Cassandra, HBase)
– Map-reduce
• Anti-pattern
– Dump all data into “big bucket”
Storage and computation patterns must be
correctly matched for high performance.
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
10
9/19/2014
11. Optimize for the right problem
• Graph analytics
– Parallelism – work must be distributed and balanced.
– Memory bandwidth – memory, not disk, is the bottleneck
– 2D partitioning – O(log(N)) communications pattern (versus O(N*N))
• 1D design looses locality when updating link weights for reverse indices.
BFS PR
• Storage and computation patterns must be correctly matched
for high performance.
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
11
9/19/2014
12. GPUs – A Game Changer for Graph Analytics
• Graphs are a hard problem
• Non-locality
• Data dependent parallelism
• Memory, PCIe bus and network are
bottlenecks
• Recent performance gains driven by
innovations in bottom-up search,
data layout, and partitioning.
• GPUs deliver effective parallelism
• 10x CPU FLOPS
• 10x CPU/RAM bandwidth
• Significant speeds up over CPU
• 3 GTEPS on one GPU
• 32 GTEPS on 64 GPU cluster
http://bigdata.com
http://mapgraph.io
3500
3000
2500
2000
1500
1000
500
0
Breadth-First Search on Graphs
NVIDIA Tesla C2050
Multicore per socket
Sequential
0
2
1 10 100 1000 10000 100000
Million Traversed Edges per
Second
Average Traversal Depth
1
1
2
1
1
2
2
2
1
3
2
3
2
1 2
2
10x Speedup on GPUs
13. http://bigdata.com
http://mapgraph.io
GPU Hardware Trends
• K40 GPU (today)
• 12G RAM/GPU
• 288 GB/s bandwidth
• PCIe Gen 3
• Pascal GPU (Q1 2016)
• 24G RAM/GPU
• 1 TB/s bandwidth
• Unified memory
across CPU, GPUs
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
13
9/19/2014
14. Full Bandwidth Access to CPU RAM
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
14
9/19/2014
15. Architecture shapes performance
• The data was a scale-free graph with 2.7M vertices and 5.6M
– MapGraph used a larger version of the graph (24M vertices, 25M edges)
• The query was a 5-degree subgraph (depth-limited BFS)
• Two main takeaways
– Horizontal scaling for titan is very expensive – wrong abstraction.
– GPUs are ridiculously fast.
platform load (s)
http://bigdata.com
http://mapgraph.io
query
(ms) comments
titan 497.00 935 4 node cluster using Cassandra
neo4j 608.00 668 single node community edition
bigdata 396.00 281 single node (open source)
MapGraph 0.08 27 NVIDIA K20 GPU
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
15
9/19/2014
17. http://bigdata.com
http://mapgraph.io
Think Like a Vertex
• Simple APIs
pageRank(Message m) {
total = m.value();
vertex.val = .15 * .85 + total;
for(nbr : out_neighbors) {
SendMsg(nbr, vertex.val/num_out_nbrs);
}
}
• Lots of algorithms
– BFS, SSSP, Page Rank, Connected Components, Louvain Modularity,
Jaccard Distance, k-means clustering, Betweenness-Centrality,
Personalized Page Rank, Loopy Belief Propagation, Graph search (crisp
and approximate), etc.
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
17
9/19/2014
18. GAS – a Graph-Parallel Abstraction
• Graph-Parallel Vertex-Centric API ala GraphLab
• “Think like a vertex”
• Gather: collect information
about my neighborhood
• Apply: update my value
• Scatter: signal adjacent vertices
• Can write all sorts of graph algorithms this way
– BFS, PageRank, Connected Component, Triangle Counting, Max Flow,
etc.
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
18
9/19/2014
19. http://bigdata.com
http://mapgraph.io
MapGraph
• High-level graph processing framework
• High programmability
GPU architecture
Optimization techniques
CUDA
• High performance
Comparable to low-level approach
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
19
9/19/2014
20. http://bigdata.com
http://mapgraph.io
MapGraph
• High-level graph processing framework
• High programmability
GPU architecture
Optimization techniques
CUDA
• High performance
Comparable to low-level approach
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
20
9/19/2014
21. http://bigdata.com
http://mapgraph.io
Single GPU MapGraph (BFS)
Dataset #vertices #edges Max Degree Milliseconds
Webbase 1,000,005 3,105,536 23 1.2
Delaunay 2,097,152 6,291,408 4,700 24.5
Bitcoin 6,297,539 28,143,065 4,075,472 345.3
Wiki 3,566,907 45,030,389 7,061 51.0
Kron 1,048,576 89,239,674 131,505 47.7
154.0
513.6
74.8
821.3
1870.9
2,000
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
Webbase Delaunay Bitcoin Wiki Kron
MTEPS
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
21
9/19/2014
22. BFS Results : MapGraph vs GraphLab
1,000.00
100.00
10.00
http://bigdata.com
http://mapgraph.io
1.00
0.10
Webbase Delaunay Bitcoin Wiki Kron
Speedup
MapGraph Speedup vs GraphLab (BFS)
GL-2
GL-4
GL-8
GL-12
MPG
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
22
9/19/2014
23. PageRank : MapGraph vs GraphLab
100.00
10.00
http://bigdata.com
http://mapgraph.io
1.00
0.10
Webbase Delaunay Bitcoin Wiki Kron
Speedup
MapGraph Speedup vs GraphLab (Page Rank)
GL-2
GL-4
GL-8
GL-12
MPG
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
23
9/19/2014
24. Graph Mining on GPU Clusters
• 2D partitioning (aka vertex cuts)
• Minimizes the communication volume.
• Batch parallel Gather in row, Scatter in
column.
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
24
9/19/2014
26. • Work spans multiple orders of magnitude.
10,000,000
1,000,000
100,000
10,000
1,000
100
http://bigdata.com
http://mapgraph.io
10
1
Scale 25 Traversal
0 1 2 3 4 5 6
Frontier Size
Iteration
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
26
9/19/2014
27. http://bigdata.com
http://mapgraph.io
Strong Scaling
• Speedup on a constant problem size with more GPUs
• Problem scale 25
– 2^25 vertices (33,554,432)
– 2^26 directed edges (1,073,741,824)
23
22
21
20
19
18
17
16
15
14
13
10 20 30 40 50 60 70
GTEPS
#GPUs
Strong scaling
GPUs GTEPS Time (s)
16 14.3 0.075
25 16.4 0.066
36 18.1 0.059
64 22.7 0.047
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
27
9/19/2014
28. http://bigdata.com
http://mapgraph.io
Weak Scaling
• Scaling the problem size with more GPUs
35
30
25
20
15
10
5
0
1 4 16 64
GTEPS
#GPUs
Weak scaling
GPUs Scale Vertices Edges Time (s) GTEPS
1 21 2,097,152 67,108,864 0.0254 3
4 23 8,388,608 268,435,456 0.0429 6
16 25 33,554,432 1,073,741,824 0.0715 15
64 27 134,217,728 4,294,967,296 0.1478 29
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
28
9/19/2014
29. http://bigdata.com
http://mapgraph.io
Highlights
• For algorithms on large graphs
– Memory is the bottleneck
• CPUs quickly saturate the memory bus.
• CPU cache thrashing limits scaling for graph traversal.
• Continued performance gains for CPUs focus on reducing the #of visited edges to reduce
bandwidth.
• Hybrid CPU/GPU architectures offload either small degree vertices (reduce cache
thrashing) or high degree vertices (if the algorithm is FLOPS bound on the CPU, e.g., BC)
– Many core is the future.
– GPUs are primarily known for their FLOPS, but they have high memory bandwidth
and can deliver effective parallelism on parallel graph problems (with sophisticated
kernels).
• Scaling to very large graphs on large compute clusters
– Communications bound.
• Communication must be constant for perfect scaling
– Hybrid partitioning seeks to reduce #of messages, size of messages, and optimize
for asynchronous communications and degree-aware layouts for bottom-up search
to reduce memory bandwidth.
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
29
http://www.bigdata.com/blog