SlideShare une entreprise Scribd logo
1  sur  59
Télécharger pour lire hors ligne
Web-Scale Graph Analytics
with Apache Spark
Joseph K Bradley
Bay Area Apache Spark Meetup
September 7, 2017
2
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning
3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
3	3	
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
•  Collaborative cloud environment
•  Free version (community edition)
4	4	
DATABRICKS RUNTIME 3.2
•  Apache Spark - optimized for the cloud
•  Caching and optimization layer - DBIO
•  Enterprise security - DBES
Try for free today.
databricks.com
5
Apache Spark Engine
…
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R APIs
Standard libraries
6
7
Spark Packages
340+ packages written for Spark
80+ packages for ML and Graphs
E.g.:
• GraphFrames: DataFrame-based graphs
• Bisecting K-Means: now part of MLlib
• Stanford CoreNLP integration: UDFs for NLP
spark-packages.org
8
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
9
Graphs
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK” “SEA” 45 1058923
10
Apache Spark’s GraphX library
Overview
•  General-purpose graph
processing library
•  Optimized for fast
distributed computing
•  Library of algorithms:
PageRank, Connected
Components, etc.
10	
Challenges
•  No Java, Python APIs
•  Lower-level RDD-based API
(vs. DataFrames)
•  Cannot use recent Spark
optimizations: Catalyst
query optimizer, Tungsten
memory management
11
The GraphFrames Spark Package
Goal: DataFrame-based graphs on Apache Spark
•  Simplify interactive queries
•  Support motif-finding for structural pattern search
•  Benefit from DataFrame optimizations
Collaboration between Databricks, UC Berkeley & MIT
+ Now with community contributors & committers!
11
12
Graphs
vertex
edge
JFK
IAD
LAX
SFO
SEA
DFW
id City State
“JFK” “New York” NY
src dst delay tripID
“JFK” “SEA” 45 1058923
13
GraphFrames
13	
id City State
“JFK” “New York” NY
“SEA” “Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW” “SFO” -7 4100224
vertices DataFrame edges DataFrame
14
Graph analysis with GraphFrames
Simple queries
Motif finding
Graph algorithms
14
15
Simple queries
SQL queries on vertices & edges
15	
Simple graph queries (e.g., vertex degrees)
16
Motif finding
16	
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
17
Motif finding
17	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
18
Motif finding
18	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
19
Motif finding
19	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
20
Motif finding
20	
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex &
edge data.
paths.filter(“e1.delay > 20”)
21
Graph algorithms
Find important vertices
•  PageRank
21	
Find paths between sets of vertices
•  Breadth-first search (BFS)
•  Shortest paths
Find groups of vertices
(components, communities)
•  Connected components
•  Strongly connected components
•  Label Propagation Algorithm (LPA)
Other
•  Triangle counting
•  SVDPlusPlus
22
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
22
23
GraphFrames vs. GraphX
23	
GraphFrames GraphX
Built on DataFrames RDDs
Languages Scala, Java, Python Scala
Use cases Queries & algorithms Algorithms
Vertex IDs Any type (in Catalyst) Long
Vertex/edge
attributes
Any number of
DataFrame columns
Any type (VD, ED)
24
2 types of graph libraries
Graph algorithms Graph queries
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries & updates
GraphFrames: Both algorithms & queries (but not point updates)
25
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
26
Algorithm implementations
Mostly wrappers for GraphX
•  PageRank
•  Shortest paths
•  Strongly connected components
•  Label Propagation Algorithm (LPA)
•  SVDPlusPlus
26	
Some algorithms implemented
using DataFrames
•  Breadth-first search
•  Connected components
•  Triangle counting
•  Motif finding
27
Moving implementations to DataFrames
DataFrames are optimized for a huge number of small records.
•  columnar storage
•  code generation (“Project Tungsten”)
•  query optimization (“Project Catalyst”)
27
28
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
29
Pros of integer vertex IDs
GraphFrames take arbitrary vertex IDs.
à convenient for users
Algorithms prefer integer vertex IDs.
à optimize in-memory storage
à reduce communication
Our task: Map unique vertex IDs to unique (long) integers.
30
The hashing trick?
• Possible solution: hash vertex ID to long integer
• What is the chance of collision?
•  1 - (k-1)/N * (k-2)/N * …
•  seems unlikely with long range N=264
•  with 1 billion nodes, the chance is ~5.4%
• Problem: collisions change graph topology.
Name Hash
Tim 84088
Joseph -2070372689
Xiangrui 264245405
Felix 67762524
31
Generating unique IDs
Spark has built-in methods to generate unique IDs.
•  RDD: zipWithUniqueId(), zipWithIndex()
•  DataFrame: monotonically_increasing_id()
!
Possible solution: just use these methods
32
How it works
ParCCon	1	
Vertex	 ID	
Tim	 0	
Joseph	 1	
ParCCon	2	
Vertex	 ID	
Xiangrui	 100	+	0	
Felix	 100	+	1	
ParCCon	3	
Vertex	 ID	
…	 200	+	0	
…	 200	+	1
33
… but not always
• DataFrames/RDDs are immutable and reproducible by design.
• However, records do not always have stable orderings.
•  distinct
•  repartition
• cache() does not help.
ParCCon	1	
Vertex	 ID	
Tim	 0	
Joseph	 1	
ParCCon	1	
Vertex	 ID	
Joseph	 0	
Tim	 1	
re-compute
34
Our implementation
We implemented (v0.5.0) an expensive but correct version:
1.  (hash) re-partition + distinct vertex IDs
2.  sort vertex IDs within each partition
3.  generate unique integer IDs
35
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
36
Connected Components
Assign each vertex a component ID such that vertices receive the
same component ID iff they are connected.
Applications:
•  fraud detection
• Spark Summit 2016 keynote from Capital One
•  clustering
•  entity resolution
1	 3	
2
37
Naive implementation (GraphX)
1.  Assign each vertex a unique component ID.
2.  Iterate until convergence:
•  For each vertex v, update:
component ID of v ß Smallest component ID in neighborhood of v
Pro: easy to implement
Con: slow convergence on large-diameter graphs
38
Small-/large-star algorithm
Kiveris et al. "Connected Components in MapReduce and Beyond."
1.  Assign each vertex a unique ID.
2.  Iterate until convergence:
• (small-star) for each vertex,
connect smaller neighbors to smallest neighbor
• (big-star) for each vertex,
connect bigger neighbors to smallest neighbor (or itself)
39
Small-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
40
Big-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
41
Another interpretation
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9	
adjacency	matrix
42
Small-star operation
1	 5	 7	 8	 9	
1	 x	 x	 x	
5	
7	
8	 x	
9	
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9	
rotate	&	liK
43
Big-star operation
liK	
1	 5	 7	 8	 9	
1	 x	 x	
5	 x	
7	 x	
8	
9	
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9
44
Convergence
1	 5	 7	 8	 9	
1	 x	 x	 x	 x	 x	
5	
7	
8	
9
45
Properties of the algorithm
• Small-/big-star operations do not change graph connectivity.
• Extra edges are pruned during iterations.
• Each connected component converges to a star graph.
• Converges in log2(#nodes) iterations
46
Implementation
Iterate:
• filter
• self-join
Challenge: handle these operations at scale.
47
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
48
Skewed joins
Real-world graphs contain big components.
à data skew during connected components iterations
src	 dst	
0	 1	
0	 2	
0	 3	
0	 4	
…	 …	
0	 2,000,000	
1	 3	
2	 5	
src	 Component	id	 neighbors	
0	 0	 2,000,000	
1	 0	 10	
2	 3	 5	
join
49
Skewed joins
4
src	 dst	
0	 1	
0	 2	
0	 3	
0	 4	
…	 …	
0	 2,000,000	
hash	join	
1	 3	
2	 5	
broadcast	join	
(#nbrs	>	1,000,000)	
union	
src	 Component	id	 neighbors	
0	 0	 2,000,000	
1	 0	 10	
2	 3	 5
50
Checkpointing
We checkpoint every 2 iterations to avoid:
•  query plan explosion (exponential growth)
•  optimizer slowdown
•  disk out of shuffle space
•  unexpected node failures
5
51
Experiments
twitter-2010 from WebGraph datasets (small diameter)
•  42 million vertices, 1.5 billion edges
16 r3.4xlarge workers on Databricks
•  GraphX: 4 minutes
•  GraphFrames: 6 minutes
–  algorithm difference, checkpointing, checking skewness
5
52
Experiments
uk-2007-05 from WebGraph datasets
•  105 million vertices, 3.7 billion edges
16 r3.4xlarge workers on Databricks
•  GraphX: 25 minutes
–  slow convergence
•  GraphFrames: 4.5 minutes
5
53
Experiments
regular grid 32,000 x 32,000 (large diameter)
•  1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
•  GraphX: failed
•  GraphFrames: 1 hour
5
54
Experiments
regular grid 50,000 x 50,000 (large diameter)
•  2.5 billion nodes, 10 billion edges
32 r3.8xlarge workers on Databricks
•  GraphX: failed
•  GraphFrames: 1.6 hours
5
55
Future improvements
GraphFrames
•  update inefficient code (due to Spark 1.6 compatibility)
•  better graph partitioning
•  letting Spark SQL handle skewed joins and iterations
•  graph compression
Connected Components
•  local iterations
•  node pruning and better stopping criteria
56
https://spark-summit.org/eu-2017/
15% Discount code: Databricks
hRp://dbricks.co/2sK35XT
https://databricks.com/company/careers
Thank you!
Get started with GraphFrames
Docs, downloads & tutorials
http://graphframes.github.io
https://docs.databricks.com
Dev community
Github issues & PRs
Twitter: @jkbatcmu à I’ll share my slides.

Contenu connexe

Tendances

Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowFernando Ortega Gallego
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricksLiangjun Jiang
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022HostedbyConfluent
 

Tendances (20)

Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlow
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Vector database
Vector databaseVector database
Vector database
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
 
Postgresql tutorial
Postgresql tutorialPostgresql tutorial
Postgresql tutorial
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
 

Similaire à Web-Scale Graph Analytics with Apache® Spark™

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkDatabricks
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real WorldAchim Friedland
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Yuichiro Yasui
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
 

Similaire à Web-Scale Graph Analytics with Apache® Spark™ (20)

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Presentation1
Presentation1Presentation1
Presentation1
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 

Dernier (20)

WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 

Web-Scale Graph Analytics with Apache® Spark™

  • 1. Web-Scale Graph Analytics with Apache Spark Joseph K Bradley Bay Area Apache Spark Meetup September 7, 2017
  • 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  • 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 3 3 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4. 4 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! •  Collaborative cloud environment •  Free version (community edition) 4 4 DATABRICKS RUNTIME 3.2 •  Apache Spark - optimized for the cloud •  Caching and optimization layer - DBIO •  Enterprise security - DBES Try for free today. databricks.com
  • 5. 5 Apache Spark Engine … Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries
  • 6. 6
  • 7. 7 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames: DataFrame-based graphs • Bisecting K-Means: now part of MLlib • Stanford CoreNLP integration: UDFs for NLP spark-packages.org
  • 8. 8 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 9. 9 Graphs vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK” “SEA” 45 1058923
  • 10. 10 Apache Spark’s GraphX library Overview •  General-purpose graph processing library •  Optimized for fast distributed computing •  Library of algorithms: PageRank, Connected Components, etc. 10 Challenges •  No Java, Python APIs •  Lower-level RDD-based API (vs. DataFrames) •  Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management
  • 11. 11 The GraphFrames Spark Package Goal: DataFrame-based graphs on Apache Spark •  Simplify interactive queries •  Support motif-finding for structural pattern search •  Benefit from DataFrame optimizations Collaboration between Databricks, UC Berkeley & MIT + Now with community contributors & committers! 11
  • 12. 12 Graphs vertex edge JFK IAD LAX SFO SEA DFW id City State “JFK” “New York” NY src dst delay tripID “JFK” “SEA” 45 1058923
  • 13. 13 GraphFrames 13 id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224 vertices DataFrame edges DataFrame
  • 14. 14 Graph analysis with GraphFrames Simple queries Motif finding Graph algorithms 14
  • 15. 15 Simple queries SQL queries on vertices & edges 15 Simple graph queries (e.g., vertex degrees)
  • 16. 16 Motif finding 16 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 17. 17 Motif finding 17 JFK IAD LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 18. 18 Motif finding 18 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 19. 19 Motif finding 19 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 20. 20 Motif finding 20 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  • 21. 21 Graph algorithms Find important vertices •  PageRank 21 Find paths between sets of vertices •  Breadth-first search (BFS) •  Shortest paths Find groups of vertices (components, communities) •  Connected components •  Strongly connected components •  Label Propagation Algorithm (LPA) Other •  Triangle counting •  SVDPlusPlus
  • 22. 22 Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) 22
  • 23. 23 GraphFrames vs. GraphX 23 GraphFrames GraphX Built on DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edge attributes Any number of DataFrame columns Any type (VD, ED)
  • 24. 24 2 types of graph libraries Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries & updates GraphFrames: Both algorithms & queries (but not point updates)
  • 25. 25 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 26. 26 Algorithm implementations Mostly wrappers for GraphX •  PageRank •  Shortest paths •  Strongly connected components •  Label Propagation Algorithm (LPA) •  SVDPlusPlus 26 Some algorithms implemented using DataFrames •  Breadth-first search •  Connected components •  Triangle counting •  Motif finding
  • 27. 27 Moving implementations to DataFrames DataFrames are optimized for a huge number of small records. •  columnar storage •  code generation (“Project Tungsten”) •  query optimization (“Project Catalyst”) 27
  • 28. 28 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 29. 29 Pros of integer vertex IDs GraphFrames take arbitrary vertex IDs. à convenient for users Algorithms prefer integer vertex IDs. à optimize in-memory storage à reduce communication Our task: Map unique vertex IDs to unique (long) integers.
  • 30. 30 The hashing trick? • Possible solution: hash vertex ID to long integer • What is the chance of collision? •  1 - (k-1)/N * (k-2)/N * … •  seems unlikely with long range N=264 •  with 1 billion nodes, the chance is ~5.4% • Problem: collisions change graph topology. Name Hash Tim 84088 Joseph -2070372689 Xiangrui 264245405 Felix 67762524
  • 31. 31 Generating unique IDs Spark has built-in methods to generate unique IDs. •  RDD: zipWithUniqueId(), zipWithIndex() •  DataFrame: monotonically_increasing_id() ! Possible solution: just use these methods
  • 32. 32 How it works ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 2 Vertex ID Xiangrui 100 + 0 Felix 100 + 1 ParCCon 3 Vertex ID … 200 + 0 … 200 + 1
  • 33. 33 … but not always • DataFrames/RDDs are immutable and reproducible by design. • However, records do not always have stable orderings. •  distinct •  repartition • cache() does not help. ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 1 Vertex ID Joseph 0 Tim 1 re-compute
  • 34. 34 Our implementation We implemented (v0.5.0) an expensive but correct version: 1.  (hash) re-partition + distinct vertex IDs 2.  sort vertex IDs within each partition 3.  generate unique integer IDs
  • 35. 35 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 36. 36 Connected Components Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Applications: •  fraud detection • Spark Summit 2016 keynote from Capital One •  clustering •  entity resolution 1 3 2
  • 37. 37 Naive implementation (GraphX) 1.  Assign each vertex a unique component ID. 2.  Iterate until convergence: •  For each vertex v, update: component ID of v ß Smallest component ID in neighborhood of v Pro: easy to implement Con: slow convergence on large-diameter graphs
  • 38. 38 Small-/large-star algorithm Kiveris et al. "Connected Components in MapReduce and Beyond." 1.  Assign each vertex a unique ID. 2.  Iterate until convergence: • (small-star) for each vertex, connect smaller neighbors to smallest neighbor • (big-star) for each vertex, connect bigger neighbors to smallest neighbor (or itself)
  • 39. 39 Small-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  • 40. 40 Big-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  • 41. 41 Another interpretation 1 5 7 8 9 1 x 5 x 7 x 8 x 9 adjacency matrix
  • 42. 42 Small-star operation 1 5 7 8 9 1 x x x 5 7 8 x 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 rotate & liK
  • 43. 43 Big-star operation liK 1 5 7 8 9 1 x x 5 x 7 x 8 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9
  • 44. 44 Convergence 1 5 7 8 9 1 x x x x x 5 7 8 9
  • 45. 45 Properties of the algorithm • Small-/big-star operations do not change graph connectivity. • Extra edges are pruned during iterations. • Each connected component converges to a star graph. • Converges in log2(#nodes) iterations
  • 47. 47 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 48. 48 Skewed joins Real-world graphs contain big components. à data skew during connected components iterations src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 1 3 2 5 src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 join
  • 49. 49 Skewed joins 4 src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 hash join 1 3 2 5 broadcast join (#nbrs > 1,000,000) union src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5
  • 50. 50 Checkpointing We checkpoint every 2 iterations to avoid: •  query plan explosion (exponential growth) •  optimizer slowdown •  disk out of shuffle space •  unexpected node failures 5
  • 51. 51 Experiments twitter-2010 from WebGraph datasets (small diameter) •  42 million vertices, 1.5 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 4 minutes •  GraphFrames: 6 minutes –  algorithm difference, checkpointing, checking skewness 5
  • 52. 52 Experiments uk-2007-05 from WebGraph datasets •  105 million vertices, 3.7 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 25 minutes –  slow convergence •  GraphFrames: 4.5 minutes 5
  • 53. 53 Experiments regular grid 32,000 x 32,000 (large diameter) •  1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1 hour 5
  • 54. 54 Experiments regular grid 50,000 x 50,000 (large diameter) •  2.5 billion nodes, 10 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1.6 hours 5
  • 55. 55 Future improvements GraphFrames •  update inefficient code (due to Spark 1.6 compatibility) •  better graph partitioning •  letting Spark SQL handle skewed joins and iterations •  graph compression Connected Components •  local iterations •  node pruning and better stopping criteria
  • 59. Thank you! Get started with GraphFrames Docs, downloads & tutorials http://graphframes.github.io https://docs.databricks.com Dev community Github issues & PRs Twitter: @jkbatcmu à I’ll share my slides.