SlideShare une entreprise Scribd logo
1  sur  36
Graph Processing at Scale using
Spark & GraphFrames
Ron Barabash, Yotpo
helps 70K+ online e-commerce brands
collect and leverage User Generated Content (UGC)
REVIEWS
PHOTOS
Q&A
ON-SITE
CONVERSION
SEARCH &
SOCIAL
CONSUMER
INSIGHTS
User Generated Content
Collect & Leverage
YOU ARE
HERE
Reviews are essential for social proof.
■ According to studies, more than 88% of shoppers
incorporate reviews in their purchasing decision
But also, reviews are a valuable source of feedback:
“I manually read through as many as 5,000 reviews
each month to extract customer insights, run different
analyses, and sending reports to the relevant internal
stakeholders.”
– Sandra Negrea, Customer Engagement Analyst
Yotpo InsightsYotpo Insights
Analyze topics
■ Overall sentiment score
■ Breakdown by products
■ Top mentioned opinions
Explore related reviews
■ See what customers
actually say on each topic
The Technology
Automatically analyzes the grammatical
structure of the reviews to identify all topics
and opinions mentioned.
Natural Language Processing (NLP)
Calculates and assigns a sentiment score per
opinion.
Sentiment Analysis
Groups related topics into one to
improve data significance and ease
of use.
Semantic Grouping
Yotpo Insights
Top success stories
Production team searched for quality issues in the reviews.
Discovered through feedback a malfunction in one of their products
A company noticed opinions on shipment, broken down by country.
Discovered a problem with a shipping warehouse that serves certain countries.
A fashion company found that husband came up very positively for certain products.
Changed their promotions for these products based on the “couple experience”, saw great results.
How?
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
quality
shipping
material
jeans
Topic
shipping took so long...
very comfortable material
Excellent quality
Love these jeans!
Opinion
STEP 1
First jeans I bought from your site
Love these jeans! Excellent quality and
very comfortable material. Only gave 4
stars because the shipping took so long...
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
STEP 2
quality
shipping
material
jeans
Topic
shipping took so long...
very comfortable material
Excellent quality
Love these jeans!
Opinion
First jeans I bought from your site
Love these jeans! Excellent quality and
very comfortable material. Only gave 4
stars because the shipping took so long...
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
Step 3: group topics & opinions
■ Group similar topics & similar opinions
■ Determine representatives for each group
STEP 3
shipment
shipping
delivery
material
jeans
Topic
Grouping
Opinion
Grouping
great
excellent
amazing
bad
not good
terrible
horrible
stinky
smelly
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
Step 3: group topics & opinions
■ Group similar topics & similar opinions
■ Determine representatives for each group
STEP 3
shipment
shipping
delivery
material
jeans
Topic
Grouping
Opinion
Grouping
great
excellent
amazing
bad
not good
terrible
horrible
stinky
smelly
YOU ARE
HERE
Topic Grouping
Group words with similar contextual meaning
Step 1: Semantic Grouping
■ Use NLP to group words with similar
semantic meaning
■ Build edges and vertices - Create a graph!
■ Calc. connected components - Graph
Algorithms!
shipping cost
delivery deliver
Step 2: Contextual Grouping
■ Group word clusters to groups with
contextuel meaning - Word2Vec
■ Create a graph
■ Finding paths - Graph Queries!
■ Avoid transitivity by relying on path length
cost costly
shipment
shipping
ship
shipping cost delivery deliver
cost costly
shipment
shipping
ship
Graph Processing in Spark
● GraphX
Graph Algorithms VS. Graph Queries
GraphFrames
● Graph Query Translation
● GraphFrames API
● Connected Components
○ GraphX Implementation
○ GraphFrames Implementation
○ Performance
Takeaways
Let's Talk Business
YOU ARE
HERE
Graph Processing in Spark
What?
● General-purpose graph processing library
● Built into Spark
● Optimized for fast distributed computing
● Library of algorithms: PageRank, Connected Components, etc.
The Bad
● Why just Scala? No Java, Python APIs. No Graph Queries
● Lower-level RDD-based API (vs. DataFrames)
● Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten
memory management
Separated Systems
Graph algorithms vs. Graph Queries
Spark evolves RDDs to DataFrames -
enjoy the benefits and optimizations of the
Dataframes API
Provides powerful tools for running queries
and standard graph algorithms - using GraphX
native implementation (if needed)
The unification of graph algorithms and graph
queries APIs - Available in Scala Java and
Python
GraphFrames
GraphFrames
Unified API
GraphFrames API
Spark SQL
● Page Rank
● Connected
Components
● BFS
● Wikipedia Collaborators
● Counting mutual
friends
● Finding paths existence
and patterns
Pattern Query
Optimizer
Query String Parsed Pattern
Logical Plan Optimized LP
DataFrame
Result
Graph
Algorithms
Materialized
Views
Relational plan
translations
View Selection Join Elimination and
Reordering
graph.find("(root)-[]->(layer1)").filter("root.is_root = true")
graph.find("(root)-[]->(layer1); (layer1)-[]->(layer2)").filter("root.is_root = true")
GraphFrames
Under The Hood
YOU ARE
HERE
Relational plan translations
● Edges and vertices are represented as
DataFrames
● Starts building the result DataFrame
● For each new vertex in the query we
generate a join
○ With the edges table - to get the src and
dst of the edge
○ With the vertices table - to get the
property of the vertex
graph.find("(v0)-[]->(v1); (v1)-[]->(v2)").filter(v2.attr = true)
a b
c
v0 v1 v2
a b
src dst
a b
b c
src = b
v0 v1 v2
a b c
id attr
a 1
b 2
c 3
id = c
The GraphFrames API
class GraphFrame {
def vertices: DataFrame
def edges: DataFrame
def find(pattern: String): DataFrame
def registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFrame
def pageRank(): GraphFrame
def connectedComponents(): GraphFrame
...
}
YOU ARE
HERE
Connected Components
Goal:
Assign each vertex a component ID such that vertices receive the same component ID iff they are
connected.
Problem:
What about really large graphs?
In Distributed Systems we really care about
communication and data skew (partitions)
Naive Implementation in GraphX
1. Assign each vertex a unique component ID.
2. Iterate until convergence:
a. For each vertex v, update:
i. Component ID of v Smallest component ID in neighborhood of v
Pro: Easy to implement
Con: Slow convergence on large-diameter graphs
*diameter is the greatest distance between any pair of vertices
Small/Big star algorithm - In GraphFrames
Kiveris et al. "Connected Components in MapReduce and Beyond."
1. Assign each vertex a unique component ID.
2. Iterate until convergence:
a. For each vertex v:
i. Connect smaller neighbors to smallest neighbor - Small Star
b. For each vertex v:
i. Connect bigger neighbors to smallest neighbor (or itself) - Big Star
*Motivation - We are mutating the graph without damaging connectivity into a union of Star Graphs
Small-Star Operations
1
5
7
9
8
smallStar(v) - Connect all smaller neighbours and self to the min neighbour.
*Happens in parallel on every single node to build a new graph
1
5
7
9
8
Big-Star Operations
bigStar(v) - Connect all strictly larger neighbours to the min neighbour including self.
*Happens in parallel on every single node to build a new graph
1
5
7
9
8
1
5
7
9
8
Small/Big star algorithm
1
5
7
9
8
Small/big star operations maintains graph connectivity.
Extra edges are pruned during iterations - makes less message
passing.
Each connected component converges to a star graph.
Converges in log²(#nodes) iterations.
42 million vertices, 1.5 billion edges (small diameter)
running on 16 r3.4xlarge workers on Databricks
● GraphX: 4 minutes
● GraphFrames: 6 minutes
Twitter
Let’s Talk about
Performance
● All datasets are taken from
WebGraph Datasets
105 million vertices, 3.7 billion edges
running on 16 r3.4xlarge workers on Databricks
● GraphX: 25 minutes • slow convergence
● GraphFrames: 4.5 minutes
UK Web Graph
grid 32,000 x 32,000 (large diameter)
1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
● GraphX: failed
● GraphFrames: 1 hour
Grid
~11M
# of Semantic Clusters
~124M
# of Opinions
~31M
# of Reviews
50 r3 xLarge
# of Machines
~2 Hours
PIpeline time
~7.5M
# of Topics
How about some numbers?
Key Takeaways
● Graph Queries + Graph Algorithms = GraphFrames ❤️
● Simple
○ Easy and convenient API in the language of your choosing
○ Lives alongside with other Spark components
● Flexible - using different implementations GraphX/GraphFrames
● Watch out for Performance!
○ Graphframes implementation of CC is actually worst than GraphX for some of
the cases
■ No silver bullet - it depends on the actual graph (size, diameter, sparseness)
○ Most of distributed graph algorithm use iterative message passing between
nodes - Shuffle hell.
Key Takeaways
● Monitoring - Hard to understand the execution plan
● Checkpointing is Important! - by default happens every 2 iterations
○ Handle unexpected node failures
○ Query plan explosion
○ Optimizer slowdown
○ Disk out of shuffle space
Future work
● Performance Optimizations
○ Using different checkpointing parameters
○ Test GraphFrames native Connected Components
● Algorithm Evaluation and AI based Clustering
○ Measure the correctness of current algorithm
○ Research the use of Unsupervised Clustering
● Support additional languages
○ Insights currently supports English.
Thank you!Thank You!

Contenu connexe

Similaire à Graph processing at scale using spark & graph frames

20181123 dn2018 graph_analytics_k_patenge
20181123 dn2018 graph_analytics_k_patenge20181123 dn2018 graph_analytics_k_patenge
20181123 dn2018 graph_analytics_k_patengeKarin Patenge
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Overcome the Reign of Chaos
Overcome the Reign of ChaosOvercome the Reign of Chaos
Overcome the Reign of ChaosMichael Stockerl
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Write Generic Code with the Tooling API
Write Generic Code with the Tooling APIWrite Generic Code with the Tooling API
Write Generic Code with the Tooling APIAdam Olshansky
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL AdvancedLeanIX GmbH
 
JS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JS Fest 2018. Anna Herlihy. How to Write a Compass PluginJS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JS Fest 2018. Anna Herlihy. How to Write a Compass PluginJSFestUA
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Alexey Zinoviev
 
Agile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptAgile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptHitesh Kumar
 
Learn Business Analytics with R at edureka!
Learn Business Analytics with R at edureka!Learn Business Analytics with R at edureka!
Learn Business Analytics with R at edureka!Edureka!
 
Learning Web Development with Ruby on Rails Launch
Learning Web Development with Ruby on Rails LaunchLearning Web Development with Ruby on Rails Launch
Learning Web Development with Ruby on Rails LaunchThiam Hock Ng
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixStefan Krawczyk
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 

Similaire à Graph processing at scale using spark & graph frames (20)

20181123 dn2018 graph_analytics_k_patenge
20181123 dn2018 graph_analytics_k_patenge20181123 dn2018 graph_analytics_k_patenge
20181123 dn2018 graph_analytics_k_patenge
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Evolve with laravel
Evolve with laravelEvolve with laravel
Evolve with laravel
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Overcome the Reign of Chaos
Overcome the Reign of ChaosOvercome the Reign of Chaos
Overcome the Reign of Chaos
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Write Generic Code with the Tooling API
Write Generic Code with the Tooling APIWrite Generic Code with the Tooling API
Write Generic Code with the Tooling API
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL Advanced
 
JS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JS Fest 2018. Anna Herlihy. How to Write a Compass PluginJS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Agile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptAgile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_ppt
 
Learn Business Analytics with R at edureka!
Learn Business Analytics with R at edureka!Learn Business Analytics with R at edureka!
Learn Business Analytics with R at edureka!
 
Learning Web Development with Ruby on Rails Launch
Learning Web Development with Ruby on Rails LaunchLearning Web Development with Ruby on Rails Launch
Learning Web Development with Ruby on Rails Launch
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 

Dernier

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Graph processing at scale using spark & graph frames

  • 1.
  • 2. Graph Processing at Scale using Spark & GraphFrames Ron Barabash, Yotpo
  • 3. helps 70K+ online e-commerce brands collect and leverage User Generated Content (UGC)
  • 5. Reviews are essential for social proof. ■ According to studies, more than 88% of shoppers incorporate reviews in their purchasing decision But also, reviews are a valuable source of feedback: “I manually read through as many as 5,000 reviews each month to extract customer insights, run different analyses, and sending reports to the relevant internal stakeholders.” – Sandra Negrea, Customer Engagement Analyst
  • 7. Analyze topics ■ Overall sentiment score ■ Breakdown by products ■ Top mentioned opinions Explore related reviews ■ See what customers actually say on each topic
  • 8. The Technology Automatically analyzes the grammatical structure of the reviews to identify all topics and opinions mentioned. Natural Language Processing (NLP) Calculates and assigns a sentiment score per opinion. Sentiment Analysis Groups related topics into one to improve data significance and ease of use. Semantic Grouping
  • 9. Yotpo Insights Top success stories Production team searched for quality issues in the reviews. Discovered through feedback a malfunction in one of their products A company noticed opinions on shipment, broken down by country. Discovered a problem with a shipping warehouse that serves certain countries. A fashion company found that husband came up very positively for certain products. Changed their promotions for these products based on the “couple experience”, saw great results.
  • 10. How?
  • 11. Algorithm Overview Step 1: extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. quality shipping material jeans Topic shipping took so long... very comfortable material Excellent quality Love these jeans! Opinion STEP 1 First jeans I bought from your site Love these jeans! Excellent quality and very comfortable material. Only gave 4 stars because the shipping took so long...
  • 12. Algorithm Overview Step 1: extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. STEP 2 quality shipping material jeans Topic shipping took so long... very comfortable material Excellent quality Love these jeans! Opinion First jeans I bought from your site Love these jeans! Excellent quality and very comfortable material. Only gave 4 stars because the shipping took so long...
  • 13. Algorithm Overview Step 1: extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. Step 3: group topics & opinions ■ Group similar topics & similar opinions ■ Determine representatives for each group STEP 3 shipment shipping delivery material jeans Topic Grouping Opinion Grouping great excellent amazing bad not good terrible horrible stinky smelly
  • 14. Algorithm Overview Step 1: extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. Step 3: group topics & opinions ■ Group similar topics & similar opinions ■ Determine representatives for each group STEP 3 shipment shipping delivery material jeans Topic Grouping Opinion Grouping great excellent amazing bad not good terrible horrible stinky smelly YOU ARE HERE
  • 15. Topic Grouping Group words with similar contextual meaning Step 1: Semantic Grouping ■ Use NLP to group words with similar semantic meaning ■ Build edges and vertices - Create a graph! ■ Calc. connected components - Graph Algorithms! shipping cost delivery deliver Step 2: Contextual Grouping ■ Group word clusters to groups with contextuel meaning - Word2Vec ■ Create a graph ■ Finding paths - Graph Queries! ■ Avoid transitivity by relying on path length cost costly shipment shipping ship shipping cost delivery deliver cost costly shipment shipping ship
  • 16. Graph Processing in Spark ● GraphX Graph Algorithms VS. Graph Queries GraphFrames ● Graph Query Translation ● GraphFrames API ● Connected Components ○ GraphX Implementation ○ GraphFrames Implementation ○ Performance Takeaways Let's Talk Business
  • 18. What? ● General-purpose graph processing library ● Built into Spark ● Optimized for fast distributed computing ● Library of algorithms: PageRank, Connected Components, etc. The Bad ● Why just Scala? No Java, Python APIs. No Graph Queries ● Lower-level RDD-based API (vs. DataFrames) ● Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management
  • 20. Spark evolves RDDs to DataFrames - enjoy the benefits and optimizations of the Dataframes API Provides powerful tools for running queries and standard graph algorithms - using GraphX native implementation (if needed) The unification of graph algorithms and graph queries APIs - Available in Scala Java and Python GraphFrames
  • 21. GraphFrames Unified API GraphFrames API Spark SQL ● Page Rank ● Connected Components ● BFS ● Wikipedia Collaborators ● Counting mutual friends ● Finding paths existence and patterns Pattern Query Optimizer
  • 22. Query String Parsed Pattern Logical Plan Optimized LP DataFrame Result Graph Algorithms Materialized Views Relational plan translations View Selection Join Elimination and Reordering graph.find("(root)-[]->(layer1)").filter("root.is_root = true") graph.find("(root)-[]->(layer1); (layer1)-[]->(layer2)").filter("root.is_root = true") GraphFrames Under The Hood YOU ARE HERE
  • 23. Relational plan translations ● Edges and vertices are represented as DataFrames ● Starts building the result DataFrame ● For each new vertex in the query we generate a join ○ With the edges table - to get the src and dst of the edge ○ With the vertices table - to get the property of the vertex graph.find("(v0)-[]->(v1); (v1)-[]->(v2)").filter(v2.attr = true) a b c v0 v1 v2 a b src dst a b b c src = b v0 v1 v2 a b c id attr a 1 b 2 c 3 id = c
  • 24. The GraphFrames API class GraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... } YOU ARE HERE
  • 25. Connected Components Goal: Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Problem: What about really large graphs? In Distributed Systems we really care about communication and data skew (partitions)
  • 26. Naive Implementation in GraphX 1. Assign each vertex a unique component ID. 2. Iterate until convergence: a. For each vertex v, update: i. Component ID of v Smallest component ID in neighborhood of v Pro: Easy to implement Con: Slow convergence on large-diameter graphs *diameter is the greatest distance between any pair of vertices
  • 27. Small/Big star algorithm - In GraphFrames Kiveris et al. "Connected Components in MapReduce and Beyond." 1. Assign each vertex a unique component ID. 2. Iterate until convergence: a. For each vertex v: i. Connect smaller neighbors to smallest neighbor - Small Star b. For each vertex v: i. Connect bigger neighbors to smallest neighbor (or itself) - Big Star *Motivation - We are mutating the graph without damaging connectivity into a union of Star Graphs
  • 28. Small-Star Operations 1 5 7 9 8 smallStar(v) - Connect all smaller neighbours and self to the min neighbour. *Happens in parallel on every single node to build a new graph 1 5 7 9 8
  • 29. Big-Star Operations bigStar(v) - Connect all strictly larger neighbours to the min neighbour including self. *Happens in parallel on every single node to build a new graph 1 5 7 9 8 1 5 7 9 8
  • 30. Small/Big star algorithm 1 5 7 9 8 Small/big star operations maintains graph connectivity. Extra edges are pruned during iterations - makes less message passing. Each connected component converges to a star graph. Converges in log²(#nodes) iterations.
  • 31. 42 million vertices, 1.5 billion edges (small diameter) running on 16 r3.4xlarge workers on Databricks ● GraphX: 4 minutes ● GraphFrames: 6 minutes Twitter Let’s Talk about Performance ● All datasets are taken from WebGraph Datasets 105 million vertices, 3.7 billion edges running on 16 r3.4xlarge workers on Databricks ● GraphX: 25 minutes • slow convergence ● GraphFrames: 4.5 minutes UK Web Graph grid 32,000 x 32,000 (large diameter) 1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks ● GraphX: failed ● GraphFrames: 1 hour Grid
  • 32. ~11M # of Semantic Clusters ~124M # of Opinions ~31M # of Reviews 50 r3 xLarge # of Machines ~2 Hours PIpeline time ~7.5M # of Topics How about some numbers?
  • 33. Key Takeaways ● Graph Queries + Graph Algorithms = GraphFrames ❤️ ● Simple ○ Easy and convenient API in the language of your choosing ○ Lives alongside with other Spark components ● Flexible - using different implementations GraphX/GraphFrames ● Watch out for Performance! ○ Graphframes implementation of CC is actually worst than GraphX for some of the cases ■ No silver bullet - it depends on the actual graph (size, diameter, sparseness) ○ Most of distributed graph algorithm use iterative message passing between nodes - Shuffle hell.
  • 34. Key Takeaways ● Monitoring - Hard to understand the execution plan ● Checkpointing is Important! - by default happens every 2 iterations ○ Handle unexpected node failures ○ Query plan explosion ○ Optimizer slowdown ○ Disk out of shuffle space
  • 35. Future work ● Performance Optimizations ○ Using different checkpointing parameters ○ Test GraphFrames native Connected Components ● Algorithm Evaluation and AI based Clustering ○ Measure the correctness of current algorithm ○ Research the use of Unsupervised Clustering ● Support additional languages ○ Insights currently supports English.