SlideShare une entreprise Scribd logo
1  sur  18
Spark War Stories
Who are we?
Tal Sliwowicz
Director, R&D
tal@taboola.com
Ruthy Goldberg
Sr. Software Engineer
ruthy@taboola.com
Our War Story
“A good plan violently executed now is better than a
perfect plan executed next week”
George S. Patton
Our Data Requirements
• Lots of incoming traffic (100K requests/sec)
• Data:
– Personalized served recommendations – per user, per page
view
– Events - What the user actually read and what he did
• The data needs to be joined and processed in real time
– Campaigns Management
– Recommendations
– Billing
– Reports
– Etc.
• The data needs to be available for offline research
Challenges
• We care about sessions - chain of page views and
events for a specific user
– Length can be hours or even days
• We care about users – chain of sessions across sites
– Length can be days or even months
• Stateless Application – single user data is sent from
multiple data centers and multiple servers
– No deterministic affinity to a server or DC
– Order isn’t guaranteed
– Must be robust and automatically deal with late arrivals
– “Exactly once” semantics
Challenges Cont.
• Many streams of data that need to be joined (user,
session, page view, widgets, recommendations,
events, actions)
• 5+TB of daily data
• Data analysis requires pre-joining the streams and
looking on the data across time
Naïve / Brute Force Solution
• Join some streams in the FE Server
– De-normalization is done as early as possible
– Everything that isn’t event or action is joined
– However, cannot assume a single PV happens on a single
server
• Join the above with events and actions in Spark
memory
– Minutes of data - ok
– 2+ Hours of data - slow (30+ minutes of processing)
– Days of data - #Fail
Why Did it Fail?
• Incoming data is received by data class (i.e. Request,
Event, etc) and by incoming timestamp
– Separate RDD per class
– The RDDs contain randomly - hash partitioned - incoming
data
• Join key is by session and page view ids
Why Did it Fail?
• To join the data:
– First, remap the incoming data to a PairRDD and add the join
key (needs to be done individually, per RDD class)
– Second, cogroup the PairRDDs  shuffle must be performed
on all participating RDDs
• The initial data is distributed randomly across many
nodes and multiple RDDs
– Small data sets  small shuffles
– Huge data sets  unmanageable shuffles
See the Shuffle
The Solution
Avoid Them
Shuffles
The Solution
• Designed to avoid the initial / heaviest shuffle
• Go through an intermediary phase before reading the
data for analysis
• As streamed data is being received, save each
message to Cassandra
– All classes saved together to a single table
– The table is partitioned by the read key
Table Model in C*
• Partition key – session start hour + user bucket (0-9,999)
• Clustering key - publisher_id, user_id, session_id, view_id,
data_type, data_hash
• Data Type - MULTI_REQUEST, USER_EVENT,
ACTION_CONVERSION, …
• Data – blobs of protobuff
• Results:
– All the data of a single session is in one place, regardless of
time of arrival
– Idempotent process – if same message is received twice it
overruns the previous arrivals due to same hash id
Result - No Shuffle
Result
• Week of data (~35TB) - 2 hours to analyze and report
• Analyzing 1% sample of the users reduces this
linearly (partition key)
• Analyzing a single publisher which is 1% of the data
reduces this almost linearly (clustering key)
Good, but not good enough
• We used Cassandra because we had it as an
available resource
• However, Cassandra:
– Isn’t columnar - cannot read partial rows (specific columns)
– Eventually consistent – not accurate enough
– For heavy loads suffers from memory issues
– Cross DC replication isn’t reliable under heavy load
• Now working on the next gen solution
– See you in a future meetup…
Some More Tips
• Avoid cogroup and use broadcasts when one of the
RDDs is small enough
• Whenever possible use map() instead of
mapPartitions()
– Memory and processing efficiency gained
– Unless setup is expensive
• G1GC – we have had a very good experience with it
in tight memory situations
– Does not work well out of the box, requires some tweaking
Thank You!
ruthy@taboola.com
tal@taboola.com

Contenu connexe

Tendances

MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor ManagementMongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor ManagementMongoDB
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchDataStax Academy
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Webmaria.grineva
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBMongoDB
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...Omid Vahdaty
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionMaggie Pint
 
Meetup Google BigQuery powered by ai
Meetup Google BigQuery powered by aiMeetup Google BigQuery powered by ai
Meetup Google BigQuery powered by aiIdo Volff
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...DataStax Academy
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDBWebinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDBMongoDB
 

Tendances (11)

MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor ManagementMongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a Hitch
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDB
 
Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
Meetup Google BigQuery powered by ai
Meetup Google BigQuery powered by aiMeetup Google BigQuery powered by ai
Meetup Google BigQuery powered by ai
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDBWebinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB
 

Similaire à Spark war stories taboola

Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceMercedes Coyle
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTPConnor McDonald
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3AWS User Group Bengaluru
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3AWS User Group Bengaluru
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel geektimecoil
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
Events & Microservices
Events & MicroservicesEvents & Microservices
Events & MicroservicesYamen Sader
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More IntelligentKyle Davis
 
Scylla Summit 2022: Stream Processing with ScyllaDB
Scylla Summit 2022: Stream Processing with ScyllaDBScylla Summit 2022: Stream Processing with ScyllaDB
Scylla Summit 2022: Stream Processing with ScyllaDBScyllaDB
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29Ted Dunning
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1GurinderG
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 

Similaire à Spark war stories taboola (20)

Storm 2012 03-29
Storm 2012 03-29Storm 2012 03-29
Storm 2012 03-29
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
Events & Microservices
Events & MicroservicesEvents & Microservices
Events & Microservices
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More Intelligent
 
Scylla Summit 2022: Stream Processing with ScyllaDB
Scylla Summit 2022: Stream Processing with ScyllaDBScylla Summit 2022: Stream Processing with ScyllaDB
Scylla Summit 2022: Stream Processing with ScyllaDB
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 

Dernier

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 

Dernier (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 

Spark war stories taboola

  • 2. Who are we? Tal Sliwowicz Director, R&D tal@taboola.com Ruthy Goldberg Sr. Software Engineer ruthy@taboola.com
  • 3. Our War Story “A good plan violently executed now is better than a perfect plan executed next week” George S. Patton
  • 4. Our Data Requirements • Lots of incoming traffic (100K requests/sec) • Data: – Personalized served recommendations – per user, per page view – Events - What the user actually read and what he did • The data needs to be joined and processed in real time – Campaigns Management – Recommendations – Billing – Reports – Etc. • The data needs to be available for offline research
  • 5. Challenges • We care about sessions - chain of page views and events for a specific user – Length can be hours or even days • We care about users – chain of sessions across sites – Length can be days or even months • Stateless Application – single user data is sent from multiple data centers and multiple servers – No deterministic affinity to a server or DC – Order isn’t guaranteed – Must be robust and automatically deal with late arrivals – “Exactly once” semantics
  • 6. Challenges Cont. • Many streams of data that need to be joined (user, session, page view, widgets, recommendations, events, actions) • 5+TB of daily data • Data analysis requires pre-joining the streams and looking on the data across time
  • 7. Naïve / Brute Force Solution • Join some streams in the FE Server – De-normalization is done as early as possible – Everything that isn’t event or action is joined – However, cannot assume a single PV happens on a single server • Join the above with events and actions in Spark memory – Minutes of data - ok – 2+ Hours of data - slow (30+ minutes of processing) – Days of data - #Fail
  • 8. Why Did it Fail? • Incoming data is received by data class (i.e. Request, Event, etc) and by incoming timestamp – Separate RDD per class – The RDDs contain randomly - hash partitioned - incoming data • Join key is by session and page view ids
  • 9. Why Did it Fail? • To join the data: – First, remap the incoming data to a PairRDD and add the join key (needs to be done individually, per RDD class) – Second, cogroup the PairRDDs  shuffle must be performed on all participating RDDs • The initial data is distributed randomly across many nodes and multiple RDDs – Small data sets  small shuffles – Huge data sets  unmanageable shuffles
  • 12. The Solution • Designed to avoid the initial / heaviest shuffle • Go through an intermediary phase before reading the data for analysis • As streamed data is being received, save each message to Cassandra – All classes saved together to a single table – The table is partitioned by the read key
  • 13. Table Model in C* • Partition key – session start hour + user bucket (0-9,999) • Clustering key - publisher_id, user_id, session_id, view_id, data_type, data_hash • Data Type - MULTI_REQUEST, USER_EVENT, ACTION_CONVERSION, … • Data – blobs of protobuff • Results: – All the data of a single session is in one place, regardless of time of arrival – Idempotent process – if same message is received twice it overruns the previous arrivals due to same hash id
  • 14. Result - No Shuffle
  • 15. Result • Week of data (~35TB) - 2 hours to analyze and report • Analyzing 1% sample of the users reduces this linearly (partition key) • Analyzing a single publisher which is 1% of the data reduces this almost linearly (clustering key)
  • 16. Good, but not good enough • We used Cassandra because we had it as an available resource • However, Cassandra: – Isn’t columnar - cannot read partial rows (specific columns) – Eventually consistent – not accurate enough – For heavy loads suffers from memory issues – Cross DC replication isn’t reliable under heavy load • Now working on the next gen solution – See you in a future meetup…
  • 17. Some More Tips • Avoid cogroup and use broadcasts when one of the RDDs is small enough • Whenever possible use map() instead of mapPartitions() – Memory and processing efficiency gained – Unless setup is expensive • G1GC – we have had a very good experience with it in tight memory situations – Does not work well out of the box, requires some tweaking

Notes de l'éditeur

  1. We assume you know who what taboola, recent press