SlideShare a Scribd company logo
1 of 20
Download to read offline
Viadeo Segmentation 
Platform with Spark 
on Mesos 
Paris Mesos User Group - 2014/09/10 
@EugenCepoi - Viadeo
Two words on Spark 
- A general purpose distributed computing framework 
- Fast, testable, easy to code and deploy 
- A strong ecosystem is being built around it
Two words on Mesos 
- A cluster manager responsible of sharing resources (CPU & RAM) with 
applications 
- You can write your own framework for mesos, to run new kind of applications
Spark and Mesos at Viadeo 
- Started using them together in mid 2013 
- We started to use Spark mainly because of its usability and being a good 
match to our data set sizes allowing to take full advantage of all its speed 
- Mesos was the logical best way to run Spark and Hadoop jobs (and other kind 
of software) on the same nodes and have dynamic resources sharing
Deploying Spark jobs on Mesos 
Test, Build, 
Package, Deploy 
and run 
Driver nodes 
Mesos lib 
Spark shell 
Spark driver 
Debian 
packaged 
Job code 
A driver node is where we deploy the 
code and from where we launch the 
jobs on the cluster
Customer Segmentation 
- Divide a population into a subset of customers that share common 
characteristics 
- Examples of segments: gender, industry, working in a big company… 
- Used to achieve fine grained targeting (ads, new products, understand the 
customer) 
- Send IBM Ads to all male customers older than 40 years and working in the IT
The problem 
- Business always needs new segments or wants to combine them 
- The segmentation was computed through SQL queries on demand 
- The raw data needs to be preprocessed (cleaned and computed) 
- Many segments involve computations on the attributes they use, too 
expensive on MySQL 
- No way for non IT employees to have the segmentation
The goal 
- Provide a solution that can compute even segments on complex data in 
reasonable time 
- Make them available to software components and humans (ex: Ad Targeting, 
AB Testing, BI team) 
- Expose a service that can answer to segmentation queries in real time 
* Get live counters & and all the members belonging to a combination of segments 
* A front-end allowing non IT people to combine segments and query it live
The pragmatic solution 
- We don’t have the time to build a real data standardization layer 
- The business team doing ad-hoc segmentation has “cleaning rules” and 
knows the MySQL tables 
- We can’t spend time to change segments definitions 
- We have conventions 
The idea: Delegate 
- Implement the segments definitions as an SQL like DSL doing joins implicitly 
and taking advantage of the conventions 
- Let the business team create the segments definitions and define the cleaning 
rules inside the segment definition
The big picture 
Segment 
definition 
Segmentation Job 
Stored with conventions: 
/sqoop/Member/20140101/* 
/sqoop/Skills/201401/02/* 
/sqoop/Company/20140101/* 
... 
Members and their 
segments 
Real Time service 
- Inverted index 
- Query app 
Member 
Skills 
Company 
Daily MySQL 
exports
The segment definition 
- A DSL focused on expressing constraints on the data 
- Doesn’t require the user to write JOINS, we imply it from the primary keys and 
the column names (remember, conventions) 
- A segment example: 
● Define some variables (for ex. patterns you want to see in the data) 
executiveKeywords: ["Dir.", "Resp.", Directeur, Directrice, Director, Dirigeant, Manager, Responsable, Chef, Chief, Head of] 
● The segment itself (and reuse available variables) 
Member.Headline = {executiveKeywords} or Position.StillInPosition=1 and Position.PositionTitle = {executiveKeywords} ...
Segments Definition 
now()-Member.BirthDate > 30y 
HDFS sqoops 
/sqoop/Member/20140101/* 
/sqoop/Skills/201401/02/* 
... 
Segmentation Job 
Parse segment definition 
using Scala combinators, 
validate & broadcast to 
all spark workers 
Infer the input sources and 
keep only the attributes we 
need 
Prune the data (rows) that 
won’t change the result of 
the expressions (reducing 
the shuffled data size) 
Join and evaluate each 
expression (segment) 
~ 30M Members 
+ 80 segments 
+10 Data sources 
10g ~ 1g/source 
Id: 1 
Name: Lucas 
BirthDate: 1986 
Id: 2 
Name: Joe 
BirthDate: 1970 
Id: 1 
BirthDate: 1986 
Id: 2 
BirthDate: 1970 
~ 2 min to complete 
Each Member 
segmentation
Querying & in memory index 
Option 1 : Index in Elasticsearch and build an app to query it 
- The most natural one, but at that moment we wanted to test hypotheses and 
didn’t want to pollute our production Elasticsearch
Querying & in memory index 
Option 2 : Use a long running spark job as a service (popular in the Spark 
community) 
● In a Spray app launch a spark job and load the data in memory 
● Submit HTTP requests to the Spray app that will query the in memory RDD 
- Increases possibility of problems/failures as the service would run 24/24 and 
would have its data spread across N nodes 
- Experienced blocking of offered resources by Mesos when running 2+ passive 
spark shells
Querying & in memory index 
Option 3 : A service using an in memory inverted index + a short spark job 
● At startup, launch a job that will build the index 
● Collect it on the driver node & stop the job 
● Submit HTTP requests to the Spray app that will query the in memory index 
- The quickest solution (for us) to get something running and collect feedback
Pattern Members 
s1,s3 1, 2 
s1,s2 3 
s2 4 
Member Segments 
1 s1, s3 
2 s1, s3 
3 s1, s2 
4 s2 
Group by existing 
segmentation 
patterns 
Fixed number of segments at 
runtime, map each segment to 
a position in a Bitset to compute 
fast set intersections 
Compute the intersections 
of the query bitset and the 
index bitsets 
Raw Query: how many in s1 and s2? Query Bitset: 011 
1 
Inverted in memory index 
Bitset Members 
101 1, 2 
011 3 
010 4
Segmentation App 
Segmentation App 
Index construction 
Job 
In memory Index 
~200 Mb 
Spray Service 
HDFS 
/segmentation/20140101/... 
< 200ms
Today... 
- The segmentation job and the App, have been in production for 6 and 3 
months, running every day without any trouble, nor requiring an intervention 
- The computed segmentation is used to display targeted Ads in email 
campaigns 
- The Segmentation App, runs 24/24 7/7 and is mainly being used by the sales
Questions?
Thanks :)

More Related Content

What's hot

Why you should use common data service final
Why you should use common data service finalWhy you should use common data service final
Why you should use common data service finalJoel Lindstrom
 
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365Brian Culver
 
SharePoint 2016 & Office 365: A Look Ahead To What’s Coming
SharePoint 2016 & Office 365: A Look Ahead To What’s ComingSharePoint 2016 & Office 365: A Look Ahead To What’s Coming
SharePoint 2016 & Office 365: A Look Ahead To What’s ComingRichard Harbridge
 
Enterprise 2.0 - SharePoint in the Cloud: Should you switch?
Enterprise 2.0 - SharePoint in the Cloud: Should you switch?Enterprise 2.0 - SharePoint in the Cloud: Should you switch?
Enterprise 2.0 - SharePoint in the Cloud: Should you switch?Richard Harbridge
 
Office365 Saturday - Redmond - 7 SharePoint Online Success Factors
Office365 Saturday - Redmond - 7 SharePoint Online Success FactorsOffice365 Saturday - Redmond - 7 SharePoint Online Success Factors
Office365 Saturday - Redmond - 7 SharePoint Online Success FactorsRichard Harbridge
 
Microsoft Azure And The Competitive Cloud Industry - TechFuse
Microsoft Azure And The Competitive Cloud Industry - TechFuseMicrosoft Azure And The Competitive Cloud Industry - TechFuse
Microsoft Azure And The Competitive Cloud Industry - TechFuseRichard Harbridge
 
Security Beyond the Firewall
Security Beyond the FirewallSecurity Beyond the Firewall
Security Beyond the FirewallKTL Solutions
 
Wonderful World of Content Types
Wonderful World of Content TypesWonderful World of Content Types
Wonderful World of Content TypesNikkia Carter
 
Share Australia - Looking to the future - SharePoint in the Cloud
Share Australia - Looking to the future - SharePoint in the CloudShare Australia - Looking to the future - SharePoint in the Cloud
Share Australia - Looking to the future - SharePoint in the CloudRichard Harbridge
 
SharePoint 2010 Integration and Interoperability: What you need to know
SharePoint 2010 Integration and Interoperability: What you need to knowSharePoint 2010 Integration and Interoperability: What you need to know
SharePoint 2010 Integration and Interoperability: What you need to knowRichard Harbridge
 

What's hot (10)

Why you should use common data service final
Why you should use common data service finalWhy you should use common data service final
Why you should use common data service final
 
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365
 
SharePoint 2016 & Office 365: A Look Ahead To What’s Coming
SharePoint 2016 & Office 365: A Look Ahead To What’s ComingSharePoint 2016 & Office 365: A Look Ahead To What’s Coming
SharePoint 2016 & Office 365: A Look Ahead To What’s Coming
 
Enterprise 2.0 - SharePoint in the Cloud: Should you switch?
Enterprise 2.0 - SharePoint in the Cloud: Should you switch?Enterprise 2.0 - SharePoint in the Cloud: Should you switch?
Enterprise 2.0 - SharePoint in the Cloud: Should you switch?
 
Office365 Saturday - Redmond - 7 SharePoint Online Success Factors
Office365 Saturday - Redmond - 7 SharePoint Online Success FactorsOffice365 Saturday - Redmond - 7 SharePoint Online Success Factors
Office365 Saturday - Redmond - 7 SharePoint Online Success Factors
 
Microsoft Azure And The Competitive Cloud Industry - TechFuse
Microsoft Azure And The Competitive Cloud Industry - TechFuseMicrosoft Azure And The Competitive Cloud Industry - TechFuse
Microsoft Azure And The Competitive Cloud Industry - TechFuse
 
Security Beyond the Firewall
Security Beyond the FirewallSecurity Beyond the Firewall
Security Beyond the Firewall
 
Wonderful World of Content Types
Wonderful World of Content TypesWonderful World of Content Types
Wonderful World of Content Types
 
Share Australia - Looking to the future - SharePoint in the Cloud
Share Australia - Looking to the future - SharePoint in the CloudShare Australia - Looking to the future - SharePoint in the Cloud
Share Australia - Looking to the future - SharePoint in the Cloud
 
SharePoint 2010 Integration and Interoperability: What you need to know
SharePoint 2010 Integration and Interoperability: What you need to knowSharePoint 2010 Integration and Interoperability: What you need to know
SharePoint 2010 Integration and Interoperability: What you need to know
 

Viewers also liked

How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseDataWorks Summit
 
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...Cloudera, Inc.
 
Memcached and the Rise of the Dynamic Web
Memcached and the Rise of the Dynamic WebMemcached and the Rise of the Dynamic Web
Memcached and the Rise of the Dynamic WebGear6
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit
 
Effective customer segmentation
Effective customer segmentationEffective customer segmentation
Effective customer segmentationSherpas
 
Customer segmentation approach
Customer segmentation approachCustomer segmentation approach
Customer segmentation approachSumit K Jha
 
Customer segmentation
Customer segmentationCustomer segmentation
Customer segmentationweave Belgium
 
Implementing a Segmentation Strategy
Implementing a Segmentation StrategyImplementing a Segmentation Strategy
Implementing a Segmentation StrategySusan Abbott
 
Customer Segmentation Principles
Customer Segmentation PrinciplesCustomer Segmentation Principles
Customer Segmentation PrinciplesVladimir Dimitroff
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer SegmentationCarlos Soares
 

Viewers also liked (12)

How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
 
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
 
Memcached and the Rise of the Dynamic Web
Memcached and the Rise of the Dynamic WebMemcached and the Rise of the Dynamic Web
Memcached and the Rise of the Dynamic Web
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Effective customer segmentation
Effective customer segmentationEffective customer segmentation
Effective customer segmentation
 
Customer segmentation approach
Customer segmentation approachCustomer segmentation approach
Customer segmentation approach
 
Customer segmentation
Customer segmentationCustomer segmentation
Customer segmentation
 
Implementing a Segmentation Strategy
Implementing a Segmentation StrategyImplementing a Segmentation Strategy
Implementing a Segmentation Strategy
 
Customer Segmentation Principles
Customer Segmentation PrinciplesCustomer Segmentation Principles
Customer Segmentation Principles
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
 
Segmentation Best Practices
Segmentation Best PracticesSegmentation Best Practices
Segmentation Best Practices
 

Similar to Viadeos Segmentation platform with Spark on Mesos

Sap implementation
Sap implementationSap implementation
Sap implementationsydraza786
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpbigdata sunil
 
CV_Nitin_Kumar_2022
CV_Nitin_Kumar_2022CV_Nitin_Kumar_2022
CV_Nitin_Kumar_2022NITIN KUMAR
 
CV NitinKumar_2020
CV NitinKumar_2020CV NitinKumar_2020
CV NitinKumar_2020NITIN KUMAR
 
Axsys Technologies Software Offerings
Axsys Technologies Software OfferingsAxsys Technologies Software Offerings
Axsys Technologies Software OfferingsSuvadeep Sarkar
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
Tejaswi Desai Resume ASP Dot Net WPF WCF MVC LINQ Agile
Tejaswi Desai Resume ASP Dot Net WPF WCF MVC LINQ AgileTejaswi Desai Resume ASP Dot Net WPF WCF MVC LINQ Agile
Tejaswi Desai Resume ASP Dot Net WPF WCF MVC LINQ AgileTejaswi Desai
 
tarun 16 may sr.system eng Resume
tarun 16 may sr.system eng Resumetarun 16 may sr.system eng Resume
tarun 16 may sr.system eng Resumetarun prakash singh
 
ABHIJEET MURLIDHAR GHAG Axisbank
ABHIJEET MURLIDHAR GHAG AxisbankABHIJEET MURLIDHAR GHAG Axisbank
ABHIJEET MURLIDHAR GHAG AxisbankAbhijeet Ghag
 
Yuriy Chapran - Building microservices.
Yuriy Chapran - Building microservices.Yuriy Chapran - Building microservices.
Yuriy Chapran - Building microservices.Yuriy Chapran
 

Similar to Viadeos Segmentation platform with Spark on Mesos (20)

Allan_John_R_Salgado-MCSD.NET, MCTS,MCPD-Resume(LinkedIn)
Allan_John_R_Salgado-MCSD.NET, MCTS,MCPD-Resume(LinkedIn)Allan_John_R_Salgado-MCSD.NET, MCTS,MCPD-Resume(LinkedIn)
Allan_John_R_Salgado-MCSD.NET, MCTS,MCPD-Resume(LinkedIn)
 
Symphony Driver Essay
Symphony Driver EssaySymphony Driver Essay
Symphony Driver Essay
 
Sap implementation
Sap implementationSap implementation
Sap implementation
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
 
CV_Nitin_Kumar_2022
CV_Nitin_Kumar_2022CV_Nitin_Kumar_2022
CV_Nitin_Kumar_2022
 
CV NitinKumar_2020
CV NitinKumar_2020CV NitinKumar_2020
CV NitinKumar_2020
 
Resume
ResumeResume
Resume
 
Axsys Technologies Software Offerings
Axsys Technologies Software OfferingsAxsys Technologies Software Offerings
Axsys Technologies Software Offerings
 
ESP POC Findings
ESP POC FindingsESP POC Findings
ESP POC Findings
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Tejaswi Desai Resume ASP Dot Net WPF WCF MVC LINQ Agile
Tejaswi Desai Resume ASP Dot Net WPF WCF MVC LINQ AgileTejaswi Desai Resume ASP Dot Net WPF WCF MVC LINQ Agile
Tejaswi Desai Resume ASP Dot Net WPF WCF MVC LINQ Agile
 
satvinder_Resume
satvinder_Resumesatvinder_Resume
satvinder_Resume
 
Blue book
Blue bookBlue book
Blue book
 
tarun 16 may sr.system eng Resume
tarun 16 may sr.system eng Resumetarun 16 may sr.system eng Resume
tarun 16 may sr.system eng Resume
 
Resume Pallavi Mishra as of 2017 Feb
Resume Pallavi Mishra as of 2017 FebResume Pallavi Mishra as of 2017 Feb
Resume Pallavi Mishra as of 2017 Feb
 
ABHIJEET MURLIDHAR GHAG Axisbank
ABHIJEET MURLIDHAR GHAG AxisbankABHIJEET MURLIDHAR GHAG Axisbank
ABHIJEET MURLIDHAR GHAG Axisbank
 
DebduttaRoy_2016
DebduttaRoy_2016 DebduttaRoy_2016
DebduttaRoy_2016
 
CV - Manuel_Lara
CV - Manuel_LaraCV - Manuel_Lara
CV - Manuel_Lara
 
Hemalatha-Software Engineer
Hemalatha-Software EngineerHemalatha-Software Engineer
Hemalatha-Software Engineer
 
Yuriy Chapran - Building microservices.
Yuriy Chapran - Building microservices.Yuriy Chapran - Building microservices.
Yuriy Chapran - Building microservices.
 

Recently uploaded

tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 

Recently uploaded (20)

tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 

Viadeos Segmentation platform with Spark on Mesos

  • 1. Viadeo Segmentation Platform with Spark on Mesos Paris Mesos User Group - 2014/09/10 @EugenCepoi - Viadeo
  • 2. Two words on Spark - A general purpose distributed computing framework - Fast, testable, easy to code and deploy - A strong ecosystem is being built around it
  • 3. Two words on Mesos - A cluster manager responsible of sharing resources (CPU & RAM) with applications - You can write your own framework for mesos, to run new kind of applications
  • 4. Spark and Mesos at Viadeo - Started using them together in mid 2013 - We started to use Spark mainly because of its usability and being a good match to our data set sizes allowing to take full advantage of all its speed - Mesos was the logical best way to run Spark and Hadoop jobs (and other kind of software) on the same nodes and have dynamic resources sharing
  • 5. Deploying Spark jobs on Mesos Test, Build, Package, Deploy and run Driver nodes Mesos lib Spark shell Spark driver Debian packaged Job code A driver node is where we deploy the code and from where we launch the jobs on the cluster
  • 6. Customer Segmentation - Divide a population into a subset of customers that share common characteristics - Examples of segments: gender, industry, working in a big company… - Used to achieve fine grained targeting (ads, new products, understand the customer) - Send IBM Ads to all male customers older than 40 years and working in the IT
  • 7. The problem - Business always needs new segments or wants to combine them - The segmentation was computed through SQL queries on demand - The raw data needs to be preprocessed (cleaned and computed) - Many segments involve computations on the attributes they use, too expensive on MySQL - No way for non IT employees to have the segmentation
  • 8. The goal - Provide a solution that can compute even segments on complex data in reasonable time - Make them available to software components and humans (ex: Ad Targeting, AB Testing, BI team) - Expose a service that can answer to segmentation queries in real time * Get live counters & and all the members belonging to a combination of segments * A front-end allowing non IT people to combine segments and query it live
  • 9. The pragmatic solution - We don’t have the time to build a real data standardization layer - The business team doing ad-hoc segmentation has “cleaning rules” and knows the MySQL tables - We can’t spend time to change segments definitions - We have conventions The idea: Delegate - Implement the segments definitions as an SQL like DSL doing joins implicitly and taking advantage of the conventions - Let the business team create the segments definitions and define the cleaning rules inside the segment definition
  • 10. The big picture Segment definition Segmentation Job Stored with conventions: /sqoop/Member/20140101/* /sqoop/Skills/201401/02/* /sqoop/Company/20140101/* ... Members and their segments Real Time service - Inverted index - Query app Member Skills Company Daily MySQL exports
  • 11. The segment definition - A DSL focused on expressing constraints on the data - Doesn’t require the user to write JOINS, we imply it from the primary keys and the column names (remember, conventions) - A segment example: ● Define some variables (for ex. patterns you want to see in the data) executiveKeywords: ["Dir.", "Resp.", Directeur, Directrice, Director, Dirigeant, Manager, Responsable, Chef, Chief, Head of] ● The segment itself (and reuse available variables) Member.Headline = {executiveKeywords} or Position.StillInPosition=1 and Position.PositionTitle = {executiveKeywords} ...
  • 12. Segments Definition now()-Member.BirthDate > 30y HDFS sqoops /sqoop/Member/20140101/* /sqoop/Skills/201401/02/* ... Segmentation Job Parse segment definition using Scala combinators, validate & broadcast to all spark workers Infer the input sources and keep only the attributes we need Prune the data (rows) that won’t change the result of the expressions (reducing the shuffled data size) Join and evaluate each expression (segment) ~ 30M Members + 80 segments +10 Data sources 10g ~ 1g/source Id: 1 Name: Lucas BirthDate: 1986 Id: 2 Name: Joe BirthDate: 1970 Id: 1 BirthDate: 1986 Id: 2 BirthDate: 1970 ~ 2 min to complete Each Member segmentation
  • 13. Querying & in memory index Option 1 : Index in Elasticsearch and build an app to query it - The most natural one, but at that moment we wanted to test hypotheses and didn’t want to pollute our production Elasticsearch
  • 14. Querying & in memory index Option 2 : Use a long running spark job as a service (popular in the Spark community) ● In a Spray app launch a spark job and load the data in memory ● Submit HTTP requests to the Spray app that will query the in memory RDD - Increases possibility of problems/failures as the service would run 24/24 and would have its data spread across N nodes - Experienced blocking of offered resources by Mesos when running 2+ passive spark shells
  • 15. Querying & in memory index Option 3 : A service using an in memory inverted index + a short spark job ● At startup, launch a job that will build the index ● Collect it on the driver node & stop the job ● Submit HTTP requests to the Spray app that will query the in memory index - The quickest solution (for us) to get something running and collect feedback
  • 16. Pattern Members s1,s3 1, 2 s1,s2 3 s2 4 Member Segments 1 s1, s3 2 s1, s3 3 s1, s2 4 s2 Group by existing segmentation patterns Fixed number of segments at runtime, map each segment to a position in a Bitset to compute fast set intersections Compute the intersections of the query bitset and the index bitsets Raw Query: how many in s1 and s2? Query Bitset: 011 1 Inverted in memory index Bitset Members 101 1, 2 011 3 010 4
  • 17. Segmentation App Segmentation App Index construction Job In memory Index ~200 Mb Spray Service HDFS /segmentation/20140101/... < 200ms
  • 18. Today... - The segmentation job and the App, have been in production for 6 and 3 months, running every day without any trouble, nor requiring an intervention - The computed segmentation is used to display targeted Ads in email campaigns - The Segmentation App, runs 24/24 7/7 and is mainly being used by the sales