SlideShare a Scribd company logo
1 of 29
Download to read offline
Apache Spark at Viadeo
Paris Hadoop User Group - 2014/04/09
Viadeo
@EugenCepoi
Open source author: Genson
Specialized in web scale systems and algorithmics
Apache Spark
- http://spark.incubator.apache.org/
- http://vision.cloudera.com/mapreduce-spark/
- A fast & easy to use engine for distributed
computing in Scala (Java & Python support)
Spark ecosystem
Tachyon
Why we like Spark
- Scala, functional languages are a great match for typical operations in
distributed systems
- Easy to test, most of your code is standard Scala code
- Has already a stack for doing streaming, ML, graph computations
- Allows you to store data in RAM and even without it is fast!
- Provides a collect/broadcast mechanism allowing to fetch all the data on the
“Driver” node or to send local data to every worker
- Has an interactive shell (a la Scala), useful to analyse your data and to debug
And much more...
At the core of Spark...
The Resilient Distributed Dataset - RDD
- Abstraction over partitioned data
- RDDs are read only and can be built only from a stable storage or by transformations on
other RDDs
parent
HadoopRDD
from HDFS
MappedRDD
map(closure)
FilteredRDD
filter(closure)
parent
At the core of Spark...
Spark Driver
Worker
RAM
Data
Worker
RAM
Data
tasks and
results
Spark Context
Scheduler
Transformations
lineage
tasks and
results
Spark Driver
- Define & launch operations on RDDs
- Keeps track of RDD lineage allowing
to recover lost partitions
- Use the RDD DAG in the Scheduler to
define the stages and tasks to be
executed
Workers
- Read/Transform/Write RDD partitions
- Need to communicate with other
workers to share their cache and shuffle
data
Beginning with Spark
- Started to POC with Spark 0.8 summer 2013
- Standalone cluster on AWS in our VPC
=> hostname resolution problems
=> additionnal cost to get the data on S3 and push back
=> network performance issues
- Finally we chose to reuse our on-premise hadoop cluster
Running Spark on Mesos
- Cluster manager providing resources (CPU & RAM) isolation and
sharing them between applications
- Strong isolation with Linux container groups
- Flexible resource request model:
=> Mesos makes resource offers
=> the framework decides to accept/reject
=> the framework tells what task to run and what amount of the resources will be
used
- We can run other frameworks (ex: our services platform)
Deploy & run jobs from
dedicated server
Hadoop & HDFS on same nodes
with Mesos/Spark.
The Mesos slave fetches Spark
framework and starts the executor
Some repetitive tasks
- Repeating the creation of a new project with packaging, scripts, config,
etc. is boring
- Most jobs take as input Avro files from HDFS (Sqoop export, Event
logs) and output Avro
- They often have common structure and testing utilities
- Want to get latest data or all data in a time range
Creating Sparky
- To improve our productivity we created Sparky, a thin
library around Spark
=> read/write Avro from HDFS, work with partitioned data by time, testing,
benchmarking ML utilities (ROC & AUC)
=> a job structure allowing to standardize jobs design and add some abstractions
around SparkContext and job configuration - could allow us to have shared RDD (Job
Server from Ooyala?)
=> a maven archetype to generate new spark projects ready for production (debian
packaging, scripts, configuration and job template)
Problems we had
- “No more space left…” => spark.local.dir pointing to /tmp … oops
- NotSerializableException
=> closures referencing variables outside the function scope
=> classes not implementing Serializable, Enable Kryo serialization
- Kryo throwing ArrayOutOfBoundsException for large objects, increase spark.
kryoserializer.buffer.mb
- Loosing still in use broadcasted data due to spark.cleaner.ttl
Problems we had
- Reading Avro GenericRecord instead of SpecificRecord
=> Avro is not finding the generated classes in your application
=> the spark assembly with hadoop includes avro, it is loaded by a different
classloader than the one used for the application
=> tweaking with Avro API to use the task classloader
=> alternative solution: repackage Spark without avro and include avro in your
application fat jar
Things we must do
- Use a scheduler such as Chronos, allowing to handle inter
job dependencies
- Migrate Hadoop jobs to use Mesos or YARN (and use the
same cluster manager for Spark jobs)
- Automatic job output validation features
ex: check output size is close to expected
Systems we built with Spark
Job offer click prediction (in emails)
Members segmentation & targeting
Job offer click prediction
- Increase click rates on job offers in our career mail
- Current solution is using Solr and searches best matches
in textual content between members and job offers
- POC a click prediction engine using ML algorithms and
Spark
- Achieving ~5-7% click increase in our career mail
Algorithm overview
Job Titles clustering
Software
engineer
Software
developer
Software
engineer
Software
developer
Project manager
Project
manager
Job Position Transition Graph
Software
engineer
Software
architect
Project
Manager
Compute input variables:
- Distance between Member
attributes and Job offer attributes
- Job offer click count, etc Linear regression to score
offers:
x = input variables
y = boolean, has clicked
model:
Spark/HDFS/HBase at the core of our system
- The prediction engine is made of several Spark jobs working together
- The job computing the predictions does not have algorithmic logic:
=> fetches past predictions/clicks from HBase
=> uses an algorithm implementation and trains it with the data
=> stores the predictions back to HBase
- Our algorithms implement a contract (train, predict) and contain all the logic
=> allows us to plug new algorithms in the prediction job
=> and to have a automatic benchmark system to measure algorithms performance (really nice!
:))
Spark/HDFS/HBase at the core of our system
The broadcast feature of Spark helped us a lot to share small to medium size
data, by broadcasting it to all workers
Job Titles
clustering
Job Position
Transition Graph
Worker
Worker
Worker
Spark/HDFS/HBase at the core of our system
- We stored only Avro on HDFS, partitioned by date
- We stored only primitive types (int, long, byte) in HBase and designed RowKeys allowing us to do
efficient partial range scans
=> decile#prediction_date#memberId, c:click_target#click_time, targetId
=> the decile was used as sharding key
=> prediction_date and click_target were used to do partial range scans for constructing
benchmark datasets and filter the specific kind of events we wanted to predict (click on the job offer,
click on the company name, etc)
Job offer click prediction platform
HDFS
Job Title
clusters
HBase
Flume
Sqoop
Event Logs
MySQL
exports
R scripts
Click Tracking - prepare click data for
efficient retrieval in benchmark, analytics
and prediction jobs
Clicks
Predictions
Analytics
Benchmark - Automatic benchmark
(ROC & AUC) of algorithms based on
predicted job offers and corresponding
clicks
Click Prediction - learns from past and
predicts clicks using selected algorithm
Algo v1 Algo v2
API exposing
recommended job
offers
Systems we built with Spark
Job offer click prediction (in emails)
Members segmentation & targeting
Member Segmentation
- What is a segment? ex: All male customers older than 30 working in IT
- Our traffic team responsible for doing campaign targeting was using MySQL
- Their queries were taking hours to complete
- Was not industrialized, thus not usable outside the ad targeting scope
- Our goal: build a system that allows to
=> easily express segments
=> compute the segmentation for all our members quickly
=> use it to do targeting in emails, select AB testing target, etc
Infer the input sources and
keep only the attributes we
need (reducing data size)
HDFS
/sqoop/path/Member/importTime/...
/sqoop/path/Position/importTime/...
Prune the data (rows) that
won’t change the result of
the expressions
Segments Definition
now()-Member.BirthDate > 30y
and Position.DepartmentId = [....]
Parse segment definition
using Scala combinators,
validate & broadcast to
all spark workers
Evaluate each
expression (segment)
Segmentation
Job
Members:
Id: 1
Name: Lucas
BirthDate: 1986
Id: 2
Name: Joe
BirthDate: 1970
Members:
Id: 1
BirthDate: 1986
Id: 2
BirthDate: 1970
The
segmentation
for each
member
~ 4 min for +60 segments
and all our members!
Compute candidate
campaigns for each member
based on their segmentation
Select the best campaign
for the member based on
the volume sold, campaign
duration, etc
Campaign Definition
Ad, sold volume, price, combination of
segments
Campaign
Targeting Job
The
segmentation
for each
member
Targeted
campaigns per
member
Mailing Jobs
with Scalding
Next steps
- Build inverted indexes for computed segmentation
(ElasticSearch?)
- Expose the indexes to BI for analytics purpose
- Expose the indexes to Viadeo customers allowing them to
create targeted campaigns online
Questions?
Or send them at cepoi.eugen@gmail.com
Thanks :) _
Viadeo Data mining & machine learning team
@EugenCepoi

More Related Content

What's hot

Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 

What's hot (19)

Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Data Science
Data ScienceData Science
Data Science
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 

Viewers also liked

Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark Summit
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best PracticesEdward Capriolo
 

Viewers also liked (16)

Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best Practices
 

Similar to Apache Spark at Viadeo: Fast Distributed Computations and ML Systems

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
 
Viadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosViadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosCepoi Eugen
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Application design for the cloud using AWS
Application design for the cloud using AWSApplication design for the cloud using AWS
Application design for the cloud using AWSJonathan Holloway
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 

Similar to Apache Spark at Viadeo: Fast Distributed Computations and ML Systems (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
Viadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosViadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on Mesos
 
Unit 5
Unit  5Unit  5
Unit 5
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Application design for the cloud using AWS
Application design for the cloud using AWSApplication design for the cloud using AWS
Application design for the cloud using AWS
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Spark core
Spark coreSpark core
Spark core
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 

Recently uploaded

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

Recently uploaded (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Apache Spark at Viadeo: Fast Distributed Computations and ML Systems

  • 1. Apache Spark at Viadeo Paris Hadoop User Group - 2014/04/09 Viadeo
  • 2. @EugenCepoi Open source author: Genson Specialized in web scale systems and algorithmics
  • 3. Apache Spark - http://spark.incubator.apache.org/ - http://vision.cloudera.com/mapreduce-spark/ - A fast & easy to use engine for distributed computing in Scala (Java & Python support)
  • 5. Why we like Spark - Scala, functional languages are a great match for typical operations in distributed systems - Easy to test, most of your code is standard Scala code - Has already a stack for doing streaming, ML, graph computations - Allows you to store data in RAM and even without it is fast! - Provides a collect/broadcast mechanism allowing to fetch all the data on the “Driver” node or to send local data to every worker - Has an interactive shell (a la Scala), useful to analyse your data and to debug And much more...
  • 6. At the core of Spark... The Resilient Distributed Dataset - RDD - Abstraction over partitioned data - RDDs are read only and can be built only from a stable storage or by transformations on other RDDs parent HadoopRDD from HDFS MappedRDD map(closure) FilteredRDD filter(closure) parent
  • 7. At the core of Spark... Spark Driver Worker RAM Data Worker RAM Data tasks and results Spark Context Scheduler Transformations lineage tasks and results Spark Driver - Define & launch operations on RDDs - Keeps track of RDD lineage allowing to recover lost partitions - Use the RDD DAG in the Scheduler to define the stages and tasks to be executed Workers - Read/Transform/Write RDD partitions - Need to communicate with other workers to share their cache and shuffle data
  • 8. Beginning with Spark - Started to POC with Spark 0.8 summer 2013 - Standalone cluster on AWS in our VPC => hostname resolution problems => additionnal cost to get the data on S3 and push back => network performance issues - Finally we chose to reuse our on-premise hadoop cluster
  • 9. Running Spark on Mesos - Cluster manager providing resources (CPU & RAM) isolation and sharing them between applications - Strong isolation with Linux container groups - Flexible resource request model: => Mesos makes resource offers => the framework decides to accept/reject => the framework tells what task to run and what amount of the resources will be used - We can run other frameworks (ex: our services platform)
  • 10. Deploy & run jobs from dedicated server Hadoop & HDFS on same nodes with Mesos/Spark. The Mesos slave fetches Spark framework and starts the executor
  • 11. Some repetitive tasks - Repeating the creation of a new project with packaging, scripts, config, etc. is boring - Most jobs take as input Avro files from HDFS (Sqoop export, Event logs) and output Avro - They often have common structure and testing utilities - Want to get latest data or all data in a time range
  • 12. Creating Sparky - To improve our productivity we created Sparky, a thin library around Spark => read/write Avro from HDFS, work with partitioned data by time, testing, benchmarking ML utilities (ROC & AUC) => a job structure allowing to standardize jobs design and add some abstractions around SparkContext and job configuration - could allow us to have shared RDD (Job Server from Ooyala?) => a maven archetype to generate new spark projects ready for production (debian packaging, scripts, configuration and job template)
  • 13. Problems we had - “No more space left…” => spark.local.dir pointing to /tmp … oops - NotSerializableException => closures referencing variables outside the function scope => classes not implementing Serializable, Enable Kryo serialization - Kryo throwing ArrayOutOfBoundsException for large objects, increase spark. kryoserializer.buffer.mb - Loosing still in use broadcasted data due to spark.cleaner.ttl
  • 14. Problems we had - Reading Avro GenericRecord instead of SpecificRecord => Avro is not finding the generated classes in your application => the spark assembly with hadoop includes avro, it is loaded by a different classloader than the one used for the application => tweaking with Avro API to use the task classloader => alternative solution: repackage Spark without avro and include avro in your application fat jar
  • 15. Things we must do - Use a scheduler such as Chronos, allowing to handle inter job dependencies - Migrate Hadoop jobs to use Mesos or YARN (and use the same cluster manager for Spark jobs) - Automatic job output validation features ex: check output size is close to expected
  • 16. Systems we built with Spark Job offer click prediction (in emails) Members segmentation & targeting
  • 17. Job offer click prediction - Increase click rates on job offers in our career mail - Current solution is using Solr and searches best matches in textual content between members and job offers - POC a click prediction engine using ML algorithms and Spark - Achieving ~5-7% click increase in our career mail
  • 18. Algorithm overview Job Titles clustering Software engineer Software developer Software engineer Software developer Project manager Project manager Job Position Transition Graph Software engineer Software architect Project Manager Compute input variables: - Distance between Member attributes and Job offer attributes - Job offer click count, etc Linear regression to score offers: x = input variables y = boolean, has clicked model:
  • 19. Spark/HDFS/HBase at the core of our system - The prediction engine is made of several Spark jobs working together - The job computing the predictions does not have algorithmic logic: => fetches past predictions/clicks from HBase => uses an algorithm implementation and trains it with the data => stores the predictions back to HBase - Our algorithms implement a contract (train, predict) and contain all the logic => allows us to plug new algorithms in the prediction job => and to have a automatic benchmark system to measure algorithms performance (really nice! :))
  • 20. Spark/HDFS/HBase at the core of our system The broadcast feature of Spark helped us a lot to share small to medium size data, by broadcasting it to all workers Job Titles clustering Job Position Transition Graph Worker Worker Worker
  • 21. Spark/HDFS/HBase at the core of our system - We stored only Avro on HDFS, partitioned by date - We stored only primitive types (int, long, byte) in HBase and designed RowKeys allowing us to do efficient partial range scans => decile#prediction_date#memberId, c:click_target#click_time, targetId => the decile was used as sharding key => prediction_date and click_target were used to do partial range scans for constructing benchmark datasets and filter the specific kind of events we wanted to predict (click on the job offer, click on the company name, etc)
  • 22. Job offer click prediction platform HDFS Job Title clusters HBase Flume Sqoop Event Logs MySQL exports R scripts Click Tracking - prepare click data for efficient retrieval in benchmark, analytics and prediction jobs Clicks Predictions Analytics Benchmark - Automatic benchmark (ROC & AUC) of algorithms based on predicted job offers and corresponding clicks Click Prediction - learns from past and predicts clicks using selected algorithm Algo v1 Algo v2 API exposing recommended job offers
  • 23. Systems we built with Spark Job offer click prediction (in emails) Members segmentation & targeting
  • 24. Member Segmentation - What is a segment? ex: All male customers older than 30 working in IT - Our traffic team responsible for doing campaign targeting was using MySQL - Their queries were taking hours to complete - Was not industrialized, thus not usable outside the ad targeting scope - Our goal: build a system that allows to => easily express segments => compute the segmentation for all our members quickly => use it to do targeting in emails, select AB testing target, etc
  • 25. Infer the input sources and keep only the attributes we need (reducing data size) HDFS /sqoop/path/Member/importTime/... /sqoop/path/Position/importTime/... Prune the data (rows) that won’t change the result of the expressions Segments Definition now()-Member.BirthDate > 30y and Position.DepartmentId = [....] Parse segment definition using Scala combinators, validate & broadcast to all spark workers Evaluate each expression (segment) Segmentation Job Members: Id: 1 Name: Lucas BirthDate: 1986 Id: 2 Name: Joe BirthDate: 1970 Members: Id: 1 BirthDate: 1986 Id: 2 BirthDate: 1970 The segmentation for each member ~ 4 min for +60 segments and all our members!
  • 26. Compute candidate campaigns for each member based on their segmentation Select the best campaign for the member based on the volume sold, campaign duration, etc Campaign Definition Ad, sold volume, price, combination of segments Campaign Targeting Job The segmentation for each member Targeted campaigns per member Mailing Jobs with Scalding
  • 27. Next steps - Build inverted indexes for computed segmentation (ElasticSearch?) - Expose the indexes to BI for analytics purpose - Expose the indexes to Viadeo customers allowing them to create targeted campaigns online
  • 28. Questions? Or send them at cepoi.eugen@gmail.com
  • 29. Thanks :) _ Viadeo Data mining & machine learning team @EugenCepoi