SlideShare a Scribd company logo
1 of 31
Download to read offline
From 6 hours to 1 minute... in 2 days!
How we managed to stream our (long)
Hadoop batches
1
Sofian DJAMAA - Software engineer
@sdjamaa
Phase 1 - Buy displays
Phase 2 - Sell clicks
Phase 3 - ???
Phase 4 - Profit
Criteo
How does it work?
What’s
on Bild website
today?
user
We gather the
information for the
retargeting process
Let’s go on
Amazon!!
The website
contains a
« pixel » used
to put
information
on the user
cookie
user
eCommerce
website
publisher
website
Using the
information we have
on the user, tagged
by a cookie, we
display the right ad
Advertiser side
Publisher side
HOW DO THEY
KNOOOOOOOOWW????
Our constraints
6 datacenters
!
3 billions events a day
!
+50 PB of data in our Hadoop cluster
!
800K HTTP requests/second
!
JIRA ticket generation
How do we use the data? Where’s my
money?
WHERE’S
MY F@*#
MONEY?!?!
Where’s my
client’s money?
finance
sales
business devinternal
reports
client
reports
billing
(heavy) data
transformation
clicks, displays,
purchases…
client
YAY!!
MAKIN’
MONEY!!
client dashboard
(web)
But this takes time…
Where goes the data? There’s
an issue… let’s
investigate
WTF?!?!?
business
escalation
relase
management
Data not aggregated cross-DC
!
Granularity limited due to the volume
of data
!
Load time can be huge even if we bulk
insert (or move files)
clicks, displays,
purchases…
IIS web
servers
SQL Server DBs
(multiple instances
per DC)
Graphite
monitoring
Scaling is limited therefore
only the most aggregated
level is kept in Graphite
production
alert
Up to 6 hours for some metrics
!
Volume being huge, processing and
storage takes time (SQL Server
replication hell…)
!
Multiple datacenters containing data
with latency issues
6 hours to get business data
+ 1 hour to check data/raise alerts
+ 1 hour to find root cause
!
- SSome big money (up to million €)
finance
sales
release
management
(heavy) data
transformation
PBs of data to
replay
client
A lot of people need the same aggregated data but all with their own
constraints…
Consistency
required
Quick feedback
Batching on Hadoop or SQL Server
doesn’t fill the requirement
!
We need to have our checks as soon
as something wrong happens
!
We need to handle both real-time
and batch mode
But who can do it?
Some amazing projects such as:
!
- Ads in GIF format
!
- Embedded ad banners in movies
!
- Streaming service stopping several times a movie to display an ad
!
- Ponys
!
- And something about Chuck Norris (‘cause he’s awesome)
Internal hackathon
No team wanted to take the responsibility of project so we built our
own:
!
- 3 1/2 developers
!
- 1 business release manager (as a product owner)
!
- 2 business intelligence engineers (the guys that write SQL queries all day long)
!
- 1 business developer (doesn’t code a business layer obviously)
!
- 1 creative
!
- 1 release manager
!
- 1 technical solution guy (the guy who helps in putting the pixels)
Turbo
What people think a hackathon
is…
What really a hackathon is….
SummingBird computations
Metric aggregations at banner/zone level
- Clicks, displays, sales, revenue…
!
Real-time part
Aggregates are updated and sent after each event
30K messages/second
!
Batch part
Computes the expected trend of data for each period (reference data)
Using lasso: sum of squared errors, with a bound on the sum of the absolute
values of the coefficients
data processing is done in batches
on each side
when an offline batch is ready, it becomes the truth for the
whole system
online batches are computed in
streaming
batch #1 (e.g.
1H)
batch #2 batch #3
insert AND update
aggregated results
for each event
insert aggregated results
for 1 hour of events
λ =
http://github.com/twitter/summingbird
Platform[T]
def job = source.map { !
/* your job here */ "
}.store()
P#Source P#Store
job is
executed on
sends job results
to store
redirects input
to job
Why not streaming everything
then ?
Streaming costs a lot of infrastructure
!
Sometimes we need to replay (backfilling) events from the
past to correct a bug or adjust a formula
!
With PBs of data generated per day, a streaming
architecture needs to be massively parallelized (much
more than a batching architecture) for replays
!
Lambda Architecture is a good way to move towards a
full streaming architecture
Rule engine
Developed in Prolog
!
10K decisions/minute
!
Linked to the real-time flow to compute the
discrepancies with expected values and tag
abnormal data
Vizatra
In-house analytic visualization stack: world map, graphs, real-
time curves…
!
AngularJS, Bootstrap, Scala, Finagle, any DB supporting SQL
!
Web-component oriented: easy customization
!
Query deconstruction and NOT query building :-)
!
Open-source release coming soon
Riemann
Monitoring system with a « powerful stream processing
language » (nah, just Clojure configuration files)
!
Sends alerts based on tags sent by the rule engine
!
Scoped alerts
!
Alerts are emails and SMS to on-duty people
!
JIRA ticket generation
Awesomeness
✓ Data granularity
	 - Checks at banner/zone levels for
better investigations
✗ Data granularity
	 - Checks only at publisher website
level (only on RTB)
	 - No checks on the client side
✗ Latency up to 6 hours ✓ Latency of 1 minute
✗ Data aggregated hourly ✓ Data aggregated in 5-min period
✗ Money in the bank: $ ✓ Money in the bank: $$$$$
Even more…
Thanks to the hackathon, we are now able to provide real-time
feedback to sales, business developers, MDs and VPs which led
us to :
!
- Getting more clients as they love having a quick feedback on
their campaigns
!
- Adjust CPC in real-time
- For special occasions like sales
- To test aggressively our prediction models in an A/B test
!
- And more…
Some feedbacks
✗ Exponential learning curve with all frameworks
✗ A lot of features are missing (e.g. stores)
✗ Very limited documentation or tutorial
✗ Testing the error rate between Hadoop and Storm is
really too long for a 2 day development period
✗ Cassandra was a bad choice because of the data model
needed for the visualization part (lot of joins)
#Paris, #BigData, #MachineLearning, #NerfGuns, #Hadoop, #Storm, #Spark,
#Cassandra, #MongoDB, #Riemann, #Scala, #C# (?!), #Java
31
Sofian DJAMAA - Software engineer
@sdjamaa
WE RECRUIT!!!!
We have many open positions in the R&D:
!
•	Data Processing Systems Manager
•	Senior Software Engineer (Grenoble)
•	Software Development lead/Manager
•	SRE OPS Manager
•	Senior Software Engineer – Palo Alto, CA.
•	Python Software Lead Engineer - Paris
•	Software Development Engineer –Paris
•	Machine Learning Scientist

More Related Content

What's hot

How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
Jos Boumans
 

What's hot (20)

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org
 
Elasticwulf Pycon Talk
Elasticwulf Pycon TalkElasticwulf Pycon Talk
Elasticwulf Pycon Talk
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraph
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uber
 
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
 
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
 
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
Introduction to Time Series: The Fastest Growing Database Category
 Introduction to Time Series: The Fastest Growing Database Category Introduction to Time Series: The Fastest Growing Database Category
Introduction to Time Series: The Fastest Growing Database Category
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecture
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with Airflow
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 

Similar to Big Data Berlin - Criteo

Similar to Big Data Berlin - Criteo (20)

Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - Analytics
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
Tweak Geeks #FOS15
Tweak Geeks #FOS15Tweak Geeks #FOS15
Tweak Geeks #FOS15
 
High availability, real-time and scalable architectures
High availability, real-time and scalable architecturesHigh availability, real-time and scalable architectures
High availability, real-time and scalable architectures
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 

Recently uploaded

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Recently uploaded (20)

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Big Data Berlin - Criteo

  • 1. From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Hadoop batches 1 Sofian DJAMAA - Software engineer @sdjamaa
  • 2. Phase 1 - Buy displays Phase 2 - Sell clicks Phase 3 - ??? Phase 4 - Profit Criteo
  • 3. How does it work? What’s on Bild website today? user We gather the information for the retargeting process Let’s go on Amazon!! The website contains a « pixel » used to put information on the user cookie user eCommerce website publisher website Using the information we have on the user, tagged by a cookie, we display the right ad Advertiser side Publisher side
  • 5. Our constraints 6 datacenters ! 3 billions events a day ! +50 PB of data in our Hadoop cluster ! 800K HTTP requests/second ! JIRA ticket generation
  • 6. How do we use the data? Where’s my money? WHERE’S MY F@*# MONEY?!?! Where’s my client’s money? finance sales business devinternal reports client reports billing (heavy) data transformation clicks, displays, purchases… client YAY!! MAKIN’ MONEY!! client dashboard (web)
  • 7. But this takes time…
  • 8. Where goes the data? There’s an issue… let’s investigate WTF?!?!? business escalation relase management Data not aggregated cross-DC ! Granularity limited due to the volume of data ! Load time can be huge even if we bulk insert (or move files) clicks, displays, purchases… IIS web servers SQL Server DBs (multiple instances per DC) Graphite monitoring Scaling is limited therefore only the most aggregated level is kept in Graphite production alert
  • 9. Up to 6 hours for some metrics ! Volume being huge, processing and storage takes time (SQL Server replication hell…) ! Multiple datacenters containing data with latency issues
  • 10. 6 hours to get business data + 1 hour to check data/raise alerts + 1 hour to find root cause ! - SSome big money (up to million €)
  • 11. finance sales release management (heavy) data transformation PBs of data to replay client A lot of people need the same aggregated data but all with their own constraints… Consistency required Quick feedback
  • 12. Batching on Hadoop or SQL Server doesn’t fill the requirement ! We need to have our checks as soon as something wrong happens ! We need to handle both real-time and batch mode
  • 13. But who can do it?
  • 14.
  • 15. Some amazing projects such as: ! - Ads in GIF format ! - Embedded ad banners in movies ! - Streaming service stopping several times a movie to display an ad ! - Ponys ! - And something about Chuck Norris (‘cause he’s awesome) Internal hackathon
  • 16. No team wanted to take the responsibility of project so we built our own: ! - 3 1/2 developers ! - 1 business release manager (as a product owner) ! - 2 business intelligence engineers (the guys that write SQL queries all day long) ! - 1 business developer (doesn’t code a business layer obviously) ! - 1 creative ! - 1 release manager ! - 1 technical solution guy (the guy who helps in putting the pixels) Turbo
  • 17. What people think a hackathon is…
  • 18. What really a hackathon is….
  • 19.
  • 20. SummingBird computations Metric aggregations at banner/zone level - Clicks, displays, sales, revenue… ! Real-time part Aggregates are updated and sent after each event 30K messages/second ! Batch part Computes the expected trend of data for each period (reference data) Using lasso: sum of squared errors, with a bound on the sum of the absolute values of the coefficients
  • 21. data processing is done in batches on each side when an offline batch is ready, it becomes the truth for the whole system online batches are computed in streaming batch #1 (e.g. 1H) batch #2 batch #3 insert AND update aggregated results for each event insert aggregated results for 1 hour of events
  • 23. Platform[T] def job = source.map { ! /* your job here */ " }.store() P#Source P#Store job is executed on sends job results to store redirects input to job
  • 24. Why not streaming everything then ? Streaming costs a lot of infrastructure ! Sometimes we need to replay (backfilling) events from the past to correct a bug or adjust a formula ! With PBs of data generated per day, a streaming architecture needs to be massively parallelized (much more than a batching architecture) for replays ! Lambda Architecture is a good way to move towards a full streaming architecture
  • 25. Rule engine Developed in Prolog ! 10K decisions/minute ! Linked to the real-time flow to compute the discrepancies with expected values and tag abnormal data
  • 26. Vizatra In-house analytic visualization stack: world map, graphs, real- time curves… ! AngularJS, Bootstrap, Scala, Finagle, any DB supporting SQL ! Web-component oriented: easy customization ! Query deconstruction and NOT query building :-) ! Open-source release coming soon
  • 27. Riemann Monitoring system with a « powerful stream processing language » (nah, just Clojure configuration files) ! Sends alerts based on tags sent by the rule engine ! Scoped alerts ! Alerts are emails and SMS to on-duty people ! JIRA ticket generation
  • 28. Awesomeness ✓ Data granularity - Checks at banner/zone levels for better investigations ✗ Data granularity - Checks only at publisher website level (only on RTB) - No checks on the client side ✗ Latency up to 6 hours ✓ Latency of 1 minute ✗ Data aggregated hourly ✓ Data aggregated in 5-min period ✗ Money in the bank: $ ✓ Money in the bank: $$$$$
  • 29. Even more… Thanks to the hackathon, we are now able to provide real-time feedback to sales, business developers, MDs and VPs which led us to : ! - Getting more clients as they love having a quick feedback on their campaigns ! - Adjust CPC in real-time - For special occasions like sales - To test aggressively our prediction models in an A/B test ! - And more…
  • 30. Some feedbacks ✗ Exponential learning curve with all frameworks ✗ A lot of features are missing (e.g. stores) ✗ Very limited documentation or tutorial ✗ Testing the error rate between Hadoop and Storm is really too long for a 2 day development period ✗ Cassandra was a bad choice because of the data model needed for the visualization part (lot of joins)
  • 31. #Paris, #BigData, #MachineLearning, #NerfGuns, #Hadoop, #Storm, #Spark, #Cassandra, #MongoDB, #Riemann, #Scala, #C# (?!), #Java 31 Sofian DJAMAA - Software engineer @sdjamaa WE RECRUIT!!!! We have many open positions in the R&D: ! • Data Processing Systems Manager • Senior Software Engineer (Grenoble) • Software Development lead/Manager • SRE OPS Manager • Senior Software Engineer – Palo Alto, CA. • Python Software Lead Engineer - Paris • Software Development Engineer –Paris • Machine Learning Scientist