SlideShare une entreprise Scribd logo
1  sur  53
Télécharger pour lire hors ligne
Building a Real-Time Fraud
Prevention Engine Using Open
Source (Big Data) Software
Kees Jan de Vries
Booking.com
Who am I?
About me
• Physics PhD Imperial College
• Data Scientist at Booking.com for 1.5 year
▸ Security Department
• linkedin.com/in/kees-jan-de-vries-93767240
About Booking.com
• World leader in connecting travellers with the widest
variety of great places to stay
• Part of The Priceline Group, the world’s 3rd largest
e-commerce company (by market capitalisation)
• Employing 14,000 people in 180 countries
• Each day, over 1,200,000 room nights are reserved on
Booking.com
Contents.
• Motivation
• Running Example
▸ Probability to Book
• Real Time Prediction Engine: Lessons Learnt
▸ Aggregate Features
▸ Models Training and Deployment
▸ Interpretation of Individual Scores
Motivation
Motivation.
booking.com
Motivation.
booking.com
Booking.com
• Empower people to experience the
world
Security
• Provide secure reservation system
(facilitating hundreds of thousands of
reservations on a daily basis)
Fraud Prevention
• Protect customers and property
partners, ensuring the best possible
experience
Motivation for this Talk.
Train awesome Model
Serve in Real Time
at Low Latency using
aggregate features
Running Example
Probability to Book
Disclaimer: although the running example
presented in these slides may seem realistic, it is
only intended to highlight some lessons learnt
about building a real time prediction engine.
Probability to Book.
Let’s book a hotel for
the Spark Summit
East 2017 in Boston
Probability to Book.
OK. Let’s click on the
first hit.
Probability to Book.
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Probability to Book.
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Let’s help the customer
make the best choice
for them!
Probability to Book.
We’ll calculate the
probability of booking
when the Customer
clicks on an
accommodation
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Behind the Scenes.
User id
Page id
...
Action
PredictionEngine
Click!
Behind the Scenes.
Labels
• Booked? Yes/No
Features
• Simple
▸ Time of day
▸ ...
• Profile
▸ User
▸ Circumstantial
Click!
Behind the Scenes.
Labels
• Booked? Yes/No
Features
• Simple
▸ Time of day
▸ ...
• Profile
▸ User
▸ Circumstantial
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Click!
Behind the Scenes.
Features
Model
Action
User id
...
Click!
Action
Behind the Scenes.
Features
Model
Action
User id
...
Click!
Aggregate
features
“ ”
Action
Aggregate Features
Aggregate Features.
Time
Requests Data Streams
Click Booking
Aggregate Features.
Time
Requests Data Streams
Click Booking
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Clicks coincide
with requests, so
aggregation is
preferably instant
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Clicks coincide
with requests, so
aggregation is
preferably instant
Small time
window, so limited
amount of data
Aggregate Features.
SELECT
COUNT(DISTINCT hotel_id)
FROM cliks.win:time(30 min)
GROUP BY user_id
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
In memory Complex Event Processing
• No lag: instant aggregation
• Scalability: see Esper website
• Not persistent
Aggregate Features.
SELECT
COUNT(DISTINCT hotel_id)
FROM cliks.win:time(30 min)
GROUP BY user_id
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...From http://espertech.com/esper/faq_esper.php#streamprocessing
“Complex Event Processing and Esper are standing queries and latency to the answer is
usually below 10us with more than 99% predictability.”
“The first component of scaling is the throughput that can be achieved running
single-threaded. For Esper we think this number is very high and likely between 10k to
200k events per second.”
In memory Complex Event Processing
Aggregate Features.
Users
Time
...
Requests
Bookings
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
Users
Time
...
Requests
Bookings
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Very High Cardinality:
features for every user who
made at least one booking
Aggregate Features
High Cardinality Features with Cassandra
• Very scalable: reads & writes
• None-instant aggregations
▸ Consistency fundamentally bound
by gossip protocol
• Persistent
( )
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features
High Cardinality Features with Cassandra
( )
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
From http://cassandra.apache.org/
● Proven
● Fault Tolerant
● Performant
● Scalable
● Elastic
● ...
“Some of the largest production
deployments include Apple's, with over
75,000 nodes storing over 10 PB of
data, Netflix (2,500 nodes, 420 TB, over
1 trillion requests per day), …”
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Medium Cardinality: limited
by number of hotels
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Medium Cardinality: limited
by number of hotels
Larger lag is ok with time
window of 3 months and
large number of events per
page.
Aggregate Features
SELECT
page_id
, COUNT(*) AS res_count
FROM reservations
GROUP BY page_id
Low to Medium Cardinality
• No need to go “fancy”
• Batch processing
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Models
Training and deployment
Aggregate Features for Training.
Time
Requests Data Streams
Click Reservation
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features for Training.
Time
Requests Data Streams
Click Reservation
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features for Training.
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Aggregate Features for Training.
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Warning: stragglers
ahead! Distribute wisely!
Aggregate Features for Training.
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Scalable Technologies
• Simple SQL
▸ Hive
• Facing stragglers
▸ Spark
Model Training & Iteration
y X
Train Model
Predict
YES
Predict
NO
Actual YES TP FN
Actual NO FP TN
Investigate
Evaluate
• Features
▸ Engineering
▸ Selection
• Hyper parameters
Improve Model
Sorry: H2O?
The benchmark shown on this slide can be found in:
http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/resources/h2odatasheet.html
More benchmarks:
https://github.com/szilard/benchm-ml
Time to Deploy
model = H2ORandomForestEstimator(ntrees=50, max_depth=10)
model.train(x=inputCols, y="label",
training_frame=trainingDf,
validation_frame=validationDf)
model.download_pojo("/my/model/path/")
Time to Deploy.
model = H2ORandomForestEstimator(ntrees=50, max_depth=10)
model.train(x=inputCols, y="label",
training_frame=trainingDf,
validation_frame=validationDf)
model.download_pojo("/my/model/path/")
AWESOME!!!
POJO: Plain Old Java Object. Model export in plain Java code for real- time scoring
on the JVM. Very fast!
Model Interpretation
Score Interpretation.
feature 15 > 9
feature 2 > 0.56 feature 4 > 200
yes no
yes no yes no
feature 11 > 6
yes no
Logistic Regression
Decision Trees
feature 1 feature 2+ 13.9 x + …x = -5.12 x
Decision tree interpretation based on:
http://blog.datadive.net/interpreting-random-forests
Contribution Feature i:
i
xi
p0 Contribution Feature i:
pnode
- pparent node
p2
p1
p3
p4
p6
p7
p5
p8
Example:
p3
= p0
+ (p1
-p0
) + (p3
-p1
)
p0
: bias
(p1
-p0
): feature 15 contr.
(p3
-p1
): feature 2 contr.
Score Interpretation.
feature 15 > 9
feature 2 > 0.56 feature 4 > 200
yes no
yes no yes no
feature 11 > 6
yes no
Logistic Regression
Decision Trees
feature 1 feature 2+ 13.9 x + …x = -5.12 x
Decision tree interpretation based on:
http://blog.datadive.net/interpreting-random-forests
Contribution Feature i:
i
xi
p0 Contribution Feature i:
pnode
- pparent node
p2
p1
p3
p4
p6
p7
p5
p8
We hacked this
interpretation into Spark
ML during a Hackathon
at Booking:) H2O does
not offer it (yet?).
Probability to Book.
The probability of
booking is high; largest
contribution from
#distinct property
pages. Let’s show a
summary!
Probability to Book.
Awesome! This app
really helps me book!
The probability of
booking is high; largest
contribution from
#distinct property
pages. Let’s show a
summary!
Summary
Summary.
• How to make Aggregate Features available?
▸ Long/short time windows
▸ High and Low cardinality dimensions
• Model Training and Deployment
▸ H2O POJO for fast Real-Time scoring
• Interpretation
▸ Contributions per Feature per Score
Thank You.
Questions?
We’re hiring!
Kees Jan de Vries from Booking.com
linkedin.com/in/kees-jan-de-vries-93767240

Contenu connexe

Tendances

Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureSteffen Staab
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsKrishnaram Kenthapadi
 
Intro to Azure OpenAI Service L100 (Thai Ver).pdf
Intro to Azure OpenAI Service L100 (Thai Ver).pdfIntro to Azure OpenAI Service L100 (Thai Ver).pdf
Intro to Azure OpenAI Service L100 (Thai Ver).pdfKorkrid Akepanidtaworn
 
LLM App Hacking (AVTOKYO2023)
LLM App Hacking (AVTOKYO2023)LLM App Hacking (AVTOKYO2023)
LLM App Hacking (AVTOKYO2023)Shota Shinogi
 
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...Amazon Web Services
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyDebmalya Biswas
 
Building and deploying microservices with event sourcing, CQRS and Docker (QC...
Building and deploying microservices with event sourcing, CQRS and Docker (QC...Building and deploying microservices with event sourcing, CQRS and Docker (QC...
Building and deploying microservices with event sourcing, CQRS and Docker (QC...Chris Richardson
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Liz Masters Lovelace
 
Cloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxCloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxSahithiGurlinka
 
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionNitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionMLconf
 
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusManasi Vartak
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxGreg Makowski
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos EngineeringGremlin
 
AI in Insurance: How to Automate Insurance Claims Processing with Machine Lea...
AI in Insurance: How to Automate Insurance Claims Processing with Machine Lea...AI in Insurance: How to Automate Insurance Claims Processing with Machine Lea...
AI in Insurance: How to Automate Insurance Claims Processing with Machine Lea...Skyl.ai
 
The current state of generative AI
The current state of generative AIThe current state of generative AI
The current state of generative AIBenjaminlapid1
 
Responsible Generative AI
Responsible Generative AIResponsible Generative AI
Responsible Generative AICMassociates
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks
 

Tendances (20)

Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
 
Intro to Azure OpenAI Service L100 (Thai Ver).pdf
Intro to Azure OpenAI Service L100 (Thai Ver).pdfIntro to Azure OpenAI Service L100 (Thai Ver).pdf
Intro to Azure OpenAI Service L100 (Thai Ver).pdf
 
LLM App Hacking (AVTOKYO2023)
LLM App Hacking (AVTOKYO2023)LLM App Hacking (AVTOKYO2023)
LLM App Hacking (AVTOKYO2023)
 
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with Transparency
 
Building and deploying microservices with event sourcing, CQRS and Docker (QC...
Building and deploying microservices with event sourcing, CQRS and Docker (QC...Building and deploying microservices with event sourcing, CQRS and Docker (QC...
Building and deploying microservices with event sourcing, CQRS and Docker (QC...
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
 
Cloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxCloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptx
 
Introduction to AI Governance
Introduction to AI GovernanceIntroduction to AI Governance
Introduction to AI Governance
 
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionNitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
 
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptx
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
 
AI in Insurance: How to Automate Insurance Claims Processing with Machine Lea...
AI in Insurance: How to Automate Insurance Claims Processing with Machine Lea...AI in Insurance: How to Automate Insurance Claims Processing with Machine Lea...
AI in Insurance: How to Automate Insurance Claims Processing with Machine Lea...
 
The current state of generative AI
The current state of generative AIThe current state of generative AI
The current state of generative AI
 
Responsible Generative AI
Responsible Generative AIResponsible Generative AI
Responsible Generative AI
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 

En vedette

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...Spark Summit
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Spark Summit
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Spark Summit
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Spark Summit
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Spark Summit
 

En vedette (20)

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
 

Similaire à Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries

Reliably Measuring Responsiveness
Reliably Measuring ResponsivenessReliably Measuring Responsiveness
Reliably Measuring ResponsivenessNicholas Jansma
 
AI techniques for tourism-oriented applications
AI techniques for tourism-oriented applicationsAI techniques for tourism-oriented applications
AI techniques for tourism-oriented applicationsAmine Bendahmane
 
Your hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the futureYour hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the futureSiteMinder
 
Genn.ai introduction for Buzzwords
Genn.ai introduction for BuzzwordsGenn.ai introduction for Buzzwords
Genn.ai introduction for BuzzwordsTakeshi Nakano
 
Big Data and the Next Best Offer
Big Data and the Next Best OfferBig Data and the Next Best Offer
Big Data and the Next Best OfferMichel Bruley
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews, Inc.
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...Srinath Perera
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream Michel Bruley
 
From model to experiment Tel Aviv
From model to experiment Tel AvivFrom model to experiment Tel Aviv
From model to experiment Tel AvivElena Sokolova, PhD
 
SiteMinder. Attract, reach and convert guests globally.
SiteMinder.  Attract, reach and convert guests globally.SiteMinder.  Attract, reach and convert guests globally.
SiteMinder. Attract, reach and convert guests globally.Mark Abramowitz
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic FoxMicroservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic FoxOpenCredo
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetupPredicSis
 
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...Amazon Web Services
 

Similaire à Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries (15)

Reliably Measuring Responsiveness
Reliably Measuring ResponsivenessReliably Measuring Responsiveness
Reliably Measuring Responsiveness
 
AI techniques for tourism-oriented applications
AI techniques for tourism-oriented applicationsAI techniques for tourism-oriented applications
AI techniques for tourism-oriented applications
 
Your hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the futureYour hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the future
 
Genn.ai introduction for Buzzwords
Genn.ai introduction for BuzzwordsGenn.ai introduction for Buzzwords
Genn.ai introduction for Buzzwords
 
Big Data and the Next Best Offer
Big Data and the Next Best OfferBig Data and the Next Best Offer
Big Data and the Next Best Offer
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream
 
From model to experiment Tel Aviv
From model to experiment Tel AvivFrom model to experiment Tel Aviv
From model to experiment Tel Aviv
 
SiteMinder. Attract, reach and convert guests globally.
SiteMinder.  Attract, reach and convert guests globally.SiteMinder.  Attract, reach and convert guests globally.
SiteMinder. Attract, reach and convert guests globally.
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic FoxMicroservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetup
 
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 

Dernier (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries

  • 1. Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software Kees Jan de Vries Booking.com
  • 2. Who am I? About me • Physics PhD Imperial College • Data Scientist at Booking.com for 1.5 year ▸ Security Department • linkedin.com/in/kees-jan-de-vries-93767240 About Booking.com • World leader in connecting travellers with the widest variety of great places to stay • Part of The Priceline Group, the world’s 3rd largest e-commerce company (by market capitalisation) • Employing 14,000 people in 180 countries • Each day, over 1,200,000 room nights are reserved on Booking.com
  • 3. Contents. • Motivation • Running Example ▸ Probability to Book • Real Time Prediction Engine: Lessons Learnt ▸ Aggregate Features ▸ Models Training and Deployment ▸ Interpretation of Individual Scores
  • 6. Motivation. booking.com Booking.com • Empower people to experience the world Security • Provide secure reservation system (facilitating hundreds of thousands of reservations on a daily basis) Fraud Prevention • Protect customers and property partners, ensuring the best possible experience
  • 7. Motivation for this Talk. Train awesome Model Serve in Real Time at Low Latency using aggregate features
  • 9. Disclaimer: although the running example presented in these slides may seem realistic, it is only intended to highlight some lessons learnt about building a real time prediction engine.
  • 10. Probability to Book. Let’s book a hotel for the Spark Summit East 2017 in Boston
  • 11. Probability to Book. OK. Let’s click on the first hit.
  • 12. Probability to Book. Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice.
  • 13. Probability to Book. Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice. Let’s help the customer make the best choice for them!
  • 14. Probability to Book. We’ll calculate the probability of booking when the Customer clicks on an accommodation Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice.
  • 15. Behind the Scenes. User id Page id ... Action PredictionEngine Click!
  • 16. Behind the Scenes. Labels • Booked? Yes/No Features • Simple ▸ Time of day ▸ ... • Profile ▸ User ▸ Circumstantial Click!
  • 17. Behind the Scenes. Labels • Booked? Yes/No Features • Simple ▸ Time of day ▸ ... • Profile ▸ User ▸ Circumstantial Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Click!
  • 19. Behind the Scenes. Features Model Action User id ... Click! Aggregate features “ ” Action
  • 21. Aggregate Features. Time Requests Data Streams Click Booking
  • 22. Aggregate Features. Time Requests Data Streams Click Booking Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 23. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests)
  • 24. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests) Clicks coincide with requests, so aggregation is preferably instant
  • 25. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests) Clicks coincide with requests, so aggregation is preferably instant Small time window, so limited amount of data
  • 26. Aggregate Features. SELECT COUNT(DISTINCT hotel_id) FROM cliks.win:time(30 min) GROUP BY user_id Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... In memory Complex Event Processing • No lag: instant aggregation • Scalability: see Esper website • Not persistent
  • 27. Aggregate Features. SELECT COUNT(DISTINCT hotel_id) FROM cliks.win:time(30 min) GROUP BY user_id Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...From http://espertech.com/esper/faq_esper.php#streamprocessing “Complex Event Processing and Esper are standing queries and latency to the answer is usually below 10us with more than 99% predictability.” “The first component of scaling is the throughput that can be achieved running single-threaded. For Esper we think this number is very high and likely between 10k to 200k events per second.” In memory Complex Event Processing
  • 28. Aggregate Features. Users Time ... Requests Bookings Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 29. Aggregate Features. Users Time ... Requests Bookings Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Very High Cardinality: features for every user who made at least one booking
  • 30. Aggregate Features High Cardinality Features with Cassandra • Very scalable: reads & writes • None-instant aggregations ▸ Consistency fundamentally bound by gossip protocol • Persistent ( ) Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 31. Aggregate Features High Cardinality Features with Cassandra ( ) Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... From http://cassandra.apache.org/ ● Proven ● Fault Tolerant ● Performant ● Scalable ● Elastic ● ... “Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), …”
  • 32. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 33. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Medium Cardinality: limited by number of hotels
  • 34. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Medium Cardinality: limited by number of hotels Larger lag is ok with time window of 3 months and large number of events per page.
  • 35. Aggregate Features SELECT page_id , COUNT(*) AS res_count FROM reservations GROUP BY page_id Low to Medium Cardinality • No need to go “fancy” • Batch processing Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 37. Aggregate Features for Training. Time Requests Data Streams Click Reservation Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 38. Aggregate Features for Training. Time Requests Data Streams Click Reservation Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 39. Aggregate Features for Training. Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id
  • 40. Aggregate Features for Training. Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id Warning: stragglers ahead! Distribute wisely!
  • 41. Aggregate Features for Training. SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id Scalable Technologies • Simple SQL ▸ Hive • Facing stragglers ▸ Spark
  • 42. Model Training & Iteration y X Train Model Predict YES Predict NO Actual YES TP FN Actual NO FP TN Investigate Evaluate • Features ▸ Engineering ▸ Selection • Hyper parameters Improve Model
  • 43. Sorry: H2O? The benchmark shown on this slide can be found in: http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/resources/h2odatasheet.html More benchmarks: https://github.com/szilard/benchm-ml
  • 44. Time to Deploy model = H2ORandomForestEstimator(ntrees=50, max_depth=10) model.train(x=inputCols, y="label", training_frame=trainingDf, validation_frame=validationDf) model.download_pojo("/my/model/path/")
  • 45. Time to Deploy. model = H2ORandomForestEstimator(ntrees=50, max_depth=10) model.train(x=inputCols, y="label", training_frame=trainingDf, validation_frame=validationDf) model.download_pojo("/my/model/path/") AWESOME!!! POJO: Plain Old Java Object. Model export in plain Java code for real- time scoring on the JVM. Very fast!
  • 47. Score Interpretation. feature 15 > 9 feature 2 > 0.56 feature 4 > 200 yes no yes no yes no feature 11 > 6 yes no Logistic Regression Decision Trees feature 1 feature 2+ 13.9 x + …x = -5.12 x Decision tree interpretation based on: http://blog.datadive.net/interpreting-random-forests Contribution Feature i: i xi p0 Contribution Feature i: pnode - pparent node p2 p1 p3 p4 p6 p7 p5 p8 Example: p3 = p0 + (p1 -p0 ) + (p3 -p1 ) p0 : bias (p1 -p0 ): feature 15 contr. (p3 -p1 ): feature 2 contr.
  • 48. Score Interpretation. feature 15 > 9 feature 2 > 0.56 feature 4 > 200 yes no yes no yes no feature 11 > 6 yes no Logistic Regression Decision Trees feature 1 feature 2+ 13.9 x + …x = -5.12 x Decision tree interpretation based on: http://blog.datadive.net/interpreting-random-forests Contribution Feature i: i xi p0 Contribution Feature i: pnode - pparent node p2 p1 p3 p4 p6 p7 p5 p8 We hacked this interpretation into Spark ML during a Hackathon at Booking:) H2O does not offer it (yet?).
  • 49. Probability to Book. The probability of booking is high; largest contribution from #distinct property pages. Let’s show a summary!
  • 50. Probability to Book. Awesome! This app really helps me book! The probability of booking is high; largest contribution from #distinct property pages. Let’s show a summary!
  • 52. Summary. • How to make Aggregate Features available? ▸ Long/short time windows ▸ High and Low cardinality dimensions • Model Training and Deployment ▸ H2O POJO for fast Real-Time scoring • Interpretation ▸ Contributions per Feature per Score
  • 53. Thank You. Questions? We’re hiring! Kees Jan de Vries from Booking.com linkedin.com/in/kees-jan-de-vries-93767240