SlideShare une entreprise Scribd logo
1  sur  43
Distributed Online Learning Techniques
Kanak Biscuitwala
kanak@siftscience.com
Background
Machine Learning at Sift Science
Online Learning Infrastructure
Experience
Outline
Fraud detection using supervised machine learning
Real-time
Billions of purchases scored
Hundreds of millions of users
About Sift Science
background | ml at sift | infrastructure | experience
background | ml at sift | infrastructure | experience
You have examples of GOOD
and BAD users.
You have a set of signals that
you think are predictive of
fraud.
Start with your data…
background | ml at sift | infrastructure | experience
Train: Build a model from existing data
Train a statistical model with
examples of GOOD and BAD
users.
Model will learn signal values
common to each user type.
background | ml at sift | infrastructure | experience
Predict: Find patterns in new data
Apply the model to current
active customers.
Predict which are fraud, and
which aren’t.
background | ml at sift | infrastructure | experience
Act: Turn insights into action
Intelligently segment your
customers with a probability of risk
background | ml at sift | infrastructure | experience
Machine Learning at Sift
Customers stream events to us
- Page Views (Javascript)
- Purchases (API)
- Labels (API or Console)
Time series view of the user
Data at Sift
background | ml at sift | infrastructure | experience
Signup
Add CC
Add item(s) to Cart
Purchase 1
Change CC
Change Billing
Purchase 2
Time Series of Events
Features
Add item(s) to Cart
Scan
background | ml at sift | infrastructure | experience
{ Device ID features }
{ Number of emails }
{ NLP features }
{ Address features }
{ Custom fields }
…
…
Time Series of Events
Data
Transformation
…
> 1K features
background | ml at sift | infrastructure | experience
Val a@ (num_fraud=1)
…
Sparse Feature: Email
Val b@ (num_fraud=3)
Val c@ (num_fraud=3)
Val d@ (num_fraud=1)
Val 1
Val 3
…
…
Dense Feature: Email
…
…
…
background | ml at sift | infrastructure | experience
Val a@ (num_fraud=1)
…
Sparse Feature: Email
Val b@ (num_fraud=3)
Val c@ (num_fraud=3)
Val d@ (num_fraud=1)
Val 1
Val 3
…
…
Dense Feature: Email
…
…
…
“densification”
background | ml at sift | infrastructure | experience
Val a@ (num_fraud=1)
…
Sparse Feature: Email
Val b@ (num_fraud=3)
Val c@ (num_fraud=3)
Val d@ (num_fraud=1)
Val 1
Val 3
…
…
Dense Feature: Email
…
…
…
these mappings constantly change
“densification”
background | ml at sift | infrastructure | experience
Prediction
classify
Dense features
background | ml at sift | infrastructure | experience
Prediction
classify
Dense features
Label
Dense features
learn
Updated
Classifier
background | ml at sift | infrastructure | experience
Prediction
classify
Dense features
Label
Dense features
learn
Updated
Classifier
feature importance constantly changes
background | ml at sift | infrastructure | experience
Regular batch training vs online learning
- Sift does both
Batch and online code paths match where possible
Adapting to Change
background | ml at sift | infrastructure | experience
Distributed system to handle requests and data size
Updates made in one place need to be visible everywhere
Performance still matters
Adapting to Scale
background | ml at sift | infrastructure | experience
Option: Checkpoints + Pub-Sub
Classifier
label
queue
Classifier
Classifier
Classifier
solves: propagation, performance
background | ml at sift | infrastructure | experience
Option: Checkpoints + Pub-Sub
Classifier
label
queue
Classifier
Classifier
Classifier
solves: propagation, performance
does not solve: data scale, write amplification, complexity
background | ml at sift | infrastructure | experience
Option: Distributed DB (HBase)
Classifier
label
solves: propagation, complexity, data scale, single source of truth
background | ml at sift | infrastructure | experience
Option: Distributed DB (HBase)
Classifier
label
solves: propagation, complexity, data scale, single source of truth
does not solve: performance
background | ml at sift | infrastructure | experience
Scan and HFile access for batch operations, row
operations online
Higher-level atomic operations and batching
Block caching (and other forms of caching)
Snapshots
Driving console and front end
Why HBase?
background | ml at sift | infrastructure | experience
Scan and HFile access for batch operations, row
operations online
Higher-level atomic operations and batching
Block caching (and other forms of caching)
Snapshots
Driving console and front end
Why HBase? see our talk at
HBaseCon!
background | ml at sift | infrastructure | experience
Online Learning Infrastructure
Online Learning
Time Series Features Score (Update)
Updates to sparse feature state
Update model parameters
background | ml at sift | infrastructure | experience
Sparse fields - device ids, cookies, custom fields, etc.
Mapping to dense space based on set cardinality
Two-table implementation (“ItemSetCounter”)
- Slower set table (up to 8K items per set; > 100M sets)
- Faster counts table (batching, coalescing)
Global and customer states
Real-time introduction of features and feature values
Sparse Feature Densification
background | ml at sift | infrastructure | experience
ItemSetCounter
set1: 1
…
set2: 3
set3: 1
set4: 2
set1: { a }
…
set2: { b, c, d }
set3: { e }
set4: { f, g }
background | ml at sift | infrastructure | experience
Feature weights updated in increments
Counting for learning and display
- Number of unique features and feature values (set union)
- Count labels on various dimensions (increment)
Thousands of accesses per classification
Model Parameters
background | ml at sift | infrastructure | experience
Three-table design
- ItemSetCounter for set membership and cardinality
- NumericParameterTable for incrementing numeric values
Enables:
- Fast batch access of numeric parameters and set sizes
- Availability of items in set for display and analysis
- Real-time introduction of features and feature values
Model Parameters (Implementation)
background | ml at sift | infrastructure | experience
param1: 20.0
…
NumericParameterTable
param2: 1.2345
param3: 24.356
param4: 0.0001
background | ml at sift | infrastructure | experience
Code written and rewritten to read data in batches
Updates are coalesced in memory for up to 1 second
“Approximately consistent”
- Throughput/latency vs consistency tradeoff
- Higher noise tolerance in ML feature space
Performance: Batching and Coalescing
background | ml at sift | infrastructure | experience
Multi-level caching scheme
- L1 (optional): Local cache with TTL of 1 minute
- L2: Memcached with batching and distributed invalidation
support, 1 day TTL
Longer TTLs for non-updatable (for now) parameters
Performance: Caching
HBase
Memcached
Local
background | ml at sift | infrastructure | experience
in all, we manage about 200 million sets and numeric parameters
background | ml at sift | infrastructure | experience
Experience
95% L2 cache hit rate
The remaining 5%:
50-100 batches/sec
75th: 5ms
99th:100ms
50-200 rows/batch
Densification (L2-only)
background | ml at sift | infrastructure | experience
90-95% L1 hit rate
99+% L2 hit rate
When we miss:
75th: 2ms (NPT), 20ms (ISC counts)
99th: 30ms (NPT), 300ms (ISC counts)
Hundreds of batches per second
10-3000 rows per batch
Model Parameters (L1 and L2)
background | ml at sift | infrastructure | experience
Application: Network View
background | ml at sift | infrastructure | experience
Application: Cascading Updates
background | ml at sift | infrastructure | experience
a
b
c
d
email address
device fingerprint
shipping address
user a event update b, c, d
Online learning helps to keep pace with fraudsters
Decomposed online updatable data into incremental
numeric values and sets
Leveraged HBase and distributed cache for
consistency and performance
Traded consistency for performance with coalescing
and L1 cache
Table design powers multiple additional use cases
Summary
background | ml at sift | infrastructure | experience
Questions?
kanak@siftscience.com
(we’re hiring!)

Contenu connexe

Tendances

Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaAccelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaDatabricks
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman FarahatSpark Summit
 
Overkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark MeetupOverkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark MeetupClaudiu Barbura
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
AWS Machine Learning & Google Cloud Machine Learning
AWS Machine Learning & Google Cloud Machine LearningAWS Machine Learning & Google Cloud Machine Learning
AWS Machine Learning & Google Cloud Machine LearningSC5.io
 
Case studies session 2
Case studies   session 2Case studies   session 2
Case studies session 2HBaseCon
 
Scaling Production Machine Learning Pipelines with Databricks
Scaling Production Machine Learning Pipelines with DatabricksScaling Production Machine Learning Pipelines with Databricks
Scaling Production Machine Learning Pipelines with DatabricksDatabricks
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Databricks
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner VogelsAWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner VogelsAmazon Web Services
 
Rest microservice ml_deployment_ntalagala_ai_conf_2019
Rest microservice ml_deployment_ntalagala_ai_conf_2019Rest microservice ml_deployment_ntalagala_ai_conf_2019
Rest microservice ml_deployment_ntalagala_ai_conf_2019Nisha Talagala
 

Tendances (20)

Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaAccelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei Zaharia
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
Overkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark MeetupOverkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark Meetup
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
AWS Machine Learning & Google Cloud Machine Learning
AWS Machine Learning & Google Cloud Machine LearningAWS Machine Learning & Google Cloud Machine Learning
AWS Machine Learning & Google Cloud Machine Learning
 
Case studies session 2
Case studies   session 2Case studies   session 2
Case studies session 2
 
Scaling Production Machine Learning Pipelines with Databricks
Scaling Production Machine Learning Pipelines with DatabricksScaling Production Machine Learning Pipelines with Databricks
Scaling Production Machine Learning Pipelines with Databricks
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner VogelsAWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
 
Rest microservice ml_deployment_ntalagala_ai_conf_2019
Rest microservice ml_deployment_ntalagala_ai_conf_2019Rest microservice ml_deployment_ntalagala_ai_conf_2019
Rest microservice ml_deployment_ntalagala_ai_conf_2019
 

En vedette

Machine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceMachine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceSift Science
 
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Codemotion
 
Braintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSBraintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSAlberto López Martín
 
The Evolution of Hadoop at Stripe
The Evolution of Hadoop at StripeThe Evolution of Hadoop at Stripe
The Evolution of Hadoop at StripeColin Marc
 
Django Zebra Lightning Talk
Django Zebra Lightning TalkDjango Zebra Lightning Talk
Django Zebra Lightning TalkLee Trout
 
Paymill vs Stripe
Paymill vs StripePaymill vs Stripe
Paymill vs Stripebetabeers
 
Omise fintech研究会
Omise fintech研究会Omise fintech研究会
Omise fintech研究会Jun Hasegawa
 
Pay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppPay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppFlatiron School
 
[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30Naoshi ONO
 
Entrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingEntrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingkamal.fariz
 
Payments using Stripe.com
Payments using Stripe.comPayments using Stripe.com
Payments using Stripe.comBilly Cravens
 
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...GreenhouseSoftware
 
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...Alberto López Martín
 
Payments integration: Stripe & Taxamo
Payments integration: Stripe & TaxamoPayments integration: Stripe & Taxamo
Payments integration: Stripe & TaxamoNetguru
 
Payments Made Easy with Stripe
Payments Made Easy with StripePayments Made Easy with Stripe
Payments Made Easy with StripeShawn Hooper
 

En vedette (16)

Machine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceMachine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift Science
 
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
 
Braintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSBraintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOS
 
The Evolution of Hadoop at Stripe
The Evolution of Hadoop at StripeThe Evolution of Hadoop at Stripe
The Evolution of Hadoop at Stripe
 
Django Zebra Lightning Talk
Django Zebra Lightning TalkDjango Zebra Lightning Talk
Django Zebra Lightning Talk
 
Paymill vs Stripe
Paymill vs StripePaymill vs Stripe
Paymill vs Stripe
 
Omise fintech研究会
Omise fintech研究会Omise fintech研究会
Omise fintech研究会
 
Pay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppPay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your App
 
[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30
 
Entrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingEntrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-working
 
Payments using Stripe.com
Payments using Stripe.comPayments using Stripe.com
Payments using Stripe.com
 
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
 
Bitcoin ,
Bitcoin ,Bitcoin ,
Bitcoin ,
 
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...
 
Payments integration: Stripe & Taxamo
Payments integration: Stripe & TaxamoPayments integration: Stripe & Taxamo
Payments integration: Stripe & Taxamo
 
Payments Made Easy with Stripe
Payments Made Easy with StripePayments Made Easy with Stripe
Payments Made Easy with Stripe
 

Similaire à Online learning talk

Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreDatabricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basFlorent Ramiere
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Amazon Web Services
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Data Streaming in Kafka
Data Streaming in KafkaData Streaming in Kafka
Data Streaming in KafkaSilviuMarcu1
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
 

Similaire à Online learning talk (20)

Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Data Streaming in Kafka
Data Streaming in KafkaData Streaming in Kafka
Data Streaming in Kafka
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 

Online learning talk

  • 1. Distributed Online Learning Techniques Kanak Biscuitwala kanak@siftscience.com
  • 2. Background Machine Learning at Sift Science Online Learning Infrastructure Experience Outline
  • 3. Fraud detection using supervised machine learning Real-time Billions of purchases scored Hundreds of millions of users About Sift Science background | ml at sift | infrastructure | experience
  • 4. background | ml at sift | infrastructure | experience
  • 5. You have examples of GOOD and BAD users. You have a set of signals that you think are predictive of fraud. Start with your data… background | ml at sift | infrastructure | experience
  • 6. Train: Build a model from existing data Train a statistical model with examples of GOOD and BAD users. Model will learn signal values common to each user type. background | ml at sift | infrastructure | experience
  • 7. Predict: Find patterns in new data Apply the model to current active customers. Predict which are fraud, and which aren’t. background | ml at sift | infrastructure | experience
  • 8. Act: Turn insights into action Intelligently segment your customers with a probability of risk background | ml at sift | infrastructure | experience
  • 10. Customers stream events to us - Page Views (Javascript) - Purchases (API) - Labels (API or Console) Time series view of the user Data at Sift background | ml at sift | infrastructure | experience
  • 11. Signup Add CC Add item(s) to Cart Purchase 1 Change CC Change Billing Purchase 2 Time Series of Events Features Add item(s) to Cart Scan background | ml at sift | infrastructure | experience
  • 12. { Device ID features } { Number of emails } { NLP features } { Address features } { Custom fields } … … Time Series of Events Data Transformation … > 1K features background | ml at sift | infrastructure | experience
  • 13. Val a@ (num_fraud=1) … Sparse Feature: Email Val b@ (num_fraud=3) Val c@ (num_fraud=3) Val d@ (num_fraud=1) Val 1 Val 3 … … Dense Feature: Email … … … background | ml at sift | infrastructure | experience
  • 14. Val a@ (num_fraud=1) … Sparse Feature: Email Val b@ (num_fraud=3) Val c@ (num_fraud=3) Val d@ (num_fraud=1) Val 1 Val 3 … … Dense Feature: Email … … … “densification” background | ml at sift | infrastructure | experience
  • 15. Val a@ (num_fraud=1) … Sparse Feature: Email Val b@ (num_fraud=3) Val c@ (num_fraud=3) Val d@ (num_fraud=1) Val 1 Val 3 … … Dense Feature: Email … … … these mappings constantly change “densification” background | ml at sift | infrastructure | experience
  • 16. Prediction classify Dense features background | ml at sift | infrastructure | experience
  • 18. Prediction classify Dense features Label Dense features learn Updated Classifier feature importance constantly changes background | ml at sift | infrastructure | experience
  • 19. Regular batch training vs online learning - Sift does both Batch and online code paths match where possible Adapting to Change background | ml at sift | infrastructure | experience
  • 20. Distributed system to handle requests and data size Updates made in one place need to be visible everywhere Performance still matters Adapting to Scale background | ml at sift | infrastructure | experience
  • 21. Option: Checkpoints + Pub-Sub Classifier label queue Classifier Classifier Classifier solves: propagation, performance background | ml at sift | infrastructure | experience
  • 22. Option: Checkpoints + Pub-Sub Classifier label queue Classifier Classifier Classifier solves: propagation, performance does not solve: data scale, write amplification, complexity background | ml at sift | infrastructure | experience
  • 23. Option: Distributed DB (HBase) Classifier label solves: propagation, complexity, data scale, single source of truth background | ml at sift | infrastructure | experience
  • 24. Option: Distributed DB (HBase) Classifier label solves: propagation, complexity, data scale, single source of truth does not solve: performance background | ml at sift | infrastructure | experience
  • 25. Scan and HFile access for batch operations, row operations online Higher-level atomic operations and batching Block caching (and other forms of caching) Snapshots Driving console and front end Why HBase? background | ml at sift | infrastructure | experience
  • 26. Scan and HFile access for batch operations, row operations online Higher-level atomic operations and batching Block caching (and other forms of caching) Snapshots Driving console and front end Why HBase? see our talk at HBaseCon! background | ml at sift | infrastructure | experience
  • 28. Online Learning Time Series Features Score (Update) Updates to sparse feature state Update model parameters background | ml at sift | infrastructure | experience
  • 29. Sparse fields - device ids, cookies, custom fields, etc. Mapping to dense space based on set cardinality Two-table implementation (“ItemSetCounter”) - Slower set table (up to 8K items per set; > 100M sets) - Faster counts table (batching, coalescing) Global and customer states Real-time introduction of features and feature values Sparse Feature Densification background | ml at sift | infrastructure | experience
  • 30. ItemSetCounter set1: 1 … set2: 3 set3: 1 set4: 2 set1: { a } … set2: { b, c, d } set3: { e } set4: { f, g } background | ml at sift | infrastructure | experience
  • 31. Feature weights updated in increments Counting for learning and display - Number of unique features and feature values (set union) - Count labels on various dimensions (increment) Thousands of accesses per classification Model Parameters background | ml at sift | infrastructure | experience
  • 32. Three-table design - ItemSetCounter for set membership and cardinality - NumericParameterTable for incrementing numeric values Enables: - Fast batch access of numeric parameters and set sizes - Availability of items in set for display and analysis - Real-time introduction of features and feature values Model Parameters (Implementation) background | ml at sift | infrastructure | experience
  • 33. param1: 20.0 … NumericParameterTable param2: 1.2345 param3: 24.356 param4: 0.0001 background | ml at sift | infrastructure | experience
  • 34. Code written and rewritten to read data in batches Updates are coalesced in memory for up to 1 second “Approximately consistent” - Throughput/latency vs consistency tradeoff - Higher noise tolerance in ML feature space Performance: Batching and Coalescing background | ml at sift | infrastructure | experience
  • 35. Multi-level caching scheme - L1 (optional): Local cache with TTL of 1 minute - L2: Memcached with batching and distributed invalidation support, 1 day TTL Longer TTLs for non-updatable (for now) parameters Performance: Caching HBase Memcached Local background | ml at sift | infrastructure | experience
  • 36. in all, we manage about 200 million sets and numeric parameters background | ml at sift | infrastructure | experience
  • 38. 95% L2 cache hit rate The remaining 5%: 50-100 batches/sec 75th: 5ms 99th:100ms 50-200 rows/batch Densification (L2-only) background | ml at sift | infrastructure | experience
  • 39. 90-95% L1 hit rate 99+% L2 hit rate When we miss: 75th: 2ms (NPT), 20ms (ISC counts) 99th: 30ms (NPT), 300ms (ISC counts) Hundreds of batches per second 10-3000 rows per batch Model Parameters (L1 and L2) background | ml at sift | infrastructure | experience
  • 40. Application: Network View background | ml at sift | infrastructure | experience
  • 41. Application: Cascading Updates background | ml at sift | infrastructure | experience a b c d email address device fingerprint shipping address user a event update b, c, d
  • 42. Online learning helps to keep pace with fraudsters Decomposed online updatable data into incremental numeric values and sets Leveraged HBase and distributed cache for consistency and performance Traded consistency for performance with coalescing and L1 cache Table design powers multiple additional use cases Summary background | ml at sift | infrastructure | experience