Online learning talk

Distributed Online Learning Techniques
Kanak Biscuitwala
kanak@siftscience.com

Background
Machine Learning at Sift Science
Online Learning Infrastructure
Experience
Outline

Fraud detection using supervised machine learning
Real-time
Billions of purchases scored
Hundreds of millions of users
About Sift Science
background | ml at sift | infrastructure | experience

You have examples of GOOD
and BAD users.
You have a set of signals that
you think are predictive of
fraud.
Start with your data…

Train: Build a model from existing data
Train a statistical model with
examples of GOOD and BAD
users.
Model will learn signal values
common to each user type.

Predict: Find patterns in new data
Apply the model to current
active customers.
Predict which are fraud, and
which aren’t.

Act: Turn insights into action
Intelligently segment your
customers with a probability of risk

Customers stream events to us
- Page Views (Javascript)
- Purchases (API)
- Labels (API or Console)
Time series view of the user
Data at Sift

Signup
Add CC
Add item(s) to Cart
Purchase 1
Change CC
Change Billing
Purchase 2
Time Series of Events
Features
Add item(s) to Cart
Scan

{ Device ID features }
{ Number of emails }
{ NLP features }
{ Address features }
{ Custom fields }
…
…
Time Series of Events
Data
Transformation
…
> 1K features

Val a@ (num_fraud=1)
…
Sparse Feature: Email
Val b@ (num_fraud=3)
Val c@ (num_fraud=3)
Val d@ (num_fraud=1)
Val 1
Val 3
…
…
Dense Feature: Email
…
…
…

…
Val 1
Val 3
…
…
…
…
…
“densification”

…
Val 1
Val 3
…
…
…
…
…
these mappings constantly change
“densification”

Prediction
classify
Dense features

Prediction
classify
Dense features
Label
Dense features
learn
Updated
Classifier

Prediction
classify
Dense features
Label
Dense features
learn
Updated
Classifier
feature importance constantly changes

Regular batch training vs online learning
- Sift does both
Batch and online code paths match where possible
Adapting to Change

Distributed system to handle requests and data size
Updates made in one place need to be visible everywhere
Performance still matters
Adapting to Scale

Option: Checkpoints + Pub-Sub
Classifier
label
queue
Classifier
Classifier
Classifier
solves: propagation, performance

Option: Checkpoints + Pub-Sub
Classifier
label
queue
Classifier
Classifier
Classifier
solves: propagation, performance
does not solve: data scale, write amplification, complexity

Option: Distributed DB (HBase)
Classifier
label
solves: propagation, complexity, data scale, single source of truth

Option: Distributed DB (HBase)
Classifier
label
solves: propagation, complexity, data scale, single source of truth
does not solve: performance

Scan and HFile access for batch operations, row
operations online
Higher-level atomic operations and batching
Block caching (and other forms of caching)
Snapshots
Driving console and front end
Why HBase?

Scan and HFile access for batch operations, row
operations online
Higher-level atomic operations and batching
Block caching (and other forms of caching)
Snapshots
Driving console and front end
Why HBase? see our talk at
HBaseCon!

Online Learning Infrastructure

Online Learning
Time Series Features Score (Update)
Updates to sparse feature state
Update model parameters

Sparse fields - device ids, cookies, custom fields, etc.
Mapping to dense space based on set cardinality
Two-table implementation (“ItemSetCounter”)
- Slower set table (up to 8K items per set; > 100M sets)
- Faster counts table (batching, coalescing)
Global and customer states
Real-time introduction of features and feature values
Sparse Feature Densification

ItemSetCounter
set1: 1
…
set2: 3
set3: 1
set4: 2
set1: { a }
…
set2: { b, c, d }
set3: { e }
set4: { f, g }

Feature weights updated in increments
Counting for learning and display
- Number of unique features and feature values (set union)
- Count labels on various dimensions (increment)
Thousands of accesses per classification
Model Parameters

Three-table design
- ItemSetCounter for set membership and cardinality
- NumericParameterTable for incrementing numeric values
Enables:
- Fast batch access of numeric parameters and set sizes
- Availability of items in set for display and analysis
- Real-time introduction of features and feature values
Model Parameters (Implementation)

param1: 20.0
…
NumericParameterTable
param2: 1.2345
param3: 24.356
param4: 0.0001

Code written and rewritten to read data in batches
Updates are coalesced in memory for up to 1 second
“Approximately consistent”
- Throughput/latency vs consistency tradeoff
- Higher noise tolerance in ML feature space
Performance: Batching and Coalescing

Multi-level caching scheme
- L1 (optional): Local cache with TTL of 1 minute
- L2: Memcached with batching and distributed invalidation
support, 1 day TTL
Longer TTLs for non-updatable (for now) parameters
Performance: Caching
HBase
Memcached
Local

in all, we manage about 200 million sets and numeric parameters

95% L2 cache hit rate
The remaining 5%:
50-100 batches/sec
75th: 5ms
99th:100ms
50-200 rows/batch
Densification (L2-only)

90-95% L1 hit rate
99+% L2 hit rate
When we miss:
75th: 2ms (NPT), 20ms (ISC counts)
99th: 30ms (NPT), 300ms (ISC counts)
Hundreds of batches per second
10-3000 rows per batch
Model Parameters (L1 and L2)

Application: Network View

Application: Cascading Updates
a
b
c
d
email address
device fingerprint
shipping address
user a event update b, c, d

Online learning helps to keep pace with fraudsters
Decomposed online updatable data into incremental
numeric values and sets
Leveraged HBase and distributed cache for
consistency and performance
Traded consistency for performance with coalescing
and L1 cache
Table design powers multiple additional use cases
Summary

Questions?
kanak@siftscience.com
(we’re hiring!)

Online learning talk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Online learning talk

Similaire à Online learning talk (20)

Online learning talk