3. Fraud detection using supervised machine learning
Real-time
Billions of purchases scored
Hundreds of millions of users
About Sift Science
background | ml at sift | infrastructure | experience
5. You have examples of GOOD
and BAD users.
You have a set of signals that
you think are predictive of
fraud.
Start with your data…
background | ml at sift | infrastructure | experience
6. Train: Build a model from existing data
Train a statistical model with
examples of GOOD and BAD
users.
Model will learn signal values
common to each user type.
background | ml at sift | infrastructure | experience
7. Predict: Find patterns in new data
Apply the model to current
active customers.
Predict which are fraud, and
which aren’t.
background | ml at sift | infrastructure | experience
8. Act: Turn insights into action
Intelligently segment your
customers with a probability of risk
background | ml at sift | infrastructure | experience
10. Customers stream events to us
- Page Views (Javascript)
- Purchases (API)
- Labels (API or Console)
Time series view of the user
Data at Sift
background | ml at sift | infrastructure | experience
11. Signup
Add CC
Add item(s) to Cart
Purchase 1
Change CC
Change Billing
Purchase 2
Time Series of Events
Features
Add item(s) to Cart
Scan
background | ml at sift | infrastructure | experience
12. { Device ID features }
{ Number of emails }
{ NLP features }
{ Address features }
{ Custom fields }
…
…
Time Series of Events
Data
Transformation
…
> 1K features
background | ml at sift | infrastructure | experience
13. Val a@ (num_fraud=1)
…
Sparse Feature: Email
Val b@ (num_fraud=3)
Val c@ (num_fraud=3)
Val d@ (num_fraud=1)
Val 1
Val 3
…
…
Dense Feature: Email
…
…
…
background | ml at sift | infrastructure | experience
14. Val a@ (num_fraud=1)
…
Sparse Feature: Email
Val b@ (num_fraud=3)
Val c@ (num_fraud=3)
Val d@ (num_fraud=1)
Val 1
Val 3
…
…
Dense Feature: Email
…
…
…
“densification”
background | ml at sift | infrastructure | experience
15. Val a@ (num_fraud=1)
…
Sparse Feature: Email
Val b@ (num_fraud=3)
Val c@ (num_fraud=3)
Val d@ (num_fraud=1)
Val 1
Val 3
…
…
Dense Feature: Email
…
…
…
these mappings constantly change
“densification”
background | ml at sift | infrastructure | experience
19. Regular batch training vs online learning
- Sift does both
Batch and online code paths match where possible
Adapting to Change
background | ml at sift | infrastructure | experience
20. Distributed system to handle requests and data size
Updates made in one place need to be visible everywhere
Performance still matters
Adapting to Scale
background | ml at sift | infrastructure | experience
22. Option: Checkpoints + Pub-Sub
Classifier
label
queue
Classifier
Classifier
Classifier
solves: propagation, performance
does not solve: data scale, write amplification, complexity
background | ml at sift | infrastructure | experience
23. Option: Distributed DB (HBase)
Classifier
label
solves: propagation, complexity, data scale, single source of truth
background | ml at sift | infrastructure | experience
24. Option: Distributed DB (HBase)
Classifier
label
solves: propagation, complexity, data scale, single source of truth
does not solve: performance
background | ml at sift | infrastructure | experience
25. Scan and HFile access for batch operations, row
operations online
Higher-level atomic operations and batching
Block caching (and other forms of caching)
Snapshots
Driving console and front end
Why HBase?
background | ml at sift | infrastructure | experience
26. Scan and HFile access for batch operations, row
operations online
Higher-level atomic operations and batching
Block caching (and other forms of caching)
Snapshots
Driving console and front end
Why HBase? see our talk at
HBaseCon!
background | ml at sift | infrastructure | experience
28. Online Learning
Time Series Features Score (Update)
Updates to sparse feature state
Update model parameters
background | ml at sift | infrastructure | experience
29. Sparse fields - device ids, cookies, custom fields, etc.
Mapping to dense space based on set cardinality
Two-table implementation (“ItemSetCounter”)
- Slower set table (up to 8K items per set; > 100M sets)
- Faster counts table (batching, coalescing)
Global and customer states
Real-time introduction of features and feature values
Sparse Feature Densification
background | ml at sift | infrastructure | experience
30. ItemSetCounter
set1: 1
…
set2: 3
set3: 1
set4: 2
set1: { a }
…
set2: { b, c, d }
set3: { e }
set4: { f, g }
background | ml at sift | infrastructure | experience
31. Feature weights updated in increments
Counting for learning and display
- Number of unique features and feature values (set union)
- Count labels on various dimensions (increment)
Thousands of accesses per classification
Model Parameters
background | ml at sift | infrastructure | experience
32. Three-table design
- ItemSetCounter for set membership and cardinality
- NumericParameterTable for incrementing numeric values
Enables:
- Fast batch access of numeric parameters and set sizes
- Availability of items in set for display and analysis
- Real-time introduction of features and feature values
Model Parameters (Implementation)
background | ml at sift | infrastructure | experience
34. Code written and rewritten to read data in batches
Updates are coalesced in memory for up to 1 second
“Approximately consistent”
- Throughput/latency vs consistency tradeoff
- Higher noise tolerance in ML feature space
Performance: Batching and Coalescing
background | ml at sift | infrastructure | experience
35. Multi-level caching scheme
- L1 (optional): Local cache with TTL of 1 minute
- L2: Memcached with batching and distributed invalidation
support, 1 day TTL
Longer TTLs for non-updatable (for now) parameters
Performance: Caching
HBase
Memcached
Local
background | ml at sift | infrastructure | experience
36. in all, we manage about 200 million sets and numeric parameters
background | ml at sift | infrastructure | experience
38. 95% L2 cache hit rate
The remaining 5%:
50-100 batches/sec
75th: 5ms
99th:100ms
50-200 rows/batch
Densification (L2-only)
background | ml at sift | infrastructure | experience
39. 90-95% L1 hit rate
99+% L2 hit rate
When we miss:
75th: 2ms (NPT), 20ms (ISC counts)
99th: 30ms (NPT), 300ms (ISC counts)
Hundreds of batches per second
10-3000 rows per batch
Model Parameters (L1 and L2)
background | ml at sift | infrastructure | experience
41. Application: Cascading Updates
background | ml at sift | infrastructure | experience
a
b
c
d
email address
device fingerprint
shipping address
user a event update b, c, d
42. Online learning helps to keep pace with fraudsters
Decomposed online updatable data into incremental
numeric values and sets
Leveraged HBase and distributed cache for
consistency and performance
Traded consistency for performance with coalescing
and L1 cache
Table design powers multiple additional use cases
Summary
background | ml at sift | infrastructure | experience