Data Science, Machine Learning, and H2O

Scalable Machine Learning
For Smarter Applications

Agenda
Data Science
Machine Learning
Trees and Power of Algorithmic Methods
Examples using H2O Scalable Machine
Learning Engine

Who am I?
Hank Roark
Data Scientist & Hacker @ H2O.ai
Lecturer in Systems Thinking, UIUC
13 years at John Deere, Research, New Product
Development, New High Tech Ventures
Previously at startups and consulting
Physics Georgia Tech
Systems Design & Management MIT

Data Science
Interdisciplinary
Electronic commodity, must
speak ‘hacker’
Extract insights from data
Discovery and building
knowledge
http://drewconway.com/zia/2013/3/26/the-data-
science-venn-diagram

Data Science
Jeff Hammerbacher (Facebook, Cloudera)
• Identify problem
• Instrument data sources
• Collect data
• Prepare data (integrate, transform, clean,
impute, ﬁlter, aggregate)
• Build model
• Evaluate model
• Communicate results

Data Science
Ben Fry (data visualization expert)
• Acquire
• Parse
• Filter
• Mine
• Represent
• Reﬁne
• Interact

Agenda
 Data Science
Machine Learning
Learning Engine

Field of study that gives computers the ability to learn
without being explicitly programmed.
Arthur Samuel, 1959
10

A computer program is said to learn from experience E
With regards to some task T
and some performance measure P,
if its performance on T,
as measured by P,
improves with experience E.
Tom Mitchell, 1998
11

Types of Learning
• Supervised Learning
• Inferring function from labeled data
• Classification
• Regression
• Unsupervised Learning
• Finding hidden structure in unlabeled data
• Clustering
• Anomaly
• Reinforcement Learning
• Learning from delayed feedback

Isn’t this just statistics repackaged?
x nature y
Shared goals of data analysis:
Prediction
Information extraction
L Breiman

Statistical Analysis
x
Linear regression
Logistic regression
Cox models
y
Assume some process that creates observed data
Model validation:
Yes–no using goodness-of-fit tests
Residual examination
L Breiman

Algorithmic Analysis (aka ML)
x Unknown y
Process that creates observed data is unknowable
Model validation:
Measured by predictive accuracy L Breiman
Decision trees
Neural networks

Why Big Data + Machine Learning

Agenda
 Data Science
 Machine Learning
Learning Engine

Trees
Short exploration of one algorithmic method
Can be used for regression and classification
Segments the prediction space into a number
of simple regions
Often referred to as decision trees

Baseball Salary
Salary is color coded from low
(blue) to high (red)
Tibshirani and Hastie

Pros and Cons
Simple, thought to mirror human decision
making
Not competitive with the best supervised
learning approaches in terms of predictive
accuracy
Combining large number of trees results in
dramatic improvements, with some loss of
interpretability

Methods to Improve Predictive
Performance of Trees
Bagging Random Forest Boosting
Bagging is short for
bootstrap aggregation.
Averaging a set of
observations reduces
variance.
Individual trees are built on
samples, with
replacement, of the data.
(Bootstrap)
Many trees are built and
the results ‘averaged’
(Aggregation)
Random forest builds on
bagging, by considering a
random subset of the
predictors at each tree split
This further decorrelates
the trees, resulting in
improved predictive
performance.
Implemented in H2O as
Random Forest.
Builds multiple models
sequentially, using
information from prior
trees.
Slowly fit the residuals of
prior models.
Is a general method, not
limited to trees.
Implemented in H2O as
GBM (Gradient Boosted
Models); first ever parallel,
distributed GBM.

Which Algorithm Is Best?
Linear
models
Decision
tree
Tibshirani and Hastie

Which Algorithm Is Best?
25
We have dubbed the associated results No Free Lunch theorems
because they demonstrate that if an algorithm performs well on a certain
class of problems then it necessarily pays for that with degraded
performance on the set of all remaining problems. (Wolpert and Macready)

Agenda
 Data Science
 Trees and Power of Algorithmic Methods
Learning Engine

• Founded: 2011 venture-backed, debuted in 2012
• Product: H2O open source in-memory prediction engine
• Team: 37 - Distributed Systems Engineers doing ML
• HQ: Mountain View, CA
H2O.ai Overview
H2O.ai
Machine Intelligence

25,000 commits / 3yrs
H2O World Conference 2014
Team Work @ H2O.ai
28
Join H2O World Nov 9-11 2015!

What is H2O?
Open source in-memory prediction engineMath Platform
• Parallelized and distributed algorithms making the most use out of
multithreaded systems
• GLM, Random Forest, GBM, Deep Learning, etc.
Easy to use and adoptAPI
• Written in Java – perfect for Java Programmers
• REST API (JSON) – drives H2O from R, Python, Excel, Tableau
More data? Or better models? BOTHBig Data
• Use all of your data – model without down sampling
• Run a simple GLM or a more complex GBM to find the best fit for the data
• More Data + Better Models = Better Predictions
H2O.ai
Machine Intelligence

31
Ad Optimization (200% CPA Lift with H2O)
P2B Model Factory (60k models, 15x
faster with H2O than before)
Fraud Detection (11% higher accuracy with H2O
Deep Learning - saves millions)
…and many large insurance, financial services, and
manufacturing companies!
Real-time marketing (H2O is 10x faster than
anything else)
Customer Use Cases

Customer Stories
• Propensity to Buy model
• AdTech
• Fraud prevention

Propensity to Buy modeling factory

Cisco Predictive Modeling Factories
Problem
Why H2O?
Who uses it?
• Need to predict whether a company will buy a
certain product at a given time
• Spend a lot of time preparing models
• Less time for scoring and less time left for using the
scores in the sales activities
• P2B factory is 15x faster with H2O
• Newer buying patterns incorporated immediately
into models
• Scores are published sooner
• More time for planning and executing activities
• R + H2O is a robust and powerful combination
• Lou Carvalheira, advanced analytics manager
• Customer Intelligence data scientists

P2B factory is 15x faster with H2O
Q1 Q2
P2B Training
Scoring
models
Data
Refresh Q2
Data
Refresh Q1
Prepare,
execute
Mktg & Sales
activities
Before, without H2O
Q1 Q2
Trai
n
&
scor
e
Data
Refresh
Prepare, execute
Mktg & Sales
activities
Trai
n
&
scor
e
Data
Refresh
Prepare, execute
Mktg & Sales
activities
Now, with H2O

Modeling conversion rate on multiple campaigns

ShareThis AdTech Optimization
Problem
Why H2O?
Who uses it?
• ShareThis ONLY targets users within 24 hours to
ensure ads reach them at the most relevant
moment for maximum ROI
• Maximized ROI by optimizing campaign
performance and budget allocation
• Increased accuracy and better anomaly removal
• Reduced R&D time significantly
• Used all data and built models faster, & faster scoring
• Smooth model building pipeline with R and Spark
API
• Prasanta Behera, VP of Engineering
• Ad Products team

STANDARD TARGETING
THRESHOLD
INTEREST
TIME
TRIGGER
EXCITEMENT
PEAK READINESS
FOR
ENGAGEMENT
FADING INTEREST
 MALE 25-45
 TECH ENTHUSIAST
 $HHI $75K+
“DAN”
ShareThis ONLY targets users within 24 hours to ensure ads reach them at the most
relevant moment
SHARETHIS
MESSAGING TRIGGER
Real Time Messaging Reaches Users During
Peak Interest

Live Tests on Different Campaigns
observed CPA lift using H2O

Fraud prevention using Deep Learning

PayPal Fraud Prevention
Problem
Why H2O?
Who uses it?
• Flag fraudulent behavior upfront
• Monitor account activity and account-to-account
transactions for suspicious behavior and changes
• Need to model new and complex attack patterns
quickly
• Fast, scalable, and accurate
• Flexible deployment
• Works seamlessly with Hadoop
• Simple interface
• 11% improvement in accuracy w/ Deep Learning
• Fraud Prevention data science team

Fraud Prevention at PayPal
Experiment
• Dataset
− 160 million records
− 1500 features (150
categorical)
− 0.6TB compressed in
HDFS
• Infrastructure
− 800 node Hadoop
(CDH3) cluster
• Decision
− Fraud/not-fraud
• Network architecture- 6 layers
with 600 neurons each performed
the best
• Activation function
− RectifierWithDropout performed the
best
• 11% accuracy Improvement with
limited feature set & a deep
network
− With a third of the original feature set,
6 hidden layers, 600 neurons each
Results

Customer
selects song to
purchase
$
Payment
information
entered
Data collected
Comparison with past consumer
behavior
Random ForestDetermine
fraud/not
fraud
Take steps to stop
fraud or prevent
future fraud
Fraud Prevention with Random Forest

Agenda
 Data Science
 Trees and Power of Algorithmic Methods
 Examples using H2O Scalable Machine
Learning Engine

Data Science, Machine Learning, and H2O

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Data Science, Machine Learning, and H2O

Similaire à Data Science, Machine Learning, and H2O (20)

Plus de Sri Ambati

Plus de Sri Ambati (20)

Dernier

Dernier (20)

Data Science, Machine Learning, and H2O

Notes de l'éditeur