Hank Roark of H2O gives an overview on data science, machine learning, and H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
3. Who am I?
Hank Roark
Data Scientist & Hacker @ H2O.ai
Lecturer in Systems Thinking, UIUC
13 years at John Deere, Research, New Product
Development, New High Tech Ventures
Previously at startups and consulting
Physics Georgia Tech
Systems Design & Management MIT
6. Data Science
Jeff Hammerbacher (Facebook, Cloudera)
• Identify problem
• Instrument data sources
• Collect data
• Prepare data (integrate, transform, clean,
impute, filter, aggregate)
• Build model
• Evaluate model
• Communicate results
10. Field of study that gives computers the ability to learn
without being explicitly programmed.
Arthur Samuel, 1959
10
11. A computer program is said to learn from experience E
With regards to some task T
and some performance measure P,
if its performance on T,
as measured by P,
improves with experience E.
Tom Mitchell, 1998
11
12. Types of Learning
• Supervised Learning
• Inferring function from labeled data
• Classification
• Regression
• Unsupervised Learning
• Finding hidden structure in unlabeled data
• Clustering
• Anomaly
• Reinforcement Learning
• Learning from delayed feedback
13. Isn’t this just statistics repackaged?
x nature y
Shared goals of data analysis:
Prediction
Information extraction
L Breiman
14. Statistical Analysis
x
Linear regression
Logistic regression
Cox models
y
Assume some process that creates observed data
Model validation:
Yes–no using goodness-of-fit tests
Residual examination
L Breiman
15. Algorithmic Analysis (aka ML)
x Unknown y
Process that creates observed data is unknowable
Model validation:
Measured by predictive accuracy L Breiman
Decision trees
Neural networks
18. Agenda
Data Science
Machine Learning
Trees and Power of Algorithmic Methods
Examples using H2O Scalable Machine
Learning Engine
19. Trees
Short exploration of one algorithmic method
Can be used for regression and classification
Segments the prediction space into a number
of simple regions
Often referred to as decision trees
22. Pros and Cons
Simple, thought to mirror human decision
making
Not competitive with the best supervised
learning approaches in terms of predictive
accuracy
Combining large number of trees results in
dramatic improvements, with some loss of
interpretability
23. Methods to Improve Predictive
Performance of Trees
Bagging Random Forest Boosting
Bagging is short for
bootstrap aggregation.
Averaging a set of
observations reduces
variance.
Individual trees are built on
samples, with
replacement, of the data.
(Bootstrap)
Many trees are built and
the results ‘averaged’
(Aggregation)
Random forest builds on
bagging, by considering a
random subset of the
predictors at each tree split
This further decorrelates
the trees, resulting in
improved predictive
performance.
Implemented in H2O as
Random Forest.
Builds multiple models
sequentially, using
information from prior
trees.
Slowly fit the residuals of
prior models.
Is a general method, not
limited to trees.
Implemented in H2O as
GBM (Gradient Boosted
Models); first ever parallel,
distributed GBM.
24. Which Algorithm Is Best?
Linear
models
Decision
tree
Tibshirani and Hastie
25. Which Algorithm Is Best?
25
We have dubbed the associated results No Free Lunch theorems
because they demonstrate that if an algorithm performs well on a certain
class of problems then it necessarily pays for that with degraded
performance on the set of all remaining problems. (Wolpert and Macready)
26. Agenda
Data Science
Machine Learning
Trees and Power of Algorithmic Methods
Examples using H2O Scalable Machine
Learning Engine
27. • Founded: 2011 venture-backed, debuted in 2012
• Product: H2O open source in-memory prediction engine
• Team: 37 - Distributed Systems Engineers doing ML
• HQ: Mountain View, CA
H2O.ai Overview
H2O.ai
Machine Intelligence
28. 25,000 commits / 3yrs
H2O World Conference 2014
Team Work @ H2O.ai
28
Join H2O World Nov 9-11 2015!
29. What is H2O?
Open source in-memory prediction engineMath Platform
• Parallelized and distributed algorithms making the most use out of
multithreaded systems
• GLM, Random Forest, GBM, Deep Learning, etc.
Easy to use and adoptAPI
• Written in Java – perfect for Java Programmers
• REST API (JSON) – drives H2O from R, Python, Excel, Tableau
More data? Or better models? BOTHBig Data
• Use all of your data – model without down sampling
• Run a simple GLM or a more complex GBM to find the best fit for the data
• More Data + Better Models = Better Predictions
H2O.ai
Machine Intelligence
31. 31
Ad Optimization (200% CPA Lift with H2O)
P2B Model Factory (60k models, 15x
faster with H2O than before)
Fraud Detection (11% higher accuracy with H2O
Deep Learning - saves millions)
…and many large insurance, financial services, and
manufacturing companies!
Real-time marketing (H2O is 10x faster than
anything else)
Customer Use Cases
34. Cisco Predictive Modeling Factories
Problem
Why H2O?
Who uses it?
• Need to predict whether a company will buy a
certain product at a given time
• Spend a lot of time preparing models
• Less time for scoring and less time left for using the
scores in the sales activities
• P2B factory is 15x faster with H2O
• Newer buying patterns incorporated immediately
into models
• Scores are published sooner
• More time for planning and executing activities
• R + H2O is a robust and powerful combination
• Lou Carvalheira, advanced analytics manager
• Customer Intelligence data scientists
35. P2B factory is 15x faster with H2O
Q1 Q2
P2B Training
Scoring
models
Data
Refresh Q2
Data
Refresh Q1
Prepare,
execute
Mktg & Sales
activities
Before, without H2O
Q1 Q2
Trai
n
&
scor
e
Data
Refresh
Prepare, execute
Mktg & Sales
activities
Trai
n
&
scor
e
Data
Refresh
Prepare, execute
Mktg & Sales
activities
Now, with H2O
37. ShareThis AdTech Optimization
Problem
Why H2O?
Who uses it?
• ShareThis ONLY targets users within 24 hours to
ensure ads reach them at the most relevant
moment for maximum ROI
• Maximized ROI by optimizing campaign
performance and budget allocation
• Increased accuracy and better anomaly removal
• Reduced R&D time significantly
• Used all data and built models faster, & faster scoring
• Smooth model building pipeline with R and Spark
API
• Prasanta Behera, VP of Engineering
• Ad Products team
41. PayPal Fraud Prevention
Problem
Why H2O?
Who uses it?
• Flag fraudulent behavior upfront
• Monitor account activity and account-to-account
transactions for suspicious behavior and changes
• Need to model new and complex attack patterns
quickly
• Fast, scalable, and accurate
• Flexible deployment
• Works seamlessly with Hadoop
• Simple interface
• 11% improvement in accuracy w/ Deep Learning
• Fraud Prevention data science team
42. Fraud Prevention at PayPal
Experiment
• Dataset
− 160 million records
− 1500 features (150
categorical)
− 0.6TB compressed in
HDFS
• Infrastructure
− 800 node Hadoop
(CDH3) cluster
• Decision
− Fraud/not-fraud
• Network architecture- 6 layers
with 600 neurons each performed
the best
• Activation function
− RectifierWithDropout performed the
best
• 11% accuracy Improvement with
limited feature set & a deep
network
− With a third of the original feature set,
6 hidden layers, 600 neurons each
Results
MOVING AWAY FROM OUTDATED AUDIENCE TARGETING BUCKETS – TO UTILIZING “FRESHER” REAL-TIME DATA .
Other companies use standard audience targeting and bucket Dan as a “tech enthusiast”, we message him at the moments when it’s most relevant.