Introduction & Hands-on with H2O Driverless AI

Welcome to #DiveIntoH2O London
4:00 PM – Tea & biscuits & networking
4:30 PM - Welcome by H2O.ai Team
4:32 PM - Session by Vladimir Brenner, Partner Account Manager, Disruptors & AI, Intel AI
4:50 PM - Introduction to H2O Driverless AI with Marios Michailidis, Kaggle Grandmaster, H2O.ai
5:30 PM - Pizza break!
5:45 PM - Hands-on with H2O Driverless AI with John Spooner, H2O.ai
7:30 PM - Q&A and networking
8:00 PM - Session ends
Presented By

Introduction to H2O
Driverless AI
Marios Michailidis

Confidential4
Background
• Competitive data scientist at H2O.ai
• PhD in ensemble methods at UCL
• Former kaggle #1 – over 140+ competitions

Confidential5
H2O is the open source leader in AI.
Democratize AI for Everyone.
AI for good.
5 Confidential

Confidential6
Growing Worldwide Open Source Community
14,000 Companies Using H2O
155,000 Data Scientists 120K Meetup Members
H2O World – NYC, London, SF
Thousands attending live and online

Confidential7
H2O.ai Product Suite
Automatic feature engineering,
machine learning and interpretability
• 100% open source – Apache V2 licensed
• Built for data scientists – interface using R, Python on H2O Flow
(interactive notebook interface)
• Enterprise support subscriptions
• Enterprise software
• Built for domain users, analysts and
data scientists – GUI-based interface
for end-to-end data science
• Fully automated machine learning
from ingest to deployment
• User licenses on a per seat basis
(annual subscription)
H2O AI open source engine
integration with Spark
Lightning fast machine
learning on GPUs
In-memory, distributed
machine learning algorithms
with H2O Flow GUI
Open Source

Confidential8
Driverless AI Advantage
• Simple interface
• Automatic feature
engineering to
accuracy
• Automatic recipes for
solving wide variety
use-cases
• Automatic tuning to
and tune the right
ensemble of models

Confidential9 Confidential9
Driverless AI
Features
Target
Data Quality and
Transformation Modeling
Table
Model
Building
Model
Data Integration
+
Typical Enterprise ML Workflow

Confidential10
AUC, Accuracy, Precision…
Time, hardware
Insight-visualization
Feature Engineering
Predictions
Model Interpretability
• Some input data
• A target variable
• An objective (or a success metric)
• Some allocated resources
H2O Driverless AI process
x1 x2 x3 x4 y
0.14 0.69 0.01 0.71 1
0.22 0.44 0.45 0.69 1
0.12 0.35 0.51 0.23 0
0.22 0.42 0.79 0.60 1
0.93 0.82 0.72 0.50 1
0.32 0.58 0.28 0.22 0
0.95 0.59 0.68 0.09 1
0.34 0.58 0.35 0.81 0
0.05 0.80 0.28 0.86 1
0.23 0.49 0.63 0.03 0
0.05 0.34 0.53 0.73 1
0.74 0.02 0.33 0.56 0
Regression:
How much will a customers spend?
Classification:
Will a customer make a purchase? Yes or No
X
y
xi
xj
yes
no

Confidential11
Automatic Visualization (AutoViz)

Confidential12
Visualizations
outlier detection Correlation graph Heat map

Confidential13
Animal
Dog
Dog
Cat
Fish
Cat
Dog
Fish
Cat
Cat
Fish
L.Encoding
2
2
1
3
1
2
3
1
1
3
Frequency
3
3
4
3
4
3
3
4
4
3
Is_Dog? Is_Cat? Is_Fish?
1 0 0
1 0 0
0 1 0
0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
0 1 0
0 0 1
Cost
100
150
300
50
250
75
25
400
225
40
Animal Avg.cost
Cat 293.75
Dog 108.333
Fish 38.3333
108.33
108.33
293.75
38.33
293.75
108.33
38.33
293.75
293.75
38.33
Feature engineering: Categorical features

Confidential14
Age
40
45
45
45
49
51
56
61
62
71
73
74
77
78
78
Age
40-51
52-56
57-77
78+
Age
40
45
45
45
missing
51
56
61
62
71
73
74
77
78
78
Age
40
45
45
45
61.14
51
56
61
62
71
73
74
77
78
78
Age
40-51
52-56
57-77
78+
missing
Age
40-51
40-51
40-51
40-51
40-51
40-51
52-56
57-77
57-77
57-77
57-77
57-77
57-77
78+
78+
Age
40-51
40-51
40-51
40-51
missing
40-51
52-56
57-77
57-77
57-77
57-77
57-77
57-77
78+
78+
Mean=
61.14
Feature engineering: Numerical features
Age
Income

Confidential15
Age income
40 40
45 14
45 100
45 33
49 20
51 44
56 65
61 80
62 55
71 15
73 18
74 33
77 32
78 48
78 41
* +
1600 80
630 59
4500 145
1485 78
980 69
2244 95
3640 121
4880 141
3410 117
1065 86
1314 91
2442 107
2464 109
3744 126
3198 119
Animal color
Dog brown
Dog white
Cat black
Fish gold
Cat mixed
Dog black
Fish white
Cat white
Cat black
Fish brown
Animal_color
Dog_brown
Dog_white
Cat_black
Fish_gold
Cat_mixed
Dog_black
Fish_white
Cat_white
Cat_black
Fish_brown
Age Animal
13 Dog
12 Dog
15 Cat
3 Fish
16 Cat
10 Dog
6 Fish
20 Cat
13 Cat
2 Fish
derived
11.66
11.66
16
3.66
16
11.66
3.66
16
16
3.66
Animal Age
Dog 11.66
Cat 16
Fish 3.66
Feature engineering: Interactions

Confidential16
This dog is cute! I love dogs
dog this is jedi love cute emperor bingo ! I
2 1 1 0 1 1 0 0 1 1
• TF(IDF)
• Stemming, Spellchecking , n-grams, stop words
• SVD
• Word2vec
• Other pre-trained embeddings (Glove, FasText)
word f1 f2 f3 f4 f5
dog 0.8 0.2 0.1 -0.1 -0.4
this -0.2 0 0.1 0.8 0.3
love 0.1 0.2 0.7 -0.5 0.1
cute 0.2 0.2 0.8 -0.4 0
King – Man =
Queen
Feature engineering:Text

Confidential17
Date
1/1/2018
2/1/2018
3/1/2018
4/1/2018
5/1/2018
6/1/2018
7/1/2018
8/1/2018
9/1/2018
10/1/2018
Day Month Year weekday weeknum
1 1 2018 2 1
2 1 2018 3 1
3 1 2018 4 1
4 1 2018 5 1
5 1 2018 6 1
6 1 2018 7 1
7 1 2018 1 2
8 1 2018 2 2
9 1 2018 3 2
10 1 2018 4 2
Date Sales
1/1/2018 100
2/1/2018 150
3/1/2018 160
4/1/2018 200
5/1/2018 210
6/1/2018 150
7/1/2018 160
8/1/2018 120
9/1/2018 80
10/1/2018 70
Lag1 Lag2
- -
100 -
150 100
160 150
200 160
210 200
150 210
160 150
120 160
80 120
MA2
-
100
125
155
180
205
180
155
140
100
Feature engineering: Time series

Confidential18
Common open source packages used in ML

Confidential19
• Optimizing maximum depth
• Tree expansion (loss vs depth)
• Learning rate
• Number of trees
1
2
3 3
2
Hyper parameter tuning

Validation approach
Train
data
Past Future
x0 x1 x2 x3 y
0.94 0.27 0.80 0.34 1
0.02 0.22 0.17 0.84 0
0.83 0.11 0.23 0.42 1
0.74 0.26 0.03 0.41 0
0.08 0.29 0.76 0.37 0
0.71 0.76 0.43 0.95 1
0.08 0.72 0.97 0.04 0
0.84 0.79 0.89 0.05 1
0.94 0.27 0.80 0.34 1
0.02 0.22 0.17 0.84 0
0.83 0.11 0.23 0.42 1
0.74 0.26 0.03 0.41 0
0.08 0.29 0.76 0.37 0
0.71 0.76 0.43 0.95 1
0.08 0.72 0.97 0.04 0
0.84 0.79 0.89 0.05 1
K=4
pred
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
Fold : 1
pred
0.96
0.03
0.00
0.00
0.00
0.00
0.00
0.00
Train
Predict pred
0.96
0.03
0.90
0.12
0.00
0.00
0.00
0.00
Fold : 2Fold : 3
pred
0.96
0.03
0.90
0.12
0.03
0.77
0.00
0.00
Fold : 4
pred
0.96
0.03
0.90
0.12
0.03
0.77
0.18
0.91

Confidential21
Genetic algorithm approach
x1 x2 x3 x4 y
0.14 0.69 0.01 0.71 1
0.22 0.44 0.45 0.69 1
0.12 0.35 0.51 0.23 0
0.22 0.42 0.79 0.60 1
0.93 0.82 0.72 0.50 1
0.32 0.58 0.28 0.22 0
0.95 0.59 0.68 0.09 1
0.34 0.58 0.35 0.81 0
0.05 0.80 0.28 0.86 1
0.23 0.49 0.63 0.03 0
0.05 0.34 0.53 0.73 1
0.74 0.02 0.33 0.56 0
Iteration1/10
X% accuracy
Feature Importance
x4 1
x2 0.5
x3 0.2
x1 0.01
Tuned
Iteration2/10
+ Random
Transformation
sqrt(x2)
Z% accuracy
Feature Importance
x2+x4 1
x4 0.4
sqrt(x2) 0.3
x3 0.2
x2 0.05

Confidential22
Feature ranking based on permutations
0.5 0.52 0.69 0.2 1
0.18 0.94 0.57 0.68 0
0.67 0.6 0.76 0.91 1
0.25 0.68 0.57 0.07 1
0.83 0.07 0.76 0.56 0
0.49 0.92 0.64 0.02 0
0.17 0.7 0.57 1
0.96 0.75 0.59 1
0.96 0.14 0.09 0
0.61 0.38 0.07 0
0.07 0.13 0.87 1
0.66 0.27 0.48 1
0.41
0.56
0.21
0.61
0.26
0.02
0.02
0.56
0.21
0.41
0.61
0.26
Train
Validation
fit
score
80% accuracy 70% accuracy
(drop)
Shuffle The 10% difference
is how important that
feature is
Bring back to
normal
Repeat for
all features

Confidential23
Stacking
X0 x1 x2 xn y
0.17 0.25 0.93 0.79 1
0.35 0.61 0.93 0.57 0
0.44 0.59 0.56 0.46 0
0.37 0.43 0.74 0.28 1
0.96 0.07 0.57 0.01 1
A
X0 x1 x2 xn y
0.89 0.72 0.50 0.66 0
0.58 0.71 0.92 0.27 1
0.10 0.35 0.27 0.37 0
0.47 0.68 0.30 0.98 0
0.39 0.53 0.59 0.18 1
B
X0 x1 x2 xn y
0.29 0.77 0.05 0.09 ?
0.38 0.66 0.42 0.91 ?
0.72 0.66 0.92 0.11 ?
0.70 0.37 0.91 0.17 ?
0.59 0.98 0.93 0.65 ?
C
pred0 pred1 pred2 y
0.24 0.72 0.70 0
0.95 0.25 0.22 1
0.64 0.80 0.96 0
0.89 0.58 0.52 0
0.11 0.20 0.93 1
B1
pred0 pred1 pred2 y
0.50 0.50 0.39 ?
0.62 0.59 0.46 ?
0.22 0.31 0.54 ?
0.90 0.47 0.09 ?
0.20 0.09 0.61 ?
C1
Train algorithm 0 on A and make predictions for B and C and save to B1, C1
Train algorithm 3 on B1 and make predictions for C1
Preds3
0.45
0.23
0.99
0.34
0.05
Consider datasets A,B,C.Target variable (y) is known for A,B

Confidential24
“The ability to explain or to present in understandable terms to a human.”
– Finale Doshi-Velez and Been Kim. “Towards a rigorous science of interpretable
machine learning.” arXiv preprint. 2017. https://arxiv.org/pdf/1702.08608.pdf
FAT*: https://www.fatml.org/resources/principles-for-accountable-algorithms
XAI: https://www.darpa.mil/program/explainable-artificial-intelligence
Machine Learning Interpretability?

Confidential25
Interpretability and accuracy trade-off
Linear Models
Exact explanations for
approximate models.
Machine Learning
Approximate explanations
for exact models.

Confidential26
Decision Tree Surrogate Models
Partial Dependence and Individual Conditional Expectation
https://github.com/jphall663/interpretable_machine_learning_with_python
Machine Learning Interpretability examples

Confidential27
Scoring Pipeline
• Python
• MOJO (Java)

Confidential28
Thank you
marios@h2o.ai
in/mariosmichailidis/
@StackNet_
https://www.facebook.com/StackNet/

Confidential29
Introduction to H2O
Driverless AI
John Spooner
Director of Solution Engineering - EMEA

DRIVERLESS AI
HANDS ON WORKSHOP

CONFIDENTIAL
CONFIDENTIAL
The Game Plan
Set up Driverless AI Environment
Introduction into the data sets
Showtime – Time to get hands on

CONFIDENTIAL
CONFIDENTIAL
Please Create an account on Aquarium
Navigate to aquarium.h2o.ai
Create a new account

CONFIDENTIAL
CONFIDENTIAL
Enter your details
Create a new account

CONFIDENTIAL
CONFIDENTIAL
Sign-in to aquarium
Go to Browse Labs
Choose David Whiting’s Lab

CONFIDENTIAL
CONFIDENTIAL
Sign-in to aquarium
Go to Browse Labs
Choose David Whiting’s Lab
And click start lab

CONFIDENTIAL
CONFIDENTIAL
Prediction problem

CONFIDENTIAL
CONFIDENTIAL
Time Series Forecasting Problem

Confidential38
Experiment Settings
ACCURACY
• Relative accuracy – higher values
should lead to higher confidence
in model performance (accuracy)
• Impacts things such as level of
data sampling, how many models
are used in the final ensemble,
parameter tuning level, among
others
Accuracy Time Interpretability
• Relative time for completing
the experiment
• Higher settings mean:
– More iterations are performed
to find the best set of features
– Longer “early stopping”
threshold
• Relative interpretability – higher
values favor more interpretable
models
• The higher the interpretability
setting, the lower the complexity
of the engineered features and
of the final model(s)

CONFIDENTIAL
CONFIDENTIAL
Accuracy
Accuracy
Max Rows x
Cols
Ensemble
Level
Target
Transformation
Parameter
Tuning Level
Num Folds
Only First
Fold Model
Distribution
Check
1 100K 0 False 0 3 True No
2 1M 0 False 0 3 True No
3 50M 0 True 1 3 True No
4 100M 0 True 1 3-4 True No
5 200M 1 True 1 3-4 True Yes
6 500M 2 True 1 3-5 True Yes
7 750M <=3 True 2 3-10 Auto Yes
8 1B <=3 True 2 4-10 Auto Yes
9 2B <=3 True 3 4-10 Auto Yes
10 10B <=4 True 3 4-10 Auto Yes

CONFIDENTIAL
CONFIDENTIAL
Time
Time Iterations
Early Stopping
Rounds
1 1-5 None
2 10 5
3 30 5
4 40 5
5 50 10
6 100 10
7 150 15
8 200 20
9 300 30
10 500 50

CONFIDENTIAL
CONFIDENTIAL
Interpretability
Interpretability
Ensemble
Level
Target
Transformation
Feature Engineering
Feature Pre-
Pruning
Monotonicity
Constraints
1 - 3 <= 3 None Disabled
4 <= 3 Inverse None Disabled
5 <= 3 Anscombe
Clustering (ID, distance)
Truncated SVD
None Disabled
6 <= 2
Logit
Sigmoid
Feature selection Disabled
7 <= 2 Frequency Encoding Feature selection Enabled
8 <= 1 4th
Root Feature selection Enabled
9 <= 1
Square
Square Root
Bulk Interactions (add,
subtract, multiply, divide)
Weight of Evidence
Feature selection Enabled
10 0
Identity
Unit Box
Log
Date Decompositions
Number Encoding
Target Encoding
Text (TF-IDF, Frequency)
Feature selection Enabled
Good
start

CONFIDENTIAL
CONFIDENTIAL
Showtime
Data Set Review
Visualisation
Automated Machine Learning
Machine Learning Interpribility
Free Style

CONFIDENTIAL
CONFIDENTIAL
Showtime
Customer Churn Prediction Sales Forecasting
Train Dataset – card_train
Test Dataset – card_test
Train Dataset – Cannabis_train
Test Dataset – Cannabis_test
Target: target (Default)
Features: Credit Limit, Sex, Education, Marriage, Age, Status 1-6,
Bill Amt 1-6, PayAmt1-6
Target – Weekly_sales
Features – Is holiday, type, temperature, fuel_price,
markdown,CPI, unemployment,sample_weight
Groups: Sales_Date, Organisation
Tasks
- Review data
- Visualistion of data
- Build a predictive model.
- Correctly specify training and test dataset
- Apply the following knob settings: 1,1,10
- Once experiment has finished, explore results. Download the
experiment summary
- Interpret the model
Tasks
- Review data
- Visualistion of data
- Build a predictive model. Remember to:
- Correctly specify training and test dataset
- Set the time column to sales_date
- Set group columns to Sales_Date, Organisation
- Apply the following settings: 4,4,1
- Once experiment has finished, explore results. Download the
experiment summary
- Interpret the model
Additional tasks
- Modify the experiment settings to improve the model accuracy
- Modify the expert settings to understand advanced capability
Additional tasks
- Modify the experiment settings to improve the model accuracy
- Modify the expert settings to understand advanced capability

The Data
Column Description
ID ID of each client
CreditLimit Amount of credit in NT dollars
Sex Gender (M, F)
Education (1: graduate school, 2: university, 3: high school, 4: others, 5-6: unknown)
Marriage Marital status (M, S, D, O)
Age Age in years
Status1 … Status6 Repayment status in September, 2005 – April, 2005
BillAmt1 … BillAmt6 Amount of bill statement in September, 2005 – April, 2005 (NT dollar)
PayAmt1 … PayAmt6 Amount of previous payment in September, 2005 – April, 2005 (NT dollar)
Default Default payment (1=yes, 0=no)

Payment History Data
1 Month
Ago
Status1:
≤0, 1
BillAmt1
PayAmt1
2 Months
Ago
Status2:
≤0, 1, 2
BillAmt2
PayAmt2
3 Months
Ago
Status3:
≤0, 1, 2, 3
BillAmt3
PayAmt3
...
6 Months
Ago
Status6:
≤0, 1, ..., 6
BillAmt6
PayAmt6
Status:
-2: No balance
-1: Paid in full
0: Minimum balance paid
1: One month late
2: Two months late
etc.

Introduction & Hands-on with H2O Driverless AI

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction & Hands-on with H2O Driverless AI

Similaire à Introduction & Hands-on with H2O Driverless AI (20)

Plus de Sri Ambati

Plus de Sri Ambati (17)

Dernier

Dernier (20)

Introduction & Hands-on with H2O Driverless AI