2. Presenta6on
Outline
• Algorithm
Background
o Decision
Trees
o Random
Forest
o Gradient
Boosted
Machines
(GBM)
• H2O
ImplementaCons
o Code
examples
o DescripCon
of
parameters
and
general
usage
3. Decision
Trees:
Concept
• Separate
the
data
according
to
a
series
of
quesCons
o Age
>
9.5?
• The
quesCons
are
found
automaCcally
to
opCmize
separaCon
of
the
data
point
by
the
“target”
Source: wikimedia CART tree Titanic survivors
Example decision tree:
Predicting survival of Titanic passengers
4. Decision
Trees:
Prac6cal
Use
• Non
linear
• Robust
to
correlated
features
• Robust
to
feature
distribuCons
• Robust
to
missing
values
• Simple
to
comprehend
• Fast
to
train
• Fast
to
score
• Poor
accuracy
• Cannot
project
• Inefficiently
fits
linear
relaConships
WeaknessesStrengths
5. Improved
Decision
Trees:
Ensembles
• Bootstrap
aggregaCon
(bagging)
• Fit
many
trees
against
different
samples
of
the
data
and
average
together
• BoosCng
• Fits
consecuCve
trees
where
each
solves
for
the
net
error
of
the
prior
trees
GBMRandom Forest
6. Random
Forest
• Combine
mulCple
decision
trees,
each
fit
to
a
random
sample
of
the
original
data
• Randomly
samples
o Rows
o Columns
• Reduce
variance,
with
minimal
increase
in
bias
• Strengths
o Easy
to
use
• Few
parameters
• Well-‐established
default
values
for
parameters
o Robust
o CompeCCve
accuracy
on
most
data
sets
• Weaknesses
o Slow
to
score
o Lack
of
transparency
PracticalConceptual
7. Gradient
Boosted
Machines
(GBM)
• BoosCng:
ensemble
of
weak
learners*
• Fits
consecuCve
trees
where
each
solves
for
the
net
loss
of
the
prior
trees
• Results
of
new
trees
are
applied
parCally
to
the
enCre
soluCon
• Strengths
o O`en
best
possible
model
o Robust
o Directly
opCmizes
cost
funcCon
• Weaknesses
o Overfits
• Need
to
find
proper
stopping
point
o SensiCve
to
noise
and
extreme
values
o Several
hyper-‐parameters
o Lack
of
transparency
PracticalConceptual
* the notion of “weak” is being challenged
in practice
8. Trees
in
H2O
• Individual
tree
fiang
is
performed
in
parallel
• Shared
histograms
calculate
cut-‐points
• Greedy
search
of
histogram
bins,
opCmizing
squared
error
9. Explore
Further
through
Examples
I have H2O
Installed
I have R
installed
I have the
H2O World
data sets