As part of the 2018 HPCC Systems Summit Community Day event:
Decision Tree based Machine Learning algorithms are among the most powerful and easiest to use. The new Learning Trees bundle from HPCC Systems provides a robust library of tree-based methods including Random Forests, Gradient Boosted Trees, and Boosted Forests. How do these algorithms work, and which are likely to provide the best results? This talk provides details of various Tree-Based learning methods and insight into the data science involved.
Roger is a Senior Architect working with John Holt on the Machine Learning Team. He recently joined HPCC Systems from CA Technologies. Roger has been involved in the implementation and utilization of machine learning and AI techniques for many years, and has over 20 patents in diverse areas of software technology.
2. Major Classes of Supervised Machine Learning
• Linear Models
• Neural Network Models
• Decision Tree Models
Presentation Title Here (Insert Menu > Header & Footer > Apply) 2
Learning Trees
=
3. Goals
• Overview of Learning Tree algorithms
• Science and intuitions behind Learning Trees
• HPCC Systems LearningTrees Bundle
Presentation Title Here (Insert Menu > Header & Footer > Apply) 3
6. Basic Decision Tree Example
Feature
1
Feature
2
Result
0 0 0
0 1 1
1 0 1
1 1 0
Start
Feature
1 >.5
Yes
Feature
2 > .5?
Feature
2 > .5?
0 1
No
1 0
NoYes YesNo
XOR Truth Table
7. What is happening Geometrically?
Presentation Title Here (Insert Menu > Header & Footer > Apply) 7
Feature1
Feature 2
.5
.5
Start
Feature
1 >.5
Ye
s
Feature
2 > .5?
Feature
2 > .5?
0 1
No
1 0
NoYe
s
Ye
s
No
8. How do we learn a Decision Tree?
Presentation Title Here (Insert Menu > Header & Footer > Apply) 8
High Entropy /
Low Order
Less Entropy / More
Order
Zero Entropy /
Pure Order
9. Learning Tree Major Strengths and Weaknesses
Strengths
• No Data Assumptions
• Non-Linear
• Discontinuous
Weaknesses
• No extrapolation / interpolation
• Fairly large training set
• Marginally descriptive
Presentation Title Here (Insert Menu > Header & Footer > Apply) 9
Less Data
Preparation and
Analysis needed
More Data needed
10. Limitations of a Decision Tree
• Deterministic Phenomena Only
• Do not generalize well for stochastic problems
Presentation Title Here (Insert Menu > Header & Footer > Apply) 10
How can that be?
11. Generalization and Population
• Target = Population
• Sample <<
Population
• Overfitting = Fitting to
the noise in the
sample
• Specifically –
Spurious correlation
Presentation Title Here (Insert Menu > Header & Footer > Apply) 11
PopulationSample 1
13. “Bagging” Theory -- Training
Presentation Title Here (Insert Menu > Header & Footer > Apply) 13
Learner
Training Data
Model
“Bootstrap”
Sample
Learner
Model
“Bootstrap”
Sample
Learner
Model
“Bootstrap”
Sample
. . .
Composite Model
14. “Bagging” Theory -- Prediction
Presentation Title Here (Insert Menu > Header & Footer > Apply) 14
Test Data
Model Model Model
Composite Model
. . .
Predictions Predictions Predictions
Aggregate
Final Predictions
15. Random Forest
• Build a forest of diverse decision trees
• Vote / average the results from all
trees
• A Random Forest is:
• Worse than the best possible
tree
• Better than the worst tree
• About as correct as you can
reliably get given the training set
and the population
• “Eliminates” the overfitting problem
Presentation Title Here (Insert Menu > Header & Footer > Apply) 15
16. Building a Diverse Forest
• Subsampling
• Start each tree with its own “bootstrap” sample
• Sample from the training set with replacement
• Each tree gets some duplicates and sees about two thirds of the samples
• Feature Restriction
• At each branch, choose a random subset of features
• Choose the best split from that set of features
• Forces trees to take different growth paths
Presentation Title Here (Insert Menu > Header & Footer > Apply) 16
17. Effect of forest size
Presentation Title Here (Insert Menu > Header & Footer > Apply) 17
Accuracy
Number of trees
1 100 1000
18. Random Forest Summary
• Regression and Classification
• All the benefits and limitations of Decision Trees
• Very accurate, given sufficient data
• Generalizes well
• Easy to use
• No data assumptions
• Few parameters – little affect on accuracy
• Almost always works well with default parameters
• Parallelizes well
Presentation Title Here (Insert Menu > Header & Footer > Apply) 18
20. “Boosting” Theory --
Training
Presentation Title Here (Insert Menu > Header & Footer > Apply) 20
“Weak
Learner”
- Residuals
Training
Data
- Residuals
. . .
Model
Model
Model
CompositeModel
“Weak
Learner”
“Weak
Learner”
21. “Boosting” Theory -- Predictions
Presentation Title Here (Insert Menu > Header & Footer > Apply) 21
TestData
Prediction
Prediction
Prediction
. . .
+
+
+
= Final Prediction
Model
Model
Model
Composite
22. Gradient Boosted Trees (GBT)
• Use truncated Decision Trees as the Weak Learner
• Train each tree to correct the errors from the previous tree
• Add predictions together to form final prediction
Presentation Title Here (Insert Menu > Header & Footer > Apply) 22
23. GBT Strengths and Weaknesses
Strengths
• High Accuracy -- Sometimes
better than Random Forest
• Tuneable
• Good generalization
Weaknesses
• Only supports Regression
(natively)
• More difficult to use
• Training is sequential – Cannot
be parallelized
Presentation Title Here (Insert Menu > Header & Footer > Apply) 23
24. GBT – Under the hood
• Generalization
• Multiple diverse trees
• Aggregated Results
• Boosting
• Using residuals focuses on the more difficult items (i.e. larger
errors)
Presentation Title Here (Insert Menu > Header & Footer > Apply) 24
25. Can we separate Generalization and Boosting?
• Generalization can be parallelized (ala Random Forest)
• Boosting is necessarily sequential
• What if we generalized and then boosted?
• Would it require fewer sequential iterations to achieve the same results?
Presentation Title Here (Insert Menu > Header & Footer > Apply) 25
26. Boosted Forests
• Use a (truncated) Random Forest as the weak learner
• Boost between forests ala GBT
Presentation Title Here (Insert Menu > Header & Footer > Apply) 26
27. Boosted Forest Findings
• No need to truncate the forest. Works well with fully
developed trees.
• Requires far fewer iterations (e.g. 5 versus 100)
• Regression significantly more accurate than Random
Forest.
• Generally more accurate than Gradient Boosted Trees
• Insensitive to training parameters = Easy to use – Works
with defaults (like Random Forest).
• Few iterations needed to achieve maximal boosting =
HPCC Systems efficient
Presentation Title Here (Insert Menu > Header & Footer > Apply) 27
31. LearningTrees Bundle
Presentation Title Here (Insert Menu > Header & Footer > Apply) 31
Learning Trees
Decision Tree Random
Forest
Gradient Boosted
Trees
Boosted Forest
32. LearningTrees Bundle additional capabilities
• Features can be any type of numeric data:
• Real values
• Integers
• Binary
• Categorical
• Output can be categorical (Classification Forest) or real-valued (Regression Forest).
• Multinomial classification is supported directly.
• Myriad Interface -- Multiple separate forests can be grown at once, and produce a composite model in
parallel. This can further improve the performance on an HPCC Systems Cluster.
• Accuracy Assessment -- Produces a range of statistics regarding the accuracy of the model given a set of
test data.
• Feature Importance -- Analyses the importance of each feature in the decision process.
• Decision Distance -- Provides insight into the similarity of different data points in a multi-dimensional
decision space.
• Uniqueness Factor -- Indicates how isolated a given data point is relative to other points in decision
space.
Presentation Title Here (Insert Menu > Header & Footer > Apply) 32
33. Choosing an Algorithm
Presentation Title Here (Insert Menu > Header & Footer > Apply) 33
Start
Problem
Deterministic
?
Regression or
Classification?
Use Single Tree
Use Random
Forest
(Classification
Forest)
Need
Standardized
Method?
Experience
d ML User?
Use Gradient Boosted
Trees
Use Random Forest
(Regression Forest)
Use Boosted Forest
Yes
No
Classification
Regression
Yes Yes
No No