SlideShare une entreprise Scribd logo
1  sur  34
October 9,
2018
Roger Dev
Learning Trees – Decision Tree Learning
Methods
Major Classes of Supervised Machine Learning
• Linear Models
• Neural Network Models
• Decision Tree Models
Presentation Title Here (Insert Menu > Header & Footer > Apply) 2
Learning Trees
=
Goals
• Overview of Learning Tree algorithms
• Science and intuitions behind Learning Trees
• HPCC Systems LearningTrees Bundle
Presentation Title Here (Insert Menu > Header & Footer > Apply) 3
The Animal
Game
Decision Tree
Basics
Basic Decision Tree Example
Feature
1
Feature
2
Result
0 0 0
0 1 1
1 0 1
1 1 0
Start
Feature
1 >.5
Yes
Feature
2 > .5?
Feature
2 > .5?
0 1
No
1 0
NoYes YesNo
XOR Truth Table
What is happening Geometrically?
Presentation Title Here (Insert Menu > Header & Footer > Apply) 7
Feature1
Feature 2
.5
.5
Start
Feature
1 >.5
Ye
s
Feature
2 > .5?
Feature
2 > .5?
0 1
No
1 0
NoYe
s
Ye
s
No
How do we learn a Decision Tree?
Presentation Title Here (Insert Menu > Header & Footer > Apply) 8
High Entropy /
Low Order
Less Entropy / More
Order
Zero Entropy /
Pure Order
Learning Tree Major Strengths and Weaknesses
Strengths
• No Data Assumptions
• Non-Linear
• Discontinuous
Weaknesses
• No extrapolation / interpolation
• Fairly large training set
• Marginally descriptive
Presentation Title Here (Insert Menu > Header & Footer > Apply) 9
Less Data
Preparation and
Analysis needed
More Data needed
Limitations of a Decision Tree
• Deterministic Phenomena Only
• Do not generalize well for stochastic problems
Presentation Title Here (Insert Menu > Header & Footer > Apply) 10
How can that be?
Generalization and Population
• Target = Population
• Sample <<
Population
• Overfitting = Fitting to
the noise in the
sample
• Specifically –
Spurious correlation
Presentation Title Here (Insert Menu > Header & Footer > Apply) 11
PopulationSample 1
Random Forest
Presentation Title Here (Insert Menu > Header & Footer > Apply)12
“Bagging” Theory -- Training
Presentation Title Here (Insert Menu > Header & Footer > Apply) 13
Learner
Training Data
Model
“Bootstrap”
Sample
Learner
Model
“Bootstrap”
Sample
Learner
Model
“Bootstrap”
Sample
. . .
Composite Model
“Bagging” Theory -- Prediction
Presentation Title Here (Insert Menu > Header & Footer > Apply) 14
Test Data
Model Model Model
Composite Model
. . .
Predictions Predictions Predictions
Aggregate
Final Predictions
Random Forest
• Build a forest of diverse decision trees
• Vote / average the results from all
trees
• A Random Forest is:
• Worse than the best possible
tree
• Better than the worst tree
• About as correct as you can
reliably get given the training set
and the population
• “Eliminates” the overfitting problem
Presentation Title Here (Insert Menu > Header & Footer > Apply) 15
Building a Diverse Forest
• Subsampling
• Start each tree with its own “bootstrap” sample
• Sample from the training set with replacement
• Each tree gets some duplicates and sees about two thirds of the samples
• Feature Restriction
• At each branch, choose a random subset of features
• Choose the best split from that set of features
• Forces trees to take different growth paths
Presentation Title Here (Insert Menu > Header & Footer > Apply) 16
Effect of forest size
Presentation Title Here (Insert Menu > Header & Footer > Apply) 17
Accuracy
Number of trees
1 100 1000
Random Forest Summary
• Regression and Classification
• All the benefits and limitations of Decision Trees
• Very accurate, given sufficient data
• Generalizes well
• Easy to use
• No data assumptions
• Few parameters – little affect on accuracy
• Almost always works well with default parameters
• Parallelizes well
Presentation Title Here (Insert Menu > Header & Footer > Apply) 18
Boosted Trees
Presentation Title Here (Insert Menu > Header & Footer > Apply)19
“Boosting” Theory --
Training
Presentation Title Here (Insert Menu > Header & Footer > Apply) 20
“Weak
Learner”
- Residuals
Training
Data
- Residuals
. . .
Model
Model
Model
CompositeModel
“Weak
Learner”
“Weak
Learner”
“Boosting” Theory -- Predictions
Presentation Title Here (Insert Menu > Header & Footer > Apply) 21
TestData
Prediction
Prediction
Prediction
. . .
+
+
+
= Final Prediction
Model
Model
Model
Composite
Gradient Boosted Trees (GBT)
• Use truncated Decision Trees as the Weak Learner
• Train each tree to correct the errors from the previous tree
• Add predictions together to form final prediction
Presentation Title Here (Insert Menu > Header & Footer > Apply) 22
GBT Strengths and Weaknesses
Strengths
• High Accuracy -- Sometimes
better than Random Forest
• Tuneable
• Good generalization
Weaknesses
• Only supports Regression
(natively)
• More difficult to use
• Training is sequential – Cannot
be parallelized
Presentation Title Here (Insert Menu > Header & Footer > Apply) 23
GBT – Under the hood
• Generalization
• Multiple diverse trees
• Aggregated Results
• Boosting
• Using residuals focuses on the more difficult items (i.e. larger
errors)
Presentation Title Here (Insert Menu > Header & Footer > Apply) 24
Can we separate Generalization and Boosting?
• Generalization can be parallelized (ala Random Forest)
• Boosting is necessarily sequential
• What if we generalized and then boosted?
• Would it require fewer sequential iterations to achieve the same results?
Presentation Title Here (Insert Menu > Header & Footer > Apply) 25
Boosted Forests
• Use a (truncated) Random Forest as the weak learner
• Boost between forests ala GBT
Presentation Title Here (Insert Menu > Header & Footer > Apply) 26
Boosted Forest Findings
• No need to truncate the forest. Works well with fully
developed trees.
• Requires far fewer iterations (e.g. 5 versus 100)
• Regression significantly more accurate than Random
Forest.
• Generally more accurate than Gradient Boosted Trees
• Insensitive to training parameters = Easy to use – Works
with defaults (like Random Forest).
• Few iterations needed to achieve maximal boosting =
HPCC Systems efficient
Presentation Title Here (Insert Menu > Header & Footer > Apply) 27
Accuracy Comparison of Random Forest, Gradient Boosted
Trees and Boosted Forest
Presentation Title Here (Insert Menu > Header & Footer > Apply) 28
Tree Depth Trees / level Boost Levels Total Trees R**2
RF
- 20 - 20 0.734
- 100 - 100 0.74
- 140 - 140 0.741
- 300 - 300 0.745
GBT
7 1 20 20 0.651
7 1 35 35 0.671
7 1 50 50 0.711
7 1 75 75 0.716
7 1 100 100 0.719
7 1 120 120 0.717
7 1 140 140 0.718
5 1 140 140 0.75
BF -
20 20 5 100 0.77
15 20 7 140 0.776
10 20 15 300 0.775
Gradient Boosted Trees versus Boosted Forest – Sensitivity to
training parameters
Presentation Title Here (Insert Menu > Header & Footer > Apply) 29
R2 and (#iterations) for GBT with various Reg Params
Depth / Learn Rate 0.1 0.25 0.5 0.75 1
5 .714 (772) .761 (296) .720 (145) .652 (100) .5 (84)
7 .686 (281) .684 (100) .597 (48) .694 (32) .521 (24)
12 .586 (61) .595 (21) .662 (13) .528 (9) .552 (6)
20 .556 (25) .491 (6) .521 (5) .560 (2) .409 (2)
R2 and (#iterations) for BF(20) with various Reg Params
Depth / Learn Rate 0.1 0.25 0.5 0.75 1
5 - .778 (517) .797 (264) .786 (174) .775 (135)
7 .790 (417) .773 (166) .810 (82) .790 (55) .790 (42)
12 .791 (111) .770 (42) .801 (22) .783 (15) .762 (11)
20 .758 (56) .738 (23) .770 (11) .754 (8) 0.777 (6)
LearningTrees
Bundle
Presentation Title Here (Insert Menu > Header & Footer > Apply)30
LearningTrees Bundle
Presentation Title Here (Insert Menu > Header & Footer > Apply) 31
Learning Trees
Decision Tree Random
Forest
Gradient Boosted
Trees
Boosted Forest
LearningTrees Bundle additional capabilities
• Features can be any type of numeric data:
• Real values
• Integers
• Binary
• Categorical
• Output can be categorical (Classification Forest) or real-valued (Regression Forest).
• Multinomial classification is supported directly.
• Myriad Interface -- Multiple separate forests can be grown at once, and produce a composite model in
parallel. This can further improve the performance on an HPCC Systems Cluster.
• Accuracy Assessment -- Produces a range of statistics regarding the accuracy of the model given a set of
test data.
• Feature Importance -- Analyses the importance of each feature in the decision process.
• Decision Distance -- Provides insight into the similarity of different data points in a multi-dimensional
decision space.
• Uniqueness Factor -- Indicates how isolated a given data point is relative to other points in decision
space.
Presentation Title Here (Insert Menu > Header & Footer > Apply) 32
Choosing an Algorithm
Presentation Title Here (Insert Menu > Header & Footer > Apply) 33
Start
Problem
Deterministic
?
Regression or
Classification?
Use Single Tree
Use Random
Forest
(Classification
Forest)
Need
Standardized
Method?
Experience
d ML User?
Use Gradient Boosted
Trees
Use Random Forest
(Regression Forest)
Use Boosted Forest
Yes
No
Classification
Regression
Yes Yes
No No
Closing
• Contact:
• Roger.Dev@LexisNexisRisk.com
• Blogs:
• https://hpccsystems.com/LearningTrees
Presentation Title Here (Insert Menu > Header & Footer > Apply) 34

Contenu connexe

Similaire à Learning Trees - Decision Tree Learning Methods

Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...Bobby Filar
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
A GENETIC-FROG LEAPING ALGORITHM FOR TEXT DOCUMENT CLUSTERING
A GENETIC-FROG LEAPING ALGORITHM FOR TEXT DOCUMENT CLUSTERINGA GENETIC-FROG LEAPING ALGORITHM FOR TEXT DOCUMENT CLUSTERING
A GENETIC-FROG LEAPING ALGORITHM FOR TEXT DOCUMENT CLUSTERINGLubna_Alhenaki
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfAdityaSoraut
 
XGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxXGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxyadav834181
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetagdavis
 
Calamities with cardinalities
Calamities with cardinalitiesCalamities with cardinalities
Calamities with cardinalitiesRandolf Geist
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Maarten Smeets
 
Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Theodore Grammatikopoulos
 
[Women in Data Science Meetup ATX] Decision Trees
[Women in Data Science Meetup ATX] Decision Trees [Women in Data Science Meetup ATX] Decision Trees
[Women in Data Science Meetup ATX] Decision Trees Nikolaos Vergos
 
MySQL Optimizer: What’s New in 8.0
MySQL Optimizer: What’s New in 8.0MySQL Optimizer: What’s New in 8.0
MySQL Optimizer: What’s New in 8.0oysteing
 

Similaire à Learning Trees - Decision Tree Learning Methods (20)

Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
A GENETIC-FROG LEAPING ALGORITHM FOR TEXT DOCUMENT CLUSTERING
A GENETIC-FROG LEAPING ALGORITHM FOR TEXT DOCUMENT CLUSTERINGA GENETIC-FROG LEAPING ALGORITHM FOR TEXT DOCUMENT CLUSTERING
A GENETIC-FROG LEAPING ALGORITHM FOR TEXT DOCUMENT CLUSTERING
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
 
XGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxXGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptx
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNet
 
Calamities with cardinalities
Calamities with cardinalitiesCalamities with cardinalities
Calamities with cardinalities
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)
 
[Women in Data Science Meetup ATX] Decision Trees
[Women in Data Science Meetup ATX] Decision Trees [Women in Data Science Meetup ATX] Decision Trees
[Women in Data Science Meetup ATX] Decision Trees
 
MySQL Optimizer: What’s New in 8.0
MySQL Optimizer: What’s New in 8.0MySQL Optimizer: What’s New in 8.0
MySQL Optimizer: What’s New in 8.0
 

Plus de HPCC Systems

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsHPCC Systems
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn HPCC Systems
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingHPCC Systems
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle ChangesHPCC Systems
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index HPCC Systems
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningHPCC Systems
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch HPCC Systems
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem HPCC Systems
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis ToolHPCC Systems
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony HPCC Systems
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...HPCC Systems
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...HPCC Systems
 

Plus de HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
 
Welcome
WelcomeWelcome
Welcome
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 

Dernier

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Dernier (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

Learning Trees - Decision Tree Learning Methods

  • 1. October 9, 2018 Roger Dev Learning Trees – Decision Tree Learning Methods
  • 2. Major Classes of Supervised Machine Learning • Linear Models • Neural Network Models • Decision Tree Models Presentation Title Here (Insert Menu > Header & Footer > Apply) 2 Learning Trees =
  • 3. Goals • Overview of Learning Tree algorithms • Science and intuitions behind Learning Trees • HPCC Systems LearningTrees Bundle Presentation Title Here (Insert Menu > Header & Footer > Apply) 3
  • 6. Basic Decision Tree Example Feature 1 Feature 2 Result 0 0 0 0 1 1 1 0 1 1 1 0 Start Feature 1 >.5 Yes Feature 2 > .5? Feature 2 > .5? 0 1 No 1 0 NoYes YesNo XOR Truth Table
  • 7. What is happening Geometrically? Presentation Title Here (Insert Menu > Header & Footer > Apply) 7 Feature1 Feature 2 .5 .5 Start Feature 1 >.5 Ye s Feature 2 > .5? Feature 2 > .5? 0 1 No 1 0 NoYe s Ye s No
  • 8. How do we learn a Decision Tree? Presentation Title Here (Insert Menu > Header & Footer > Apply) 8 High Entropy / Low Order Less Entropy / More Order Zero Entropy / Pure Order
  • 9. Learning Tree Major Strengths and Weaknesses Strengths • No Data Assumptions • Non-Linear • Discontinuous Weaknesses • No extrapolation / interpolation • Fairly large training set • Marginally descriptive Presentation Title Here (Insert Menu > Header & Footer > Apply) 9 Less Data Preparation and Analysis needed More Data needed
  • 10. Limitations of a Decision Tree • Deterministic Phenomena Only • Do not generalize well for stochastic problems Presentation Title Here (Insert Menu > Header & Footer > Apply) 10 How can that be?
  • 11. Generalization and Population • Target = Population • Sample << Population • Overfitting = Fitting to the noise in the sample • Specifically – Spurious correlation Presentation Title Here (Insert Menu > Header & Footer > Apply) 11 PopulationSample 1
  • 12. Random Forest Presentation Title Here (Insert Menu > Header & Footer > Apply)12
  • 13. “Bagging” Theory -- Training Presentation Title Here (Insert Menu > Header & Footer > Apply) 13 Learner Training Data Model “Bootstrap” Sample Learner Model “Bootstrap” Sample Learner Model “Bootstrap” Sample . . . Composite Model
  • 14. “Bagging” Theory -- Prediction Presentation Title Here (Insert Menu > Header & Footer > Apply) 14 Test Data Model Model Model Composite Model . . . Predictions Predictions Predictions Aggregate Final Predictions
  • 15. Random Forest • Build a forest of diverse decision trees • Vote / average the results from all trees • A Random Forest is: • Worse than the best possible tree • Better than the worst tree • About as correct as you can reliably get given the training set and the population • “Eliminates” the overfitting problem Presentation Title Here (Insert Menu > Header & Footer > Apply) 15
  • 16. Building a Diverse Forest • Subsampling • Start each tree with its own “bootstrap” sample • Sample from the training set with replacement • Each tree gets some duplicates and sees about two thirds of the samples • Feature Restriction • At each branch, choose a random subset of features • Choose the best split from that set of features • Forces trees to take different growth paths Presentation Title Here (Insert Menu > Header & Footer > Apply) 16
  • 17. Effect of forest size Presentation Title Here (Insert Menu > Header & Footer > Apply) 17 Accuracy Number of trees 1 100 1000
  • 18. Random Forest Summary • Regression and Classification • All the benefits and limitations of Decision Trees • Very accurate, given sufficient data • Generalizes well • Easy to use • No data assumptions • Few parameters – little affect on accuracy • Almost always works well with default parameters • Parallelizes well Presentation Title Here (Insert Menu > Header & Footer > Apply) 18
  • 19. Boosted Trees Presentation Title Here (Insert Menu > Header & Footer > Apply)19
  • 20. “Boosting” Theory -- Training Presentation Title Here (Insert Menu > Header & Footer > Apply) 20 “Weak Learner” - Residuals Training Data - Residuals . . . Model Model Model CompositeModel “Weak Learner” “Weak Learner”
  • 21. “Boosting” Theory -- Predictions Presentation Title Here (Insert Menu > Header & Footer > Apply) 21 TestData Prediction Prediction Prediction . . . + + + = Final Prediction Model Model Model Composite
  • 22. Gradient Boosted Trees (GBT) • Use truncated Decision Trees as the Weak Learner • Train each tree to correct the errors from the previous tree • Add predictions together to form final prediction Presentation Title Here (Insert Menu > Header & Footer > Apply) 22
  • 23. GBT Strengths and Weaknesses Strengths • High Accuracy -- Sometimes better than Random Forest • Tuneable • Good generalization Weaknesses • Only supports Regression (natively) • More difficult to use • Training is sequential – Cannot be parallelized Presentation Title Here (Insert Menu > Header & Footer > Apply) 23
  • 24. GBT – Under the hood • Generalization • Multiple diverse trees • Aggregated Results • Boosting • Using residuals focuses on the more difficult items (i.e. larger errors) Presentation Title Here (Insert Menu > Header & Footer > Apply) 24
  • 25. Can we separate Generalization and Boosting? • Generalization can be parallelized (ala Random Forest) • Boosting is necessarily sequential • What if we generalized and then boosted? • Would it require fewer sequential iterations to achieve the same results? Presentation Title Here (Insert Menu > Header & Footer > Apply) 25
  • 26. Boosted Forests • Use a (truncated) Random Forest as the weak learner • Boost between forests ala GBT Presentation Title Here (Insert Menu > Header & Footer > Apply) 26
  • 27. Boosted Forest Findings • No need to truncate the forest. Works well with fully developed trees. • Requires far fewer iterations (e.g. 5 versus 100) • Regression significantly more accurate than Random Forest. • Generally more accurate than Gradient Boosted Trees • Insensitive to training parameters = Easy to use – Works with defaults (like Random Forest). • Few iterations needed to achieve maximal boosting = HPCC Systems efficient Presentation Title Here (Insert Menu > Header & Footer > Apply) 27
  • 28. Accuracy Comparison of Random Forest, Gradient Boosted Trees and Boosted Forest Presentation Title Here (Insert Menu > Header & Footer > Apply) 28 Tree Depth Trees / level Boost Levels Total Trees R**2 RF - 20 - 20 0.734 - 100 - 100 0.74 - 140 - 140 0.741 - 300 - 300 0.745 GBT 7 1 20 20 0.651 7 1 35 35 0.671 7 1 50 50 0.711 7 1 75 75 0.716 7 1 100 100 0.719 7 1 120 120 0.717 7 1 140 140 0.718 5 1 140 140 0.75 BF - 20 20 5 100 0.77 15 20 7 140 0.776 10 20 15 300 0.775
  • 29. Gradient Boosted Trees versus Boosted Forest – Sensitivity to training parameters Presentation Title Here (Insert Menu > Header & Footer > Apply) 29 R2 and (#iterations) for GBT with various Reg Params Depth / Learn Rate 0.1 0.25 0.5 0.75 1 5 .714 (772) .761 (296) .720 (145) .652 (100) .5 (84) 7 .686 (281) .684 (100) .597 (48) .694 (32) .521 (24) 12 .586 (61) .595 (21) .662 (13) .528 (9) .552 (6) 20 .556 (25) .491 (6) .521 (5) .560 (2) .409 (2) R2 and (#iterations) for BF(20) with various Reg Params Depth / Learn Rate 0.1 0.25 0.5 0.75 1 5 - .778 (517) .797 (264) .786 (174) .775 (135) 7 .790 (417) .773 (166) .810 (82) .790 (55) .790 (42) 12 .791 (111) .770 (42) .801 (22) .783 (15) .762 (11) 20 .758 (56) .738 (23) .770 (11) .754 (8) 0.777 (6)
  • 30. LearningTrees Bundle Presentation Title Here (Insert Menu > Header & Footer > Apply)30
  • 31. LearningTrees Bundle Presentation Title Here (Insert Menu > Header & Footer > Apply) 31 Learning Trees Decision Tree Random Forest Gradient Boosted Trees Boosted Forest
  • 32. LearningTrees Bundle additional capabilities • Features can be any type of numeric data: • Real values • Integers • Binary • Categorical • Output can be categorical (Classification Forest) or real-valued (Regression Forest). • Multinomial classification is supported directly. • Myriad Interface -- Multiple separate forests can be grown at once, and produce a composite model in parallel. This can further improve the performance on an HPCC Systems Cluster. • Accuracy Assessment -- Produces a range of statistics regarding the accuracy of the model given a set of test data. • Feature Importance -- Analyses the importance of each feature in the decision process. • Decision Distance -- Provides insight into the similarity of different data points in a multi-dimensional decision space. • Uniqueness Factor -- Indicates how isolated a given data point is relative to other points in decision space. Presentation Title Here (Insert Menu > Header & Footer > Apply) 32
  • 33. Choosing an Algorithm Presentation Title Here (Insert Menu > Header & Footer > Apply) 33 Start Problem Deterministic ? Regression or Classification? Use Single Tree Use Random Forest (Classification Forest) Need Standardized Method? Experience d ML User? Use Gradient Boosted Trees Use Random Forest (Regression Forest) Use Boosted Forest Yes No Classification Regression Yes Yes No No
  • 34. Closing • Contact: • Roger.Dev@LexisNexisRisk.com • Blogs: • https://hpccsystems.com/LearningTrees Presentation Title Here (Insert Menu > Header & Footer > Apply) 34