SlideShare une entreprise Scribd logo
1  sur  28
Hands on Classification:
Decision Trees and Random
Forests
Predictive Analytics Meetup Group
Machine Learning Workshop
December 2, 2012




Daniel Gerlanc, Managing Director
Enplus Advisors, Inc.
www.enplusadvisors.com
dgerlanc@enplusadvisors.com
© Daniel Gerlanc, 2012.
All rights reserved.


If you‟d like to use this material for any
purpose, please contact
dgerlanc@enplusadvisors.com
What You‟ll Learn

• Intuition behind decision trees and
  random forests
• Implementation in R
• Assessing the results
Dataset

• Chemical Analysis of Italian Wines
• http://www.parvus.unige.it/
• 178 records, 14 attributes
Follow along
> library(mlclass)
> data(wine)
> str(wine)
'data.frame':        178 obs. of 14 variables:
 $ Type         : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alcohol       : num 14.2 13.2 13.2 14.4 13.2 ...
 $ Malic       : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ Ash         : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
What are Decision
       Trees?


• Model for partitioning an input space
What‟s partitioning?




See rf-1.R
Create the   1 st   split.

                        Not G

        G




See rf-1.R
Create the   2 nd   Split

                          Not G

        G


                      G



See rf-1.R
Create more splits…

                        Not G

         G


                      Not G
                G


I drew this one in.
Another view of partitioning




See rf-2.R
Use R to do the partitioning.

 tree.1 <- rpart(Type ~ ., data=wine)
 prp(tree.1, type=4, extra=2)




• See the „rpart‟ and „rpart.plot‟ R packages.
• Many parameters available to control the fit.



 See rf-2.R
Make predictions on a test dataset

 predict(tree.1, data=wine, type=“vector”)
How‟d it do?
Guessing: 60.11%
CART: 94.38% Accuracy
 •     Precision: 92.95% (66 / 71)
 •     Sensitivity/Recall: 92.95% (66 / 71)


                                              Actual

Predicted                  Grig                  no

Grig                       (1) 66                (3)   5

No                         (2) 5                 (4)   102
Decision Tree
       Problems

• Overfitting the data
• May not use all relevant features
• Perpendicular decision boundaries
Random Forests


One Decision
    Tree



                 Many Decision
                Trees (Ensemble)
Random Forest Fixes

• Overfitting the data
• May not use all relevant features
• Perpendicular decision boundaries
Building RF

For each tree:
  Sample from the data
  At each split, sample from the available
  variables
Bootstrap Sampling
Sample Attributes at each
          split
Motivations for RF

• Create uncorrelated trees
• Variance reduction
• Subspace exploration
Random Forests
rffit.1 <- randomForest(Type ~ ., data=wine)




See rf-3.R
RF Parameters in R
Most important parameters are:

Variable   Description                       Default

ntree      Number of Trees                   500

mtry       Number of variables to randomly   • square root of # predictors for
           select at each node                 classification
                                             • # predictors / 3 for regression
nodesize   Minimum number of records in a    • 1 for classification
           terminal node                     • 5 for regression

sampsize Number of records to select in each • 63.2%
         bootstrap sample
How‟d it do?
Guessing Accuracy: 60.11%
Random Forest: 98.31% Accuracy
 •     Precision: 95.77% (68 / 71)
 •     Sensitivity/Recall: 100% (68 / 68)


                                            Actual

Predicted                  Grig                No

Grig                       (1) 68              (3)   3

No                         (2) 0               (4)   107
Tuning RF: Grid Search
This is the default.




      See rf-4.R
Tuning is Expensive
Benefits of RF

• Good performance with default settings
• Relatively easy to make parallel
• Many implementations
 • R, Weka, RapidMiner, Mahout
References

•   A. Liaw and M. Wiener (2002). Classification and Regression by
    randomForest. R News 2(3), 18--22.

•   Breiman, Leo. Classification and Regression Trees. Belmont, Calif:
    Wadsworth International Group, 1984. Print.

•   Brieman, Leo and Adele Cutler. Random forests.
    http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.ht
    m

Contenu connexe

En vedette

Final media evaluation
Final media evaluation Final media evaluation
Final media evaluation Jack Street
 
Preschool Garden
Preschool GardenPreschool Garden
Preschool GardenAmy Beard
 
11 things i wish i had learned final presentation
11 things i wish i had learned final presentation11 things i wish i had learned final presentation
11 things i wish i had learned final presentationbehnbrian
 
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิก
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิกบทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิก
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิกNattipong Siangyen
 
Creating Discounts & Promotions with Hitachi Solutions Ecommerce
Creating Discounts & Promotions with Hitachi Solutions EcommerceCreating Discounts & Promotions with Hitachi Solutions Ecommerce
Creating Discounts & Promotions with Hitachi Solutions EcommerceHitachi Solutions America, Ltd.
 

En vedette (8)

Chahal’s photography
Chahal’s photographyChahal’s photography
Chahal’s photography
 
05999528
0599952805999528
05999528
 
uso de internet
uso de internetuso de internet
uso de internet
 
Final media evaluation
Final media evaluation Final media evaluation
Final media evaluation
 
Preschool Garden
Preschool GardenPreschool Garden
Preschool Garden
 
11 things i wish i had learned final presentation
11 things i wish i had learned final presentation11 things i wish i had learned final presentation
11 things i wish i had learned final presentation
 
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิก
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิกบทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิก
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิก
 
Creating Discounts & Promotions with Hitachi Solutions Ecommerce
Creating Discounts & Promotions with Hitachi Solutions EcommerceCreating Discounts & Promotions with Hitachi Solutions Ecommerce
Creating Discounts & Promotions with Hitachi Solutions Ecommerce
 

Similaire à Machine Learning Workshop

Predicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random ForestsPredicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random ForestsEnplus Advisors, Inc.
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasMadhumita Ghosh
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsRupak Roy
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...Bobby Filar
 
Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Elena Sügis
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
RNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxRNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxBiancaMoreira45
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Nanometer Testing: Challenges and Solutions
Nanometer Testing: Challenges and SolutionsNanometer Testing: Challenges and Solutions
Nanometer Testing: Challenges and SolutionsDVClub
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 

Similaire à Machine Learning Workshop (20)

Predicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random ForestsPredicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random Forests
 
Random Forests Lightning Talk
Random Forests Lightning TalkRandom Forests Lightning Talk
Random Forests Lightning Talk
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and Sas
 
4 1 tree world
4 1 tree world4 1 tree world
4 1 tree world
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Decision tree
Decision treeDecision tree
Decision tree
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree Algorithms
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Self healing data
Self healing dataSelf healing data
Self healing data
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
RNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxRNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptx
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Nanometer Testing: Challenges and Solutions
Nanometer Testing: Challenges and SolutionsNanometer Testing: Challenges and Solutions
Nanometer Testing: Challenges and Solutions
 
Abraham q3 2008
Abraham q3 2008Abraham q3 2008
Abraham q3 2008
 
Connected Components Labeling
Connected Components LabelingConnected Components Labeling
Connected Components Labeling
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 

Dernier

Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 

Dernier (20)

Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 

Machine Learning Workshop

  • 1. Hands on Classification: Decision Trees and Random Forests Predictive Analytics Meetup Group Machine Learning Workshop December 2, 2012 Daniel Gerlanc, Managing Director Enplus Advisors, Inc. www.enplusadvisors.com dgerlanc@enplusadvisors.com
  • 2. © Daniel Gerlanc, 2012. All rights reserved. If you‟d like to use this material for any purpose, please contact dgerlanc@enplusadvisors.com
  • 3. What You‟ll Learn • Intuition behind decision trees and random forests • Implementation in R • Assessing the results
  • 4. Dataset • Chemical Analysis of Italian Wines • http://www.parvus.unige.it/ • 178 records, 14 attributes
  • 5. Follow along > library(mlclass) > data(wine) > str(wine) 'data.frame': 178 obs. of 14 variables: $ Type : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
  • 6. What are Decision Trees? • Model for partitioning an input space
  • 8. Create the 1 st split. Not G G See rf-1.R
  • 9. Create the 2 nd Split Not G G G See rf-1.R
  • 10. Create more splits… Not G G Not G G I drew this one in.
  • 11. Another view of partitioning See rf-2.R
  • 12. Use R to do the partitioning. tree.1 <- rpart(Type ~ ., data=wine) prp(tree.1, type=4, extra=2) • See the „rpart‟ and „rpart.plot‟ R packages. • Many parameters available to control the fit. See rf-2.R
  • 13. Make predictions on a test dataset predict(tree.1, data=wine, type=“vector”)
  • 14. How‟d it do? Guessing: 60.11% CART: 94.38% Accuracy • Precision: 92.95% (66 / 71) • Sensitivity/Recall: 92.95% (66 / 71) Actual Predicted Grig no Grig (1) 66 (3) 5 No (2) 5 (4) 102
  • 15. Decision Tree Problems • Overfitting the data • May not use all relevant features • Perpendicular decision boundaries
  • 16. Random Forests One Decision Tree Many Decision Trees (Ensemble)
  • 17. Random Forest Fixes • Overfitting the data • May not use all relevant features • Perpendicular decision boundaries
  • 18. Building RF For each tree: Sample from the data At each split, sample from the available variables
  • 20. Sample Attributes at each split
  • 21. Motivations for RF • Create uncorrelated trees • Variance reduction • Subspace exploration
  • 22. Random Forests rffit.1 <- randomForest(Type ~ ., data=wine) See rf-3.R
  • 23. RF Parameters in R Most important parameters are: Variable Description Default ntree Number of Trees 500 mtry Number of variables to randomly • square root of # predictors for select at each node classification • # predictors / 3 for regression nodesize Minimum number of records in a • 1 for classification terminal node • 5 for regression sampsize Number of records to select in each • 63.2% bootstrap sample
  • 24. How‟d it do? Guessing Accuracy: 60.11% Random Forest: 98.31% Accuracy • Precision: 95.77% (68 / 71) • Sensitivity/Recall: 100% (68 / 68) Actual Predicted Grig No Grig (1) 68 (3) 3 No (2) 0 (4) 107
  • 25. Tuning RF: Grid Search This is the default. See rf-4.R
  • 27. Benefits of RF • Good performance with default settings • Relatively easy to make parallel • Many implementations • R, Weka, RapidMiner, Mahout
  • 28. References • A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22. • Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print. • Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.ht m

Notes de l'éditeur

  1. John, Dave, and I have spoken a bit about the motivations for using Machine Learning techniques.
  2. John, Dave, and I have spoken a bit about the motivations for using Machine Learning techniques.