This document provides an overview of decision trees and random forests machine learning algorithms. It discusses how decision trees partition data to make predictions and how random forests address overfitting by creating an ensemble of decorrelated trees formed by bootstrap sampling and randomly selecting attributes at each split. The document demonstrates implementing decision trees and random forests in R on a wine dataset and compares their performance, with random forests achieving higher accuracy. It also discusses tuning random forests via grid search.
Plant propagation: Sexual and Asexual propapagation.pptx
Machine Learning Workshop
1. Hands on Classification:
Decision Trees and Random
Forests
Predictive Analytics Meetup Group
Machine Learning Workshop
December 2, 2012
Daniel Gerlanc, Managing Director
Enplus Advisors, Inc.
www.enplusadvisors.com
dgerlanc@enplusadvisors.com
12. Use R to do the partitioning.
tree.1 <- rpart(Type ~ ., data=wine)
prp(tree.1, type=4, extra=2)
• See the „rpart‟ and „rpart.plot‟ R packages.
• Many parameters available to control the fit.
See rf-2.R
13. Make predictions on a test dataset
predict(tree.1, data=wine, type=“vector”)
14. How‟d it do?
Guessing: 60.11%
CART: 94.38% Accuracy
• Precision: 92.95% (66 / 71)
• Sensitivity/Recall: 92.95% (66 / 71)
Actual
Predicted Grig no
Grig (1) 66 (3) 5
No (2) 5 (4) 102
15. Decision Tree
Problems
• Overfitting the data
• May not use all relevant features
• Perpendicular decision boundaries
23. RF Parameters in R
Most important parameters are:
Variable Description Default
ntree Number of Trees 500
mtry Number of variables to randomly • square root of # predictors for
select at each node classification
• # predictors / 3 for regression
nodesize Minimum number of records in a • 1 for classification
terminal node • 5 for regression
sampsize Number of records to select in each • 63.2%
bootstrap sample
24. How‟d it do?
Guessing Accuracy: 60.11%
Random Forest: 98.31% Accuracy
• Precision: 95.77% (68 / 71)
• Sensitivity/Recall: 100% (68 / 68)
Actual
Predicted Grig No
Grig (1) 68 (3) 3
No (2) 0 (4) 107
27. Benefits of RF
• Good performance with default settings
• Relatively easy to make parallel
• Many implementations
• R, Weka, RapidMiner, Mahout
28. References
• A. Liaw and M. Wiener (2002). Classification and Regression by
randomForest. R News 2(3), 18--22.
• Breiman, Leo. Classification and Regression Trees. Belmont, Calif:
Wadsworth International Group, 1984. Print.
• Brieman, Leo and Adele Cutler. Random forests.
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.ht
m
Notes de l'éditeur
John, Dave, and I have spoken a bit about the motivations for using Machine Learning techniques.
John, Dave, and I have spoken a bit about the motivations for using Machine Learning techniques.