SlideShare a Scribd company logo
1 of 44
Download to read offline
Understanding Random Forests
Marc Garcia
PyData Madrid - April 9th, 2016
1 / 44
Understanding Random Forests - Marc Garcia
Introduction
2 / 44
Understanding Random Forests - Marc Garcia
Introduction
About me: @datapythonista
Python programmer since 2006
Master in AI from UPC
Working for Bank of America
http://datapythonista.github.io
3 / 44
Understanding Random Forests - Marc Garcia
Introduction
Overview: Understanding Random Forests
How decision trees work
Training Random Forests
Practical example
Analyzing model parameters
4 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
5 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Using a decision tree
Simple example:
Event attendance: binary classification
2 explanatory variables: age and distance
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(df[[’age’, ’distance’]], df[’attended’])
cart_plot(dtree)
6 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Data visualization
7 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Decision boundary visualization
8 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Decision boundary visualization
9 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Decision boundary visualization
10 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Decision boundary visualization
11 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Decision boundary visualization
12 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Decision boundary visualization
13 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
How is the model?
def decision_tree_model(age, distance):
if distance >= 2283.11:
if age >= 40.00:
if distance >= 6868.86:
if distance >= 8278.82:
return True
else:
return False
else:
return True
else:
return False
else:
if age >= 54.50:
if age >= 57.00:
return True
else:
return False
else:
return True
14 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Training: Basic algorithm
def train_decision_tree(x, y):
feature, value = get_best_split(x, y)
x_left, y_left = x[x[feature] < value], y[x[feature] < value]
if len(y_left.unique()) > 1:
left_node = train_decision_tree(x_left, y_left)
else:
left_node = None
x_right, y_right = x[x[feature] >= value], y[x[feature] >= value]
if len(y_right.unique()) > 1:
right_node = train_decision_tree(x_right, y_right)
else:
right_node = None
return Node(feature, value, left_node, right_node)
15 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Best split
Candidate split 1
age 18 19 21 27 29 34 38 42 49 54 62 64
attended F F T F T T F T F T T T
Split True False
Left 0 1
Right 7 4
Candidate split 2
age 18 19 21 27 29 34 38 42 49 54 62 64
attended F F T F T T F T F T T T
Split True False
Left 0 2
Right 7 3
16 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Best split algorithm
def get_best_split(x, y):
best_split = None
best_entropy = 1.
for feature in x.columns.values:
column = x[feature]
for value in column.iterrows():
a = y[column < value] == class_a_value
b = y[column < value] == class_b_value
left_weight = (a + b) / len(y.index)
left_entropy = entropy(a, b)
a = y[column >= value] == class_a_value
b = y[column >= value] == class_b_value
right_items = (a + b) / len(y.index)
right_entropy = entropy(a, b)
split_entropy = left_weight * left_etropy + right_weight * right_entropy
if split_entropy < best_entropy:
best_split = (feature, value)
best_entropy = split_entropy
return best_split
17 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Entropy
For a given subset1:
entropy = −Prattending · log2 Prattending − Pr¬attending · log2 Pr¬attending (1)
1
Note that for pure splits it’s assumed that 0 · log2 0 = 0
18 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Entropy algorithm
import math
def entropy(a, b):
total = a + b
prob_a = a / total
prob_b = b / total
return - prob_a * math.log(prob_a, 2) 
- prob_b * math.log(prob_b, 2)
19 / 44
Understanding Random Forests - Marc Garcia
How decision trees work
Information gain
For a given split:
information_gain = entropyparent −
itemsleft
itemstotal
· entropyleft +
itemsright
itemstotal
· entropyright (2)
20 / 44
Understanding Random Forests - Marc Garcia
Practical example
21 / 44
Understanding Random Forests - Marc Garcia
Practical example
Etsy dataset
Features:
Feedback (number of reviews)
Age (days since shop creation)
Admirers (likes)
Items (number of products)
Response variable:
Sales (number)
Samples:
58,092
Source:
www.bigml.com
22 / 44
Understanding Random Forests - Marc Garcia
Practical example
Features distribution
23 / 44
Understanding Random Forests - Marc Garcia
Practical example
Sales visualization
24 / 44
Understanding Random Forests - Marc Garcia
Practical example
Feature correlation
sales admirers age feedback items
sales 1.000000 0.458261 0.181045 0.955949 0.502423
admirers 0.458261 1.000000 0.340939 0.401995 0.268985
age 0.181045 0.340939 1.000000 0.184238 0.167535
feedback 0.955949 0.401995 0.184238 1.000000 0.458955
items 0.502423 0.268985 0.167535 0.458955 1.000000
25 / 44
Understanding Random Forests - Marc Garcia
Practical example
Correlation to sales
26 / 44
Understanding Random Forests - Marc Garcia
Practical example
Decision tree visualization
27 / 44
Understanding Random Forests - Marc Garcia
Practical example
Decision tree visualization (top)
28 / 44
Understanding Random Forests - Marc Garcia
Practical example
Decision tree visualization (left)
29 / 44
Understanding Random Forests - Marc Garcia
Practical example
Decision tree visualization (right)
30 / 44
Understanding Random Forests - Marc Garcia
Practical example
What is wrong with the tree?
Only one variable used to predict in some cases
Lack of stability:
If feedback = 207 we predict 109 sales
If feedback = 208 we predict 977 sales
Our model can change dramatically if we add few more samples to the dataset
Overfitting: We make predictions based on a single sample
31 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
32 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Ensembles: The board of directors
Why companies have a BoD?
What is best?
The best point of view
A mix of good points of view
How was our best tree?
What about a mix of not so good trees?
33 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Ensemble of trees
How do we build optimized trees, that are different?
We need to randomize them
Samples: bagging
Features
Splits
34 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Bagging: Uniform sampling with replacement
35 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Randomizing features
Even with bagging, trees will usually make the top splits by the same feature
max_features parameter:
Regression: Use all
Classification: Use the squared root
36 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Randomizing splits
A split per sample is considered in the original decision tree
We can consider a subset
We obtain randomized trees
Performance is improved
sklearn implements ExtraTreeRegressor and ExtraTreeClassifier
37 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
What happened with our example problems
Only one variable used to predict in some cases
We can avoid it by not using all features
Lack of stability:
Our model is smoother
Overfitting: We make predictions based on a single sample
Overfitting is mostly avoided
Random forests come with built-in cross validation
38 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Smoothing
39 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Parameter optimization (Etsy)
40 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Parameter optimization (Bank marketing)
Data source: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita.
A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
41 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
When should we use Random Forests?
Using a CART is usually good as part of exploratory analysis
Do not use CART or RF when the problem is linear
They are specially good for ordinal (categorical but sorted) variables
Performance
CART is slower than linear models (e.g. logit)
RF trains N trees, and takes N times but it works in multiple cores, seriously
But RF is usually much faster than ANN or SVM, and results can be similar
42 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Summary
CART is a simple, but still powerful model
Visualizing them we can better understand our data
Ensembles usually improve the predictive power of models
Random Forests fix CART problems
Better use of features
More stability
Better generalization (RFs avoid overfitting)
Random Forests main parameters
min_samples_leaf (inherited from CART)
num_estimators
43 / 44
Understanding Random Forests - Marc Garcia
Training Random Forests
Thank you
QUESTIONS?
44 / 44
Understanding Random Forests - Marc Garcia

More Related Content

What's hot

Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest Rupak Roy
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learningAmr BARAKAT
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: ClusteringDeepak George
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is funZhen Li
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 

What's hot (20)

Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Decision tree
Decision treeDecision tree
Decision tree
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: Clustering
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 

Similar to Understanding random forests

CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesMarc Garcia
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
 
5.Module_AIML Random Forest.pptx
5.Module_AIML Random Forest.pptx5.Module_AIML Random Forest.pptx
5.Module_AIML Random Forest.pptxPRIYACHAURASIYA25
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2Nandhini S
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfAdityaSoraut
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learningRajasekhar364622
 
02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by OptimizationAndres Mendez-Vazquez
 
Random Forests: The Vanilla of Machine Learning - Anna Quach
Random Forests: The Vanilla of Machine Learning - Anna QuachRandom Forests: The Vanilla of Machine Learning - Anna Quach
Random Forests: The Vanilla of Machine Learning - Anna QuachWithTheBest
 
13 random forest
13 random forest13 random forest
13 random forestVishal Dutt
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxShivakrishnan18
 
2023 Supervised Learning for Orange3 from scratch
2023 Supervised Learning for Orange3 from scratch2023 Supervised Learning for Orange3 from scratch
2023 Supervised Learning for Orange3 from scratchFEG
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).pptrajasamal1999
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesIntuit Inc.
 

Similar to Understanding random forests (20)

CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression Trees
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
Decision tree
Decision tree Decision tree
Decision tree
 
5.Module_AIML Random Forest.pptx
5.Module_AIML Random Forest.pptx5.Module_AIML Random Forest.pptx
5.Module_AIML Random Forest.pptx
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learning
 
02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization
 
Random Forests: The Vanilla of Machine Learning - Anna Quach
Random Forests: The Vanilla of Machine Learning - Anna QuachRandom Forests: The Vanilla of Machine Learning - Anna Quach
Random Forests: The Vanilla of Machine Learning - Anna Quach
 
13 random forest
13 random forest13 random forest
13 random forest
 
What Is Random Forest_ analytics_ IBM.pdf
What Is Random Forest_ analytics_ IBM.pdfWhat Is Random Forest_ analytics_ IBM.pdf
What Is Random Forest_ analytics_ IBM.pdf
 
Decision Trees.ppt
Decision Trees.pptDecision Trees.ppt
Decision Trees.ppt
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
2023 Supervised Learning for Orange3 from scratch
2023 Supervised Learning for Orange3 from scratch2023 Supervised Learning for Orange3 from scratch
2023 Supervised Learning for Orange3 from scratch
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).ppt
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
 

Recently uploaded

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

Understanding random forests

  • 1. Understanding Random Forests Marc Garcia PyData Madrid - April 9th, 2016 1 / 44 Understanding Random Forests - Marc Garcia
  • 2. Introduction 2 / 44 Understanding Random Forests - Marc Garcia
  • 3. Introduction About me: @datapythonista Python programmer since 2006 Master in AI from UPC Working for Bank of America http://datapythonista.github.io 3 / 44 Understanding Random Forests - Marc Garcia
  • 4. Introduction Overview: Understanding Random Forests How decision trees work Training Random Forests Practical example Analyzing model parameters 4 / 44 Understanding Random Forests - Marc Garcia
  • 5. How decision trees work 5 / 44 Understanding Random Forests - Marc Garcia
  • 6. How decision trees work Using a decision tree Simple example: Event attendance: binary classification 2 explanatory variables: age and distance from sklearn.tree import DecisionTreeClassifier dtree = DecisionTreeClassifier() dtree.fit(df[[’age’, ’distance’]], df[’attended’]) cart_plot(dtree) 6 / 44 Understanding Random Forests - Marc Garcia
  • 7. How decision trees work Data visualization 7 / 44 Understanding Random Forests - Marc Garcia
  • 8. How decision trees work Decision boundary visualization 8 / 44 Understanding Random Forests - Marc Garcia
  • 9. How decision trees work Decision boundary visualization 9 / 44 Understanding Random Forests - Marc Garcia
  • 10. How decision trees work Decision boundary visualization 10 / 44 Understanding Random Forests - Marc Garcia
  • 11. How decision trees work Decision boundary visualization 11 / 44 Understanding Random Forests - Marc Garcia
  • 12. How decision trees work Decision boundary visualization 12 / 44 Understanding Random Forests - Marc Garcia
  • 13. How decision trees work Decision boundary visualization 13 / 44 Understanding Random Forests - Marc Garcia
  • 14. How decision trees work How is the model? def decision_tree_model(age, distance): if distance >= 2283.11: if age >= 40.00: if distance >= 6868.86: if distance >= 8278.82: return True else: return False else: return True else: return False else: if age >= 54.50: if age >= 57.00: return True else: return False else: return True 14 / 44 Understanding Random Forests - Marc Garcia
  • 15. How decision trees work Training: Basic algorithm def train_decision_tree(x, y): feature, value = get_best_split(x, y) x_left, y_left = x[x[feature] < value], y[x[feature] < value] if len(y_left.unique()) > 1: left_node = train_decision_tree(x_left, y_left) else: left_node = None x_right, y_right = x[x[feature] >= value], y[x[feature] >= value] if len(y_right.unique()) > 1: right_node = train_decision_tree(x_right, y_right) else: right_node = None return Node(feature, value, left_node, right_node) 15 / 44 Understanding Random Forests - Marc Garcia
  • 16. How decision trees work Best split Candidate split 1 age 18 19 21 27 29 34 38 42 49 54 62 64 attended F F T F T T F T F T T T Split True False Left 0 1 Right 7 4 Candidate split 2 age 18 19 21 27 29 34 38 42 49 54 62 64 attended F F T F T T F T F T T T Split True False Left 0 2 Right 7 3 16 / 44 Understanding Random Forests - Marc Garcia
  • 17. How decision trees work Best split algorithm def get_best_split(x, y): best_split = None best_entropy = 1. for feature in x.columns.values: column = x[feature] for value in column.iterrows(): a = y[column < value] == class_a_value b = y[column < value] == class_b_value left_weight = (a + b) / len(y.index) left_entropy = entropy(a, b) a = y[column >= value] == class_a_value b = y[column >= value] == class_b_value right_items = (a + b) / len(y.index) right_entropy = entropy(a, b) split_entropy = left_weight * left_etropy + right_weight * right_entropy if split_entropy < best_entropy: best_split = (feature, value) best_entropy = split_entropy return best_split 17 / 44 Understanding Random Forests - Marc Garcia
  • 18. How decision trees work Entropy For a given subset1: entropy = −Prattending · log2 Prattending − Pr¬attending · log2 Pr¬attending (1) 1 Note that for pure splits it’s assumed that 0 · log2 0 = 0 18 / 44 Understanding Random Forests - Marc Garcia
  • 19. How decision trees work Entropy algorithm import math def entropy(a, b): total = a + b prob_a = a / total prob_b = b / total return - prob_a * math.log(prob_a, 2) - prob_b * math.log(prob_b, 2) 19 / 44 Understanding Random Forests - Marc Garcia
  • 20. How decision trees work Information gain For a given split: information_gain = entropyparent − itemsleft itemstotal · entropyleft + itemsright itemstotal · entropyright (2) 20 / 44 Understanding Random Forests - Marc Garcia
  • 21. Practical example 21 / 44 Understanding Random Forests - Marc Garcia
  • 22. Practical example Etsy dataset Features: Feedback (number of reviews) Age (days since shop creation) Admirers (likes) Items (number of products) Response variable: Sales (number) Samples: 58,092 Source: www.bigml.com 22 / 44 Understanding Random Forests - Marc Garcia
  • 23. Practical example Features distribution 23 / 44 Understanding Random Forests - Marc Garcia
  • 24. Practical example Sales visualization 24 / 44 Understanding Random Forests - Marc Garcia
  • 25. Practical example Feature correlation sales admirers age feedback items sales 1.000000 0.458261 0.181045 0.955949 0.502423 admirers 0.458261 1.000000 0.340939 0.401995 0.268985 age 0.181045 0.340939 1.000000 0.184238 0.167535 feedback 0.955949 0.401995 0.184238 1.000000 0.458955 items 0.502423 0.268985 0.167535 0.458955 1.000000 25 / 44 Understanding Random Forests - Marc Garcia
  • 26. Practical example Correlation to sales 26 / 44 Understanding Random Forests - Marc Garcia
  • 27. Practical example Decision tree visualization 27 / 44 Understanding Random Forests - Marc Garcia
  • 28. Practical example Decision tree visualization (top) 28 / 44 Understanding Random Forests - Marc Garcia
  • 29. Practical example Decision tree visualization (left) 29 / 44 Understanding Random Forests - Marc Garcia
  • 30. Practical example Decision tree visualization (right) 30 / 44 Understanding Random Forests - Marc Garcia
  • 31. Practical example What is wrong with the tree? Only one variable used to predict in some cases Lack of stability: If feedback = 207 we predict 109 sales If feedback = 208 we predict 977 sales Our model can change dramatically if we add few more samples to the dataset Overfitting: We make predictions based on a single sample 31 / 44 Understanding Random Forests - Marc Garcia
  • 32. Training Random Forests 32 / 44 Understanding Random Forests - Marc Garcia
  • 33. Training Random Forests Ensembles: The board of directors Why companies have a BoD? What is best? The best point of view A mix of good points of view How was our best tree? What about a mix of not so good trees? 33 / 44 Understanding Random Forests - Marc Garcia
  • 34. Training Random Forests Ensemble of trees How do we build optimized trees, that are different? We need to randomize them Samples: bagging Features Splits 34 / 44 Understanding Random Forests - Marc Garcia
  • 35. Training Random Forests Bagging: Uniform sampling with replacement 35 / 44 Understanding Random Forests - Marc Garcia
  • 36. Training Random Forests Randomizing features Even with bagging, trees will usually make the top splits by the same feature max_features parameter: Regression: Use all Classification: Use the squared root 36 / 44 Understanding Random Forests - Marc Garcia
  • 37. Training Random Forests Randomizing splits A split per sample is considered in the original decision tree We can consider a subset We obtain randomized trees Performance is improved sklearn implements ExtraTreeRegressor and ExtraTreeClassifier 37 / 44 Understanding Random Forests - Marc Garcia
  • 38. Training Random Forests What happened with our example problems Only one variable used to predict in some cases We can avoid it by not using all features Lack of stability: Our model is smoother Overfitting: We make predictions based on a single sample Overfitting is mostly avoided Random forests come with built-in cross validation 38 / 44 Understanding Random Forests - Marc Garcia
  • 39. Training Random Forests Smoothing 39 / 44 Understanding Random Forests - Marc Garcia
  • 40. Training Random Forests Parameter optimization (Etsy) 40 / 44 Understanding Random Forests - Marc Garcia
  • 41. Training Random Forests Parameter optimization (Bank marketing) Data source: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 41 / 44 Understanding Random Forests - Marc Garcia
  • 42. Training Random Forests When should we use Random Forests? Using a CART is usually good as part of exploratory analysis Do not use CART or RF when the problem is linear They are specially good for ordinal (categorical but sorted) variables Performance CART is slower than linear models (e.g. logit) RF trains N trees, and takes N times but it works in multiple cores, seriously But RF is usually much faster than ANN or SVM, and results can be similar 42 / 44 Understanding Random Forests - Marc Garcia
  • 43. Training Random Forests Summary CART is a simple, but still powerful model Visualizing them we can better understand our data Ensembles usually improve the predictive power of models Random Forests fix CART problems Better use of features More stability Better generalization (RFs avoid overfitting) Random Forests main parameters min_samples_leaf (inherited from CART) num_estimators 43 / 44 Understanding Random Forests - Marc Garcia
  • 44. Training Random Forests Thank you QUESTIONS? 44 / 44 Understanding Random Forests - Marc Garcia