Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
1. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Unsupervised Learning Techniques to Diversifying
and Pruning Random Forest
Dr Mohamed Medhat Gaber
School of Computing Science and Digital Media
Robert Gordon University
27 January 2015
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
2. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Acknowledgement
Work done in collaboration with PhD student Khaled Fawagreh
and co-supervisor Dr Eyad Elyan
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
3. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
1 Background
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
2 Clustering and Ensemble Diversity
CLUB-DRF
Experimental Study
3 Outlier Scoring and Ensemble Diversity
LOFB-DRF
Experimental Study
4 Summary and Future Work
Summary
Future Work
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
4. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
What is Data Classification?
Data classification is the process of assigning a class
(labelling) to a data instance, based on the values of a set of
predictive attributes (features).
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
5. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
What is Data Classification?
Data classification is the process of assigning a class
(labelling) to a data instance, based on the values of a set of
predictive attributes (features).
The process has two stages:
1 Model construction: potentially a large number of “labelled”
instances are fed to a classification technique to build a model
(classifier).
2 Model usage: once the model is constructed, it can be
deployed and used to classify “unlabelled” instances.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
6. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
What is Data Classification?
Data classification is the process of assigning a class
(labelling) to a data instance, based on the values of a set of
predictive attributes (features).
The process has two stages:
1 Model construction: potentially a large number of “labelled”
instances are fed to a classification technique to build a model
(classifier).
2 Model usage: once the model is constructed, it can be
deployed and used to classify “unlabelled” instances.
A large number of techniques have been proposed addressing
the data classification process (e.g., decision trees, artificial
neural networks, and support vector machine).
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
7. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
What is Data Classification?
Data classification is the process of assigning a class
(labelling) to a data instance, based on the values of a set of
predictive attributes (features).
The process has two stages:
1 Model construction: potentially a large number of “labelled”
instances are fed to a classification technique to build a model
(classifier).
2 Model usage: once the model is constructed, it can be
deployed and used to classify “unlabelled” instances.
A large number of techniques have been proposed addressing
the data classification process (e.g., decision trees, artificial
neural networks, and support vector machine).
Predictive accuracy has been the major concern when
designing a new classification technique, followed by time
needed for model construction and usage.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
8. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Decision Tree Classification Techniques
Almost all decision trees are constructed using a similar
procedure
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
9. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Decision Tree Classification Techniques
Almost all decision trees are constructed using a similar
procedure
Attributes (features) represented in internal nodes with their
values given on the links for tree traversal (a variation of this
exists for binary decision trees)
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
10. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Decision Tree Classification Techniques
Almost all decision trees are constructed using a similar
procedure
Attributes (features) represented in internal nodes with their
values given on the links for tree traversal (a variation of this
exists for binary decision trees)
Leaf nodes are class labels
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
11. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Decision Tree Classification Techniques
Almost all decision trees are constructed using a similar
procedure
Attributes (features) represented in internal nodes with their
values given on the links for tree traversal (a variation of this
exists for binary decision trees)
Leaf nodes are class labels
Decision trees mainly vary in the goodness measure used to
find the best attribute to split on (e.g., information gain, gain
ratio, Gini index, and Chi-square)
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
12. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Decision Tree Classification Techniques
Almost all decision trees are constructed using a similar
procedure
Attributes (features) represented in internal nodes with their
values given on the links for tree traversal (a variation of this
exists for binary decision trees)
Leaf nodes are class labels
Decision trees mainly vary in the goodness measure used to
find the best attribute to split on (e.g., information gain, gain
ratio, Gini index, and Chi-square)
The first attribute which is called the root is the best
attribute (according to some goodness measure) to spit on.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
13. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Decision Tree Classification Techniques
Almost all decision trees are constructed using a similar
procedure
Attributes (features) represented in internal nodes with their
values given on the links for tree traversal (a variation of this
exists for binary decision trees)
Leaf nodes are class labels
Decision trees mainly vary in the goodness measure used to
find the best attribute to split on (e.g., information gain, gain
ratio, Gini index, and Chi-square)
The first attribute which is called the root is the best
attribute (according to some goodness measure) to spit on.
An iterative process to build subtrees is followed with finding
the best attribute (attribute = value) to split on at each
iteration
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
14. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Ensemble Classification
Combining a number of classifiers to vote towards the winning
class has been thoroughly investigated by machine learning
and data mining communities.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
15. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Ensemble Classification
Combining a number of classifiers to vote towards the winning
class has been thoroughly investigated by machine learning
and data mining communities.
Bagging, boosting and stacking are among the major
approaches to build ensemble of classifiers.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
16. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Ensemble Classification
Combining a number of classifiers to vote towards the winning
class has been thoroughly investigated by machine learning
and data mining communities.
Bagging, boosting and stacking are among the major
approaches to build ensemble of classifiers.
Bagging uses bootstrap sampling to generate diverse number
of samples in the dataset.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
17. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Ensemble Classification
Combining a number of classifiers to vote towards the winning
class has been thoroughly investigated by machine learning
and data mining communities.
Bagging, boosting and stacking are among the major
approaches to build ensemble of classifiers.
Bagging uses bootstrap sampling to generate diverse number
of samples in the dataset.
Boosting builds classifiers in a sequence encouraging later
built classifiers to be expert in classifying incorrectly classified
instances from previous classifiers in the sequence.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
18. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Ensemble Classification
Combining a number of classifiers to vote towards the winning
class has been thoroughly investigated by machine learning
and data mining communities.
Bagging, boosting and stacking are among the major
approaches to build ensemble of classifiers.
Bagging uses bootstrap sampling to generate diverse number
of samples in the dataset.
Boosting builds classifiers in a sequence encouraging later
built classifiers to be expert in classifying incorrectly classified
instances from previous classifiers in the sequence.
Stacking uses a hierarchy of classifiers that generates a new
dataset for a single classifier to be built.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
19. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Diversity and Predictive Accuracy
Diversity among members of the ensemble is key to predictive
accuracy
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
20. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Diversity and Predictive Accuracy
Diversity among members of the ensemble is key to predictive
accuracy
There are many ways to measuring such diversity; it is not a
straightforward process
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
21. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Diversity and Predictive Accuracy
Diversity among members of the ensemble is key to predictive
accuracy
There are many ways to measuring such diversity; it is not a
straightforward process
Regardless of the used measure, diversity has been the target
of a number of ‘diversity creation’ methods
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
22. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Diversity and Predictive Accuracy
Diversity among members of the ensemble is key to predictive
accuracy
There are many ways to measuring such diversity; it is not a
straightforward process
Regardless of the used measure, diversity has been the target
of a number of ‘diversity creation’ methods
Bagging and boosting enforce diversity by input manipulation
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
23. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Diversity and Predictive Accuracy
Diversity among members of the ensemble is key to predictive
accuracy
There are many ways to measuring such diversity; it is not a
straightforward process
Regardless of the used measure, diversity has been the target
of a number of ‘diversity creation’ methods
Bagging and boosting enforce diversity by input manipulation
Stacking typically imposes diversity using a number of
different classifiers
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
24. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Diversity and Predictive Accuracy
Diversity among members of the ensemble is key to predictive
accuracy
There are many ways to measuring such diversity; it is not a
straightforward process
Regardless of the used measure, diversity has been the target
of a number of ‘diversity creation’ methods
Bagging and boosting enforce diversity by input manipulation
Stacking typically imposes diversity using a number of
different classifiers
Error correcting code manipulates output to create diversity
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
25. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forests: An Overview
An ensemble classification and regression technique
introduced by Leo Breiman
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
26. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forests: An Overview
An ensemble classification and regression technique
introduced by Leo Breiman
It generates a diversified ensemble of decision trees adopting
two methods:
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
27. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forests: An Overview
An ensemble classification and regression technique
introduced by Leo Breiman
It generates a diversified ensemble of decision trees adopting
two methods:
A bootstrap sample is used for the construction of each tree
(bagging), resulting in approximately 63.2% unique samples,
and the rest are repeated
At each node split, only a subset of features are drawn
randomly to assess the goodness of each feature/attribute (
√
F
or log2 F is used, where F is the total number of features)
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
28. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forests: An Overview
An ensemble classification and regression technique
introduced by Leo Breiman
It generates a diversified ensemble of decision trees adopting
two methods:
A bootstrap sample is used for the construction of each tree
(bagging), resulting in approximately 63.2% unique samples,
and the rest are repeated
At each node split, only a subset of features are drawn
randomly to assess the goodness of each feature/attribute (
√
F
or log2 F is used, where F is the total number of features)
Trees are allowed to grow without pruning
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
29. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forests: An Overview
An ensemble classification and regression technique
introduced by Leo Breiman
It generates a diversified ensemble of decision trees adopting
two methods:
A bootstrap sample is used for the construction of each tree
(bagging), resulting in approximately 63.2% unique samples,
and the rest are repeated
At each node split, only a subset of features are drawn
randomly to assess the goodness of each feature/attribute (
√
F
or log2 F is used, where F is the total number of features)
Trees are allowed to grow without pruning
Typically 100 to 500 trees are used to form the ensemble
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
30. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forests: An Overview
An ensemble classification and regression technique
introduced by Leo Breiman
It generates a diversified ensemble of decision trees adopting
two methods:
A bootstrap sample is used for the construction of each tree
(bagging), resulting in approximately 63.2% unique samples,
and the rest are repeated
At each node split, only a subset of features are drawn
randomly to assess the goodness of each feature/attribute (
√
F
or log2 F is used, where F is the total number of features)
Trees are allowed to grow without pruning
Typically 100 to 500 trees are used to form the ensemble
It is now considered among the best performing classifiers
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
31. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forest Tops State-of-the-art Classifiers
179 classifiers
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
32. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forest Tops State-of-the-art Classifiers
179 classifiers
121 datasets (the whole UCI repository at the time of the
experiment)
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
33. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Random Forest Tops State-of-the-art Classifiers
179 classifiers
121 datasets (the whole UCI repository at the time of the
experiment)
Random Forest was the first ranked, followed by SVM with
Gaussian kernel
Reference
Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D.
(2014). Do we need hundreds of classifiers to solve real world
classification problems?. The Journal of Machine Learning
Research, 15(1), 3133-3181.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
34. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Data Classification
Ensemble Classification
Ensemble Diversity
Random Forests
Improving Random Forests
Source: Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: from early
developments to recent advancements. Systems Science & Control Engineering: An
Open Access Journal, 2(1), pp. 602-609.
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
35. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
How is Diversity Related to Clustering?
The aim of any clustering algorithm is to produce cohesive
clusters that are well separated
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
36. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
How is Diversity Related to Clustering?
The aim of any clustering algorithm is to produce cohesive
clusters that are well separated
A good clustering model diversifies among members of
different clusters
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
37. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
How is Diversity Related to Clustering?
The aim of any clustering algorithm is to produce cohesive
clusters that are well separated
A good clustering model diversifies among members of
different clusters
Inspired by this observation, we hypothesised that if trees in
the Random Forest are clustered, we can use a small subset
(typically one tree) from each cluster to produce a diversified
Random Forest
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
38. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
How is Diversity Related to Clustering?
The aim of any clustering algorithm is to produce cohesive
clusters that are well separated
A good clustering model diversifies among members of
different clusters
Inspired by this observation, we hypothesised that if trees in
the Random Forest are clustered, we can use a small subset
(typically one tree) from each cluster to produce a diversified
Random Forest
The benefits are two fold
An increased diversification
A smaller ensemble, leading to faster classification of
unlabelled instances
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
39. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
CLUB-DRF
We termed the method CLUster
Based Diversified Random Forests
(CLUB-DRF)
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
40. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
CLUB-DRF
We termed the method CLUster
Based Diversified Random Forests
(CLUB-DRF)
Three stages are followed:
A Random Forest is induced
using the traditional method
Trees are clustered according to
their classification pattern
One or more representative are
chosen from each cluster to form
the pruned Random Forest
…....
…....
C(t1, T) C(tn, T)
t1 ……. tn
Parent RF
Training Set
Random Forest Algorithm
Clustering Algorithm
Cluster 1 Cluster k
Representative Selection
Testing Set
t1 ……. tk
CLUB-DRF
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
41. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
CLUB-DRF Settings
A number of settings are needed as follows:
The clustering algorithm used
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
42. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
CLUB-DRF Settings
A number of settings are needed as follows:
The clustering algorithm used
The number of clusters of trees
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
43. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
CLUB-DRF Settings
A number of settings are needed as follows:
The clustering algorithm used
The number of clusters of trees
The number of trees representing each cluster
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
44. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
CLUB-DRF Settings
A number of settings are needed as follows:
The clustering algorithm used
The number of clusters of trees
The number of trees representing each cluster
The criteria for choosing the representatives
Random
Best performing
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
45. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Experimental Setup
We tested the technique over 15 datasets from the UCI
repository
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
46. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Experimental Setup
We tested the technique over 15 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
47. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Experimental Setup
We tested the technique over 15 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used k-modes to cluster the trees
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
48. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Experimental Setup
We tested the technique over 15 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used k-modes to cluster the trees
We used the following values for k: 5, 10, 15, 20, 25, 30, 35,
and 40
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
49. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Experimental Setup
We tested the technique over 15 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used k-modes to cluster the trees
We used the following values for k: 5, 10, 15, 20, 25, 30, 35,
and 40
We used one representative tree per cluster based on the Out
Of Bag (OOB) performance
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
50. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Experimental Setup
We tested the technique over 15 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used k-modes to cluster the trees
We used the following values for k: 5, 10, 15, 20, 25, 30, 35,
and 40
We used one representative tree per cluster based on the Out
Of Bag (OOB) performance
Repeated hold-out method used to estimate the performance
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
51. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Summarised Results
0
3
6
9
10 20 30 40
Size (Number of Trees)
NumberofDatasets
Method
CLUB−DRF
RF
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
52. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Pruning Results
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
53. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
CLUB-DRF
Experimental Study
Sample of Detailed Results
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
54. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
How is Diversity Related to Outlier Detection?
Outliers are out of the norm instances that are thought to be
generated by a different mechanism
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
55. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
How is Diversity Related to Outlier Detection?
Outliers are out of the norm instances that are thought to be
generated by a different mechanism
By analogy, trees that are significantly different (diverse) from
the set of other trees in the Random Forest can be seen as
outliers
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
56. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
How is Diversity Related to Outlier Detection?
Outliers are out of the norm instances that are thought to be
generated by a different mechanism
By analogy, trees that are significantly different (diverse) from
the set of other trees in the Random Forest can be seen as
outliers
Local Outlier Factor (LOF) assigns a real number to each
instance to represent its peculiarity
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
57. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
How is Diversity Related to Outlier Detection?
Outliers are out of the norm instances that are thought to be
generated by a different mechanism
By analogy, trees that are significantly different (diverse) from
the set of other trees in the Random Forest can be seen as
outliers
Local Outlier Factor (LOF) assigns a real number to each
instance to represent its peculiarity
Inspired by this analogy, we hypothesised that a diverse
ensemble of trees can be formed using outlier detection
method
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
58. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
LOFB-DRF
We termed the method
Local Outlier Factor Based
Diversified Random Forests
(LOFB-DRF)
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
59. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
LOFB-DRF
We termed the method
Local Outlier Factor Based
Diversified Random Forests
(LOFB-DRF)
It follows similar steps to
CLUB-DRF
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
60. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
LOFB-DRF
We termed the method
Local Outlier Factor Based
Diversified Random Forests
(LOFB-DRF)
It follows similar steps to
CLUB-DRF
Each tree is assigned LOF
value
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
61. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
LOFB-DRF
We termed the method
Local Outlier Factor Based
Diversified Random Forests
(LOFB-DRF)
It follows similar steps to
CLUB-DRF
Each tree is assigned LOF
value
Trees are then chosen
according to two criteria
Predictive accuracy
LOF value
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
62. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
LOFB-DRF Settings
A number of settings are needed as follows:
LOF setting of the number of nearest neighbours
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
63. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
LOFB-DRF Settings
A number of settings are needed as follows:
LOF setting of the number of nearest neighbours
Options of combining LOF with predictive accuracy
Using LOF only ruling out predictive accuracy
Using a combination strategy
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
64. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Experimental Setup
We tested the technique over 10 datasets from the UCI
repository
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
65. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Experimental Setup
We tested the technique over 10 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
66. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Experimental Setup
We tested the technique over 10 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used LOF with 40 nearest neighbours
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
67. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Experimental Setup
We tested the technique over 10 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used LOF with 40 nearest neighbours
We used [rank = normal(LOF) × accuracy] for each tree,
where normal(LOF), accuracy ∈ [0, 1]
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
68. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Experimental Setup
We tested the technique over 10 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used LOF with 40 nearest neighbours
We used [rank = normal(LOF) × accuracy] for each tree,
where normal(LOF), accuracy ∈ [0, 1]
Trees with the higher rank are chosen as representatives
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
69. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Experimental Setup
We tested the technique over 10 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used LOF with 40 nearest neighbours
We used [rank = normal(LOF) × accuracy] for each tree,
where normal(LOF), accuracy ∈ [0, 1]
Trees with the higher rank are chosen as representatives
We used the following values for representative trees: 5, 10,
15, 20, 25, 30, 35, and 40
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
70. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Experimental Setup
We tested the technique over 10 datasets from the UCI
repository
We generated 500 trees for the main Random Forest
We used LOF with 40 nearest neighbours
We used [rank = normal(LOF) × accuracy] for each tree,
where normal(LOF), accuracy ∈ [0, 1]
Trees with the higher rank are chosen as representatives
We used the following values for representative trees: 5, 10,
15, 20, 25, 30, 35, and 40
Repeated hold-out method used to estimate the performance
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
71. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Summarised Results
0
2
4
6
10 20 30 40
Size (Number of Trees)
NumberofDatasets
Method
LOF−DRF
RF
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
72. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Pruning Results
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
73. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
LOFB-DRF
Experimental Study
Sample of Detailed Results
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
74. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Summary
Random Forest has proved superiority over the last few years
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
75. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Summary
Random Forest has proved superiority over the last few years
Two methods were presented in this talk aiming at
diversifying and pruning Random Forests
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
76. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Summary
Random Forest has proved superiority over the last few years
Two methods were presented in this talk aiming at
diversifying and pruning Random Forests
Results showed the potential of these two methods to further
enhance the predictive accuracy of the method
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
77. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Summary
Random Forest has proved superiority over the last few years
Two methods were presented in this talk aiming at
diversifying and pruning Random Forests
Results showed the potential of these two methods to further
enhance the predictive accuracy of the method
The high level of pruning makes these techniques candidates
for real-time applications, as the number of trees to be
traversed are significantly reduced
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
78. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Future Work
In CLUB-DRF:
Exploring other methods for choosing tree representatives from
each cluster (e.g., varying the number of representatives per
cluster)
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
79. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Future Work
In CLUB-DRF:
Exploring other methods for choosing tree representatives from
each cluster (e.g., varying the number of representatives per
cluster)
Using other clustering techniques
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
80. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Future Work
In CLUB-DRF:
Exploring other methods for choosing tree representatives from
each cluster (e.g., varying the number of representatives per
cluster)
Using other clustering techniques
In LOFB-DRF:
Exploring other options for combining LOF value and
predictive accuracy
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
81. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Future Work
In CLUB-DRF:
Exploring other methods for choosing tree representatives from
each cluster (e.g., varying the number of representatives per
cluster)
Using other clustering techniques
In LOFB-DRF:
Exploring other options for combining LOF value and
predictive accuracy
Using LOF and predictive accuracy for the choice of tree
representatives in each cluster
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
82. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Future Work
In CLUB-DRF:
Exploring other methods for choosing tree representatives from
each cluster (e.g., varying the number of representatives per
cluster)
Using other clustering techniques
In LOFB-DRF:
Exploring other options for combining LOF value and
predictive accuracy
Using LOF and predictive accuracy for the choice of tree
representatives in each cluster
Applying both methods to other ensemble classification
techniques
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest
83. Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Summary
Future Work
Q & A
Thanks for listening!
Contact Details
Dr Mohamed Medhat Gaber
E-mail: m.gaber1@rgu.ac.uk
Webpage: http://mohamedmgaber.weebly.com/
LinkedIn: https://www.linkedin.com/profile/view?id=21808352
Twitter: https://twitter.com/mmmgaber
ResearchGate:
https://www.researchgate.net/profile/Mohamed Gaber16?ev=prf highl
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest