Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

Background
Clustering and Ensemble Diversity
Outlier Scoring and Ensemble Diversity
Summary and Future Work
Unsupervised Learning Techniques to Diversifying
and Pruning Random Forest
Dr Mohamed Medhat Gaber
School of Computing Science and Digital Media
Robert Gordon University
27 January 2015
Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Background
Acknowledgement
Work done in collaboration with PhD student Khaled Fawagreh
and co-supervisor Dr Eyad Elyan

Background
1 Background
Data Classiﬁcation
Ensemble Classiﬁcation
Ensemble Diversity
Random Forests
2 Clustering and Ensemble Diversity
CLUB-DRF
Experimental Study
3 Outlier Scoring and Ensemble Diversity
LOFB-DRF
Experimental Study
4 Summary and Future Work
Summary
Future Work

Background
Ensemble Diversity
Random Forests
What is Data Classiﬁcation?
Data classiﬁcation is the process of assigning a class
(labelling) to a data instance, based on the values of a set of
predictive attributes (features).

Background
Ensemble Diversity
Random Forests
The process has two stages:
1 Model construction: potentially a large number of “labelled”
instances are fed to a classiﬁcation technique to build a model
(classiﬁer).
2 Model usage: once the model is constructed, it can be
deployed and used to classify “unlabelled” instances.

Background
Ensemble Diversity
Random Forests
(classifier).
A large number of techniques have been proposed addressing
the data classification process (e.g., decision trees, artificial
neural networks, and support vector machine).

Background
Ensemble Diversity
Random Forests
(classifier).
A large number of techniques have been proposed addressing
the data classification process (e.g., decision trees, artificial
neural networks, and support vector machine).
Predictive accuracy has been the major concern when
designing a new classification technique, followed by time
needed for model construction and usage.

Background
Ensemble Diversity
Random Forests
Decision Tree Classiﬁcation Techniques
Almost all decision trees are constructed using a similar
procedure

Background
Ensemble Diversity
Random Forests
procedure
Attributes (features) represented in internal nodes with their
values given on the links for tree traversal (a variation of this
exists for binary decision trees)

Background
Ensemble Diversity
Random Forests
procedure
Leaf nodes are class labels

Background
Ensemble Diversity
Random Forests
procedure
Decision trees mainly vary in the goodness measure used to
ﬁnd the best attribute to split on (e.g., information gain, gain
ratio, Gini index, and Chi-square)

Background
Ensemble Diversity
Random Forests
procedure
The ﬁrst attribute which is called the root is the best
attribute (according to some goodness measure) to spit on.

Background
Ensemble Diversity
Random Forests
procedure
The ﬁrst attribute which is called the root is the best
attribute (according to some goodness measure) to spit on.
An iterative process to build subtrees is followed with ﬁnding
the best attribute (attribute = value) to split on at each
iteration

Background
Ensemble Diversity
Random Forests
Combining a number of classiﬁers to vote towards the winning
class has been thoroughly investigated by machine learning
and data mining communities.

Background
Ensemble Diversity
Random Forests
Bagging, boosting and stacking are among the major
approaches to build ensemble of classiﬁers.

Background
Ensemble Diversity
Random Forests
Bagging uses bootstrap sampling to generate diverse number
of samples in the dataset.

Background
Ensemble Diversity
Random Forests
Boosting builds classifiers in a sequence encouraging later
built classifiers to be expert in classifying incorrectly classified
instances from previous classifiers in the sequence.

Background
Ensemble Diversity
Random Forests
Boosting builds classifiers in a sequence encouraging later
built classifiers to be expert in classifying incorrectly classified
instances from previous classifiers in the sequence.
Stacking uses a hierarchy of classifiers that generates a new
dataset for a single classifier to be built.

Background
Ensemble Diversity
Random Forests
Diversity and Predictive Accuracy
Diversity among members of the ensemble is key to predictive
accuracy

Background
Ensemble Diversity
Random Forests
accuracy
There are many ways to measuring such diversity; it is not a
straightforward process

Background
Ensemble Diversity
Random Forests
accuracy
Regardless of the used measure, diversity has been the target
of a number of ‘diversity creation’ methods

Background
Ensemble Diversity
Random Forests
accuracy
Bagging and boosting enforce diversity by input manipulation

Background
Ensemble Diversity
Random Forests
accuracy
Stacking typically imposes diversity using a number of
diﬀerent classiﬁers

Background
Ensemble Diversity
Random Forests
accuracy
Stacking typically imposes diversity using a number of
diﬀerent classiﬁers
Error correcting code manipulates output to create diversity

Background
Ensemble Diversity
Random Forests
Random Forests: An Overview
An ensemble classiﬁcation and regression technique
introduced by Leo Breiman

Background
Ensemble Diversity
Random Forests
It generates a diversiﬁed ensemble of decision trees adopting
two methods:

Background
Ensemble Diversity
Random Forests
two methods:
A bootstrap sample is used for the construction of each tree
(bagging), resulting in approximately 63.2% unique samples,
and the rest are repeated
At each node split, only a subset of features are drawn
randomly to assess the goodness of each feature/attribute (
√
F
or log2 F is used, where F is the total number of features)

Background
Ensemble Diversity
Random Forests
two methods:
√
F
Trees are allowed to grow without pruning

Background
Ensemble Diversity
Random Forests
two methods:
√
F
Typically 100 to 500 trees are used to form the ensemble

Background
Ensemble Diversity
Random Forests
two methods:
√
F
Typically 100 to 500 trees are used to form the ensemble
It is now considered among the best performing classiﬁers

Background
Ensemble Diversity
Random Forests
Random Forest Tops State-of-the-art Classiﬁers
179 classiﬁers

Background
Ensemble Diversity
Random Forests
179 classiﬁers
121 datasets (the whole UCI repository at the time of the
experiment)

Background
Ensemble Diversity
Random Forests
179 classifiers
121 datasets (the whole UCI repository at the time of the
experiment)
Random Forest was the first ranked, followed by SVM with
Gaussian kernel
Reference
Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D.
(2014). Do we need hundreds of classifiers to solve real world
classification problems?. The Journal of Machine Learning
Research, 15(1), 3133-3181.

Background
Ensemble Diversity
Random Forests
Improving Random Forests
Source: Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: from early
developments to recent advancements. Systems Science & Control Engineering: An
Open Access Journal, 2(1), pp. 602-609.

Background
CLUB-DRF
Experimental Study
How is Diversity Related to Clustering?
The aim of any clustering algorithm is to produce cohesive
clusters that are well separated

Background
CLUB-DRF
Experimental Study
A good clustering model diversiﬁes among members of
diﬀerent clusters

Background
CLUB-DRF
Experimental Study
diﬀerent clusters
Inspired by this observation, we hypothesised that if trees in
the Random Forest are clustered, we can use a small subset
(typically one tree) from each cluster to produce a diversiﬁed
Random Forest

Background
CLUB-DRF
Experimental Study
different clusters
Inspired by this observation, we hypothesised that if trees in
the Random Forest are clustered, we can use a small subset
(typically one tree) from each cluster to produce a diversified
Random Forest
The benefits are two fold
An increased diversification
A smaller ensemble, leading to faster classification of
unlabelled instances

Background
CLUB-DRF
Experimental Study
CLUB-DRF
We termed the method CLUster
Based Diversiﬁed Random Forests
(CLUB-DRF)

Background
CLUB-DRF
Experimental Study
CLUB-DRF
We termed the method CLUster
Based Diversiﬁed Random Forests
(CLUB-DRF)
Three stages are followed:
A Random Forest is induced
using the traditional method
Trees are clustered according to
their classiﬁcation pattern
One or more representative are
chosen from each cluster to form
the pruned Random Forest
…....
…....
C(t1, T) C(tn, T)
t1 ……. tn
Parent RF
Training Set
Random Forest Algorithm
Clustering Algorithm
Cluster 1 Cluster k
Representative Selection
Testing Set
t1 ……. tk
CLUB-DRF

Background
CLUB-DRF
Experimental Study
CLUB-DRF Settings
A number of settings are needed as follows:
The clustering algorithm used

Background
CLUB-DRF
Experimental Study
CLUB-DRF Settings
The number of clusters of trees

Background
CLUB-DRF
Experimental Study
CLUB-DRF Settings
The number of trees representing each cluster

Background
CLUB-DRF
Experimental Study
CLUB-DRF Settings
The number of trees representing each cluster
The criteria for choosing the representatives
Random
Best performing

Background
CLUB-DRF
Experimental Study
Experimental Setup
We tested the technique over 15 datasets from the UCI
repository

Background
CLUB-DRF
Experimental Study
Experimental Setup
repository
We generated 500 trees for the main Random Forest

Background
CLUB-DRF
Experimental Study
Experimental Setup
repository
We used k-modes to cluster the trees

Background
CLUB-DRF
Experimental Study
Experimental Setup
repository
We used the following values for k: 5, 10, 15, 20, 25, 30, 35,
and 40

Background
CLUB-DRF
Experimental Study
Experimental Setup
repository
and 40
We used one representative tree per cluster based on the Out
Of Bag (OOB) performance

Background
CLUB-DRF
Experimental Study
Experimental Setup
repository
and 40
We used one representative tree per cluster based on the Out
Of Bag (OOB) performance
Repeated hold-out method used to estimate the performance

Background
CLUB-DRF
Experimental Study
Summarised Results
0
3
6
9
10 20 30 40
Size (Number of Trees)
NumberofDatasets
Method
CLUB−DRF
RF

Background
CLUB-DRF
Experimental Study
Pruning Results

Background
CLUB-DRF
Experimental Study
Sample of Detailed Results

Background
LOFB-DRF
Experimental Study
How is Diversity Related to Outlier Detection?
Outliers are out of the norm instances that are thought to be
generated by a diﬀerent mechanism

Background
LOFB-DRF
Experimental Study
By analogy, trees that are signiﬁcantly diﬀerent (diverse) from
the set of other trees in the Random Forest can be seen as
outliers

Background
LOFB-DRF
Experimental Study
outliers
Local Outlier Factor (LOF) assigns a real number to each
instance to represent its peculiarity

Background
LOFB-DRF
Experimental Study
outliers
Local Outlier Factor (LOF) assigns a real number to each
instance to represent its peculiarity
Inspired by this analogy, we hypothesised that a diverse
ensemble of trees can be formed using outlier detection
method

Background
LOFB-DRF
Experimental Study
LOFB-DRF
We termed the method
Local Outlier Factor Based
Diversiﬁed Random Forests
(LOFB-DRF)

Background
LOFB-DRF
Experimental Study
LOFB-DRF
(LOFB-DRF)
It follows similar steps to
CLUB-DRF

Background
LOFB-DRF
Experimental Study
LOFB-DRF
(LOFB-DRF)
CLUB-DRF
Each tree is assigned LOF
value

Background
LOFB-DRF
Experimental Study
LOFB-DRF
(LOFB-DRF)
CLUB-DRF
Each tree is assigned LOF
value
Trees are then chosen
according to two criteria
Predictive accuracy
LOF value

Background
LOFB-DRF
Experimental Study
LOFB-DRF Settings
LOF setting of the number of nearest neighbours

Background
LOFB-DRF
Experimental Study
LOFB-DRF Settings
LOF setting of the number of nearest neighbours
Options of combining LOF with predictive accuracy
Using LOF only ruling out predictive accuracy
Using a combination strategy

Background
LOFB-DRF
Experimental Study
Experimental Setup
repository

Background
LOFB-DRF
Experimental Study
Experimental Setup
repository
We used LOF with 40 nearest neighbours

Background
LOFB-DRF
Experimental Study
Experimental Setup
repository
We used [rank = normal(LOF) × accuracy] for each tree,
where normal(LOF), accuracy ∈ [0, 1]

Background
LOFB-DRF
Experimental Study
Experimental Setup
repository
Trees with the higher rank are chosen as representatives

Background
LOFB-DRF
Experimental Study
Experimental Setup
repository
We used the following values for representative trees: 5, 10,
15, 20, 25, 30, 35, and 40

Background
LOFB-DRF
Experimental Study
Experimental Setup
repository
We used the following values for representative trees: 5, 10,
15, 20, 25, 30, 35, and 40
Repeated hold-out method used to estimate the performance

Background
LOFB-DRF
Experimental Study
Summarised Results
0
2
4
6
10 20 30 40
Size (Number of Trees)
NumberofDatasets
Method
LOF−DRF
RF

Background
LOFB-DRF
Experimental Study
Pruning Results

Background
LOFB-DRF
Experimental Study
Sample of Detailed Results

Background
Summary
Future Work
Summary
Random Forest has proved superiority over the last few years

Background
Summary
Future Work
Summary
Two methods were presented in this talk aiming at
diversifying and pruning Random Forests

Background
Summary
Future Work
Summary
Results showed the potential of these two methods to further
enhance the predictive accuracy of the method

Background
Summary
Future Work
Summary
Results showed the potential of these two methods to further
enhance the predictive accuracy of the method
The high level of pruning makes these techniques candidates
for real-time applications, as the number of trees to be
traversed are signiﬁcantly reduced

Background
Summary
Future Work
Future Work
In CLUB-DRF:
Exploring other methods for choosing tree representatives from
each cluster (e.g., varying the number of representatives per
cluster)

Background
Summary
Future Work
Future Work
In CLUB-DRF:
cluster)
Using other clustering techniques

Background
Summary
Future Work
Future Work
In CLUB-DRF:
cluster)
In LOFB-DRF:
Exploring other options for combining LOF value and
predictive accuracy

Background
Summary
Future Work
Future Work
In CLUB-DRF:
cluster)
In LOFB-DRF:
predictive accuracy
Using LOF and predictive accuracy for the choice of tree
representatives in each cluster

Background
Summary
Future Work
Future Work
In CLUB-DRF:
cluster)
In LOFB-DRF:
predictive accuracy
Using LOF and predictive accuracy for the choice of tree
representatives in each cluster
Applying both methods to other ensemble classiﬁcation
techniques

Background
Summary
Future Work
Q & A
Thanks for listening!
Contact Details
Dr Mohamed Medhat Gaber
E-mail: m.gaber1@rgu.ac.uk
Webpage: http://mohamedmgaber.weebly.com/
LinkedIn: https://www.linkedin.com/proﬁle/view?id=21808352
Twitter: https://twitter.com/mmmgaber
ResearchGate:
https://www.researchgate.net/proﬁle/Mohamed Gaber16?ev=prf highl

Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

Similaire à Unsupervised Learning Techniques to Diversifying and Pruning Random Forest (20)

Dernier

Dernier (20)

Unsupervised Learning Techniques to Diversifying and Pruning Random Forest