SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Leveraging Bagging for Evolving Data Streams
Albert Bifet, Geoff Holmes, and Bernhard Pfahringer
University of Waikato
Hamilton, New Zealand
Barcelona, 21 September 2010
ECML PKDD 2010
Mining Data Streams with Concept Drift
Extract information from
potentially infinite sequence of data
possibly varying over time
using few resources
Adaptively:
no prior knowledge of type or rate of change
2 / 32
Mining Data Streams with Concept Drift
Extract information from
potentially infinite sequence of data
possibly varying over time
using few resources
Leveraging Bagging
New improvements for adaptive bagging methods using
input randomization
output randomization
2 / 32
Outline
1 Data stream constraints
2 Leveraging Bagging for Evolving Data Streams
3 Empirical evaluation
3 / 32
Outline
1 Data stream constraints
2 Leveraging Bagging for Evolving Data Streams
3 Empirical evaluation
4 / 32
Mining Massive Data
Eric Schmidt, August 2010
Every two days now we create as much information as we did
from the dawn of civilization up until 2003.
5 exabytes of data
5 / 32
Data stream classification cycle
1 Process an example at a
time, and inspect it only
once (at most)
2 Use a limited amount of
memory
3 Work in a limited amount
of time
4 Be ready to predict at
any point
6 / 32
Mining Massive Data
Koichi Kawana
Simplicity means the achievement of maximum effect with
minimum means.
time
accuracy
memory
Data Streams
7 / 32
Evaluation Example
Accuracy Time Memory
Classifier A 70% 100 20
Classifier B 80% 20 40
Which classifier is performing better?
8 / 32
RAM-Hours
RAM-Hour
Every GB of RAM deployed for 1 hour
Cloud Computing Rental Cost Options
9 / 32
Evaluation Example
Accuracy Time Memory RAM-Hours
Classifier A 70% 100 20 2,000
Classifier B 80% 20 40 800
Which classifier is performing better?
10 / 32
Outline
1 Data stream constraints
2 Leveraging Bagging for Evolving Data Streams
3 Empirical evaluation
11 / 32
Hoeffding Trees
Hoeffding Tree : VFDT
Pedro Domingos and Geoff Hulten.
Mining high-speed data streams. 2000
With high probability, constructs an identical model that a
traditional (greedy) method would learn
With theoretical guarantees on the error rate
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
12 / 32
Hoeffding Naive Bayes Tree
Hoeffding Tree
Majority Class learner at leaves
Hoeffding Naive Bayes Tree
G. Holmes, R. Kirkby, and B. Pfahringer.
Stress-testing Hoeffding trees, 2005.
monitors accuracy of a Majority Class learner
monitors accuracy of a Naive Bayes learner
predicts using the most accurate method
13 / 32
Bagging
Figure: Poisson(1) Distribution.
Bagging builds a set of M base models, with a bootstrap
sample created by drawing random samples with
replacement.
14 / 32
Bagging
Figure: Poisson(1) Distribution.
Each base model’s training set contains each of the original
training example K times where P(K = k) follows a binomial
distribution.
14 / 32
Oza and Russell’s Online Bagging for M models
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: for all training examples do
3: for m = 1,2,...,M do
4: Set w = Poisson(1)
5: Update hm with the current example with weight w
6: anytime output:
7: return hypothesis: hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = y)
15 / 32
ADWIN Bagging (KDD’09)
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
On ratio of false positives and negatives
On the relation of the size of the current window and
change rates
ADWIN Bagging
When a change is detected, the worst classifier is removed and
a new classifier is added.
16 / 32
ADWIN Bagging for M models
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: for all training examples do
3: for m = 1,2,...,M do
4: Set w = Poisson(1)
5: Update hm with the current example with weight w
6: if ADWIN detects change in error of one of the
classifiers then
7: Replace classifier with higher error with a new one
8: anytime output:
9: return hypothesis: hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = y)
17 / 32
Leveraging Bagging for Evolving
Data Streams
Randomization as a powerful tool to increase accuracy and
diversity
There are three ways of using randomization:
Manipulating the input data
Manipulating the classifier algorithms
Manipulating the output targets
18 / 32
Input Randomization
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
0,40
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
P(X=k)
λ=1
λ=6
λ=10
Figure: Poisson Distribution.
19 / 32
ECOC Output Randomization
Table: Example matrix of random output codes for 3 classes and 6
classifiers
Class 1 Class 2 Class 3
Classifier 1 0 0 1
Classifier 2 0 1 1
Classifier 3 1 0 0
Classifier 4 1 1 0
Classifier 5 1 0 1
Classifier 6 0 1 0
20 / 32
Leveraging Bagging for Evolving Data Streams
Leveraging Bagging
Using Poisson(λ)
Leveraging Bagging MC
Using Poisson(λ) and Random Output Codes
Fast Leveraging Bagging ME
if an instance is misclassified: weight = 1
if not: weight = eT /(1−eT ),
21 / 32
Input Randomization
Bagging
resampling with replacement using Poisson(1)
Other Strategies
subagging
resampling without replacement
half subagging
resampling without replacement half of the instances
bagging without taking out any instance
using 1+Poisson(1)
22 / 32
Leveraging Bagging for Evolving Data Streams
1: Initialize base models hm for all m ∈ {1,2,...,M}
2:
3: for all training examples (x,y) do
4: for m = 1,2,...,M do
5: Set w = Poisson(λ)
6: Update hm with the current example with weight w
7: if ADWIN detects change in error of one of the classifiers
then
8: Replace classifier with higher error with a new one
9: anytime output:
10: return hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = y)
23 / 32
Leveraging Bagging for Evolving Data Streams MC
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: Compute coloring µm(y)
3: for all training examples (x,y) do
4: for m = 1,2,...,M do
5: Set w = Poisson(λ)
6: Update hm with the current example with weight w and
class µm(y)
7: if ADWIN detects change in error of one of the classifiers
then
8: Replace classifier with higher error with a new one
9: anytime output:
10: return hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = µt (y))
23 / 32
Leveraging Bagging for Evolving Data Streams ME
1: Initialize base models hm for all m ∈ {1,2,...,M}
2:
3: for all training examples (x,y) do
4: for m = 1,2,...,M do
5: Set w = 1 if misclassified, otherwise eT /(1−eT )
6: Update hm with the current example with weight w
7: if ADWIN detects change in error of one of the classifiers
then
8: Replace classifier with higher error with a new one
9: anytime output:
10: return hfin(x) = argmaxy∈Y ∑T
t=1 I(ht (x) = y)
23 / 32
Outline
1 Data stream constraints
2 Leveraging Bagging for Evolving Data Streams
3 Empirical evaluation
24 / 32
What is MOA?
{M}assive {O}nline {A}nalysis is a framework for mining data
streams.
Based on experience with Weka and VFML
Focussed on classification trees, but lots of active
development: clustering, item set and sequence mining,
regression
Easy to extend
Easy to design and run experiments
25 / 32
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
26 / 32
Leveraging Bagging Empirical evaluation
Accuracy
75
77
79
81
83
85
87
89
91
93
951000080000150000220000290000360000430000500000570000640000710000780000850000920000990000
Instances
Accuracy(%)
Leveraging Bagging
Leveraging Bagging MC
ADWIN Bagging
Online Bagging
Figure: Accuracy on dataset SEA with three concept drifts.
27 / 32
Empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03% 0.01
Online Bagging 77.15% 2.98
ADWIN Bagging 79.24% 1.48
ADWIN Half Subagging 78.36% 1.04
ADWIN Subagging 78.68% 1.13
ADWIN Bagging WT 81.49% 2.74
ADWIN Bagging Strategies
half subagging
resampling without replacement half of the instances
subagging
resampling without replacement
WT: bagging without taking out any instance
using 1+Poisson(1)
28 / 32
Empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03% 0.01
Online Bagging 77.15% 2.98
ADWIN Bagging 79.24% 1.48
Leveraging Bagging 85.54% 20.17
Leveraging Bagging MC 85.37% 22.04
Leveraging Bagging ME 80.77% 0.87
Leveraging Bagging
Leveraging Bagging
Using Poisson(λ)
Leveraging Bagging MC
Using Poisson(λ) and Random Output Codes
Leveraging Bagging ME
Using weight 1 if misclassified, otherwise eT /(1−eT )
29 / 32
Empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03% 0.01
Online Bagging 77.15% 2.98
ADWIN Bagging 79.24% 1.48
Random Forest Leveraging Bagging 80.69% 5.51
Random Forest Online Bagging 72.91% 1.30
Random Forest ADWIN Bagging 74.24% 0.89
Random Forests
the input training set is obtained by sampling with
replacement
the nodes of the tree use only (n) random attributes to
split
we only keep statistics of these attributes.
30 / 32
Leveraging Bagging Diversity
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0,16
0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96
Kappa Statistic
Error
Leveraging Bagging
Online Bagging
Figure: Kappa-Error diagrams for Leveraging Bagging and Online
bagging (bottom) on on the SEA data with three concept drifts,
plotting 576 pairs of classifiers.
31 / 32
Summary
http://moa.cs.waikato.ac.nz/
Conclusions
New improvements for bagging methods using input
randomization
Improving Accuracy: Using Poisson(λ)
Improving RAM-Hours: Using weight 1 if misclassified,
otherwise eT /(1−eT )
New improvements for bagging methods using output
randomization
No need for multi-class classifiers
32 / 32

Contenu connexe

Tendances

Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream miningAlbert Bifet
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsAlbert Bifet
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsSangwoo Mo
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksIlias Flaounas
 
Contextual Bandit Survey
Contextual Bandit SurveyContextual Bandit Survey
Contextual Bandit SurveySangwoo Mo
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18Matt Yang
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Kentaro Minami
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Toru Fujino
 

Tendances (20)

Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricks
 
Contextual Bandit Survey
Contextual Bandit SurveyContextual Bandit Survey
Contextual Bandit Survey
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 

En vedette

Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network
 
Bagging Decision Trees on Data Sets with Classification Noise
Bagging Decision Trees on Data Sets with Classification NoiseBagging Decision Trees on Data Sets with Classification Noise
Bagging Decision Trees on Data Sets with Classification NoiseNTNU
 
Drill Down- Maximizing Business Retention Programming
Drill Down- Maximizing Business Retention ProgrammingDrill Down- Maximizing Business Retention Programming
Drill Down- Maximizing Business Retention ProgrammingGeneva Analytics, Ltd
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Marina Santini
 
K Nearest Neighbor Presentation
K Nearest Neighbor PresentationK Nearest Neighbor Presentation
K Nearest Neighbor PresentationDessy Amirudin
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighborbutest
 
Drill Down URL Presentation
Drill Down URL PresentationDrill Down URL Presentation
Drill Down URL PresentationGreg Chick
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and ToolsBrendan Gregg
 

En vedette (11)

Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
Retail site assessment
Retail site assessmentRetail site assessment
Retail site assessment
 
Bagging Decision Trees on Data Sets with Classification Noise
Bagging Decision Trees on Data Sets with Classification NoiseBagging Decision Trees on Data Sets with Classification Noise
Bagging Decision Trees on Data Sets with Classification Noise
 
Drill Down- Maximizing Business Retention Programming
Drill Down- Maximizing Business Retention ProgrammingDrill Down- Maximizing Business Retention Programming
Drill Down- Maximizing Business Retention Programming
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
K Nearest Neighbor Presentation
K Nearest Neighbor PresentationK Nearest Neighbor Presentation
K Nearest Neighbor Presentation
 
KNN
KNN KNN
KNN
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 
Drill Down URL Presentation
Drill Down URL PresentationDrill Down URL Presentation
Drill Down URL Presentation
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similaire à Leveraging Bagging for Evolving Data Streams

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
 
Mathematical Modeling
Mathematical ModelingMathematical Modeling
Mathematical ModelingMohsen EM
 
Using AI Planning to Automate the Performance Analysis of Simulators
Using AI Planning to Automate the Performance Analysis of SimulatorsUsing AI Planning to Automate the Performance Analysis of Simulators
Using AI Planning to Automate the Performance Analysis of SimulatorsRoland Ewald
 
Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationUmberto Picchini
 
Stochastic methods and models for multi-objective dynamic optimisation problems
Stochastic methods and models for multi-objective dynamic optimisation problemsStochastic methods and models for multi-objective dynamic optimisation problems
Stochastic methods and models for multi-objective dynamic optimisation problemsEric Fraga
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingArthur Charpentier
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)AMIDST Toolbox
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10FredrikRonquist
 
Some sampling techniques for big data analysis
Some sampling techniques for big data analysisSome sampling techniques for big data analysis
Some sampling techniques for big data analysisJae-kwang Kim
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Pythonfreshdatabos
 

Similaire à Leveraging Bagging for Evolving Data Streams (20)

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
 
Mathematical Modeling
Mathematical ModelingMathematical Modeling
Mathematical Modeling
 
My7class
My7classMy7class
My7class
 
Using AI Planning to Automate the Performance Analysis of Simulators
Using AI Planning to Automate the Performance Analysis of SimulatorsUsing AI Planning to Automate the Performance Analysis of Simulators
Using AI Planning to Automate the Performance Analysis of Simulators
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
ICPR 2016
ICPR 2016ICPR 2016
ICPR 2016
 
Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computation
 
Stochastic methods and models for multi-objective dynamic optimisation problems
Stochastic methods and models for multi-objective dynamic optimisation problemsStochastic methods and models for multi-objective dynamic optimisation problems
Stochastic methods and models for multi-objective dynamic optimisation problems
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Ml presentation
Ml presentationMl presentation
Ml presentation
 
Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
Some sampling techniques for big data analysis
Some sampling techniques for big data analysisSome sampling techniques for big data analysis
Some sampling techniques for big data analysis
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
 

Plus de Albert Bifet

Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labelsAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsAlbert Bifet
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online AnalysisAlbert Bifet
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streamsAlbert Bifet
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAlbert Bifet
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAlbert Bifet
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesAlbert Bifet
 
Kalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsKalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsAlbert Bifet
 

Plus de Albert Bifet (14)

Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online Analysis
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streams
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data Streams
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed Trees
 
Kalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsKalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data Streams
 

Dernier

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 

Dernier (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 

Leveraging Bagging for Evolving Data Streams

  • 1. Leveraging Bagging for Evolving Data Streams Albert Bifet, Geoff Holmes, and Bernhard Pfahringer University of Waikato Hamilton, New Zealand Barcelona, 21 September 2010 ECML PKDD 2010
  • 2. Mining Data Streams with Concept Drift Extract information from potentially infinite sequence of data possibly varying over time using few resources Adaptively: no prior knowledge of type or rate of change 2 / 32
  • 3. Mining Data Streams with Concept Drift Extract information from potentially infinite sequence of data possibly varying over time using few resources Leveraging Bagging New improvements for adaptive bagging methods using input randomization output randomization 2 / 32
  • 4. Outline 1 Data stream constraints 2 Leveraging Bagging for Evolving Data Streams 3 Empirical evaluation 3 / 32
  • 5. Outline 1 Data stream constraints 2 Leveraging Bagging for Evolving Data Streams 3 Empirical evaluation 4 / 32
  • 6. Mining Massive Data Eric Schmidt, August 2010 Every two days now we create as much information as we did from the dawn of civilization up until 2003. 5 exabytes of data 5 / 32
  • 7. Data stream classification cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 6 / 32
  • 8. Mining Massive Data Koichi Kawana Simplicity means the achievement of maximum effect with minimum means. time accuracy memory Data Streams 7 / 32
  • 9. Evaluation Example Accuracy Time Memory Classifier A 70% 100 20 Classifier B 80% 20 40 Which classifier is performing better? 8 / 32
  • 10. RAM-Hours RAM-Hour Every GB of RAM deployed for 1 hour Cloud Computing Rental Cost Options 9 / 32
  • 11. Evaluation Example Accuracy Time Memory RAM-Hours Classifier A 70% 100 20 2,000 Classifier B 80% 20 40 800 Which classifier is performing better? 10 / 32
  • 12. Outline 1 Data stream constraints 2 Leveraging Bagging for Evolving Data Streams 3 Empirical evaluation 11 / 32
  • 13. Hoeffding Trees Hoeffding Tree : VFDT Pedro Domingos and Geoff Hulten. Mining high-speed data streams. 2000 With high probability, constructs an identical model that a traditional (greedy) method would learn With theoretical guarantees on the error rate Time Contains “Money” YES Yes NO No Day YES Night 12 / 32
  • 14. Hoeffding Naive Bayes Tree Hoeffding Tree Majority Class learner at leaves Hoeffding Naive Bayes Tree G. Holmes, R. Kirkby, and B. Pfahringer. Stress-testing Hoeffding trees, 2005. monitors accuracy of a Majority Class learner monitors accuracy of a Naive Bayes learner predicts using the most accurate method 13 / 32
  • 15. Bagging Figure: Poisson(1) Distribution. Bagging builds a set of M base models, with a bootstrap sample created by drawing random samples with replacement. 14 / 32
  • 16. Bagging Figure: Poisson(1) Distribution. Each base model’s training set contains each of the original training example K times where P(K = k) follows a binomial distribution. 14 / 32
  • 17. Oza and Russell’s Online Bagging for M models 1: Initialize base models hm for all m ∈ {1,2,...,M} 2: for all training examples do 3: for m = 1,2,...,M do 4: Set w = Poisson(1) 5: Update hm with the current example with weight w 6: anytime output: 7: return hypothesis: hfin(x) = argmaxy∈Y ∑T t=1 I(ht (x) = y) 15 / 32
  • 18. ADWIN Bagging (KDD’09) ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates ADWIN Bagging When a change is detected, the worst classifier is removed and a new classifier is added. 16 / 32
  • 19. ADWIN Bagging for M models 1: Initialize base models hm for all m ∈ {1,2,...,M} 2: for all training examples do 3: for m = 1,2,...,M do 4: Set w = Poisson(1) 5: Update hm with the current example with weight w 6: if ADWIN detects change in error of one of the classifiers then 7: Replace classifier with higher error with a new one 8: anytime output: 9: return hypothesis: hfin(x) = argmaxy∈Y ∑T t=1 I(ht (x) = y) 17 / 32
  • 20. Leveraging Bagging for Evolving Data Streams Randomization as a powerful tool to increase accuracy and diversity There are three ways of using randomization: Manipulating the input data Manipulating the classifier algorithms Manipulating the output targets 18 / 32
  • 21. Input Randomization 0,00 0,05 0,10 0,15 0,20 0,25 0,30 0,35 0,40 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k P(X=k) λ=1 λ=6 λ=10 Figure: Poisson Distribution. 19 / 32
  • 22. ECOC Output Randomization Table: Example matrix of random output codes for 3 classes and 6 classifiers Class 1 Class 2 Class 3 Classifier 1 0 0 1 Classifier 2 0 1 1 Classifier 3 1 0 0 Classifier 4 1 1 0 Classifier 5 1 0 1 Classifier 6 0 1 0 20 / 32
  • 23. Leveraging Bagging for Evolving Data Streams Leveraging Bagging Using Poisson(λ) Leveraging Bagging MC Using Poisson(λ) and Random Output Codes Fast Leveraging Bagging ME if an instance is misclassified: weight = 1 if not: weight = eT /(1−eT ), 21 / 32
  • 24. Input Randomization Bagging resampling with replacement using Poisson(1) Other Strategies subagging resampling without replacement half subagging resampling without replacement half of the instances bagging without taking out any instance using 1+Poisson(1) 22 / 32
  • 25. Leveraging Bagging for Evolving Data Streams 1: Initialize base models hm for all m ∈ {1,2,...,M} 2: 3: for all training examples (x,y) do 4: for m = 1,2,...,M do 5: Set w = Poisson(λ) 6: Update hm with the current example with weight w 7: if ADWIN detects change in error of one of the classifiers then 8: Replace classifier with higher error with a new one 9: anytime output: 10: return hfin(x) = argmaxy∈Y ∑T t=1 I(ht (x) = y) 23 / 32
  • 26. Leveraging Bagging for Evolving Data Streams MC 1: Initialize base models hm for all m ∈ {1,2,...,M} 2: Compute coloring µm(y) 3: for all training examples (x,y) do 4: for m = 1,2,...,M do 5: Set w = Poisson(λ) 6: Update hm with the current example with weight w and class µm(y) 7: if ADWIN detects change in error of one of the classifiers then 8: Replace classifier with higher error with a new one 9: anytime output: 10: return hfin(x) = argmaxy∈Y ∑T t=1 I(ht (x) = µt (y)) 23 / 32
  • 27. Leveraging Bagging for Evolving Data Streams ME 1: Initialize base models hm for all m ∈ {1,2,...,M} 2: 3: for all training examples (x,y) do 4: for m = 1,2,...,M do 5: Set w = 1 if misclassified, otherwise eT /(1−eT ) 6: Update hm with the current example with weight w 7: if ADWIN detects change in error of one of the classifiers then 8: Replace classifier with higher error with a new one 9: anytime output: 10: return hfin(x) = argmaxy∈Y ∑T t=1 I(ht (x) = y) 23 / 32
  • 28. Outline 1 Data stream constraints 2 Leveraging Bagging for Evolving Data Streams 3 Empirical evaluation 24 / 32
  • 29. What is MOA? {M}assive {O}nline {A}nalysis is a framework for mining data streams. Based on experience with Weka and VFML Focussed on classification trees, but lots of active development: clustering, item set and sequence mining, regression Easy to extend Easy to design and run experiments 25 / 32
  • 30. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 26 / 32
  • 31. Leveraging Bagging Empirical evaluation Accuracy 75 77 79 81 83 85 87 89 91 93 951000080000150000220000290000360000430000500000570000640000710000780000850000920000990000 Instances Accuracy(%) Leveraging Bagging Leveraging Bagging MC ADWIN Bagging Online Bagging Figure: Accuracy on dataset SEA with three concept drifts. 27 / 32
  • 32. Empirical evaluation Accuracy RAM-Hours Hoeffding Tree 74.03% 0.01 Online Bagging 77.15% 2.98 ADWIN Bagging 79.24% 1.48 ADWIN Half Subagging 78.36% 1.04 ADWIN Subagging 78.68% 1.13 ADWIN Bagging WT 81.49% 2.74 ADWIN Bagging Strategies half subagging resampling without replacement half of the instances subagging resampling without replacement WT: bagging without taking out any instance using 1+Poisson(1) 28 / 32
  • 33. Empirical evaluation Accuracy RAM-Hours Hoeffding Tree 74.03% 0.01 Online Bagging 77.15% 2.98 ADWIN Bagging 79.24% 1.48 Leveraging Bagging 85.54% 20.17 Leveraging Bagging MC 85.37% 22.04 Leveraging Bagging ME 80.77% 0.87 Leveraging Bagging Leveraging Bagging Using Poisson(λ) Leveraging Bagging MC Using Poisson(λ) and Random Output Codes Leveraging Bagging ME Using weight 1 if misclassified, otherwise eT /(1−eT ) 29 / 32
  • 34. Empirical evaluation Accuracy RAM-Hours Hoeffding Tree 74.03% 0.01 Online Bagging 77.15% 2.98 ADWIN Bagging 79.24% 1.48 Random Forest Leveraging Bagging 80.69% 5.51 Random Forest Online Bagging 72.91% 1.30 Random Forest ADWIN Bagging 74.24% 0.89 Random Forests the input training set is obtained by sampling with replacement the nodes of the tree use only (n) random attributes to split we only keep statistics of these attributes. 30 / 32
  • 35. Leveraging Bagging Diversity 0 0,02 0,04 0,06 0,08 0,1 0,12 0,14 0,16 0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96 Kappa Statistic Error Leveraging Bagging Online Bagging Figure: Kappa-Error diagrams for Leveraging Bagging and Online bagging (bottom) on on the SEA data with three concept drifts, plotting 576 pairs of classifiers. 31 / 32
  • 36. Summary http://moa.cs.waikato.ac.nz/ Conclusions New improvements for bagging methods using input randomization Improving Accuracy: Using Poisson(λ) Improving RAM-Hours: Using weight 1 if misclassified, otherwise eT /(1−eT ) New improvements for bagging methods using output randomization No need for multi-class classifiers 32 / 32