SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
New ensemble methods for evolving data streams


 A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà

       Laboratory for Relational Algorithmics, Complexity and Learning LARCA
                            UPC-Barcelona Tech, Catalonia


                               University of Waikato
                              Hamilton, New Zealand




                   Paris, 29 June 2009
       15th ACM SIGKDD International Conference on
        Knowledge Discovery and Data Mining 2009
New Ensemble Methods For Evolving Data Streams




  Outline
      a new experimental data stream framework for studying
      concept drift
      two new variants of Bagging:
            ADWIN Bagging
            Adaptive-Size Hoeffding Tree (ASHT) Bagging.
      an evaluation study on synthetic and real-world datasets



                                                                 2 / 25
Outline


   1   MOA: Massive Online Analysis


   2   Concept Drift Framework


   3   New Ensemble Methods


   4   Empirical evaluation




                                      3 / 25
What is MOA?

  {M}assive {O}nline {A}nalysis is a framework for online learning
  from data streams.




      It is closely related to WEKA
      It includes a collection of offline and online as well as tools
      for evaluation:
           boosting and bagging
           Hoeffding Trees
      with and without Naïve Bayes classifiers at the leaves.



                                                                       4 / 25
WEKA

   Waikato Environment for Knowledge Analysis
   Collection of state-of-the-art machine learning algorithms
   and data processing tools implemented in Java
       Released under the GPL
   Support for the whole process of experimental data mining
       Preparation of input data
       Statistical evaluation of learning schemes
       Visualization of input data and the result of learning




   Used for education, research and applications
   Complements “Data Mining” by Witten & Frank



                                                                5 / 25
WEKA: the bird




                 6 / 25
MOA: the bird

   The Moa (another native NZ bird) is not only flightless, like the
                     Weka, but also extinct.




                                                                      7 / 25
MOA: the bird

   The Moa (another native NZ bird) is not only flightless, like the
                     Weka, but also extinct.




                                                                      7 / 25
MOA: the bird

   The Moa (another native NZ bird) is not only flightless, like the
                     Weka, but also extinct.




                                                                      7 / 25
Data stream classification cycle

  1   Process an example at a time,
      and inspect it only once (at
      most)
  2   Use a limited amount of
      memory
  3   Work in a limited amount of
      time
  4   Be ready to predict at any
      point




                                      8 / 25
Experimental setting

 Evaluation procedures for Data
 Streams
    Holdout
    Interleaved Test-Then-Train or
    Prequential

 Environments
    Sensor Network: 100Kb
    Handheld Computer: 32 Mb
    Server: 400 Mb




                                     9 / 25
Experimental setting

 Data Sources
    Random Tree Generator
    Random RBF Generator
    LED Generator
    Waveform Generator
    Function Generator




                            9 / 25
Experimental setting

 Classifiers
     Naive Bayes
     Decision stumps
     Hoeffding Tree
     Hoeffding Option Tree
     Bagging and Boosting

 Prediction strategies
     Majority class
     Naive Bayes Leaves
     Adaptive Hybrid



                             9 / 25
Easy Design of a MOA classifier




      void resetLearningImpl ()
      void trainOnInstanceImpl (Instance inst)
      double[] getVotesForInstance (Instance i)
      void getModelDescription (StringBuilder
      out, int indent)




                                                  10 / 25
Outline


   1   MOA: Massive Online Analysis


   2   Concept Drift Framework


   3   New Ensemble Methods


   4   Empirical evaluation




                                      11 / 25
Extension to Evolving Data Streams




  New Evolving Data Stream Extensions

      New Stream Generators
      New UNION of Streams
      New Classifiers




                                        12 / 25
Extension to Evolving Data Streams




  New Evolving Data Stream Generators

     Random RBF with Drift        Hyperplane
     LED with Drift               SEA Generator
     Waveform with Drift          STAGGER Generator




                                                      12 / 25
Concept Drift Framework
          f (t)                                      f (t)
         1


                                     α
        0.5


                         α
                                                 t
                                t0
                                W
  Definition
  Given two data streams a, b, we define c = a ⊕W b as the data
                                                  t0
  stream built joining the two data streams a and b
      Pr[c(t) = b(t)] = 1/(1 + e−4(t−t0 )/W ).
      Pr[c(t) = a(t)] = 1 − Pr[c(t) = b(t)]

                                                                 13 / 25
Concept Drift Framework
          f (t)                                       f (t)
         1


                                     α
        0.5


                         α
                                                  t
                                t0
                                W
  Example
      (((a ⊕W0 b) ⊕W1 c) ⊕W2 d) . . .
            t0     t1     t2
      (((SEA9 ⊕W SEA8 ) ⊕W0 SEA7 ) ⊕W0 SEA9.5 )
               t0        2t         3t

      CovPokElec = (CoverType ⊕5,000 Poker) ⊕5,000
                               581,012       1,000,000 ELEC2


                                                               13 / 25
Extension to Evolving Data Streams




  New Evolving Data Stream Classifiers

     Adaptive Hoeffding Option Tree     OCBoost
     DDM Hoeffding Tree                 FLBoost
     EDDM Hoeffding Tree




                                                  14 / 25
Outline


   1   MOA: Massive Online Analysis


   2   Concept Drift Framework


   3   New Ensemble Methods


   4   Empirical evaluation




                                      15 / 25
Ensemble Methods




  New ensemble methods:
     Adaptive-Size Hoeffding Tree bagging:
         each tree has a maximum size
         after one node splits, it deletes some nodes to reduce its
         size if the size of the tree is higher than the maximum value
     ADWIN bagging:
         When a change is detected, the worst classifier is removed
         and a new classifier is added.



                                                                         16 / 25
Adaptive-Size Hoeffding Tree
               T1       T2          T3            T4




   Ensemble of trees of different size
       smaller trees adapt more quickly to changes,
       larger trees do better during periods with little change
       diversity

                                                                  17 / 25
Adaptive-Size Hoeffding Tree


              0,3                                                      0,28

             0,29
                                                                      0,275
             0,28

             0,27
                                                                       0,27
             0,26
     Error




                                                              Error
             0,25                                                     0,265

             0,24
                                                                       0,26
             0,23

             0,22
                                                                      0,255
             0,21

              0,2                                                      0,25
                    0   0,1   0,2    0,3    0,4   0,5   0,6                   0,1   0,12   0,14   0,16   0,18    0,2    0,22   0,24   0,26   0,28   0,3
                                    Kappa                                                                       Kappa




   Figure: Kappa-Error diagrams for ASHT bagging (left) and bagging
   (right) on dataset RandomRBF with drift, plotting 90 pairs of
   classifiers.




                                                                                                                                                          18 / 25
ADWIN Bagging

  ADWIN
  An adaptive sliding window whose size is recomputed online
  according to the rate of change observed.

  ADWIN has rigorous guarantees (theorems)
      On ratio of false positives and negatives
      On the relation of the size of the current window and
      change rates

  ADWIN Bagging
  When a change is detected, the worst classifier is removed and
  a new classifier is added.



                                                                  19 / 25
Outline


   1   MOA: Massive Online Analysis


   2   Concept Drift Framework


   3   New Ensemble Methods


   4   Empirical evaluation




                                      20 / 25
Empirical evaluation

     Dataset                            Most Accurate Method
     Hyperplane Drift 0.0001            Bag10 ASHT W+R
     Hyperplane Drift 0.001             Bag10 ASHT W+R
     SEA W = 50                         Bag10 ASHT W+R
     SEA W = 50000                      BagADWIN 10 HT
     RandomRBF No Drift 50 centers      Bag 10 HT
     RandomRBF Drift .0001 50 centers   BagADWIN 10 HT
     RandomRBF Drift .001 50 centers    Bag10 ASHT W+R
     RandomRBF Drift .001 10 centers    BagADWIN 10 HT
     Cover Type                         Bag10 ASHT W+R
     Poker                              OzaBoost
     Electricity                        OCBoost
     CovPokElec                         BagADWIN 10 HT




                                                               21 / 25
Empirical evaluation




      Figure: Accuracy on dataset LED with three concept drifts.


                                                                   22 / 25
Empirical evaluation
                                     SEA
                                    W = 50
                             Time    Acc. Mem.
           Bag10 ASHT W+R   33.20   88.89  0.84
           BagADWIN 10 HT   54.51   88.58  1.90
           Bag5 ASHT W+R    19.78   88.55  0.01
           HT DDM            8.30   88.27  0.17
           HT EDDM           8.56   87.97  0.18
           OCBoost          59.12   87.21  2.41
           OzaBoost         39.40   86.28  4.03
           Bag10 HT         31.06   85.45  3.38
           AdaHOT50         22.70   85.35  0.86
           HOT50            22.54   85.20  0.84
           AdaHOT5          11.46   84.94  0.38
           HOT5             11.46   84.92  0.38
           HT                6.96   84.89  0.34
           NaiveBayes        5.32   83.87  0.00


                                                  23 / 25
Empirical evaluation
                                    SEA
                                  W = 50000
                             Time    Acc. Mem.
           BagADWIN 10 HT   53.15 88.53     0.88
           Bag10 ASHT W+R   33.56 88.30     0.84
           HT DDM            7.88 88.07     0.16
           Bag5 ASHT W+R    20.00 87.99     0.05
           HT EDDM           8.52 87.64     0.06
           OCBoost          60.33 86.97     2.44
           OzaBoost         39.97 86.17     4.00
           Bag10 HT         30.88 85.34     3.36
           AdaHOT50         22.80 85.30     0.84
           HOT50            22.78 85.18     0.83
           AdaHOT5          12.48 84.94     0.38
           HOT5             12.46 84.91     0.37
           HT                7.20 84.87     0.33
           NaiveBayes        5.52 83.87     0.00


                                                   24 / 25
Summary




      http://www.cs.waikato.ac.nz/∼abifet/MOA/

  Conclusions
     Extension of MOA to evolving data streams
     MOA is easy to use and extend
     New ensemble bagging methods:
          Adaptive-Size Hoeffding Tree bagging
          ADWIN bagging

  Future Work
     Extend MOA to more data mining and learning methods.

                                                            25 / 25

Contenu connexe

Similaire à New ensemble methods for evolving data streams

Slider: an Efficient Incremental Reasoner, by Jules Chevalier
Slider: an Efficient Incremental Reasoner, by Jules ChevalierSlider: an Efficient Incremental Reasoner, by Jules Chevalier
Slider: an Efficient Incremental Reasoner, by Jules Chevalieropencloudware
 
Data assimilation with OpenDA
Data assimilation with OpenDAData assimilation with OpenDA
Data assimilation with OpenDAnilsvanvelzen
 
Introduction to MATLAB
Introduction to MATLABIntroduction to MATLAB
Introduction to MATLABBhavesh Shah
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
 
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Lucidworks
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagerySuneel Marthi
 
ACAT 2019: A hybrid deep learning approach to vertexing
ACAT 2019: A hybrid deep learning approach to vertexingACAT 2019: A hybrid deep learning approach to vertexing
ACAT 2019: A hybrid deep learning approach to vertexingHenry Schreiner
 
Landuse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningDataWorks Summit
 
An intro to explainable AI for polar climate science
An intro to  explainable AI for  polar climate scienceAn intro to  explainable AI for  polar climate science
An intro to explainable AI for polar climate scienceZachary Labe
 
Subgraph Matching for Resource Allocation in the Federated Cloud Environment
Subgraph Matching for Resource Allocation in the Federated Cloud EnvironmentSubgraph Matching for Resource Allocation in the Federated Cloud Environment
Subgraph Matching for Resource Allocation in the Federated Cloud EnvironmentAtakanAral
 
Tim connecting-the-dots
Tim connecting-the-dotsTim connecting-the-dots
Tim connecting-the-dotsTimothy Head
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDCAstroAtom
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0Makoto Yui
 
Two Days workshop on MATLAB
Two Days workshop on MATLABTwo Days workshop on MATLAB
Two Days workshop on MATLABBhavesh Shah
 

Similaire à New ensemble methods for evolving data streams (20)

Slider: an Efficient Incremental Reasoner, by Jules Chevalier
Slider: an Efficient Incremental Reasoner, by Jules ChevalierSlider: an Efficient Incremental Reasoner, by Jules Chevalier
Slider: an Efficient Incremental Reasoner, by Jules Chevalier
 
MLMM_16_08_2022.pdf
MLMM_16_08_2022.pdfMLMM_16_08_2022.pdf
MLMM_16_08_2022.pdf
 
Data assimilation with OpenDA
Data assimilation with OpenDAData assimilation with OpenDA
Data assimilation with OpenDA
 
Introduction to MATLAB
Introduction to MATLABIntroduction to MATLAB
Introduction to MATLAB
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
 
ACAT 2019: A hybrid deep learning approach to vertexing
ACAT 2019: A hybrid deep learning approach to vertexingACAT 2019: A hybrid deep learning approach to vertexing
ACAT 2019: A hybrid deep learning approach to vertexing
 
Landuse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep Learning
 
An intro to explainable AI for polar climate science
An intro to  explainable AI for  polar climate scienceAn intro to  explainable AI for  polar climate science
An intro to explainable AI for polar climate science
 
Subgraph Matching for Resource Allocation in the Federated Cloud Environment
Subgraph Matching for Resource Allocation in the Federated Cloud EnvironmentSubgraph Matching for Resource Allocation in the Federated Cloud Environment
Subgraph Matching for Resource Allocation in the Federated Cloud Environment
 
Tim connecting-the-dots
Tim connecting-the-dotsTim connecting-the-dots
Tim connecting-the-dots
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDC
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
What is matlab
What is matlabWhat is matlab
What is matlab
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
Two Days workshop on MATLAB
Two Days workshop on MATLABTwo Days workshop on MATLAB
Two Days workshop on MATLAB
 

Plus de Albert Bifet

Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream miningAlbert Bifet
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labelsAlbert Bifet
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsAlbert Bifet
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataAlbert Bifet
 

Plus de Albert Bifet (20)

Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
 

Dernier

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

New ensemble methods for evolving data streams

  • 1. New ensemble methods for evolving data streams A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia University of Waikato Hamilton, New Zealand Paris, 29 June 2009 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009
  • 2. New Ensemble Methods For Evolving Data Streams Outline a new experimental data stream framework for studying concept drift two new variants of Bagging: ADWIN Bagging Adaptive-Size Hoeffding Tree (ASHT) Bagging. an evaluation study on synthetic and real-world datasets 2 / 25
  • 3. Outline 1 MOA: Massive Online Analysis 2 Concept Drift Framework 3 New Ensemble Methods 4 Empirical evaluation 3 / 25
  • 4. What is MOA? {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. It is closely related to WEKA It includes a collection of offline and online as well as tools for evaluation: boosting and bagging Hoeffding Trees with and without Naïve Bayes classifiers at the leaves. 4 / 25
  • 5. WEKA Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms and data processing tools implemented in Java Released under the GPL Support for the whole process of experimental data mining Preparation of input data Statistical evaluation of learning schemes Visualization of input data and the result of learning Used for education, research and applications Complements “Data Mining” by Witten & Frank 5 / 25
  • 6. WEKA: the bird 6 / 25
  • 7. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 7 / 25
  • 8. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 7 / 25
  • 9. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 7 / 25
  • 10. Data stream classification cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 8 / 25
  • 11. Experimental setting Evaluation procedures for Data Streams Holdout Interleaved Test-Then-Train or Prequential Environments Sensor Network: 100Kb Handheld Computer: 32 Mb Server: 400 Mb 9 / 25
  • 12. Experimental setting Data Sources Random Tree Generator Random RBF Generator LED Generator Waveform Generator Function Generator 9 / 25
  • 13. Experimental setting Classifiers Naive Bayes Decision stumps Hoeffding Tree Hoeffding Option Tree Bagging and Boosting Prediction strategies Majority class Naive Bayes Leaves Adaptive Hybrid 9 / 25
  • 14. Easy Design of a MOA classifier void resetLearningImpl () void trainOnInstanceImpl (Instance inst) double[] getVotesForInstance (Instance i) void getModelDescription (StringBuilder out, int indent) 10 / 25
  • 15. Outline 1 MOA: Massive Online Analysis 2 Concept Drift Framework 3 New Ensemble Methods 4 Empirical evaluation 11 / 25
  • 16. Extension to Evolving Data Streams New Evolving Data Stream Extensions New Stream Generators New UNION of Streams New Classifiers 12 / 25
  • 17. Extension to Evolving Data Streams New Evolving Data Stream Generators Random RBF with Drift Hyperplane LED with Drift SEA Generator Waveform with Drift STAGGER Generator 12 / 25
  • 18. Concept Drift Framework f (t) f (t) 1 α 0.5 α t t0 W Definition Given two data streams a, b, we define c = a ⊕W b as the data t0 stream built joining the two data streams a and b Pr[c(t) = b(t)] = 1/(1 + e−4(t−t0 )/W ). Pr[c(t) = a(t)] = 1 − Pr[c(t) = b(t)] 13 / 25
  • 19. Concept Drift Framework f (t) f (t) 1 α 0.5 α t t0 W Example (((a ⊕W0 b) ⊕W1 c) ⊕W2 d) . . . t0 t1 t2 (((SEA9 ⊕W SEA8 ) ⊕W0 SEA7 ) ⊕W0 SEA9.5 ) t0 2t 3t CovPokElec = (CoverType ⊕5,000 Poker) ⊕5,000 581,012 1,000,000 ELEC2 13 / 25
  • 20. Extension to Evolving Data Streams New Evolving Data Stream Classifiers Adaptive Hoeffding Option Tree OCBoost DDM Hoeffding Tree FLBoost EDDM Hoeffding Tree 14 / 25
  • 21. Outline 1 MOA: Massive Online Analysis 2 Concept Drift Framework 3 New Ensemble Methods 4 Empirical evaluation 15 / 25
  • 22. Ensemble Methods New ensemble methods: Adaptive-Size Hoeffding Tree bagging: each tree has a maximum size after one node splits, it deletes some nodes to reduce its size if the size of the tree is higher than the maximum value ADWIN bagging: When a change is detected, the worst classifier is removed and a new classifier is added. 16 / 25
  • 23. Adaptive-Size Hoeffding Tree T1 T2 T3 T4 Ensemble of trees of different size smaller trees adapt more quickly to changes, larger trees do better during periods with little change diversity 17 / 25
  • 24. Adaptive-Size Hoeffding Tree 0,3 0,28 0,29 0,275 0,28 0,27 0,27 0,26 Error Error 0,25 0,265 0,24 0,26 0,23 0,22 0,255 0,21 0,2 0,25 0 0,1 0,2 0,3 0,4 0,5 0,6 0,1 0,12 0,14 0,16 0,18 0,2 0,22 0,24 0,26 0,28 0,3 Kappa Kappa Figure: Kappa-Error diagrams for ASHT bagging (left) and bagging (right) on dataset RandomRBF with drift, plotting 90 pairs of classifiers. 18 / 25
  • 25. ADWIN Bagging ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates ADWIN Bagging When a change is detected, the worst classifier is removed and a new classifier is added. 19 / 25
  • 26. Outline 1 MOA: Massive Online Analysis 2 Concept Drift Framework 3 New Ensemble Methods 4 Empirical evaluation 20 / 25
  • 27. Empirical evaluation Dataset Most Accurate Method Hyperplane Drift 0.0001 Bag10 ASHT W+R Hyperplane Drift 0.001 Bag10 ASHT W+R SEA W = 50 Bag10 ASHT W+R SEA W = 50000 BagADWIN 10 HT RandomRBF No Drift 50 centers Bag 10 HT RandomRBF Drift .0001 50 centers BagADWIN 10 HT RandomRBF Drift .001 50 centers Bag10 ASHT W+R RandomRBF Drift .001 10 centers BagADWIN 10 HT Cover Type Bag10 ASHT W+R Poker OzaBoost Electricity OCBoost CovPokElec BagADWIN 10 HT 21 / 25
  • 28. Empirical evaluation Figure: Accuracy on dataset LED with three concept drifts. 22 / 25
  • 29. Empirical evaluation SEA W = 50 Time Acc. Mem. Bag10 ASHT W+R 33.20 88.89 0.84 BagADWIN 10 HT 54.51 88.58 1.90 Bag5 ASHT W+R 19.78 88.55 0.01 HT DDM 8.30 88.27 0.17 HT EDDM 8.56 87.97 0.18 OCBoost 59.12 87.21 2.41 OzaBoost 39.40 86.28 4.03 Bag10 HT 31.06 85.45 3.38 AdaHOT50 22.70 85.35 0.86 HOT50 22.54 85.20 0.84 AdaHOT5 11.46 84.94 0.38 HOT5 11.46 84.92 0.38 HT 6.96 84.89 0.34 NaiveBayes 5.32 83.87 0.00 23 / 25
  • 30. Empirical evaluation SEA W = 50000 Time Acc. Mem. BagADWIN 10 HT 53.15 88.53 0.88 Bag10 ASHT W+R 33.56 88.30 0.84 HT DDM 7.88 88.07 0.16 Bag5 ASHT W+R 20.00 87.99 0.05 HT EDDM 8.52 87.64 0.06 OCBoost 60.33 86.97 2.44 OzaBoost 39.97 86.17 4.00 Bag10 HT 30.88 85.34 3.36 AdaHOT50 22.80 85.30 0.84 HOT50 22.78 85.18 0.83 AdaHOT5 12.48 84.94 0.38 HOT5 12.46 84.91 0.37 HT 7.20 84.87 0.33 NaiveBayes 5.52 83.87 0.00 24 / 25
  • 31. Summary http://www.cs.waikato.ac.nz/∼abifet/MOA/ Conclusions Extension of MOA to evolving data streams MOA is easy to use and extend New ensemble bagging methods: Adaptive-Size Hoeffding Tree bagging ADWIN bagging Future Work Extend MOA to more data mining and learning methods. 25 / 25