Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Concept Drift: Monitoring Model Quality In Streaming ML Applications

2 199 vues

Publié le

Most machine learning algorithms are designed to work with stationary data. Yet, real-life streaming data is rarely stationary. Machine learned models built on data observed within a fixed time period usually suffer loss of prediction quality due to what is known as concept drift.

The most common method to deal with concept drift is periodically retraining the models with new data. The length of the period is usually determined based on cost of retraining. The changes in the input data and the quality of predictions are not monitored, and the cost of inaccurate predictions is not included in these calculations.

A better alternative is monitoring the model quality by testing the inputs and predictions for changes over time, and using change points in retraining decisions. There has been significant development in this area within the last two decades.

In this webinar, Emre Velipasaoglu, Principal Data Scientist at Lightbend, Inc., will review the successful methods of machine learned model quality monitoring.

Publié dans : Logiciels
  • Soyez le premier à commenter

Concept Drift: Monitoring Model Quality In Streaming ML Applications

  1. 1. model
 quality
  2. 2. * same data generating distribution
 (Some algorithms tolerate violation of this to a certain degree.) training set operation=* core problem
  3. 3. core problem stream
  4. 4. core problem stream population change
  5. 5. core problem stream sensor failure
  6. 6. core problem stream concept drift
  7. 7. core problem stream emerging 
 concept
  8. 8. common solution model batch poor
  9. 9. classifiermargin can active learning help? color size
  10. 10. classifiermargin can active learning help? color size
  11. 11. classifiermargin can active learning help? color size
  12. 12. classifiermargin can active learning help? color size
  13. 13. classifiermargin can active learning help? color size
  14. 14. classifiermargin can active learning help? color size
  15. 15. a better solution data classifier feature extraction predictions
  16. 16. a better solution data classifier feature extraction predictions monitoring
  17. 17. a better solution data classifier feature extraction predictions monitoring
  18. 18. a better solution data classifier feature extraction predictions monitoring
  19. 19. a better solution data classifier feature extraction predictions monitoring
  20. 20. a better solution data classifier labeling - feature extraction predictions monitoring
  21. 21. a better solution data classifier labeling -change detection adaptation feature extraction predictions
  22. 22. monitor how? supervised unsupervised statistical process control sequential analysis error distribution monitoring clustering / novelty detection feature distribution monitoring model-dependent monitoring
  23. 23. adapt how? explicit 
 mechanisms implicit 
 mechanisms windowing weighting sampling pure methods ensemble methods
  24. 24. which method?
  25. 25. monitor how? supervised unsupervised statistical process control sequential analysis error distribution monitoring clustering / novelty detection feature distribution monitoring model-dependent monitoring
  26. 26. ML theory: samples errors
  27. 27. statistical process control • Drift Detection Method [DDM] • # of errors is Binomial: • alert:
  28. 28. statistical process control • Early Drift Detection Method [EDDM] • distance between errors 
 better for gradual drift • warn & start caching: • alert and reset max: • Drift Detection Method [DDM] • # of errors is Binomial: • alert:
  29. 29. monitor how? supervised unsupervised statistical process control sequential analysis error distribution monitoring clustering / novelty detection feature distribution monitoring model-dependent monitoring
  30. 30. sequential analysis • Linear Four Rates [LFR] • stationary data => constant contingency table 0 1 0 TN FN 1 FP TP Predicted True
  31. 31. sequential analysis • Linear Four Rates [LFR] • stationary data => constant contingency table • calculate four rates 0 1 0 TN FN 1 FP TP Predicted True
  32. 32. sequential analysis • Linear Four Rates [LFR] • stationary data => constant contingency table • calculate four rates • incremental updates 0 1 0 TN FN 1 FP TP Predicted True
  33. 33. sequential analysis • Linear Four Rates [LFR] • stationary data => constant contingency table • calculate four rates • incremental updates • test for change • Monte Carlo sampling 
 for significance level • Bonferoni correction 
 for correlated tests • O(1) • Better than (E)DDM 
 for class imbalance 0 1 0 TN FN 1 FP TP Predicted True
  34. 34. monitor how? supervised unsupervised statistical process control sequential analysis error distribution monitoring clustering / novelty detection feature distribution monitoring model-dependent monitoring
  35. 35. error distribution monitoring • ADaptive WINdowing [ADWIN] • Consider all partitions of a window
 
 
 
 
 • Drop the last element if any
 
 
 • Efficient version O(log W) • Data structure for windows ~ exponential histograms • Drop last window rather than last element w0 w1 prediction errors
  36. 36. resampling • Prediction loss over random permutations vs. ordered training data • Parallel permutation test version available • Still expensive • Only method directly applicable to regression setting • Side note: Even with finite training set, drift could be problematic if model is developed naively.
  37. 37. monitor how? supervised unsupervised statistical process control sequential analysis error distribution monitoring clustering / novelty detection feature distribution monitoring model-dependent monitoring
  38. 38. clustering / novelty detection • OLINDDA: K-means, periodically merge unknown to known or flag • MINAS: micro-clusters, incremental stream clustering • DETECTNOD: Discrete Cosine Transform to estimate distances efficiently • Woo-ensemble: Treat outliers as potential emerging class centroids • ECSMiner: Store and use cluster summary efficiently • GC3: Grid based clustering size color
  39. 39. Curse of 
 Dimensionality clustering / novelty detection size color • OLINDDA: K-means, periodically merge unknown to known or flag • MINAS: micro-clusters, incremental stream clustering • DETECTNOD: Discrete Cosine Transform to estimate distances efficiently • Woo-ensemble: Treat outliers as potential emerging class centroids • ECSMiner: Store and use cluster summary efficiently • GC3: Grid based clustering
  40. 40. monitor how? supervised unsupervised statistical process control sequential analysis error distribution monitoring clustering / novelty detection feature distribution monitoring model-dependent monitoring
  41. 41. feature distribution monitoring • Monitor individual features • Many ways to compare: • Pearson correlation [Change of Concept - CoC] • Hellinger distance [HDDDM] ~ O(DB) • PCA to reduce the number of features to track (top [PCA-1] or bottom [PCA-2] n%) w0 w1 color size samples
  42. 42. monitor how? supervised unsupervised statistical process control sequential analysis error distribution monitoring clustering / novelty detection feature distribution monitoring model-dependent monitoring
  43. 43. model-dependent monitoring • Not all changes matter • Posterior probability estimate • Use [A-distance] ~ generalized Kolmogorov-Simirnov distance • designed to be less sensitive to irrelevant changes L1-distance KS-distance A-distance
  44. 44. model-dependent monitoring • [Margin] distribution • rank statistic on density estimates for a 
 binary representation of the data, • compare average margins of a linear classifier 
 induced by the 1-norm SVM • based on the average zero-one or sigmoid error 
 rate of an SVM classifier • Generalized margin [MD3]: • Embed base classifier in a 
 Random Feature Bagged Ensemble • Margin == high disagreement region of the ensemble m argin “margin” “margin”
  45. 45. adapt how? explicit 
 mechanisms implicit 
 mechanisms windowing weighting sampling pure methods ensemble methods
  46. 46. explicit mechanisms for adaptation W stationary W drift ADWIN Drop the last sub-window 
 if threshold is exceeded. = Adaptively shrink window during drift.
  47. 47. explicit mechanisms for adaptation * Adaptation goes through a similar refinement process. JIT w 0 m 0 m 1 m 2 m 3 m 4 I 0 I 1 I 2 I 3 I 4 change detected * w 1 w 2 w 3 w 4
  48. 48. adapt how? explicit 
 mechanisms implicit 
 mechanisms windowing weighting sampling pure methods ensemble methods
  49. 49. explicit mechanisms for adaptation Biased
 Reservoir
 Sampling bias: capacity: overwrite / exchange randomly w/ Prob{ %full } or append
  50. 50. adapt how? explicit 
 mechanisms implicit 
 mechanisms windowing weighting sampling pure methods ensemble methods
  51. 51. implicit mechanisms for adaptation Ensemble Based Adaptation ensemble 1 ensemble (N-1) ensemble N train new member
  52. 52. implicit mechanisms for adaptation retire / decay train new member ensemble 1 ensemble (N-1) ensemble N Ensemble Based Adaptation
  53. 53. implicit mechanisms for adaptation retire / decay recurring train new member ensemble 1 ensemble (N-1) ensemble N Ensemble Based Adaptation
  54. 54. implicit mechanisms for adaptation • Online NonStationary boosting [ONSboost] • NonStationary Random Forests [NSRF] • Dynamic Weighted Majority [DWM] • Learn++ for NonStationary Environments [Learn++.NSE] retire / decay recurring train new member ensemble 1 ensemble (N-1) ensemble N Ensemble Based Adaptation
  55. 55. which method? Method Efficiency Pros Cons Notes DDM/EDDM O(1) no data stored label cost false alarms sampling 
 necessary 
 in case of 
 fast data, microservices
 architecture
 ideal LFR O(1) class imbalance OK label cost ADWIN O(log W) better change localization label cost JIT O(log W) no labels required only for abrupt changes best localization
  56. 56. which method? Method Efficiency Pros Cons Notes ECSMiner / GC3 O(W 2 / k)
 O(G log C) emerging concepts clusterable 
 drift only use if emerging concepts expected HDDDM O(DB) no labels not for population drift or class imbalance better when combined with PCA A-distance O(log W) no labels less false positives compared to HDDDM good choice for unsupervised Margin / MD3 Learning, detection, adaptation bundled reduced false alarms must use feature bagged ensembles best choice but must commit to using the specific machine learning algorithmsEnsemble methods recurring concepts large batches
  57. 57. references https://gist.github.com/emrev12/0d75dc2d6c3e80012d10a82712b8ced0
  58. 58. Check out these resources: Dean’s book Webinars etc.Fast Data Architectures 
 for Streaming Applications Getting Answers Now from Data Sets that Never End By Dean Wampler, Ph. D., VP of Fast Data Engineering 60 LIGHTBEND.COM/LEARN
  59. 59. Serving Machine Learning Models A Guide to Architecture, Stream Processing Engines, 
 and Frameworks By Boris Lublinsky, Fast Data Platform Architect 61 LIGHTBEND.COM/LEARN
  60. 60. lightbend.com/fast-data-platform
  61. 61. thank you! emre.velipasaoglu@ .com

×