Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Five Things I Learned While Building 
Anomaly Detection Tools 
(Or: 5 things that bit me in the …) 
Toufic Boubez, Ph.D. 
...
2 
Preamble 
• IANA Data Scientist! I’m just an engineer that needed to get stuff done! 
• I learned (!) many more things,...
3 
Toufic intro – who I am 
• Co-Founder/CTO Metafor Software 
• Co-Founder/CTO Layer 7 Technologies 
– Acquired by Comput...
4 
Why Anomaly Detection? 
• Watching screens on the “Wall of Charts” 
cannot scale! 
– Leads to alert fatigue 
• Need to ...
Thing 1: 
Your data is NOT Gaussian 
1
6 
Gaussian or Normal distribution 
• Bell-shaped distribution 
– Has a mean and a standard deviation
7 
This is Normally distributed data
8 
Quick check: Histogram
9 
Normal distributions are really useful 
• I can make powerful predictions because of 
the statistical properties of the...
Normally distributed vs Not 
- Confidential - 10 
Normal distributions 
• Most naturally occurring 
processes 
• Populatio...
11 
Why is that important? 
• Most analytics tools are based on two 
assumptions: 
1. Parametric techniques: Data is norma...
12 
Example: Three-Sigma Rule 
• Three-sigma rule 
– ~68% of the values lie within 1 std deviation of the mean 
– ~95% of ...
13 
Aaahhhh 
• The mysterious red lines explained 
3s 
mean 
3s
14 
Doesn’t work because THIS
15 
Histogram – probability distribution
16 
3-sigma rule alerts
17 
Holt-Winters predictions
18 
Or worse, THIS!
19 
Histogram – probability distribution
20 
3-sigma rule alerts
Thing 2: 
2 
Yesterday’s anomaly is today’s normal
22 
Why is that important? 
• Most analytics tools are based on two 
assumptions: 
1. Parametric techniques: Data is norma...
23 
Remember this data?
24 
No matter where you look
25 
Its characteristics are stationary
26 
Meanwhile, in our real world 
• Stationarity is not a realistic assumption in the 
large complex systems with which we...
27 
Meanwhile, in our real world 
• Stationarity is not a realistic assumption in the 
large complex systems with which we...
28 
Supervised learning 
• In ML, Supervised Learning is the general set of 
techniques for inferring a model from a set o...
What happens when something changes in your 
fundamentals? 
29
This is your new normal: all red all the time 
30
31 
Mean Shift and Breakout Detection 
• https://blog.twitter.com/2014/breakout-detection- 
in-the-wild
Thing 3: 
Saying Kolmogorov-Smirnov is a great way to 
impress everyone 
3
33 
Why is that important? 
• Seriously!? 
• Ok, actually non-parametric techniques that 
make no assumptions about normal...
34 
The Kolmogorov-Smirnov test 
• Non-parametric test 
– Compare two probability 
distributions 
– Makes no assumptions (...
35 
KS with windowing
36 
Data from similar windows
Cumulative distribution for those windows 
37
38 
Data from dissimilar windows
Cumulative distribution for those windows 
39
40 
Sliding window of KS scores
41 
KS anomaly results
Thing 4: 
4 
Take Scope and Context into account!
43 
Some data – is that normal?
44 
Wider scope
45 
Is this an anomlay?
46 
Even wider scope
47 
Is every weekend an anomaly?
48 
Would this be more accurate?
49 
Use domain knowledge! 
• Domain knowledge is NOT a bad thing! 
– There is no algorithm that will work on everything 
–...
Thing 5: 
No data != No information
51 
Why is that important? 
• Some data channels are inherently non-chatty: 
– We don’t have the luxury of always generati...
52 
Communication channel
53 
Box plot results
54
55 
Simple lookup table with priors
56 
Don’t be an analytics snob 
• Sparse data is VERY hard to analyze using 
typical analytics techniques 
• Sparse data c...
57 
Recap 
1. Your data is NOT Gaussian 
2. Yesterday’s anomaly is today’s normal 
3. Kolmogorov-Smirnov is really cool 
4...
58 
Questions? 
• Shout out to the Metafor Data Science team! 
– Fred Zhang 
– Iman Makaremi
Prochain SlideShare
Chargement dans…5
×

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software - LISA 2014

3 403 vues

Publié le

This is my presentation from LISA 2014 in Seattle on November 14, 2014.

Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.

In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.

Publié dans : Logiciels

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software - LISA 2014

  1. 1. Five Things I Learned While Building Anomaly Detection Tools (Or: 5 things that bit me in the …) Toufic Boubez, Ph.D. Founder, CTO Metafor Software toufic@metaforsoftware.com
  2. 2. 2 Preamble • IANA Data Scientist! I’m just an engineer that needed to get stuff done! • I learned (!) many more things, but cannnot be mentioned! – Because lawyers  – But ask me later  • I usually beat up on parametric, Gaussian, supervised techniques – This talk is not an exception, – But more of a “lessons learned” message • Note: all data real • Note: no y-axis labels on charts – on purpose!! • Note to self: remember to SLOW DOWN! • Note to self: mention the cats!! Everybody loves cats!!
  3. 3. 3 Toufic intro – who I am • Co-Founder/CTO Metafor Software • Co-Founder/CTO Layer 7 Technologies – Acquired by Computer Associates in 2013 – I escaped  • CTO Saffron Technology • IBM Chief Architect for SOA • Co-Author, Co-Editor: WS-Trust, WS-SecureConversation, WS-Federation, WS-Policy • Building large scale software systems for >20 years (I’m older than I look, I know!)
  4. 4. 4 Why Anomaly Detection? • Watching screens on the “Wall of Charts” cannot scale! – Leads to alert fatigue • Need to automate detection of anomalous behaviors • Anomaly detection is the search for items or events which do not conform to an expected pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys 41 (3): 1]
  5. 5. Thing 1: Your data is NOT Gaussian 1
  6. 6. 6 Gaussian or Normal distribution • Bell-shaped distribution – Has a mean and a standard deviation
  7. 7. 7 This is Normally distributed data
  8. 8. 8 Quick check: Histogram
  9. 9. 9 Normal distributions are really useful • I can make powerful predictions because of the statistical properties of the data • I can easily compare different metrics since they have similar statistical properties • There is a HUGE body of statistical work on parametric techniques for normally distributed data
  10. 10. Normally distributed vs Not - Confidential - 10 Normal distributions • Most naturally occurring processes • Population height, IQ distributions (present company excepted of course) • Widget sizes, weights in manufacturing • … Not • Your metrics!
  11. 11. 11 Why is that important? • Most analytics tools are based on two assumptions: 1. Parametric techniques: Data is normally distributed with a useful and usable mean and standard deviation 2. Supervised Learning techniques: Data is probabilistically “stationary”
  12. 12. 12 Example: Three-Sigma Rule • Three-sigma rule – ~68% of the values lie within 1 std deviation of the mean – ~95% of the values lie within 2 std deviations – 99.73% of the values lie within 3 std deviations: anything else is considered an outlier
  13. 13. 13 Aaahhhh • The mysterious red lines explained 3s mean 3s
  14. 14. 14 Doesn’t work because THIS
  15. 15. 15 Histogram – probability distribution
  16. 16. 16 3-sigma rule alerts
  17. 17. 17 Holt-Winters predictions
  18. 18. 18 Or worse, THIS!
  19. 19. 19 Histogram – probability distribution
  20. 20. 20 3-sigma rule alerts
  21. 21. Thing 2: 2 Yesterday’s anomaly is today’s normal
  22. 22. 22 Why is that important? • Most analytics tools are based on two assumptions: 1. Parametric techniques: Data is normally distributed with a useful and usable mean and standard deviation 2. Supervised Learning techniques: Data is probabilistically “stationary”
  23. 23. 23 Remember this data?
  24. 24. 24 No matter where you look
  25. 25. 25 Its characteristics are stationary
  26. 26. 26 Meanwhile, in our real world • Stationarity is not a realistic assumption in the large complex systems with which we’re dealing • “Concept Drift” is very common – http://en.wikipedia.org/wiki/Concept_drift “ … the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.”
  27. 27. 27 Meanwhile, in our real world • Stationarity is not a realistic assumption in the large complex systems with which we’re dealing • “Concept Drift” is very common – http://en.wikipedia.org/wiki/Concept_drift “ … the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.”
  28. 28. 28 Supervised learning • In ML, Supervised Learning is the general set of techniques for inferring a model from a set of observations: – Observations in a Training Set are labelled with the desired outcomes (e.g. “normal vs. anomalous”, “normal vs. fraudulent”, “red/green/yellow”, etc) – As observations are fed into the learning system, it learns to differentiate by inferring a model based on these labels – Once sufficiently “trained”, the system is used in production on “real” unlabelled data and can label the new data based on the inferred model
  29. 29. What happens when something changes in your fundamentals? 29
  30. 30. This is your new normal: all red all the time 30
  31. 31. 31 Mean Shift and Breakout Detection • https://blog.twitter.com/2014/breakout-detection- in-the-wild
  32. 32. Thing 3: Saying Kolmogorov-Smirnov is a great way to impress everyone 3
  33. 33. 33 Why is that important? • Seriously!? • Ok, actually non-parametric techniques that make no assumptions about normality or any other probability distribution are crucial in your effort to understand what’s going on in your systems
  34. 34. 34 The Kolmogorov-Smirnov test • Non-parametric test – Compare two probability distributions – Makes no assumptions (e.g. Gaussian) about the distributions of the samples – Measures maximum distance between cumulative distributions – Can be used to compare periodic/seasonal metric periods (e.g. day-to-day or week-to-week) http://en.wikipedia.org/wiki/Kol mogorov%E2%80%93Smirnov_te st
  35. 35. 35 KS with windowing
  36. 36. 36 Data from similar windows
  37. 37. Cumulative distribution for those windows 37
  38. 38. 38 Data from dissimilar windows
  39. 39. Cumulative distribution for those windows 39
  40. 40. 40 Sliding window of KS scores
  41. 41. 41 KS anomaly results
  42. 42. Thing 4: 4 Take Scope and Context into account!
  43. 43. 43 Some data – is that normal?
  44. 44. 44 Wider scope
  45. 45. 45 Is this an anomlay?
  46. 46. 46 Even wider scope
  47. 47. 47 Is every weekend an anomaly?
  48. 48. 48 Would this be more accurate?
  49. 49. 49 Use domain knowledge! • Domain knowledge is NOT a bad thing! – There is no algorithm that will work on everything – Know your data and it general patterns • Periodicity/Seasonality • Known events (maintenance, backups, etc) – Apply the appropriate algorithms, taking into account enough scope for any inherent periodicity to appear – Customize your alerts to take into accounts known events
  50. 50. Thing 5: No data != No information
  51. 51. 51 Why is that important? • Some data channels are inherently non-chatty: – We don’t have the luxury of always generating non-zero values – There is a lot of useful information in the fact that nothing is happening on a particular channel • A lot of time series analytics techniques fail on time series with too few values (e.g. RF, adjusted box plot, etc)
  52. 52. 52 Communication channel
  53. 53. 53 Box plot results
  54. 54. 54
  55. 55. 55 Simple lookup table with priors
  56. 56. 56 Don’t be an analytics snob • Sparse data is VERY hard to analyze using typical analytics techniques • Sparse data conveys VERY important information • Sometimes the simplest rules, thresholds, lookup tables will work
  57. 57. 57 Recap 1. Your data is NOT Gaussian 2. Yesterday’s anomaly is today’s normal 3. Kolmogorov-Smirnov is really cool 4. Scope and Context are important 5. No data != No information
  58. 58. 58 Questions? • Shout out to the Metafor Data Science team! – Fred Zhang – Iman Makaremi

×