SlideShare une entreprise Scribd logo
1  sur  49
1
Some Simple Math for
Anomaly Detection
#Monitorama PDX
2014.05.05
Toufic Boubez, Ph.D.
Co-Founder, CTO
Metafor Software
toufic@metaforsoftware.com
@tboubez
3
Preamble
• I lied!
– There are no “simple” tricks
– If it’s too good to be true, it probably is
• I usually beat up on parametric, Gaussian, supervised techniques
– This talk is to show some alternatives
– Only enough time to cover a couple of relatively simple but very useful
techniques
– Oh, and I will still beat up on the usual suspects
• Adrian and James are right! Listen to them! 
– What’s the point of collecting all that data if you can’t get useful information
out of it!?
• Note: real data
• Note: no y-axis labels on charts – on purpose!!
• Note to self: remember to SLOW DOWN!
• Note to self: mention the cats!! Everybody loves cats!!
4
• Co-Founder/CTO Metafor Software
• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013
– I escaped 
• Co-Founder/CTO Saffron Technology
• IBM Chief Architect for SOA
• Co-Author, Co-Editor: WS-Trust, WS-
SecureConversation, WS-Federation, WS-Policy
• Building large scale software systems for >20
years (I’m older than I look, I know!)
Toufic intro – who I am
5
Wall of Charts™
6
The WoC side-effects: alert fatigue
“Alert fatigue is the single
biggest problem we have
right now … We need to be
more intelligent about our
alerts or we’ll all go insane.”
- John Vincent (@lusis)
(#monitoringsucks)
7
Watching screens cannot scale + it’s useless
8
Gotta turn things over to the machines
9
TO THE RESCUE: Anomaly Detection!!
• Anomaly detection (also known as outlier
detection) is the search for items or events
which do not conform to an expected pattern.
[Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A
survey". ACM Computing Surveys 41 (3): 1]
• For devops: Need to know when one or more
of our metrics is going wonky
10
Attempt #1: thresholds …
• Roots in manufacturing process QC
11
… are based on Gaussian distributions
• Make assumptions about probability
distributions and process behaviour
– Data is normally distributed with a useful and
usable mean and standard deviation
– Data is probabilistically “stationary”
12
Three-Sigma Rule
• Three-sigma rule
– ~68% of the values lie within 1 std deviation of the mean
– ~95% of the values lie within 2 std deviations
– 99.73% of the values lie within 3 std deviations: anything
else is an outlier
13
Aaahhhh
• The mysterious red lines explained
14
Stationary Gaussian distributions are powerful
• Because far far in the future, in a galaxy far far
away:
– I can make the same predictions because the
statistical properties of the data haven’t changed
– I can easily compare different metrics since they
have similar statistical properties
• Let’s do this!!
• BUT…
• Cue in DRAMATIC MUSIC
15
Then THIS happens
16
3-sigma rule alerts
17
Or worse, THIS happens!
18
3-sigma rule alerts
19
WTF!? So what gives!?
• Remember this?
20
Histogram – probability distribution
21
Histogram – probability distribution
22
Attempts #2, #3, etc: mo’ better thresholds
• Static thresholds ineffective on dynamic data
– Thresholds use the mean as predictor and alert if
data falls more than 3 sigma outside the mean
• Need “moving” or “adaptive” thresholds:
– Value of mean changes with time to
accommodate new data values/trends
23
Moving Averages “big idea”
• At any point in time in a well-behaved time
series, your next value should not significantly
deviate from the general trend of your data
• Mean as a predictor is too static, relies on too
much past data (ALL of the data!)
• Instead of overall mean use a finite window of
past values, predict most likely next value
• Alert if actual value “significantly” (3 sigmas?)
deviates from predicted value
24
Moving Averages typical method
• Generate a “smoothed” version of the time series
– Average over a sliding (moving) window
• Compute the squared error between raw series
and its smoothed version
• Compute a new effective standard deviation by
smoothing the squared error
• Generate a moving threshold:
– Outliers are 3-sigma outside the new, smoothed data!
• Ta-da!
25
Simple and Weighted Moving Averages
• Simple Moving Average
– Average of last N values in your time series
• S[t] <- sum(X[t-(N-1):t])/N
– Each value in the window contributes equally to
prediction
– …INCLUDING spikes and outliers
• Weigthed Moving Average
– Similar to SMA but assigns linearly (arithmetically)
decreasing weights to every value in the window
– Older values contribute less to the prediction
26
Exponential Smoothing techniques
• Exponential Smoothing
– Similar to weighted average, but with weights decay
exponentially over the whole set of historic samples
• S[t]=αX[t-1] + (1-α)S[t-1]
– Does not deal with trends in data
• DES
– In addition to data smoothing factor (α), introduces a trend
smoothing factor (β)
– Better at dealing with trending
– Does not deal with seasonality in data
• TES, Holt-Winters
– Introduces additional seasonality factor
– … and so on
27
Let’s look at an example
28
Holt-Winters predictions
29
A harder example
30
Exponential smoothing predictions
31
Hmmmm, so are we doomed?
• No!
• ALL smoothing predictive methods work best
with normally distributed data!
• But there are lots of other non-Gaussian
based techniques
– We can only scratch the surface in this talk
32
Trick #1: Histogram!
33
THIS is normal
34
This isn’t
35
Neither is this
36
Trick #2: Kolmogorov-Smirnov test
• Non-parametric test
– Compare two probability
distributions
– Makes no assumptions (e.g.
Gaussian) about the
distributions of the samples
– Measures maximum
distance between
cumulative distributions
– Can be used to compare
periodic/seasonal metric
periods (e.g. day-to-day or
week-to-week)
http://en.wikipedia.org/wiki/Kolmogorov%E2%
80%93Smirnov_test
37
KS with windowing
38
39
40
41
42
43
KS Test on difficult data
44
Trick #3: Diffing/Derivatives
• Often, even when the data itself is not
stationary, its derivatives tends to be!
• Most frequently, first difference is sufficient:
dS(t) <- S(t+1) – S(t)
• Can then perform some analytics on first
difference
45
CPU time series
46
Its first difference – possible random walk?
47
We’re not doomed, but: Know your data!!
• You need to understand the statistical properties
of your data, and where it comes from, in order
to determine what kind of analytics to use.
– Your data is very important!
– You spend time collecting it so spend time analyzing
it!
• A large amount of data center data is non-
Gaussian
– Guassian statistics won’t work
– Use appropriate techniques
48
More?
• Only scratched the surface
• I want to talk more about algorithms, analytics,
current issues, etc, in more depth, but time’s up!!
– Come talk to me or email me if interested.
• Thank you!
toufic@metaforsoftware.com
@tboubez
49
Oh yeah, and we’re hiring!
In Vancouver, BC

Contenu connexe

Tendances

Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Scaling to Millions of Simultaneous Connections by Rick Reed from WhatsApp
Scaling to Millions of Simultaneous Connections by Rick Reed from WhatsAppScaling to Millions of Simultaneous Connections by Rick Reed from WhatsApp
Scaling to Millions of Simultaneous Connections by Rick Reed from WhatsAppmustafa sarac
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive HookMinwoo Kim
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
Intro to Kanban - AgileDayChile2011 Keynote
Intro to Kanban - AgileDayChile2011 KeynoteIntro to Kanban - AgileDayChile2011 Keynote
Intro to Kanban - AgileDayChile2011 KeynoteChileAgil
 
Team Topologies - how and why to design your teams - AllDayDevOps 2017
Team Topologies - how and why to design your teams - AllDayDevOps 2017Team Topologies - how and why to design your teams - AllDayDevOps 2017
Team Topologies - how and why to design your teams - AllDayDevOps 2017Matthew Skelton
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들Woong Seok Kang
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...HostedbyConfluent
 
Heart of agile by Pierre Hervouet
Heart of agile by Pierre HervouetHeart of agile by Pierre Hervouet
Heart of agile by Pierre HervouetAgile ME
 

Tendances (20)

Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Scaling to Millions of Simultaneous Connections by Rick Reed from WhatsApp
Scaling to Millions of Simultaneous Connections by Rick Reed from WhatsAppScaling to Millions of Simultaneous Connections by Rick Reed from WhatsApp
Scaling to Millions of Simultaneous Connections by Rick Reed from WhatsApp
 
Kanban VS Scrum
Kanban VS ScrumKanban VS Scrum
Kanban VS Scrum
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Intro to Kanban - AgileDayChile2011 Keynote
Intro to Kanban - AgileDayChile2011 KeynoteIntro to Kanban - AgileDayChile2011 Keynote
Intro to Kanban - AgileDayChile2011 Keynote
 
Team Topologies - how and why to design your teams - AllDayDevOps 2017
Team Topologies - how and why to design your teams - AllDayDevOps 2017Team Topologies - how and why to design your teams - AllDayDevOps 2017
Team Topologies - how and why to design your teams - AllDayDevOps 2017
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
 
Heart of agile by Pierre Hervouet
Heart of agile by Pierre HervouetHeart of agile by Pierre Hervouet
Heart of agile by Pierre Hervouet
 

En vedette

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 
devops - what's missing? what's next?
devops - what's missing? what's next?devops - what's missing? what's next?
devops - what's missing? what's next?Andrew Shafer
 
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...tboubez
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comJohn Allspaw
 
Calculus - St. Petersburg Electrotechnical University "LETI"
Calculus - St. Petersburg Electrotechnical University "LETI"Calculus - St. Petersburg Electrotechnical University "LETI"
Calculus - St. Petersburg Electrotechnical University "LETI"metamath
 
Hs Industrial Insights Manufacturing
Hs Industrial Insights   ManufacturingHs Industrial Insights   Manufacturing
Hs Industrial Insights ManufacturingTKarlsson
 
Immutable infrastructure with Docker and containers (GlueCon 2015)
Immutable infrastructure with Docker and containers (GlueCon 2015)Immutable infrastructure with Docker and containers (GlueCon 2015)
Immutable infrastructure with Docker and containers (GlueCon 2015)Jérôme Petazzoni
 
Exploratory Statistics with R
Exploratory Statistics with RExploratory Statistics with R
Exploratory Statistics with RChristian Robert
 
disenos experimentales
disenos experimentalesdisenos experimentales
disenos experimentalesAngel Velazco
 
Mineral processing-design-and-operation, gupta
Mineral processing-design-and-operation, guptaMineral processing-design-and-operation, gupta
Mineral processing-design-and-operation, guptaCarlos Barreto Gamarra
 
JKSimMet Course - Part 1
JKSimMet Course - Part 1JKSimMet Course - Part 1
JKSimMet Course - Part 1James Didovich
 
The Binomial, Poisson, and Normal Distributions
The Binomial, Poisson, and Normal DistributionsThe Binomial, Poisson, and Normal Distributions
The Binomial, Poisson, and Normal DistributionsSCE.Surat
 

En vedette (20)

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 
devops - what's missing? what's next?
devops - what's missing? what's next?devops - what's missing? what's next?
devops - what's missing? what's next?
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.com
 
Calculus - St. Petersburg Electrotechnical University "LETI"
Calculus - St. Petersburg Electrotechnical University "LETI"Calculus - St. Petersburg Electrotechnical University "LETI"
Calculus - St. Petersburg Electrotechnical University "LETI"
 
9주차
9주차9주차
9주차
 
Pearson y sperman
Pearson y spermanPearson y sperman
Pearson y sperman
 
Hs Industrial Insights Manufacturing
Hs Industrial Insights   ManufacturingHs Industrial Insights   Manufacturing
Hs Industrial Insights Manufacturing
 
Estadística: Cálculos SPSS
Estadística: Cálculos SPSSEstadística: Cálculos SPSS
Estadística: Cálculos SPSS
 
Immutable infrastructure with Docker and containers (GlueCon 2015)
Immutable infrastructure with Docker and containers (GlueCon 2015)Immutable infrastructure with Docker and containers (GlueCon 2015)
Immutable infrastructure with Docker and containers (GlueCon 2015)
 
Exploratory Statistics with R
Exploratory Statistics with RExploratory Statistics with R
Exploratory Statistics with R
 
disenos experimentales
disenos experimentalesdisenos experimentales
disenos experimentales
 
Mineral processing-design-and-operation, gupta
Mineral processing-design-and-operation, guptaMineral processing-design-and-operation, gupta
Mineral processing-design-and-operation, gupta
 
Jk (1)
Jk (1)Jk (1)
Jk (1)
 
JKSimMet Course - Part 1
JKSimMet Course - Part 1JKSimMet Course - Part 1
JKSimMet Course - Part 1
 
The Binomial, Poisson, and Normal Distributions
The Binomial, Poisson, and Normal DistributionsThe Binomial, Poisson, and Normal Distributions
The Binomial, Poisson, and Normal Distributions
 
Presentación del curso,agosto,18,2014
Presentación del curso,agosto,18,2014Presentación del curso,agosto,18,2014
Presentación del curso,agosto,18,2014
 
Estadística: Revisión Estadística
Estadística: Revisión EstadísticaEstadística: Revisión Estadística
Estadística: Revisión Estadística
 
Clase 4 diseños de bloques - final
Clase 4   diseños de bloques - finalClase 4   diseños de bloques - final
Clase 4 diseños de bloques - final
 

Similaire à Simple math for anomaly detection toufic boubez - metafor software - monitorama pdx 2014-05-05

Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25tboubez
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...tboubez
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detectionShantanuDeosthale
 
forecasting model
forecasting modelforecasting model
forecasting modelFEG
 
Chapters 14 and 15 presentation
Chapters 14 and 15 presentationChapters 14 and 15 presentation
Chapters 14 and 15 presentationWilliam Perkins
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsJen Stirrup
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsQuantUniversity
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptxPallabiSahoo5
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup SlidesQuantUniversity
 
Statistics for analytics
Statistics for analyticsStatistics for analytics
Statistics for analyticsdeepika4721
 
2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higginsrgveroniki
 
4 26 2013 1 IME 674 Quality Assurance Reliability EXAM TERM PROJECT INFO...
4 26 2013 1 IME 674  Quality Assurance   Reliability EXAM   TERM PROJECT INFO...4 26 2013 1 IME 674  Quality Assurance   Reliability EXAM   TERM PROJECT INFO...
4 26 2013 1 IME 674 Quality Assurance Reliability EXAM TERM PROJECT INFO...Robin Beregovska
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modellingQuinton Anderson
 

Similaire à Simple math for anomaly detection toufic boubez - metafor software - monitorama pdx 2014-05-05 (20)

Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
forecasting model
forecasting modelforecasting model
forecasting model
 
Chapters 14 and 15 presentation
Chapters 14 and 15 presentationChapters 14 and 15 presentation
Chapters 14 and 15 presentation
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
 
Multivariate Analysis.ppt
Multivariate Analysis.pptMultivariate Analysis.ppt
Multivariate Analysis.ppt
 
Multivariate analysis
Multivariate analysisMultivariate analysis
Multivariate analysis
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal Datasets
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
 
Statistics for analytics
Statistics for analyticsStatistics for analytics
Statistics for analytics
 
lec21.VAE_1.pdf
lec21.VAE_1.pdflec21.VAE_1.pdf
lec21.VAE_1.pdf
 
2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins
 
4 26 2013 1 IME 674 Quality Assurance Reliability EXAM TERM PROJECT INFO...
4 26 2013 1 IME 674  Quality Assurance   Reliability EXAM   TERM PROJECT INFO...4 26 2013 1 IME 674  Quality Assurance   Reliability EXAM   TERM PROJECT INFO...
4 26 2013 1 IME 674 Quality Assurance Reliability EXAM TERM PROJECT INFO...
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Spc training
Spc training Spc training
Spc training
 

Dernier

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Dernier (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

Simple math for anomaly detection toufic boubez - metafor software - monitorama pdx 2014-05-05

  • 1. 1
  • 2. Some Simple Math for Anomaly Detection #Monitorama PDX 2014.05.05 Toufic Boubez, Ph.D. Co-Founder, CTO Metafor Software toufic@metaforsoftware.com @tboubez
  • 3. 3 Preamble • I lied! – There are no “simple” tricks – If it’s too good to be true, it probably is • I usually beat up on parametric, Gaussian, supervised techniques – This talk is to show some alternatives – Only enough time to cover a couple of relatively simple but very useful techniques – Oh, and I will still beat up on the usual suspects • Adrian and James are right! Listen to them!  – What’s the point of collecting all that data if you can’t get useful information out of it!? • Note: real data • Note: no y-axis labels on charts – on purpose!! • Note to self: remember to SLOW DOWN! • Note to self: mention the cats!! Everybody loves cats!!
  • 4. 4 • Co-Founder/CTO Metafor Software • Co-Founder/CTO Layer 7 Technologies – Acquired by Computer Associates in 2013 – I escaped  • Co-Founder/CTO Saffron Technology • IBM Chief Architect for SOA • Co-Author, Co-Editor: WS-Trust, WS- SecureConversation, WS-Federation, WS-Policy • Building large scale software systems for >20 years (I’m older than I look, I know!) Toufic intro – who I am
  • 6. 6 The WoC side-effects: alert fatigue “Alert fatigue is the single biggest problem we have right now … We need to be more intelligent about our alerts or we’ll all go insane.” - John Vincent (@lusis) (#monitoringsucks)
  • 7. 7 Watching screens cannot scale + it’s useless
  • 8. 8 Gotta turn things over to the machines
  • 9. 9 TO THE RESCUE: Anomaly Detection!! • Anomaly detection (also known as outlier detection) is the search for items or events which do not conform to an expected pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys 41 (3): 1] • For devops: Need to know when one or more of our metrics is going wonky
  • 10. 10 Attempt #1: thresholds … • Roots in manufacturing process QC
  • 11. 11 … are based on Gaussian distributions • Make assumptions about probability distributions and process behaviour – Data is normally distributed with a useful and usable mean and standard deviation – Data is probabilistically “stationary”
  • 12. 12 Three-Sigma Rule • Three-sigma rule – ~68% of the values lie within 1 std deviation of the mean – ~95% of the values lie within 2 std deviations – 99.73% of the values lie within 3 std deviations: anything else is an outlier
  • 13. 13 Aaahhhh • The mysterious red lines explained
  • 14. 14 Stationary Gaussian distributions are powerful • Because far far in the future, in a galaxy far far away: – I can make the same predictions because the statistical properties of the data haven’t changed – I can easily compare different metrics since they have similar statistical properties • Let’s do this!! • BUT… • Cue in DRAMATIC MUSIC
  • 17. 17 Or worse, THIS happens!
  • 19. 19 WTF!? So what gives!? • Remember this?
  • 22. 22 Attempts #2, #3, etc: mo’ better thresholds • Static thresholds ineffective on dynamic data – Thresholds use the mean as predictor and alert if data falls more than 3 sigma outside the mean • Need “moving” or “adaptive” thresholds: – Value of mean changes with time to accommodate new data values/trends
  • 23. 23 Moving Averages “big idea” • At any point in time in a well-behaved time series, your next value should not significantly deviate from the general trend of your data • Mean as a predictor is too static, relies on too much past data (ALL of the data!) • Instead of overall mean use a finite window of past values, predict most likely next value • Alert if actual value “significantly” (3 sigmas?) deviates from predicted value
  • 24. 24 Moving Averages typical method • Generate a “smoothed” version of the time series – Average over a sliding (moving) window • Compute the squared error between raw series and its smoothed version • Compute a new effective standard deviation by smoothing the squared error • Generate a moving threshold: – Outliers are 3-sigma outside the new, smoothed data! • Ta-da!
  • 25. 25 Simple and Weighted Moving Averages • Simple Moving Average – Average of last N values in your time series • S[t] <- sum(X[t-(N-1):t])/N – Each value in the window contributes equally to prediction – …INCLUDING spikes and outliers • Weigthed Moving Average – Similar to SMA but assigns linearly (arithmetically) decreasing weights to every value in the window – Older values contribute less to the prediction
  • 26. 26 Exponential Smoothing techniques • Exponential Smoothing – Similar to weighted average, but with weights decay exponentially over the whole set of historic samples • S[t]=αX[t-1] + (1-α)S[t-1] – Does not deal with trends in data • DES – In addition to data smoothing factor (α), introduces a trend smoothing factor (β) – Better at dealing with trending – Does not deal with seasonality in data • TES, Holt-Winters – Introduces additional seasonality factor – … and so on
  • 27. 27 Let’s look at an example
  • 31. 31 Hmmmm, so are we doomed? • No! • ALL smoothing predictive methods work best with normally distributed data! • But there are lots of other non-Gaussian based techniques – We can only scratch the surface in this talk
  • 36. 36 Trick #2: Kolmogorov-Smirnov test • Non-parametric test – Compare two probability distributions – Makes no assumptions (e.g. Gaussian) about the distributions of the samples – Measures maximum distance between cumulative distributions – Can be used to compare periodic/seasonal metric periods (e.g. day-to-day or week-to-week) http://en.wikipedia.org/wiki/Kolmogorov%E2% 80%93Smirnov_test
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42
  • 43. 43 KS Test on difficult data
  • 44. 44 Trick #3: Diffing/Derivatives • Often, even when the data itself is not stationary, its derivatives tends to be! • Most frequently, first difference is sufficient: dS(t) <- S(t+1) – S(t) • Can then perform some analytics on first difference
  • 46. 46 Its first difference – possible random walk?
  • 47. 47 We’re not doomed, but: Know your data!! • You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use. – Your data is very important! – You spend time collecting it so spend time analyzing it! • A large amount of data center data is non- Gaussian – Guassian statistics won’t work – Use appropriate techniques
  • 48. 48 More? • Only scratched the surface • I want to talk more about algorithms, analytics, current issues, etc, in more depth, but time’s up!! – Come talk to me or email me if interested. • Thank you! toufic@metaforsoftware.com @tboubez
  • 49. 49 Oh yeah, and we’re hiring! In Vancouver, BC