SlideShare a Scribd company logo
1 of 30
Sta$s$cal 
Learning 
Based 
Anomaly 
Detec$on 
@ 
Twi9er 
Arun Kejariwal 
(@arun_kejariwal) 
Joint work with Jordan Hochenbaum and Owen Vallis 
November 2014
Internet 
trends 
• Real-time 
[1] 
h9p://techcrunch.com/2014/05/05/amazon-­‐extends-­‐its-­‐shopping-­‐cart-­‐to-­‐twi9er/ 
AK 
2 
[1]
Twi9er: 
Global 
Town 
Square 
AK 
3
Data 
Fidelity 
• Data-driven decision making 
q Evolving product landscape 
• Data partners 
q Nielsen 
q Dataminr 
• Operational 
q Performance and Availability 
AK 
4
Data 
Fidelity: 
Challenges 
• Anomalies 
q Exogenic factors 
§ User behavior 
§ Events 
§ Data center 
q Endogenic factors 
§ Agile development 
o Fail fast 
§ Data collection 
• Millions of time series [1,2] 
q Scalability 
AK 
5 
[1] 
h9p://strata.oreilly.com/2013/09/how-­‐twi9er-­‐monitors-­‐millions-­‐of-­‐$me-­‐series.html 
[2] 
h9p://strataconf.com/strata2014/public/schedule/detail/32431
Anomaly 
Detec$on: 
Why 
Bother? 
• Analyze User Engagement 
q Events 
§ Super Bowl, Japanese New Year 
q Year over year analysis (input to forecasting) 
• Identify Attacks 
q DoS 
q Malware attacks 
• Identify Bots 
q Separating actual users from spam 
AK 
6
Anomaly 
Detec$on 
• Visual 
q Prone to errors 
q Not scalable 
§ Machine generated data 
11% of the digital universe in 2005 
to > 40% by 2020 [1] 
§ Cloud Infrastructure 2013-2017 CAGR ~50% [2] 
• Algorithmic approach 
q Automate! 
[1] 
h9p://www.emc.com/about/news/press/2012/20121211-­‐01.htm 
AK 
7 
[2] 
h9p://www.forbes.com/sites/gilpress/2013/12/12/16-­‐1-­‐billion-­‐big-­‐data-­‐market-­‐2014-­‐predic$ons-­‐from-­‐idc-­‐and-­‐iia/
Anomaly 
Detec$on: 
Background 
• Over 50 years of research [1] 
q Statistics 
§ Extreme Value Theory 
§ Robust Statistics, Grubb’s Test, ESD 
q Econometrics 
q Finance 
§ Value at Risk (VaR) 
q Signal Processing 
q Music Information Retrieval 
q Networking 
q E- Commerce 
q Performance Regression 
[1] 
“Anomaly 
Detec$on” 
by 
Chandola 
et 
al. 
ACM 
Compu$ng 
Surveys, 
2009. 
AK 
8 
Jon 
from 
Etsy 
Toufic 
from 
Metafor
Anomaly 
Detec$on: 
Overview 
• Definition 
q “An anomaly is an observation that deviates so much from other observations so 
as to arouse suspicions that it is was generated by a different mechanism” [1,2] 
[1] 
“Iden$fica$on 
of 
outliers” 
by 
Hawkins, 
Douglas 
M. 
London: 
Chapman 
and 
Hall, 
1980. 
AK 
9 
[2] 
“Outlier 
Analysis” 
by 
Charu 
C. 
Aggarwal. 
Springer, 
2013.
Anomaly 
Detec$on 
• Characterization 
q Magnitude 
q Width 
q Frequency 
q Direction 
AK 
10
Anomaly 
Detec$on 
(contd.) 
• Two flavors 
q Global 
§ Max Value 
q Local 
§ Intra-day 
AK 
11 
Global 
Local
Anomaly 
Detec$on 
(contd.) 
• Traditional Approaches 
q Metrics 
§ Mean μ 
§ Variance σ 
q Rule of thumb 
§ μ + 3*σ 
q Which time series? 
§ Raw 
§ Moving Averages 
o SMA, EWMA, PEWMA 
AK 
12 
3 * σ
Anomaly 
Detec$on 
(contd.) 
• Impact of multi-modal distribution 
q μ Shift ~ 0.2% 
q Inflates σ by 4.5% 
§ Miss quite a few anomalies 
q What do multiple modes correspond to? 
§ Seasonality 
AK 
13
• Robust Statistics 
q MAD 
§ Robust Breakdown point 
o Median 50% vs. Mean 0% 
q σMAD 
§ K = 1.4826 for normally distributed data 
AK 
14 
Anomaly 
Detec$on 
(contd.)
• Limitations of using MAD 
AK 
15 
Anomaly 
Detec$on 
(contd.)
• Grubb’s Test 
q Critical value is derived from data using a statistical confidence (α) 
• Limitations 
q Assumes data distribution is normal 
q Good for detecting ONLY 1 outlier 
q Seasonality unaware 
AK 
16 
Anomaly 
Detec$on 
(contd.)
• ESD (Generalized Extreme Studentized Deviate) [1] 
q Critical value (λi) re-calculated every iteration 
q Largest i such that Ri > λi determines # of anomalies 
q An upper-bound on the number of anomalies is an input parameter 
• Limitations 
q Generalized ESD assumes a “normal” distribution 
q Seasonality unaware 
AK 
17 
Anomaly 
Detec$on 
(contd.) 
[1] 
Rosner, 
Bernard. 
“Percentage 
Points 
for 
a 
Generalized 
ESD 
Many-­‐outlier 
Procedure.” 
Technometrics 
25, 
no. 
2 
(1983): 
165–172.
Our 
Approach
• Addressing Seasonality 
q Key Idea 
§ Time Series Decomposition 
AK 
19 
Anomaly 
Detec$on 
(contd.)
• Determining seasonal component 
q Regression on sub-cycle plots [1] 
AK 
20 
Anomaly 
Detec$on 
(contd.) 
[1] 
“STL: 
A 
seasonal-­‐trend 
decomposi$on 
procedure 
based 
on 
loess” 
by 
Cleveland, 
et 
al. 
Journal 
of 
Official 
Sta$s$cs, 
Vol. 
6, 
Issue 
1, 
1990.
• Impact of removal of seasonal and trend 
q Transforms our multi-modal data into unimodal data. 
§ Amenable to ESD/MAD! 
AK 
21 
Anomaly 
Detec$on 
(contd.) 
The decomposed Residual 
becomes "Uni-modal". This 
significantly shrinks the value of 
sigma. 
The original "Multi-Modal" 
Raw Data has a much wider 
value for sigma, leading ESD 
to miss a lot of the outliers.
Trend Smoothing Distortion 
Creates “Phantom” Anomalies 
• Challenges remain! 
AK 
22 
Anomaly 
Detec$on 
(contd.)
• Marrying Robust Statistics with Seasonal Decomposition 
AK 
23 
Anomaly 
Detec$on 
(contd.) 
Median is Free from Distortion
• Applying ESD on the Residual 
AK 
24 
Anomaly 
Detec$on 
(contd.) 
Decomposition Exposes Anomalies
• Recap 
q Extract the seasonal component using STL 
§ Filters out periodic spikes 
q Residual = Raw - Seasonalraw- Medianraw 
q Run ESD on residual (using median and MAD) 
AK 
25 
Anomaly 
Detec$on 
(contd.)
• Illustrative example 
AK 
26 
Anomaly 
Detec$on 
(contd.)
• Applications 
q Three perspectives 
§ Capacity 
o CPU utilization 
o Garbage collection 
o Network activity 
§ User behavior 
o Events 
• Impressions 
• Link clicks 
o Spam 
§ Forecasting 
AK 
27 
Anomaly 
Detec$on 
(contd.)
• Deployed in production 
q Used by large number of services at Twitter 
q Automatic e-mail notification 
§ Only sent if anomalies are present 
§ Anomalies annotated 
§ CSV with anomaly locations attached 
AK 
28 
Anomaly 
Detec$on 
(contd.)
• Skyline from Etsy 
q https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py 
• Coming soon! 
q R package 
AK 
29 
Open 
Sourcing
Join 
the 
Flock 
Like 
problem 
solving? 
Like 
challenges? 
Be 
at 
cukng 
Edge 
Make 
an 
impact 
• We are hiring!! 
q https://twitter.com/JoinTheFlock 
q https://twitter.com/jobs 
q Contact us: @arun_kejariwal 
AK 
30

More Related Content

What's hot

Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Toshiaki Maki
 

What's hot (20)

AWS Black Belt Tech Webinar 2016 〜 Amazon CloudSearch & Amazon Elasticsearch ...
AWS Black Belt Tech Webinar 2016 〜 Amazon CloudSearch & Amazon Elasticsearch ...AWS Black Belt Tech Webinar 2016 〜 Amazon CloudSearch & Amazon Elasticsearch ...
AWS Black Belt Tech Webinar 2016 〜 Amazon CloudSearch & Amazon Elasticsearch ...
 
App013 ここはあえて紙と
App013 ここはあえて紙とApp013 ここはあえて紙と
App013 ここはあえて紙と
 
20191030 AWS Black Belt Online Seminar AWS IoT Analytics Deep Dive
20191030 AWS Black Belt Online Seminar AWS IoT Analytics Deep Dive 20191030 AWS Black Belt Online Seminar AWS IoT Analytics Deep Dive
20191030 AWS Black Belt Online Seminar AWS IoT Analytics Deep Dive
 
20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast
 
AWS Black Belt Techシリーズ Amazon VPC
AWS Black Belt Techシリーズ Amazon VPCAWS Black Belt Techシリーズ Amazon VPC
AWS Black Belt Techシリーズ Amazon VPC
 
카카오뱅크 모바일앱 개발 이야기
카카오뱅크 모바일앱 개발 이야기카카오뱅크 모바일앱 개발 이야기
카카오뱅크 모바일앱 개발 이야기
 
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
 
An Admin's Guide for Running Confluence at Scale for 10,000+ Yahoo! JAPAN Users
An Admin's Guide for Running Confluence at Scale for 10,000+ Yahoo! JAPAN UsersAn Admin's Guide for Running Confluence at Scale for 10,000+ Yahoo! JAPAN Users
An Admin's Guide for Running Confluence at Scale for 10,000+ Yahoo! JAPAN Users
 
SonarQube와 함께하는 소프트웨어 품질 세미나 - SonarQube 소개
SonarQube와 함께하는 소프트웨어 품질 세미나 - SonarQube 소개SonarQube와 함께하는 소프트웨어 품질 세미나 - SonarQube 소개
SonarQube와 함께하는 소프트웨어 품질 세미나 - SonarQube 소개
 
AWS CloudFormation Best Practices
AWS CloudFormation Best PracticesAWS CloudFormation Best Practices
AWS CloudFormation Best Practices
 
[AKIBA.AWS] VPN接続とルーティングの基礎
[AKIBA.AWS] VPN接続とルーティングの基礎[AKIBA.AWS] VPN接続とルーティングの基礎
[AKIBA.AWS] VPN接続とルーティングの基礎
 
SecurityCamp2015「バグハンティング入門」
SecurityCamp2015「バグハンティング入門」SecurityCamp2015「バグハンティング入門」
SecurityCamp2015「バグハンティング入門」
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
클라우드 마이그레이션 성공적인 여정, 그 중요한 시작 "Readiness Assessment (전환 준비 평가)" - 김준범, AWS Mi...
클라우드 마이그레이션 성공적인 여정, 그 중요한 시작 "Readiness Assessment (전환 준비 평가)" - 김준범, AWS Mi...클라우드 마이그레이션 성공적인 여정, 그 중요한 시작 "Readiness Assessment (전환 준비 평가)" - 김준범, AWS Mi...
클라우드 마이그레이션 성공적인 여정, 그 중요한 시작 "Readiness Assessment (전환 준비 평가)" - 김준범, AWS Mi...
 
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
 
AIOpsで実現する効率化 OSC 2022 Online Spring TIS
AIOpsで実現する効率化 OSC 2022 Online Spring TISAIOpsで実現する効率化 OSC 2022 Online Spring TIS
AIOpsで実現する効率化 OSC 2022 Online Spring TIS
 
自動化テスト VS 手動テスト
自動化テスト VS 手動テスト自動化テスト VS 手動テスト
自動化テスト VS 手動テスト
 
失敗事例で学ぶ負荷試験
失敗事例で学ぶ負荷試験失敗事例で学ぶ負荷試験
失敗事例で学ぶ負荷試験
 
AWS Black Belt Techシリーズ Amazon EBS
AWS Black Belt Techシリーズ  Amazon EBSAWS Black Belt Techシリーズ  Amazon EBS
AWS Black Belt Techシリーズ Amazon EBS
 
Java Clientで入門する Apache Kafka #jjug_ccc #ccc_e2
Java Clientで入門する Apache Kafka #jjug_ccc #ccc_e2Java Clientで入門する Apache Kafka #jjug_ccc #ccc_e2
Java Clientで入門する Apache Kafka #jjug_ccc #ccc_e2
 

Viewers also liked

Data Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action Upon
Arun Kejariwal
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
Arun Kejariwal
 
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Arun Kejariwal
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
Arun Kejariwal
 

Viewers also liked (20)

Data Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action Upon
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
 
Anomaly Detection @Twitter
Anomaly Detection @TwitterAnomaly Detection @Twitter
Anomaly Detection @Twitter
 
Isolating Events from the Fail Whale
Isolating Events from the Fail WhaleIsolating Events from the Fail Whale
Isolating Events from the Fail Whale
 
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
 
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
 
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...
 
Everyone is a Data Analyst Adobe EMEA Summit 2014
Everyone is a Data Analyst Adobe EMEA Summit 2014Everyone is a Data Analyst Adobe EMEA Summit 2014
Everyone is a Data Analyst Adobe EMEA Summit 2014
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
 
A Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldA Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real World
 
Time series Analysis & fpp package
Time series Analysis & fpp packageTime series Analysis & fpp package
Time series Analysis & fpp package
 
PyGotham 2016
PyGotham 2016PyGotham 2016
PyGotham 2016
 
Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop
 
Data, data, everywhere… - SEE UK - 2016
Data, data, everywhere… - SEE UK - 2016Data, data, everywhere… - SEE UK - 2016
Data, data, everywhere… - SEE UK - 2016
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
 

Similar to Statistical Learning Based Anomaly Detection @ Twitter

Sampling-SDM2012_Jun
Sampling-SDM2012_JunSampling-SDM2012_Jun
Sampling-SDM2012_Jun
MDO_Lab
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
Nesreen K. Ahmed
 
impervious cover
impervious coverimpervious cover
impervious cover
James Yang
 
Weather Data: Why Accuracy is More Complicated Than You Think
Weather Data: Why Accuracy is More Complicated Than You ThinkWeather Data: Why Accuracy is More Complicated Than You Think
Weather Data: Why Accuracy is More Complicated Than You Think
METER Group, Inc. USA
 
Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)
Shubham Gupta
 
autonomus Bike Progress
autonomus Bike Progressautonomus Bike Progress
autonomus Bike Progress
Nadeem Qandeel
 

Similar to Statistical Learning Based Anomaly Detection @ Twitter (20)

Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Monte Carlo Schedule Risk Analysis
Monte Carlo Schedule Risk AnalysisMonte Carlo Schedule Risk Analysis
Monte Carlo Schedule Risk Analysis
 
Spc
SpcSpc
Spc
 
Monte Carlo and Schedule Risk Analysis
Monte Carlo and Schedule Risk AnalysisMonte Carlo and Schedule Risk Analysis
Monte Carlo and Schedule Risk Analysis
 
Wqtc2013 invest ofperformanceprobswitheds-20130910
Wqtc2013 invest ofperformanceprobswitheds-20130910Wqtc2013 invest ofperformanceprobswitheds-20130910
Wqtc2013 invest ofperformanceprobswitheds-20130910
 
Sampling-SDM2012_Jun
Sampling-SDM2012_JunSampling-SDM2012_Jun
Sampling-SDM2012_Jun
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
 
TAO Refresh - Automation of Data Spike Flagging Quality
TAO Refresh - Automation of Data Spike Flagging Quality TAO Refresh - Automation of Data Spike Flagging Quality
TAO Refresh - Automation of Data Spike Flagging Quality
 
Forecasting time series powerful and simple
Forecasting time series powerful and simpleForecasting time series powerful and simple
Forecasting time series powerful and simple
 
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...
 
Running windmills with machine learning - Anette Bergo
Running windmills with machine learning - Anette BergoRunning windmills with machine learning - Anette Bergo
Running windmills with machine learning - Anette Bergo
 
impervious cover
impervious coverimpervious cover
impervious cover
 
Lightweight Neighborhood Cardinality Estimation in Dynamic Wireless Networks ...
Lightweight Neighborhood Cardinality Estimation in Dynamic Wireless Networks ...Lightweight Neighborhood Cardinality Estimation in Dynamic Wireless Networks ...
Lightweight Neighborhood Cardinality Estimation in Dynamic Wireless Networks ...
 
Weather Data: Why Accuracy is More Complicated Than You Think
Weather Data: Why Accuracy is More Complicated Than You ThinkWeather Data: Why Accuracy is More Complicated Than You Think
Weather Data: Why Accuracy is More Complicated Than You Think
 
Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)
 
Looking out for anomalies
Looking out for anomaliesLooking out for anomalies
Looking out for anomalies
 
7 8. emi - analog instruments and digital instruments
7 8. emi - analog instruments and digital instruments7 8. emi - analog instruments and digital instruments
7 8. emi - analog instruments and digital instruments
 
autonomus Bike Progress
autonomus Bike Progressautonomus Bike Progress
autonomus Bike Progress
 

More from Arun Kejariwal

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
Arun Kejariwal
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
Arun Kejariwal
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
Arun Kejariwal
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
Arun Kejariwal
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
Arun Kejariwal
 

More from Arun Kejariwal (13)

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly Detection
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Statistical Learning Based Anomaly Detection @ Twitter

  • 1. Sta$s$cal Learning Based Anomaly Detec$on @ Twi9er Arun Kejariwal (@arun_kejariwal) Joint work with Jordan Hochenbaum and Owen Vallis November 2014
  • 2. Internet trends • Real-time [1] h9p://techcrunch.com/2014/05/05/amazon-­‐extends-­‐its-­‐shopping-­‐cart-­‐to-­‐twi9er/ AK 2 [1]
  • 3. Twi9er: Global Town Square AK 3
  • 4. Data Fidelity • Data-driven decision making q Evolving product landscape • Data partners q Nielsen q Dataminr • Operational q Performance and Availability AK 4
  • 5. Data Fidelity: Challenges • Anomalies q Exogenic factors § User behavior § Events § Data center q Endogenic factors § Agile development o Fail fast § Data collection • Millions of time series [1,2] q Scalability AK 5 [1] h9p://strata.oreilly.com/2013/09/how-­‐twi9er-­‐monitors-­‐millions-­‐of-­‐$me-­‐series.html [2] h9p://strataconf.com/strata2014/public/schedule/detail/32431
  • 6. Anomaly Detec$on: Why Bother? • Analyze User Engagement q Events § Super Bowl, Japanese New Year q Year over year analysis (input to forecasting) • Identify Attacks q DoS q Malware attacks • Identify Bots q Separating actual users from spam AK 6
  • 7. Anomaly Detec$on • Visual q Prone to errors q Not scalable § Machine generated data 11% of the digital universe in 2005 to > 40% by 2020 [1] § Cloud Infrastructure 2013-2017 CAGR ~50% [2] • Algorithmic approach q Automate! [1] h9p://www.emc.com/about/news/press/2012/20121211-­‐01.htm AK 7 [2] h9p://www.forbes.com/sites/gilpress/2013/12/12/16-­‐1-­‐billion-­‐big-­‐data-­‐market-­‐2014-­‐predic$ons-­‐from-­‐idc-­‐and-­‐iia/
  • 8. Anomaly Detec$on: Background • Over 50 years of research [1] q Statistics § Extreme Value Theory § Robust Statistics, Grubb’s Test, ESD q Econometrics q Finance § Value at Risk (VaR) q Signal Processing q Music Information Retrieval q Networking q E- Commerce q Performance Regression [1] “Anomaly Detec$on” by Chandola et al. ACM Compu$ng Surveys, 2009. AK 8 Jon from Etsy Toufic from Metafor
  • 9. Anomaly Detec$on: Overview • Definition q “An anomaly is an observation that deviates so much from other observations so as to arouse suspicions that it is was generated by a different mechanism” [1,2] [1] “Iden$fica$on of outliers” by Hawkins, Douglas M. London: Chapman and Hall, 1980. AK 9 [2] “Outlier Analysis” by Charu C. Aggarwal. Springer, 2013.
  • 10. Anomaly Detec$on • Characterization q Magnitude q Width q Frequency q Direction AK 10
  • 11. Anomaly Detec$on (contd.) • Two flavors q Global § Max Value q Local § Intra-day AK 11 Global Local
  • 12. Anomaly Detec$on (contd.) • Traditional Approaches q Metrics § Mean μ § Variance σ q Rule of thumb § μ + 3*σ q Which time series? § Raw § Moving Averages o SMA, EWMA, PEWMA AK 12 3 * σ
  • 13. Anomaly Detec$on (contd.) • Impact of multi-modal distribution q μ Shift ~ 0.2% q Inflates σ by 4.5% § Miss quite a few anomalies q What do multiple modes correspond to? § Seasonality AK 13
  • 14. • Robust Statistics q MAD § Robust Breakdown point o Median 50% vs. Mean 0% q σMAD § K = 1.4826 for normally distributed data AK 14 Anomaly Detec$on (contd.)
  • 15. • Limitations of using MAD AK 15 Anomaly Detec$on (contd.)
  • 16. • Grubb’s Test q Critical value is derived from data using a statistical confidence (α) • Limitations q Assumes data distribution is normal q Good for detecting ONLY 1 outlier q Seasonality unaware AK 16 Anomaly Detec$on (contd.)
  • 17. • ESD (Generalized Extreme Studentized Deviate) [1] q Critical value (λi) re-calculated every iteration q Largest i such that Ri > λi determines # of anomalies q An upper-bound on the number of anomalies is an input parameter • Limitations q Generalized ESD assumes a “normal” distribution q Seasonality unaware AK 17 Anomaly Detec$on (contd.) [1] Rosner, Bernard. “Percentage Points for a Generalized ESD Many-­‐outlier Procedure.” Technometrics 25, no. 2 (1983): 165–172.
  • 19. • Addressing Seasonality q Key Idea § Time Series Decomposition AK 19 Anomaly Detec$on (contd.)
  • 20. • Determining seasonal component q Regression on sub-cycle plots [1] AK 20 Anomaly Detec$on (contd.) [1] “STL: A seasonal-­‐trend decomposi$on procedure based on loess” by Cleveland, et al. Journal of Official Sta$s$cs, Vol. 6, Issue 1, 1990.
  • 21. • Impact of removal of seasonal and trend q Transforms our multi-modal data into unimodal data. § Amenable to ESD/MAD! AK 21 Anomaly Detec$on (contd.) The decomposed Residual becomes "Uni-modal". This significantly shrinks the value of sigma. The original "Multi-Modal" Raw Data has a much wider value for sigma, leading ESD to miss a lot of the outliers.
  • 22. Trend Smoothing Distortion Creates “Phantom” Anomalies • Challenges remain! AK 22 Anomaly Detec$on (contd.)
  • 23. • Marrying Robust Statistics with Seasonal Decomposition AK 23 Anomaly Detec$on (contd.) Median is Free from Distortion
  • 24. • Applying ESD on the Residual AK 24 Anomaly Detec$on (contd.) Decomposition Exposes Anomalies
  • 25. • Recap q Extract the seasonal component using STL § Filters out periodic spikes q Residual = Raw - Seasonalraw- Medianraw q Run ESD on residual (using median and MAD) AK 25 Anomaly Detec$on (contd.)
  • 26. • Illustrative example AK 26 Anomaly Detec$on (contd.)
  • 27. • Applications q Three perspectives § Capacity o CPU utilization o Garbage collection o Network activity § User behavior o Events • Impressions • Link clicks o Spam § Forecasting AK 27 Anomaly Detec$on (contd.)
  • 28. • Deployed in production q Used by large number of services at Twitter q Automatic e-mail notification § Only sent if anomalies are present § Anomalies annotated § CSV with anomaly locations attached AK 28 Anomaly Detec$on (contd.)
  • 29. • Skyline from Etsy q https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py • Coming soon! q R package AK 29 Open Sourcing
  • 30. Join the Flock Like problem solving? Like challenges? Be at cukng Edge Make an impact • We are hiring!! q https://twitter.com/JoinTheFlock q https://twitter.com/jobs q Contact us: @arun_kejariwal AK 30