Антон Лебедевич

•

2 j'aime•2,634 vues

Ontico

HighLoad++ 2013

Technologie

Статистика на практике для
поиска аномалий в нагрузочном
тестировании и production
Антон Лебедевич

Contents
●

Real Data and Ideal Models

●

Load Testing (Tuning)

●

Production Monitoring

●

Correlation

●

Tools

Real Data vs. Ideal Models
●

noise (human actions)

●

outliers

●

missing data

●

different resolutions

●

counter update frequencies

●

quantization

●

not Gaussian and not random walk

●

what is normal for the system?

Resolution
●

>=5min

●

1min

●

10s

●

<=1s

Load Testing (Tuning)
●

goal

●

beware of transient response

●

find failure

●

filter data

●

find bottleneck and fix

●

rinse and repeat

Filtration
●

constants

●

index of dispersion (sd/mean)

●

apply system knowledge
–

tasks migrated by scheduler

–

dependent (disk used/free)

–

interface traffic < 10 packets/s

–

load average < 0.5

–

…

Nonlinear

ndiffs: diff until kpss says it's stationary

Production Classics
●

Control charts
–
–

●

fixed window moving average (MA)
exponentially weighted moving average (EWMA)

Holt-Winters

Control Charts
●

stationary

●

Gaussian/Poisson

●

outliers

Holt-Winters
triple exponential smoothing
●

needs a lot of data

●

sensitive to outliers

●

can't handle 3 seasons + holidays

●

overfitting

Production Experimental
●

autocorrelation

●

non-parametric 2 sample tests

Autocorrelation
Ljung-Box Test
●

non-stationary

●

mean shift

●

trends

●

seasonal

●

periodic (cron jobs, sampling)

●

aggregated (MA, EWMA)

2-Sample Tests: Good
Kolmogorov–Smirnov, Cramér–von Mises
●

good for request size and latency (unaggregated)

●

work on periodic data

●

outlier resistant

●

good for data exploration

2-Sample Tests: Bad
Kolmogorov–Smirnov, Cramér–von Mises
●

false positives on trends and seasonal changes

●

need many unique values

●

computational complexity

●

bad for alerting

Finding Similar Graphs
●

correlation (Pearson, Spearman)

●

Euclidean distance

●

dynamic time warping (DTW)

●

discrete Fourier transform (DFT)

●

discrete wavelet transform (DWT)

Clustering
●

non-euclidean (ultrametric) space

●

many small clusters

●

local clustering around events

●

false positives
–

cron jobs (log rotation)

–

human actions (restarts, reconfigurations)

–

cache expirations

–

…

Tools
●

collectd

●

statsd

●

graphite

●

whisper-fetch

●

R

R
add.smooth <- function(m) {
r <- nrow(m)
ms <- sapply(m, function(y) {
ave(coredata(y),
seq.int(r) %/% max(3, r %/% 150),
FUN=function(x) {mean(x, na.rm=T)})
})
df <- data.frame(index(m)[rep.int(1:r, ncol(m))],
factor(rep(1:ncol(m), each = r), levels = 1:ncol(m)),
as.vector(coredata(m)),
as.vector(coredata(ms)))
names(df) <- c("Index", "Series", "Value", "Smooth")
df
}

Kale Stack
●

github.com/etsy/skyline

●

github.com/etsy/oculus

Skyline
image from
github.com/etsy/skyline

Skyline Internals
●

Horizon agent

●

Redis

●

Analyzer agent

●

Flask (Python) Web App

Skyline Algorithms
●

median absolute deviation

●

mean subtraction cumulation

●

grubbs

●

least squares

●

first hour average

●

histogram bins

●

stddev from average

●

ks test

●

stddev from moving average

●

second order anomalies

Oculus
image from
github.com/etsy/oculus

Oculus Internals
●

Skyline Import Script and Cronjob

●

Resque workers

●

ElasticSearch

●

Sinatra (Ruby) Web App

Q&A
Anton Lebedevich
mabrek@gmail.com
twitter.com/widdoc
github.com/mabrek

Contenu connexe

Tendances

Earliest Due Date Algorithm for Task scheduling for cloud computingPrakash Poudel

Homework solutionsch9Ignåciø Såråviå

Flowshop schedulingKunal Goswami

Job shop schedulingSujeet TAMBE

Os prj pptCharmi Chokshi

slot_shiftingGokul Vasan

Comparision of different Round Robin Scheduling Algorithm using Dynamic Time ...Editor IJMTER

Schedulingahmad bassiouny

05 lcd slides 1 - CPU SCHEDULING (Powerpoint)Anne Lee

Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli

Transaction Timestamping in Temporal DatabasesGera Shegalov

Flow shop scheduling problem, processing time associated with probabilities i...Alexander Decker

Cpu scheduling algorithm on windowssiddhartha pande

Hard versus Soft real time systemKamal Acharya

Tendances (14)

Earliest Due Date Algorithm for Task scheduling for cloud computing

Homework solutionsch9

Flowshop scheduling

Job shop scheduling

Os prj ppt

slot_shifting

Comparision of different Round Robin Scheduling Algorithm using Dynamic Time ...

Scheduling

05 lcd slides 1 - CPU SCHEDULING (Powerpoint)

Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics

Transaction Timestamping in Temporal Databases

Flow shop scheduling problem, processing time associated with probabilities i...

Cpu scheduling algorithm on windows

Hard versus Soft real time system

Similaire à Антон Лебедевич

Machine Learning Applications in Subsurface Analysis: Case Study in North SeaYohanes Nuwara

Taskerman - a distributed cluster task managerRaghavendra Prabhu

Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Rafael Ferreira da Silva

Task Resource Consumption Prediction for Scientific Applications and WorkflowsRafael Ferreira da Silva

Modeling full scale-data(2)John B. Cook, PE, CEO

The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir

Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty

Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Lionel Briand

Modeling full scale-data(2)John B. Cook, PE, CEO

EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE

Dconf2015 d2 t4Andrei Alexandrescu

Transfer defect learningSung Kim

When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...Anis Nasir

진동데이터 활용 충돌체 탐지 AI 경진대회 1등DACON AI 데이콘

Intro to Sorting + Insertion SortNicholas Case

ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani

GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi

Benchmarks, performance, scalability, and capacity what's behind the numbersJustin Dorfman

Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong

Gupta datamuleRajesh Gupta

Similaire à Антон Лебедевич (20)

Machine Learning Applications in Subsurface Analysis: Case Study in North Sea

Taskerman - a distributed cluster task manager

Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...

Task Resource Consumption Prediction for Scientific Applications and Workflows

Modeling full scale-data(2)

The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...

Lessons learned from designing QA automation event streaming platform(IoT big...

Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...

Modeling full scale-data(2)

EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME

Dconf2015 d2 t4

Transfer defect learning

When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...

진동데이터 활용 충돌체 탐지 AI 경진대회 1등

Intro to Sorting + Insertion Sort

ODSC 2019: Sessionisation via stochastic periods for root event identification

GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...

Benchmarks, performance, scalability, and capacity what's behind the numbers

Benchmarks, performance, scalability, and capacity what s behind the numbers...

Gupta datamule

Plus de Ontico

One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...Ontico

Масштабируя DNS / Артем Гавриченков (Qrator Labs)Ontico

Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)Ontico

Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Ontico

Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...Ontico

PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)Ontico

Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Ontico

Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...Ontico

ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)Ontico

MySQL Replication — Advanced Features / Петр Зайцев (Percona)Ontico

Внутренний open-source. Как разрабатывать мобильное приложение большим количе...Ontico

Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...Ontico

Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...Ontico

Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)Ontico

И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)Ontico

Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)Ontico

Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)Ontico

100500 способов кэширования в Oracle Database или как достичь максимальной ск...Ontico

Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...Ontico

Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...Ontico

Plus de Ontico (20)

One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...

Масштабируя DNS / Артем Гавриченков (Qrator Labs)

Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)

Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...

Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...

PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)

Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...

Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...

ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)

MySQL Replication — Advanced Features / Петр Зайцев (Percona)

Внутренний open-source. Как разрабатывать мобильное приложение большим количе...

Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...

Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...

Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)

И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)

Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)

Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)

100500 способов кэширования в Oracle Database или как достичь максимальной ск...

Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...

Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...

Dernier

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Histor y of HAM Radio presentation slidevu2urc

Evaluating the top large language models.pdfChristopherTHyatt

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Scaling API-first – The story of a global engineering organizationRadu Cotescu

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Histor y of HAM Radio presentation slide

Evaluating the top large language models.pdf

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

What Are The Drone Anti-jamming Systems Technology?

Handwritten Text Recognition for manuscripts and early printed texts

Partners Life - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Automating Google Workspace (GWS) & more with Apps Script

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Powerful Google developer tools for immediate impact! (2023-24 C)

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Scaling API-first – The story of a global engineering organization

How to Troubleshoot Apps for the Modern Connected Worker

IAC 2024 - IA Fast Track to Search Focused AI Solutions

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Антон Лебедевич

1. Статистика на практике для поиска аномалий в нагрузочном тестировании и production Антон Лебедевич

2. A Lot of Graphs

3. Contents ● Real Data and Ideal Models ● Load Testing (Tuning) ● Production Monitoring ● Correlation ● Tools

4. Real Data vs. Ideal Models ● noise (human actions) ● outliers ● missing data ● different resolutions ● counter update frequencies ● quantization ● not Gaussian and not random walk ● what is normal for the system?

5. Outlier vs. Changepoint

6. Outlier vs. Changepoint

7. Outlier vs. Changepoint

8. Resolution ● >=5min ● 1min ● 10s ● <=1s

9. Load Testing (Tuning) ● goal ● beware of transient response ● find failure ● filter data ● find bottleneck and fix ● rinse and repeat

10. Transient Response

11. Failure on Target Metric

12. Failure on Target Metric

13. Failure on Target Metric

14. Filtration ● constants ● index of dispersion (sd/mean) ● apply system knowledge – tasks migrated by scheduler – dependent (disk used/free) – interface traffic < 10 packets/s – load average < 0.5 – …

15. Missing or Constant

16. Changed Mean

17. Nonlinear ndiffs: diff until kpss says it's stationary

18. Production Classics ● Control charts – – ● fixed window moving average (MA) exponentially weighted moving average (EWMA) Holt-Winters

23. Exponentially-Weighted Moving Average

24. Control Charts ● stationary ● Gaussian/Poisson ● outliers

25. Two Weeks

26. Holt-Winters triple exponential smoothing ● needs a lot of data ● sensitive to outliers ● can't handle 3 seasons + holidays ● overfitting

27. Time Shifting

28. Production Experimental ● autocorrelation ● non-parametric 2 sample tests

29. Autocorrelation

30. Autocorrelation Ljung-Box Test ● non-stationary ● mean shift ● trends ● seasonal ● periodic (cron jobs, sampling) ● aggregated (MA, EWMA)

31. Distribution Change

32. Distribution Change

33. Distribution Change

34. Distribution Change

35. 2-Sample Tests: Good Kolmogorov–Smirnov, Cramér–von Mises ● good for request size and latency (unaggregated) ● work on periodic data ● outlier resistant ● good for data exploration

36. 2-Sample Tests: Bad Kolmogorov–Smirnov, Cramér–von Mises ● false positives on trends and seasonal changes ● need many unique values ● computational complexity ● bad for alerting

37. Finding Similar Graphs ● correlation (Pearson, Spearman) ● Euclidean distance ● dynamic time warping (DTW) ● discrete Fourier transform (DFT) ● discrete wavelet transform (DWT)

38. Cluster Centers

39. Cluster Members

40. Cluster Members

41. Clustering ● non-euclidean (ultrametric) space ● many small clusters ● local clustering around events ● false positives – cron jobs (log rotation) – human actions (restarts, reconfigurations) – cache expirations – …

42. Tools ● collectd ● statsd ● graphite ● whisper-fetch ● R

43. R add.smooth <- function(m) { r <- nrow(m) ms <- sapply(m, function(y) { ave(coredata(y), seq.int(r) %/% max(3, r %/% 150), FUN=function(x) {mean(x, na.rm=T)}) }) df <- data.frame(index(m)[rep.int(1:r, ncol(m))], factor(rep(1:ncol(m), each = r), levels = 1:ncol(m)), as.vector(coredata(m)), as.vector(coredata(ms))) names(df) <- c("Index", "Series", "Value", "Smooth") df }

44. Kale Stack ● github.com/etsy/skyline ● github.com/etsy/oculus

45. Skyline image from github.com/etsy/skyline

46. Skyline Internals ● Horizon agent ● Redis ● Analyzer agent ● Flask (Python) Web App

47. Skyline Algorithms ● median absolute deviation ● mean subtraction cumulation ● grubbs ● least squares ● first hour average ● histogram bins ● stddev from average ● ks test ● stddev from moving average ● second order anomalies

48. Oculus image from github.com/etsy/oculus

49. Oculus Internals ● Skyline Import Script and Cronjob ● Resque workers ● ElasticSearch ● Sinatra (Ruby) Web App

50. Q&A Anton Lebedevich mabrek@gmail.com twitter.com/widdoc github.com/mabrek

Антон Лебедевич

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (14)

Similaire à Антон Лебедевич

Similaire à Антон Лебедевич (20)

Plus de Ontico

Plus de Ontico (20)

Dernier

Dernier (20)

Антон Лебедевич