SlideShare une entreprise Scribd logo
1  sur  58
Télécharger pour lire hors ligne
I’M KEVIN
I WORK AT NEW RELIC
I LIKE MATH
I’m Kevin Scaldeferri. I work for New Relic as a Principal Engineer and distributed systems architect and I’m sort of a math geek, which leads to writing talk titles like
YOU CAN’T SPELL
MONITORING
WITHOUT
MONOID Kevin Scaldeferri
New Relic
Before jumping into the math, first some motivation
I DON’T REALLY LIKE
“METRIC TIME SERIES”
I have a confession to make: I don’t really like metric time series. “But they’re so simple”
EASY IS NOT THE SAME AS SIMPLE
Rich Hickey
No, they are easy, not simple. Doing the easy thing often intertwines multiple concepts in way that complicate thinking about them.
STORY TIME
Let me tell you a Story. We’re having an incident, people are looking for the root cause.
“Aha! CPU on this DB is going up and to the right. Page the database team!”

DB Team: “nope, that’s normal, that metric’s not a gauge, it’s an accumulative counter, it’s always up and to the right”.
More unhelpful charts of accumulative counters. Why are all these instances different? Is there a real difference or were they just restarted at different times?
WHAT ABOUT GAUGES?
PDX -> SEA
10 ✈ @ 40 min
1000 🚙 @ 3.5 hr
How long does it take to get from portland to seattle on average?
PDX -> SEA
10 ✈ @ 40 min
1000 🚙 @ 3.5 hr
Avg = (10*40 + 1000*210) / 1010
= 208 min
Is this right? Not really.
PDX -> SEA
10 ✈ @ 40 min <— 120 people
1000 🚙 @ 3.5 hr <— 1 person
Avg = (1200*40 + 1000*210) / 2200
= 117 min
You need to weight the average correctly.

But what’s that got to do with metrics?
AVERAGE RESPONSE TIME
Host A: 10ms
Host B: 12ms
Host C: 80ms
What the average response time of this app, where one host is slow for some reason.
AVERAGE RESPONSE TIME
Host A: 10ms
Host B: 12ms
Host C: 80ms
Average: 34ms?
AVERAGE RESPONSE TIME
Host A: 10ms
Host B: 12ms
Host C: 80ms
Average: 34ms? NO!
AVERAGE RESPONSE TIME
Host A: 10ms
Host B: 12ms
Host C: 80ms <— LB sends less
Average: 34ms? NO!
Don’t average averages.
PERCENTILES
Everyone knows percentiles are better than averages anyway.
“MyResource.post-requests”: {
"p50": 0.001,
"p75": 0.002,
"p95": 0.006,
"p98": 0.007,
"p99": 0.008,
"p999": 0.018,
}
p99 per host is easy to come by, but my SL[IOA] is the p99 for the app overall.
Gil Tene
shameless appeal to authority
UNIQUE COUNTS
Businesses really care about unique counts. How many unique users are coming to the site? How many unique users have tried a new feature?
UNIQUE USERS
10 18 20 19 17 15 12
Weekly Unique Users?
But we’re in trouble if we have daily unique counts and try to get a weekly value.
UNIQUE USERS
10 18 20 19 17 15 12
20 ≤ Weekly Unique Users ≤ 111
Could be anywhere from 20 to 111, which isn’t very satisfying to your business owner.
WELL THIS IS SORT OF
DEPRESSING
MATH TO THE RESCUE!
A MONOID IS AN ALGEBRAIC STRUCTURE
WITH A SINGLE ASSOCIATIVE BINARY
OPERATION AND AN IDENTITY ELEMENT.
Wikipedia
What is a monoid? … what’s that mean?
“ALGEBRAIC STRUCTURE”
=
DATA TYPE
“ASSOCIATIVE BINARY
OPERATION”
=
SOMETHING LIKE ADDITION
“IDENTITY ELEMENT”
=
SOMETHING LIKE ZERO
interface Monoid<T> {
// (x + y) + z = x + (y + z)
add(x:T, y:T) : T
// 0 + x = x = x + 0
zero() : T
}
As an interface definition. But it’s not just addition. For example, multiplication or string concatenation satisfy these rules.
HOW DOES THIS HELP?
How does this simple concept help fix the problems with our easy approach?
TEMPORAL AND
DIMENSIONAL AGGREGATION
AGGREGATION
1 2 3 4 5 6 7 8 9 10 11 12
host
A
10 14 15 19 17 15 12 11 12 15 14 17
host
B
9 13 12 15 16 17 16 14 9 11 12 15
host
C
10 15 13 16 13 19 15 16 13 13 12 14
host
D
10 13 13 17 14 20 13 15 12 12 13 15
10 second resolution is great for tactical debugging.
AGGREGATION
1-4 5-8 9-12
host
A
58 55 58
host
B
49 63 47
host
C
54 61 52
host
D
53 62 52
but for long term analysis it’s too expensive and we want time roll-ups.
AGGREGATION
1 2 3 4 5 6 7 8 9 10 11 12
host
A
10 14 15 19 17 15 12 11 12 15 14 17
host
B
9 13 12 15 16 17 16 14 9 11 12 15
host
C
10 15 13 16 13 19 15 16 13 13 12 14
host
D
10 13 13 17 14 20 13 15 12 12 13 15
Similarly we want all those high-cardinality dimensions to track down problems and answer ad-hoc question.
AGGREGATION
1 2 3 4 5 6 7 8 9 10 11 12
all
hosts 39 55 53 67 60 71 56 56 46 51 51 62
But you also need to measure SLIs. And a year from now you won’t care about that container ID.
ACCUMULATIVE
COUNTERS
replace accumulative counters with
DELTA
COUNTERS
delta counters
12 REQUESTS TO THIS ENDPOINT WERE
RECEIVED BY THIS HOST DURING THIS
TIME INTERVAL
Useful Monitoring
Some of our sources of telemetry insist on giving us accumulators, but as quickly as possible we need to convert them to something like this.
(AND YOU SHOULD SUM THEM)
Useful Monitoring
And that measurement needs to tell us how to combine multiple data points.
A MONOID IS
BOTH THE DATA
AND THE OPERATION
There’s more than one monoid on longs and doubles, and we need to be clear about what’s sensible to do with a particular metric.
MIN / MAX
GAUGES
Don’t sum or average a max or min. Take the max of all your maxes and the min of all your mins.
THE MAX MEMORY USED BY THIS HOST
DURING THIS TIME INTERVAL WAS
1.2GB; AND AGGREGATE USING MAX
Useful Monitoring
This should be explicit, not something you have to extract from the metric name.
MONOIDS
COMPOSE
AVERAGE RESPONSE TIME
Host A: 10ms
Host B: 12ms
Host C: 80ms
Average: ???
How do we do this right?
AVERAGE RESPONSE TIME
Host A: 10s / 1000 reqs = 10ms avg
Host B: 10.8s / 900 reqs = 12ms avg
Host C: 9.6s / 120 reqs = 80ms avg
Average: ???
Break it into two sum monoids for the total time of requests and the total number of requests.
AVERAGE RESPONSE TIME
Host A: 10s / 1000 reqs = 10ms avg
Host B: 10.8s / 900 reqs = 12ms avg
Host C: 9.6s / 120 reqs = 80ms avg
Avg: 30.4s / 2020 reqs = 15ms avg
Now we can aggregate correctly and get the right answer.

The Prometheus histogram Bryan showed yesterday is a more complicated example where you have to know exactly how to combine all those individual lines together.
We can do better. Structured logs, why not structured metrics?
APPROXIMATION
WITH RIGOR
monoids tell us how to design approximate algorithms which are still mathematically sound
UNIQUE COUNTS
Let’s revisit our unique count example.
UNIQUE USERS
10 18 20 19 17 15 12
Weekly Unique Users?
We know that the unique counts for each day aren’t sufficient to let us calculate the unique users for the week, so what should we do? This is not at all obvious.
HYPERLOGLOG: THE ANALYSIS OF A
NEAR-OPTIMAL CARDINALITY
ESTIMATION ALGORITHM
Flajolet, et al
Lots of research, but at this point everyone pretty much agrees HyperLogLog is the way to go.
UNIQUE USERS - HYPERLOGLOG
1000110
0111101
01…
1010011
0010110
001…
1110010
0101101
00…
0001110
0001101
01…
1010010
0100101
00…
1110110
0001101
11…
1100100
0111010
111…
Weekly Unique Users = 25
Takes about 700 bytes so you don’t want to track a ton of these, but reasonable for high-value business metrics.
PERCENTILES
What about percentiles?	The good news is that there’s lots of ways to approximate percentiles monoidally. But the bad news is also that there’s lots of ways to
approximate percentiles monoidally.
RE-AGGREGATABLE PERCENTILES
▸MomentSketch
▸Q-Digest
▸T-Digest
▸GK-Array
▸HDRHistogram
▸Spectator histogram
▸CLWY “Random”
Algorithm
▸DDSketch
This is not a complete list, this is just some of the most well known approaches. These all make tradeoffs in a multi-dimensional space of speed, size, and accuracy and
this is still an active area of research. Unlike unique counts, we don’t have a consensus about what approach all our monitoring tools should use. Hard to compare
across data from multiple systems.
Gratuitous dog photo in case you were getting overwhelmed by math about now.
WRAPPING UP
METRIC TIME SERIES
▸ Can be misleading / surprising
▸ Accumulative Counters: please stop!
▸ Easy to do mathematical nonsense
▸ Accurate aggregation often impossible
Metric time series have been the easy and dominant paradigm for monitoring data over the last decade or so, but they present challenges in today’s environment.
MONOIDS
▸ Data that tells us what math makes sense
▸ Collect high-resolution, high-cardinality data
▸ Aggregate after the fact as needed
▸ Composable
▸ Guides the design of approximate algorithms
Monoids provide a simple framework which allows us to build mathematically sound monitoring systems.
CHALLENGES
▸ Self-describing data that includes how to aggregate
▸ Composite data types
▸ Universal support for HyperLogLogs
▸ Consensus on quantile estimation
If we’re adding units and descriptions for humans to our metrics (a la Open Census), why not richer type annotations?

Quantiles are hard, maybe Open Telemetry should tackle this.
THANK YOU
KEVIN SCALDEFERRI
@KSCALDEF

Contenu connexe

Similaire à You can't spell "monitoring" without "monoid"

Why computer programming
Why computer programmingWhy computer programming
Why computer programming
TUOS-Sam
 
Data confusion (how to confuse yourself and others with data analysis)
Data confusion (how to confuse yourself and others with data analysis)Data confusion (how to confuse yourself and others with data analysis)
Data confusion (how to confuse yourself and others with data analysis)
Vijay Kukrety
 
4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski
PROIDEA
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 

Similaire à You can't spell "monitoring" without "monoid" (20)

Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Why computer programming
Why computer programmingWhy computer programming
Why computer programming
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
DZone_RC_RxJS
DZone_RC_RxJSDZone_RC_RxJS
DZone_RC_RxJS
 
The R of War
The R of WarThe R of War
The R of War
 
17 large scale machine learning
17 large scale machine learning17 large scale machine learning
17 large scale machine learning
 
Data confusion (how to confuse yourself and others with data analysis)
Data confusion (how to confuse yourself and others with data analysis)Data confusion (how to confuse yourself and others with data analysis)
Data confusion (how to confuse yourself and others with data analysis)
 
BasicTools-Histogram.ppt
BasicTools-Histogram.pptBasicTools-Histogram.ppt
BasicTools-Histogram.ppt
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Everybody Lies
Everybody LiesEverybody Lies
Everybody Lies
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path ahead
 
Simplified Forecasting masterclass CPA Australia Congress 2016 udpate
Simplified Forecasting masterclass CPA Australia Congress 2016 udpateSimplified Forecasting masterclass CPA Australia Congress 2016 udpate
Simplified Forecasting masterclass CPA Australia Congress 2016 udpate
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski
 
Measure to fail
Measure to failMeasure to fail
Measure to fail
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Predict saturated thickness using tensor board visualization
Predict saturated thickness using tensor board visualizationPredict saturated thickness using tensor board visualization
Predict saturated thickness using tensor board visualization
 

Dernier

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Dernier (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 

You can't spell "monitoring" without "monoid"

  • 1. I’M KEVIN I WORK AT NEW RELIC I LIKE MATH I’m Kevin Scaldeferri. I work for New Relic as a Principal Engineer and distributed systems architect and I’m sort of a math geek, which leads to writing talk titles like
  • 2. YOU CAN’T SPELL MONITORING WITHOUT MONOID Kevin Scaldeferri New Relic Before jumping into the math, first some motivation
  • 3. I DON’T REALLY LIKE “METRIC TIME SERIES” I have a confession to make: I don’t really like metric time series. “But they’re so simple”
  • 4. EASY IS NOT THE SAME AS SIMPLE Rich Hickey No, they are easy, not simple. Doing the easy thing often intertwines multiple concepts in way that complicate thinking about them.
  • 5. STORY TIME Let me tell you a Story. We’re having an incident, people are looking for the root cause.
  • 6. “Aha! CPU on this DB is going up and to the right. Page the database team!” DB Team: “nope, that’s normal, that metric’s not a gauge, it’s an accumulative counter, it’s always up and to the right”.
  • 7. More unhelpful charts of accumulative counters. Why are all these instances different? Is there a real difference or were they just restarted at different times?
  • 9. PDX -> SEA 10 ✈ @ 40 min 1000 🚙 @ 3.5 hr How long does it take to get from portland to seattle on average?
  • 10. PDX -> SEA 10 ✈ @ 40 min 1000 🚙 @ 3.5 hr Avg = (10*40 + 1000*210) / 1010 = 208 min Is this right? Not really.
  • 11. PDX -> SEA 10 ✈ @ 40 min <— 120 people 1000 🚙 @ 3.5 hr <— 1 person Avg = (1200*40 + 1000*210) / 2200 = 117 min You need to weight the average correctly. But what’s that got to do with metrics?
  • 12. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms What the average response time of this app, where one host is slow for some reason.
  • 13. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms Average: 34ms?
  • 14. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms Average: 34ms? NO!
  • 15. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms <— LB sends less Average: 34ms? NO! Don’t average averages.
  • 16. PERCENTILES Everyone knows percentiles are better than averages anyway.
  • 19. UNIQUE COUNTS Businesses really care about unique counts. How many unique users are coming to the site? How many unique users have tried a new feature?
  • 20. UNIQUE USERS 10 18 20 19 17 15 12 Weekly Unique Users? But we’re in trouble if we have daily unique counts and try to get a weekly value.
  • 21. UNIQUE USERS 10 18 20 19 17 15 12 20 ≤ Weekly Unique Users ≤ 111 Could be anywhere from 20 to 111, which isn’t very satisfying to your business owner.
  • 22. WELL THIS IS SORT OF DEPRESSING
  • 23. MATH TO THE RESCUE!
  • 24. A MONOID IS AN ALGEBRAIC STRUCTURE WITH A SINGLE ASSOCIATIVE BINARY OPERATION AND AN IDENTITY ELEMENT. Wikipedia What is a monoid? … what’s that mean?
  • 28. interface Monoid<T> { // (x + y) + z = x + (y + z) add(x:T, y:T) : T // 0 + x = x = x + 0 zero() : T } As an interface definition. But it’s not just addition. For example, multiplication or string concatenation satisfy these rules.
  • 29. HOW DOES THIS HELP? How does this simple concept help fix the problems with our easy approach?
  • 31. AGGREGATION 1 2 3 4 5 6 7 8 9 10 11 12 host A 10 14 15 19 17 15 12 11 12 15 14 17 host B 9 13 12 15 16 17 16 14 9 11 12 15 host C 10 15 13 16 13 19 15 16 13 13 12 14 host D 10 13 13 17 14 20 13 15 12 12 13 15 10 second resolution is great for tactical debugging.
  • 32. AGGREGATION 1-4 5-8 9-12 host A 58 55 58 host B 49 63 47 host C 54 61 52 host D 53 62 52 but for long term analysis it’s too expensive and we want time roll-ups.
  • 33. AGGREGATION 1 2 3 4 5 6 7 8 9 10 11 12 host A 10 14 15 19 17 15 12 11 12 15 14 17 host B 9 13 12 15 16 17 16 14 9 11 12 15 host C 10 15 13 16 13 19 15 16 13 13 12 14 host D 10 13 13 17 14 20 13 15 12 12 13 15 Similarly we want all those high-cardinality dimensions to track down problems and answer ad-hoc question.
  • 34. AGGREGATION 1 2 3 4 5 6 7 8 9 10 11 12 all hosts 39 55 53 67 60 71 56 56 46 51 51 62 But you also need to measure SLIs. And a year from now you won’t care about that container ID.
  • 37. 12 REQUESTS TO THIS ENDPOINT WERE RECEIVED BY THIS HOST DURING THIS TIME INTERVAL Useful Monitoring Some of our sources of telemetry insist on giving us accumulators, but as quickly as possible we need to convert them to something like this.
  • 38. (AND YOU SHOULD SUM THEM) Useful Monitoring And that measurement needs to tell us how to combine multiple data points.
  • 39. A MONOID IS BOTH THE DATA AND THE OPERATION There’s more than one monoid on longs and doubles, and we need to be clear about what’s sensible to do with a particular metric.
  • 40. MIN / MAX GAUGES Don’t sum or average a max or min. Take the max of all your maxes and the min of all your mins.
  • 41. THE MAX MEMORY USED BY THIS HOST DURING THIS TIME INTERVAL WAS 1.2GB; AND AGGREGATE USING MAX Useful Monitoring This should be explicit, not something you have to extract from the metric name.
  • 43. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms Average: ??? How do we do this right?
  • 44. AVERAGE RESPONSE TIME Host A: 10s / 1000 reqs = 10ms avg Host B: 10.8s / 900 reqs = 12ms avg Host C: 9.6s / 120 reqs = 80ms avg Average: ??? Break it into two sum monoids for the total time of requests and the total number of requests.
  • 45. AVERAGE RESPONSE TIME Host A: 10s / 1000 reqs = 10ms avg Host B: 10.8s / 900 reqs = 12ms avg Host C: 9.6s / 120 reqs = 80ms avg Avg: 30.4s / 2020 reqs = 15ms avg Now we can aggregate correctly and get the right answer. The Prometheus histogram Bryan showed yesterday is a more complicated example where you have to know exactly how to combine all those individual lines together. We can do better. Structured logs, why not structured metrics?
  • 46. APPROXIMATION WITH RIGOR monoids tell us how to design approximate algorithms which are still mathematically sound
  • 47. UNIQUE COUNTS Let’s revisit our unique count example.
  • 48. UNIQUE USERS 10 18 20 19 17 15 12 Weekly Unique Users? We know that the unique counts for each day aren’t sufficient to let us calculate the unique users for the week, so what should we do? This is not at all obvious.
  • 49. HYPERLOGLOG: THE ANALYSIS OF A NEAR-OPTIMAL CARDINALITY ESTIMATION ALGORITHM Flajolet, et al Lots of research, but at this point everyone pretty much agrees HyperLogLog is the way to go.
  • 50. UNIQUE USERS - HYPERLOGLOG 1000110 0111101 01… 1010011 0010110 001… 1110010 0101101 00… 0001110 0001101 01… 1010010 0100101 00… 1110110 0001101 11… 1100100 0111010 111… Weekly Unique Users = 25 Takes about 700 bytes so you don’t want to track a ton of these, but reasonable for high-value business metrics.
  • 51. PERCENTILES What about percentiles? The good news is that there’s lots of ways to approximate percentiles monoidally. But the bad news is also that there’s lots of ways to approximate percentiles monoidally.
  • 52. RE-AGGREGATABLE PERCENTILES ▸MomentSketch ▸Q-Digest ▸T-Digest ▸GK-Array ▸HDRHistogram ▸Spectator histogram ▸CLWY “Random” Algorithm ▸DDSketch This is not a complete list, this is just some of the most well known approaches. These all make tradeoffs in a multi-dimensional space of speed, size, and accuracy and this is still an active area of research. Unlike unique counts, we don’t have a consensus about what approach all our monitoring tools should use. Hard to compare across data from multiple systems.
  • 53. Gratuitous dog photo in case you were getting overwhelmed by math about now.
  • 55. METRIC TIME SERIES ▸ Can be misleading / surprising ▸ Accumulative Counters: please stop! ▸ Easy to do mathematical nonsense ▸ Accurate aggregation often impossible Metric time series have been the easy and dominant paradigm for monitoring data over the last decade or so, but they present challenges in today’s environment.
  • 56. MONOIDS ▸ Data that tells us what math makes sense ▸ Collect high-resolution, high-cardinality data ▸ Aggregate after the fact as needed ▸ Composable ▸ Guides the design of approximate algorithms Monoids provide a simple framework which allows us to build mathematically sound monitoring systems.
  • 57. CHALLENGES ▸ Self-describing data that includes how to aggregate ▸ Composite data types ▸ Universal support for HyperLogLogs ▸ Consensus on quantile estimation If we’re adding units and descriptions for humans to our metrics (a la Open Census), why not richer type annotations? Quantiles are hard, maybe Open Telemetry should tackle this.