Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx

Richard “RichiH” Hartmann
Intro to open source observability
with Grafana, Prometheus,
Loki, and Tempo
I come from the trenches of tech
Bad tools are worse than no tools
| 2
Today’s reality: Disparate systems. Disparate data.
Back to the basics
Let’s rethink this
| 4
Or: Buzzwords, and their useful parts
Observability & SRE
| 5
| 6
Dirty secret
● What is “cloud native scale” is “Internet scale” of networks two
decades ago
● The combination of metrics and events has been standard in
power measurement for half a century
● Modern tools with good engineering practices map very well onto
brownfield technology
| 7
Buzzword alert!
● Cool new term, almost meaningless by now, what does it mean?
○ Pitfall alert: Cargo culting
○ It’s about changing the behaviour, not about changing the name
● “Monitoring” has taken on a meaning of collecting, not using data
○ One extreme: Full text indexing
○ Other extreme: Data lake
● “Observability” is about enabling humans to understand complex
systems
○ Ask why it’s not working instead of just knowing that it’s not
| 8
If you can’t ask new questions on the fly, it’s not
observability
| 9
| 10
Complexity
● Fake complexity, a.k.a. bad design
○ Can be reduced
● Real, system-inherent complexity
○ Can be moved (monolith vs client-server vs microservices)
○ Must be compartmentalized (service boundaries)
○ Should be distilled meaningfully
| 11
SRE, an instantiation of DevOps
● At its core: Align incentives across the org
○ Error budgets allow devs, ops, PMs, etc. to optimize for shared benefits
● Measure it!
○ SLI: Service Level Indicator: What you measure
○ SLO: Service Level Objective: What you need to hit
○ SLA: Service Level Agreement: When you need to pay
| 12
Shared understanding
● Everyone uses the same tools & dashboards
○ Shared incentive to invest into tooling
○ Pooling of institutional system knowledge
○ Shared language & understanding of services
| 13
Services
● Service?
○ Compartmentalized complexity, with an interface
○ Different owners/teams
○ Contracts define interfaces
● Why “contract”: Shared agreement which MUST NOT be broken
○ Internal and external customers rely on what you build and maintain
● Other common term: layer
○ The Internet would not exist without network layering
○ Enables innovation, parallelizes human engineering
● Other examples: CPUs, harddisk, compute nodes, your lunch
| 14
Alerting
● Customers care about services being up, not about individual components
● Discern between different SLIs
○ Primary: service-relevant, for alerting
○ Secondary: informational, debugging, might be underlying’s primary
Anything currently or imminently impacting customer service must be
alerted upon
But nothing(!) else
Prometheus
| 15
| 16
Prometheus 101
● Inspired by Google's Borgmon
● Time series database
● unit64 millisecond timestamp, float64 value
● Instrumentation & exporters
● Not for event logging
● Dashboarding via Grafana
| 17
Main selling points
● Highly dynamic, built-in service discovery
● No hierarchical model, n-dimensional label set
● PromQL: for processing, graphing, alerting, and export
● Simple operation
● Highly efficient
| 18
Main selling points
● Prometheus is a pull-based system
● Black-box monitoring: Looking at a service from the outside (Does the server
answer to HTTP requests?)
● White-box monitoring: Instrumenting code from the inside (How much time
does this subroutine take?)
● Every service should have its own metrics endpoint
● Hard API commitments within major versions
| 19
Time series
● Time series are recorded values which change over time
● Individual events are usually merged into counters and/or histograms
● Changing values are recorded as gauges
● Typical examples
○ Requests to a webserver (counter)
○ Temperatures in a datacenter (gauge)
○ Service latency (histograms)
| 20
Super easy to emit, parse & read
http_requests_total{env="prod",method="post",code="200"} 1027
http_requests_total{env="prod",method="post",code="400"} 3
http_requests_total{env="prod",method="post",code="500"} 12
http_requests_total{env="prod",method="get",code="200"} 20
http_requests_total{env="test",method="post",code="200"} 372
http_requests_total{env="test",method="post",code="400"} 75
| 21
Scale
● Kubernetes is Borg
● Prometheus is Borgmon
● Google couldn't have run Borg without Borgmon (plus Omega and Monarch)
● Kubernetes & Prometheus are designed and written with each other in mind
| 22
Prometheus scale
● 1,000,000+ samples/second no problem on current hardware
● ~200,000 samples/second/core
● 16 bytes/sample compressed to 1.36 bytes/sample
The highest we saw in production on a single Prometheus instance were 15
million active times series at once!
| 23
Long term storage
● Two long term storage solutions have Prometheus-team members
working on them
○ Thanos
■ Historically easier to run, but slower
■ Scales storage horizontally
○ Cortex
■ Easy to run these days
■ Scales both storage, ingester, and querier horizontally
| 24
Cortex @ Grafana (largest cluster, 2021-09)
● ~65 million active series (just the one cluster)
● 668 CPU cores
● 3,349 GiB RAM
One customer running at 3 billion active series
Loki
| 25
| 26
Loki 101
● Following the same label-based system like Prometheus
● No full text index needed, incredible speed
● Work with logs at scale, without the massive cost
● Access logs with the same label sets as metrics
● Turn logs into metrics, to make it easier to work with them
● Make direct use of syslog data, via promtail
2019-12-11T10:01:02.123456789Z {env="prod",instance=”1.1.1.1”} GET /about
Timestamp
with nanosecond precision
Content
log line
Prometheus-style Labels
key-value pairs
indexed unindexed
| 28
Loki @ Grafana Labs
● Queries regularly see 40 GiB/s
● Query terabytes of data in under a minute
○ Including complex processing of result sets
Tempo
| 29
| 30
Tempo
● Exemplars: Jump from relevant logs & metrics
○ Native to Prometheus, Cortex, Thanos, and Loki
○ Exemplars work at Google scale, with the ease of Grafana
● Index and search by labelsets available for those who need it
● Object store only: No Cassandra, Elastic, etc.
● 100% compatible with OpenTelemetry Tracing, Zipkin, Jaeger
● 100% of your traces, no sampling
| 31
Tempo @ Grafana Labs (2021-07)
● 2,200,000 samples per second @ 350 MiB/s
● 14-day retention @ 3 copies stored
● ~240 CPU cores (includes compression cost)
● ~450 GiB RAM
● 132 TiB object storage
● Latencies:
○ p99 - 2.5s
○ p90 - 2.3s
○ p50 - 1.6s
Bringing it together
| 32
From logs to traces
| 33
From metrics to traces
| 34
From metrics to traces
| 35
...and from traces to logs
| 36
All of this is Open Source and you can run it yourself
| 37
Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx
Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx
grafana.com/go/obscon2021
| 40
For a deeper dive into
Open Source Observability, go to:
Thank you!
richih@richih.org
twitter.com/TwitchiH
github.com/RichiH/talks
1 sur 41

Recommandé

Monitoring using Prometheus and Grafana par
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
3.6K vues25 diapositives
Introduction to Prometheus par
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
6.7K vues55 diapositives
Grafana Loki: like Prometheus, but for Logs par
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
2.8K vues38 diapositives
Apache Airflow par
Apache AirflowApache Airflow
Apache AirflowKnoldus Inc.
634 vues17 diapositives
Loki - like prometheus, but for logs par
Loki - like prometheus, but for logsLoki - like prometheus, but for logs
Loki - like prometheus, but for logsJuraj Hantak
533 vues38 diapositives
DevOps Monitoring and Alerting par
DevOps Monitoring and AlertingDevOps Monitoring and Alerting
DevOps Monitoring and AlertingKhairul Zebua
375 vues43 diapositives

Contenu connexe

Tendances

SRE & Kubernetes par
SRE & KubernetesSRE & Kubernetes
SRE & KubernetesAfkham Azeez
553 vues33 diapositives
Monitoring Kubernetes with Prometheus par
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
4.2K vues35 diapositives
Monitoring with prometheus par
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
6.8K vues53 diapositives
MeetUp Monitoring with Prometheus and Grafana (September 2018) par
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
6.2K vues37 diapositives
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management par
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
7.5K vues36 diapositives
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana par
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
235 vues31 diapositives

Tendances(20)

Monitoring Kubernetes with Prometheus par Grafana Labs
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
Grafana Labs4.2K vues
Monitoring with prometheus par Kasper Nissen
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
Kasper Nissen6.8K vues
MeetUp Monitoring with Prometheus and Grafana (September 2018) par Lucas Jellema
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema6.2K vues
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management par Burasakorn Sabyeying
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana par Sridhar Kumar N
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Sridhar Kumar N235 vues
Server monitoring using grafana and prometheus par Celine George
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
Celine George2.1K vues
Prometheus design and philosophy par Docker, Inc.
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
Docker, Inc.2.7K vues
Introduction to Apache Airflow par mutt_data
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data900 vues
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ... par HostedbyConfluent
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Monitoring With Prometheus par Knoldus Inc.
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
Knoldus Inc.424 vues
Monitoring_with_Prometheus_Grafana_Tutorial par Tim Vaillancourt
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
Tim Vaillancourt9.1K vues

Similaire à Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx

Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La... par
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
348 vues35 diapositives
Labeling all the Things with the WDI Skill Labeler par
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Kwame Porter Robinson
92 vues43 diapositives
Elasticsearch Performance Testing and Scaling @ Signal par
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
431 vues25 diapositives
WTF is a Microservice - Rafael Schloming, Datawire par
WTF is a Microservice - Rafael Schloming, DatawireWTF is a Microservice - Rafael Schloming, Datawire
WTF is a Microservice - Rafael Schloming, DatawireAmbassador Labs
2K vues41 diapositives
Prometheus: What is is, what is new, what is coming par
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingJulien Pivotto
43 vues27 diapositives
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016 par
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
457 vues36 diapositives

Similaire à Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx(20)

Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La... par InfluxData
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
InfluxData348 vues
Elasticsearch Performance Testing and Scaling @ Signal par Joachim Draeger
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger431 vues
WTF is a Microservice - Rafael Schloming, Datawire par Ambassador Labs
WTF is a Microservice - Rafael Schloming, DatawireWTF is a Microservice - Rafael Schloming, Datawire
WTF is a Microservice - Rafael Schloming, Datawire
Ambassador Labs2K vues
Prometheus: What is is, what is new, what is coming par Julien Pivotto
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
Julien Pivotto43 vues
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016 par Dan Lynn
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn457 vues
Extracting Insights from Data at Twitter par Prasad Wagle
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle2K vues
Building a data pipeline to ingest data into Hadoop in minutes using Streamse... par Guglielmo Iozzia
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia1.2K vues
Go Observability (in practice) par Eran Levy
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
Eran Levy1.1K vues
Data platform architecture principles - ieee infrastructure 2020 par Julien Le Dem
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem824 vues
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry par Marcus Hanwell
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Marcus Hanwell535 vues
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H... par Data Con LA
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA291 vues
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N... par Amazon Web Services
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
Distributed real time stream processing- why and how par Petr Zapletal
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
Petr Zapletal20.2K vues
Bitkom Cray presentation - on HPC affecting big data analytics in FS par Philip Filleul
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul1.3K vues
High-Speed Reactive Microservices par Rick Hightower
High-Speed Reactive MicroservicesHigh-Speed Reactive Microservices
High-Speed Reactive Microservices
Rick Hightower1.5K vues
BISSA: Empowering Web gadget Communication with Tuple Spaces par Srinath Perera
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple Spaces
Srinath Perera1.2K vues
Digital Document Preservation Simulation - Boston Python User's Group par Micah Altman
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's Group
Micah Altman850 vues
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin... par Anna Ossowski
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski115 vues

Plus de LibbySchulze

Running distributed tests with k6.pdf par
Running distributed tests with k6.pdfRunning distributed tests with k6.pdf
Running distributed tests with k6.pdfLibbySchulze
529 vues25 diapositives
Extending Kubectl.pptx par
Extending Kubectl.pptxExtending Kubectl.pptx
Extending Kubectl.pptxLibbySchulze
204 vues21 diapositives
Enhancing Data Protection Workflows with Kanister And Argo Workflows par
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsLibbySchulze
275 vues26 diapositives
Fallacies in Platform Engineering.pdf par
Fallacies in Platform Engineering.pdfFallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdfLibbySchulze
68 vues19 diapositives
Intro to Fluvio.pptx.pdf par
Intro to Fluvio.pptx.pdfIntro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdfLibbySchulze
245 vues10 diapositives
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf par
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfCNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfLibbySchulze
74 vues26 diapositives

Plus de LibbySchulze(20)

Running distributed tests with k6.pdf par LibbySchulze
Running distributed tests with k6.pdfRunning distributed tests with k6.pdf
Running distributed tests with k6.pdf
LibbySchulze529 vues
Enhancing Data Protection Workflows with Kanister And Argo Workflows par LibbySchulze
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo Workflows
LibbySchulze275 vues
Fallacies in Platform Engineering.pdf par LibbySchulze
Fallacies in Platform Engineering.pdfFallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdf
LibbySchulze68 vues
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf par LibbySchulze
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfCNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
LibbySchulze74 vues
Oh The Places You'll Sign.pdf par LibbySchulze
Oh The Places You'll Sign.pdfOh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdf
LibbySchulze113 vues
Rancher MasterClass - Avoiding-configuration-drift.pptx par LibbySchulze
Rancher  MasterClass - Avoiding-configuration-drift.pptxRancher  MasterClass - Avoiding-configuration-drift.pptx
Rancher MasterClass - Avoiding-configuration-drift.pptx
LibbySchulze51 vues
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx par LibbySchulze
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptxvFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
LibbySchulze47 vues
CNCF Live Webinar: Low Footprint Java Containers with GraalVM par LibbySchulze
CNCF Live Webinar: Low Footprint Java Containers with GraalVMCNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
LibbySchulze155 vues
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin... par LibbySchulze
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
LibbySchulze288 vues
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr... par LibbySchulze
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
LibbySchulze25 vues
CNCF_ A step to step guide to platforming your delivery setup.pdf par LibbySchulze
CNCF_ A step to step guide to platforming your delivery setup.pdfCNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdf
LibbySchulze250 vues
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf par LibbySchulze
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
LibbySchulze348 vues
Securing Windows workloads.pdf par LibbySchulze
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
LibbySchulze191 vues
Securing Windows workloads.pdf par LibbySchulze
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
LibbySchulze99 vues
Advancements in Kubernetes Workload Identity for Azure par LibbySchulze
Advancements in Kubernetes Workload Identity for AzureAdvancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for Azure
LibbySchulze182 vues

Dernier

Affiliate Marketing par
Affiliate MarketingAffiliate Marketing
Affiliate MarketingNavin Dhanuka
21 vues30 diapositives
40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download par
40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download
40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink DownloadAPNIC
112 vues30 diapositives
Penetration Testing for Cybersecurity Professionals par
Penetration Testing for Cybersecurity ProfessionalsPenetration Testing for Cybersecurity Professionals
Penetration Testing for Cybersecurity Professionals211 Check
49 vues17 diapositives
Amine el bouzalimi par
Amine el bouzalimiAmine el bouzalimi
Amine el bouzalimiAmine EL BOUZALIMI
6 vues38 diapositives
ATPMOUSE_융합2조.pptx par
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptxkts120898
35 vues70 diapositives
40th TWNIC Open Policy Meeting: APNIC PDP update par
40th TWNIC Open Policy Meeting: APNIC PDP update40th TWNIC Open Policy Meeting: APNIC PDP update
40th TWNIC Open Policy Meeting: APNIC PDP updateAPNIC
106 vues20 diapositives

Dernier(13)

40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download par APNIC
40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download
40th TWNIC OPM: On LEOs (Low Earth Orbits) and Starlink Download
APNIC112 vues
Penetration Testing for Cybersecurity Professionals par 211 Check
Penetration Testing for Cybersecurity ProfessionalsPenetration Testing for Cybersecurity Professionals
Penetration Testing for Cybersecurity Professionals
211 Check49 vues
ATPMOUSE_융합2조.pptx par kts120898
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptx
kts12089835 vues
40th TWNIC Open Policy Meeting: APNIC PDP update par APNIC
40th TWNIC Open Policy Meeting: APNIC PDP update40th TWNIC Open Policy Meeting: APNIC PDP update
40th TWNIC Open Policy Meeting: APNIC PDP update
APNIC106 vues
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptx par LeasedLinesQuote
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptxCracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptx
Cracking the Code Decoding Leased Line Quotes for Connectivity Excellence.pptx
40th TWNIC Open Policy Meeting: A quick look at QUIC par APNIC
40th TWNIC Open Policy Meeting: A quick look at QUIC40th TWNIC Open Policy Meeting: A quick look at QUIC
40th TWNIC Open Policy Meeting: A quick look at QUIC
APNIC109 vues
The Dark Web : Hidden Services par Anshu Singh
The Dark Web : Hidden ServicesThe Dark Web : Hidden Services
The Dark Web : Hidden Services
Anshu Singh22 vues

Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx

  • 1. Richard “RichiH” Hartmann Intro to open source observability with Grafana, Prometheus, Loki, and Tempo
  • 2. I come from the trenches of tech Bad tools are worse than no tools | 2
  • 3. Today’s reality: Disparate systems. Disparate data.
  • 4. Back to the basics Let’s rethink this | 4
  • 5. Or: Buzzwords, and their useful parts Observability & SRE | 5
  • 6. | 6 Dirty secret ● What is “cloud native scale” is “Internet scale” of networks two decades ago ● The combination of metrics and events has been standard in power measurement for half a century ● Modern tools with good engineering practices map very well onto brownfield technology
  • 7. | 7 Buzzword alert! ● Cool new term, almost meaningless by now, what does it mean? ○ Pitfall alert: Cargo culting ○ It’s about changing the behaviour, not about changing the name ● “Monitoring” has taken on a meaning of collecting, not using data ○ One extreme: Full text indexing ○ Other extreme: Data lake ● “Observability” is about enabling humans to understand complex systems ○ Ask why it’s not working instead of just knowing that it’s not
  • 8. | 8
  • 9. If you can’t ask new questions on the fly, it’s not observability | 9
  • 10. | 10 Complexity ● Fake complexity, a.k.a. bad design ○ Can be reduced ● Real, system-inherent complexity ○ Can be moved (monolith vs client-server vs microservices) ○ Must be compartmentalized (service boundaries) ○ Should be distilled meaningfully
  • 11. | 11 SRE, an instantiation of DevOps ● At its core: Align incentives across the org ○ Error budgets allow devs, ops, PMs, etc. to optimize for shared benefits ● Measure it! ○ SLI: Service Level Indicator: What you measure ○ SLO: Service Level Objective: What you need to hit ○ SLA: Service Level Agreement: When you need to pay
  • 12. | 12 Shared understanding ● Everyone uses the same tools & dashboards ○ Shared incentive to invest into tooling ○ Pooling of institutional system knowledge ○ Shared language & understanding of services
  • 13. | 13 Services ● Service? ○ Compartmentalized complexity, with an interface ○ Different owners/teams ○ Contracts define interfaces ● Why “contract”: Shared agreement which MUST NOT be broken ○ Internal and external customers rely on what you build and maintain ● Other common term: layer ○ The Internet would not exist without network layering ○ Enables innovation, parallelizes human engineering ● Other examples: CPUs, harddisk, compute nodes, your lunch
  • 14. | 14 Alerting ● Customers care about services being up, not about individual components ● Discern between different SLIs ○ Primary: service-relevant, for alerting ○ Secondary: informational, debugging, might be underlying’s primary Anything currently or imminently impacting customer service must be alerted upon But nothing(!) else
  • 16. | 16 Prometheus 101 ● Inspired by Google's Borgmon ● Time series database ● unit64 millisecond timestamp, float64 value ● Instrumentation & exporters ● Not for event logging ● Dashboarding via Grafana
  • 17. | 17 Main selling points ● Highly dynamic, built-in service discovery ● No hierarchical model, n-dimensional label set ● PromQL: for processing, graphing, alerting, and export ● Simple operation ● Highly efficient
  • 18. | 18 Main selling points ● Prometheus is a pull-based system ● Black-box monitoring: Looking at a service from the outside (Does the server answer to HTTP requests?) ● White-box monitoring: Instrumenting code from the inside (How much time does this subroutine take?) ● Every service should have its own metrics endpoint ● Hard API commitments within major versions
  • 19. | 19 Time series ● Time series are recorded values which change over time ● Individual events are usually merged into counters and/or histograms ● Changing values are recorded as gauges ● Typical examples ○ Requests to a webserver (counter) ○ Temperatures in a datacenter (gauge) ○ Service latency (histograms)
  • 20. | 20 Super easy to emit, parse & read http_requests_total{env="prod",method="post",code="200"} 1027 http_requests_total{env="prod",method="post",code="400"} 3 http_requests_total{env="prod",method="post",code="500"} 12 http_requests_total{env="prod",method="get",code="200"} 20 http_requests_total{env="test",method="post",code="200"} 372 http_requests_total{env="test",method="post",code="400"} 75
  • 21. | 21 Scale ● Kubernetes is Borg ● Prometheus is Borgmon ● Google couldn't have run Borg without Borgmon (plus Omega and Monarch) ● Kubernetes & Prometheus are designed and written with each other in mind
  • 22. | 22 Prometheus scale ● 1,000,000+ samples/second no problem on current hardware ● ~200,000 samples/second/core ● 16 bytes/sample compressed to 1.36 bytes/sample The highest we saw in production on a single Prometheus instance were 15 million active times series at once!
  • 23. | 23 Long term storage ● Two long term storage solutions have Prometheus-team members working on them ○ Thanos ■ Historically easier to run, but slower ■ Scales storage horizontally ○ Cortex ■ Easy to run these days ■ Scales both storage, ingester, and querier horizontally
  • 24. | 24 Cortex @ Grafana (largest cluster, 2021-09) ● ~65 million active series (just the one cluster) ● 668 CPU cores ● 3,349 GiB RAM One customer running at 3 billion active series
  • 26. | 26 Loki 101 ● Following the same label-based system like Prometheus ● No full text index needed, incredible speed ● Work with logs at scale, without the massive cost ● Access logs with the same label sets as metrics ● Turn logs into metrics, to make it easier to work with them ● Make direct use of syslog data, via promtail
  • 27. 2019-12-11T10:01:02.123456789Z {env="prod",instance=”1.1.1.1”} GET /about Timestamp with nanosecond precision Content log line Prometheus-style Labels key-value pairs indexed unindexed
  • 28. | 28 Loki @ Grafana Labs ● Queries regularly see 40 GiB/s ● Query terabytes of data in under a minute ○ Including complex processing of result sets
  • 30. | 30 Tempo ● Exemplars: Jump from relevant logs & metrics ○ Native to Prometheus, Cortex, Thanos, and Loki ○ Exemplars work at Google scale, with the ease of Grafana ● Index and search by labelsets available for those who need it ● Object store only: No Cassandra, Elastic, etc. ● 100% compatible with OpenTelemetry Tracing, Zipkin, Jaeger ● 100% of your traces, no sampling
  • 31. | 31 Tempo @ Grafana Labs (2021-07) ● 2,200,000 samples per second @ 350 MiB/s ● 14-day retention @ 3 copies stored ● ~240 CPU cores (includes compression cost) ● ~450 GiB RAM ● 132 TiB object storage ● Latencies: ○ p99 - 2.5s ○ p90 - 2.3s ○ p50 - 1.6s
  • 33. From logs to traces | 33
  • 34. From metrics to traces | 34
  • 35. From metrics to traces | 35
  • 36. ...and from traces to logs | 36
  • 37. All of this is Open Source and you can run it yourself | 37
  • 40. grafana.com/go/obscon2021 | 40 For a deeper dive into Open Source Observability, go to: