This document provides an overview of OpenTelemetry for operators. It discusses some of the limitations of current observability platforms and how OpenTelemetry addresses these issues. It introduces the OpenTelemetry project which combines distributed tracing, metrics, and logging APIs. It describes the OpenTelemetry Collector which receives, processes, and exports telemetry data. It provides examples of Collector configuration and running it in production. It also discusses some innovations in the observability space from vendors like Dynatrace, New Relic, Splunk SignalFX, and others.
2. Our
Agenda
● Why are current observability platforms
falling short?
● What OpenTelemetry features address
these issues?
● How do I run OpenTelemetry
components in production?
● Who are the innovators in the
observability space?
3. Level
Setting
● Have you used ELK stack or other log
aggregator?
● Have you used an APM system?
● Have you used distributed tracing
before?
4. Who am I?
● Kevin Brockhoff - Senior
Consultant, Daugherty Business
Solutions
○ Solving difficult cloud adoption
challenges for Daugherty's
Fortune 500 clients
○ OpenTelemetry committer since
early stages of the project
○ Github:
https://github.com/kbrockhoff
○ Linkedin:
https://www.linkedin.com/in/kevi
n-brockhoff-a557877/
6. 6
Enterprise Applications
● Only instrumented with logging during initial development.
○ Logging oriented toward development, not operations
● Metrics and tracing only added later if at all as a separate project.
○ Each team creates their own system using familiar tools
○ Or enterprise commits to a specific APM vendor
● Logs, metrics and traces are never connected.
7. 7
First Generation Observability Platforms
Search logs in ELK,
Lack context
Homegrown tracing per
app mainly accessible by
developers
Customer experience
metrics
Low-level metrics
and alerts
9. OpenCensus + OpenTracing = OpenTelemetry
● OpenTracing:
○ Provides APIs and instrumentation for distributed tracing
● OpenCensus:
○ Provides APIs and instrumentation that allow you to collect application metrics and
distributed tracing.
● OpenTelemetry:
○ An effort to combine distributed tracing, metrics and logging into a single set of system
components and language-specific libraries.
10. 10
OpenTelemetry Project
● Specification
○ API (for application developers)
○ SDK Implementations
○ Transport Protocol (Protobuf)
● Collector (middleware)
● SDK’s (various stages of maturity)
○ C++
○ C# (Auto-instrument/Manual)
○ Erlang
○ Go
○ JavaScript (Browser/Node)
○ Java (Auto-instrument/Manual)
■ Android compatibility
○ PHP
○ Python (Auto-instrument/Manual)
○ Ruby
○ Rust
○ Swift
13. 13
OpenTelemetry Collector
● Offers a vendor-agnostic implementation on how to receive, process and
export telemetry data.
● Removes the need to run, operate and maintain multiple
agents/collectors.
● Support open-source telemetry data formats (e.g. OTLP, Jaeger,
Prometheus, etc.) sending to multiple open-source or commercial back-
ends.
14. 14
Collector Concepts
● Telemetry data processing pipelines
○ Per pipeline: Receiver(s) -> Processors -> Exporter(s)
○ Currently only single telemetry type pipelines supported
● Extensions
○ Supporting functionality
○ Core collector extensions
■ health_check - HTTP endpoint for load balancer or k8s controller
■ zpages - Internal processing metrics and traces accessible via HTTP
■ pprof - Performance profiler enables the golang net/http/pprof endpoint
17. Collector Bundled Processors
● Attributes
○ Modifies span attributes
● Batch
○ Groups data into batches
● Filter
○ Include/exclude metrics by name
● Group by Trace
○ Holds all spans for a trace for a set time
and then sends to next processor
● Memory Limiter
○ Prevents out-of-memory issues by
triggering GC
○ Configuration must be matched with
ballast setting collector is launched with
● Queued Retry
○ Deprecated, each exporter now
implements
● Resource
○ Applies changes to Resource attributes
● Probabilistic Sampling
○ Adjusts TraceID hash-based sampling
decisions by sampling.priority
attribute value
● Tail Sampling
○ Sampling decisions based on configured
attribute values and rate limits
● Span
○ Modifies span name or attributes based
on span name
18. 18
Recommended Processor Configuration
Traces
memory_limiter
any sampling processors
batch
any other processors
Metrics
memory_limiter
any filtering processors
batch
any other processors
Memory limiter ballast_size_mib must match --mem-ballast-size-mib command line
parameter. Trigger GC with either limit_mib / spike_limit_mib or limit_percentage /
spike_limit_percentage.
19. Collector Contrib Processors
● Kubernetes
○ Adds metadata from pod
● Metrics Transform
○ Renames/aggregations within individual
metrics
● Resource Detection
○ OTEL_RESOURCE environment variable
○ GCE metadata server
○ EC2 instance metadata server
● Routing
○ Route to particular exporter based on
incoming header value
TODO
● Span data sharding by TraceID
20. Collector Bundled Exporters
Traces
● File
○ JSON format
● Jaeger
○ v2 gRPC
● Kafka
○ OTLP, Jaeger, Zipkin
● Logging
○ Debugging
● OpenCensus
● OTLP (OpenTelemetry Protocol)
● Zipkin
○ v2 JSON or Protobuf
Metrics
● File
○ JSON format
● Logging
○ Debugging
● OpenCensus
● OTLP (OpenTelemetry Protocol)
● Prometheus
○ Metrics endpoint for Prometheus to pull
from
● Prometheus Remote Write
○ Pushes metrics in Prometheus
TimeSeries format (Cortex, etc.)
24. Collector Command Line Example
/usr/local/bin/otelcol
--config=/usr/local/etc/otel-collector-config.yaml
--mem-ballast-size-mib=192
--log-level=DEBUG
25. 25
Collector Docker Images
● otel/opentelemetry-collector
○ Core receivers, processors, and exporters bundled in
● otel/opentelemetry-collector-contrib
○ All core and contrib receivers, processors, and exporters bundled in
● OpenTelemetry Collector builder
○ https://github.com/observatorium/opentelemetry-collector-builder
26. 26
Other Collector Installs
● RPM
○ Produced by opentelemetry-collector build
● Debian
○ Produced by opentelemetry-collector build
27. 27
Observing the Collector
● health_check
○ http://<hostname>:13133/ returns basic
pipeline availability
● zpages
○ RPC metric aggregations at
http://<hostname>:55679/debug/rpcz
○ Trace summaries at
http://<hostname>:55679/debug/tracez
● prometheus
○ Pipeline metrics scrap endpoint at
http://<hostname>:8888/metrics
28. 28
Current Gotchas
● Errors propagated back through pipelines and instances in the chain
○ Errors reported by SDK exporters in the applications may be coming from two hops
downstream
● TraceID sharding not working correctly
○ Can only do tail-based sampling if running single instance of collector
30. 30
Latest Innovations
● Dynatrace automates manual quality validation processes using AI-
assisted SLI/SLO-based quality gates.
● New Relic Incident Intelligence continuously analyzes alerts and incident
data to find patterns in event sequences and offers suggested correlation
decisions that merge incidents to reduce alert noise further.
● Splunk SignalFX provides high cardinality exploration of traces across
different regions, hosts, versions or users.
● Lightstep provides rapid root cause analysis using unlimited cardinality
and a high-fidelity dataset uncompromised by head or tail sampling,
31. 31
Latest Innovations
● Datadog provides automated tagging and correlation of logs so can jump
from any log entry to related metrics.
● Honeycomb lets you break down on every dimension in your data both
the obvious fields, and the surprising ones.
● Grafana Loki datasource provides switching from metrics to logs with
preserved label filters.
● Elastic Observability bring your logs, metrics, and APM traces together at
scale in a single stack.
Copyright 2020, The OpenTelemetry Authors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.