The Incremental Path to Observability

The Incremental Path to Observability
@eanakashima

about://
Emily Nakashima
dir. engineering @ honeycomb.io

Observability for Production Systems:
debug with Events, Traces, and Logs in real time.

Instrumenting your code so that
you can deeply understand the
state of your system just by
observing its outputs
Observability

How you collect the data you
need to achieve observability —
usually code you write or
include in your applications and
services
Instrumentation

Instrumentation
That sounds like
a lot of work!

Why trust you?
Isn’t this…
your job?

We’re a tiny, early startup
We have to work on building the product
as fast as we can — not instrumentation

~3% of initial project == instrumentation
~1% of ongoing work == instrumentation

~3% of initial project == instrumentation
~1% of ongoing work == instrumentation
Toshok
Christine Charity
Ben
Ginsu

The Incremental Path
1. Metrics & Logs
1a. Structured Logs
2. Events
3. Traces
4. Gotchas & Next Steps

Metrics & Logs
• Metrics: signalFX metrics SaaS (could have been Datadog, Wavefront,
Prometheus, etc.)
• We still get ~1-2 useful metrics-based alerts per week
• Logs: just regular log files written to disk on EC2 (no log aggregation SaaS)
• Logs are a final debugging step, not a starting one

“SSH’ing into any server is an
antipattern, a failure”
- Charity Majors

Metrics & Logs
• How: the lowest-effort way you won’t regret later
• Why: don’t fly blind
• What you get: host-based & aggregated alerts; logs of wtf happened
• When you’re ready for the next step: high-cardinality questions your
metrics & logs can’t quickly answer

High-cardinality questions your metrics & logs can’t answer
What
happened
here?

1a. Structured Logs
• Key-value pairs (JSON)
• Self-describing
• Human-readable, just like traditional logs
https://www.honeycomb.io/blog/you-could-have-invented-structured-logging/

•
•
•
• A good transitional step to using Events. Send to your Events backend.
• Self-describing
• Human-readable, just like traditional logs
1a. Structured Logs

Structured Logs: almost Events

• Can look like structured logs; might even be structured logs!
• Represent a complete unit of work for your system
2. Events
https://www.honeycomb.io/blog/how-are-structured-logs-different-from-events/

What is an event?
1 event
==
1 unit of work
~=
1 request

"Host": "127.0.0.1:8080",
"IsXHR": true,
"Method": "POST",
"RequestURI": "/user_event/page-unload",
"ResponseContentLength": 443,
"ResponseHttpStatus": 200,
"ResponseTime_ms": 123,
"Timestamp": "2018-03-02T06:14:57.206349701Z",
"UserEmail": "nathan@honeycomb.io",
"UserID": 18,
"availability_zone": "us-east-1b",
"build_id": "6552",
"env": "dogfood",
"infra_type": "aws_instance",
"instance_type": "t2.micro",
"memory_inuse": 15450056,
"num_goroutines": 56,
"request_id": "poodle-a38f5e39/5fIUGkX5D1-001814",
"server_hostname": "poodle-a38f5e39",
"type": "request"
},

High cardinality
Fields that may have many unique values
Common examples:
• email address
• username / user id / team id
• server hostname
• IP address
• user agent string
• build id
• request url
• feature flags / flag combinations

if ok {
ev.AddField("user.id", userID)
}
return ev
}
func signupHandlerGet(w http.ResponseWriter, r *http.Request) {
var err error
ev := eventFromRequest(r)
defer addFinalErr(&err, ev)
tmpl := filepath.Join(templatesDir, "signup.html"),
tmplData := struct {
ErrorMessage string
}{}
if err = tmpl.Execute(w, tmplData); err != nil {
log.Print(err)
}
}
See https://github.com/honeycombio/examples for more

Writing event
instrumentation code:
Not scary!

Where do I start instrumenting?
Start at the edge.
Instrument the surfaces
customers touch first.

“Nines don’t matter if users
aren’t happy.”
- Charity Majors, all the time

Storytime: the mysterious metrics spike

• Best: dedicated Event tool — Honeycomb, NewRelic Insights, etc.
• Good: tracing tool (just send root spans) — Honeycomb, Jaeger, Lightstep…
• Ok: log aggregation & analytics tool (if it’s blazing fast)
• ¯_(ツ)_/¯: metrics tool (if it has high-cardinality support for many tags)
Tool Choice

Events
• How: bit by bit, starting from the edge
• Why: understand faster
• What you get: context, high-cardinality support
• When you’re ready for the next step: you’re spending lots of time on
backend perf issues, you have complex multi-service bugs, or your team no
longer has a complete mental model of how your services talk to each other

Events vs. Traces
• Traces require fields like parent_id, span_id, trace_id, duration
• Traces are composed of many Events — usually called Spans when they
are part of a larger Trace

Instrumentation choices
• Better turnkey instrumentation (db & http wrappers, etc.):
• Vendor agent for an APM product (New Relic, Datadog)
• Vendor integration for a tracing product (Honeycomb)
• More open:
• OpenCensus / OpenTracing (merging)

metadata["durationMs"] = float64(time.Since(start)) / float64(time.Millisecond)
if parentID := parentIDFromContext(ctx); parentID != "" {
metadata["parentId"] = parentID
}
ev := libhoney.NewEvent()
ev.Timestamp = start
ev.Add(metadata)
ev.Send()
}
func editHandler(w http.ResponseWriter, r *http.Request, title string) {
loadPageID := newID()
loadPageStart := time.Now()
p, err := loadPage(newContextWithParentID(r.Context(), loadPageID), title)
sendSpan("loadPage", loadPageID, loadPageStart, r.Context(), map[string]interface{}{"title": title, "error": err})
if err != nil {
p = &Page{Title: title}
}
renderID := newID()
renderStart := time.Now()
renderTemplate(newContextWithParentID(r.Context(), renderID), w, "edit", p)
sendSpan("renderTemplate", renderID, renderStart, r.Context(), map[string]interface{}{"template": "edit"})
}
See https://github.com/honeycombio/examples for more

Writing tracing
instrumentation code:
Somewhat scary!
Take it a span at a time

Switching our web app to traces
• 1 week hackathon for ~2 people to get the big plumbing in
• 3 months of 2-3 engineers adding context in our spare time (yikes)
• Ongoing instrumentation improvements as we need them, an hour or two
per month

Storytime: the stampede
// SaveToCache caches the dataset by ID
func (d *Dataset) SaveToCache(ctx context.Context, cacheService cache.Service, mysqlService mysql.Service) {
cacheKey := datasetCacheKey(d.ID)
cacheService.SetWithTTL(ctx, cacheKey, d, datasetCacheTTL, func(innerCtx context.Context, key string, value
interface{}) (interface{}, error) {
return FindDatasetByID(innerCtx, mysqlService, d.ID)
})
}

Tracing
• How: start at your hot spots
• Why: dive deeper
• What you get: faster perf & distributed systems debugging
• When you’re ready for the next step: lol no, you’ll be working on tracing
instrumentation forever

How not to shoot your own foot

Sampling
• Tracing sends much more data — don’t surprise your CFO with a double-
digit percent increase in your AWS bill like I did!
• Select & send representative events, not all events.
• Use the dynsampler library or write your own logic
• Be sure to set the `sample_rate` field on your events
Canopy paper: https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
Ben’s Sampling talk: https://www.usenix.org/conference/lisa17/conference-program/presentation/hartshorne

Schema churn
The best schema:
• Is consistent between services
• Plans ahead for field names (follow OpenCensus field conventions?)
• Uses namespacing!

Start in !prod
Use events or traces to debug something low-risk and high-value, like:
• build pipeline — find bottlenecks
• test suite — quantify your test flakiness

Build pipeline: find bottlenecks

Thank you!
@eanakashima
honeycomb.io

The Incremental Path to Observability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Incremental Path to Observability

Similar to The Incremental Path to Observability (20)

Recently uploaded

Recently uploaded (20)

The Incremental Path to Observability