18. Metrics & Logs
• Metrics: signalFX metrics SaaS (could have been Datadog, Wavefront,
Prometheus, etc.)
• We still get ~1-2 useful metrics-based alerts per week
• Logs: just regular log files written to disk on EC2 (no log aggregation SaaS)
• Logs are a final debugging step, not a starting one
19. “SSH’ing into any server is an
antipattern, a failure”
- Charity Majors
20. Metrics & Logs
• How: the lowest-effort way you won’t regret later
• Why: don’t fly blind
• What you get: host-based & aggregated alerts; logs of wtf happened
• When you’re ready for the next step: high-cardinality questions your
metrics & logs can’t quickly answer
23. 1a. Structured Logs
• Key-value pairs (JSON)
• Self-describing
• Human-readable, just like traditional logs
https://www.honeycomb.io/blog/you-could-have-invented-structured-logging/
24.
25. •
•
•
• A good transitional step to using Events. Send to your Events backend.
• Key-value pairs (JSON)
• Self-describing
• Human-readable, just like traditional logs
1a. Structured Logs
29. • Can look like structured logs; might even be structured logs!
• Key-value pairs (JSON)
• Represent a complete unit of work for your system
2. Events
https://www.honeycomb.io/blog/how-are-structured-logs-different-from-events/
30. What is an event?
1 event
==
1 unit of work
~=
1 request
32. High cardinality
Fields that may have many unique values
Common examples:
• email address
• username / user id / team id
• server hostname
• IP address
• user agent string
• build id
• request url
• feature flags / flag combinations
33. if ok {
ev.AddField("user.id", userID)
}
return ev
}
func signupHandlerGet(w http.ResponseWriter, r *http.Request) {
var err error
ev := eventFromRequest(r)
defer addFinalErr(&err, ev)
tmpl := filepath.Join(templatesDir, "signup.html"),
tmplData := struct {
ErrorMessage string
}{}
if err = tmpl.Execute(w, tmplData); err != nil {
log.Print(err)
}
}
See https://github.com/honeycombio/examples for more
46. • Best: dedicated Event tool — Honeycomb, NewRelic Insights, etc.
• Good: tracing tool (just send root spans) — Honeycomb, Jaeger, Lightstep…
• Ok: log aggregation & analytics tool (if it’s blazing fast)
• ¯_(ツ)_/¯: metrics tool (if it has high-cardinality support for many tags)
Tool Choice
47. Events
• How: bit by bit, starting from the edge
• Why: understand faster
• What you get: context, high-cardinality support
• When you’re ready for the next step: you’re spending lots of time on
backend perf issues, you have complex multi-service bugs, or your team no
longer has a complete mental model of how your services talk to each other
50. Events vs. Traces
• Traces require fields like parent_id, span_id, trace_id, duration
• Traces are composed of many Events — usually called Spans when they
are part of a larger Trace
55. Switching our web app to traces
• 1 week hackathon for ~2 people to get the big plumbing in
• 3 months of 2-3 engineers adding context in our spare time (yikes)
• Ongoing instrumentation improvements as we need them, an hour or two
per month
62. Tracing
• How: start at your hot spots
• Why: dive deeper
• What you get: faster perf & distributed systems debugging
• When you’re ready for the next step: lol no, you’ll be working on tracing
instrumentation forever
64. Sampling
• Tracing sends much more data — don’t surprise your CFO with a double-
digit percent increase in your AWS bill like I did!
• Select & send representative events, not all events.
• Use the dynsampler library or write your own logic
• Be sure to set the `sample_rate` field on your events
Canopy paper: https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
Ben’s Sampling talk: https://www.usenix.org/conference/lisa17/conference-program/presentation/hartshorne
65. Schema churn
The best schema:
• Is consistent between services
• Plans ahead for field names (follow OpenCensus field conventions?)
• Uses namespacing!
67. Start in !prod
Use events or traces to debug something low-risk and high-value, like:
• build pipeline — find bottlenecks
• test suite — quantify your test flakiness