Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2HaAz9v.
Pierre Vincent covers key techniques to build a clearer picture of distributed applications in production, including details on useful health checks, best practices for instrumentation with metrics, logging and tracing. Filmed at qconlondon.com.
Pierre Vincent is SRE manager at Poppulo, where he helps teams embrace DevOps practices, focusing on building maintainable applications and continuously improving their processes. He strongly believes that DevOps is a key turning point for our industry, finally bringing together Continuous Delivery practices and Lean philosophy, all supported by safe culture of learning and innovation.
1. @PierreVincent
How to build observable
distributed systems
March 8th, 2018 – QCon London
@PierreVincent pvincent.io
2. InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
observable-distributed-ststems
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
16. @PierreVincent
Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell
skeltonthatcher.com/blog/http-healthchecks-for-a-resilient-platform
Overzealous Healthchecks can be
counter-productive
21. @PierreVincent
Watch out for over-reliance on metrics
Limitations at high-cardinality
Not every metric
deserves an alert
Real-time querying means
some trade-offs on retention
Limit alerting to
user-impacting symptoms
Poor fine-grained debugging
e.g. CustomerId
Not suitable for long-term
trend analysis
24. @PierreVincent
A
F
H
D
J
B
E
C
G
a1b2c3
a1b2c3
a1b2c3
ERROR [svc=H][trace=a1b2c3] Failed to save order
Cause: Cassandra timeout exception
ERROR [svc=F][trace=a1b2c3] Failed to complete order
Cause: Shipping service responded with 500
ERROR [svc=A][trace=a1b2c3] Failed to process order
Cause: Order process manager responded with 500
a1b2c3
INFO [svc=G][trace=a1b2c3] Items verified in stock
Log Correlation
25. @PierreVincent
JSON
2018-02-20T16:38:23+00:00 ERROR Read timed out
timestamp 2018-02-20T16:38:23+00:00
requestID ec667cb45
level ERROR
team eventsservice registration-service
commit 542a8b8e build 542a8b8e.7
node node_e79f3e52
log Read timed out
region europe-west2
runtime java-1.8.0_161
When did it happen?
Can I trace it?
What is it?
What is it running?
Where is it?
What is the message?
customerID 55123Who caused it? userID 458
... ...Any other info?
Hmmm thanks… ?
26. @PierreVincent
Structured logs unleash high-cardinality exploration
Error rate spike isolated
by build version
Activity spike isolated
for single customer
35. @PierreVincent
Test (a little bit*) lessTest (a little bit*) lessDon’t spend all your time testing
...keep some for instrumenting
Thank you!
@PierreVincent
pvincent.io