SlideShare une entreprise Scribd logo
1  sur  37
@PierreVincent
How to build observable
distributed systems
March 8th, 2018 – QCon London
@PierreVincent pvincent.io
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
observable-distributed-ststems
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
@PierreVincent
@PierreVincent
Pierre Vincent
SRE Manager at Poppulo
@PierreVincent
pvincent.io
@PierreVincent
Reaching production is
only the beginning
@PierreVincent
No system is immune to failure
Be ready to recover
@PierreVincent
When distributing a system,
we’re also distributing the places
where things might go wrong
@PierreVincent
Monitoring only applies to
known failure modes.
What about everything else?
@PierreVincent
Monitoring tells you whether the
system works. Observability lets
you ask why it's not working.
“
”– Baron Schwarz
@PierreVincent
Healthchecks Metrics
Logging Tracing
@PierreVincent
Healthchecks
@PierreVincent
Is it running?
Can it perform
its task?
Can it accept
more work?
Healthchecks
@PierreVincent
Broadcast Register Expose
Healthchecks
@PierreVincent
Healthchecks
/health
{
"service": "registration-service",
"healthy": true,
"workload": { "healthy": true },
"dependencies": [
{ "name": "cassandra", "healthy": true },
{ "name": "billing-svc", "healthy": true
},
]
}
GET http://1.2.3.4:8080/health
200 OK
@PierreVincent
Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell
skeltonthatcher.com/blog/http-healthchecks-for-a-resilient-platform
Overzealous Healthchecks can be
counter-productive
@PierreVincent
Metrics
@PierreVincent
System
metrics
Application
metrics
Business
metrics
CPU usage Error rates Customer conversions
Metrics
@PierreVincent
Servers / VMs
Appliances/Infra
Services
Metrics
collector
Metrics
query
engine
Dashboards
Alerts
Metrics
@PierreVincent
Servers / VMs
Appliances/Infra
Services
/metrics
/metrics
/metrics
Prometheus
Metrics
@PierreVincent
Watch out for over-reliance on metrics
Limitations at high-cardinality
Not every metric
deserves an alert
Real-time querying means
some trade-offs on retention
Limit alerting to
user-impacting symptoms
Poor fine-grained debugging
e.g. CustomerId
Not suitable for long-term
trend analysis
@PierreVincent
Logging
@PierreVincent
Searchable Correlated
Logging
Making sense of (a lot of) logs
Centralised
@PierreVincent
A
F
H
D
J
B
E
C
G
a1b2c3
a1b2c3
a1b2c3
ERROR [svc=H][trace=a1b2c3] Failed to save order
Cause: Cassandra timeout exception
ERROR [svc=F][trace=a1b2c3] Failed to complete order
Cause: Shipping service responded with 500
ERROR [svc=A][trace=a1b2c3] Failed to process order
Cause: Order process manager responded with 500
a1b2c3
INFO [svc=G][trace=a1b2c3] Items verified in stock
Log Correlation
@PierreVincent
JSON
2018-02-20T16:38:23+00:00 ERROR Read timed out
timestamp 2018-02-20T16:38:23+00:00
requestID ec667cb45
level ERROR
team eventsservice registration-service
commit 542a8b8e build 542a8b8e.7
node node_e79f3e52
log Read timed out
region europe-west2
runtime java-1.8.0_161
When did it happen?
Can I trace it?
What is it?
What is it running?
Where is it?
What is the message?
customerID 55123Who caused it? userID 458
... ...Any other info?
Hmmm thanks… ?
@PierreVincent
Structured logs unleash high-cardinality exploration
Error rate spike isolated
by build version
Activity spike isolated
for single customer
@PierreVincent
Tracing
@PierreVincent
Tracing
get_confirmed_attendees
get_attendees
get_confirm_status
cassandra/select
mysql/select
event-mgt-api
attendees-service
registration-service
Trace
Spans
0ms 50ms 100ms 150ms 172ms
172ms
73ms
54ms
78ms
41ms
@PierreVincent
@PierreVincent
Unknown
Unknowns
Known
Unknowns
Healthchecks
Metrics
(Alerting)
Logs
Metrics
(Queries)
Tracing
Events
Monitoring & Resiliency Debugging & Exploration
@PierreVincent
Cheap to
instrument
Easy to
explore
Reliable &
Trustworthy
Usability of tooling is key to adoption
@PierreVincent
Visibility builds trust
but requires safety
@PierreVincent
Visibility helps justify decisions
@PierreVincent
Visibility enables operability
@PierreVincent
Test (a little bit*) lessTest (a little bit*) lessDon’t spend all your time testing
...keep some for instrumenting
Thank you!
@PierreVincent
pvincent.io
@PierreVincent
Thank you!
Pierre Vincent
SRE Manager at Poppulo
@PierreVincent
pvincent.io
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
observable-distributed-ststems

Contenu connexe

Plus de C4Media

Plus de C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

How to Build Observable Distributed Systems