Microservices are a great way to design your system so that it can scale. But once those pieces are in production, how do you know if all the different pieces are working properly? Are some metrics more important than others, and what story can each of the metrics tell you? This talk shows you some tools and techniques to monitor distributed systems
3. An average production system
Database
• Is the web server up?
• Is the database up?
• Can the webserver talk
to the db?
4. What are you actually monitoring?
Business
Capability
Application
Infrastructure
Are my servers running?Is my application process running?Can users place an order?
Monitoring Area
5. Monitoring Concerns
Capacity
Performance
Health
Is the server up?Is there high CPU?Do I have enough disk space?
Is my application generating exceptions?
How quickly is my system processing messages?
Can I handle month end batch jobs?
Is the server up?
Is there high CPU?
Do I have enough disk space?
Application
Infrastructure
Can users access the checkout cart?
Are we meeting our SLAs?
What is the impact of adding another customer?
Business
Capability
8. Recap: What are we monitoring?
Database
• Is the web server up?
• Is the database up?
• Can the webserver talk
to the db?
Infrastructure PassiveHealth
19. Queue Length
• Queue length is an indicator of work still outstanding
• High queue length doesn’t necessarily indicate a problem though
Stable or
decreasing
is good
Increasing
is bad
22. Processing Time
• Processing Time is the time taken to successfully process a message
• Processing Time does not include error handling time
• It is independent of queue wait time
Stable or decreasing could
be good
Increasing is bad
27. • Critical Time is the total duration between when a message is created
to when it is processed
Critical Time = Time in Queue +
Processing Time +
Retry Time +
Network Latency Time
Critical Time
Stable or decreasing could
be good
Increasing is bad
28. Putting these together
• Each of these metrics presents a piece of the puzzle
• Look at them from an endpoint’s perspective, not per message
• Looking at them together gives great insight into your system
Critical Time Processing Time Queue LengthCritical Time Processing Time Queue LengthCritical Time Processing Time Queue Length
29.
30. Detecting Connectivity
• Distributed systems typically work when other parts aren’t available
• How do you know the endpoint you’re sending messages to is
actually processing messages?
34. How do we collect all this info?
⏱️
• Processing Time
• Critical Time
• Queue Length
• Connectivity
• Reporting Metric
• Message Type
• Timestamp
• Value
• Reporting Metric (N bytes)
• Message Type (N bytes)
• Timestamp (8 bytes)
• Value (8 bytes)
35. How do we collect all this info?
• Epoch time (8 bytes)
• Dictionary of Metric Types (n* (N + 4) bytes)
• Dictionary of Message Types (n * (N + 4) bytes)
• An array of:
• Reporting Metric index (4 bytes)
• Message Type index (4 bytes)
• Epoch offset (4 bytes)
• Value (8 bytes)