Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Systems Monitoring with Prometheus
Devops Ireland, April 2015
Brian Brazil
Senior Software Engineer
Boxever
What is monitoring?
What is monitoring?
Host-based checks?
• Typically shell scripts with success/fail
• Failure causes alerts
• More blackbox...
Brian’s Pet Peeve #1
Thinking in terms of machines rather than services.
It’s the future, it’s not the “Webserver machine”...
What is monitoring?
Highly granular information about a subsystem?
• Tends to be focused on one subsystem, such as
incomin...
What is monitoring?
High frequency high granularity profiling?
• Great for debugging once you’ve narrowed down the
problem...
What is monitoring?
Tailing logs?
• Easy to miss something
• Tend to get very noisy
• We have computers, why are humans do...
Step Back
Why do we want monitoring?
Why: Alerting
We want to know when things go wrong
We want to know when things aren’t quite right
We want to know in advan...
Why: Debugging
When something is up, you need to debug.
You want to go from high-level problem, and drill down to
what’s c...
Brian’s Pet Peeve #2
Instrumentation that you need to read the code to
understand.
Make the names such that a random perso...
Why: Trending
How the various bits of a system are being used.
For example, how many static requests per dynamic
request? ...
A different approach
What if we instrumented everything?
• RPCs
• Interfaces between subsystems
• Business logic
• Every t...
That’s a lot of metrics
That could be tens of thousands of codepoints across an
entire system.
You’d need some way to make...
Presenting Prometheus
An open-source service monitoring system and time series
database.
Started in 2012, primarily develo...
Architecture
The Server
• Can handle over a million time series in one instance
• No dependencies such as HBase or Cassandra
• Stores d...
Data Model
Tired of aggregating and alerting off metrics like http.
responses.500.myserver.mydc.production?
Time series ha...
Brian’s Pet Peeve #3
Munging structured data in a way that loses the structure
Is it so much to ask for some escaping, or ...
Query Language
Aggregation based on the key-value labels
Arbitrarily complex math
And all of this can be used in pre-compu...
Query Language: Example
Column families with the 10 highest read rates per second
topk(10,
sum by(job, keyspace, columnfam...
Client Libraries
How you instrument your code
• Official: Python, Java, Go, Ruby
• Unofficial: Bash, NodeJS, .Net
Text-bas...
Client Libraries: Example
@Summary(“request_latency_seconds”,
“Request latency”).time()
def process_request():
pass
Client Libraries: In and Out
Client libraries don’t tie you to Prometheus instrumentation
Custom collectors allow pulling ...
Client Libraries: What to Instrument
Client Libraries: What to Instrument
Everything
Things to have
• Client and server qps/errors/latency
• Every log message should be a metric
• Every failure should be a m...
Batch/Offline Processing Metrics
• Last time it succeeded
• Records processed/throughput
• Duration of major batch stages
...
Brian’s Pet Peeve #4
Wrapping instrumentation libraries to make them “simpler”
Tend to confuse abstractions, encourage bad...
Speaking of Correct Instrumentation
It’s better to have math done in the server, not the client
Many instrumentation syste...
Integrations
More powerful data model needs integrations and
instrumentation written to take advantage of that
Machine (No...
Dashboards
• Promdash: Ruby on Rails web app
• Console templates: More power for those who like
checking things in
• Expre...
Dashboards
Dashboards
What to put on dashboards?
Dashboards
Goal: Make it easy to logically drill down a problem
Most services are a graph.
Make it easy to go from a conso...
Brian’s Pet Peeve #5
Dashboard anti-patterns:
● Graph of a hundred plots
● Page of a thousand graphs
● Consoles that their...
Dashboard Guidelines
• Don’t put every possible metric on the dashboard
• Focus on the top few metrics, based on the most ...
Alerting
Alertmanager aggregates alerts from Prometheus servers
Supports notifications to Pagerduty, Email, Pushover
Best ...
What to Alert On
Online Serving: Overall latency, errors
Offline Processing: Propagation/Processing Delay
Batch Jobs: When...
The Live Demo
please work please work please work please work please work please work please work please work please work ...
The Future
Many features on roadmap:
• Service discovery
• Federation
• Long term storage
• HA Alertmanager
• More exporte...
Final Word
Systems Monitoring is your first port of call in an
emergency, keep it working without needing lots of effort.
...
Finaler Word
How do you know your monitoring system is good?
• When you have superb monitoring for everything?
• When it c...
Finaler Word
How do you know your monitoring system is good?
• When you have superb monitoring for everything?
• When it c...
Try it out!
http://www.boxever.com/tag/monitoring has step-by-step instructions on
monitoring:
• Machines
• Cassandra
• HA...
More Information
http://prometheus.io
http://www.boxever.com/tag/monitoring
Python Ireland, April 8th
SREcon15 Europe, May...
Prochain SlideShare
Chargement dans…5
×

Systems Monitoring with Prometheus (Devops Ireland April 2015)

17 964 vues

Publié le

Monitoring means many things to many people. This talk looks at Systems Monitoring, that is how to keep an eye on a given system and use this as part of overall management of a system. This talk will cover Why one monitors, What to monitor, How to monitor, the general design of a monitoring system and how Prometheus is a good fit for this in terms of instrumentation, consoles, alerts, general system health and sanity.

Prometheus is a next-generation monitoring system publicly announced earlier this year, developed by companies including SoundCloud, locals Boxever and Docker. Since launch there has been wide-spread interest, and many community contributions.

For more information see http://prometheus.io or http://www.boxever.com/tag/monitoring

Publié dans : Internet
  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Systems Monitoring with Prometheus (Devops Ireland April 2015)

  1. 1. Systems Monitoring with Prometheus Devops Ireland, April 2015 Brian Brazil Senior Software Engineer Boxever
  2. 2. What is monitoring?
  3. 3. What is monitoring? Host-based checks? • Typically shell scripts with success/fail • Failure causes alerts • More blackbox than whitebox • About machines, not services
  4. 4. Brian’s Pet Peeve #1 Thinking in terms of machines rather than services. It’s the future, it’s not the “Webserver machine” it’s one machine what happens to run a Webserver as part of the Webserver service.
  5. 5. What is monitoring? Highly granular information about a subsystem? • Tends to be focused on one subsystem, such as incoming http requests • No visibility into the rest of the system
  6. 6. What is monitoring? High frequency high granularity profiling? • Great for debugging once you’ve narrowed down the problem • Not so useful for general monitoring
  7. 7. What is monitoring? Tailing logs? • Easy to miss something • Tend to get very noisy • We have computers, why are humans doing repetitive tasks?
  8. 8. Step Back Why do we want monitoring?
  9. 9. Why: Alerting We want to know when things go wrong We want to know when things aren’t quite right We want to know in advance of problems
  10. 10. Why: Debugging When something is up, you need to debug. You want to go from high-level problem, and drill down to what’s causing it. Need to be able to reason about things. Sometimes want to go from code back to metrics.
  11. 11. Brian’s Pet Peeve #2 Instrumentation that you need to read the code to understand. Make the names such that a random person not intimately familiar with the system would have a good chance at guessing what it means. Specify your units.
  12. 12. Why: Trending How the various bits of a system are being used. For example, how many static requests per dynamic request? How many sessions active at once? How many hit a certain corner case? For some stats, also want to know how they change over time for capacity planning and design discussions.
  13. 13. A different approach What if we instrumented everything? • RPCs • Interfaces between subsystems • Business logic • Every time you’d log something What if we monitored systems and subsystems to know how everything is generally doing?
  14. 14. That’s a lot of metrics That could be tens of thousands of codepoints across an entire system. You’d need some way to make it easy to instrument all code, not just the externally facing parts of applications. You’d need something able to handle a million time series.
  15. 15. Presenting Prometheus An open-source service monitoring system and time series database. Started in 2012, primarily developed in Soundcloud with committers also in Boxever and Docker. Publicly announced January 2015, many contributions and users since then.
  16. 16. Architecture
  17. 17. The Server • Can handle over a million time series in one instance • No dependencies such as HBase or Cassandra • Stores data on local disk • Written in Go • Easy to run
  18. 18. Data Model Tired of aggregating and alerting off metrics like http. responses.500.myserver.mydc.production? Time series have structured key-value pairs, e.g. http_responses_total{ response_code=”500”,instance=”myserver”, dc=”mydc”,env=”production”}
  19. 19. Brian’s Pet Peeve #3 Munging structured data in a way that loses the structure Is it so much to ask for some escaping, or at least sanitizing any separators in the data?
  20. 20. Query Language Aggregation based on the key-value labels Arbitrarily complex math And all of this can be used in pre-computed rules and alerts
  21. 21. Query Language: Example Column families with the 10 highest read rates per second topk(10, sum by(job, keyspace, columnfamily) ( rate(cassandra_columnfamily_readlatency[5m]) ) )
  22. 22. Client Libraries How you instrument your code • Official: Python, Java, Go, Ruby • Unofficial: Bash, NodeJS, .Net Text-based format, easy to write new client libraries
  23. 23. Client Libraries: Example @Summary(“request_latency_seconds”, “Request latency”).time() def process_request(): pass
  24. 24. Client Libraries: In and Out Client libraries don’t tie you to Prometheus instrumentation Custom collectors allow pulling data from other instrumentation systems into Prometheus client library Similarly, can pull data out of client library and expose as you wish
  25. 25. Client Libraries: What to Instrument
  26. 26. Client Libraries: What to Instrument Everything
  27. 27. Things to have • Client and server qps/errors/latency • Every log message should be a metric • Every failure should be a metric • Threadpool/queue size, in progress, latency • Business logic inputs and outputs • Data sizes in/out • Process cpu/ram/language internals (e.g. GC) • Blackbox and end-to-end monitoring
  28. 28. Batch/Offline Processing Metrics • Last time it succeeded • Records processed/throughput • Duration of major batch stages • Heartbeats for end-to-end testing Use the PushGateway for ephemeral jobs
  29. 29. Brian’s Pet Peeve #4 Wrapping instrumentation libraries to make them “simpler” Tend to confuse abstractions, encourage bad practices and make it difficult to write correct and useable instrumentation e.g. Prometheus values are doubles, if you only allow ints then end user has to do math to convert back to seconds
  30. 30. Speaking of Correct Instrumentation It’s better to have math done in the server, not the client Many instrumentation systems are exponentially decaying Do you really want to do calculus during an outage? Prometheus has monotonic counters Races and missed scrapers don’t lose data
  31. 31. Integrations More powerful data model needs integrations and instrumentation written to take advantage of that Machine (Node Exporter), HAProxy, CloudWatch, Statsd, Collectd, JMX, Mesos, Consul, MySQL Direct instrumentation: cadvisor, etcd
  32. 32. Dashboards • Promdash: Ruby on Rails web app • Console templates: More power for those who like checking things in • Expression browser: Ad-hoc queries • JSON interface: Roll your own
  33. 33. Dashboards
  34. 34. Dashboards What to put on dashboards?
  35. 35. Dashboards Goal: Make it easy to logically drill down a problem Most services are a graph. Make it easy to go from a console about one service, see which of it’s backends is the problem, and repeat.
  36. 36. Brian’s Pet Peeve #5 Dashboard anti-patterns: ● Graph of a hundred plots ● Page of a thousand graphs ● Consoles that their creator can barely understand ● “Put it on a console somewhere”
  37. 37. Dashboard Guidelines • Don’t put every possible metric on the dashboard • Focus on the top few metrics, based on the most likely failure modes and things you’ll want to know • No more than 5 graphs per console, 5 plots per graph • Have units, y-labels, legends and descriptions • Split out or trim if dashboards are getting too complex • It’s difficult for a dashboard to serve two masters • Dashboards are not for alerting
  38. 38. Alerting Alertmanager aggregates alerts from Prometheus servers Supports notifications to Pagerduty, Email, Pushover Best practices: • Alert on symptoms not causes • Have a way to deal with non-critical alerts
  39. 39. What to Alert On Online Serving: Overall latency, errors Offline Processing: Propagation/Processing Delay Batch Jobs: When it last Suceeded
  40. 40. The Live Demo please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work
  41. 41. The Future Many features on roadmap: • Service discovery • Federation • Long term storage • HA Alertmanager • More exporters, client libraries and integrations
  42. 42. Final Word Systems Monitoring is your first port of call in an emergency, keep it working without needing lots of effort. Prometheus is awesome - can be lead to non-critical data taking lots of management, crowding out critical metrics. At some point, have to move non-critical things to generic data processing system.
  43. 43. Finaler Word How do you know your monitoring system is good? • When you have superb monitoring for everything? • When it causes a HDD to fail? • When it finds two bugs in Go’s DNS library?
  44. 44. Finaler Word How do you know your monitoring system is good? • When you have superb monitoring for everything? • When it causes a HDD to fail? • When it finds two bugs in Go’s DNS library? No, when the unittests find a bug in your filesystem!
  45. 45. Try it out! http://www.boxever.com/tag/monitoring has step-by-step instructions on monitoring: • Machines • Cassandra • HAProxy • Python Batch host-based jobs • Java applications Problems? We’re on #prometheus on Freenode
  46. 46. More Information http://prometheus.io http://www.boxever.com/tag/monitoring Python Ireland, April 8th SREcon15 Europe, May 14-15th

×