Publicité
Publicité

Contenu connexe

Similaire à London devops logging(20)

Publicité

London devops logging

  1. Practical logstash - beyond the basics. Tomas Doran (t0m) <bobtfish@bobtfish.net>
  2. Who are you • Sysadmin at TIM Group • t0m on irc.freenode.net • twitter.com/bobtfish • github.com/bobtfish • slideshare.com/bobtfish
  3. Logstash
  4. Logstash • I hope you already know what logstash is?
  5. Logstash • I hope you already know what logstash is? • I’m going to talk about our implementation.
  6. Logstash • I hope you already know what logstash is? • I’m going to talk about our implementation. • Elasticsearch
  7. Logstash • I hope you already know what logstash is? • I’m going to talk about our implementation. • Elasticsearch • Metrics
  8. Logstash • I hope you already know what logstash is? • I’m going to talk about our implementation. • Elasticsearch • Metrics • Nagios
  9. Logstash • I hope you already know what logstash is? • I’m going to talk about our implementation. • Elasticsearch • Metrics • Nagios • Riemann
  10. > 55 million messages a day
  11. > 55 million messages a day • Now ~30Gb of indexed data per day • All our applications • All of syslog • Used by developers and product managers • 2 x DL360s with 8x600Gb discs, also graphite install
  12. About 4 months old
  13. About 4 months old • Almost all apps onboard to various levels • All of syslog was easy • Still haven’t done apache logs • Haven’t comprehensively done router/ switches • Lots of apps still emit directly to graphite
  14. Java
  15. Java • All our apps are Java / Scala / Clojure
  16. Java • All our apps are Java / Scala / Clojure • https://github.com/tlrx/slf4j-logback-zeromq
  17. Java • All our apps are Java / Scala / Clojure • https://github.com/tlrx/slf4j-logback-zeromq • Own layer (x2 1 Java, 1 Scala) for sending structured events as JSON
  18. Java • All our apps are Java / Scala / Clojure • https://github.com/tlrx/slf4j-logback-zeromq • Own layer (x2 1 Java, 1 Scala) for sending structured events as JSON • Java developers hate native code
  19. On host log collector
  20. On host log collector • Need a lightweight log shipper. • VMs with 1Gb of RAM.. • Message::Passing - perl library I wrote. • Small, light, pluggable
  21. On host log collector
  22. On host log collector • Application to logcollector is ZMQ • Small amount of buffering (1000 messages)
  23. On host log collector • Application to logcollector is ZMQ • Small amount of buffering (1000 messages) • logcollector to logstash is ZMQ • Large amount of buffering (disc offload, 100s of thousands of messages)
  24. ZeroMQ has the correct semantics
  25. ZeroMQ has the correct semantics • Pub/Sub sockets
  26. ZeroMQ has the correct semantics • Pub/Sub sockets • Never, ever blocking
  27. ZeroMQ has the correct semantics • Pub/Sub sockets • Never, ever blocking • Lossy! (If needed)
  28. ZeroMQ has the correct semantics • Pub/Sub sockets • Never, ever blocking • Lossy! (If needed) • Buffer sizes / locations configureable
  29. ZeroMQ has the correct semantics • Pub/Sub sockets • Never, ever blocking • Lossy! (If needed) • Buffer sizes / locations configureable • Arbitrary message size
  30. ZeroMQ has the correct semantics • Pub/Sub sockets • Never, ever blocking • Lossy! (If needed) • Buffer sizes / locations configureable • Arbitrary message size • IO done in a background thread (nice in interpreted languages - ruby/perl/python)
  31. What, no AMQP?
  32. What, no AMQP? • Could go logcollector => AMQP => logstash for extra durability
  33. What, no AMQP? • Could go logcollector => AMQP => logstash for extra durability • ZMQ buffering ‘good enough’
  34. What, no AMQP? • Could go logcollector => AMQP => logstash for extra durability • ZMQ buffering ‘good enough’ • logstash uses a pure ruby AMQP decoder
  35. What, no AMQP? • Could go logcollector => AMQP => logstash for extra durability • ZMQ buffering ‘good enough’ • logstash uses a pure ruby AMQP decoder • Slooooowwwwww
  36. Reliability
  37. Reliability • Multiple Elasticsearch servers (obvious)!
  38. Reliability • Multiple Elasticsearch servers (obvious)! • Due to ZMQ buffering, you can: • restart logstash, messages just buffer on hosts whilst it’s unavailable • restart logcollector, messages from apps buffer (lose some syslog)
  39. Reliability: TODO
  40. Reliability: TODO • Elasticsearch cluster getting sick happens
  41. Reliability: TODO • Elasticsearch cluster getting sick happens • In-flight messages in logstash lost :(
  42. Reliability: TODO • Elasticsearch cluster getting sick happens • In-flight messages in logstash lost :( • Solution - elasticsearch_river output • logstash => durable RabbitMQ queue • ES reads from queue • Also faster - uses bulk API
  43. Redundancy
  44. Redundancy • Add a UUID to each message at emission point.
  45. Redundancy • Add a UUID to each message at emission point. • Index in elasticsearch by UUID
  46. Redundancy • Add a UUID to each message at emission point. • Index in elasticsearch by UUID • Emit to two backend logstash instances (TODO)
  47. Redundancy • Add a UUID to each message at emission point. • Index in elasticsearch by UUID • Emit to two backend logstash instances (TODO) • Index everything twice! (TODO)
  48. Elasticsearch optimisation • You need a template • compress source • disable _all • discard unwanted fields from source / indexing • tweak shards and replicas • compact your yesterday’s index at end of day!
  49. Elasticsearch size
  50. Elasticsearch size • 87 daily indexes
  51. Elasticsearch size • 87 daily indexes • 800Gb of data (per instance)
  52. Elasticsearch size • 87 daily indexes • 800Gb of data (per instance) • Just bumped ES heap to 22G • Just writing data - 2Gb • Query over all indexes - 17Gb!
  53. Elasticsearch size • 87 daily indexes • 800Gb of data (per instance) • Just bumped ES heap to 22G • Just writing data - 2Gb • Query over all indexes - 17Gb! • Hang on - 800/87 does not = 33Gb/day!
  54. Rate has increased! Text Text We may have problems fitting onto 5 x 600Gb discs!
  55. Standard log message
  56. Standard event message
  57. TimedWebRequest
  58. TimedWebRequest • Most obvious example of a standard event • App name • Environment • HTTP status • Page generation time • Request / Response size
  59. TimedWebRequest • Most obvious example of a standard event • App name • Environment • HTTP status • Page generation time • Request / Response size • Can derive loads of metrics from this!
  60. statsd
  61. statsd • Rolls up counters and timers into metrics
  62. statsd • Rolls up counters and timers into metrics • One bucket per stat, emits values every 10 seconds
  63. statsd • Rolls up counters and timers into metrics • One bucket per stat, emits values every 10 seconds • Counters: Request rate, HTTP status rate
  64. statsd • Rolls up counters and timers into metrics • One bucket per stat, emits values every 10 seconds • Counters: Request rate, HTTP status rate • Timers: Total page time, mean page time, min/max page times
  65. statsd
  66. statsd
  67. JSON everywhere
  68. JSON everywhere • Legacy shell ftp mirror scripts • gitolite hooks for deployments • keepalived health checks
  69. JSON everywhere echo "JSON:{"nagios_service":"${SERVICE}", "nagios_status":"${STATUS_CODE}", "message":"${STATUS_TEXT}"}" | logger -t nagios
  70. Alerting
  71. Alerting use cases: • Replaced nsca client with standardised log pipeline • Developers log an event and get (one!) email warning of client side exceptions • Passive health monitoring - ‘did we log something recently’
  72. Riemann
  73. Riemann • Using for some simple health checking
  74. Riemann • Using for some simple health checking • logcollector health
  75. Riemann • Using for some simple health checking • logcollector health • Load balancer instance health
  76. Riemann
  77. Riemann • Ambitious plans to do more
  78. Riemann • Ambitious plans to do more • Web pool health (>= n nodes)
  79. Riemann • Ambitious plans to do more • Web pool health (>= n nodes) • Replace statsd
  80. Riemann • Ambitious plans to do more • Web pool health (>= n nodes) • Replace statsd • Transit collectd data via logstash and use to emit to graphite
  81. Riemann • Ambitious plans to do more • Web pool health (>= n nodes) • Replace statsd • Transit collectd data via logstash and use to emit to graphite • disc usage trending / prediction
  82. Metadata
  83. Metadata • It’s all about the metadata
  84. Metadata • It’s all about the metadata • Structured events are describable
  85. Metadata • It’s all about the metadata • Structured events are describable • Common patterns to give standard metrics / alerting for free
  86. Metadata • It’s all about the metadata • Structured events are describable • Common patterns to give standard metrics / alerting for free • Dashboards!
  87. Dashboard love/hate
  88. Dashboard love/hate • Riemann x 2
  89. Dashboard love/hate • Riemann x 2 • Graphite dashboards x 2
  90. Dashboard love/hate • Riemann x 2 • Graphite dashboards x 2 • Nagios x 3
  91. Dashboard love/hate • Riemann x 2 • Graphite dashboards x 2 • Nagios x 3 • CI radiator
  92. Dashboard love/hate • Riemann x 2 • Graphite dashboards x 2 • Nagios x 3 • CI radiator
  93. Dashboard love/hate • Riemann x 2 • Graphite dashboards x 2 • Nagios x 3 • CI radiator • Information overload!
  94. Thanks! • Questions? • slides with more detail about my log collector code: • http://slideshare.net/bobtfish/

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  20. The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  21. The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  22. The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  23. The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  24. The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
Publicité