Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Cortex: Prometheus as a Service, One Year On

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 35 Publicité

Cortex: Prometheus as a Service, One Year On

Télécharger pour lire hors ligne

Presented by Tom Wilkie at PromCon 2017

With Speaker Notes: https://goo.gl/V9qGva

At PromCon 2016, I presented "Project Frankenstein: A multitenant, horizontally scalable Prometheus as a service". It's now one year later, and lots has changed - not least the name! This talk will discuss what we've learnt running a Prometheus service for the past year, the architectural changes we made from the original design, and the improvements we've made to the Cortex user experience.

Presented by Tom Wilkie at PromCon 2017

With Speaker Notes: https://goo.gl/V9qGva

At PromCon 2016, I presented "Project Frankenstein: A multitenant, horizontally scalable Prometheus as a service". It's now one year later, and lots has changed - not least the name! This talk will discuss what we've learnt running a Prometheus service for the past year, the architectural changes we made from the original design, and the improvements we've made to the Cortex user experience.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Cortex: Prometheus as a Service, One Year On (20)

Publicité

Plus récents (20)

Publicité

Cortex: Prometheus as a Service, One Year On

  1. 1. Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com https://github.com/weaveworks/cortex
  2. 2. https://www.youtube.com/watch?v=3Tb4Wc0kfCM
  3. 3. Cortex: Prometheus as a Service • Natively multi tenant; isolate different customers in the same services. • Different story around scaling & HA • “Virtually infinite” retention and durability • Opportunities for performance enhancements Cortex YourYourYourYour Your Jobs Alertmanager Grafana Prometheus HA
  4. 4. Frontend Ditributor DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Cortex Architecture
  5. 5. A Year’s Evolution
  6. 6. Problem #1: DynamoDB Write Throughput
  7. 7. https://github.com/weaveworks/cortex/issues/254
  8. 8. Frontend Ditributor DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3Table Manager Cortex Architecture
  9. 9. Problem #2: DynamoDB Write Throughput, again
  10. 10. Original schema: • Hash Key: <user ID>:<hour>:<metric name> • Range Key: <label name>:<label value>:<chunk ID> New schema: • Hash Key: <user ID>:<day>:<metric name>:<label name> • Range Key: <chunk ID>:<chunk end time> https://github.com/weaveworks/cortex/pull/262
  11. 11. Problem #3: Queries of Death
  12. 12. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Cortex Architecture
  13. 13. Problem #3: Recording rules and alerts
  14. 14. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Ruler Cortex Architecture
  15. 15. Problem #4: Long tail
  16. 16. https://www.weave.works/blog/the-long-tail-tools-to-investigate-high-long-tail-latency/
  17. 17. Problem #5: Cost
  18. 18. S3 DynamoDB IOP Cost ($/IOP) 5x10-6 2x10-7 Storage Cost ($/GB/Month) 0.023 0.250 https://github.com/weaveworks/cortex/issues/141
  19. 19. 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.005 0.01 0.015 0.02 0.025 Object size (GB) Cost ($) S3 DynamoDB
  20. 20. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul RulerIngester Write requests Read requests Control requests Prometheus Your Jobs Cortex Architecture
  21. 21. Problem #6: DynamoDB, again
  22. 22. Frontend Ditributor Querier Table Manager BigTable Memcache Consul RulerIngester Write requests Read requests Control requests Prometheus Your Jobs Cortex Architecture
  23. 23. DynamoDB BigTable 99th Percentile Write Latency (ms) 70-100 50-150 99th Percentile Read Latency (ms) 100-2500 ~250 LOC ~2000 ~400 DynamoDB numbers courtesy of Weaveworks
  24. 24. Closing thoughts
  25. 25. 1. DynamoDB Write Throughput 2. DynamoDB Write Throughput, again 3. Recording rules and alerts 4. Long tail 5. Cost 6. DynamoDB, again
  26. 26. Running for >12months • Availability: querier unavailable for <12hrs ~99.9% • Durability: lost <2 days of data >99.5% • 99th percentile write performance ~60ms • 99th percentile query performance <200ms
  27. 27. Future • Direct chunk writes from Prometheus to Cortex Chunk Store • Separate ingester index for better load balancing • Use prometheus/tsdb for the ingesters • Etcd & gossip for ring storage • Chunks in Google Cloud Storage
  28. 28. One more thing…
  29. 29. I left Weaveworks at the begging of June to focus on Prometheus & Cortex development. Since then I’ve teamed up with David to develop some ideas around Prometheus, logging, and tracing. We’re available for Prometheus hosting, consulting, training and support. email: hello@kausal.co
  30. 30. Metrics
  31. 31. Logs
  32. 32. Traces
  33. 33. Thank you! Questions?

×