Cortex: Prometheus as a Service, One Year On

•

4 likes•1,612 views

Presented by Tom Wilkie at PromCon 2017 With Speaker Notes: https://goo.gl/V9qGva At PromCon 2016, I presented "Project Frankenstein: A multitenant, horizontally scalable Prometheus as a service". It's now one year later, and lots has changed - not least the name! This talk will discuss what we've learnt running a Prometheus service for the past year, the architectural changes we made from the original design, and the improvements we've made to the Cortex user experience.

Technology

Cortex: Prometheus as a
Service, One Year On
Tom Wilkie, PromCon 2017
tom.wilkie@gmail.com
https://github.com/weaveworks/cortex

https://www.youtube.com/watch?v=3Tb4Wc0kfCM

Cortex: Prometheus as a Service
• Natively multi tenant; isolate different
customers in the same services.
• Different story around scaling & HA
• “Virtually inﬁnite” retention and durability
• Opportunities for performance
enhancements
Cortex
YourYourYourYour
Your Jobs
Alertmanager
Grafana Prometheus HA

Frontend
Ditributor
DynamoDB Memcache
Consul
Ingester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
S3
Cortex Architecture

https://github.com/weaveworks/cortex/issues/254

Frontend
Ditributor
DynamoDB Memcache
Consul
Ingester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
S3Table Manager
Cortex Architecture

Problem #2: DynamoDB Write
Throughput, again

Original schema:
• Hash Key: <user ID>:<hour>:<metric name>
• Range Key: <label name>:<label value>:<chunk ID>
New schema:
• Hash Key: <user ID>:<day>:<metric name>:<label name>
• Range Key: <chunk ID>:<chunk end time>
https://github.com/weaveworks/cortex/pull/262

Frontend
Ditributor Querier
Table Manager DynamoDB Memcache
Consul
Ingester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
S3
Cortex Architecture

https://www.weave.works/blog/the-long-tail-tools-to-investigate-high-long-tail-latency/

S3 DynamoDB
IOP Cost
($/IOP)
5x10-6 2x10-7
Storage Cost
($/GB/Month)
0.023 0.250
https://github.com/weaveworks/cortex/issues/141

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06
0.005
0.01
0.015
0.02
0.025
Object size (GB)
Cost ($)
S3
DynamoDB

Frontend
Ditributor Querier
Table Manager DynamoDB Memcache
Consul
RulerIngester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
Cortex Architecture

Frontend
Ditributor Querier
Table Manager BigTable Memcache
Consul
RulerIngester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
Cortex Architecture

DynamoDB BigTable
99th Percentile Write
Latency (ms)
70-100 50-150
99th Percentile Read
Latency (ms)
100-2500 ~250
LOC ~2000 ~400
DynamoDB numbers courtesy of Weaveworks

1. DynamoDB Write Throughput
2. DynamoDB Write Throughput, again
3. Recording rules and alerts
4. Long tail
5. Cost
6. DynamoDB, again

Running for >12months
• Availability: querier unavailable for <12hrs ~99.9%
• Durability: lost <2 days of data >99.5%
• 99th percentile write performance ~60ms
• 99th percentile query performance <200ms

Future
• Direct chunk writes from Prometheus to Cortex Chunk Store
• Separate ingester index for better load balancing
• Use prometheus/tsdb for the ingesters
• Etcd & gossip for ring storage
• Chunks in Google Cloud Storage

I left Weaveworks at the begging of June to
focus on Prometheus & Cortex development.
Since then I’ve teamed up with David to
develop some ideas around Prometheus,
logging, and tracing.
We’re available for Prometheus hosting,
consulting, training and support.
email: hello@kausal.co

What's hot

Kong in 1.x TerritoryThibault Charbonnier

Solving some of the scalability problems at booking.comIvan Kruglov

Using Kubernetes to deploy Django in GCPWalter Liu

CNCF explore k8s api using java clientErhwen Kuo

Cortex: Horizontally Scalable, Highly Available PrometheusGrafana Labs

Cncf k8s_network_part1Erhwen Kuo

Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...Weaveworks

Self Created Load Balancer for MTA on AWSsharu1204

Cncf k8s_network_02Erhwen Kuo

Developing a user-friendly OpenResty applicationThibault Charbonnier

Breaking Prometheus (Promcon Berlin '16)Matthew Campbell

From AWS to GCP, TABLEAPP Architecture StoryYen-Wen Chen

Building Thick Clients with Tower in RustLucio Franco

Altitude NY 2018: 132 websites, 1 service: Your local news runs on FastlyFastly

A Cassandra driver from and for the Lua communityThibault Charbonnier

PubSub++ - Atmosphere 2015Krzysztof Debski

Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSAWS Vietnam Community

Traefik on google kubernetes engineManuel Zapf

Ingress overviewHarshal Shah

Tableapp architecture migration story for GCPUG.TWYen-Wen Chen

What's hot (20)

Kong in 1.x Territory

Solving some of the scalability problems at booking.com

Using Kubernetes to deploy Django in GCP

CNCF explore k8s api using java client

Cortex: Horizontally Scalable, Highly Available Prometheus

Cncf k8s_network_part1

Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...

Self Created Load Balancer for MTA on AWS

Cncf k8s_network_02

Developing a user-friendly OpenResty application

Breaking Prometheus (Promcon Berlin '16)

From AWS to GCP, TABLEAPP Architecture Story

Building Thick Clients with Tower in Rust

Altitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly

A Cassandra driver from and for the Lua community

PubSub++ - Atmosphere 2015

Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS

Traefik on google kubernetes engine

Ingress overview

Tableapp architecture migration story for GCPUG.TW

Similar to Cortex: Prometheus as a Service, One Year On

Become a Performance Diagnostics HeroTechWell

Cloud Architecture Tutorial - Running in the Cloud (3of3)Adrian Cockcroft

Fixing twitterRoger Xia

Fixing_Twitterliujianrong

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight

5 Pitfalls to Avoid with MongoDBTim Callaghan

Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner

Performance & Scalability Improvements in PerforcePerforce

Avoiding Common Pitfalls: Spark Structured Streaming with KafkaHostedbyConfluent

Production Ready Serverless Java Applications in 3 Weeks AWS UG Cologne Febru...Vadym Kazulkin

MongoDB World 2018: Building a New Transactional ModelMongoDB

Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent

Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI

Getting Started with Amazon RedshiftAmazon Web Services

Chirp 2010: Scaling TwitterJohn Adams

Running Legacy Applications with ContainersLinuxCon ContainerCon CloudOpen China

Serverlessusecase workshop feb3_v2kartraj

Scaling habits of ASP.NETDavid Giard

Presto At Treasure DataTaro L. Saito

Similar to Cortex: Prometheus as a Service, One Year On (20)

Become a Performance Diagnostics Hero

Cloud Architecture Tutorial - Running in the Cloud (3of3)

Fixing twitter

Fixing_Twitter

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...

5 Pitfalls to Avoid with MongoDB

Top Java Performance Problems and Metrics To Check in Your Pipeline

Performance & Scalability Improvements in Perforce

Avoiding Common Pitfalls: Spark Structured Streaming with Kafka

Production Ready Serverless Java Applications in 3 Weeks AWS UG Cologne Febru...

MongoDB World 2018: Building a New Transactional Model

Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...

Planet-scale Data Ingestion Pipeline: Bigdam

Getting Started with Amazon Redshift

Chirp 2010: Scaling Twitter

Running Legacy Applications with Containers

Serverlessusecase workshop feb3_v2

Scaling habits of ASP.NET

Presto At Treasure Data

Recently uploaded

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Artificial Intelligence: Facts and MythsJoaquim Jorge

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Scaling API-first – The story of a global engineering organizationRadu Cotescu

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Tech Trends Report 2024 Future Today Institute.pdfhans926745

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

A Domino Admins Adventures (Engage 2024)Gabriella Davis

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker

Artificial Intelligence: Facts and Myths

Driving Behavioral Change for Information Management through Data-Driven Gree...

CNv6 Instructor Chapter 6 Quality of Service

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Data Cloud, More than a CDP by Matt Robison

Handwritten Text Recognition for manuscripts and early printed texts

The 7 Things I Know About Cyber Security After 25 Years | April 2024

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Axa Assurance Maroc - Insurer Innovation Award 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Scaling API-first – The story of a global engineering organization

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Finology Group – Insurtech Innovation Award 2024

Tech Trends Report 2024 Future Today Institute.pdf

2024: Domino Containers - The Next Step. News from the Domino Container commu...

A Domino Admins Adventures (Engage 2024)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Cortex: Prometheus as a Service, One Year On

1. Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com https://github.com/weaveworks/cortex

3. https://www.youtube.com/watch?v=3Tb4Wc0kfCM

4. Cortex: Prometheus as a Service • Natively multi tenant; isolate different customers in the same services. • Different story around scaling & HA • “Virtually inﬁnite” retention and durability • Opportunities for performance enhancements Cortex YourYourYourYour Your Jobs Alertmanager Grafana Prometheus HA

5. Frontend Ditributor DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Cortex Architecture

6. A Year’s Evolution

7. Problem #1: DynamoDB Write Throughput

8. https://github.com/weaveworks/cortex/issues/254

9. Frontend Ditributor DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3Table Manager Cortex Architecture

10. Problem #2: DynamoDB Write Throughput, again

11. Original schema: • Hash Key: <user ID>:<hour>:<metric name> • Range Key: <label name>:<label value>:<chunk ID> New schema: • Hash Key: <user ID>:<day>:<metric name>:<label name> • Range Key: <chunk ID>:<chunk end time> https://github.com/weaveworks/cortex/pull/262

12. Problem #3: Queries of Death

13. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Cortex Architecture

14. Problem #3: Recording rules and alerts

15. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Ruler Cortex Architecture

16. Problem #4: Long tail

17.

18. https://www.weave.works/blog/the-long-tail-tools-to-investigate-high-long-tail-latency/

19. Problem #5: Cost

20. S3 DynamoDB IOP Cost ($/IOP) 5x10-6 2x10-7 Storage Cost ($/GB/Month) 0.023 0.250 https://github.com/weaveworks/cortex/issues/141

21. 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.005 0.01 0.015 0.02 0.025 Object size (GB) Cost ($) S3 DynamoDB

22. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul RulerIngester Write requests Read requests Control requests Prometheus Your Jobs Cortex Architecture

23. Problem #6: DynamoDB, again

24. Frontend Ditributor Querier Table Manager BigTable Memcache Consul RulerIngester Write requests Read requests Control requests Prometheus Your Jobs Cortex Architecture

25. DynamoDB BigTable 99th Percentile Write Latency (ms) 70-100 50-150 99th Percentile Read Latency (ms) 100-2500 ~250 LOC ~2000 ~400 DynamoDB numbers courtesy of Weaveworks

26. Closing thoughts

27. 1. DynamoDB Write Throughput 2. DynamoDB Write Throughput, again 3. Recording rules and alerts 4. Long tail 5. Cost 6. DynamoDB, again

28. Running for >12months • Availability: querier unavailable for <12hrs ~99.9% • Durability: lost <2 days of data >99.5% • 99th percentile write performance ~60ms • 99th percentile query performance <200ms

29. Future • Direct chunk writes from Prometheus to Cortex Chunk Store • Separate ingester index for better load balancing • Use prometheus/tsdb for the ingesters • Etcd & gossip for ring storage • Chunks in Google Cloud Storage

30. One more thing…

31. I left Weaveworks at the begging of June to focus on Prometheus & Cortex development. Since then I’ve teamed up with David to develop some ideas around Prometheus, logging, and tracing. We’re available for Prometheus hosting, consulting, training and support. email: hello@kausal.co

32. Metrics

33. Logs

34. Traces

35. Thank you! Questions?