SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
Donny Nadolny
6 Nines:
How Stripe keeps Kafka highly-available
across the globe
Kafka at Stripe
● Used by vast majority of services at Stripe
● Stripe:
○ Payments & more for internet businesses
○ Our reliability affects all businesses that use us
● “Charge path” - critical services needed to process payments
○ Failure here means a customer can’t buy something and
businesses lose money
2
Typical Cluster Topology
● Brokers (and ZooKeeeper) spread across 3 AWS availability zones
● Topic replication factor: 3 with rack awareness
● Minimum live replicas to publish: 2
● Fault tolerance: can lose 1 machine or 1 AZ and everything works
3
Typical Cluster Topology
4
Typical Cluster Topology
5
Typical Cluster Topology
6
7
Kafka is Highly Available
Kafka is highly available, but…
● Region failure
● 1 AZ failure + 1 machine failure
● 1 machine failure during rolling restart for maintenance
○ OS update, Kafka version upgrade
● Misbehaving client flooding requests
○ Client sets their ID based on machine name, or a UUID
● Bugs in Kafka
● ZooKeeper dependency
○ Multitenant ZooKeeper
● Operator error
8
High(er) availability Kafka - requirements
● Producers always need to be able to produce
● Consumers can fall behind with less impact
9
Higher availability Kafka - classic setup
10
Higher availability Kafka
11
Higher availability Kafka
12
Higher availability Kafka
13
Higher availability Kafka - only for producers!
14
HA-write tradeoffs
● Pros:
○ Easy to set up
○ Much better produce availability
● Cons:
○ Worse produce latency due to mirroring
○ Weaker ordering guarantees (when switching between clusters
& failed publishes)
○ No improvement to consumer availability
15
Changing requirements - consumers
● Over time, consumers became more important
16
17
Multicluster Kafka
Multicluster Kafka
18
Multicluster Kafka
19
Multicluster Kafka
20
Multicluster Kafka
21
Kproxy
● Observability
● ACLs
○ “Dark read” config
● Single place for health checking
22
Kproxy publish availability
● Extremely high publish availability
● Available if even 1 cluster is healthy
23
Kproxy consume availability
● If a cluster goes down, move publishes to another cluster
● Messages unavailable until cluster is back up
● Fewer unconsumed messages than you might think…
24
Kproxy consume availability - unconsumed messages
● (latest offset - committed offset) is guaranteed upper bound
● Consumer gets stream of messages, processes them as they
come, commits every commit.interval.ms (Kafka default: 5
seconds, our default: 1 second)
● Number of unconsumed messages is related to latency from Kafka
-> consumer, not commit interval
○ Commit interval only controls duplicate consume
● Applies to single Kafka cluster too
25
High availability consume
● Publish to all clusters instead of any cluster
● Deduplicate on the consumer
● Similarly ultra-high consume availability as produce availability
26
High availability consume
● Publish to all clusters instead of any cluster
● Deduplicate on the consumer
● Similarly ultra-high consume availability as produce availability
● Nobody has asked for this!
27
More benefits
● Zero-impact: turn off cluster for maintenance
○ Route produces to other clusters, wait for consumers to catch
up and then:
○ Unthrottled rebalance!
○ Safer version upgrades!
○ Any unknown problem
○ We do this frequently!
● Reduce number of connections to Kafka
28
Single point of failure: kproxy
Yes, but:
● Stateless
○ Health checking of clusters is in-memory
● Zero coordination between kproxy instances
● Safer deployment strategy
29
Kproxy host sets
● Isolate consumers from producers
● Isolate low-latency / high availability producers from others
● Isolate specific use-cases
30
Kproxy host sets
31
Kproxy host sets - rollout
32
1. Single host of a low priority
host set
2. A few hosts in that host set
3. Many hosts in that host set
4. All hosts in that hostset
5. Repeat for other host sets in
increasing order of
importance
Kproxy host sets - rollout
33
Data locality
● Producers and consumers can request that they publish to / consume from
a set of regions
● Common producer selections:
○ Any cluster in any region (default: prefer same region, but use far
regions if all local ones are down)
○ Local region only
○ Specific far region only (if a consumer only runs there)
○ Specific locality zone (combination of regions)
● Common consumer selections
○ Local region only (by far most common)
○ All regions
34
Failure detection
● Simple or complicated?
○ By topic-partition-cluster or by cluster?
○ Use topology (broker overlap, availability zones)?
○ Share failure state?
○ Use cluster metadata (ISR)?
35
Failure detection - our approach
● Use actual publishes as health check
● Initially by topic-cluster, simple circuit breaker
○ Mark “healthy” or “unhealthy”, only use healthy ones
○ … but try unhealthy ones every x minutes
● New approach:
○ Variant of ϵ-greedy, aggregate by cluster
36
37
Multicluster Kafka Limitations
Limitations
● Our implementation:
○ Custom client & API
○ Small amount of unconsumed messages if cluster goes down
○ Keyed publishes
● The approach:
○ Ordering
○ Transactions
38
Custom client & API
● Big tradeoff: other tools (Apache Flink, Kafka Connect, Spark
Streaming, etc) don’t work off the shelf
39
Custom client & API
● Kafka client protocol is very large & changes with versions
● How to express partitions across multiple clusters with a protocol
proxy?
● Kproxy is limited (ordering, transactions) - it’s obvious what
features aren’t supported because they’re not in our API
40
Gotchas
● Higher effective number of partitions, but can’t use it for more
consumer parallelism unless you do extra work in partition
assignment, and even then…
● Cluster capacity: need to over-provision or be able to quickly scale
to [num_cluster_replicas / num_cluster_replicas - 1], i.e. 50% if
using 3 clusters
41
42
Results
Kafka is highly available, but…
● Region failure
● 1 AZ failure + 1 machine failure
● 1 machine failure during rolling restart for maintenance
● Misbehaving client flooding requests
● Bugs in Kafka
● ZooKeeper dependency
● Operator error
43
Mitigated by multicluster / Kproxy
● Region failure
● 1 AZ failure + 1 machine failure
● 1 machine failure during rolling restart for maintenance
● Misbehaving client flooding requests
● Bugs in Kafka
● ZooKeeper dependency
● Operator error
44
Mostly Mitigated by multicluster / Kproxy
● Misbehaving client flooding requests
○ Mostly mitigated - can isolate clients with host sets
● Bugs in Kafka
○ Mostly mitigated - few bugs would affect multiple clusters
● [New] Bugs in Kproxy
○ Mostly mitigated - safe automated rollout with host sets
● Operator error
○ Always a risk
45
Publish availability
● SLO for other teams at Stripe:
● 99.999% of publishes succeed within 5 seconds for high-priority
publishers (5 mins downtime/year)
● 99.995% of publishes succeed within 5 seconds for all other
publishers (26 mins downtime/year)
● Internal team target:
● 99.9999% (31.56 seconds downtime/year)
46
Publish availability
● SLO for other teams at Stripe:
● 99.999% of publishes succeed within 5 seconds for high-priority
publishers (5 mins downtime/year)
● 99.995% of publishes succeed within 5 seconds for all other
publishers (26 mins downtime/year)
● Internal team target:
● 99.9999% (31.56 seconds downtime/year)
● Actual availability:
● 99.99997% (9.46 seconds downtime/year)
47
Publish latency
● SLO: p50/p90/p99 of 10ms/15ms/35ms
● Actual: 3.5ms/5ms/9.5ms
48
49
We’re hiring! stripe.com/jobs
50
Thanks!
51
Questions?

Contenu connexe

Tendances

Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
confluent
 

Tendances (20)

Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 

Similaire à 6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny Nadolny | Kafka Summit London 2022

Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
confluent
 

Similaire à 6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny Nadolny | Kafka Summit London 2022 (20)

Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 
Apache KAfka
Apache KAfkaApache KAfka
Apache KAfka
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
kafka
kafkakafka
kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache Kafka Reliability
Apache Kafka Reliability Apache Kafka Reliability
Apache Kafka Reliability
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache Kafka
 
The Google file system
The Google file systemThe Google file system
The Google file system
 
Cocktail of Environments. How to Mix Test and Development Environments and St...
Cocktail of Environments. How to Mix Test and Development Environments and St...Cocktail of Environments. How to Mix Test and Development Environments and St...
Cocktail of Environments. How to Mix Test and Development Environments and St...
 
Linked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache PulsarLinked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache Pulsar
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free Friday
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
How Much Kafka?
How Much Kafka?How Much Kafka?
How Much Kafka?
 
Uber: Kafka Consumer Proxy
Uber: Kafka Consumer ProxyUber: Kafka Consumer Proxy
Uber: Kafka Consumer Proxy
 

Plus de HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

Plus de HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny Nadolny | Kafka Summit London 2022

  • 1. Donny Nadolny 6 Nines: How Stripe keeps Kafka highly-available across the globe
  • 2. Kafka at Stripe ● Used by vast majority of services at Stripe ● Stripe: ○ Payments & more for internet businesses ○ Our reliability affects all businesses that use us ● “Charge path” - critical services needed to process payments ○ Failure here means a customer can’t buy something and businesses lose money 2
  • 3. Typical Cluster Topology ● Brokers (and ZooKeeeper) spread across 3 AWS availability zones ● Topic replication factor: 3 with rack awareness ● Minimum live replicas to publish: 2 ● Fault tolerance: can lose 1 machine or 1 AZ and everything works 3
  • 7. 7 Kafka is Highly Available
  • 8. Kafka is highly available, but… ● Region failure ● 1 AZ failure + 1 machine failure ● 1 machine failure during rolling restart for maintenance ○ OS update, Kafka version upgrade ● Misbehaving client flooding requests ○ Client sets their ID based on machine name, or a UUID ● Bugs in Kafka ● ZooKeeper dependency ○ Multitenant ZooKeeper ● Operator error 8
  • 9. High(er) availability Kafka - requirements ● Producers always need to be able to produce ● Consumers can fall behind with less impact 9
  • 10. Higher availability Kafka - classic setup 10
  • 14. Higher availability Kafka - only for producers! 14
  • 15. HA-write tradeoffs ● Pros: ○ Easy to set up ○ Much better produce availability ● Cons: ○ Worse produce latency due to mirroring ○ Weaker ordering guarantees (when switching between clusters & failed publishes) ○ No improvement to consumer availability 15
  • 16. Changing requirements - consumers ● Over time, consumers became more important 16
  • 22. Kproxy ● Observability ● ACLs ○ “Dark read” config ● Single place for health checking 22
  • 23. Kproxy publish availability ● Extremely high publish availability ● Available if even 1 cluster is healthy 23
  • 24. Kproxy consume availability ● If a cluster goes down, move publishes to another cluster ● Messages unavailable until cluster is back up ● Fewer unconsumed messages than you might think… 24
  • 25. Kproxy consume availability - unconsumed messages ● (latest offset - committed offset) is guaranteed upper bound ● Consumer gets stream of messages, processes them as they come, commits every commit.interval.ms (Kafka default: 5 seconds, our default: 1 second) ● Number of unconsumed messages is related to latency from Kafka -> consumer, not commit interval ○ Commit interval only controls duplicate consume ● Applies to single Kafka cluster too 25
  • 26. High availability consume ● Publish to all clusters instead of any cluster ● Deduplicate on the consumer ● Similarly ultra-high consume availability as produce availability 26
  • 27. High availability consume ● Publish to all clusters instead of any cluster ● Deduplicate on the consumer ● Similarly ultra-high consume availability as produce availability ● Nobody has asked for this! 27
  • 28. More benefits ● Zero-impact: turn off cluster for maintenance ○ Route produces to other clusters, wait for consumers to catch up and then: ○ Unthrottled rebalance! ○ Safer version upgrades! ○ Any unknown problem ○ We do this frequently! ● Reduce number of connections to Kafka 28
  • 29. Single point of failure: kproxy Yes, but: ● Stateless ○ Health checking of clusters is in-memory ● Zero coordination between kproxy instances ● Safer deployment strategy 29
  • 30. Kproxy host sets ● Isolate consumers from producers ● Isolate low-latency / high availability producers from others ● Isolate specific use-cases 30
  • 32. Kproxy host sets - rollout 32 1. Single host of a low priority host set 2. A few hosts in that host set 3. Many hosts in that host set 4. All hosts in that hostset 5. Repeat for other host sets in increasing order of importance
  • 33. Kproxy host sets - rollout 33
  • 34. Data locality ● Producers and consumers can request that they publish to / consume from a set of regions ● Common producer selections: ○ Any cluster in any region (default: prefer same region, but use far regions if all local ones are down) ○ Local region only ○ Specific far region only (if a consumer only runs there) ○ Specific locality zone (combination of regions) ● Common consumer selections ○ Local region only (by far most common) ○ All regions 34
  • 35. Failure detection ● Simple or complicated? ○ By topic-partition-cluster or by cluster? ○ Use topology (broker overlap, availability zones)? ○ Share failure state? ○ Use cluster metadata (ISR)? 35
  • 36. Failure detection - our approach ● Use actual publishes as health check ● Initially by topic-cluster, simple circuit breaker ○ Mark “healthy” or “unhealthy”, only use healthy ones ○ … but try unhealthy ones every x minutes ● New approach: ○ Variant of ϵ-greedy, aggregate by cluster 36
  • 38. Limitations ● Our implementation: ○ Custom client & API ○ Small amount of unconsumed messages if cluster goes down ○ Keyed publishes ● The approach: ○ Ordering ○ Transactions 38
  • 39. Custom client & API ● Big tradeoff: other tools (Apache Flink, Kafka Connect, Spark Streaming, etc) don’t work off the shelf 39
  • 40. Custom client & API ● Kafka client protocol is very large & changes with versions ● How to express partitions across multiple clusters with a protocol proxy? ● Kproxy is limited (ordering, transactions) - it’s obvious what features aren’t supported because they’re not in our API 40
  • 41. Gotchas ● Higher effective number of partitions, but can’t use it for more consumer parallelism unless you do extra work in partition assignment, and even then… ● Cluster capacity: need to over-provision or be able to quickly scale to [num_cluster_replicas / num_cluster_replicas - 1], i.e. 50% if using 3 clusters 41
  • 43. Kafka is highly available, but… ● Region failure ● 1 AZ failure + 1 machine failure ● 1 machine failure during rolling restart for maintenance ● Misbehaving client flooding requests ● Bugs in Kafka ● ZooKeeper dependency ● Operator error 43
  • 44. Mitigated by multicluster / Kproxy ● Region failure ● 1 AZ failure + 1 machine failure ● 1 machine failure during rolling restart for maintenance ● Misbehaving client flooding requests ● Bugs in Kafka ● ZooKeeper dependency ● Operator error 44
  • 45. Mostly Mitigated by multicluster / Kproxy ● Misbehaving client flooding requests ○ Mostly mitigated - can isolate clients with host sets ● Bugs in Kafka ○ Mostly mitigated - few bugs would affect multiple clusters ● [New] Bugs in Kproxy ○ Mostly mitigated - safe automated rollout with host sets ● Operator error ○ Always a risk 45
  • 46. Publish availability ● SLO for other teams at Stripe: ● 99.999% of publishes succeed within 5 seconds for high-priority publishers (5 mins downtime/year) ● 99.995% of publishes succeed within 5 seconds for all other publishers (26 mins downtime/year) ● Internal team target: ● 99.9999% (31.56 seconds downtime/year) 46
  • 47. Publish availability ● SLO for other teams at Stripe: ● 99.999% of publishes succeed within 5 seconds for high-priority publishers (5 mins downtime/year) ● 99.995% of publishes succeed within 5 seconds for all other publishers (26 mins downtime/year) ● Internal team target: ● 99.9999% (31.56 seconds downtime/year) ● Actual availability: ● 99.99997% (9.46 seconds downtime/year) 47
  • 48. Publish latency ● SLO: p50/p90/p99 of 10ms/15ms/35ms ● Actual: 3.5ms/5ms/9.5ms 48