Availability is a key metric for any Kafka deployment, but when every event is critical the system must be centered around keeping publishers and consumers highly available, even when a Kafka cluster goes down. At Stripe our core business relies on Kafka, and as we outgrew a single Kafka cluster we had to build a multi-cluster system which would fit our needs while supporting a target of 99.9999% availability for our most critical use cases.
In this talk we’ll discuss our solution to this problem: an in-house proxy layer and multi-cluster toplogy which we’ve built and operated over the past 3 years. Our proxy layer enables multiple Kafka clusters to work in coordination across the globe, while hitting our ambitious availability targets and providing clean client abstractions.
In this talk we’ll discuss how our Kafka deployment provides: availability for both publishers and consumers in the face of cluster outages, increased security and observability, simplified cluster maintenance, and global routing for constraints such as data locality. We’ll highlight the benefits & tradeoffs of our approach, the design of our proxy layer, Kafka configuration decisions, and where we’re planning to go from here.
2. Kafka at Stripe
● Used by vast majority of services at Stripe
● Stripe:
○ Payments & more for internet businesses
○ Our reliability affects all businesses that use us
● “Charge path” - critical services needed to process payments
○ Failure here means a customer can’t buy something and
businesses lose money
2
3. Typical Cluster Topology
● Brokers (and ZooKeeeper) spread across 3 AWS availability zones
● Topic replication factor: 3 with rack awareness
● Minimum live replicas to publish: 2
● Fault tolerance: can lose 1 machine or 1 AZ and everything works
3
8. Kafka is highly available, but…
● Region failure
● 1 AZ failure + 1 machine failure
● 1 machine failure during rolling restart for maintenance
○ OS update, Kafka version upgrade
● Misbehaving client flooding requests
○ Client sets their ID based on machine name, or a UUID
● Bugs in Kafka
● ZooKeeper dependency
○ Multitenant ZooKeeper
● Operator error
8
9. High(er) availability Kafka - requirements
● Producers always need to be able to produce
● Consumers can fall behind with less impact
9
15. HA-write tradeoffs
● Pros:
○ Easy to set up
○ Much better produce availability
● Cons:
○ Worse produce latency due to mirroring
○ Weaker ordering guarantees (when switching between clusters
& failed publishes)
○ No improvement to consumer availability
15
24. Kproxy consume availability
● If a cluster goes down, move publishes to another cluster
● Messages unavailable until cluster is back up
● Fewer unconsumed messages than you might think…
24
25. Kproxy consume availability - unconsumed messages
● (latest offset - committed offset) is guaranteed upper bound
● Consumer gets stream of messages, processes them as they
come, commits every commit.interval.ms (Kafka default: 5
seconds, our default: 1 second)
● Number of unconsumed messages is related to latency from Kafka
-> consumer, not commit interval
○ Commit interval only controls duplicate consume
● Applies to single Kafka cluster too
25
26. High availability consume
● Publish to all clusters instead of any cluster
● Deduplicate on the consumer
● Similarly ultra-high consume availability as produce availability
26
27. High availability consume
● Publish to all clusters instead of any cluster
● Deduplicate on the consumer
● Similarly ultra-high consume availability as produce availability
● Nobody has asked for this!
27
28. More benefits
● Zero-impact: turn off cluster for maintenance
○ Route produces to other clusters, wait for consumers to catch
up and then:
○ Unthrottled rebalance!
○ Safer version upgrades!
○ Any unknown problem
○ We do this frequently!
● Reduce number of connections to Kafka
28
29. Single point of failure: kproxy
Yes, but:
● Stateless
○ Health checking of clusters is in-memory
● Zero coordination between kproxy instances
● Safer deployment strategy
29
30. Kproxy host sets
● Isolate consumers from producers
● Isolate low-latency / high availability producers from others
● Isolate specific use-cases
30
32. Kproxy host sets - rollout
32
1. Single host of a low priority
host set
2. A few hosts in that host set
3. Many hosts in that host set
4. All hosts in that hostset
5. Repeat for other host sets in
increasing order of
importance
34. Data locality
● Producers and consumers can request that they publish to / consume from
a set of regions
● Common producer selections:
○ Any cluster in any region (default: prefer same region, but use far
regions if all local ones are down)
○ Local region only
○ Specific far region only (if a consumer only runs there)
○ Specific locality zone (combination of regions)
● Common consumer selections
○ Local region only (by far most common)
○ All regions
34
35. Failure detection
● Simple or complicated?
○ By topic-partition-cluster or by cluster?
○ Use topology (broker overlap, availability zones)?
○ Share failure state?
○ Use cluster metadata (ISR)?
35
36. Failure detection - our approach
● Use actual publishes as health check
● Initially by topic-cluster, simple circuit breaker
○ Mark “healthy” or “unhealthy”, only use healthy ones
○ … but try unhealthy ones every x minutes
● New approach:
○ Variant of ϵ-greedy, aggregate by cluster
36
38. Limitations
● Our implementation:
○ Custom client & API
○ Small amount of unconsumed messages if cluster goes down
○ Keyed publishes
● The approach:
○ Ordering
○ Transactions
38
39. Custom client & API
● Big tradeoff: other tools (Apache Flink, Kafka Connect, Spark
Streaming, etc) don’t work off the shelf
39
40. Custom client & API
● Kafka client protocol is very large & changes with versions
● How to express partitions across multiple clusters with a protocol
proxy?
● Kproxy is limited (ordering, transactions) - it’s obvious what
features aren’t supported because they’re not in our API
40
41. Gotchas
● Higher effective number of partitions, but can’t use it for more
consumer parallelism unless you do extra work in partition
assignment, and even then…
● Cluster capacity: need to over-provision or be able to quickly scale
to [num_cluster_replicas / num_cluster_replicas - 1], i.e. 50% if
using 3 clusters
41
43. Kafka is highly available, but…
● Region failure
● 1 AZ failure + 1 machine failure
● 1 machine failure during rolling restart for maintenance
● Misbehaving client flooding requests
● Bugs in Kafka
● ZooKeeper dependency
● Operator error
43
44. Mitigated by multicluster / Kproxy
● Region failure
● 1 AZ failure + 1 machine failure
● 1 machine failure during rolling restart for maintenance
● Misbehaving client flooding requests
● Bugs in Kafka
● ZooKeeper dependency
● Operator error
44
45. Mostly Mitigated by multicluster / Kproxy
● Misbehaving client flooding requests
○ Mostly mitigated - can isolate clients with host sets
● Bugs in Kafka
○ Mostly mitigated - few bugs would affect multiple clusters
● [New] Bugs in Kproxy
○ Mostly mitigated - safe automated rollout with host sets
● Operator error
○ Always a risk
45
46. Publish availability
● SLO for other teams at Stripe:
● 99.999% of publishes succeed within 5 seconds for high-priority
publishers (5 mins downtime/year)
● 99.995% of publishes succeed within 5 seconds for all other
publishers (26 mins downtime/year)
● Internal team target:
● 99.9999% (31.56 seconds downtime/year)
46
47. Publish availability
● SLO for other teams at Stripe:
● 99.999% of publishes succeed within 5 seconds for high-priority
publishers (5 mins downtime/year)
● 99.995% of publishes succeed within 5 seconds for all other
publishers (26 mins downtime/year)
● Internal team target:
● 99.9999% (31.56 seconds downtime/year)
● Actual availability:
● 99.99997% (9.46 seconds downtime/year)
47