6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny Nadolny | Kafka Summit London 2022

Donny Nadolny
6 Nines:
How Stripe keeps Kafka highly-available
across the globe

Kafka at Stripe
● Used by vast majority of services at Stripe
● Stripe:
○ Payments & more for internet businesses
○ Our reliability aﬀects all businesses that use us
● “Charge path” - critical services needed to process payments
○ Failure here means a customer can’t buy something and
businesses lose money
2

Typical Cluster Topology
● Brokers (and ZooKeeeper) spread across 3 AWS availability zones
● Topic replication factor: 3 with rack awareness
● Minimum live replicas to publish: 2
● Fault tolerance: can lose 1 machine or 1 AZ and everything works
3

Kafka is highly available, but…
● Region failure
● 1 AZ failure + 1 machine failure
● 1 machine failure during rolling restart for maintenance
○ OS update, Kafka version upgrade
● Misbehaving client ﬂooding requests
○ Client sets their ID based on machine name, or a UUID
● Bugs in Kafka
● ZooKeeper dependency
○ Multitenant ZooKeeper
● Operator error
8

High(er) availability Kafka - requirements
● Producers always need to be able to produce
● Consumers can fall behind with less impact
9

Higher availability Kafka - classic setup
10

Higher availability Kafka - only for producers!
14

HA-write tradeoﬀs
● Pros:
○ Easy to set up
○ Much better produce availability
● Cons:
○ Worse produce latency due to mirroring
○ Weaker ordering guarantees (when switching between clusters
& failed publishes)
○ No improvement to consumer availability
15

Changing requirements - consumers
● Over time, consumers became more important
16

Kproxy
● Observability
● ACLs
○ “Dark read” conﬁg
● Single place for health checking
22

Kproxy publish availability
● Extremely high publish availability
● Available if even 1 cluster is healthy
23

Kproxy consume availability
● If a cluster goes down, move publishes to another cluster
● Messages unavailable until cluster is back up
● Fewer unconsumed messages than you might think…
24

Kproxy consume availability - unconsumed messages
● (latest oﬀset - committed oﬀset) is guaranteed upper bound
● Consumer gets stream of messages, processes them as they
come, commits every commit.interval.ms (Kafka default: 5
seconds, our default: 1 second)
● Number of unconsumed messages is related to latency from Kafka
-> consumer, not commit interval
○ Commit interval only controls duplicate consume
● Applies to single Kafka cluster too
25

High availability consume
● Publish to all clusters instead of any cluster
● Deduplicate on the consumer
● Similarly ultra-high consume availability as produce availability
26

High availability consume
● Publish to all clusters instead of any cluster
● Deduplicate on the consumer
● Similarly ultra-high consume availability as produce availability
● Nobody has asked for this!
27

More beneﬁts
● Zero-impact: turn oﬀ cluster for maintenance
○ Route produces to other clusters, wait for consumers to catch
up and then:
○ Unthrottled rebalance!
○ Safer version upgrades!
○ Any unknown problem
○ We do this frequently!
● Reduce number of connections to Kafka
28

Single point of failure: kproxy
Yes, but:
● Stateless
○ Health checking of clusters is in-memory
● Zero coordination between kproxy instances
● Safer deployment strategy
29

Kproxy host sets
● Isolate consumers from producers
● Isolate low-latency / high availability producers from others
● Isolate speciﬁc use-cases
30

Kproxy host sets - rollout
32
1. Single host of a low priority
host set
2. A few hosts in that host set
3. Many hosts in that host set
4. All hosts in that hostset
5. Repeat for other host sets in
increasing order of
importance

Data locality
● Producers and consumers can request that they publish to / consume from
a set of regions
● Common producer selections:
○ Any cluster in any region (default: prefer same region, but use far
regions if all local ones are down)
○ Local region only
○ Speciﬁc far region only (if a consumer only runs there)
○ Speciﬁc locality zone (combination of regions)
● Common consumer selections
○ Local region only (by far most common)
○ All regions
34

Failure detection
● Simple or complicated?
○ By topic-partition-cluster or by cluster?
○ Use topology (broker overlap, availability zones)?
○ Share failure state?
○ Use cluster metadata (ISR)?
35

Failure detection - our approach
● Use actual publishes as health check
● Initially by topic-cluster, simple circuit breaker
○ Mark “healthy” or “unhealthy”, only use healthy ones
○ … but try unhealthy ones every x minutes
● New approach:
○ Variant of ϵ-greedy, aggregate by cluster
36

37
Multicluster Kafka Limitations

Limitations
● Our implementation:
○ Custom client & API
○ Small amount of unconsumed messages if cluster goes down
○ Keyed publishes
● The approach:
○ Ordering
○ Transactions
38

Custom client & API
● Big tradeoﬀ: other tools (Apache Flink, Kafka Connect, Spark
Streaming, etc) don’t work oﬀ the shelf
39

Custom client & API
● Kafka client protocol is very large & changes with versions
● How to express partitions across multiple clusters with a protocol
proxy?
● Kproxy is limited (ordering, transactions) - it’s obvious what
features aren’t supported because they’re not in our API
40

Gotchas
● Higher eﬀective number of partitions, but can’t use it for more
consumer parallelism unless you do extra work in partition
assignment, and even then…
● Cluster capacity: need to over-provision or be able to quickly scale
to [num_cluster_replicas / num_cluster_replicas - 1], i.e. 50% if
using 3 clusters
41

Kafka is highly available, but…
● Region failure
● Bugs in Kafka
● Operator error
43

Mitigated by multicluster / Kproxy
● Region failure
● Bugs in Kafka
● Operator error
44

Mostly Mitigated by multicluster / Kproxy
○ Mostly mitigated - can isolate clients with host sets
● Bugs in Kafka
○ Mostly mitigated - few bugs would aﬀect multiple clusters
● [New] Bugs in Kproxy
○ Mostly mitigated - safe automated rollout with host sets
● Operator error
○ Always a risk
45

Publish availability
● SLO for other teams at Stripe:
● 99.999% of publishes succeed within 5 seconds for high-priority
publishers (5 mins downtime/year)
● 99.995% of publishes succeed within 5 seconds for all other
● Internal team target:
● 99.9999% (31.56 seconds downtime/year)
46

Publish availability
● SLO for other teams at Stripe:
● 99.999% of publishes succeed within 5 seconds for high-priority
● 99.995% of publishes succeed within 5 seconds for all other
● Internal team target:
● Actual availability:
47

Publish latency
● SLO: p50/p90/p99 of 10ms/15ms/35ms
● Actual: 3.5ms/5ms/9.5ms
48

49
We’re hiring! stripe.com/jobs

6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny Nadolny | Kafka Summit London 2022

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny Nadolny | Kafka Summit London 2022

Similaire à 6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny Nadolny | Kafka Summit London 2022 (20)

Plus de HostedbyConfluent

Plus de HostedbyConfluent (20)

Dernier

Dernier (20)

6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny Nadolny | Kafka Summit London 2022