Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Reliability Guarantees for Apache Kafka

3 898 vues

Publié le

Presentation by Gwen Shapira as part 2 of the Best Practices for Apache Kafka™ in Production, Confluent Online Talk Series

Publié dans : Logiciels
  • Login to see the comments

Reliability Guarantees for Apache Kafka

  1. 1. 1 Reliability Guarantees in Apache Kafka Gwen Shapira, Product Manager @gwenshap
  2. 2. 2 Streaming Platform Producer Consumer Streaming Applications Connectors Connectors Apache Kafka
  3. 3. 3 Versions of Apache Kafka • 0.7.0 <- Please don’t • 0.8.0 <- Replication exists, it will continue evolving with every release • 0.8.2 <- New producer, offset commits to Kafka • 0.9.0 <- New consumer, Connect APIs • 0.10.0 <- New consumer improvements, Streams APIs • 0.11.0 <- Idempotent producer, transactional semantics, Exactly once. • Future <- Out-of-the-box reliable configuration: https://issues.apache.org/jira/browse/KAFKA-5795
  4. 4. 4 If Kafka is a critical piece of our pipeline § Can we be 100% sure that our data will get there? § Can we lose messages? § How do we verify? § Who’s fault is it?
  5. 5. 5 Distributed Systems § Things Fail § Systems are designed to tolerate failure § We must expect failures and design our code and configure our systems to handle them
  6. 6. 6 Network Broker MachineClient Machine Data Flow Kafka Client Broker O/S Socket Buffer NIC NIC Page Cache Disk Application Thread O/S Socket Buffercallback ✗ ✗ ✗ ✗ ✗ ✗ ✗✗ data ack / exception Replication
  7. 7. 7 Kafka is super reliable. … if you know how to configure it that way.
  8. 8. 8 Replication is your friend § Kafka protects against failures by replicating data § The unit of replication is the partition § One replica is designated as the Leader § Follower replicas fetch data from the leader § The leader holds the list of “in-sync” replicas
  9. 9. 9 Replication and ISRs 0 1 2 0 1 2 0 1 2 Producer Broker 100 Broker 101 Broker 102 Topic: Partitions: Replicas: my_topic 3 3 Partition: Leader: ISR: 1 101 100,102 Partition: Leader: ISR: 2 102 101,100 Partition: Leader: ISR: 0 100 101,102
  10. 10. 10 ISR 2 things make a replica in-sync • replica.lag.time.max.ms – replica that didn t fetch or is behind • Connection to Zookeeper
  11. 11. 11 Terminology Acked • Producers will not retry sending. • Depends on producer setting Committed • Only when message got to all ISR (future leaders have it). • Consumers can read. • replica.lag.time.max.ms controls: how long can a dead replica prevent consumers from reading? Committed Offsets • Consumer told Kafka the latest offsets it read. By default the consumer will not see these events again.
  12. 12. 12 Replication Acks = all • Waits for all in-sync replicas to reply. Replica 3 100 Replica 2 100 Replica 1 100 Time
  13. 13. 13 Replica 3 stopped replicating for some reason Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time Acked in acks = all “committed” Acked in acks = 1 but not “committed”
  14. 14. 14 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time One replica drops out of ISR, or goes offline All messages are now acked and committed
  15. 15. 15 2nd Replica drops out, or is offline Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104Time
  16. 16. 16 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104Time Now we’re in trouble ✗
  17. 17. 17 Replication If Replica 2 or 3 come back online before the leader, you can will lose data. Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104 All those are “acked” and “committed” Time
  18. 18. 18 So what to do Disable Unclean Leader Election • unclean.leader.election.enable = false • Default from 0.11.0 Set replication factor • default.replication.factor = 3 Set minimum ISRs • min.insync.replicas = 2
  19. 19. 19 Replication Replication = 3 Min ISR = 2 Replica 3 100 Replica 2 100 Replica 1 100 Time
  20. 20. 20 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time One replica drops out of ISR, or goes offline
  21. 21. 21 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101102 103 104 Time 2nd Replica fails out, or is out of sync Buffers in Producer
  22. 22. 22
  23. 23. 23 Producer Internals Producer sends batches of messages to a buffer M3 Application Thread Application Thread Application Thread send() M2 M1 M0 Batch 3 Batch 2 Batch 1 Fail ? response retry Update Future callback drain Metadata or Exception
  24. 24. 24 Basics • Durability: request.required.acks • 0 The message is written to the network (buffer) • 1 The message is written to the leader • all The producer gets an ack after all ISRs receive the data; the message is committed • Retries: • Default is 0. • How long of downtime I need to survive divide by retry.backoff.ms • KIP-91 may improve things • Memory for retries: • Have plenty of buffer.memory • max.block.ms = Long.MAX_VALUE • Or handle the BufferExhaustedException / TimeoutException yourself • In flight requests could lead to message re-ordering • Don’t forget to close the producer: producer.close()will block until in-flight txns complete
  25. 25. 25 “New” Producer All calls are non-blocking async 3 Options for checking for failures: • Don’t. Just call send() and YOLO! • Immediately block for response: send().get() • Do follow-up work in callback (but not retries)
  26. 26. 26
  27. 27. 27 Consumer Three choices One good choice for Consumer API: • Simple Consumer • High Level Consumer (ZookeeperConsumer) • New KafkaConsumer
  28. 28. 28 New Consumer – auto commit props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "10000"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props); consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record); } } What if we crash after 8 seconds? Commit automatically every 10 seconds
  29. 29. 29 New Consumer – manual commit props.put("enable.auto.commit", "false"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props); consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) processAndUpdateDB(record); consumer.commitSync(); } Commit entire batch outside the loop!
  30. 30. 30 Rebalances Happen
  31. 31. 31 Handling Rebalances private class HandleRebalance implements ConsumerRebalanceListener { public void onPartitionsAssigned(Collection<TopicPartition> tp) { } public void onPartitionsRevoked(Collection<TopicPartition> tp) { System.out.println("Lost partitions in rebalance. Committing current offsets:" + currentOffsets); consumer.commitSync(currentOffsets); } }
  32. 32. 32 Minimize Duplicates for At Least Once Consuming 1. Commit your own offsets autocommit.enable = false 2. Use Rebalance Listener 3. Commit only what you are done processing
  33. 33. 33 Exactly Once Semantics • At most once is easy • At least once is not bad either – commit after 100% sure data is safe • Exactly once is tricky • Commit data and offsets in one transaction • Idempotent producer • Kafka Connect: • many connectors (especially Confluent’s) are exactly once • by using an external database to write events and store offsets in one transaction • Kafka Streams: • starting at 0.11.0 have easy to configure exactly once (exactly.once=true). • Other stream processing systems – have their own thing.
  34. 34. 34 How we test Kafka? """Replication tests. These tests verify that replication provides simple durability guarantees by checking that data acked by brokers is still available for consumption in the face of various failure scenarios. Setup: 1 zk, 3 kafka nodes, 1 topic with partitions=3, replication-factor=3, and min.insync.replicas=2 - Produce messages in the background - Consume messages in the background - Drive broker failures (shutdown, or bounce repeatedly with kill -15 or kill -9) - When done driving failures, stop producing, and finish consuming - Validate that every acked message was consumed """
  35. 35. 35 Monitoring for Data Loss
  36. 36. 36 And catching duplicates too
  37. 37. 37 Monitoring for Data Loss • Monitor for producer errors – watch the retry numbers • Monitor consumer lag – MaxLag or via offsets • Each message contains a CreateTime timestamp • Each producer can report message counts and offsets to a special topic • Each consumer reports message counts to another special topic • Reconcile the results
  38. 38. 38 Be Safe, Not Sorry Acks = all Max.block.ms = Long.MAX_VALUE Retries = MAX_INT ( Max.inflight.requests.per.connection = 1 ) Producer.close() Replication-factor >= 3 Min.insync.replicas = 2 Unclean.leader.election = false Auto.offset.commit = false Commit after processing Monitor!
  39. 39. 39 Thank You!

×