Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Meet Apache Kafka : data streaming in your hands

269 vues

Publié le

Apache Kafka has emerged as a leading platform for building real-time data pipelines.
It provides support for high-throughput/low-latency messaging, as well as sophisticated development options that cover all the stages of a distributed data streaming pipeline, from ingestion to processing. During this session we’ll have an introduction about what Apache Kafka is, its architecture and how it works and finally the related entire ecosystem with the addition of Kafka Connect and Kafka Streams which make this platform useful in a lot of different scenarios.

Publié dans : Logiciels
  • Soyez le premier à commenter

Meet Apache Kafka : data streaming in your hands

  1. 1. Meet Apache Kafka Data streaming in your hands Paolo Patierno Principal Software Engineer @ Red Hat 21/11/2018
  2. 2. 2 Who am I ? @ppatierno ● Principal Software Engineer @ Red Hat ○ Messaging & IoT team ● Lead/Committer @ Eclipse Foundation ○ Hono, Paho and Vert.x projects ● Microsoft MVP Azure/IoT ● Hacking low constrained devices in spare time (from previous life) ● Dad of two ● VR46’s fan (I don’t like MM93) ● Try to be a runner ...
  3. 3. 3 Kafka ? “ did you join the wrong meetup ?”
  4. 4. 4 “ … a publish/subscribe messaging system …” What is Kafka ? At the beginning ...
  5. 5. 5 “ … a streaming data platform …” What is Kafka ? … today ...
  6. 6. 6 “ … a distributed, horizontally-scalable, fault-tolerant, commit log …” What is Kafka ? … but at the core
  7. 7. 7 ● Developed at Linkedin back in 2010, open sourced in 2011 ● Designed to be fast, scalable, durable and available ● Distributed by nature ● Data partitioning (sharding) ● High throughput / low latency ● Ability to handle huge number of consumers What is Kafka ?
  8. 8. 8 Applications data flow Without using Kafka Logging Application Application Web Microservices Monitoring Analytics ApplicationApplication Application
  9. 9. 9 Applications data flow Kafka as a central hub Application Web Microservices Monitoring Analytics Application Application Apache Kafka
  10. 10. 10 ● Messaging ○ As the “traditional” messaging systems ● Website activity tracking ○ Events like page views, searches, … ● Operational metrics ○ Alerting and reporting on operational metrics ● Log aggregation ○ Collect logs from multiple services ● Stream processing ○ Read, process and write stream for real-time analysis Common use cases
  11. 11. 11 Key use cases Apache Kafka Apache Kafka Apache Kafka Events buffer for data analytics/ML Event-driven microservices Bridge from on-prem to cloud native
  12. 12. 12 Microservices Architectures Orchestration vs Choreography
  13. 13. 13 ● Apache Kafka has an ecosystem consisting of many components / tools ○ Kafka Core ■ Broker ■ Clients library (producer, consumer, admin) ○ Kafka Connect ○ Kafka Streams ○ Mirror Maker Apache Kafka ecosystem
  14. 14. 14 ● Kafka Broker ○ Central component responsible for hosting topics and delivering messages ○ One or more brokers run in a cluster alongside with a Zookeeper ensemble ● Kafka Producers and Consumers ○ Java-based clients for sending and receiving messages ● Kafka Admin tools ○ Java- and Scala- based tools for managing Kafka brokers ○ Managing topics, ACLs, monitoring etc. Apache Kafka ecosystem Apache Kafka components
  15. 15. 15 Kafka & Zookeeper Zookeeper Kafka Applications Admin tools
  16. 16. 16 ● Kafka Connect ○ Framework for transferring data between Kafka and other data systems ○ Facilitate data conversion, scaling, load balancing, fault tolerance, … ○ Connector plugins are deployed into Kafka connect cluster ○ Well defined API for creating new connectors (with Sink/Source) ○ Apache Kafka itself includes only FileSink and FileSource plugins (reading records from file and posting them as Kafka messages / writing Kafka messages to files) ○ Many additional plugins are available outside of Apache Kafka ● Main use case CDC (Change Data Capture) Kafka ecosystem Apache Kafka components
  17. 17. 17 Kafka Connect ConnectAPI ConnectAPI
  18. 18. 18 ● Kafka Streams ○ Stream processing framework ○ Streams are Kafka topics (as input and output) ○ It’s really just a Java library to include in your application ○ Scaling the stream application horizontally ○ Creates a topology of processing nodes (filter, map, join etc) acting on a stream ○ Low level processor API ○ High level DSL ○ Using “internal” topics (when re-partitioning is needed or for “stateful” transformations) Kafka ecosystem Apache Kafka components
  19. 19. 19 Kafka Streams API topic topic Processing topic Processing topic topic topic StreamsAPI application topic Processing
  20. 20. 20 Kafka ecosystem Extract … Transform … Load Source ConnectAPI ConnectAPI Streams API Sink
  21. 21. 21 ● Mirror Maker ○ Kafka clusters do not work well when split across multiple datacenters ○ Low bandwidth, High latency ○ For use within multiple datacenters it is recommended to setup independent cluster in each data center and mirror the data ○ Tool for replication of topics between different clusters Kafka ecosystem Apache Kafka components
  22. 22. 22 Topic & Partitions Topic anatomy Partition 0 Partition 1 Partition 2 writes reads Topic old new
  23. 23. 23 ● They are “backup” for a partition ○ Provide redundancy ● It’s the way Kafka guarantees availability and durability in case of node failures ○ Kafka tolerates “replicas - 1” dead brokers before losing data ● Two roles : ○ Leader : a replica used by producers/consumers for exchanging messages ○ Followers : all the other replicas ■ They don’t serve client requests ■ They replicate messages from the leader to be “in-sync” (ISR) ○ A replica changes its role as brokers come and go ● Messages ordering is “per partition” Replication Leaders & Followers
  24. 24. Broker 1 24 Partitions Distribution Leaders & Followers T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ● Leaders and followers spread across the cluster ○ producers/consumers connect to leaders ○ multiple connections needed for reading different partitions
  25. 25. Broker 1 25 Partitions Distribution Leader election T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ● A broker with leader partition goes down ● New leader partition is elected on different node
  26. 26. 26 ● They are really “smart” (unlike “traditional” messaging) ● Configured with a bootstrap servers list for fetching first metadata ○ Where are interested topics ? Connect to broker which holds partition leaders ○ Producer specifies destination partition ○ Consumer handles messages offsets to read ○ If error happens, refresh metadata (something is changed in the cluster) ● Batching on producing/consuming Clients
  27. 27. Broker 1 27 Producers & Consumers Writing/Reading to/from leaders T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Producer Consumer Consumer Producer Consumer
  28. 28. 28 ● Destination partition computed on client ○ Round robin ○ Specified by hashing the “key” in the message ○ Custom partitioning ● Writes messages to “leader” for a partition ● Acknowledge : ○ No ack ○ Ack on message written to “leader” ○ Ack on message also replicated to “in-sync” replicas Producers
  29. 29. 29 Producers Partitioners Partition 0 Partition 1 Partition 2 Key = X payload Key = Y payload Key = X payload msg 3 msg 2 msg 1 msg 1, 3 msg 2 Partition 0 Partition 1 Partition 2 Key = payload Key = payload Key = payload msg 3 msg 2 msg 1 msg 1 msg 3 msg 2 Default “key” hash Round robin “key” null Partition 0 Partition 1 Partition 2 Key = X payload Key = Y payload Key = X payload msg 3 msg 2 msg 1 msg 2 msg 1 Custom partitioner msg 3
  30. 30. 30 Producers No ack Broker 1 Broker 2 Broker 3Producer copy copy copy msg msg msg
  31. 31. 31 Producers Ack on message written to “leader” Broker 1 Broker 2 Broker 3Producer ack ack ack copy copy copy msg msg msg
  32. 32. 32 Producers Ack on message also replicated to “in-sync” replicas Broker 1 Broker 2 Broker 3Producer ack ack ack copy copy copy msg msg msg
  33. 33. 33 ● Read from one (or more) partition(s) ● Track (commit) the offset for given partition ○ A partitioned topic “__consumer_offsets” is used for that ○ Key → [group, topic, partition], Value → [offset] ○ Offset is shared inside the consumer group ● QoS ○ At most once : read message, commit offset, process message ○ At least once : read message, process message, commit offset ○ Exactly once : read message, commit message output and offset to a transactional system ● Gets only “committed” messages (depends on producer “ack” level) Consumers
  34. 34. 34 ● The consumer asks for a specific partition (assign) ○ An application using one or more consumers has to handle such assignment on its own, the scaling as well ● The consumer is part of a “consumer group” ○ Consumer groups are an easier way to scale up consumption ○ One of the consumers, as “group lead”, applies a strategy to assign partitions to consumers in the group ○ When new consumers join or leave, a rebalancing happens to reassign partitions ○ This allows pluggable strategies for partition assignment (e.g. stickiness) Consumer : partitions assignment Available approaches
  35. 35. 35 ● Consumer Group ○ Grouping multiple consumers ○ Each consumer reads from a “unique” subset of partition → max consumers = num partitions ○ They are “competing” consumers on the topic, each message delivered to one consumer ○ Messages with same “key” delivered to same consumer ● More consumer groups ○ Allows publish/subscribe ○ Same messages delivered to different consumers in different consumer groups Consumer Groups
  36. 36. Topic 36 Consumer Groups Partitions assignment Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer
  37. 37. Topic 37 Consumer Groups Rebalancing Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer
  38. 38. Topic 38 Consumer Groups Max parallelism & Idle consumer Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Consumer Consumer Consumer
  39. 39. 39 ● Encryption between clients and brokers and between brokers ○ Using SSL ● Authentication of clients (and brokers) connecting to brokers ○ Using SSL (mutual authentication) ○ Using SASL (with PLAIN, Kerberos or SCRAM-SHA as mechanisms) ● Authorization of read/writes operation by clients ○ ACLs on resources such as topics ○ Authenticated “principal” for configuring ACLs ○ Pluggable ● It’s possible to mix encryption/no-encryption and authentication/no-authentication Security
  40. 40. 40 “ … and what about the traditional messaging brokers ?”
  41. 41. 41 ● Have to track consumers status, dispatching messages ● Delete messages, no long-time persistence ○ Support for TTL ● No re-reading the messages feature is available ● Consumers are fairly simple ● Open and standard protocols support (AMQP, MQTT, JMS, STOMP, …) ● Selectors ● Best fit for request/reply pattern Messaging : “traditional” brokers
  42. 42. 42 Messaging with “traditional” brokers Key differences ActiveMQ Artemis Kafka Model “Smart broker, dumb clients” “Dumb broker, smart clients” Durability Volatile or durable storage Durable storage Storage duration Temporary storage of messages Potential long-term storage of messages Message retention Retained until consumed Retained until expired or compacted Consumer state Broker managed Client managed (can be stored in broker) Selectors Yes, per consumer No Stream replay No Yes High-availability Replication Replication Protocols AMQP, MQTT, OpenWire, Core, STOMP Kafka protocol Delivery guarantees Best-effort or guaranteed Best-effort or guaranteed
  43. 43. 43 Demo: an IoT use case iot-temperature-max Zookeeper Kafka Kafka Streams App Device App(s) iot-temperature iot-temperature Consumer App iot-temperature-max
  44. 44. 44 ● Apache Kafka : https://kafka.apache.org/ ● Strimzi : http://strimzi.io/ ● Kubernetes : https://kubernetes.io/ ● OpenShift : https://www.openshift.org/ ● Demo : https://github.com/strimzi/strimzi-lab/tree/master/iot-demo ● Paolo’s blog : https://paolopatierno.wordpress.com/ Resources
  45. 45. Thank you ! Questions ? @ppatierno

×