Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

IoT data streaming with Apache Kafka

1 283 vues

Publié le

Tons of devices to supervise the environment connected to the Cloud, tons of data to ingest as a river in flood, tons of messages to process for getting information for them. This is where IoT data streaming comes in place with the need to analyze this unbounded dataset in real time and react to events. But how to handle them for reaching the higher point of wisdom? Apache Kafka and the Streams API are together a dream team for making this reality. During this session, we'll see how to use them in order to develop a complete pipeline for IoT data processing.

Publié dans : Logiciels
  • Soyez le premier à commenter

IoT data streaming with Apache Kafka

  1. 1. IOT DATA STREAMING WITH APACHE KAFKA From the data to the wisdom Paolo Patierno Principal Software Engineer @ Red Hat 15/12/2018
  2. 2. INSERT DESIGNATOR, IF NEEDED ● Principal Software Engineer @ Red Hat ○ Messaging & IoT team ● Lead/Committer @ Eclipse Foundation ○ Hono, Paho and Vert.x projects ● Microsoft MVP Azure/IoT ● Dad of two ● VR46’s fan (I don’t like MM93) ● Try to be a runner ... WHO AM I ? @ppatierno
  3. 3. INSERT DESIGNATOR, IF NEEDED ● Smart city ● Healthcare ● Connected cars ● Logistics ● Home automation ● Airlines ● Farmers ● You! IOT USE CASES
  4. 4. WHAT THEY HAVE IN COMMON?
  5. 5. INSERT DESIGNATOR, IF NEEDED ● Huge amount of data to ingest ● Unbounded dataset ● Need for real time analytics ● Need for real time reactions ● Sometimes the need to join : ○ with static data ○ with historical data IOT WORKLOAD
  6. 6. INSERT DESIGNATOR, IF NEEDED IOT WORKLOAD Ingestion Process & Analysis From raw data to the intelligence Sensors Intelligence
  7. 7. THE NEED FOR A PARADIGM DIFFERENT FROM BATCH PROCESSING
  8. 8. INSERT DESIGNATOR, IF NEEDED ● A type of data processing engine that is designed with infinite datasets in mind ● Working with data as they arrive ● Working with an infinite stream of data ○ Moving boundaries with data in “windows” ● Tradeoffs between latency/cost/correctness STREAM PROCESSING
  9. 9. INSERT DESIGNATOR, IF NEEDED Compared to batch based flows, streaming flows … ● … are often more time-sensitive ● … could be larger in volume ● … may or may not be of long-term value after processing So that you can … ● … capture and aggregate millions of events per second ● … instantly take action ● … respond to changing conditions STREAM PROCESSING Versus batch processing
  10. 10. INSERT DESIGNATOR, IF NEEDED Time ● Event time, which is the time at which events actually occurred ● Processing time, which is the time at which events are observed in the system STREAM PROCESSING Because time matters
  11. 11. INSERT DESIGNATOR, IF NEEDED ● Ordering ○ Handling devices data in the order they are sent ○ Handling “out of order” data when the device is not able to guarantee order ● Timing ○ Allowing “event time” which makes much sense in IoT scenarios ○ Supporting “windowing” for getting analytics in a specific time window ● State management ○ Be fast on joining real time data with metadata to enrich information content ○ Be fast on data aggregation ○ Be able to join different streams of data, from different devices STREAM PROCESSING & IOT What are the good parts for IoT?
  12. 12. INSERT DESIGNATOR, IF NEEDED ● Re-processing ○ Supporting the possibility to re-read the stream of data ● Fault tolerance ○ Allowing to get data from devices continuously even in case of consumers crash ● Scalability/Partitioning ○ Being able to handle bursts of traffic ○ Being able to handle more and more devices data STREAM PROCESSING & IOT What are the good parts for IoT?
  13. 13. INSERT DESIGNATOR, IF NEEDED STREAM PROCESSING & IOT Edge Gateways Devices Ingestion platform Applications Streaming data platform Streams processing framework
  14. 14. INSERT DESIGNATOR, IF NEEDED ● Streaming data platform ○ Apache Kafka ● Streams processing framework ○ Apache Spark (streaming) ○ Apache Samza ○ Apache Flink ○ … STREAM PROCESSING & IOT It’s a wild west out there
  15. 15. DO WE REALLY NEED MORE THAN ONE PLATFORMS IN PLACE?
  16. 16. INSERT DESIGNATOR, IF NEEDED STREAM PROCESSING & IOT Edge Gateways Devices Apache Kafka + Kafka Streams API streams API
  17. 17. INSERT DESIGNATOR, IF NEEDED IOT WORKLOAD streams API From raw data to the intelligence Sensors Intelligence
  18. 18. APACHE KAFKA
  19. 19. INSERT DESIGNATOR, IF NEEDED WHAT IS APACHE KAFKA A publish/subscribe messaging system? A streaming data platform? A distributed, horizontally-scalable, fault-tolerant, commit log?
  20. 20. INSERT DESIGNATOR, IF NEEDED APACHE KAFKA CONCEPTS ● Messages are sent to and received from a topic ○ Topics are split into one or more partitions (aka shards) ○ All actual work is done on partition level, topic is just a virtual object ● Each message is written only into a one selected partition ○ Partitioning is usually done based on the message key ○ Message ordering within the partition is fixed ● Retention ○ Based on size / message age ○ Compacted based on message key ● Replication ○ Each partition can exist in one or more copies to achieve high availability
  21. 21. TOPIC & PARTITIONS old new 0 1 2 3 4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 1 0 Producer Partition 0 Partition 1 Partition 2
  22. 22. TOPICS & PARTITIONS old new 0 1 2 3 4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 1 0 Consumer Partition 0 Partition 1 Partition 2
  23. 23. HIGH AVAILABILITY Broker 1 T1 - P1 T1 - P2 T1 - P3 Broker 2 T1 - P1 T1 - P2 T1 - P3 Broker 3 T1 - P1 T1 - P2 T1 - P3 Leaders and followers spread across the cluster.
  24. 24. HIGH AVAILABILITY Broker 1 T1 - P1 T1 - P2 T1 - P3 Broker 2 T1 - P1 T1 - P2 T1 - P3 Broker 3 T1 - P1 T1 - P2 T1 - P3 Clients always connect only to the leaders. Producer Consumer
  25. 25. HIGH AVAILABILITY Broker 1 T1 - P1 T1 - P2 T1 - P3 Broker 2 T1 - P1 T1 - P2 T1 - P3 Broker 3 T1 - P1 T1 - P2 T1 - P3 If a broker with leader partition goes down, a new leader partition is elected on different node
  26. 26. CONSUMER GROUPS ● Consumer Group ○ Grouping multiple consumers ○ Each consumer reads from a “unique” subset of partition → max consumers = num partitions ○ They are “competing” consumers on the topic, each message delivered to one consumer ○ Messages with same “key” delivered to same consumer ● More consumer groups ○ Allows publish/subscribe ○ Same messages delivered to different consumers in different consumer groups
  27. 27. Topic CONSUMER GROUPS Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer
  28. 28. Topic CONSUMER GROUPS Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer
  29. 29. Topic CONSUMER GROUPS Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Consumer Consumer Consumer
  30. 30. INSERT DESIGNATOR, IF NEEDED APACHE KAFKA FOR IOT Pros ● “Big” buffer for handling back pressure ● High throughput ● High availability ● Long term storage ● Reprocessing devices data Cons ● Cannot handle a huge number of device connections ● No standard IoT devices oriented protocol ● Network needs to be stable Pros & Cons
  31. 31. KAFKA STREAMS
  32. 32. INSERT DESIGNATOR, IF NEEDED KAFKA STREAMS API ● Stream processing framework ● Streams are Kafka topics (as input and output) ● It’s really just a Java library to include in your application ● Scaling the stream application horizontally ● Creates a topology of processing nodes (filter, map, join etc) acting on a stream ○ Low level processor API ○ High level DSL ○ Using “internal” topics (when re-partitioning is needed or for “stateful” transformations)
  33. 33. INSERT DESIGNATOR, IF NEEDED KAFKA STREAMS API topic topic Processing topic Processing topic topic topic StreamsAPI application topic Processing
  34. 34. PROCESSORS TOPOLOGY Source processor Sink processor Stream processor Kafka
  35. 35. INSERT DESIGNATOR, IF NEEDED ● Stateless (filter, map) vs Stateful (join, aggregate) ● State handled locally ○ Faster ○ Better isolation ○ Flexible ● It’s based on RocksDB and backed by a Kafka topic as well ○ For rebuilding state on a different Streams node STATE MANAGEMENT Streams app Streams app Remote state Local state Stateless vs Stateful processing
  36. 36. INSERT DESIGNATOR, IF NEEDED Time ● Event time, which is the time at which events actually occurred ○ As part of the message payload ○ Timestamp on the message (set by the producer) ● Ingestion time, which is the time when the data enters the processing pipeline ○ The broker sets the timestamp on the message ● Processing time, which is the time at which events are observed in the system ○ It’s also called “wall-clock” time EVENTS TIMESTAMP Always because time matters
  37. 37. INSERT DESIGNATOR, IF NEEDED EVENTS TIMESTAMP value timestamp Kafka producer IoT device application or value timestamp Kafka broker value timestamp Kafka streams timestamp Event time Ingestion time Processing time
  38. 38. INSERT DESIGNATOR, IF NEEDED WINDOWING Tumbling Hopping Session time
  39. 39. DEMO
  40. 40. INSERT DESIGNATOR, IF NEEDED DEMO ARCHITECTURE iot-temperature-max Zookeeper Kafka Kafka Streams App Device App(s) iot-temperature iot-temperature Consumer App iot-temperature-max
  41. 41. THANK YOU plus.google.com/+RedHat linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat

×