Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Stream Processing – Concepts and Frameworks

428 vues

Publié le

More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. It is one thing to collect these events in the velocity they arrive, without losing any single message. An Event Hub and a data flow engine can help here. It’s another thing to do some (complex) analytics on the data. There is always the option to first store in a data sink of choice and later analyze it. Storing even a high-volume event stream is feasible and not a challenge anymore. But this adds to the end-to-end latency and it takes minutes if not hours to present results. If you need to react fast, you simply can’t afford to first store the data. You need to do process it directly on the data stream. This is called Stream Processing or Stream Analytics. In this talk I will present the important concepts, a Stream Processing solution should support and then dive into some of the most popular frameworks available on the market and how they compare.

Publié dans : Données & analyses
  • Soyez le premier à commenter

Stream Processing – Concepts and Frameworks

  1. 1. gschmutz Stream Processing – Concepts and Frameworks JEE Conf 2019 Guido Schmutz (guido.schmutz@trivadis.com) gschmutz http://guidoschmutz.wordpress.com
  2. 2. gschmutz Agenda Stream Processing – Concepts and Frameworks 1. Motivation for Stream Processing? 2. Capabilities for Stream Processing 3. Implementing Stream Processing Solutions 4. Demo 5. Summary
  3. 3. gschmutz Guido Schmutz Stream Processing – Concepts and Frameworks Working at Trivadis for more than 22 years Oracle Groundbreaker Ambassador & Oracle ACE Director Consultant, Trainer, Software Architect for Java, AWS, Azure, Oracle Cloud, SOA and Big Data / Fast Data Platform Architect & Head of Trivadis Architecture Board More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 155th edition
  4. 4. gschmutzStream Processing – Concepts and Frameworks Motivation for Stream Processing?
  5. 5. gschmutz Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools Enterprise Data Warehouse SQL Search / Explore Search SQL Export Service Parallel Processing Storage Storage RawRefined Results high latency Enterprise Apps Logic { } API File Import / SQL Import DB Extract File DB Big Data solves Volume and Variety – not Velocity Stream Processing – Concepts and Frameworks
  6. 6. gschmutz Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools Enterprise Data Warehouse SQL Search / Explore Search SQL Export Service Parallel Processing Storage Storage RawRefined Results high latency Enterprise Apps Logic { } API File Import / SQL Import DB Extract File DB Event Source Location Telemetry IoT Data Mobile Apps Social Big Data solves Volume and Variety – not Velocity Stream Processing – Concepts and Frameworks Event Stream
  7. 7. gschmutz Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools Enterprise Data Warehouse SQL Search / Explore Search SQL Export Service • Machine Learning • Graph Algorithms • Natural Language Processing Parallel Processing Storage Storage RawRefined Results high latency Enterprise Apps Logic { } API File Import / SQL Import DB Extract File DB Event Stream Event Source Location IoT Data Mobile Apps Social Big Data solves Volume and Variety – not Velocity Stream Processing – Concepts and Frameworks Event Hub Event Hub Event Hub Telemetry
  8. 8. gschmutz "Data at Rest" vs. "Data in Motion" Stream Processing – Concepts and Frameworks Data at Rest Data in Motion Store (Re)Act Visualize/ Analyze StoreAct Analyze 11101 01010 10110 11101 01010 10110 vs. Visualize
  9. 9. gschmutz Event Hub Event Hub Hadoop Clusterd Hadoop Cluster Stream Analytics Platform Stream Processing Architecture solves Velocity Stream Processing – Concepts and Frameworks BI Tools Enterprise Data Warehouse Event Hub SQ L Search / Explore Enterprise Apps Search ServiceResults Stream Analytics Reference / Models Dashboard Logic { } API Event Stream Event Stream Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Low(est) latency, no history Telemetry
  10. 10. gschmutz Hadoop Clusterd Hadoop Cluster Stream Analytics Platform Big Data for all historical data analysis Stream Processing – Concepts and Frameworks BI Tools Enterprise Data Warehouse SQ L Search / Explore Enterprise Apps Search ServiceResults Stream Analytics Reference / Models Dashboard Logic { } API Event Stream Event Stream Hadoop Clusterd Hadoop Cluster Big Data Platform Parallel Processing Storage Storage RawRefined Results Data FlowEvent Hub Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social File Import / SQL Import Telemetry
  11. 11. gschmutz Hadoop Clusterd Hadoop Cluster Stream Analytics Platform Integrate existing systems with lower latency through CDC Stream Processing – Concepts and Frameworks BI Tools Enterprise Data Warehouse SQ L Search / Explore Enterprise Apps Search ServiceResults Stream Analytics Reference / Models Dashboard Logic { } API Hadoop Clusterd Hadoop Cluster Big Data Platform Parallel Processing Storage Storage RawRefined Results File Import / SQL Import Event Stream Event Stream Data FlowEvent Hub Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Change Data Capture Telemetry
  12. 12. gschmutz New systems participate in event-oriented fashion Stream Processing – Concepts and Frameworks Hadoop Clusterd Hadoop Cluster Big Data Platform Parallel Processing Storage Storage RawRefined Results Microservice Platform Microservice State { } API Stream Analytics Platform Stream Processor State { } API Event Stream SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore SQL Export Search Service Enterprise Apps Logic { } API File Import / SQL Import Event Stream Data FlowEvent Hub Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Change Data Capture Event Stream Event Stream Telemetry
  13. 13. gschmutz Edge computing allows processing close to data sources Stream Processing – Concepts and Frameworks Hadoop Clusterd Hadoop Cluster Big Data Platform Parallel Processing Storage Storage RawRefined Results Microservice Platform Microservice State { } API Stream Analytics Platform Stream Processor State { } API SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore SQL Export Search Service Enterprise Apps Logic { } API Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Edge Node File Import / SQL Import Change DataCapture D ata Flow Event Hub Data Flow Event Stream Event Stream Event Stream Telemetry Rules Event Hub Storage
  14. 14. gschmutz Hadoop Clusterd Hadoop Cluster Big Data Unified Architecture for Modern Data Analytics Solutions Stream Processing – Concepts and Frameworks SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event Hub D ata Flow D ata Flow Change DataCapture Parallel Processing Storage Storage RawRefined Results SQL Export Microservice State { } API Stream Processor State { } API Event Stream Event Stream Search Service Stream Analytics Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Event Stream Telemetry
  15. 15. gschmutz Two Types of Stream Processing (by Gartner) Stream Processing – Concepts and Frameworks Stream Data Integration • focuses on the ingestion and processing of data sources targeting real-time extract- transform-load (ETL) and data integration use cases • filter and enrich the data Stream Analytics • targets analytics use cases • calculating aggregates and detecting patterns to generate higher-level, more relevant summary information (complex events) • Complex events may signify threats or opportunities that require a response from the business Gartner: Market Guide for Event Stream Processing, Nick Heudecker, W. Roy Schulte
  16. 16. gschmutz Stream Processing & Analytics Ecosystem Stream Processing – Concepts and Frameworks Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge
  17. 17. gschmutzStream Processing – Concepts and Frameworks Important Capabilities for Stream Processing
  18. 18. gschmutz Capabilities: Stream Data Integration vs. Stream Analytics Stream Processing – Concepts and Frameworks Stream Data Integration Stream Analytics Support for Various Data Sources yes - Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial Execution Mode: Native Streaming yes yes Execution Mode: Non-Native Streaming - Micro-Batching yes partial Delivery Guarantees yes yes API : GUI-Based API / Declarative API / Programmatic yes yes API: Streaming SQL - yes Event Time vs. Ingestion / Processing Time - yes Windowing - yes Stream-to-Static Joins (Lookup/Enrichment) partial yes Stream-to-Stream Joins - yes State Management - yes Queryable State (aka Interactive Queries) - yes Event Pattern Detection - Yes
  19. 19. gschmutz Integrating Data Sources Stream Processing – Concepts and Frameworks Sensor Stream SQL Polling Change Data Capture (CDC) File Polling File Stream (File Tailing) File Stream (Appender)
  20. 20. gschmutz Streaming ETL Stream Processing – Concepts and Frameworks • Streaming Extract – Transform – Load • Flow-based ”programming” • High-Throughput, straight-through data flows • Visual coding with flow editor • Stream Data Integration but not Stream Analytics
  21. 21. gschmutz Execution Mode: Native Streaming Ingestion Event Source Event Source Stream Processing – Concepts and Frameworks Individual Event PPPPPPPPPPPP • Events processed as they arrive • low(est)-latency • fault tolerance expensive
  22. 22. gschmutz Execution Mode: Non-Native Streaming - Micro-Batching Ingestion Event Source Event Source Stream Processing – Concepts and Frameworks PPPPPP • Splits incoming stream in small batches • Fault tolerance easier to achieve • Higher latency
  23. 23. gschmutz Delivery Guarantees Stream Processing – Concepts and Frameworks At most once (fire-and-forget) § message is sent, but the sender doesn’t care if it’s received or lost. At least once § Retransmission of messages can cause messages to be sent one or more times Exactly once § ensures that a message is received once and only once (never lost and never repeated) [ 0 | 1 ] [ 1+ ] [ 1 ]
  24. 24. gschmutz API Stream Processing – Concepts and Frameworks GUI-based / Drag-and-Drop • A graphical way of designing a pipeline • Often web-based Declarative • An streaming engine configured declaratively • JSON, YML "config": { "connector.class": "..MqttSourceConnector", "tasks.max": "1", "mqtt.server.uri": "tcp://mosquitto-1:1883", "mqtt.topics": "truck/+/position", "kafka.topic":"truck_position", ...
  25. 25. gschmutz Programmatic • Low-level (class) or high-level fluent API • Higher order function as operators (filter, mapWithState …) API (II) Stream Processing – Concepts and Frameworks Streaming SQL • use stream in FROM clause • Extensions support windowing, pattern matching, spatial, …. val filteredDf = truckPosDf. where("eventType !='Normal'") SELECT * FROM truck_position_s WHERE eventType != 'Normal'
  26. 26. gschmutz Event Time vs. Ingestion / Processing Time Stream Processing – Concepts and Frameworks Event time • time at which events actually occurred Ingestion time / Processing Time • time at which events are ingested into / processed by the system Not all use cases care about event times, but lot’s do!
  27. 27. gschmutz Windowing Stream Processing – Concepts and Frameworks Computations over events done using windows of data not feasible to keep entire stream of data in memory window represents a certain amount of data to perform computations on Time Stream of Data Window of Data
  28. 28. gschmutz Sliding / Hopping Window eviction & trigger based on window length and sliding interval length Fixed / Tumbling Window eviction based on window being full and trigger based on either count of items or time Session Window sequences of temporarily related events terminated by gap of inactivity > than some timeout Windowing Stream Processing – Concepts and Frameworks Time TimeTime
  29. 29. gschmutz Joining – Stream-to-Static Stream Processing – Concepts and Frameworks Challenges of joining streams • Data streams need to be aligned because of their different timestamps • joins must be limited; otherwise they will never end • join needs to produce results continuously • there is no end to the data Stream-to-Static (Table) Join Stream-to- Static Join Time
  30. 30. gschmutz Joining –Stream-to-Stream Stream Processing – Concepts and Frameworks Stream-to-Stream Join (one window join) Stream-to-Stream Join (two window join) Stream-to- Stream Join Stream-to- Stream Join Time Time
  31. 31. gschmutz State Management Stream Processing – Concepts and Frameworks Needed if use case is dependent on previously seen data Windowing, Joining and Pattern Detection use State Management behind the scenes State needs to be as close to the stream processor as possible How does it handle failures? Options for State Management In-Memory Replicated, Distributed Store Local, Embedded Store Operational Complexity and Features Low high
  32. 32. gschmutz Queryable State (aka. Interactive Queries) Stream Processing – Concepts and Frameworks Exposes state managed by Stream Analytics solution Allows application to query managed state, i.e. to visualize it can eliminate need for an external database to keep results Stream Processing Infrastructure Reference Data Stream Analytics { } Query API State Stream Processor Search / Explore Online & Mobile Apps Model Dashboard
  33. 33. gschmutz Event Pattern Detection Stream Processing – Concepts and Frameworks • Streaming Data often contain interesting patterns • Special operators allow finding complex relationships between events • Absence Pattern - event A not followed by event B within time window • Sequence Pattern - event A followed by event B followed by event C • Increasing Pattern - up trend of a value of a certain attribute • Decreasing Pattern - down trend of a value of a certain attribute • …
  34. 34. gschmutz Capabilities: Stream Data Integration vs. Stream Analytics Stream Processing – Concepts and Frameworks Stream Data Integration Stream Analytics Support for Various Data Sources yes - Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial Execution Mode: Native Streaming yes yes Execution Mode: Non-Native Streaming - Micro-Batching yes partial Delivery Guarantees yes yes API : GUI-Based API / Declarative API / Programmatic yes yes API: Streaming SQL - yes Event Time vs. Ingestion / Processing Time - yes Windowing - yes Stream-to-Static Joins (Lookup/Enrichment) partial yes Stream-to-Stream Joins - yes State Management - yes Queryable State (aka Interactive Queries) - yes Event Pattern Detection - Yes
  35. 35. gschmutzStream Processing – Concepts and Frameworks Implementing Stream Processing Solutions
  36. 36. gschmutz Stream Processing & Analytics Ecosystem Stream Processing – Concepts and Frameworks Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge
  37. 37. gschmutz Event Hub: Apache Kafka Stream Processing – Concepts and Frameworks Kafka Cluster Consumer Consumer Consumer Broker 1 Broker 2 Broker 3 Zookeeper Ensemble ZK 1 ZK 2ZK 3 Schema Registry Service 1 Management Control Center Kafka Manager KAdmin Producer Producer Producer kafkacat Data Retention • Never • Time (TTL) or Size-based • Log-Compacted based
  38. 38. gschmutz Stream Data Integration: Kafka Connect Stream Processing – Concepts and Frameworks curl -X "POST" "http://192.168.69.138:8083/connectors" -H "Content-Type: application/json" -d $'{ "name": "mqtt-source", "config": { "connector.class": "io.confluent.connect.mqtt.MqttSourceConnector", "tasks.max": "1", "mqtt.server.uri": "tcp://mosquitto:1883", "mqtt.topics": "truck/+/position", "kafka.topic":"truck_position" } }' • declarative style data flows • framework is part of Kafka • Many connectors available • Single Message Transforms (SMT)
  39. 39. gschmutz Stream Data Integration: StreamSets • GUI-based, drag-and drop Data Flow Pipelines • Both stream and batch processing • special option for Edge computing • custom sources, sinks, processors • Monitoring and Error Detection Stream Processing – Concepts and Frameworks
  40. 40. gschmutz Stream Analytics: Kafka Streams • Programmatic API, “just” a Java library • Native streaming • fault-tolerant local state • Fixed, Sliding and Session Windowing • Stream-Stream / Stream-Table Joins • At-least-once and exactly-once KTable<Integer, Customer> customers = builder.stream(”customer"); KStream<Integer, Order> orders = builder.stream(”order"); KStream<Integer, String> joined = orders.leftJoin(customers, …); joined.to(”orderEnriched"); trucking_ driver Kafka Broker Java Application Kafka Streams Stream Processing – Concepts and Frameworks
  41. 41. gschmutz Stream Analytics: KSQL • Stream Processing with zero coding using SQL-like language • part of Confluent Platform (community edition) • built on top of Kafka Streams • interactive (CLI) and headless (command file) CREATE STREAM customer_s WITH (kafka_topic='customer', value_format='AVRO'); SELECT * FROM customer_s WHERE address->country = 'Switzerland'; ... trucking_ driver Kafka Broker KSQL Engine Kafka Streams KSQL CLI Commands Stream Processing – Concepts and Frameworks
  42. 42. gschmutz Stream Analytics: Spark Structured Streaming Stream Processing – Concepts and Frameworks • 2nd gen Spark Streaming, using DataFrame instead of RDD • Programmatic API • Code reuse between batch and streaming • Supports Java, Scala, Python, R and SQL val oderDf = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("subscribe", ”order") .load() val orderFilteredDf = orderDf.where(”address.county = ‘Switzerland'")
  43. 43. gschmutzStream Processing – Concepts and Frameworks Demo
  44. 44. gschmutz Sample Use Case detect_dangero us_driving truck/nn/ position mqtt-to- kafka truck_ position Stream Stream dangerous_ driving count_by_ eventType Table dangergous_ driving_coun t {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info Stream Processing – Concepts and Frameworks Source: https://github.com/gschmutz/iot-truck-demo
  45. 45. gschmutzStream Processing – Concepts and Frameworks Summary
  46. 46. gschmutz Summary Stream Processing – Concepts and Frameworks • Stream Processing is the solution for low-latency • Event Hub, Stream Data Integration and Stream Analytics are the main building blocks in your architecture • Kafka is currently the de-facto standard for Event Hub • Various options exists for Stream Data Integration and Stream Analytics • SQL becomes a valid option for implementing Stream Analytics
  47. 47. gschmutzStream Processing – Concepts and Frameworks Technology on its own won't help you. You need to know how to use it properly.

×