Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka

644 vues

Publié le

Spoilt for Choice – Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka:

Apache Kafka is a de facto standard streaming data processing platform. It is widely deployed as event streaming platform. Part of Kafka is its stream processing API “Kafka Streams”. In addition, the Kafka ecosystem now offers KSQL, a declarative, SQL-like stream processing language that lets you define powerful stream-processing applications easily. What once took some moderately sophisticated Java code can now be done at the command line with a familiar and eminently approachable syntax.
This session discusses and demos the pros and cons of Kafka Streams and KSQL to understand when to use which stream processing alternative for continuous stream processing natively on Apache Kafka infrastructures. The end of the session compares the trade-offs of Kafka Streams and KSQL to separate stream processing frameworks such as Apache Flink or Spark Streaming.

Publié dans : Logiciels
  • Soyez le premier à commenter

Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka

  1. 1. 1C O N F I D E N T I A L Stream Processing with Confluent Kafka Streams and KSQL Kai Waehner Technology Evangelist kontakt@kai-waehner.de LinkedIn @KaiWaehner www.confluent.io www.kai-waehner.de
  2. 2. 2C O N F I D E N T I A L
  3. 3. 3C O N F I D E N T I A L
  4. 4. 4C O N F I D E N T I A L Ubiquitous connectivity Globally scalable platform for all event producers and consumers Immediate data access Data accessible to all consumers in real time Single system of record Persistent storage to enable reprocessing of past events Continuous queries Stream processing capabilities for in-line data transformation Microservice s DBs SaaS apps Mobile Customer 360 Real-time fraud detection Data warehouse Producers Consumers Database change Microservices events SaaS data Customer experience s Streams of real time events Stream processing appsStream processing apps Stream processing apps A Streaming Platform is the Underpinning of an Event-driven Architecture
  5. 5. 5C O N F I D E N T I A L
  6. 6. 6C O N F I D E N T I A L The beginning of a new Era https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying The first use case. This is why Kafka was created!
  7. 7. 7C O N F I D E N T I A L ● Global-scale ● Real-time ● Persistent Storage ● Stream Processing Apache Kafka: The De-facto Standard for Real-Time Event Streaming Edge Cloud Data LakeDatabases Datacenter IoT SaaS AppsMobile Microservices Machine Learning Apache Kafka
  8. 8. 8C O N F I D E N T I A L Apache Kafka at Scale at Tech Giants > 4.5 trillion messages / day > 6 Petabytes / day “You name it” * Kafka Is not just used by tech giants ** Kafka is not just used for big data
  9. 9. 9C O N F I D E N T I A L Confluents Business Value per Use Case Improve Customer Experience (CX) Increase Revenue (make money) Business Value Decrease Costs (save money) Core Business Platform Increase Operational Efficiency Migrate to Cloud Mitigate Risk (protect money) Key Drivers Strategic Objectives (sample) Fraud Detection IoT sensor ingestion Digital replatforming/ Mainframe Offload Connected Car: Navigation & improved in-car experience: Audi Customer 360 Simplifying Omni-channel Retail at Scale: Target Faster transactional processing / analysis incl. Machine Learning / AI Mainframe Offload: RBC Microservices Architecture Online Fraud Detection Online Security (syslog, log aggregation, Splunk replacement) Middleware replacement Regulatory Digital Transformation Application Modernization: Multiple Examples Website / Core Operations (Central Nervous System) The [Silicon Valley] Digital Natives; LinkedIn, Netflix, Uber, Yelp... Predictive Maintenance: Audi Streaming Platform in a regulated environment (e.g. Electronic Medical Records): Celmatix Real-time app updates Real Time Streaming Platform for Communications and Beyond: Capital One Developer Velocity - Building Stateful Financial Applications with Kafka Streams: Funding Circle Detect Fraud & Prevent Fraud in Real Time: PayPal Kafka as a Service - A Tale of Security and Multi-Tenancy: Apple Example Use Cases $↑ $↓ $ Example Case Studies (of many)
  10. 10. 10C O N F I D E N T I A L A Modern, Distributed Platform for Data Streams. Messaging + Storage + Processing!
  11. 11. 11C O N F I D E N T I A L Stream Processing processing-time event-time windowing alice bob dave
  12. 12. 12C O N F I D E N T I A L Confluent Delivers a Mission-Critical Event Streaming Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Enterprise Operations Replicator | Auto Data Balancer | Connectors | MQTT Proxy | Kubernetes Operator Database Changes Log Events IoT Data Web Events other events Hadoop Database Data Warehouse CRM other DATA INTEGRATION Transformations Custom Apps Analytics Monitoring other REAL-TIME APPLICATIONS COMMUNITY FEATURES COMMERCIAL FEATURES Datacenter Public Cloud Confluent Cloud Confluent Platform Management & Monitoring Control Center | Security Development & Connectivity Clients | Connectors | REST Proxy | KSQL CONFLUENT FULLY-MANAGEDCUSTOMER SELF-MANAGED
  13. 13. 13C O N F I D E N T I A L What We Cover Today: streams The streaming SQL engine for Apache Kafka® to write real-time applications in SQL Apache Kafka® library to write real-time applications and microservices in Java, Scala KSQL 10
  14. 14. 14C O N F I D E N T I A L Lower the bar to enter the world of streaming User Population CodingSophistication Core developers who use Java/Scala Core developers who don’t use Java/Scala Data engineers, architects, DevOps/SRE BI analysts streams
  15. 15. 15C O N F I D E N T I A L CREATE STREAM fraudulent_payments AS SELECT * FROM payments WHERE fraudProbability > 0.8 Lower the bar to enter the world of streaming vs. KSQL streams
  16. 16. 16C O N F I D E N T I A L Confluent ● Kafka Streams and KSQL for stream processing ● Lower-level Kafka Producer and Kafka Consumer clients for multiple languages Java C/C++ Go Python.NET JMS Kafka Streams KSQL Java/Scala Streaming SQL
  17. 17. 17C O N F I D E N T I A L Microservices Example use cases Data enrichment Streaming ETL Filter, cleanse, mask Real-time monitoring Anomaly detection
  18. 18. 18C O N F I D E N T I A L Similarities and Differences of KSQL and Kafka Streams
  19. 19. 19C O N F I D E N T I A L Similarities Enterprise support All you need is Kafka Run everywhere Elastic, scalable, fault-tolerant Kafka security integration Powerful processing Supports streams & tables Exactly-once processing Event-time processing And more! 1 2 3 4 5 6 7 8 9 ... streams KSQL
  20. 20. 20C O N F I D E N T I A L Similarities KSQL (processing) Kafka (data) JVM application with Kafka Streams (processing) Does not run on Kafka brokers! Does not run on Kafka brokers!
  21. 21. 21C O N F I D E N T I A L Differences Consumer, Producer KSQL Kafka Streams Flexibilit y Ease of Use CREATE STREAM ... CREATE TABLE ... SELECT, JOIN, COUNT, … KStream, KTable, filter(), map(), flatMap(), join(), aggregate(), … subscribe(), poll(), send(), flush(), beginTransaction(), … streams KSQL Kafka Clients
  22. 22. 22C O N F I D E N T I A L Differences You write… UI included for human interaction CLI included for human interaction Data formats Interactive queries KSQL statements Yes (Enterprise) Yes Avro, JSON, CSV (today) Not yet JVM applications No No Any data format, including Avro, JSON, CSV, Protobuf, XML Yes streams KSQL Flexibility, use case coverage Limited to KSQL syntax, UDFs Full power of Java, Scala REST API included Yes No, but you can DIY Runtime included Yes, the KSQL server Not needed, applications run as standard JVM processes
  23. 23. 23C O N F I D E N T I A L Guidance streams KSQL • New to streaming and Kafka • Prefer SQL to writing code in Java, Scala • Prefer interactive experience with UI, CLI • Use cases include: filtering, transforming, masking data; enriching data, joining data sources • Use case is naturally expressible through SQL, with optional help from User Defined Functions as “get out of jail free” card • Provides KSQL REST API for use from Python, Go, JavaScript, shell, etc. • At least basic Kafka experience • Prefer writing and deploying JVM apps • Writing microservices • Use cases cover KSQL’s and more • To integrate with external services or 3rd party libraries (but see KSQL UDFs) • To customize or fine-tune a use case, e.g. custom joins, probabilistic counting • Need for queryable state, which is not yet supported by KSQL
  24. 24. 24C O N F I D E N T I A L KSQL and Kafka Streams A closer look
  25. 25. 25C O N F I D E N T I A L Next: KSQL in more detail streams The streaming SQL engine for Apache Kafka® to write real-time applications in SQL Apache Kafka® library to write real-time applications and microservices in Java, Scala KSQL 30
  26. 26. 26C O N F I D E N T I A L KSQL ● You write only SQL. No Java, Python, or other boilerplate to wrap around it! ● Create KSQL user defined functions in Java when needed. CREATE STREAM fraudulent_payments AS SELECT * FROM payments WHERE fraudProbability > 0.8
  27. 27. 27C O N F I D E N T I A L New user experience: interactive stream processing
  28. 28. 28C O N F I D E N T I A L KSQL can be used interactively + programmatically ksql> 1 UI POST /query 2 CLI 3 REST 4 Headless
  29. 29. 29C O N F I D E N T I A L KSQL REST API POST /query HTTP/1.1 { "ksql": "SELECT * FROM users WHERE name LIKE 'a%';" "streamsProperties": { "your.custom.setting": "value" } } Work with KSQL programmatically from other languages or the terminal Here: run a continuous query and stream back the results
  30. 30. 30C O N F I D E N T I A L All you need is Kafka & KSQL 1.Build & package 2. Submit job ksql> SELECT * FROM myStream Without KSQL With KSQL Processing clusterStorage system required for fault-tolerant processing
  31. 31. 31C O N F I D E N T I A L KSQL is a stream processing technology As such it is not yet a great fit for: Ad-hoc queries ● No indexes yet in KSQL ● Kafka often configured to retain data for only a limited span of time BI reports (Tableau etc.) ● No indexes yet in KSQL ● No official JDBC ● Most BI tools don’t understand continuous, streaming results
  32. 32. 32C O N F I D E N T I A L Data exploration KSQL example use cases Data enrichment Streaming ETL Filter, cleanse, mask Real-time monitoring Anomaly detection
  33. 33. 33C O N F I D E N T I A L Example: CDC from DB via Kafka to Elastic KSQL processes table changes in real-time to continuously maintain aggregates of metrics, KPI Kafka Connect streams data in Kafka Connect streams data out
  34. 34. 34C O N F I D E N T I A L Example: Retail KSQL joins the two streams in real-time Stream of shipments that arrive Stream of purchases from online and physical stores
  35. 35. 35C O N F I D E N T I A L Example: IoT, Automotive, Connected Cars KSQL joins the stream and table in real-time, and spots for vehicle failures Kafka Connect streams data in Cars send telemetry data via Kafka API Kafka Streams application to notify customers
  36. 36. 36C O N F I D E N T I A L KSQL for Streaming ETL ● Joining, filtering, and aggregating streams of event data CREATE STREAM vip_actions AS SELECT user_id, page, action FROM clickstream c LEFT JOIN users u ON c.user_id = u.user_id WHERE u.level = 'Platinum'
  37. 37. 37C O N F I D E N T I A L KSQL for Data Transformation ● Easily make derivations of existing topics ● Change data format ● Change number of partitions or partitioning scheme CREATE STREAM pageviews_avro WITH (PARTITIONS=6, VALUE_FORMAT='AVRO') AS SELECT * FROM pageviews_json PARTITION BY user_id
  38. 38. 38C O N F I D E N T I A L KSQL for Real-Time Monitoring ● Filtering, tracking, and alerting ● Log data monitoring ● Syslog data ● Sensor / IoT data ● Application metrics CREATE STREAM syslog_invalid_users AS SELECT host, message FROM syslog WHERE message LIKE '%Invalid user%' http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
  39. 39. 39C O N F I D E N T I A L KSQL for Anomaly Detection ● Identify patterns or anomalies in real- time data, surfaced in milliseconds CREATE TABLE possible_fraud AS SELECT card_number, COUNT(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING COUNT(*) > 3
  40. 40. 40C O N F I D E N T I A L Streams and Tables Important because most use cases need both
  41. 41. 41C O N F I D E N T I A L § Do you think that’s a table you are querying?
  42. 42. 42C O N F I D E N T I A L The Stream-Table Duality aggregation changelog “materialized view” of the stream (like SUM, COUNT) Stream Table (CDC)
  43. 43. 43C O N F I D E N T I A L The Stream-Table Duality CREATE TABLE num_visited_locations_per_user AS SELECT username, COUNT(*) FROM location_updates GROUP BY username
  44. 44. 45C O N F I D E N T I A L Scalability, Elasticity, Fault-Tolerance
  45. 45. 46C O N F I D E N T I A L Fault-tolerance, powered by Kafka Server A: “I do stateful stream processing, like tables, joins, aggregations.” “streaming restore” of A’s local state to B Changelog Topic “streaming backup” of A’s local state KSQL Kafka State is automatically migrated in case of server failure Server B: “I restore the state and continue processing where server A stopped.” A key challenge of distributed stream processing is fault-tolerant state.
  46. 46. 47C O N F I D E N T I A L Fault-tolerance, powered by Kafka Processing fails over automatically, without data loss or miscomputation. 1 Kafka consumer group rebalance is triggered 2 Processing and state of #3 is migrated via Kafka to remaining servers #1 + #2 #3 died, so #1 and #2 take over 1 Kafka consumer group rebalance is triggered 2 Part of processing incl. state is migrated via Kafka from #1 + #2 to server #3 #3 is back, so work is split again
  47. 47. 48C O N F I D E N T I A L Elasticity and Scalability, powered by Kafka You can add, remove, restart servers during live operations. We need more processing power!” “Ok, we can scale down again.”
  48. 48. 49C O N F I D E N T I A L Deploying KSQL
  49. 49. 50C O N F I D E N T I A L Deploying KSQL KSQL Server (JVM process) …and many more… DEB, RPM, ZIP, TAR downloads http://confluent.io/ksql Docker images confluentinc/cp-ksql-server confluentinc/cp-ksql-cli
  50. 50. 51C O N F I D E N T I A L Deploying KSQL #1 Interactive KSQL, for development & testing ksql> POST /query ... KSQL (processing) Kafka (data)
  51. 51. 52C O N F I D E N T I A L Deploying KSQL #2 Headless KSQL, for production servers started with same .sql file ... interaction for UI, CLI, REST is disabled KSQL (processing) Kafka (data)
  52. 52. 53C O N F I D E N T I A L Deploying KSQL read, write … BookingsTeam … FraudTeam … MobileTeam KSQL (processing) Kafka (data) More KSQL
  53. 53. 54C O N F I D E N T I A L Monitoring KSQL https://www.confluent.io/blog/troubleshooting-ksql-part-2 Confluent Control Center JMX
  54. 54. 56C O N F I D E N T I A L Next: Kafka Streams in more detail streams The streaming SQL engine for Apache Kafka® to write real-time applications in SQL Apache Kafka® library to write real-time applications and microservices in Java, Scala KSQL 10
  55. 55. 57C O N F I D E N T I A L Kafka Streams ● You write standard Java or Scala applications to process your data ● The Kafka Streams library makes these applications: elastic, scalable, fault-tolerant, and more ● All you need is
  56. 56. 58C O N F I D E N T I A L All you need is Kafka & Kafka Streams Without Kafka Streams With Kafka Streams JVM application 1.Build & package 2. Submit job required for fault-tolerant processing Processing clusterStorage system
  57. 57. 59C O N F I D E N T I A L DB (key-value store) included, and queryable Location-tracking Application: “I continuously track the latest geolocation of every customer vehicle in a table.” Your app has its own local DB. You can also expose it to other apps, e.g. via REST. Other Applications (Java, Go, Python, etc.) can directly query this table. Kafka Write results Read results Alternative query
  58. 58. 60C O N F I D E N T I A L Microservices Kafka Streams example use cases Data enrichment Streaming ETL Filter, cleanse, mask Real-time monitoring Anomaly detection
  59. 59. 61C O N F I D E N T I A L Writing Kafka Streams applications <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>2.1.0</version> </dependency> Add as dependency to your Java/Scala application
  60. 60. 62C O N F I D E N T I A L Writing Kafka Streams applications DSL Processor API And of course, you can combine the DSL and the Processor API! API style Functional programming Imperative programming Typically used when Starting point for most developers and most use cases To customize, tune, or to add functionality beyond what’s in the DSL today You work with KStream and KTable Processors and state stores Example operations KStream#map(), KTable#filter(), KStream#join(), aggregate() Processor#init(), Processor#close(), Processor#process(msg) Suitable for use cases S / M / L / XL S / M / L / XL
  61. 61. 63C O N F I D E N T I A L Deploying Kafka Streams applications JVM application with Kafka Streams (processing) Develop your application Build and package (jar, container, ...) Deploy and run one or multiple app instances …and many more…
  62. 62. 64C O N F I D E N T I A L Elasticity and Scalability, powered by Kafka You can add, remove, restart instances of your application during live operations. We need more processing power!” “Ok, we can scale down again.”
  63. 63. 65C O N F I D E N T I A L Deploying Kafka Streams applications read, write App (processing) Kafka (data) More Apps BookingsTeam FraudTeam … MobileTeam …
  64. 64. 66C O N F I D E N T I A L Confluent Platform as Central Nervous System
  65. 65. 67C O N F I D E N T I A L Confluent’s Streaming Maturity Model - where are you? Value Maturity (Investment & time) 2 Enterprise Streaming Pilot / Early Production Pub + Sub Store Process 5 Central Nervous System 1 Developer Interest Pre-Streaming 4 Global Streaming 3 SLA Ready, Integrated Streaming Projects Platform
  66. 66. 69C O N F I D E N T I A L Kafka Connect Kafka Cluster CRM Integration Domain-Driven Design for your Event Steaming Platform Legacy Integration Custom Application ESB Connector Java / KSQL / Kafka Streams Schema Registry Event Streaming Platform CRM Domain Legacy Domain Payment Domain è Independent and loosely coupled, but scalable, highly available and reliable!
  67. 67. 70C O N F I D E N T I A L Confluent Schema Registry for Message Validation Input Data Schema Registry App 1 • “Kafka Benefits Under the Hood” • Schema definition + evolution • Forward and backward compatibility • Multi data center deployment App X
  68. 68. 71C O N F I D E N T I A L Resources and Next Steps confluentinc/kafka-streams-examples https://docs.confluent.io/current/streams/ http://cnfl.io/slack
  69. 69. 72C O N F I D E N T I A L Kai Waehner Technology Evangelist kontakt@kai-waehner.de @KaiWaehner www.confluent.io www.kai-waehner.de LinkedIn Questions? Feedback? Please contact me!
  70. 70. 73C O N F I D E N T I A L

×