This document provides an overview and agenda for a presentation on Confluent, streaming, and KSQL. The presentation includes: an introduction to Confluent and Apache Kafka; an explanation of why streaming platforms are useful; an overview of the Confluent Platform and its components; key concepts in streaming and Kafka; a demonstration of Kafka Streams, Kafka Connect, and KSQL; and resources for further information. The presentation aims to explain streaming concepts, demonstrate Confluent tools, and allow for a question and answer session.
4. About Confluent and Apache Kafka™
70% of active Kafka Committers
Founded
September 2014
Technology developed
while at LinkedIn
Founded by the creators of
Apache Kafka
Cheryl Dalrymple
CFO
Luanne Dauber
CMO
Simon Hayes
Head of Corporate &
Business Development
Jay Kreps
CEO
Todd Barnett
VP WW Sales
Neha Narkhede
CTO, VP Engineering
Sarah Sproehnle
VP Customer Success
5. Why a Streaming Platform?
All your data
Real-time
Fault tolerant
Secure
6. Confluent Platform: Enterprise Streaming based on Apache Kafka
Database Changes Log Events loT Data Web Events …
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time Applications
…
Apache Open Source Confluent Open Source Confluent Enterprise
Confluent Platform
Apache Kafka®
Core | Connect API | Streams API
Data Compatibility
Schema Registry
Confluent Platform
Monitoring & Administration
Confluent Control Center | Security
Operations
Replicator | Auto Data Balancing | JMS Client | JMS Connectors
Development and Connectivity
Clients | Connectors | REST Proxy | CLI
Apache Open Source Confluent Open Source Confluent Enterprise
SQL Stream Processing
KSQL (Streams API)
20. Apache Kafka™ Connect API – Streaming Data Capture
JDBC
Mongo
MySQL
Elastic
Cassandra
HDFS
Kafka Connect API
Kafka Pipeline
Connector
Connector
Connector
Connector
Connector
Connector
Sources Sinks
Fault tolerant
Manage hundreds of data
sources and sinks
Preserves data schema
Part of Apache Kafka project
Integrated within Confluent
Platform’s Control Center
Flexible Integrated Reliable Compatible
Connect any source to any target system
21. Single Message Transforms
•Mask sensitive information
•Add identifiers
•Tag events
•Lineage/provenance
•Remove unnecessary
columns
•Route high priority events to
faster data stores
•Direct events to different
Elasticsearch indexes
•Cast data types to match
destination
•Remove unnecessary
columns
Modify events before storing in
Kafka:
Modify events going out of Kafka:
22. But…Easy to Implement
/**
* Single message transformation for Kafka Connect record types.
*
* Connectors can be configured with transformations to make lightweight
* message-at-a-time modifications.
*/
public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable {
/**
* Apply transformation to the {@code record} and return another record object.
*
* The implementation must be thread-safe.
*/
R apply(R record);
/** Configuration specification for this transformation. **/
ConfigDef config();
/** Signal that this transformation instance will no longer will be used. **/
@Override
void close();
}
25. KSQL for Data Exploration
SELECT status, bytes
FROM clickstream
WHERE user_agent =
'Mozilla/5.0 (compatible; MSIE 6.0)';
An easy way to inspect data in a running cluster
26. KSQL for Streaming ETL
• Kafka is popular for data pipelines.
• KSQL enables easy transformations of data within the pipe.
• Transforming data while moving from Kafka to another system.
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
27. KSQL for Anomaly Detection
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds
28. Once again
KSQL implements for you the Kafka Stream
application you would have implemented if you had
• ... the time
• ... the experience
• ... the KSQL as a spec
• ... the willingness to do boring code
30. Where is KSQL not such a great fit?
BI reports (Tableau etc.)
• No indexes
• No JDBC (most BI tools are not
good with continuous results!)
Ad-hoc queries
• Limited span of time usually
retained in Kafka
• No indexes
35. Demo ... less fun
https://github.com/framiere/a-kafka-story/tree/master/step19
Change Data Capture
Docker
Producer
Consumer
Kafka Stream
KSQL
Event sourcing
Influxdb
Grafana
S3
docker run --rm -it --name dcv -v $(pwd):/input pmsipilot/docker-compose-viz
render --horizontal --output-format image --force
docker-compose.yml
36. Demo ... less fun
... without Confluent Control Center links
37. Confluent 4.2 - Nested Types
SELECT userid, address.city
FROM users
WHERE address.state = 'CA'
https://github.com/confluentinc/ksql/pull/1114
38. Confluent 4.2 - Remaining joins
SELECT orderid, shipmentid
FROM orders INNER JOIN shipments
ON order.id = shipmentid;
39. Where to go from here
● KSQL project page
○ https://www.confluent.io/product/ksql
● Confluent blog
○ http://blog.confluent.io/
● Blog Formule1 game
○ https://www.confluent.io/blog/taking-ksql-spin-using-real-time-device-data/
● KSQL github repo
○ https://github.com/confluentinc/ksql
● CP-Demo
○ https://github.com/confluentinc/cp-demo
● A-Kafka-Story
○ https://github.com/framiere/a-kafka-story
● Un tour de l'environement Kafka
○ https://www.youtube.com/watch?v=BBo-rqmhpDM
● KSQL Recipies
○ https://github.com/bluemonk3y/ksql-recipe-fraudulent-txns/