Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again! Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with Kafka Connect, which is part of Apache Kafka. ksqlDB is the source-available SQL streaming engine for Apache Kafka and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.
In this talk, we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, Kafka Connect, and ksqlDB.
Gasp as we filter events in real-time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!
35. @rmoff | #ConfluentVUG | @confluentinc
Let’s Build It!
User
data
RDBMS
Kafka
Connect
Rating
events
App
Producer API
36. @rmoff | #ConfluentVUG | @confluentinc
Let’s Build It!
User
data
RDBMS
Kafka
Connect
ksqlDB
Join events to
users, and filter
Elasticsearch
Kafka
Connect
Operational
Dashboard
Rating
events
App
Producer API
37. @rmoff | #ConfluentVUG | @confluentinc
Let’s Build It!
User
data
RDBMS
Kafka
Connect
ksqlDB
Join events to
users, and filter
SnowflakeDB/
S3/HDFS/etc
Kafka
Connect
Data
Lake
Elasticsearch
Kafka
Connect
Operational
Dashboard
App
Consumer API
Push notification
Rating
events
App
Producer API
38. @rmoff | #ConfluentVUG | @confluentinc
Let’s Build It!
User
data
RDBMS
Kafka
Connect
ksqlDB
Join events to
users, and filter
SnowflakeDB/
S3/HDFS/etc
Kafka
Connect
Data
Lake
Elasticsearch
Kafka
Connect
Operational
Dashboard
App
Consumer API
Push notification
Rating
events
App
Producer API
Demo Time!
39. @rmoff | #ConfluentVUG | @confluentinc
User
data
RDBMS
Kafka
Connect
ksqlDB
Join events to
users, and filter
SnowflakeDB/
S3/HDFS/etc
Kafka
Connect
Data
Lake
Elasticsearch
Kafka
Connect
Operational
Dashboard
App
Consumer API
Push notification
Rating
events
App
Producer API
40. @rmoff | #ConfluentVUG | @confluentinc
Rating
events
Push notification
to Slack
Operational
Dashboard
Data
Lake
User
data
RDBMS
S3/HDFS/
SnowflakeDB
etc
Elasticsearch
App
App
Producer API
Consumer API
ksqlDB
Kafka
Connect
Kafka
Connect
Kafka
Connect
ratings
poor_ratings
Filter events
41. @rmoff | #ConfluentVUG | @confluentinc
Producer API
{
"rating_id": 5313,
"user_id": 3,
"stars": 4,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "worst. flight. ever. #neveragain"
}
POOR_RATINGS
Filter all ratings where STARS<3
CREATE STREAM POOR_RATINGS AS
SELECT * FROM ratings WHERE STARS <3
42. @rmoff | #ConfluentVUG | @confluentinc
Rating
events
Join events to
users, and filter
Push notification
to Slack
Operational
Dashboard
Data
Lake
User
data
RDBMS
Elasticsearch
App
App
Producer API
Consumer API
Kafka
Connect
Kafka
Connect
Kafka
Connect
SnowflakeDB/
S3/HDFS/etc
Kafka Connect
44. @rmoff | #ConfluentVUG | @confluentinc
The Stream/Table Duality
Account ID Balance
12345 £50
Account ID Amount
12345 + £50
12345 + £25
12345 - £60
Account ID Balance
12345 £75
Account ID Balance
12345 £15
Time
Stream
Table
45. The truth is the log.
The database is a cache
of a subset of the log.
—Pat Helland
Immutability Changes Everything
http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
Photo by Bobby Burch on Unsplash
46. @rmoff | #ConfluentVUG | @confluentinc
Kafka Connect
Producer API
{
"rating_id": 5313,
"user_id": 3,
"stars": 4,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "worst. flight. ever. #neveragain"
}
{
"id": 3,
"first_name": "Merilyn",
"last_name": "Doughartie",
"email": "mdoughartie1@dedecms.com",
"gender": "Female",
"club_status": "platinum",
"comments": "none"
}
RATINGS_WITH_CUSTOMER_DATA
Join each rating to customer data
CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS
SELECT * FROM RATINGS
LEFT JOIN CUSTOMERS ON R.ID=C.ID;
47. @rmoff | #ConfluentVUG | @confluentinc
Kafka Connect
Producer API
{
"rating_id": 5313,
"user_id": 3,
"stars": 4,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "worst. flight. ever. #neveragain"
}
{
"id": 3,
"first_name": "Merilyn",
"last_name": "Doughartie",
"email": "mdoughartie1@dedecms.com",
"gender": "Female",
"club_status": "platinum",
"comments": "none"
}
RATINGS_WITH_CUSTOMER_DATA
Join each rating to customer data
UNHAPPY_PLATINUM_CUSTOMERS
Filter for just PLATINUM customers
CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS
SELECT * FROM RATINGS_WITH_CUSTOMER_DATA
WHERE STARS < 3
48. @rmoff | #ConfluentVUG | @confluentinc
Kafka Connect
Producer API
{
"rating_id": 5313,
"user_id": 3,
"stars": 4,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "worst. flight. ever. #neveragain"
}
{
"id": 3,
"first_name": "Merilyn",
"last_name": "Doughartie",
"email": "mdoughartie1@dedecms.com",
"gender": "Female",
"club_status": "platinum",
"comments": "none"
}
RATINGS_WITH_CUSTOMER_DATA
Join each rating to customer data
RATINGS_BY_CLUB_STATUS_1MIN
Aggregate per-minute by CLUB_STATUS
CREATE TABLE RATINGS_BY_CLUB_STATUS AS
SELECT CLUB_STATUS, COUNT(*)
FROM RATINGS_WITH_CUSTOMER_DATA
WINDOW TUMBLING (SIZE 1 MINUTES)
GROUP BY CLUB_STATUS;