Contenu connexe Similaire à Select Star: Unified Batch & Streaming with Flink SQL & Pulsar (20) Select Star: Unified Batch & Streaming with Flink SQL & Pulsar5. © 2020 Ververica
5
● Caito Scherr
● Developer Advocate
● Ververica, GmbH
● Portland, OR, USA
Introduction
11. © 2020 Ververica
11
>> What is Flink?
Pulsar + Flink
● Stateful
● Stream processing engine
● Unified batch & streaming
14. © 2020 Ververica
14
>> Why Pulsar + Flink?
Pulsar + Flink
“Batch as a special case
of streaming”
“Stream as a unified
view on data”
15. © 2020 Ververica
15
>> Pulsar: Unified Storage
● Pub/Sub messaging layer
(Streaming)
● Durable storage layer
(Batch)
Pulsar + Flink
16. © 2020 Ververica
16
now
bounded query
unbounded query
past future
bounded query
start of the stream
unbounded query
>> Flink: Unified Processing
● Reuse code and logic
● Consistent semantics
● Simplify operations
● Mix historic and real-time
● Pub/Sub messaging layer (Stream)
● Durable storage layer (Batch)
Pulsar + Flink
17. © 2020 Ververica
17
Unified Processing Engine
(Batch / Streaming)
Unified Storage
(Segments / Pub/Sub)
>> A Unified Data Stack
Pulsar + Flink
18. © 2020 Ververica
18
Flink 1.6+
2018
Streaming Source/Sink Connectors
Table Sink Connector
>> Pulsar + Flink History
Pulsar + Flink
19. © 2020 Ververica
19
Flink 1.6+
2018
Streaming Source/Sink Connectors
Table Sink Connector
>> Pulsar + Flink History
Pulsar + Flink
Flink 1.9+
Pulsar Schema + Flink Catalog
Table API/SQL as 1st class citizens
Exactly-once Source
At-least once Sink
20. © 2020 Ververica
20
Flink 1.6+
2018
Streaming Source/Sink Connectors
Table Sink Connector
>> Pulsar + Flink History
Pulsar + Flink
Flink 1.9+
Pulsar Schema + Flink Catalog
Table API/SQL as 1st class citizens
Exactly-once Source
At-least once Sink
Flink 1.12
Upserts
DDL Computed Columns, Watermarks. Metadata
End-to-end Exactly-once
Key-shared Subscription Model
21. © 2020 Ververica
21
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-Driven
Applications
Stateful Functions
Streaming
Analytics & ML
SQL, PyFlink, Tables
>> Why Flink SQL?
Pulsar + Flink
22. © 2020 Ververica
22
>> Why Flink SQL?
● Focus on business logic, not implementation
● Mixed workloads (batch + streaming)
● Maximize developer speed and autonomy
ML Feature Generation
Unified Online/Offline Model
Training
E2E Streaming Analytics
Pipelines
Pulsar + Flink
23. © 2020 Ververica
23
user cnt
Mary 2
Bob 1
SELECT user_id,
COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
Take a snapshot when the
query starts
A final result is
produced
A row that was added after the query
was started is not considered
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
The query
terminates
Where SQL Fits In >> A Regular SQL Engine
24. © 2020 Ververica
24
user cTime url
user cnt
SELECT user_id,
COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1
Mary 2
Ingest all changes as
they happen
Continuously update the
result
The result is identical to the one-time query (at this point)
Where SQL Fits In >> A Streaming SQL Engine
25. © 2020 Ververica
25
● Standard SQL syntax and semantics (i.e. not a “SQL-flavor”)
● Unified APIs for batch and streaming
● Support for advanced time handling and operations (e.g. CDC, pattern matching)
UDF Support
Python
Java
Scala
Execution
TPC-DS Coverage
Batch
Streaming
+
Formats
Native Connectors
Apache Kafka
Elasticsearch
FileSystems
JDBC HBase
+
Kinesis
Metastore Postgres (JDBC)
Data Catalogs
Debezium
Where SQL Fits In >> Flink SQL In A Nutshell
28. © 2020 Ververica
28
Demo >> 2. SQL Client + Pulsar
CREATE CATALOG pulsar WITH (
'type' = 'pulsar',
'service-url' = 'pulsar://pulsar:6650',
'admin-url' = 'http://pulsar:8080',
'format' = 'json'
);
Catalog DDL
30. © 2020 Ververica
30
Demo
CREATE TABLE pulsar_tweets (
publishTime TIMESTAMP(3) METADATA,
WATERMARK FOR publishTime AS publishTime - INTERVAL '5'
SECOND
) WITH (
'connector' = 'pulsar',
'topic' = 'persistent://public/default/tweets',
'value.format' = 'json',
'service-url' = 'pulsar://pulsar:6650',
'admin-url' = 'http://pulsar:8080',
'scan.startup.mode' = 'earliest-offset'
)
LIKE tweets;
Derive schema from the original topic
Define the source connector (Pulsar)
Read and use Pulsar message metadata
>> 3. Get relevant timestamp
31. © 2020 Ververica
31
Demo >> 4. Windowed aggregation
CREATE TABLE pulsar_tweets_agg (
tmstmp TIMESTAMP(3),
tweet_cnt BIGINT
) WITH (
'connector'='pulsar',
'topic'='persistent://public/default/tweets_agg',
'value.format'='json',
'service-url'='pulsar://pulsar:6650',
'admin-url'='http://pulsar:8080'
);
Sink Table DDL
INSERT INTO pulsar_tweets_agg
SELECT TUMBLE_START(publishTime, INTERVAL '10'
SECOND) AS wStart,
COUNT(id) AS tweet_cnt
FROM pulsar_tweets
GROUP BY TUMBLE(publishTime, INTERVAL '10'
SECOND);
Continuous SQL Query
34. © 2020 Ververica
Resources
● Flink Ahead: What Comes After Batch & Streaming: https://youtu.be/h5OYmy9Yx7Y
● Apache Pulsar as one Storage System for Real Time & Historical Data Analysis:
https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c59017
● Flink Table API & SQL:
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#operatio
ns
● Flink SQL Cookbook: https://github.com/ververica/flink-sql-cookbook
● When Flink & Pulsar Come Together: https://flink.apache.org/2019/05/03/pulsar-flink.html
● How to Query Pulsar Streams in Flink:
https://flink.apache.org/news/2019/11/25/query-pulsar-streams-using-apache-flink.html
● What’s New in the Flink/Pulsar Connector:
● https://flink.apache.org/2021/01/07/pulsar-flink-connector-270.html
● Marta’s Demo: https://github.com/morsapaes/flink-sql-pulsar
34
@Caito_200_OK
35. © 2020 Ververica
● Pulsar Conference staff!!
● Marta Paes
35
Thank You!
@Caito_200_OK
Scan here for links
& resources