More Related Content Similar to Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021 (20) More from StreamNative (20) Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 20216. © 2020 Ververica
6
● Caito Scherr
● Developer Advocate
● Ververica, GmbH
● Portland, OR, USA
Introduction
14. © 2020 Ververica
14
Flink + Pulsar Today
Pulsar: Unified Storage
Unified Storage
(Segments / Pub/Sub)
“Stream as a unified view on data”
● Pub/Sub messaging layer (Streaming)
● Durable storage layer (Batch)
15. © 2020 Ververica
15
Why Flink + Pulsar?
Flink: Unified Processing Engine
Unified Processing Engine
(Batch / Streaming)
“Batch as a special case of streaming”
now
bounded query
unbounded query
past future
bounded query
start of the stream
unbounded query
● Reuse code and logic across batch and stream processing
● Ensure consistent semantics between processing modes
● Simplify operations
● Power applications mixing historic and real-time data
16. © 2020 Ververica
16
A Unified Data Stack
Unified Processing Engine
(Batch / Streaming)
Unified Storage
(Segments / Pub/Sub)
Why Flink + Pulsar?
17. © 2020 Ververica
17
Why Flink + Pulsar?
Flink is broad
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-Driven
Applications
Stateful Functions
Streaming
Analytics & ML
SQL, PyFlink, Tables
18. © 2020 Ververica
18
Why Flink + Pulsar?
Flink is broad
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-Driven
Applications
Stateful Functions
Streaming
Analytics & ML
SQL, PyFlink, Tables
19. © 2020 Ververica
19
Why Flink + Pulsar?
Use Cases
● Focus on business logic, not implementation
● Mixed workloads (batch and streaming)
● Maximize developer speed and autonomy
More high-level or domain-specific use cases can be modeled with SQL/Python and dynamic tables.
Examples
ML Feature Generation
Unified Online/Offline Model Training E2E Streaming Analytics Pipelines
20. © 2020 Ververica
20
Where SQL Fits
Flink SQL
SELECT user_id, COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
This is standard SQL (ANSI SQL)
“Everyone knows SQL, right?”
21. © 2020 Ververica
21
Where SQL Fits
Flink SQL
SELECT user_id, COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
This is standard SQL (ANSI SQL)
“Everyone knows SQL, right?”
also Flink SQL
22. © 2020 Ververica
22
Where SQL Fits
A Regular SQL Engine
user cnt
Mary 2
Bob 1
SELECT user_id,
COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
Take a snapshot when the
query starts
A final result is
produced
A row that was added after the query
was started is not considered
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
The query
terminates
23. © 2020 Ververica
23
Where SQL Fits
A Streaming SQL Engine
user cTime url
user cnt
SELECT user_id,
COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1
Mary 2
Ingest all changes as
they happen
Continuously update the
result
The result is identical to the one-time query (at this point)
24. © 2020 Ververica
24
Where SQL Fits
Flink SQL in a Nutshell
● Standard SQL syntax and semantics (i.e. not a “SQL-flavor”)
● Unified APIs for batch and streaming
● Support for advanced time handling and operations (e.g. CDC, pattern matching)
UDF Support
Python
Java
Scala
Execution
TPC-DS Coverage
Batch
Streaming
+
Formats
Native Connectors
Apache Kafka
Elasticsearch
FileSystems
JDBC HBase
+
Kinesis
Metastore Postgres (JDBC)
Data Catalogs
Debezium
25. © 2020 Ververica
25
Flink + Pulsar Today
Pulsar Integration Over Time
Flink 1.6+
2018
Streaming Source/Sink Connectors
Table Sink Connector
26. © 2020 Ververica
26
Flink + Pulsar Today
Pulsar Integration Over Time
Flink 1.6+
2018
Streaming Source/Sink Connectors
Table Sink Connector
Pulsar Schema + Flink Catalog Integration
Table API/SQL as first-class citizens
Exactly-once Source
At-least once Sink
Flink 1.9+
27. © 2020 Ververica
27
Flink + Pulsar Today
Flink 1.12
Upserts (upsert-pulsar)
DDL Computed Columns, Watermarks and Metadata
End-to-end Exactly-once
Key-shared Subscription Model
Flink 1.6+
2018
Streaming Source/Sink Connectors
Table Sink Connector
Pulsar Integration Over Time
Pulsar Schema + Flink Catalog Integration
Table API/SQL as first-class citizens
Exactly-once Source
At-least once Sink
Flink 1.9+
28. © 2020 Ververica
28
Flink + Pulsar Today
Flink 1.12
Flink 1.6+
2018
Streaming Source/Sink Connectors
Table Sink Connector
Pulsar Integration Over Time
Pulsar Schema + Flink Catalog Integration
Table API/SQL as first-class citizens
Exactly-once Source
At-least once Sink
Flink 1.13 *
Apr/May’21
Contribution to the Apache Flink repository [FLINK-20726]
Port connector to the new, unified Source API (FLIP-27)
Flink 1.9+
Upserts (upsert-pulsar)
DDL Computed Columns, Watermarks and Metadata
End-to-end Exactly-once
Key-shared Subscription Model
29. © 2020 Ververica
29
Where SQL Fits
DEMO
1. Use the Twitter Firehose built-in connector to consume tweets about gardening 🌿 into a Pulsar topic (tweets).
30. © 2020 Ververica
30
Where SQL Fits
DEMO
DEMO
2. Start the Flink SQL Client and use a Pulsar catalog to access the topic directly as a table in Flink.
SQL Client
CREATE CATALOG pulsar WITH (
'type' = 'pulsar',
'service-url' = 'pulsar://pulsar:6650',
'admin-url' = 'http://pulsar:8080',
'format' = 'json'
);
Catalog DDL
31. © 2020 Ververica
31
Where SQL Fits
DEMO
DEMO
2.1. You can query the tweets topic off-the-bat using a simple SELECT statement — it’s treated as a Flink table!
SQL Client
32. © 2020 Ververica
32
Where SQL Fits
DEMO
DEMO
2.2. But then you find out that most Firehose events have a null createdTime. What now?
Not cool. 👹
SQL Client
33. © 2020 Ververica
33
DEMO
3. One way to get a relevant timestamp is to use Pulsar metadata to get the publishTime (i.e. ingestion time).
CREATE TABLE pulsar_tweets (
publishTime TIMESTAMP(3) METADATA,
WATERMARK FOR publishTime AS publishTime - INTERVAL '5' SECOND
) WITH (
'connector' = 'pulsar',
'topic' = 'persistent://public/default/tweets',
'value.format' = 'json',
'service-url' = 'pulsar://pulsar:6650',
'admin-url' = 'http://pulsar:8080',
'scan.startup.mode' = 'earliest-offset'
)
LIKE tweets;
Source Table DDL
Derive schema from the original topic
Define the source connector (Pulsar)
Read and use Pulsar message metadata
Where SQL Fits
34. © 2020 Ververica
34
Where SQL Fits
DEMO
4. Perform a simple windowed aggregation (count), and insert results into a new pulsar topic (tweets_agg).
CREATE TABLE pulsar_tweets_agg (
tmstmp TIMESTAMP(3),
tweet_cnt BIGINT
) WITH (
'connector'='pulsar',
'topic'='persistent://public/default/tweets_agg',
'value.format'='json',
'service-url'='pulsar://pulsar:6650',
'admin-url'='http://pulsar:8080'
);
Sink Table DDL
INSERT INTO pulsar_tweets_agg
SELECT TUMBLE_START(publishTime, INTERVAL '10' SECOND) AS
wStart,
COUNT(id) AS tweet_cnt
FROM pulsar_tweets
GROUP BY TUMBLE(publishTime, INTERVAL '10' SECOND);
Continuous SQL Query
35. © 2020 Ververica
35
Where SQL Fits
DEMO
5. We’ll get a count of the # of tweets in windows of 10 seconds (based on event time!).
36. © 2020 Ververica
36
Where SQL Fits
There’s a lot more to it!
Check out the Flink SQL Cookbook, where we share hands-on examples, patterns, and use cases for Flink SQL.
37. © 2020 Ververica
● YOW! Conference staff!!
● “Shared Language” photo models:
○ Zack Hobson, CTO
○ Cory Johannsen, Senior Software Engineer
37
Thank You!
@Caito_200_OK
Quoted Contributors & Reviewers:
● Mandy Riso - DevOps Engineer
● Eric Shamow - Engineering Manager
● Mike Hix - Senior Software Engineer
● Ben Ford - Product Manager
● Noçnica Fee - Developer Advocate
● Lucy Wyman - Senior Software Engineer
● Logan Ballard - Site Reliability Engineer
38. © 2020 Ververica
Resources
● Flink Ahead: What Comes After Batch & Streaming: https://youtu.be/h5OYmy9Yx7Y
● Apache Pulsar as one Storage System for Real Time & Historical Data Analysis:
https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c59017
● Flink Table API & SQL:
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#ope
rations
● When Flink & Pulsar Come Together:
https://flink.apache.org/2019/05/03/pulsar-flink.html
● How to Query Pulsar Streams in Flink:
https://flink.apache.org/news/2019/11/25/query-pulsar-streams-using-apache-flink.ht
ml
● What’s New in the Flink/Pulsar Connector:
● https://flink.apache.org/2021/01/07/pulsar-flink-connector-270.html
● Marta’s Demo: https://github.com/morsapaes/flink-sql-pulsar
●
38
@Caito_200_OK
48. © 2020 Ververica
48
Metrics-Driven Metrics
“Any situation where people
create their own dashboards
without structure, quickly starts to
look like the cockpit of a 747”
69. © 2020 Ververica
69
Shared Language
“Even the best metrics driven
development fails easily, when it’s
only built for yourself”
70. © 2020 Ververica
70
Shared Language
Implementation:
• Identify your impacted groups
• Identify the most effective
tools for each group
• Enable automation
85. © 2020 Ververica
85
Conclusion
For systems with complex integration points:
• Metrics-Driven Metrics: streamline technical
complexity
• Metrics as a Shared Language: streamline
interpersonal complications
87. © 2020 Ververica
Credits - Images
● Flink logos:
https://wints.github.io/flink-web//community.html
● Data-Driven/Aware/Informed Design:
https://uxdesign.cc/becoming-a-data-aware-designer-1d7614ebc3ed
● Stream processing diagram:
https://www.ververica.com/what-is-stream-processing
● 747 Airplane cockpit:
https://www.reddit.com/r/pics/comments/5vv8qt/the_pilots_seat_and_cockpit_of_a_boein
g_747/
● DataDog Flink dashboard:
https://www.datadoghq.com/blog/monitor-apache-flink-with-datadog/
● All other photos & Images:
Caito Scherr
87
@Caito_200_OK
88. © 2020 Ververica
Credits
● Basic Data-Driven Development principles
https://www.portable.com.au/reports/principles-of-data-driven-design
● Data-driven design:
https://www.springboard.com/blog/data-driven-design/
● Becoming A Data-Aware Designer, + definition image - Illustration from “Designing
with Data” by King, Churchill, & Tan
https://uxdesign.cc/becoming-a-data-aware-designer-1d7614ebc3ed
● https://digitalprinciples.org/wp-content/uploads/PDD_Principle-BeDataDriven_v2.p
df
● Harvard Business Review - Data Driven Culture:
https://hbr.org/2020/02/10-steps-to-creating-a-data-driven-culture
88
@Caito_200_OK
96. © 2020 Ververica
CONCLUSION
96
Audience Tactic Tools
You/Your Team • Dashboard should represent your priorities & values
• Bring Development + Operations together
Analytics platforms
• Flink UI, Ververica Platform
• Data Dog
• New Relic
• Grafana/Prometheus...
Upstream + Downstream
Teams
• Understand each other’s “normal”
• Clear ownership for integration points
Analytics + chat integrations
• Slack + PagerDuty
• Slack + Jenkins...
Leadership +
Stakeholders
• Business metrics
• Pivot between summary & depth
• Use tools they’re familiar with
Automated, non-engineering spaces
• URL endpoints
• Internal wiki/blog
(with embedded, automated output)
Customers • More manual approach Manual approach
• Your company’s customer-facing teams
99. © 2020 Ververica
BASIC PRINCIPLES
99
Quantitative Qualitative
Data-Driven Design Data-Informed Design
@Caito_200_OK
100. © 2020 Ververica
BASIC PRINCIPLES
100
Quantitative Qualitative
Data-Driven Design Data-Aware Design
Data-Informed Design
@Caito_200_OK
103. © 2020 Ververica
A SHARED LANGUAGE
103
Even the best data-driven development
fails easily when it’s only built for yourself.
@Caito_200_OK
104. © 2020 Ververica
Metrics as a Shared Language
104
Engineering Team Most
Technical
Most
Abstract
Up + Downstream Teams
Leadership/Stakeholders
Customers
@Caito_200_OK
105. © 2020 Ververica
Metrics as a Shared Language
105
Engineering Team Most
Technical
Most
Abstract
Up + Downstream Teams
Leadership/Stakeholders
Customers
@Caito_200_OK
106. © 2020 Ververica
Metrics as a Shared Language
106
Engineering Team Most
Technical
Most
Abstract
Up + Downstream Teams
Leadership/Stakeholders
Customers
@Caito_200_OK
110. © 2020 Ververica
Metrics as a Shared Language
110
Engineering Team Most
Technical
Most
Abstract
Up + Downstream Teams
Leadership/Stakeholders
Customers
@Caito_200_OK
111. © 2020 Ververica
Metrics as a Shared Language
111
Engineering Team Most
Technical
Most
Abstract
Up + Downstream Teams
Leadership/Stakeholders
Customers
@Caito_200_OK
112. © 2020 Ververica
Metrics as a Shared Language
112
Engineering Team Most
Technical
Most
Abstract
Up + Downstream Teams
Leadership/Stakeholders
Customers
@Caito_200_OK
113. © 2020 Ververica
Summary
113
@Caito_200_OK
Concept Principles Implementation Benefit
Metrics-Driven-
Metrics cycle
Meaningful,
iterative,
accessible
High risk,
most uncertain,
current roadmap
priority
Incident
response,
Metrics as a
shared
language
Use tools familiar
to your audience
Small modular
units
Automate people
& process
challenges
114. © 2020 Ververica
114
Before We Start...
Flink’s REST API integration
http://hostname:8081/jobmanager/metrics
/taskmanagers/<taskmanagerid>/metrics
/taskmanagers/metrics
/jobs/metrics?jobs=D,E,F
...