SlideShare une entreprise Scribd logo
1  sur  36
1
Fabian Hueske
@fhueske
Seattle Apache Flink Meetup
March, 5th 2018
Why and how to leverage
the power and simplicity of
SQL on Apache Flink®
About me
 Apache Flink PMC member
• Contributing since day 1 at TU Berlin
• Focusing on Flink’s relational APIs since ~2 years
 Co-author of “Stream Processing with Apache Flink”
• Work in progress…
 Co-founder & Software Engineer at data Artisans
2
3
Original creators of
Apache Flink®
dA Platform 2
Open Source Apache Flink
+ dA Application Manager
4
Productionizing and operating
stream processing made easy
The dA Platform 2
dA Platform 2
Apache Flink
Stateful stream processing
Kubernetes
Container platform
Logging
Streams from
Kafka, Kinesis,
S3, HDFS,
Databases, ...
dA
Application
Manager
Application lifecycle
management
Metrics
CI/CD
Real-time
Analytics
Anomaly- &
Fraud Detection
Real-time
Data Integration
Reactive
Microservices
(and more)
What is Apache Flink?
6
Batch Processing
process static and
historic data
Data Stream
Processing
realtime results
from data streams
Event-driven
Applications
data-driven actions
and services
Stateful Computations Over Data Streams
What is Apache Flink?
7
Queries
Applications
Devices
etc.
Database
Stream
File / Object
Storage
Stateful computations over streams
real-time and historic
fast, scalable, fault tolerant, in-memory,
event time, large state, exactly-once
Historic
Data
Streams
Application
Hardened at scale
8
Streaming Platform Service
billions messages per day
A lot of Stream SQL
Streaming Platform as a Service
3700+ container running Flink,
1400+ nodes, 22k+ cores, 100s of jobs
Fraud detection
Streaming Analytics Platform
100s jobs, 1000s nodes, TBs state,
metrics, analytics, real time ML,
Streaming SQL as a platform
Powerful Abstractions
9
Process Function (events, state, time)
DataStream API (streams, windows)
SQL / Table API (dynamic tables)
Stream- & Batch
Data Processing
High-level
Analytics API
Stateful Event-
Driven Applications
val stats = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
Layered abstractions to
navigate simple to complex use cases
Apache Flink’s Relational APIs
Unified APIs for batch & streaming data
A query specifies exactly the same result
regardless whether its input is
static batch data or streaming data.
10
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
LINQ-style Table APIANSI SQL
Query Translation
11
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
Input data is
bounded
(batch)
Input data is
unbounded
(streaming)
What if “clicks” is a file?
12
Clicks
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
user cnt
Mary 2
Bob 1
Liz 1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Input data is
read at once
Result is
produced at once
What if “clicks” is a stream?
13
user cTime url
user cnt
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Clicks
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1Mary 2
Input data is
continuously read
Result is
continuously updated
The result is the same!
Why is stream-batch unification important?
 Usability
• ANSI SQL syntax: No custom “StreamSQL” syntax.
• ANSI SQL semantics: No stream-specific results.
 Portability
• Run the same query on bounded and unbounded data
• Run the same query on recorded and real-time data
 How can we achieve SQL semantics on streams? 14
now
bounded query
unbounded query
past future
bounded query
start of the stream
unbounded query
DBMSs Run Queries on Streams
 Materialized views (MV) are similar to regular views,
but persisted to disk or memory
• Used to speed-up analytical queries
• MVs need to be updated when the base tables change
 MV maintenance is very similar to SQL on streams
• Base table updates are a stream of DML statements
• MV definition query is evaluated on that stream
• MV is query result and continuously updated
15
Continuous Queries in Flink
 Core concept is a “Dynamic Table”
• Dynamic tables are changing over time
 Queries on dynamic tables
• produce new dynamic tables (which are updated based on input)
• do not terminate
 Stream ↔ Dynamic table conversions
16
Stream ↔ Dynamic Table Conversions
 Append Conversions
• Records are only inserted/appended
 Upsert Conversions
• Records are inserted/updated/deleted
• Records have a (composite) unique key
 Changelog Conversions
• Records are inserted/updated/deleted
17
SQL Feature Set in Flink 1.5.0
 SELECT FROM WHERE
 GROUP BY / HAVING
• Non-windowed, TUMBLE, HOP, SESSION windows
 JOIN
• Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN
• Non-windowed INNER JOIN
 Scalar, aggregation, table-valued UDFs
 SQL CLI Client (beta)
 [streaming only] OVER / WINDOW
• UNBOUNDED / BOUNDED PRECEDING
 [batch only] UNION / INTERSECT / EXCEPT / IN / ORDER BY
18
What can I build with this?
 Data Pipelines
• Transform, aggregate, and move events in real-time
 Low-latency ETL
• Convert and write streams to file systems, DBMS, K-V stores, indexes, …
• Ingest appearing files to produce streams
 Stream & Batch Analytics
• Run analytical queries over bounded and unbounded data
• Query and compare historic and real-time data
 Power Live Dashboards
• Compute and update data to visualize in real-time 19
The New York Taxi Rides Data Set
 The New York City Taxi & Limousine Commission provides a public data
set about past taxi rides in New York City
 We can derive a streaming table from the data
 Table: TaxiRides
rideId: BIGINT // ID of the taxi ride
isStart: BOOLEAN // flag for pick-up (true) or drop-off (false) event
lon: DOUBLE // longitude of pick-up or drop-off location
lat: DOUBLE // latitude of pick-up or drop-off location
rowtime: TIMESTAMP // time of pick-up or drop-off event
20
Identify popular pick-up / drop-off locations
SELECT cell,
isStart,
HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd,
COUNT(*) AS cnt
FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell
FROM TaxiRides)
GROUP BY cell,
isStart,
HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
21
 Compute every 5 minutes for each location the
number of departing and arriving taxis
of the last 15 minutes.
Average ride duration per pick-up location
SELECT pickUpCell,
AVG(TIMESTAMPDIFF(MINUTE, e.rowtime, s.rowtime) AS avgDuration
FROM (SELECT rideId, rowtime, toCellId(lon, lat) AS pickUpCell
FROM TaxiRides
WHERE isStart) s
JOIN
(SELECT rideId, rowtime
FROM TaxiRides
WHERE NOT isStart) e
ON s.rideId = e.rideId AND
e.rowtime BETWEEN s.rowtime AND s.rowtime + INTERVAL '1' HOUR
GROUP BY pickUpCell
22
 Join start ride and end ride events on rideId and
compute average ride duration per pick-up location.
Building a Dashboard
23
Elastic
Search
Kafka
SELECT cell,
isStart,
HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd,
COUNT(*) AS cnt
FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell
FROM TaxiRides)
GROUP BY cell,
isStart,
HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
Sounds great! How can I use it?
 ATM, SQL queries must be embedded in Java/Scala code 
• Tight integration with DataStream and DataSet APIs
 Community focused on internals (until Flink 1.4.0)
• Operators, types, built-in functions, extensibility (UDFs, extern. catalog)
• Proven at scale by Alibaba, Huawei, and Uber
• All built their own submission system & connectors library
 Community neglected user interfaces
• No query submission client, no CLI
• No catalog integration
• Limited set of TableSources and TableSinks
24
Coming in Flink 1.5.0 - SQL CLI
Demo Time!
That’s a nice toy, but …
... can I use it for anything serious?
25
FLIP-24 – A SQL Query Service
 REST service to submit & manage SQL queries
• SELECT …
• INSERT INTO SELECT …
• CREATE MATERIALIZE VIEW …
 Serve results of “SELECT …” queries
 Provide a table catalog (integrated with external catalogs)
 Use cases
• Data exploration with notebooks like Apache Zeppelin
• Access to real-time data from applications
• Easy data routing / ETL from management consoles
26
Challenge: Serve Dynamic Tables
Unbounded input yields unbounded results
27
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
SELECT user, url
FROM clicks
WHERE url LIKE '%xyz.com'
Append-only Table
• Result rows are never changed
• Consume, buffer, or drop rows
Continuously updating Table
• Result rows can be updated or
deleted
• Consume changelog or
periodically query result table
• Result table must be maintained
somewhere
(Serving bounded results is easy)
Application
FLIP-24 – A SQL Query Service
28
Query Service
Catalog
Optimizer
Database /
HDFS
Event Log
External Catalog
(Schema Registry,
HCatalog, …)
Query
Results
Submit Query Job
State
REST
Result Server
Submit Query
REST
Database /
HDFS
Event Log
SELECT
user,
COUNT(url) AS cnt
FROM clicks
GROUP BY user
Results are served by Query Service via REST
+ Application does not need a special client
+ Works well in many network configurations
− Query service can become bottleneck
Application
FLIP-24 – A SQL Query Service
29
Query Service
SELECT
user,
COUNT(url) AS cnt
FROM clicks
GROUP BY user Catalog
Optimizer
Database /
HDFS
Event Log
External Catalog
(Schema Registry,
HCatalog, …)
Query
Submit Query Job
State
REST
Result Server
Submit Query
RES
T
Database /
HDFS
Event Log
Serving
Library
Result Handle
We want your feedback!
 The design of SQL Query Service is not final yet.
 Check out FLIP-24 and FLINK-7594
 Share your ideas and feedback and discuss on
JIRA or dev@flink.apache.org.
30
Summary
 Unification of stream and batch is important.
 Flink’s SQL solves many streaming and batch use cases.
 Runs in production at Alibaba, Uber, and others.
 The community is working on improving user interfaces.
 Get involved, discuss, and contribute!
31
15% discount code: FlinkSeattle
Flink Forward SF 2018 Presenters
33
Thank you!
@fhueske
@ApacheFlink
@dataArtisans
Available on O’Reilly Early Release!
We are hiring!
data-artisans.com/careers
Why and how to leverage the power and simplicity of SQL on Apache Flink

Contenu connexe

Tendances

Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Michael Noll
 
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Michael Noll
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 

Tendances (20)

Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
 
data Artisans Product Announcement
data Artisans Product Announcementdata Artisans Product Announcement
data Artisans Product Announcement
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
 
From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in Review
 
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
 
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
 
Aljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache FlinkAljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache Flink
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 

Similaire à Why and how to leverage the power and simplicity of SQL on Apache Flink

Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 

Similaire à Why and how to leverage the power and simplicity of SQL on Apache Flink (20)

Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Flink SQL in Action
Flink SQL in ActionFlink SQL in Action
Flink SQL in Action
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache Flink
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)
 
Apache Flink - a Gentle Start
Apache Flink - a Gentle StartApache Flink - a Gentle Start
Apache Flink - a Gentle Start
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Creating the PromQL Transpiler for Flux by Julius Volz, Co-Founder | Prometheus
Creating the PromQL Transpiler for Flux by Julius Volz, Co-Founder | PrometheusCreating the PromQL Transpiler for Flux by Julius Volz, Co-Founder | Prometheus
Creating the PromQL Transpiler for Flux by Julius Volz, Co-Founder | Prometheus
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Streaming SQL w/ Apache Calcite
Streaming SQL w/ Apache Calcite Streaming SQL w/ Apache Calcite
Streaming SQL w/ Apache Calcite
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 

Plus de Fabian Hueske

Plus de Fabian Hueske (9)

Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache FlinkStream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Apache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated QueriesApache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated Queries
 
Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!
 
Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015
 

Dernier

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Dernier (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Why and how to leverage the power and simplicity of SQL on Apache Flink

  • 1. 1 Fabian Hueske @fhueske Seattle Apache Flink Meetup March, 5th 2018 Why and how to leverage the power and simplicity of SQL on Apache Flink®
  • 2. About me  Apache Flink PMC member • Contributing since day 1 at TU Berlin • Focusing on Flink’s relational APIs since ~2 years  Co-author of “Stream Processing with Apache Flink” • Work in progress…  Co-founder & Software Engineer at data Artisans 2
  • 3. 3 Original creators of Apache Flink® dA Platform 2 Open Source Apache Flink + dA Application Manager
  • 5. The dA Platform 2 dA Platform 2 Apache Flink Stateful stream processing Kubernetes Container platform Logging Streams from Kafka, Kinesis, S3, HDFS, Databases, ... dA Application Manager Application lifecycle management Metrics CI/CD Real-time Analytics Anomaly- & Fraud Detection Real-time Data Integration Reactive Microservices (and more)
  • 6. What is Apache Flink? 6 Batch Processing process static and historic data Data Stream Processing realtime results from data streams Event-driven Applications data-driven actions and services Stateful Computations Over Data Streams
  • 7. What is Apache Flink? 7 Queries Applications Devices etc. Database Stream File / Object Storage Stateful computations over streams real-time and historic fast, scalable, fault tolerant, in-memory, event time, large state, exactly-once Historic Data Streams Application
  • 8. Hardened at scale 8 Streaming Platform Service billions messages per day A lot of Stream SQL Streaming Platform as a Service 3700+ container running Flink, 1400+ nodes, 22k+ cores, 100s of jobs Fraud detection Streaming Analytics Platform 100s jobs, 1000s nodes, TBs state, metrics, analytics, real time ML, Streaming SQL as a platform
  • 9. Powerful Abstractions 9 Process Function (events, state, time) DataStream API (streams, windows) SQL / Table API (dynamic tables) Stream- & Batch Data Processing High-level Analytics API Stateful Event- Driven Applications val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } Layered abstractions to navigate simple to complex use cases
  • 10. Apache Flink’s Relational APIs Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data. 10 tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user LINQ-style Table APIANSI SQL
  • 11. Query Translation 11 tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Input data is bounded (batch) Input data is unbounded (streaming)
  • 12. What if “clicks” is a file? 12 Clicks user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… user cnt Mary 2 Bob 1 Liz 1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Input data is read at once Result is produced at once
  • 13. What if “clicks” is a stream? 13 user cTime url user cnt SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Clicks Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… Bob 1 Liz 1 Mary 1Mary 2 Input data is continuously read Result is continuously updated The result is the same!
  • 14. Why is stream-batch unification important?  Usability • ANSI SQL syntax: No custom “StreamSQL” syntax. • ANSI SQL semantics: No stream-specific results.  Portability • Run the same query on bounded and unbounded data • Run the same query on recorded and real-time data  How can we achieve SQL semantics on streams? 14 now bounded query unbounded query past future bounded query start of the stream unbounded query
  • 15. DBMSs Run Queries on Streams  Materialized views (MV) are similar to regular views, but persisted to disk or memory • Used to speed-up analytical queries • MVs need to be updated when the base tables change  MV maintenance is very similar to SQL on streams • Base table updates are a stream of DML statements • MV definition query is evaluated on that stream • MV is query result and continuously updated 15
  • 16. Continuous Queries in Flink  Core concept is a “Dynamic Table” • Dynamic tables are changing over time  Queries on dynamic tables • produce new dynamic tables (which are updated based on input) • do not terminate  Stream ↔ Dynamic table conversions 16
  • 17. Stream ↔ Dynamic Table Conversions  Append Conversions • Records are only inserted/appended  Upsert Conversions • Records are inserted/updated/deleted • Records have a (composite) unique key  Changelog Conversions • Records are inserted/updated/deleted 17
  • 18. SQL Feature Set in Flink 1.5.0  SELECT FROM WHERE  GROUP BY / HAVING • Non-windowed, TUMBLE, HOP, SESSION windows  JOIN • Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN • Non-windowed INNER JOIN  Scalar, aggregation, table-valued UDFs  SQL CLI Client (beta)  [streaming only] OVER / WINDOW • UNBOUNDED / BOUNDED PRECEDING  [batch only] UNION / INTERSECT / EXCEPT / IN / ORDER BY 18
  • 19. What can I build with this?  Data Pipelines • Transform, aggregate, and move events in real-time  Low-latency ETL • Convert and write streams to file systems, DBMS, K-V stores, indexes, … • Ingest appearing files to produce streams  Stream & Batch Analytics • Run analytical queries over bounded and unbounded data • Query and compare historic and real-time data  Power Live Dashboards • Compute and update data to visualize in real-time 19
  • 20. The New York Taxi Rides Data Set  The New York City Taxi & Limousine Commission provides a public data set about past taxi rides in New York City  We can derive a streaming table from the data  Table: TaxiRides rideId: BIGINT // ID of the taxi ride isStart: BOOLEAN // flag for pick-up (true) or drop-off (false) event lon: DOUBLE // longitude of pick-up or drop-off location lat: DOUBLE // latitude of pick-up or drop-off location rowtime: TIMESTAMP // time of pick-up or drop-off event 20
  • 21. Identify popular pick-up / drop-off locations SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) 21  Compute every 5 minutes for each location the number of departing and arriving taxis of the last 15 minutes.
  • 22. Average ride duration per pick-up location SELECT pickUpCell, AVG(TIMESTAMPDIFF(MINUTE, e.rowtime, s.rowtime) AS avgDuration FROM (SELECT rideId, rowtime, toCellId(lon, lat) AS pickUpCell FROM TaxiRides WHERE isStart) s JOIN (SELECT rideId, rowtime FROM TaxiRides WHERE NOT isStart) e ON s.rideId = e.rideId AND e.rowtime BETWEEN s.rowtime AND s.rowtime + INTERVAL '1' HOUR GROUP BY pickUpCell 22  Join start ride and end ride events on rideId and compute average ride duration per pick-up location.
  • 23. Building a Dashboard 23 Elastic Search Kafka SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
  • 24. Sounds great! How can I use it?  ATM, SQL queries must be embedded in Java/Scala code  • Tight integration with DataStream and DataSet APIs  Community focused on internals (until Flink 1.4.0) • Operators, types, built-in functions, extensibility (UDFs, extern. catalog) • Proven at scale by Alibaba, Huawei, and Uber • All built their own submission system & connectors library  Community neglected user interfaces • No query submission client, no CLI • No catalog integration • Limited set of TableSources and TableSinks 24
  • 25. Coming in Flink 1.5.0 - SQL CLI Demo Time! That’s a nice toy, but … ... can I use it for anything serious? 25
  • 26. FLIP-24 – A SQL Query Service  REST service to submit & manage SQL queries • SELECT … • INSERT INTO SELECT … • CREATE MATERIALIZE VIEW …  Serve results of “SELECT …” queries  Provide a table catalog (integrated with external catalogs)  Use cases • Data exploration with notebooks like Apache Zeppelin • Access to real-time data from applications • Easy data routing / ETL from management consoles 26
  • 27. Challenge: Serve Dynamic Tables Unbounded input yields unbounded results 27 SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user SELECT user, url FROM clicks WHERE url LIKE '%xyz.com' Append-only Table • Result rows are never changed • Consume, buffer, or drop rows Continuously updating Table • Result rows can be updated or deleted • Consume changelog or periodically query result table • Result table must be maintained somewhere (Serving bounded results is easy)
  • 28. Application FLIP-24 – A SQL Query Service 28 Query Service Catalog Optimizer Database / HDFS Event Log External Catalog (Schema Registry, HCatalog, …) Query Results Submit Query Job State REST Result Server Submit Query REST Database / HDFS Event Log SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Results are served by Query Service via REST + Application does not need a special client + Works well in many network configurations − Query service can become bottleneck
  • 29. Application FLIP-24 – A SQL Query Service 29 Query Service SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Catalog Optimizer Database / HDFS Event Log External Catalog (Schema Registry, HCatalog, …) Query Submit Query Job State REST Result Server Submit Query RES T Database / HDFS Event Log Serving Library Result Handle
  • 30. We want your feedback!  The design of SQL Query Service is not final yet.  Check out FLIP-24 and FLINK-7594  Share your ideas and feedback and discuss on JIRA or dev@flink.apache.org. 30
  • 31. Summary  Unification of stream and batch is important.  Flink’s SQL solves many streaming and batch use cases.  Runs in production at Alibaba, Uber, and others.  The community is working on improving user interfaces.  Get involved, discuss, and contribute! 31
  • 32. 15% discount code: FlinkSeattle
  • 33. Flink Forward SF 2018 Presenters 33