Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
1. A Data Streaming Architecture
with Apache Flink
Robert Metzger
@rmetzger_
rmetzger@apache.org
Berlin Buzzwords,
June 7, 2016
2. Talk overview
My take on the stream processing space, and how it
changes the way we think about data
Transforming an existing data analysis pattern into the
streaming world (“Streaming ETL”)
Demo
2
3. Apache Flink
Apache Flink is an open source stream processing
framework
• Low latency
• High throughput
• Stateful
• Distributed
Developed at the Apache Software Foundation, 1.0.0
released in March 2016,
used in production
3
6. 6
1. Radically simplified infrastructure
2. Do more with your data, faster
3. Can completely subsume batch
7. 7
Real-world data is produced in a
continuous fashion.
New systems like Flink and Kafka
embrace streaming nature of data.
Web server Kafka topic
Stream processor
9. What makes Flink flink?
9
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
11. Extract, Transform, Load (ETL)
ETL: Move data from A to B and transform it on the way
Old approach:
Server
LogsServer
Logs
Server
Logs
Mobile
IoT
12. Extract, Transform, Load (ETL)
ETL: Move data from A to B and transform it on the way
Old approach:
Server
Logs
HDFS / S3
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Tier 0: Raw data
13. Extract, Transform, Load (ETL)
ETL: Move data from A to B and transform it on the way
Old approach:
Server
Logs
HDFS / S3
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Tier 0: Raw data Tier 1: Normalized, cleansed data
Periodic
jobs Parquet /
ORC in
HDFS
User
14. Extract, Transform, Load (ETL)
ETL: Move data from A to B and transform it on the way
Old approach:
Server
Logs
HDFS / S3
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Tier 0: Raw data Tier 1: Normalized, cleansed data
Periodic
jobs Parquet /
ORC in
HDFS
Tier 2: Aggregated data
Periodic
jobs
User
User
“Data Warehouse”
15. Extract, Transform, Load (Streaming ETL)
ETL: Move data from A to B and transform it on the way
Streaming approach:
Server
Logs
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Tier 0: Raw data
16. Stream Processor
Extract, Transform, Load (Streaming ETL)
ETL: Move data from A to B and transform it on the way
Streaming approach:
Server
Logs
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Kafka
Connector
Tier 0: Raw data
Cleansing
Transformation
Time-Window
Alerts
Time-Window
17. Stream Processor
Extract, Transform, Load (Streaming ETL)
ETL: Move data from A to B and transform it on the way
Streaming approach:
Server
Logs
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Tier 1: Normalized, cleansed data
Parquet /
ORC in HDFS
Kafka
Connector
ES
Connector
Rolling file
sink
Tier 0: Raw data
Cleansing
Transformation
Time-Window
Alerts
Time-Window
User
Batch
Processing
18. Stream Processor
Extract, Transform, Load (Streaming ETL)
ETL: Move data from A to B and transform it on the way
Streaming approach:
Server
Logs
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Tier 1: Normalized, cleansed data
Parquet /
ORC in HDFS
Tier 2: Aggregated data
User
Kafka
Connector
ES
Connector
Rolling file
sink
JDBC sink
Cassandra
sink
Tier 0: Raw data
Cleansing
Transformation
Time-Window
Alerts
Time-Window
User
Batch
Processing
19. Streaming ETL: Low Latency
19* Your mileage may vary. These are rule of thumb estimates.
Events are processed immediately
No need to wait until the next “load” batch job is running
hours minutes milliseconds
Periodic batch job
Batch processor
with micro-batches
Latency
Approach
seconds
Stream processor
20. Streaming ETL: Event-time aware
20
Events derived from the same real-world activity might
arrive out of order in the system
Flink is event-time aware
11:28 11:29
11:28 11:29
11:28 11:29
Same real-world activity
Out of sync clocks Network delays Machine failures
Because its enabling the obvious: Process continuous data in a cont. fashion
But what is the importance of streaming, what can you do with it?
First, streaming radically simplifies the data infrastructure, by serving many use cases out of the stream processor in real time. This is connected to broader trends like the move to more microservice-based organizations.
Second, streaming is the style of processing that is needed by new applications. These include Internet of Things, and demand-driven services like Uber.
Third, streaming is just a better way to do many of the traditional use cases because it subsumes batch.