SlideShare a Scribd company logo
1 of 41
1
Kostas Tzoumas
@kostas_tzoumas
Big Data Ldn
November 4, 2016
Stream Processing with Apache
Flink®
2
Kostas Tzoumas
@kostas_tzoumas
Big Data Ldn
November 4, 2016
Debunking Some Common Myths in
Stream Processing
3
Original creators of Apache
Flink®
Providers of the
dA Platform, a supported
Flink distribution
Outline
 What is data streaming
 Myth 1: The throughput/latency tradeoff
 Myth 2: Exactly once not possible
 Myth 3: Streaming is for (near) real-time
 Myth 4: Streaming is hard
4
The streaming architecture
5
6
Reconsideration of data architecture
 Better app isolation
 More real-time reaction to events
 Robust continuous applications
 Process both real-time and historical data
7
app state
app state
app state
event log
Query
service
What is (distributed) streaming
 Computations on never-
ending “streams” of data
records (“events”)
 Stream processor
distributes the
computation in a cluster
8
Your
code
Your
code
Your
code
Your
code
What is stateful streaming
 Computation and state
• E.g., counters, windows of past
events, state machines, trained ML
models
 Result depends on history of
stream
 Stateful stream processor gives
the tools to manage state
• Recover, roll back, version,
upgrade, etc
9
Your
code
state
What is event-time streaming
 Data records associated with
timestamps (time series data)
 Processing depends on timestamps
 Event-time stream processor gives
you the tools to reason about time
• E.g., handle streams that are out of
order
• Core feature is watermarks – a clock
to measure event time
10
Your
code
state
t3 t1 t2t4 t1-t2 t3-t4
What is streaming
 Continuous processing on data that is
continuously generated
 I.e., pretty much all “big” data
 It’s all about state and time
11
Debunking some common stream
processing myths
12
Myth 1: Throughput/latency tradeoff
 Myth 1: you need to choose between high
throughput or low latency
 Physical limits
• In reality, network determines both the achievable
throughput and latency
• A well-engineered system achieves these limits
13
Flink performance
 10s of millions events per seconds in 10s of nodes
 scaled to 1000s of nodes
 with latency in single-digit milliseconds
14
Myth 2: Exactly once not possible
 Exactly once: under failures, system computes result
as if there was no failure
 In contrast to:
• At most once: no guarantees
• At least once: duplicates possible
 Exactly once state versus exactly once delivery
 Myth 2: Exactly once state not possible/too costly
15
Transactions
 “Exactly once” is transactions: either all
actions succeed or none succeed
 Transactions are possible
 Transactions are useful
 Let’s not start eventual consistency all over
again…
16
Flink checkpoints
 Periodic asynchronous consistent snapshots of
application state
 Provide exactly-once state guarantees under failures
17
9/2/2016 stream_barriers.svg
checkpoint
barrier n­1
data stream
stream record
(event)
checkpoint
barrier n
newer records
part of
checkpoint n­1
part of
checkpoint n
part of
checkpoint n+1
older records
End-to-end exactly once
 Checkpoints double as transaction coordination mechanism
 Source and sink operators can take part in checkpoints
 Exactly once internally, "effectively once" end to end: e.g.,
Flink + Cassandra with idempotent updates
18
transactional sinks
State management
 Checkpoints triple as state
versioning mechanism
(savepoints)
 Go back and forth in time while
maintaining state consistency
 Ease code upgrades (Flink or
app), maintenance, migration,
and debugging, what-if
simulations, A/B tests
19
Myth 3: Streaming and real time
 Myth 3: streaming and real-time are
synonymous
 Streaming is a new model
• Essentially, state and time
• Low latency/real time is the icing on the cake
20
Low latency and high latency streams
21
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
partition
partition
Stream (low latency)
Batch
(bounded stream)
Stream (high latency)
Robust continuous applications
22
Accurate computation
 Batch processing is not an accurate
computation model for continuous data
• Misses the right concepts and primitives
• Time handling, state across batch boundaries
 Stateful stream processing a better model
• Real-time/low-latency is the icing on the cake
23
Myth 4: How hard is streaming?
 Myth 4: streaming is too hard to learn
 You are already doing streaming, just in an
ad hoc way
 Most data is unbounded and the code
changes slower than the data
• This is a streaming problem
24
It's about your data and code
 What's the form of your data?
• Unbounded (e.g., clicks, sensors, logs), or
• Bounded (e.g., ???*)
 What changes more often?
• My code changes faster than my data
• My data changes faster than my code
25
* Please help me find a great example of naturally bounded data
It's about your data and code
 If your data changes faster than your code
you have a streaming problem
• You may be solving it with hourly batch jobs
depending on someone else to create the
hourly batches
• You are probably living with inaccurate results
without knowing it
26
It's about your data and code
 If your code changes faster than your data
you have an exploration problem
• Using notebooks or other tools for quick data
exploration is a good idea
• Once your code stabilizes you will have a
streaming problem, so you might as well think
of it as such from the beginning
27
Flink in the real world
28
Flink community
 > 240 contributors, 95 contributors in Flink 1.1
 42 meetups around the world with > 15,000 members
 2x-3x growth in 2015, similar in 2016
29
Powered by Flink
30
Zalando, one of the largest ecommerce
companies in Europe, uses Flink for real-
time business process monitoring.
King, the creators of Candy Crush Saga,
uses Flink to provide data science teams
with real-time analytics.
Bouygues Telecom uses Flink for real-time
event processing over billions of Kafka
messages per day.
Alibaba, the world's largest retailer, built a
Flink-based system (Blink) to optimize
search rankings in real time.
See more at flink.apache.org/poweredby.html
30 Flink applications in production for more than one
year. 10 billion events (2TB) processed daily
Complex jobs of > 30 operators running 24/7,
processing 30 billion events daily, maintaining state
of 100s of GB with exactly-once guarantees
Largest job has > 20 operators, runs on > 5000
vCores in 1000-node cluster, processes millions of
events per second
31
32
Flink Forward 2016
Current work in Flink
34
Ongoing Flink development
35
Connectors
Session
Windows
(Stream) SQL
Library
enhancements
Metric
System
Operations
Ecosystem
Application
Features
Metrics &
Visualization
Dynamic Scaling
Savepoint
compatibility Checkpoints
to savepoints
More connectors Stream SQL
Windows
Large state
Maintenance
Fine grained
recovery
Side in-/outputs
Window DSL
Broader
Audience
Security
Mesos &
others
Dynamic Resource
Management
Authentication
Queryable State
A longer-term vision for Flink
36
Streaming use cases
Application
(Near) real-time apps
Continuous apps
Analytics on historical
data
Request/response apps
Technology
Low-latency streaming
High-latency streaming
Batch as special case of
streaming
Large queryable state
37
Request/response applications
 Queryable state: query Flink state directly instead
of pushing results in a database
 Large state support and query API coming in Flink
38
queries
In summary
 The need for streaming comes from a rethinking of
data infra architecture
• Stream processing then just becomes natural
 Debunking 4 common myths
• Myth 1: The throughput/latency tradeoff
• Myth 2: Exactly once not possible
• Myth 3: Streaming is for (near) real-time
• Myth 4: Streaming is hard
39
4
Thank you!
@kostas_tzoumas
@ApacheFlink
@dataArtisans
4
We are hiring!
data-artisans.com/careers

More Related Content

What's hot

Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 

What's hot (20)

Don't Cross The Streams - Data Streaming And Apache Flink
Don't Cross The Streams  - Data Streaming And Apache FlinkDon't Cross The Streams  - Data Streaming And Apache Flink
Don't Cross The Streams - Data Streaming And Apache Flink
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Tech Talk @ Google on Flink Fault Tolerance and HA
Tech Talk @ Google on Flink Fault Tolerance and HATech Talk @ Google on Flink Fault Tolerance and HA
Tech Talk @ Google on Flink Fault Tolerance and HA
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
 

Viewers also liked

Viewers also liked (6)

Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 

Similar to Debunking Common Myths in Stream Processing

Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's NextKostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 

Similar to Debunking Common Myths in Stream Processing (20)

Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's NextKostas Tzoumas - Apache Flink®: State of the Union and What's Next
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, ConfluentJay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Neha Narkhede | Kafka Summit London 2019 Keynote | Event Streaming: Our Cloud...
Neha Narkhede | Kafka Summit London 2019 Keynote | Event Streaming: Our Cloud...Neha Narkhede | Kafka Summit London 2019 Keynote | Event Streaming: Our Cloud...
Neha Narkhede | Kafka Summit London 2019 Keynote | Event Streaming: Our Cloud...
 
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
 
Data analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publishData analytics at scale implementing stateful stream processing - publish
Data analytics at scale implementing stateful stream processing - publish
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Stream Processing with Apache Flink
Stream Processing with Apache FlinkStream Processing with Apache Flink
Stream Processing with Apache Flink
 

Recently uploaded

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 

Debunking Common Myths in Stream Processing

  • 1. 1 Kostas Tzoumas @kostas_tzoumas Big Data Ldn November 4, 2016 Stream Processing with Apache Flink®
  • 2. 2 Kostas Tzoumas @kostas_tzoumas Big Data Ldn November 4, 2016 Debunking Some Common Myths in Stream Processing
  • 3. 3 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution
  • 4. Outline  What is data streaming  Myth 1: The throughput/latency tradeoff  Myth 2: Exactly once not possible  Myth 3: Streaming is for (near) real-time  Myth 4: Streaming is hard 4
  • 6. 6 Reconsideration of data architecture  Better app isolation  More real-time reaction to events  Robust continuous applications  Process both real-time and historical data
  • 7. 7 app state app state app state event log Query service
  • 8. What is (distributed) streaming  Computations on never- ending “streams” of data records (“events”)  Stream processor distributes the computation in a cluster 8 Your code Your code Your code Your code
  • 9. What is stateful streaming  Computation and state • E.g., counters, windows of past events, state machines, trained ML models  Result depends on history of stream  Stateful stream processor gives the tools to manage state • Recover, roll back, version, upgrade, etc 9 Your code state
  • 10. What is event-time streaming  Data records associated with timestamps (time series data)  Processing depends on timestamps  Event-time stream processor gives you the tools to reason about time • E.g., handle streams that are out of order • Core feature is watermarks – a clock to measure event time 10 Your code state t3 t1 t2t4 t1-t2 t3-t4
  • 11. What is streaming  Continuous processing on data that is continuously generated  I.e., pretty much all “big” data  It’s all about state and time 11
  • 12. Debunking some common stream processing myths 12
  • 13. Myth 1: Throughput/latency tradeoff  Myth 1: you need to choose between high throughput or low latency  Physical limits • In reality, network determines both the achievable throughput and latency • A well-engineered system achieves these limits 13
  • 14. Flink performance  10s of millions events per seconds in 10s of nodes  scaled to 1000s of nodes  with latency in single-digit milliseconds 14
  • 15. Myth 2: Exactly once not possible  Exactly once: under failures, system computes result as if there was no failure  In contrast to: • At most once: no guarantees • At least once: duplicates possible  Exactly once state versus exactly once delivery  Myth 2: Exactly once state not possible/too costly 15
  • 16. Transactions  “Exactly once” is transactions: either all actions succeed or none succeed  Transactions are possible  Transactions are useful  Let’s not start eventual consistency all over again… 16
  • 17. Flink checkpoints  Periodic asynchronous consistent snapshots of application state  Provide exactly-once state guarantees under failures 17 9/2/2016 stream_barriers.svg checkpoint barrier n­1 data stream stream record (event) checkpoint barrier n newer records part of checkpoint n­1 part of checkpoint n part of checkpoint n+1 older records
  • 18. End-to-end exactly once  Checkpoints double as transaction coordination mechanism  Source and sink operators can take part in checkpoints  Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates 18 transactional sinks
  • 19. State management  Checkpoints triple as state versioning mechanism (savepoints)  Go back and forth in time while maintaining state consistency  Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests 19
  • 20. Myth 3: Streaming and real time  Myth 3: streaming and real-time are synonymous  Streaming is a new model • Essentially, state and time • Low latency/real time is the icing on the cake 20
  • 21. Low latency and high latency streams 21 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am… partition partition Stream (low latency) Batch (bounded stream) Stream (high latency)
  • 23. Accurate computation  Batch processing is not an accurate computation model for continuous data • Misses the right concepts and primitives • Time handling, state across batch boundaries  Stateful stream processing a better model • Real-time/low-latency is the icing on the cake 23
  • 24. Myth 4: How hard is streaming?  Myth 4: streaming is too hard to learn  You are already doing streaming, just in an ad hoc way  Most data is unbounded and the code changes slower than the data • This is a streaming problem 24
  • 25. It's about your data and code  What's the form of your data? • Unbounded (e.g., clicks, sensors, logs), or • Bounded (e.g., ???*)  What changes more often? • My code changes faster than my data • My data changes faster than my code 25 * Please help me find a great example of naturally bounded data
  • 26. It's about your data and code  If your data changes faster than your code you have a streaming problem • You may be solving it with hourly batch jobs depending on someone else to create the hourly batches • You are probably living with inaccurate results without knowing it 26
  • 27. It's about your data and code  If your code changes faster than your data you have an exploration problem • Using notebooks or other tools for quick data exploration is a good idea • Once your code stabilizes you will have a streaming problem, so you might as well think of it as such from the beginning 27
  • 28. Flink in the real world 28
  • 29. Flink community  > 240 contributors, 95 contributors in Flink 1.1  42 meetups around the world with > 15,000 members  2x-3x growth in 2015, similar in 2016 29
  • 30. Powered by Flink 30 Zalando, one of the largest ecommerce companies in Europe, uses Flink for real- time business process monitoring. King, the creators of Candy Crush Saga, uses Flink to provide data science teams with real-time analytics. Bouygues Telecom uses Flink for real-time event processing over billions of Kafka messages per day. Alibaba, the world's largest retailer, built a Flink-based system (Blink) to optimize search rankings in real time. See more at flink.apache.org/poweredby.html
  • 31. 30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second 31
  • 32. 32
  • 34. Current work in Flink 34
  • 35. Ongoing Flink development 35 Connectors Session Windows (Stream) SQL Library enhancements Metric System Operations Ecosystem Application Features Metrics & Visualization Dynamic Scaling Savepoint compatibility Checkpoints to savepoints More connectors Stream SQL Windows Large state Maintenance Fine grained recovery Side in-/outputs Window DSL Broader Audience Security Mesos & others Dynamic Resource Management Authentication Queryable State
  • 36. A longer-term vision for Flink 36
  • 37. Streaming use cases Application (Near) real-time apps Continuous apps Analytics on historical data Request/response apps Technology Low-latency streaming High-latency streaming Batch as special case of streaming Large queryable state 37
  • 38. Request/response applications  Queryable state: query Flink state directly instead of pushing results in a database  Large state support and query API coming in Flink 38 queries
  • 39. In summary  The need for streaming comes from a rethinking of data infra architecture • Stream processing then just becomes natural  Debunking 4 common myths • Myth 1: The throughput/latency tradeoff • Myth 2: Exactly once not possible • Myth 3: Streaming is for (near) real-time • Myth 4: Streaming is hard 39