SlideShare une entreprise Scribd logo
1  sur  44
Experiences in running
Apache Flink®

at large scale
Stephan Ewen (@StephanEwen)
2
Lessons learned from running Flink at large scale
Including various things we never expected to
become a problem and evidently still did…
Also a preview to various fixes coming in Flink…
What is large scale?
3
Large Data Volume
( events / sec )
Large Application State
(GBs / TBs)
High Parallelism

(1000s of subtasks)
Complex Dataflow
Graphs
(many operators)
Distributed Coordination
4
Deploying Tasks
5
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects

- Recover State Handle

- Correlation IDs
Deploying Tasks
6
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects

- Recover State Handle

- Correlation IDs
KBs
up to MBs
KBs
few bytes
RPC volume during deployment
7
(back of the napkin calculation)
number of
tasks
2 MB
parallelism
size of task

objects
100010 x x
x x
=
= RPC volume
20 GB
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
Timeouts and Failure detection
8
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
Default RPC timeout: 10 secs
default settings lead to failed

deployments with RPC timeouts
Solution: Increase RPC timeout
Caveat: Increasing the timeout makes failure detection
slower

Future: Reduce RPC load (next slides)
Dissecting the RPC messages
9
Message part Size
Variance across
subtasks

and redeploys
Job Configuration KBs constant
Task Code and
Objects
up to MBs constant
Recover State Handle KBs variable
Correlation IDs few bytes variable
Upcoming: Deploying Tasks
10
Out-of-band transfer and caching of

large and constant message parts
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Cache
(1) Deployment RPC Call
(Recover State Handle,
Correlation IDs, BLOB pointers)
(2) Download and cache BLOBs


(Job Config, Task Objects) MBs
KBs
Checkpoints at scale
11
12
…is the most important part of

running a large Flink program
Robustly checkpointing…
Review: Checkpoints
13
Trigger checkpoint Inject checkpoint barrier
stateful

operation
source /
transform
Review: Checkpoints
14
Take state snapshot Trigger state

snapshot
stateful

operation
source /
transform
Review: Checkpoint Alignment
15
1
2
3
4
5
6
a
b
c
d
e
f
begin aligning
checkpoint

barrier n
xy
operator
7
8
9
c
d
e
f
g
h
aligning
ab
operator
23 1
4
5
6
input buffer
y
Review: Checkpoint Alignment
16
7
8
9
d
e
f
g
h
bc
operator
23 1
5
6
emit barrier n
7
8
9
d
e
f
g
h
c
operator
23 1
5
6
input buffer
continuecheckpoint
4 4
i
a
Understanding Checkpoints
17
Understanding Checkpoints
18
How well behaves
the alignment?
(lower is better)
How long do

snapshots take?
delay =

end_to_end – sync – async
Understanding Checkpoints
19
How well behaves
the alignment?
(lower is better)
How long do

snapshots take?
delay =

end_to_end – sync – async
long delay = under backpressure
under constant backpressure
means the application is
under provisioned
too long means
! too much state

per node

! snapshot store cannot

keep up with load

(low bandwidth)
changes with incremental

checkpoints
most important

metric
Alignments: Limit in-flight data
▪ In-flight data is data "between" operators
• On the wire or in the network buffers
• Amount depends mainly on network buffer memory
▪ Need some to buffer out network

fluctuations / transient backpressure
▪ Max amount of in-flight data is max

amount buffered during alignment
20
1
2
3
4
5
6
a
b
c
d
e
f
checkpoint

barrier
xyoperator
Alignments: Limit in-flight data
▪ Flink 1.2: Global pool that distributes across all tasks
• Rule-of-thumb: set to 4 * num_shuffles * parallelism * num_slots
▪ Flink 1.3: Limits the max in-flight

data automatically
• Heuristic based on of channels

and connections involved in a

transfer step
21
1
2
3
4
5
6
a
b
c
d
e
f
checkpoint

barrier
xyoperator
Heavy alignments
▪ A heavy alignment typically happens at some point

! Different load on different paths
▪ Big window emission concurrent

to a checkpoint
▪ Stall of one operator on the path
22
Heavy alignments
▪ A heavy alignment typically happens at some point

! Different load on different paths
▪ Big window emission concurrent

to a checkpoint
▪ Stall of one operator on the path
23
Heavy alignments
▪ A heavy alignment typically happens at some point

! Different load on different paths
▪ Big window emission concurrent

to a checkpoint
▪ Stall of one operator on the path
24
GC stall
Catching up from heavy alignments
▪ Operators that did heavy alignment need to catch up
again
▪ Otherwise, next checkpoint will have a

heavy alignment as well
25
7
8
9
d
e
f
g
h
operator
5
6
7
8
9
d
e
f
g
h
bc
operator
23 1
5
6
4
a
consumed first after

checkpoint completed
bc a
Catching up from heavy alignments
▪ Giving the computation time to catch up before starting
the next checkpoint
• Useful: Set the min-time-between-checkpoints
▪ Asynchronous checkpoints help a lot!
• Shorter stalls in the pipelines means less build-up of in-flight
data
• Catch up already happens concurrently to state materialization
26
Asynchronous Checkpoints
27
Durably persist

snapshots

asynchronously
Processing pipeline continues
stateful

operation
source /
transform
Asynchrony of different state types
28
State Flink 1.2 Flink 1.3 Flink 1.3 +
Keyed state

RocksDB ✔ ✔ ✔
Keyed State

on heap
✘ (✔)

(hidden in 1.2.1) ✔ ✔
Timers ✘ ✔/✘ ✔
Operator State ✘ ✔ ✔
When to use which state backend?
29
Async. Heap RocksDB
State ≥ Memory ?
Complex Objects?

(expensive serialization)
high data rate?
no yes
yes no
yes
no
a bit

simplified
File Systems, Object Stores,

and Checkpointed State
30
Exceeding FS request capacity
▪ Job size: 4 operators
▪ Parallelism: 100s to 1000
▪ State Backend: FsStateBackend
▪ State size: few KBs per operator, 100s to 1000 of files
▪ Checkpoint interval: few secs
▪ Symptom: S3 blocked off connections after exceeding
1000s HEAD requests / sec
31
Exceeding FS request capacity
What happened?
▪ Operators prepare state writes,

ensure parent directory exists
▪ Via the S3 FS (from Hadoop), each mkdirs causes

2 HEAD requests
▪ Flink 1.2: Lazily initialize checkpoint preconditions (dirs.)
▪ Flink 1.3: Core state backends reduce assumption of directories
(PUT/GET/DEL), rich file systems support them as fast paths
32
Reducing FS stress for small state
33
JobManager TaskManager
Checkpoint

Coordinator
Task
TaskManager
Task
TaskTask
Root Checkpoint File

(metadata) checkpoint data

files
Fs/RocksDB state backend

for most states
Reducing FS stress for small state
34
JobManager TaskManager
Checkpoint

Coordinator
Task
TaskManager
Task
TaskTask
checkpoint data

directly in metadata file
Fs/RocksDB state backend

for small states
ack+data
Increasing small state

threshold reduces number

of files (default: 1KB)
Lagging state cleanup
35
JobManager
TM
TMTM
TMTM
TM
TMTMTM
Symptom: Checkpoints get cleaned up too slow

State accumulates over time
many TaskManagers

create files
One JobManager

deleting files
Lagging state cleanup
▪ Problem: FileSystems and Object Stores offer only
synchronous requests to delete state object

Time to delete a checkpoint may accumulates to minutes.
▪ Flink 1.2: Concurrent checkpoint deletes on the JobManager
▪ Flink 1.3: For FileSystems with actual directory structure, use
recursive directory deletes (one request per directory)
36
Orphaned Checkpoint State
37
JobManager
Checkpoint

Coordinator
TaskManager
TaskTask
(1) TaskManager

writes state
(2) Ack and transfer

ownership of state
(3) JobManager

records state reference
Who owns state objects at what time?
Orphaned Checkpoint State
38
fs:///checkpoints/job-61776516/
chk-113
/
chk-129
/
chk-221
/
chk-271
/
chk-272
/
It gets more complicated with incremental checkpoints…
Upcoming: Searching for orphaned state
latest
retained
periodically sweep checkpoint

directory for leftover dirs
Conclusion &

General Recommendations
39
40
The closer you application is to saturating either

network, CPU, memory, FS throughput, etc.

the sooner an extraordinary situation causes a regression
Enough headroom in provisioned capacity means
fast catchup after temporary regressions
Be aware that certain operations are spiky
(like aligned windows)
Production test always with checkpoints ;-)
Recommendations (part 1)
Be aware of the inherent scalability of primitives
▪ Broadcasting state is useful, for example for updating rules / configs,
dynamic code loading, etc.
▪ Broadcasting does not scale, i.e., adding more nodes does not.

Don't use it for high volume joins
▪ Putting very large objects into a ValueState may mean big
serialization effort on access / checkpoint
▪ If the state can be mappified, use MapState – it performs much better
41
Recommendations (part 2)
If you are about recovery time
▪ Having spare TaskManagers helps bridge the time until backup
TaskManagers come online
▪ Having a spare JobManager can be useful
• Future: JobManager failures are non disruptive
42
Recommendations (part 3)
If you care about CPU efficiency, watch your serializers
▪ JSON is a flexible, but awfully inefficient data type
▪ Kryo does okay - make sure you register the types
▪ Flink's directly supported types have good performance

basic types, arrays, tuples, …
▪ Nothing ever beats a custom serializer ;-)
43
44
Thank you!
Questions?

Contenu connexe

Tendances

Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkVerverica
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsVerverica
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupStephan Ewen
 
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink Forward
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14宇帆 盛
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...Flink Forward
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Ververica
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetupKostas Tzoumas
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...Paris Carbone
 
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...Flink Forward
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
 
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache FlinkFlink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache FlinkFlink Forward
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward
 

Tendances (20)

Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIs
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
 
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache FlinkFlink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
 

Similaire à Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...Ververica
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practiceDocker, Inc.
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeDocker, Inc.
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connectconfluent
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker seriesMonal Daxini
 
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIB Solutions
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 

Similaire à Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale (20)

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practice
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to Practice
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 

Plus de Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 

Plus de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 

Dernier

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 

Dernier (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale

  • 1. Experiences in running Apache Flink®
 at large scale Stephan Ewen (@StephanEwen)
  • 2. 2 Lessons learned from running Flink at large scale Including various things we never expected to become a problem and evidently still did… Also a preview to various fixes coming in Flink…
  • 3. What is large scale? 3 Large Data Volume ( events / sec ) Large Application State (GBs / TBs) High Parallelism
 (1000s of subtasks) Complex Dataflow Graphs (many operators)
  • 5. Deploying Tasks 5 Happens during initial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects
 - Recover State Handle
 - Correlation IDs
  • 6. Deploying Tasks 6 Happens during initial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects
 - Recover State Handle
 - Correlation IDs KBs up to MBs KBs few bytes
  • 7. RPC volume during deployment 7 (back of the napkin calculation) number of tasks 2 MB parallelism size of task
 objects 100010 x x x x = = RPC volume 20 GB ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net
  • 8. Timeouts and Failure detection 8 ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net Default RPC timeout: 10 secs default settings lead to failed
 deployments with RPC timeouts Solution: Increase RPC timeout Caveat: Increasing the timeout makes failure detection slower
 Future: Reduce RPC load (next slides)
  • 9. Dissecting the RPC messages 9 Message part Size Variance across subtasks
 and redeploys Job Configuration KBs constant Task Code and Objects up to MBs constant Recover State Handle KBs variable Correlation IDs few bytes variable
  • 10. Upcoming: Deploying Tasks 10 Out-of-band transfer and caching of
 large and constant message parts JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Cache (1) Deployment RPC Call (Recover State Handle, Correlation IDs, BLOB pointers) (2) Download and cache BLOBs 
 (Job Config, Task Objects) MBs KBs
  • 12. 12 …is the most important part of
 running a large Flink program Robustly checkpointing…
  • 13. Review: Checkpoints 13 Trigger checkpoint Inject checkpoint barrier stateful
 operation source / transform
  • 14. Review: Checkpoints 14 Take state snapshot Trigger state
 snapshot stateful
 operation source / transform
  • 15. Review: Checkpoint Alignment 15 1 2 3 4 5 6 a b c d e f begin aligning checkpoint
 barrier n xy operator 7 8 9 c d e f g h aligning ab operator 23 1 4 5 6 input buffer y
  • 16. Review: Checkpoint Alignment 16 7 8 9 d e f g h bc operator 23 1 5 6 emit barrier n 7 8 9 d e f g h c operator 23 1 5 6 input buffer continuecheckpoint 4 4 i a
  • 18. Understanding Checkpoints 18 How well behaves the alignment? (lower is better) How long do
 snapshots take? delay =
 end_to_end – sync – async
  • 19. Understanding Checkpoints 19 How well behaves the alignment? (lower is better) How long do
 snapshots take? delay =
 end_to_end – sync – async long delay = under backpressure under constant backpressure means the application is under provisioned too long means ! too much state
 per node
 ! snapshot store cannot
 keep up with load
 (low bandwidth) changes with incremental
 checkpoints most important
 metric
  • 20. Alignments: Limit in-flight data ▪ In-flight data is data "between" operators • On the wire or in the network buffers • Amount depends mainly on network buffer memory ▪ Need some to buffer out network
 fluctuations / transient backpressure ▪ Max amount of in-flight data is max
 amount buffered during alignment 20 1 2 3 4 5 6 a b c d e f checkpoint
 barrier xyoperator
  • 21. Alignments: Limit in-flight data ▪ Flink 1.2: Global pool that distributes across all tasks • Rule-of-thumb: set to 4 * num_shuffles * parallelism * num_slots ▪ Flink 1.3: Limits the max in-flight
 data automatically • Heuristic based on of channels
 and connections involved in a
 transfer step 21 1 2 3 4 5 6 a b c d e f checkpoint
 barrier xyoperator
  • 22. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 22
  • 23. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 23
  • 24. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 24 GC stall
  • 25. Catching up from heavy alignments ▪ Operators that did heavy alignment need to catch up again ▪ Otherwise, next checkpoint will have a
 heavy alignment as well 25 7 8 9 d e f g h operator 5 6 7 8 9 d e f g h bc operator 23 1 5 6 4 a consumed first after
 checkpoint completed bc a
  • 26. Catching up from heavy alignments ▪ Giving the computation time to catch up before starting the next checkpoint • Useful: Set the min-time-between-checkpoints ▪ Asynchronous checkpoints help a lot! • Shorter stalls in the pipelines means less build-up of in-flight data • Catch up already happens concurrently to state materialization 26
  • 27. Asynchronous Checkpoints 27 Durably persist
 snapshots
 asynchronously Processing pipeline continues stateful
 operation source / transform
  • 28. Asynchrony of different state types 28 State Flink 1.2 Flink 1.3 Flink 1.3 + Keyed state
 RocksDB ✔ ✔ ✔ Keyed State
 on heap ✘ (✔)
 (hidden in 1.2.1) ✔ ✔ Timers ✘ ✔/✘ ✔ Operator State ✘ ✔ ✔
  • 29. When to use which state backend? 29 Async. Heap RocksDB State ≥ Memory ? Complex Objects?
 (expensive serialization) high data rate? no yes yes no yes no a bit
 simplified
  • 30. File Systems, Object Stores,
 and Checkpointed State 30
  • 31. Exceeding FS request capacity ▪ Job size: 4 operators ▪ Parallelism: 100s to 1000 ▪ State Backend: FsStateBackend ▪ State size: few KBs per operator, 100s to 1000 of files ▪ Checkpoint interval: few secs ▪ Symptom: S3 blocked off connections after exceeding 1000s HEAD requests / sec 31
  • 32. Exceeding FS request capacity What happened? ▪ Operators prepare state writes,
 ensure parent directory exists ▪ Via the S3 FS (from Hadoop), each mkdirs causes
 2 HEAD requests ▪ Flink 1.2: Lazily initialize checkpoint preconditions (dirs.) ▪ Flink 1.3: Core state backends reduce assumption of directories (PUT/GET/DEL), rich file systems support them as fast paths 32
  • 33. Reducing FS stress for small state 33 JobManager TaskManager Checkpoint
 Coordinator Task TaskManager Task TaskTask Root Checkpoint File
 (metadata) checkpoint data
 files Fs/RocksDB state backend
 for most states
  • 34. Reducing FS stress for small state 34 JobManager TaskManager Checkpoint
 Coordinator Task TaskManager Task TaskTask checkpoint data
 directly in metadata file Fs/RocksDB state backend
 for small states ack+data Increasing small state
 threshold reduces number
 of files (default: 1KB)
  • 35. Lagging state cleanup 35 JobManager TM TMTM TMTM TM TMTMTM Symptom: Checkpoints get cleaned up too slow
 State accumulates over time many TaskManagers
 create files One JobManager
 deleting files
  • 36. Lagging state cleanup ▪ Problem: FileSystems and Object Stores offer only synchronous requests to delete state object
 Time to delete a checkpoint may accumulates to minutes. ▪ Flink 1.2: Concurrent checkpoint deletes on the JobManager ▪ Flink 1.3: For FileSystems with actual directory structure, use recursive directory deletes (one request per directory) 36
  • 37. Orphaned Checkpoint State 37 JobManager Checkpoint
 Coordinator TaskManager TaskTask (1) TaskManager
 writes state (2) Ack and transfer
 ownership of state (3) JobManager
 records state reference Who owns state objects at what time?
  • 38. Orphaned Checkpoint State 38 fs:///checkpoints/job-61776516/ chk-113 / chk-129 / chk-221 / chk-271 / chk-272 / It gets more complicated with incremental checkpoints… Upcoming: Searching for orphaned state latest retained periodically sweep checkpoint
 directory for leftover dirs
  • 40. 40 The closer you application is to saturating either
 network, CPU, memory, FS throughput, etc.
 the sooner an extraordinary situation causes a regression Enough headroom in provisioned capacity means fast catchup after temporary regressions Be aware that certain operations are spiky (like aligned windows) Production test always with checkpoints ;-)
  • 41. Recommendations (part 1) Be aware of the inherent scalability of primitives ▪ Broadcasting state is useful, for example for updating rules / configs, dynamic code loading, etc. ▪ Broadcasting does not scale, i.e., adding more nodes does not.
 Don't use it for high volume joins ▪ Putting very large objects into a ValueState may mean big serialization effort on access / checkpoint ▪ If the state can be mappified, use MapState – it performs much better 41
  • 42. Recommendations (part 2) If you are about recovery time ▪ Having spare TaskManagers helps bridge the time until backup TaskManagers come online ▪ Having a spare JobManager can be useful • Future: JobManager failures are non disruptive 42
  • 43. Recommendations (part 3) If you care about CPU efficiency, watch your serializers ▪ JSON is a flexible, but awfully inefficient data type ▪ Kryo does okay - make sure you register the types ▪ Flink's directly supported types have good performance
 basic types, arrays, tuples, … ▪ Nothing ever beats a custom serializer ;-) 43