SlideShare une entreprise Scribd logo
1  sur  54
Patrick Lucas
patrick@data-artisans.com
@theplucas
Flink in Containerland
Deploying Apache Flink on Kubernetes
What’s this talk about?
Lessons learned based on…
▪ getting a Flink cluster working in Docker
▪ creating the Flink image for Docker Hub
▪ deploying Flink on Kubernetes (K8s)
▪ tying it all together for dA Platform 2
2
Why containerized Flink?
▪ Fits our ideal deployment model (more
about this later)
▪ Efficient dynamic resource allocation and
scaling
▪ Application-oriented instead of machine-
oriented philosophy
3
Why Kubernetes?
▪ Principled API design
▪ Resource-oriented
▪ Disciplined about the layer of infrastructure
it occupies
▪ Provides all the necessary parts to effectively
run containerized applications at scale
4
Agenda
▪ Part 1: Setting the Stage
▪ Background of Flink and Kubernetes
▪ Part 2: Challenges and Solutions
▪ Discussion of and solutions for various challenges
with containerized Flink
▪ Part 3: Conclusion
▪ Part 4: Q&A
5
Part 1 — Setting the Stage
6
Agenda (Part 1)
▪ Introduction
▪ Recap of Flink architecture
▪ One job per cluster
▪ Kubernetes intro and terminology
▪ Flink-on-K8s example
7
Recap of Flink architecture
▪ 1 active JobManager (JM)
▪ central coordinator for cluster/jobs
▪ orchestrates task distribution and
checkpointing
▪ serves UI and HTTP API
8
Recap of Flink architecture
▪ 1 or more TaskManagers (TM)
▪ execute user code
▪ configured with address of JM
9
Recap of Flink architecture
10
JM
JM (spare)
ZooKeeper
TM
TM
TM
TM
TM
TM
TM
TM
TM
Coordination and
task distribution
leader election/
metadata
data flows
within job
Sources
and sinks
State
storage for
checkpoints and
savepoints
Agenda (Part 1)
▪ Introduction
▪ Recap of Flink architecture
▪ One job per cluster
▪ Kubernetes intro and terminology
▪ Flink-on-K8s example
11
One job per cluster
▪ Evolving job deployment philosophy
▪ Tie cluster lifecycle to job lifecycle
▪ Ideally, the job is first-class
▪ think about running a job on a platform
like Kubernetes, not a running Flink
cluster first then submitting the job
12
One job per cluster
▪ Early versions of Flink heavily inspired by
Hadoop
▪ have a cluster comprised of nodes
having different roles (JM, TMs)
▪ FLIP-6, ongoing project to improve and
modernize this aspect of Flink
13
Agenda (Part 1)
▪ Introduction
▪ Recap of Flink architecture
▪ One job per cluster
▪ Kubernetes intro and terminology
▪ Flink-on-K8s example
14
Kubernetes intro and terminology
▪ Resource-oriented with declarative configuration
▪ Tell K8s about what should exist, and a background
process asynchronously makes it happen
▪ “3 replicas of this container should be kept running,
only on nodes with this label”
▪ “a load balancer should exist, listening on port 443,
backed by containers with this label”
15
Kubernetes intro and terminology
▪ Various kinds of resources, very extensible (add your own
resources and “controllers” that handle them)
▪ Some core types:
▪ Pod (one or more containers running on a node)
▪ Deployment (keep 0 or more pods running, specified by a
template)
▪ Service (a hostname-accessible network service that listens on
one or more ports and proxies to pods that match given labels)
▪ ConfigMap (data that can be rendered as files or env vars)
16
Agenda (Part 1)
▪ Introduction
▪ Recap of Flink architecture
▪ One job per cluster
▪ Kubernetes intro and terminology
▪ Flink-on-K8s example
17
Flink-on-K8s example
▪ JobManager Deployment
▪ maintain 1 replica of Flink container running as a
JobManager
▪ apply a label like “flink-jobmanager”
▪ JobManager Service
▪ make JobManager accessible by a hostname & port
▪ select pods with label “flink-jobmanager”
18
Flink-on-K8s example
▪ TaskManager Deployment
▪ maintain n replica of Flink container
running as a TaskManager
▪ Flink config ConfigMap
▪ Materialize necessary config into files (i.e.
flink-conf.yaml; perhaps core-site.xml)
19
Simple Flink-on-Kubernetes
20
Unfortunately, no one can
be told what Kubernetes is.
You have to use it for
yourself.
“
Morpheus, The Matrix
TaskManager DeploymentJobManager
Deployment
Flink-on-K8s example
21
Kubernetes Nodes
Flink Pod (JM) Flink Pod (TM) Flink Pod (TM) Flink Pod (TM)
JobManager
Service
ConfigMap including
contents of flink-conf.yaml
mounted as volume in each
pod at /etc/flink
Service exposes predict-
able hostname and load
balances connections
across matching pods
Deployment ensures
desired # of pods remain
running and applies
upgrade strategy
Part 2 — Challenges and Solutions
22
Agenda (Part 2)
▪ JobManager RPC
▪ Port configuration
▪ “HA mode”
▪ Extra libraries, logging, other config
▪ Queryable state
▪ TLS/on-the-wire encryption
▪ Resource allocation
23
JobManager RPC
▪ Networking is the source of most challenges with
Flink-on-Kubernetes
▪ 5 of 7 sections in Part 2 are networking-related
▪ If you run a simple deployment on Kubernetes as
I described above… it just doesn’t work (or rather
it doesn’t “just work”)
24
JobManager RPC
▪ Most visible issue: use of Akka for Flink CLI RPC (including
job submission)
▪ Akka requires that the “name” (hostname or IP) used to
address the JM matches the name the JM itself is
configured with
▪ If connecting from outside the K8s environment, such
via port-forwarding or over a load balancer, messages
are silently dropped
25
JobManager RPC
▪ JM running in pod my-flink-cluster-jobmanager-3063995c2t0
▪ JM Service my-flink-cluster-jobmanager, listening on Flink RPC,
blob server, and UI ports
▪ JM/TMs configured with jobmanager.rpc.address: my-flink-
cluster-jobmanager
▪ TMs work fine, as they talk to the JM with the same name as it sees
itself (i.e. the name of the Service)
▪ But if you forward the JM RPC port locally or access it via a proxy…
26
JobManager RPC
From my laptop, having forwarded port 6123 to a running cluster on K8s:
$ flink run -m localhost:6123 examples/batch/WordCount.jar
Cluster configuration: Standalone cluster with JobManager at localhost/127.0.0.1:6123
Using address localhost:6123 to connect to JobManager.
. . .
Submitting job with JobID: ff86d06802fc69554ee0a5f726204db5. Waiting for job
completion.
… which hangs until a timeout. In the JobManager’s log:
2017-09-08 09:42:26,082 ERROR akka.remote.EndpointWriter
[] - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://flink@localhost:6123/]] arriving at [akka.tcp://flink@localhost:
6123] inbound addresses are [akka.tcp://flink@my-flink-cluster-jobmanager:6123]
27
JobManager RPC
▪ One solution is to submit the job from within the K8s
environment
▪ e.g. run a pod that contains your job jar, use the
Flink CLI to submit
▪ Another solution is to use the HTTP API (or the UI)
▪ Not subject to “addressability” problems
▪ Cons: hard to secure, not well specified
28
Agenda (Part 2)
▪ JobManager RPC
▪ Port configuration
▪ “HA mode”
▪ Extra libraries, logging, other config
▪ Queryable state
▪ TLS/on-the-wire encryption
▪ Resource allocation
29
Port configuration
▪ Flink chooses arbitrary ports for some
services (i.e. blob server, queryable state)
▪ Must specify these in Flink config and
JobManager service
▪ blob.server.port: 6124
▪ query.server.port: 6125
30
Agenda (Part 2)
▪ JobManager RPC
▪ Port configuration
▪ “HA mode”
▪ Extra libraries, logging, other config
▪ Queryable state
▪ TLS/on-the-wire encryption
▪ Resource allocation
31
“HA mode”
▪ A misnomer—always use HA mode in production,
even with 1 JobManager
▪ JM stores critical state in ZooKeeper that allows
it to recover if the JM dies
▪ On K8s and other resource managers, should only
need to run a single JobManager as the manager
will re-launch it upon failure
32
“HA mode”
▪ If you do run >1 JM, accessing the UI becomes difficult
▪ If you expose both JMs in a Service, but your request
routes to the non-leader, will be redirected to internal
hostname of leader
▪ One solution seen on the mailing list, have non-leader
fail “readiness probe”, so doesn’t receive traffic
▪ Possible improvement to Flink: proxy non-leader requests
to leader
33
Agenda (Part 2)
▪ JobManager RPC
▪ Port configuration
▪ “HA mode”
▪ Extra libraries, logging, other config
▪ Queryable state
▪ TLS/on-the-wire encryption
▪ Resource allocation
34
Extra libraries, logging, other config
▪ For code dependencies, easiest to build a “fat jar”
that includes them
▪ Notable exceptions are state backend, logging, and
metrics dependencies, for example…
▪ S3 state storage requires extra AWS, Hadoop jars
▪ SLF4J and Graylog’s GELF appender require
additional jars
35
Extra libraries, logging, other config
▪ For these, easiest to build your own Flink image
▪ See https://github.com/docker-flink/docker-flink
to see how the public image is built
▪ Alternatively, an advanced solution is to build an
image containing your jars and use Kubernetes’
“init-containers” feature to load them in before Flink
starts
36
Agenda (Part 2)
▪ JobManager RPC
▪ Port configuration
▪ “HA mode”
▪ Extra libraries, logging, other config
▪ Queryable state
▪ TLS/on-the-wire encryption
▪ Resource allocation
37
Queryable state
▪ Currently somewhat tricky
▪ Only works out-of-the-box if all TMs are addressable and routable
▪ For most setups, this means you must be in the same K8s
environment
▪ Also suffer from Akka addressability issue
▪ Possible improvement to Flink: enable TMs to proxy queries to the
correct TM, à la Cassandra
▪ Then, create another Service in front of all TMs listening on the
query server port
38
Queryable state
▪ A more cumbersome solution that would work today is to
create a service yourself that runs in the same K8s
environment and proxies requests
▪ This has some added advantages, such as…
▪ horizontal scalability
▪ an opportunity for caching
▪ a way to maintain partial availability while the
underlying Flink job is down, e.g. during upgrades
39
Agenda (Part 2)
▪ JobManager RPC
▪ Port configuration
▪ “HA mode”
▪ Extra libraries, logging, other config
▪ Queryable state
▪ TLS/on-the-wire encryption
▪ Resource allocation
40
TLS/on-the-wire encryption
▪ Also a bit tricky, as Pods inherently don’t have predictable
hostnames or IPs
▪ One solution: generate a trustStore/keyStore when launching
cluster
▪ Which contain a single key pair and certificate that all JMs/TMs
in the cluster use and verify against
▪ Essentially symmetric encryption in the “convenient” wrapper
of TLS
▪ I haven’t actually tried this
41
TLS/on-the-wire encryption
▪ Another possibility: use an overlay network in
Kubernetes that magically does on-the-wire encryption
▪ Encryption supported in some fashion by some
overlay networks
▪ Whether or not this is acceptable to you depends
on your requirements
▪ I haven’t actually tried this
42
Agenda (Part 2)
▪ JobManager RPC
▪ Port configuration
▪ “HA mode”
▪ Extra libraries, logging, other config
▪ Queryable state
▪ TLS/on-the-wire encryption
▪ Resource allocation
43
Resource allocation
▪ How many slots should each TM have?
▪ >1 slot per TM can save some overhead
▪ But fewer slots/TM could theoretically reduce
time to recovery (restoring from checkpoint)
▪ But more machines means a higher chance
of some failure
44
Resource allocation
▪ How much CPU/RAM should each TM get?
▪ Ideally, each slot should have one CPU
▪ Within a slot, processing is serial, so >1
CPU/slot shouldn’t matter much
▪ RAM/slot ratio is application-dependent;
1GB is a decent minimum to start with
45
Resource allocation
▪ Check out Robert Metzger’s talk about Flink
operations to some deeper discussion and insight
about this
▪ Keep It Going — How to Reliably and Efficiently
Operate Apache Flink
▪ Available on the FlinkForward YouTube channel
before the end of the day
46
Part 3 — Conclusion
47
Conclusion
▪ Containerization/resource management
platforms have a bright future
▪ Flink well-positioned, but has plenty of
room for improvement
48
Conclusion
▪ My most-wanted Flink improvement vis-à-vis Flink-on-K8s
▪ Establish a specified HTTP API for all management
and monitoring functions
▪ Rewrite the CLI to use this API exclusively
▪ Provide health-check endpoints for JobManagers and
TaskManagers that can be used for Liveness/
ReadinessChecks
49
See Also
▪ FLINK-6369 - Better support for overlay networks
▪ for discussion about the networking-/overlay network-
related issues discussed in this presentation
▪ The Flink User mailing list
▪ plenty of posts about the topics I covered today
▪ FLIP-6 - Flink Deployment and Process Model
▪ ongoing work to improve how Flink interacts with
resource managers, notably Mesos and YARN
50
51
52
Thank you!
@theplucas
@ApacheFlink
@dataArtisans
We are hiring!
data-artisans.com/careers
Part 3 — Q & A
54

Contenu connexe

Tendances

Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 

Tendances (20)

Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst Practices
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain
 

Similaire à Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland

Similaire à Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland (20)

Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
 
Docker Swarm and Traefik 2.0
Docker Swarm and Traefik 2.0Docker Swarm and Traefik 2.0
Docker Swarm and Traefik 2.0
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureDevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
 
Kubernetes: The Next Research Platform
Kubernetes: The Next Research PlatformKubernetes: The Next Research Platform
Kubernetes: The Next Research Platform
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
 
ROS Course Slides Course 2.pdf
ROS Course Slides Course 2.pdfROS Course Slides Course 2.pdf
ROS Course Slides Course 2.pdf
 
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSPutting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
A day in the life of a log message
A day in the life of a log messageA day in the life of a log message
A day in the life of a log message
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinPGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang WangVirtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Rook - cloud-native storage
Rook - cloud-native storageRook - cloud-native storage
Rook - cloud-native storage
 
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 

Plus de Flink Forward

Plus de Flink Forward (18)

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 

Dernier

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 

Dernier (20)

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 

Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland

  • 1. Patrick Lucas patrick@data-artisans.com @theplucas Flink in Containerland Deploying Apache Flink on Kubernetes
  • 2. What’s this talk about? Lessons learned based on… ▪ getting a Flink cluster working in Docker ▪ creating the Flink image for Docker Hub ▪ deploying Flink on Kubernetes (K8s) ▪ tying it all together for dA Platform 2 2
  • 3. Why containerized Flink? ▪ Fits our ideal deployment model (more about this later) ▪ Efficient dynamic resource allocation and scaling ▪ Application-oriented instead of machine- oriented philosophy 3
  • 4. Why Kubernetes? ▪ Principled API design ▪ Resource-oriented ▪ Disciplined about the layer of infrastructure it occupies ▪ Provides all the necessary parts to effectively run containerized applications at scale 4
  • 5. Agenda ▪ Part 1: Setting the Stage ▪ Background of Flink and Kubernetes ▪ Part 2: Challenges and Solutions ▪ Discussion of and solutions for various challenges with containerized Flink ▪ Part 3: Conclusion ▪ Part 4: Q&A 5
  • 6. Part 1 — Setting the Stage 6
  • 7. Agenda (Part 1) ▪ Introduction ▪ Recap of Flink architecture ▪ One job per cluster ▪ Kubernetes intro and terminology ▪ Flink-on-K8s example 7
  • 8. Recap of Flink architecture ▪ 1 active JobManager (JM) ▪ central coordinator for cluster/jobs ▪ orchestrates task distribution and checkpointing ▪ serves UI and HTTP API 8
  • 9. Recap of Flink architecture ▪ 1 or more TaskManagers (TM) ▪ execute user code ▪ configured with address of JM 9
  • 10. Recap of Flink architecture 10 JM JM (spare) ZooKeeper TM TM TM TM TM TM TM TM TM Coordination and task distribution leader election/ metadata data flows within job Sources and sinks State storage for checkpoints and savepoints
  • 11. Agenda (Part 1) ▪ Introduction ▪ Recap of Flink architecture ▪ One job per cluster ▪ Kubernetes intro and terminology ▪ Flink-on-K8s example 11
  • 12. One job per cluster ▪ Evolving job deployment philosophy ▪ Tie cluster lifecycle to job lifecycle ▪ Ideally, the job is first-class ▪ think about running a job on a platform like Kubernetes, not a running Flink cluster first then submitting the job 12
  • 13. One job per cluster ▪ Early versions of Flink heavily inspired by Hadoop ▪ have a cluster comprised of nodes having different roles (JM, TMs) ▪ FLIP-6, ongoing project to improve and modernize this aspect of Flink 13
  • 14. Agenda (Part 1) ▪ Introduction ▪ Recap of Flink architecture ▪ One job per cluster ▪ Kubernetes intro and terminology ▪ Flink-on-K8s example 14
  • 15. Kubernetes intro and terminology ▪ Resource-oriented with declarative configuration ▪ Tell K8s about what should exist, and a background process asynchronously makes it happen ▪ “3 replicas of this container should be kept running, only on nodes with this label” ▪ “a load balancer should exist, listening on port 443, backed by containers with this label” 15
  • 16. Kubernetes intro and terminology ▪ Various kinds of resources, very extensible (add your own resources and “controllers” that handle them) ▪ Some core types: ▪ Pod (one or more containers running on a node) ▪ Deployment (keep 0 or more pods running, specified by a template) ▪ Service (a hostname-accessible network service that listens on one or more ports and proxies to pods that match given labels) ▪ ConfigMap (data that can be rendered as files or env vars) 16
  • 17. Agenda (Part 1) ▪ Introduction ▪ Recap of Flink architecture ▪ One job per cluster ▪ Kubernetes intro and terminology ▪ Flink-on-K8s example 17
  • 18. Flink-on-K8s example ▪ JobManager Deployment ▪ maintain 1 replica of Flink container running as a JobManager ▪ apply a label like “flink-jobmanager” ▪ JobManager Service ▪ make JobManager accessible by a hostname & port ▪ select pods with label “flink-jobmanager” 18
  • 19. Flink-on-K8s example ▪ TaskManager Deployment ▪ maintain n replica of Flink container running as a TaskManager ▪ Flink config ConfigMap ▪ Materialize necessary config into files (i.e. flink-conf.yaml; perhaps core-site.xml) 19
  • 20. Simple Flink-on-Kubernetes 20 Unfortunately, no one can be told what Kubernetes is. You have to use it for yourself. “ Morpheus, The Matrix
  • 21. TaskManager DeploymentJobManager Deployment Flink-on-K8s example 21 Kubernetes Nodes Flink Pod (JM) Flink Pod (TM) Flink Pod (TM) Flink Pod (TM) JobManager Service ConfigMap including contents of flink-conf.yaml mounted as volume in each pod at /etc/flink Service exposes predict- able hostname and load balances connections across matching pods Deployment ensures desired # of pods remain running and applies upgrade strategy
  • 22. Part 2 — Challenges and Solutions 22
  • 23. Agenda (Part 2) ▪ JobManager RPC ▪ Port configuration ▪ “HA mode” ▪ Extra libraries, logging, other config ▪ Queryable state ▪ TLS/on-the-wire encryption ▪ Resource allocation 23
  • 24. JobManager RPC ▪ Networking is the source of most challenges with Flink-on-Kubernetes ▪ 5 of 7 sections in Part 2 are networking-related ▪ If you run a simple deployment on Kubernetes as I described above… it just doesn’t work (or rather it doesn’t “just work”) 24
  • 25. JobManager RPC ▪ Most visible issue: use of Akka for Flink CLI RPC (including job submission) ▪ Akka requires that the “name” (hostname or IP) used to address the JM matches the name the JM itself is configured with ▪ If connecting from outside the K8s environment, such via port-forwarding or over a load balancer, messages are silently dropped 25
  • 26. JobManager RPC ▪ JM running in pod my-flink-cluster-jobmanager-3063995c2t0 ▪ JM Service my-flink-cluster-jobmanager, listening on Flink RPC, blob server, and UI ports ▪ JM/TMs configured with jobmanager.rpc.address: my-flink- cluster-jobmanager ▪ TMs work fine, as they talk to the JM with the same name as it sees itself (i.e. the name of the Service) ▪ But if you forward the JM RPC port locally or access it via a proxy… 26
  • 27. JobManager RPC From my laptop, having forwarded port 6123 to a running cluster on K8s: $ flink run -m localhost:6123 examples/batch/WordCount.jar Cluster configuration: Standalone cluster with JobManager at localhost/127.0.0.1:6123 Using address localhost:6123 to connect to JobManager. . . . Submitting job with JobID: ff86d06802fc69554ee0a5f726204db5. Waiting for job completion. … which hangs until a timeout. In the JobManager’s log: 2017-09-08 09:42:26,082 ERROR akka.remote.EndpointWriter [] - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@localhost:6123/]] arriving at [akka.tcp://flink@localhost: 6123] inbound addresses are [akka.tcp://flink@my-flink-cluster-jobmanager:6123] 27
  • 28. JobManager RPC ▪ One solution is to submit the job from within the K8s environment ▪ e.g. run a pod that contains your job jar, use the Flink CLI to submit ▪ Another solution is to use the HTTP API (or the UI) ▪ Not subject to “addressability” problems ▪ Cons: hard to secure, not well specified 28
  • 29. Agenda (Part 2) ▪ JobManager RPC ▪ Port configuration ▪ “HA mode” ▪ Extra libraries, logging, other config ▪ Queryable state ▪ TLS/on-the-wire encryption ▪ Resource allocation 29
  • 30. Port configuration ▪ Flink chooses arbitrary ports for some services (i.e. blob server, queryable state) ▪ Must specify these in Flink config and JobManager service ▪ blob.server.port: 6124 ▪ query.server.port: 6125 30
  • 31. Agenda (Part 2) ▪ JobManager RPC ▪ Port configuration ▪ “HA mode” ▪ Extra libraries, logging, other config ▪ Queryable state ▪ TLS/on-the-wire encryption ▪ Resource allocation 31
  • 32. “HA mode” ▪ A misnomer—always use HA mode in production, even with 1 JobManager ▪ JM stores critical state in ZooKeeper that allows it to recover if the JM dies ▪ On K8s and other resource managers, should only need to run a single JobManager as the manager will re-launch it upon failure 32
  • 33. “HA mode” ▪ If you do run >1 JM, accessing the UI becomes difficult ▪ If you expose both JMs in a Service, but your request routes to the non-leader, will be redirected to internal hostname of leader ▪ One solution seen on the mailing list, have non-leader fail “readiness probe”, so doesn’t receive traffic ▪ Possible improvement to Flink: proxy non-leader requests to leader 33
  • 34. Agenda (Part 2) ▪ JobManager RPC ▪ Port configuration ▪ “HA mode” ▪ Extra libraries, logging, other config ▪ Queryable state ▪ TLS/on-the-wire encryption ▪ Resource allocation 34
  • 35. Extra libraries, logging, other config ▪ For code dependencies, easiest to build a “fat jar” that includes them ▪ Notable exceptions are state backend, logging, and metrics dependencies, for example… ▪ S3 state storage requires extra AWS, Hadoop jars ▪ SLF4J and Graylog’s GELF appender require additional jars 35
  • 36. Extra libraries, logging, other config ▪ For these, easiest to build your own Flink image ▪ See https://github.com/docker-flink/docker-flink to see how the public image is built ▪ Alternatively, an advanced solution is to build an image containing your jars and use Kubernetes’ “init-containers” feature to load them in before Flink starts 36
  • 37. Agenda (Part 2) ▪ JobManager RPC ▪ Port configuration ▪ “HA mode” ▪ Extra libraries, logging, other config ▪ Queryable state ▪ TLS/on-the-wire encryption ▪ Resource allocation 37
  • 38. Queryable state ▪ Currently somewhat tricky ▪ Only works out-of-the-box if all TMs are addressable and routable ▪ For most setups, this means you must be in the same K8s environment ▪ Also suffer from Akka addressability issue ▪ Possible improvement to Flink: enable TMs to proxy queries to the correct TM, à la Cassandra ▪ Then, create another Service in front of all TMs listening on the query server port 38
  • 39. Queryable state ▪ A more cumbersome solution that would work today is to create a service yourself that runs in the same K8s environment and proxies requests ▪ This has some added advantages, such as… ▪ horizontal scalability ▪ an opportunity for caching ▪ a way to maintain partial availability while the underlying Flink job is down, e.g. during upgrades 39
  • 40. Agenda (Part 2) ▪ JobManager RPC ▪ Port configuration ▪ “HA mode” ▪ Extra libraries, logging, other config ▪ Queryable state ▪ TLS/on-the-wire encryption ▪ Resource allocation 40
  • 41. TLS/on-the-wire encryption ▪ Also a bit tricky, as Pods inherently don’t have predictable hostnames or IPs ▪ One solution: generate a trustStore/keyStore when launching cluster ▪ Which contain a single key pair and certificate that all JMs/TMs in the cluster use and verify against ▪ Essentially symmetric encryption in the “convenient” wrapper of TLS ▪ I haven’t actually tried this 41
  • 42. TLS/on-the-wire encryption ▪ Another possibility: use an overlay network in Kubernetes that magically does on-the-wire encryption ▪ Encryption supported in some fashion by some overlay networks ▪ Whether or not this is acceptable to you depends on your requirements ▪ I haven’t actually tried this 42
  • 43. Agenda (Part 2) ▪ JobManager RPC ▪ Port configuration ▪ “HA mode” ▪ Extra libraries, logging, other config ▪ Queryable state ▪ TLS/on-the-wire encryption ▪ Resource allocation 43
  • 44. Resource allocation ▪ How many slots should each TM have? ▪ >1 slot per TM can save some overhead ▪ But fewer slots/TM could theoretically reduce time to recovery (restoring from checkpoint) ▪ But more machines means a higher chance of some failure 44
  • 45. Resource allocation ▪ How much CPU/RAM should each TM get? ▪ Ideally, each slot should have one CPU ▪ Within a slot, processing is serial, so >1 CPU/slot shouldn’t matter much ▪ RAM/slot ratio is application-dependent; 1GB is a decent minimum to start with 45
  • 46. Resource allocation ▪ Check out Robert Metzger’s talk about Flink operations to some deeper discussion and insight about this ▪ Keep It Going — How to Reliably and Efficiently Operate Apache Flink ▪ Available on the FlinkForward YouTube channel before the end of the day 46
  • 47. Part 3 — Conclusion 47
  • 48. Conclusion ▪ Containerization/resource management platforms have a bright future ▪ Flink well-positioned, but has plenty of room for improvement 48
  • 49. Conclusion ▪ My most-wanted Flink improvement vis-à-vis Flink-on-K8s ▪ Establish a specified HTTP API for all management and monitoring functions ▪ Rewrite the CLI to use this API exclusively ▪ Provide health-check endpoints for JobManagers and TaskManagers that can be used for Liveness/ ReadinessChecks 49
  • 50. See Also ▪ FLINK-6369 - Better support for overlay networks ▪ for discussion about the networking-/overlay network- related issues discussed in this presentation ▪ The Flink User mailing list ▪ plenty of posts about the topics I covered today ▪ FLIP-6 - Flink Deployment and Process Model ▪ ongoing work to improve how Flink interacts with resource managers, notably Mesos and YARN 50
  • 51. 51
  • 54. Part 3 — Q & A 54