https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/227622666/
Title: Spark on Kubernetes
Abstract: Engineers across several organizations are working on support for Kubernetes as a cluster scheduler backend within Spark. While designing this, we have encountered several challenges in translating Spark to use idiomatic Kubernetes constructs natively. This talk is about our high level design decisions and the current state of our work.
Speaker:
Anirudh Ramanathan is a software engineer on the Kubernetes team at Google. His focus is on running stateful and batch workloads. Previously, he worked on GGC (Google Global Cache) and prior to that, on the infrastructure team at NVIDIA."
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - Anirudh Ramanthan from Google Kubernetes Team
1. Spark on Kubernetes
Advanced Spark and TensorFlow Meetup (19 Jan 2017)
Anirudh Ramanathan (Google)
GitHub: foxish
2. What is Kubernetes
● Open source cluster manager originally developed by Google
● Based on a decade and a half of experience in running containers at scale
● Has over 1000 contributors and 30,000+ commits on Github
● Container centric infrastructure
● Deploy and manage applications declaratively
4. Concepts
0. Container: A sealed application package (Docker)
1. Pod: A small group of tightly coupled Containers
example: content syncer & web server
2. Controller: A loop that drives current state towards desired state
example: replication controller
3. Service: A set of running pods that work together
example: load-balanced backends
6. Why Spark?
● Spark is used for processing many kinds of workloads
○ Batch
○ Interactive
○ Streaming
● Lots of organizations already run their serving workloads in Kubernetes
● Better resource sharing and management when all workloads run on a
single cluster manager
7. Spark Standalone on Kubernetes
Setup one master controller and a worker pods in a standalone cluster on top of
Kubernetes: https://github.com/kubernetes/kubernetes/tree/master/examples/spark
● Resource negotiation tied to Spark standalone and Kubernetes configuration
● No easy way to dynamically scale number of workers when there are idle resources
● Lacks robust authentication and authorization mechanism
● FIFO scheduling only
10. Kubernetes as a Cluster Scheduler Backend
● Cluster mode support
● The driver shall run within the
cluster
● Coarse grained mode
● Spark talks to kubernetes
clusters directly
spark-subm
it
--m
aster=k8s://<IP>
Kubernetes
driver&
executors
13. Communication
● Kubernetes provides a REST API
● Fabric8's Kubernetes Java client
to make REST calls
● Allows us to create, watch,
delete Pods and higher level
controllers from Scala/Java
code
REST
APIcalls
apiserver
scheduler
14. Spark Configuration
● Spark configuration options provided
to spark-submit at the time of
invocation
● https://github.com/apache-spark-on-
k8s/spark/blob/k8s-support-alternat
e-incremental/docs/running-on-kube
rnetes.md
15. Dynamic Executor Scaling
Hypothesis 1
● The set of executors can be
adequately represented by a
ReplicaSet
Replica
Set
create
run 3
executor
pods
16. Dynamic Executor Scaling
Hypothesis 1
● The set of executors can be
adequately represented by a
ReplicaSet
● Which one do we kill?
● Spark knows to intelligently
scale down but the ReplicaSet
does not
Replica
Set
kill one?
scale down
to 2
17. Solution: Driver pod as controller
● Let the Spark driver pod launch
executor pods
● Scale up/down can be such that
we lose the least amount of
cached data
spark-subm
it
kubernetes cluster
apiserver
scheduler
18. Solution: Driver pod as controller
● Let the Spark driver pod launch
executor pods
● Scale up/down can be such that
we lose the least amount of
cached data
kubernetes cluster
apiserver
scheduler
spark driver
pod
schedule driver pod
19. Solution: Driver pod as controller
● Let the Spark driver pod launch
executor pods
● Scale up/down can be such that
we lose the least amount of
cached data
kubernetes cluster
apiserver
scheduler
spark driver
pod
create executor pods
20. Solution: Driver pod as controller
● Let the Spark driver pod launch
executor pods
● Scale up/down can be such that
we lose the least amount of
cached data
spark driver
pod
kubernetes cluster
apiserver
scheduler
schedule
executorpods
21. Solution: Driver pod as controller
● Let the Spark driver pod launch
executor pods
● Scale up/down can be such that
we lose the least amount of
cached data
spark driver
pod
kubernetes cluster
apiserver
scheduler
22. Solution: Driver pod as controller
● Let the Spark driver pod launch
executor pods
● Scale up/down can be such that
we lose the least amount of
cached data
Spark job
completed
kubernetes cluster
apiserver
scheduler
get
output/logs
24. Shuffle Service
● The shuffle service is a component that persists files written by
executors beyond the lifetime of the executors
● Important (and required) for dynamic allocation of executors
● Typically one per node or instance and shared by different executors
● Can kill executors without fear of losing data and triggering
recomputation
● Considering two possible designs of the Shuffle Service
25. Shuffle Service: DaemonSet
● One shuffle service per node
● Idiomatic and similar to other cluster
schedulers
● Requires disk sharing between a
DaemonSet pod and each executor
pod
● Difficult to enforce ACLs
foo-1
bar -1
shuffle service
foo-2
bar -2
shuffle service
driver foo driver bar
26. Shuffle Service: Per Executor
● Strong isolation possible between
shuffle files
● Resource wastage in having multiple
shuffle services per node
● Disk sharing between containers in a
Pod is trivial
● Can expose shuffle service on Pod IP
driver foo driver bar
foo-1
shuffle service
bar-1
shuffle service
foo-2
shuffle service
bar-2
shuffle service
27. Resource Allocation
● Kubernetes lets us specify soft and hard limits on resources (CPU,
Memory, etc)
● Pods may be in one of 3 QoS levels
○ Guaranteed
○ Burstable
○ Best Effort
● Scheduling, Pre-emption based on QoS
28. Resource Allocation
● Today, we launch Drivers and Executors with guaranteed resources.
● In the near future:
○ QoS level of executors should be decided based on a notion of priority
○ Must be able to overcommit cluster resources for Spark batch jobs and pre-empt/scale
down when higher priority jobs come in
● Schedule and execute Spark Jobs launched by the same and different
tenants fairly
29. Extending the Kubernetes API
● Use ThirdPartyResources to extend
the API dynamically
● SparkJob can be added to the API
● SparkJob object can be written to by
the Spark Driver to allow recording
parameters
● Can perform better cluster-level
aggregation/decisions