SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Spark on YARN
● Shashank L
● Big data consultant and trainer at
datamantra.io
● shashankgowda.me@gmail.com
Agenda
● YARN - Introduction
● Need for YARN
● OS Analogy
● Why run Spark on YARN
● YARN Architecture
● Modes of Spark on YARN
● Internals of Spark on YARN
● Recent developments
● Road ahead
● Hands-on
YARN
● Yet another resource negotiator.
● a general-purpose, distributed, application management
framework.
Need for YARN
Hadoop 1.0
● Single use system
● Capable of running only MR
Need for YARN
● Scalability
○ 2009 – 8 cores, 16GB of RAM, 4x1TB disk
○ 2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk.
● Cluster utilization
○ distinct map slots and reduce slots
● Supporting workloads other than MapReduce
○ MapReduce is great for many applications, but not everything.
Need for YARN
Hadoop 2.0
● Multi purpose platform
● Capable of running apps
other than MR
OS analogy
Traditional operating system
Storage:
File system
Execution/Scheduling:
Processes/ Kernel scheduler
OS analogy
Hadoop
Storage:
HDFS
Execution/Scheduling:
YARN
Why run Spark on YARN
● Leverage existing clusters
● Data locality
● Dynamically sharing the cluster resources between different frameworks.
● YARN schedulers can be used for categorizing, isolating, and prioritizing
workloads.
● Only cluster manager for Spark that supports security
YARN architecture
Resource manager
Node manager Node manager
YARN architecture
Resource manager
Node manager Node manager
Container Container Container Container
YARN architecture
Resource manager
Node manager Node manager
Client
YARN architecture
Resource manager
Node manager Node manager
Container
Application
Master
Client
YARN architecture
Resource manager
Node manager Node manager
Container
Application
Master
Client
Container
Processing
Container
Processing
Running Spark on YARN
Spark architecture
Driver program
Spark Context
Cluster manager
Worker node
Executor Cache
Task Task
Worker node
Executor Cache
Task Task
Spark architecture
● Driver Program is responsible for managing the job flow and scheduling
tasks that will run on the executors.
● Executors are processes that run computation and store data for a Spark
application.
● Cluster Manager is responsible for starting executor processes and where
and when they will be run. Spark supports pluggable cluster manager, it
supports
Example: YARN, Mesos and “standalone” cluster manager
Modes on Spark on YARN
● YARN-Client Mode
● YARN-Cluster Mode
YARN client mode
Resource manager
Node manager(s)
Container
Spark
Application Master/
Executor launcher
Container
Executor
Client
Spark driver/
Spark Context
YARN client mode
● Driver runs in the client process, and the application master is only used for
requesting resources from YARN.
● Used for interactive and debugging uses where you want to see your
application’s output immediately (on the client process side).
YARN cluster mode
Resource manager
Node manager(s)
Container
Spark
Application Master/
Spark driver
Container
Executor
Client
Spark submit
YARN cluster mode
● In yarn-cluster mode, the Spark driver runs inside an application master
process which is managed by YARN on the cluster, and the client can go
away after initiating the application.
● Yarn-cluster mode makes sense for production jobs.
Concurrency vs Parallelism
● Concurrency is about dealing with lots of things at once.
● Parallelism is about doing lots of things at once.
Akka
● Follows Actor model
○ Keep mutable state internally and communicate through async messages
● Actors
Treat Actors like People. People who don't talk to each other in person. They
just talk through mails.
○ Object
○ Runs in its own thread
○ Messages are kept in queue and processed in order
Internals of Spark
YARN Cluster
Internals of Spark on YARN
Container
Spark AM
Spark driver
(Spark Context)
ContainerContainer
Container
Executor
DAG
Scheduler
Task
Scheduler
Scheduler
backend
1
2
3
9
Client
5
8
4
6 7
1010
Internals of Spark on YARN
1. Requests container for the AM and launches AM in the container
2. Creates SparkContext (inside AM / inside Client).
This internally creates a DAG Scheduler, Task scheduler and Scheduler
backend.
Creates an Akka actor system.
3. Application master based on the required resources will request for the
containers. Once it get the containers it runs executor process in the
container.
4. The executor process when it comes up registers with the Schedulerbackend
through Akka.
5. When few lines of code has to be run on the cluster. RDD runJob method
calls the DAG scheduler to create a DAG of tasks.
Internals of Spark on YARN
6. Set of tasks which is capable of running in parallel is sent to the Task
Scheduler in the form of TaskSet.
7. Task scheduler in turn will contact the Schedulerbackend to run the tasks on
the executor.
8. Scheduler backend which keeps track of running executors and its statuses,
will schedule tasks on executors
9. Task output if any are sent through heartbeats to Schedulerbackend/
10. SchedulerBackend passes the task output onto the Task and DAG scheduler
which could make use of that output.
Recent developments
● Dynamic resource allocation
○ No need to specify number of executors
○ Application grows and shrinks based on outstanding task count
○ Need to specify other things
● Data locality
○ Allocate executors close to data
○ SPARK-4352
● Cached RDDs
○ Keep executors around
Road ahead
● Making dynamic allocation better
○ Reduce allocation latency
○ Handle cached RDDs
● Simplified allocation
● Encrypt shuffle files
● File distribution
○ Replace HTTP with RPC
Yarn Hands On
● Shashidhar E S
● Big data consultant and trainer at
datamantra.io
● shashidhar.e@gmail.com
Yarn components in different phases
Resource Manager
Yarn Client Application Master
Node Manager
ApplicationClientProtocol ApplicationMasterProtocol
ContainerManagementProtocol
Protocols
Protocols
Application Client Protocol (Client<-->RM)(org.apache.hadoop.yarn.api.
ApplicationClientProtocol)
Application Master Protocol (AM<-->RM)(org.apache.hadoop.yarn.api.
ApplicationMasterProtocol)
Container Management Protocol (AM<-->NM)(org.apache.hadoop.yarn.api.
ContainerManagementProtocol)
Application Client Protocol
● Protocol between client and resource manager
● Allows clients to submit and abort the jobs
● Enables clients to get information about applications
○ Cluster metrics - Active node managers
○ Nodes - Node details
○ Queues - Queue details
○ Delegation Tokens - Token for containers to interact with the services.
○ ApplicationAttemptReport - Application Details (host,port,diagnostics)
○ etc
Application Master Protocol
● Protocol between AM and RM
● Key functionalities
○ RegisterAM
○ FinishAM - Notify RM about completion
○ Allocate - RM responds with available/unused containers
Container Management Protocol
● Protocol between AM and NM
● Key functionalities
○ Start Containers
○ Stop Containers
○ Status of running containers - (NEW,RUNNING,COMPLETE)
Building Blocks of Communication
Records
Each component in YARN architecture communicates between each other by
forming records.Each request sent is a record
Ex: localResource requests,applicationContext, containerLaunchContext etc
Each response obtained is a record
Ex: applicationResponse, applicationMasterResponse, allocateResponse etc
Custom Yarn Application
Components
● Yarn Client
● Yarn Application Master
● Application
Yarn Hello world
Resource ManagerClient
Application
Application Master
Node manager
Package : com.madhukaraphatak.yarnexamples.helloworld
Yarn Hello world
Steps:
● Create application client
○ Communicate with RM to launch AM
○ Specify AM resource requirements
● Application master
○ Communicate with RM to get containers
○ Communicate with NM’s to launch containers
● Application
○ Specify the application logic
AKKA remote example
Remote actor
system
(Remote actor)
Local actor system
(Local actor)
Communicate through
messages
Package : com.madhukaraphatak.yarnexamples.akka
AKKA remote example
Steps:
● Create Remote client with following properties
○ Actor ref provider - References should be remote aware
○ Transports used - Tcp is the transport layer protocol
○ hostname - 127.0.0.1
○ port - 5150
● Create Local Client with following properties
○ Actor ref provider - We are specifying the references should be remote aware
○ Transports used - tcp is the transport layer protocol
○ hostname - 127.0.0.1
○ port - 0
1. Akka actors behave like peers rather than client-server.
2. They talk in similar transport.
3. Only difference is port : 0 -> any free port.
AKKA application on Yarn
Resource Manager
Application
Local actor system
(TaskResultSenderActor)
Application Master
Remote actor system
(Remote actor)
Client
Node manager
AKKA application on Yarn
● Client
○ Defines the tasks to be performed
○ Submits tasks as separate set
● Scheduler
○ Create receiver actor (Remote Actor) for orchestration
○ Set up resources for AM
○ Launch AM for set of tasks
● Application Master
○ Create Executor for each single task
○ Set up resources for Containers
AKKA application on Yarn
● Executor
○ Create local Actor
○ Run task
○ Send response to remote actor
○ Kill local actor

Contenu connexe

Tendances

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingDataWorks Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 

Tendances (20)

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 

En vedette

Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageSandeep Patil
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDataWorks Summit
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 

En vedette (14)

Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean ManualSocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Proxy Servers
Proxy ServersProxy Servers
Proxy Servers
 
Proxy Server
Proxy ServerProxy Server
Proxy Server
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 

Similaire à Spark on yarn

ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...Zhijie Shen
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Apache Apex
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnOmid Vahdaty
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersAnant Corporation
 
YARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPYARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPOmkar Joshi
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Jianfeng Zhang
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 

Similaire à Spark on yarn (20)

Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource Managers
 
YARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPYARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOP
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Introduction to yarn
Introduction to yarnIntroduction to yarn
Introduction to yarn
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache spark
Apache sparkApache spark
Apache spark
 

Plus de datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 

Plus de datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Dernier

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Spark on yarn

  • 2. ● Shashank L ● Big data consultant and trainer at datamantra.io ● shashankgowda.me@gmail.com
  • 3. Agenda ● YARN - Introduction ● Need for YARN ● OS Analogy ● Why run Spark on YARN ● YARN Architecture ● Modes of Spark on YARN ● Internals of Spark on YARN ● Recent developments ● Road ahead ● Hands-on
  • 4. YARN ● Yet another resource negotiator. ● a general-purpose, distributed, application management framework.
  • 5. Need for YARN Hadoop 1.0 ● Single use system ● Capable of running only MR
  • 6. Need for YARN ● Scalability ○ 2009 – 8 cores, 16GB of RAM, 4x1TB disk ○ 2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk. ● Cluster utilization ○ distinct map slots and reduce slots ● Supporting workloads other than MapReduce ○ MapReduce is great for many applications, but not everything.
  • 7. Need for YARN Hadoop 2.0 ● Multi purpose platform ● Capable of running apps other than MR
  • 8. OS analogy Traditional operating system Storage: File system Execution/Scheduling: Processes/ Kernel scheduler
  • 10. Why run Spark on YARN ● Leverage existing clusters ● Data locality ● Dynamically sharing the cluster resources between different frameworks. ● YARN schedulers can be used for categorizing, isolating, and prioritizing workloads. ● Only cluster manager for Spark that supports security
  • 12. YARN architecture Resource manager Node manager Node manager Container Container Container Container
  • 13. YARN architecture Resource manager Node manager Node manager Client
  • 14. YARN architecture Resource manager Node manager Node manager Container Application Master Client
  • 15. YARN architecture Resource manager Node manager Node manager Container Application Master Client Container Processing Container Processing
  • 17. Spark architecture Driver program Spark Context Cluster manager Worker node Executor Cache Task Task Worker node Executor Cache Task Task
  • 18. Spark architecture ● Driver Program is responsible for managing the job flow and scheduling tasks that will run on the executors. ● Executors are processes that run computation and store data for a Spark application. ● Cluster Manager is responsible for starting executor processes and where and when they will be run. Spark supports pluggable cluster manager, it supports Example: YARN, Mesos and “standalone” cluster manager
  • 19. Modes on Spark on YARN ● YARN-Client Mode ● YARN-Cluster Mode
  • 20. YARN client mode Resource manager Node manager(s) Container Spark Application Master/ Executor launcher Container Executor Client Spark driver/ Spark Context
  • 21. YARN client mode ● Driver runs in the client process, and the application master is only used for requesting resources from YARN. ● Used for interactive and debugging uses where you want to see your application’s output immediately (on the client process side).
  • 22. YARN cluster mode Resource manager Node manager(s) Container Spark Application Master/ Spark driver Container Executor Client Spark submit
  • 23. YARN cluster mode ● In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. ● Yarn-cluster mode makes sense for production jobs.
  • 24. Concurrency vs Parallelism ● Concurrency is about dealing with lots of things at once. ● Parallelism is about doing lots of things at once.
  • 25. Akka ● Follows Actor model ○ Keep mutable state internally and communicate through async messages ● Actors Treat Actors like People. People who don't talk to each other in person. They just talk through mails. ○ Object ○ Runs in its own thread ○ Messages are kept in queue and processed in order
  • 27. YARN Cluster Internals of Spark on YARN Container Spark AM Spark driver (Spark Context) ContainerContainer Container Executor DAG Scheduler Task Scheduler Scheduler backend 1 2 3 9 Client 5 8 4 6 7 1010
  • 28. Internals of Spark on YARN 1. Requests container for the AM and launches AM in the container 2. Creates SparkContext (inside AM / inside Client). This internally creates a DAG Scheduler, Task scheduler and Scheduler backend. Creates an Akka actor system. 3. Application master based on the required resources will request for the containers. Once it get the containers it runs executor process in the container. 4. The executor process when it comes up registers with the Schedulerbackend through Akka. 5. When few lines of code has to be run on the cluster. RDD runJob method calls the DAG scheduler to create a DAG of tasks.
  • 29. Internals of Spark on YARN 6. Set of tasks which is capable of running in parallel is sent to the Task Scheduler in the form of TaskSet. 7. Task scheduler in turn will contact the Schedulerbackend to run the tasks on the executor. 8. Scheduler backend which keeps track of running executors and its statuses, will schedule tasks on executors 9. Task output if any are sent through heartbeats to Schedulerbackend/ 10. SchedulerBackend passes the task output onto the Task and DAG scheduler which could make use of that output.
  • 30. Recent developments ● Dynamic resource allocation ○ No need to specify number of executors ○ Application grows and shrinks based on outstanding task count ○ Need to specify other things ● Data locality ○ Allocate executors close to data ○ SPARK-4352 ● Cached RDDs ○ Keep executors around
  • 31. Road ahead ● Making dynamic allocation better ○ Reduce allocation latency ○ Handle cached RDDs ● Simplified allocation ● Encrypt shuffle files ● File distribution ○ Replace HTTP with RPC
  • 33. ● Shashidhar E S ● Big data consultant and trainer at datamantra.io ● shashidhar.e@gmail.com
  • 34. Yarn components in different phases Resource Manager Yarn Client Application Master Node Manager ApplicationClientProtocol ApplicationMasterProtocol ContainerManagementProtocol Protocols
  • 35. Protocols Application Client Protocol (Client<-->RM)(org.apache.hadoop.yarn.api. ApplicationClientProtocol) Application Master Protocol (AM<-->RM)(org.apache.hadoop.yarn.api. ApplicationMasterProtocol) Container Management Protocol (AM<-->NM)(org.apache.hadoop.yarn.api. ContainerManagementProtocol)
  • 36. Application Client Protocol ● Protocol between client and resource manager ● Allows clients to submit and abort the jobs ● Enables clients to get information about applications ○ Cluster metrics - Active node managers ○ Nodes - Node details ○ Queues - Queue details ○ Delegation Tokens - Token for containers to interact with the services. ○ ApplicationAttemptReport - Application Details (host,port,diagnostics) ○ etc
  • 37. Application Master Protocol ● Protocol between AM and RM ● Key functionalities ○ RegisterAM ○ FinishAM - Notify RM about completion ○ Allocate - RM responds with available/unused containers
  • 38. Container Management Protocol ● Protocol between AM and NM ● Key functionalities ○ Start Containers ○ Stop Containers ○ Status of running containers - (NEW,RUNNING,COMPLETE)
  • 39. Building Blocks of Communication Records Each component in YARN architecture communicates between each other by forming records.Each request sent is a record Ex: localResource requests,applicationContext, containerLaunchContext etc Each response obtained is a record Ex: applicationResponse, applicationMasterResponse, allocateResponse etc
  • 40. Custom Yarn Application Components ● Yarn Client ● Yarn Application Master ● Application
  • 41. Yarn Hello world Resource ManagerClient Application Application Master Node manager Package : com.madhukaraphatak.yarnexamples.helloworld
  • 42. Yarn Hello world Steps: ● Create application client ○ Communicate with RM to launch AM ○ Specify AM resource requirements ● Application master ○ Communicate with RM to get containers ○ Communicate with NM’s to launch containers ● Application ○ Specify the application logic
  • 43. AKKA remote example Remote actor system (Remote actor) Local actor system (Local actor) Communicate through messages Package : com.madhukaraphatak.yarnexamples.akka
  • 44. AKKA remote example Steps: ● Create Remote client with following properties ○ Actor ref provider - References should be remote aware ○ Transports used - Tcp is the transport layer protocol ○ hostname - 127.0.0.1 ○ port - 5150 ● Create Local Client with following properties ○ Actor ref provider - We are specifying the references should be remote aware ○ Transports used - tcp is the transport layer protocol ○ hostname - 127.0.0.1 ○ port - 0 1. Akka actors behave like peers rather than client-server. 2. They talk in similar transport. 3. Only difference is port : 0 -> any free port.
  • 45. AKKA application on Yarn Resource Manager Application Local actor system (TaskResultSenderActor) Application Master Remote actor system (Remote actor) Client Node manager
  • 46. AKKA application on Yarn ● Client ○ Defines the tasks to be performed ○ Submits tasks as separate set ● Scheduler ○ Create receiver actor (Remote Actor) for orchestration ○ Set up resources for AM ○ Launch AM for set of tasks ● Application Master ○ Create Executor for each single task ○ Set up resources for Containers
  • 47. AKKA application on Yarn ● Executor ○ Create local Actor ○ Run task ○ Send response to remote actor ○ Kill local actor