SlideShare a Scribd company logo
1 of 22
YARN
Apache Hadoop Next Generation
Compute Platform
Bikas Saha
@bikassaha

© Hortonworks Inc. 2013

Page 1
Apache Hadoop & YARN
• Apache Hadoop
– De facto Big Data open source platform
– Running for about 5 years in production at hundreds of companies
like Yahoo, Ebay and Facebook

• Hadoop 2
– Significant improvements in HDFS distributed storage layer. High
Availability, NFS, Snapshots
– YARN – next generation compute framework for Hadoop designed
from the ground up based on experience gained from Hadoop 1
– YARN running in production at Yahoo for about a year
– YARN awarded Best Paper at SOCC 2013

© Hortonworks Inc. 2013 - Confidential

Page 2
1st Generation Hadoop: Batch Focus
HADOOP 1.0
Built for Web-Scale Batch Apps

Single App

Single App

INTERACTIVE

ONLINE

Single App

Single App

Single App

BATCH

BATCH

BATCH

HDFS

HDFS

All other usage patterns
MUST leverage same
infrastructure

HDFS

© Hortonworks Inc. 2013 - Confidential

Forces Creation of Silos to
Manage Mixed Workloads

Page 3
Hadoop 1 Architecture
JobTracker
Manage Cluster Resources & Job Scheduling

TaskTracker
Per-node agent
Manage Tasks

© Hortonworks Inc. 2013 - Confidential

Page 4
Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower

Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000

Availability
Failure Kills Queued & Running Jobs

Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
© Hortonworks Inc. 2013 - Confidential

Page 5
Our Vision: Hadoop as Next-Gen Platform

Single Use System

Multi Purpose Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

HADOOP 1.0

HADOOP 2.0
MapReduce

Others

(data processing)

MapReduce

YARN

(cluster resource management
& data processing)

(cluster resource management)

HDFS

HDFS2

(redundant, reliable storage)

(redundant, highly-available & reliable storage)

© Hortonworks Inc. 2013 - Confidential

Page 6
Hadoop 2 - YARN Architecture
ResourceManager (RM)
Central agent - Manages and allocates
cluster resources

Node
Manager

NodeManager (NM)
Per-Node agent - Manages and

App Mstr

enforces node resource allocations

ApplicationMaster (AM)
Per-Application –

Resource
Manager

Node
Manager

Client
Container

Manages application
lifecycle and task

scheduling

MapReduce Status
Job Submission

Node
Manager

Node Status
Resource Request

© Hortonworks Inc. 2013 - Confidential

Page 7
YARN: Taking Hadoop Beyond Batch
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
Applications Run Natively in Hadoop
BATCH
INTERACTIVE
(MapReduce)
(Tez)

ONLINE
(HBase)

STREAMING
(Storm, S4,…)

GRAPH
(Giraph)

IN-MEMORY
(Spark)

HPC MPI
(OpenMPI)

OTHER
(Search)
(Weave…)

YARN (Cluster Resource Management)
HDFS2 (Redundant, Reliable Storage)

© Hortonworks Inc. 2013 - Confidential

Page 8
5 Key Benefits of YARN
1.

New Applications & Services

2.

Improved cluster utilization

3.

Scale

4.

Experimental Agility

5.

Shared Services

© Hortonworks Inc. 2013 - Confidential

Page 9
Key Improvements in YARN
Framework supporting multiple applications
– Separate generic resource brokering from application logic
– Define protocols/libraries and provide a framework for custom
application development
– Share same Hadoop Cluster across applications

Cluster Utilization
– Generic resource container model replaces fixed Map/Reduce
slots. Container allocations based on locality, memory (CPU
coming soon)
– Sharing cluster among multiple application

© Hortonworks Inc. 2013 - Confidential

Page 10
Key Improvements in YARN
Scalability
– Removed complex app logic from RM, scale further
– State machine, message passing based loosely coupled design
– Compact scheduling protocol

Application Agility and Innovation
– Use Protocol Buffers for RPC gives wire compatibility
– Map Reduce becomes an application in user space unlocking
safe innovation
– Multiple versions of an app can co-exist leading to
experimentation
– Easier upgrade of framework and application

© Hortonworks Inc. 2013 - Confidential

Page 11
Key Improvements in YARN
Shared Services
– Common services needed to build distributed application are
included in a pluggable framework
– Distributed file sharing service
– Remote data read service
– Log Aggregation Service

© Hortonworks Inc. 2013 - Confidential

Page 12
YARN: Efficiency with Shared Services

Yahoo! leverages YARN
40,000+ nodes running YARN across over 365PB of data
~400,000 jobs per day for about 10 million hours of compute
time
Estimated a 60% – 150% improvement on node usage per
day using YARN
Eliminated Colo (~10K nodes) due to increased utilization
For more details check out the YARN SOCC 2013 paper
© Hortonworks Inc. 2013 - Confidential

Page 13
YARN as Cluster Operating System
ResourceManager

Scheduler

NodeManager

NodeManager

NodeManager

NodeManager

map 1.1
nimbus0

vertex1.1.1

vertex1.2.2

NodeManager

NodeManager

NodeManager

NodeManager

map1.2
Batch

Interactive SQL

vertex1.1.2

nimbus2

NodeManager

NodeManager

NodeManager

NodeManager

nimbus1
Real-Time

vertex1.2.1

reduce1.1

© Hortonworks Inc. 2013 - Confidential

Page 14
Multi-Tenancy is Built-in
• Queues
• Economics as queue-capacity
– Hierarchical Queues

• SLAs

ResourceManager

– Cooperative Preemption

Scheduler

• Resource Isolation
– Linux: cgroups
– Roadmap: Virtualization (Xen, KVM)

• Administration
– Queue ACLs
– Run-time re-configuration for queues
Default Capacity Scheduler supports
all features
© Hortonworks Inc. 2013 - Confidential

Hierarchical
Queues

root

Mrkting
20%

Dev
20%

Adhoc
10%

Prod
80%

DW
70%

Dev Reserved Prod
10%
20%
70%

P0
70%

P1
30%

Capacity Scheduler
Page 15
YARN Eco-system
Applications Powered by YARN
Apache Giraph – Graph Processing
Apache Hama - BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative applications
Elastic Search – Scalable Search
Cloudera Llama – Impala on YARN
DataTorrent – Data Analysis
HOYA – HBase on YARN

© Hortonworks Inc. 2013 - Confidential

There's an app for that...
YARN App Marketplace!

Frameworks Powered By YARN
Apache Twill
REEF by Microsoft
Spring support for Hadoop 2

Page 16
YARN Application Lifecycle
Application Client
Protocol

Application Client

YarnClient
App
Specific API

Resource
Manager
NodeManager
Application Master
Protocol

App
Container

Application Master

AMRMClient

Container
Management
Protocol

NMClient

© Hortonworks Inc. 2013 - Confidential

Page 17
BYOA – Bring Your Own App
Application Client Protocol: Client to RM interaction
– Library: YarnClient
– Application Lifecycle control
– Access Cluster Information

Application Master Protocol: AM – RM interaction
– Library: AMRMClient / AMRMClientAsync
– Resource negotiation
– Heartbeat to the RM

Container Management Protocol: AM to NM interaction
– Library: NMClient/NMClientAsync
– Launching allocated containers
– Stop Running containers

Use external frameworks like Twill/REEF/Spring
© Hortonworks Inc. 2013 - Confidential

Page 18
YARN Future Work
• ResourceManager High Availability
– Automatic failover
– Work preserving failover

• Scheduler Enhancements
– SLA Driven Scheduling, Low latency allocations
– Multiple resource types – disk/network/GPUs/affinity

• Rolling upgrades
• Generic History Service
• Long running services
– Better support to running services like HBase
– Service Discovery

• More utilities/libraries for Application Developers
– Failover/Checkpointing

© Hortonworks Inc. 2013 - Confidential

Page 19
Key Take-Aways
• YARN is a platform to build/run Multiple Distributed Applications
in Hadoop
• YARN is completely Backwards Compatible for existing
MapReduce apps
• YARN enables Fine Grained Resource Management via Generic
Resource Containers.
• YARN has built-in support for multi-tenancy to share cluster
resources and increase cost efficiency
• YARN provides a cluster operating system like abstraction for a
modern data architecture

© Hortonworks Inc. 2013 - Confidential

Page 20
Apache YARN
The Data Operating System for Hadoop 2.0
Flexible

Efficient

Shared

Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming

Increase processing IN Hadoop
on the same hardware while
providing predictable
performance & quality of service

Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads

Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce

INTERACTIVE
Tez

ONLINE
HBase

STREAMING
Storm, S4, …

GRAPH
Giraph

MICROSOFT
REEF

SAS
LASR, HPA

OTHERS

YARN: Cluster Resource Management
HDFS2: Redundant, Reliable Storage

© Hortonworks Inc. 2013 - Confidential

Page 21
Thank you!

http://hortonworks.com/products/hortonworks-sandbox/

Download Sandbox: Experience Apache Hadoop
Both 2.0 and 1.x Versions Available!
http://hortonworks.com/products/hortonworks-sandbox/

Questions?
© Hortonworks Inc. 2013 - Confidential

Page 22

More Related Content

What's hot

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 

What's hot (20)

Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 

Similar to Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
Hortonworks
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
Hortonworks
 

Similar to Apache Hadoop YARN: Understanding the Data Operating System of Hadoop (20)

Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
 
Yarnthug2014
Yarnthug2014Yarnthug2014
Yarnthug2014
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Hadoop - Past, Present and Future - v2.0
Hadoop - Past, Present and Future - v2.0Hadoop - Past, Present and Future - v2.0
Hadoop - Past, Present and Future - v2.0
 

More from Hortonworks

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

  • 1. YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha © Hortonworks Inc. 2013 Page 1
  • 2. Apache Hadoop & YARN • Apache Hadoop – De facto Big Data open source platform – Running for about 5 years in production at hundreds of companies like Yahoo, Ebay and Facebook • Hadoop 2 – Significant improvements in HDFS distributed storage layer. High Availability, NFS, Snapshots – YARN – next generation compute framework for Hadoop designed from the ground up based on experience gained from Hadoop 1 – YARN running in production at Yahoo for about a year – YARN awarded Best Paper at SOCC 2013 © Hortonworks Inc. 2013 - Confidential Page 2
  • 3. 1st Generation Hadoop: Batch Focus HADOOP 1.0 Built for Web-Scale Batch Apps Single App Single App INTERACTIVE ONLINE Single App Single App Single App BATCH BATCH BATCH HDFS HDFS All other usage patterns MUST leverage same infrastructure HDFS © Hortonworks Inc. 2013 - Confidential Forces Creation of Silos to Manage Mixed Workloads Page 3
  • 4. Hadoop 1 Architecture JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks © Hortonworks Inc. 2013 - Confidential Page 4
  • 5. Hadoop 1 Limitations Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization © Hortonworks Inc. 2013 - Confidential Page 5
  • 6. Our Vision: Hadoop as Next-Gen Platform Single Use System Multi Purpose Platform Batch Apps Batch, Interactive, Online, Streaming, … HADOOP 1.0 HADOOP 2.0 MapReduce Others (data processing) MapReduce YARN (cluster resource management & data processing) (cluster resource management) HDFS HDFS2 (redundant, reliable storage) (redundant, highly-available & reliable storage) © Hortonworks Inc. 2013 - Confidential Page 6
  • 7. Hadoop 2 - YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources Node Manager NodeManager (NM) Per-Node agent - Manages and App Mstr enforces node resource allocations ApplicationMaster (AM) Per-Application – Resource Manager Node Manager Client Container Manages application lifecycle and task scheduling MapReduce Status Job Submission Node Manager Node Status Resource Request © Hortonworks Inc. 2013 - Confidential Page 7
  • 8. YARN: Taking Hadoop Beyond Batch Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applications Run Natively in Hadoop BATCH INTERACTIVE (MapReduce) (Tez) ONLINE (HBase) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave…) YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) © Hortonworks Inc. 2013 - Confidential Page 8
  • 9. 5 Key Benefits of YARN 1. New Applications & Services 2. Improved cluster utilization 3. Scale 4. Experimental Agility 5. Shared Services © Hortonworks Inc. 2013 - Confidential Page 9
  • 10. Key Improvements in YARN Framework supporting multiple applications – Separate generic resource brokering from application logic – Define protocols/libraries and provide a framework for custom application development – Share same Hadoop Cluster across applications Cluster Utilization – Generic resource container model replaces fixed Map/Reduce slots. Container allocations based on locality, memory (CPU coming soon) – Sharing cluster among multiple application © Hortonworks Inc. 2013 - Confidential Page 10
  • 11. Key Improvements in YARN Scalability – Removed complex app logic from RM, scale further – State machine, message passing based loosely coupled design – Compact scheduling protocol Application Agility and Innovation – Use Protocol Buffers for RPC gives wire compatibility – Map Reduce becomes an application in user space unlocking safe innovation – Multiple versions of an app can co-exist leading to experimentation – Easier upgrade of framework and application © Hortonworks Inc. 2013 - Confidential Page 11
  • 12. Key Improvements in YARN Shared Services – Common services needed to build distributed application are included in a pluggable framework – Distributed file sharing service – Remote data read service – Log Aggregation Service © Hortonworks Inc. 2013 - Confidential Page 12
  • 13. YARN: Efficiency with Shared Services Yahoo! leverages YARN 40,000+ nodes running YARN across over 365PB of data ~400,000 jobs per day for about 10 million hours of compute time Estimated a 60% – 150% improvement on node usage per day using YARN Eliminated Colo (~10K nodes) due to increased utilization For more details check out the YARN SOCC 2013 paper © Hortonworks Inc. 2013 - Confidential Page 13
  • 14. YARN as Cluster Operating System ResourceManager Scheduler NodeManager NodeManager NodeManager NodeManager map 1.1 nimbus0 vertex1.1.1 vertex1.2.2 NodeManager NodeManager NodeManager NodeManager map1.2 Batch Interactive SQL vertex1.1.2 nimbus2 NodeManager NodeManager NodeManager NodeManager nimbus1 Real-Time vertex1.2.1 reduce1.1 © Hortonworks Inc. 2013 - Confidential Page 14
  • 15. Multi-Tenancy is Built-in • Queues • Economics as queue-capacity – Hierarchical Queues • SLAs ResourceManager – Cooperative Preemption Scheduler • Resource Isolation – Linux: cgroups – Roadmap: Virtualization (Xen, KVM) • Administration – Queue ACLs – Run-time re-configuration for queues Default Capacity Scheduler supports all features © Hortonworks Inc. 2013 - Confidential Hierarchical Queues root Mrkting 20% Dev 20% Adhoc 10% Prod 80% DW 70% Dev Reserved Prod 10% 20% 70% P0 70% P1 30% Capacity Scheduler Page 15
  • 16. YARN Eco-system Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN © Hortonworks Inc. 2013 - Confidential There's an app for that... YARN App Marketplace! Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2 Page 16
  • 17. YARN Application Lifecycle Application Client Protocol Application Client YarnClient App Specific API Resource Manager NodeManager Application Master Protocol App Container Application Master AMRMClient Container Management Protocol NMClient © Hortonworks Inc. 2013 - Confidential Page 17
  • 18. BYOA – Bring Your Own App Application Client Protocol: Client to RM interaction – Library: YarnClient – Application Lifecycle control – Access Cluster Information Application Master Protocol: AM – RM interaction – Library: AMRMClient / AMRMClientAsync – Resource negotiation – Heartbeat to the RM Container Management Protocol: AM to NM interaction – Library: NMClient/NMClientAsync – Launching allocated containers – Stop Running containers Use external frameworks like Twill/REEF/Spring © Hortonworks Inc. 2013 - Confidential Page 18
  • 19. YARN Future Work • ResourceManager High Availability – Automatic failover – Work preserving failover • Scheduler Enhancements – SLA Driven Scheduling, Low latency allocations – Multiple resource types – disk/network/GPUs/affinity • Rolling upgrades • Generic History Service • Long running services – Better support to running services like HBase – Service Discovery • More utilities/libraries for Application Developers – Failover/Checkpointing © Hortonworks Inc. 2013 - Confidential Page 19
  • 20. Key Take-Aways • YARN is a platform to build/run Multiple Distributed Applications in Hadoop • YARN is completely Backwards Compatible for existing MapReduce apps • YARN enables Fine Grained Resource Management via Generic Resource Containers. • YARN has built-in support for multi-tenancy to share cluster resources and increase cost efficiency • YARN provides a cluster operating system like abstraction for a modern data architecture © Hortonworks Inc. 2013 - Confidential Page 20
  • 21. Apache YARN The Data Operating System for Hadoop 2.0 Flexible Efficient Shared Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Increase processing IN Hadoop on the same hardware while providing predictable performance & quality of service Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez ONLINE HBase STREAMING Storm, S4, … GRAPH Giraph MICROSOFT REEF SAS LASR, HPA OTHERS YARN: Cluster Resource Management HDFS2: Redundant, Reliable Storage © Hortonworks Inc. 2013 - Confidential Page 21
  • 22. Thank you! http://hortonworks.com/products/hortonworks-sandbox/ Download Sandbox: Experience Apache Hadoop Both 2.0 and 1.x Versions Available! http://hortonworks.com/products/hortonworks-sandbox/ Questions? © Hortonworks Inc. 2013 - Confidential Page 22

Editor's Notes

  1. As Arun mentioned there are less JVMs to spin up per job management (1 instead of 3) as well as the RM and NM provisioning being fasterOriginally conceived & architected by the team at Yahoo!Arun Murthy created the original JIRA in 2008 and led the PMCThe team at Hortonworks has been working on YARN for 4 years: 90% of code from Hortonworks & Yahoo!YARN based architecture running at scale at Yahoo!Deployed on 35,000 nodes for 6+ monthsMultitude of YARN applications*********************On great public example of in production use of YARN, is at Yahoo!. They outlined some performance gains in a keynote address at Hadoop Summit this year. Yahoo uses YARN for three use cases, stream processing, iterative processing and shared storage. With Storm on YARN they stream data into a cluster and execute 5 second analytics windows. This cluster is only 320 nodes, but is processing 133,000 events per second and is executing 12000 threads. Their shared data cluster uses 1900 nodes to store 2PB of data.In all, Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time. They also have estimated a 60% – 150% improvement on node usage per day.ANDAt this point, over 50,000 Hadoop nodes have been upgraded at Yahoo from Hadoop 1.0 to Hadoop 2, yielding 50% improvement in cluster utilization & efficiency.This should be a big deal in terms of potential ROI.
  2. Reconfig already works (HDP-1.3 and HDP-2.0).Chargeback - coming in HDP-2.1https://issues.apache.org/jira/browse/YARN-415----- Meeting Notes (11/14/13 11:07) -----POINT AND CLICK TIME BASED RESOURCE ALLOCATION
  3. Graph processing – Giraph, HamaStream proessing – Smaza, Storm, Spark, DataTorrentMapReduceTez – fast query executionWeave/REEF – frameworks to help with writing applicationsList of some of the applications which already support YARN, in some form.Smaza, Storm, S4 and DataTorrent are streaming frameworksVarious types of graph processing frameworks – Giraph and Hama are graph processing systemsThere’s some github projects – caching systems, on-demand web-server spin up Wave and REEF are frameworks on top of YARN to make writing applications easier
  4. YarnClientApplication Lifecycle Control – submit, query state, kill - .e.g Submit, followed by poll state, and optionally connect to AM for further stateCluster Information – Query node status, Scheduler queues etcAMRMCLientRequest Resources from the RM – specifies a priority (within app only), location information, and whether strict locality is required, also number of such containers.Hides the details of the protocol – which is absolute (i.e. each call to the RM must provide complete information about a priority level)Once a container is allocated, provides a utility method to link it back to the entity which generated the request (entity optionally passed in while asking for containers)Allocate call doubles as a heartbeat – must be sent out ever so often.Also provides some node level information, available resources, etcNMClientOnce resources are obtained from the RM, an App uses this to communicate with the NM to start containers. Setup the ContainerLuanchContext – cmdLine, environment, resources.Authorization information is part of the contianers allocated by the RM – to prevent unauthorized usageOnly a single process can be launched on an allocated container, and one must be within a certain period otherwise the allocation times outContainer must launch a process with a certain time after being allocatedOnly a single process can be launched on an allocated container
  5. HA and work preserving – being actively worked upoin by the communitiy.Scheduler – Additional resources – specifically disk / network. Gang schedulingRolling upgrades – upgrading a cluster typically involves downtime. NM forgets containers across restartsLong Running – Enhandcement to log handling, security, multiple tasks per container, container resizingHA and work preserving restart are still being worked on in the community – YARN-128 and YARN-149.On scheduling – there’ve been requests for gang scheduling, meeting SLAs. Also TBD is support for scheduling additional resource types – disk/ network.Rolling Upgrades – some work pending. Big piece here, which ties in with work preserving restart – restarting a NodeManager should not cause processes started by the previous NM to be killedLong Running Services support – handling logs, security – specifically token expiryAdditional utility libraries to help AppWriters – primarily geared towards checkpointing in the AM, app history handling
  6. The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.