Contenu connexe Similaire à YARN - Presented At Dallas Hadoop User Group (20) Plus de Rommel Garcia (12) YARN - Presented At Dallas Hadoop User Group1. Hadoop 2.0 –YARN
Yet Another Resource Negotiator
Rommel Garcia
Solutions Engineer
© Hortonworks Inc. 2013
Page 1
2. Agenda
• Hadoop 1.X & 2.X – Concepts Recap
• YARN Architecture – How does this affect MRv1?
• Slots be gone – What does this mean for MapReduce?
• Building YARN Applications
•Q &A
© Hortonworks Inc. 2013
3. Hadoop 1.X vs. 2.X
Recap over the differences
© Hortonworks Inc. 2013
4. The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
Single App
INTERACTIVE
ONLINE
Single App
Single App
Single App
BATCH
BATCH
BATCH
HDFS
HDFS
HDFS
© Hortonworks Inc. 2013
• All other usage
patterns must
leverage that same
infrastructure
• Forces the creation
of silos for managing
mixed workloads
5. Hadoop MapReduce Classic
• JobTracker
–Manages cluster resources and job scheduling
• TaskTracker
–Per-node agent
–Manage tasks
© Hortonworks Inc. 2013
Page 5
6. Hadoop 1
• Limited up to 4,000 nodes per cluster
• O(# of tasks in a cluster)
• JobTracker bottleneck - resource management, job
scheduling and monitoring
• Only has one namespace for managing HDFS
• Map and Reduce slots are static
• Only job to run is MapReduce
© Hortonworks Inc. 2013
7. Hadoop 1.X Stack
OPERATIONAL
SERVICES
AMBARI
DATA
SERVICES
FLUME
HIVE &
PIG
OOZIE
HCATALOG
SQOOP
HBASE
LOAD &
EXTRACT
HADOOP
CORE
PLATFORM
SERVICES
NFS
MAP REDUCE
WebHDFS
HDFS
Enterprise Readiness
High Availability, Disaster Recovery,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OS
© Hortonworks Inc. 2013
Cloud
VM
Appliance
Page 7
8. Our Vision: Hadoop as Next-Gen Platform
Single Use System
Multi Purpose Platform
Batch Apps
Batch, Interactive, Online, Streaming, …
HADOOP 1.0
HADOOP 2.0
MapReduce
(data processing)
MapReduce
Others
(data processing)
YARN
(cluster resource management
& data processing)
(cluster resource management)
HDFS
HDFS2
(redundant, reliable storage)
(redundant, reliable storage)
© Hortonworks Inc. 2012 Confidential and Proprietary.
2013.
Page 8
9. YARN: Taking Hadoop Beyond Batch
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
Applications Run Natively IN Hadoop
BATCH
INTERACTIVE
(MapReduce)
(Tez)
ONLINE
(HBase)
STREAMING
(Storm, S4,…)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave…)
YARN (Cluster Resource Management)
HDFS2 (Redundant, Reliable Storage)
© Hortonworks Inc. 2013
Page 9
10. Hadoop 2
• Potentially up to 10,000 nodes per cluster
• O(cluster size)
• Supports multiple namespace for managing HDFS
• Efficient cluster utilization (YARN)
• MRv1 backward and forward compatible
• Any apps can integrate with Hadoop
• Beyond Java
© Hortonworks Inc. 2013
13. A Brief History of YARN
• Originally conceived & architected by the team at Yahoo!
– Arun Murthy created the original JIRA in 2008, led the PMC
– Currently Arun is the Lead for Map-Reduce/YARN/Tez at Hortonworks and
was formerly Architect Hadoop MapReduce at Yahoo
• The team at Hortonworks has been working on YARN for 4 years
• YARN based architecture running at scale at Yahoo!
– Deployed on 35,000 nodes for about a year
– Implemented Storm-on-Yarn that processes 133,000 events per second.
© Hortonworks Inc. 2013
Page 13
14. Concepts
• Application
–Application is a job submitted to the framework
–Example – Map Reduce Job
• Container
–Basic unit of allocation
–Fine-grained resource allocation across multiple resource
types (memory, cpu, disk, network, gpu etc.)
– container_0 = 2GB, 1CPU
– container_1 = 1GB, 6 CPU
–Replaces the fixed map/reduce slots
© Hortonworks Inc. 2013
14
15. Architecture
• Resource Manager
–Global resource scheduler
–Hierarchical queues
–Application management
• Node Manager
–Per-machine agent
–Manages the life-cycle of container
–Container resource monitoring
• Application Master
–Per-application
–Manages application scheduling and task execution
–E.g. MapReduce Application Master
© Hortonworks Inc. 2013
15
16. YARN – Running Apps
create app1
Hadoop Client 1
ASM
NM
ResourceManager
.......negotiates....... Containers
.......reports to....... ASM
submit app1
Scheduler .......partitions.......
Resources
create app2
Hadoop Client 2
submit app2
Scheduler
ASM
queues
status report
NodeManager
C2.1
NodeManager
C2.2
NodeManager
AM2
Rack1
© Hortonworks Inc. 2012
NodeManager
NodeManager
C1.3
NodeManager
C2.3
C1.2
NodeManager
AM1
Rack2
NodeManager
C1.4
NodeManager
C1.1
RackN
18. Apache Hadoop MapReduce on YARN
• Original use-case
• Most complex application to build
– Data-locality
– Fault tolerance
– ApplicationMaster recovery: Check point to HDFS
– Intra-application Priorities: Maps v/s Reduces
– Needed complex feedback mechanism from ResourceManager
– Security
– Isolation
• Binary compatible with Apache Hadoop 1.x
© Hortonworks Inc. 2013
Page 18
19. Apache Hadoop MapReduce on YARN
ResourceManager
Scheduler
NodeManager
NodeManager
NodeManager
NodeManager
map 1.1
map2.1
reduce2.1
NodeManager
NodeManager
MR AM 1
NodeManager
map1.2
NodeManager
reduce1.1
© Hortonworks Inc. 2012
NodeManager
MR AM2
NodeManager
NodeManager
map2.2
NodeManager
reduce2.2
20. Efficiency Gains of YARN
• Key Optimizations
– No hard segmentation of resource into map and reduce slots
– Yarn scheduler is more efficient
– All resources are fungible
• Yahoo has over 30000 nodes running YARN across over
365PB of data.
• They calculate running about 400,000 jobs per day for
about 10 million hours of compute time.
• They also have estimated a 60% – 150% improvement on
node usage per day.
• Yahoo got rid of a whole colo (10,000 node datacenter)
because of their increased utilization.
© Hortonworks Inc. 2013
21. An Example Calculating Node Capacity
• Important Parameters
– mapreduce.[map|reduce].memory.mb
– This is the physical ram hard-limit enforced by Hadoop on the task
– mapreduce.[map|reduce].java.opts
– The heapsize of the jvm –Xmx
– yarn.scheduler.minimum-allocation-mb
– The smallest container yarn will allow
– yarn.nodemanager.resource.memory-mb
– The amount of physical ram on the node
– yarn.nodemanager.vmem-pmem-ratio
– The amount of virtual ram each container is allowed.
– This is calculated by containerMemoryRequest*vmem-pmem-ratio
© Hortonworks Inc. 2013
22. Calculating Node Capacity Continued
• Lets pretend we need a 1g map and a 2g reduce
• mapreduce[map|reduce].memory.mb = [-Xmx 1g | -Xmx 2g]
• Remember a container has more overhead then just your heap! Add
512mb to the container limit for overhead
• mapreduce.[map.reduce].memory.mb= [1536 | 2560]
• We have 36g per node and minimum allocations of 512mb
• yarn.nodemanager.resource.memory-mb=36864
• yarn.scheduler.minimum-allocation-mb=512
• Virtual Memory for each container is
• Map: 1536mb*vmem-pmem-ratio (default is 2.1) = 3225.6mb
• Reduce 2560mb*vmem-pmem-ratio = 5376mb
• Our 36g node can support
• 24 Maps OR 14 Reducers OR any combination allowed by the
resources on the node
© Hortonworks Inc. 2013
24. YARN – Implementing Applications
• What APIs do I need to use?
–Only three protocols
– Client to ResourceManager
– Application submission
– ApplicationMaster to ResourceManager
– Container allocation
– ApplicationMaster to NodeManager
– Container launch
–Use client libraries for all 3 actions
–Module yarn-client
–Provides both synchronous and asynchronous libraries
–Use 3rd party like Weave
– http://continuuity.github.io/weave/
© Hortonworks Inc. 2013
24
25. YARN – Implementing Applications
• What do I need to do?
–Write a submission Client
–Write an ApplicationMaster (well copy-paste)
–DistributedShell is the new WordCount
–Get containers, run whatever you want!
© Hortonworks Inc. 2013
25
26. YARN – Implementing Applications
• What else do I need to know?
–Resource Allocation & Usage
–ResourceRequest
–Container
–ContainerLaunchContext
–LocalResource
–ApplicationMaster
–ApplicationId
–ApplicationAttemptId
–ApplicationSubmissionContext
© Hortonworks Inc. 2013
26
27. YARN – Resource Allocation & Usage
• ResourceRequest
– Fine-grained resource ask to the ResourceManager
– Ask for a specific amount of resources (memory, cpu etc.) on a
specific machine or rack
– Use special value of * for resource name for any machine
ResourceRequest
priority
resourceName
capability
numContainers
© Hortonworks Inc. 2013
Page 27
28. YARN – Resource Allocation & Usage
• ResourceRequest
priority
1
© Hortonworks Inc. 2013
<4gb, 1 core>
numContainers
1
rack0
1
*
<2gb, 1 core>
resourceName
host01
0
capability
1
*
1
Page 28
29. YARN – Resource Allocation & Usage
• Container
– The basic unit of allocation in YARN
– The result of the ResourceRequest provided by ResourceManager
to the ApplicationMaster
– A specific amount of resources (cpu, memory etc.) on a specific
machine
Container
containerId
resourceName
capability
tokens
© Hortonworks Inc. 2013
Page 29
30. YARN – Resource Allocation & Usage
• ContainerLaunchContext
– The context provided by ApplicationMaster to NodeManager to
launch the Container
– Complete specification for a process
– LocalResource used to specify container binary and
dependencies
– NodeManager responsible for downloading from shared namespace
(typically HDFS)
ContainerLaunchContext
container
commands
environment
localResources
LocalResource
uri
type
© Hortonworks Inc. 2013
Page 30
31. YARN - ApplicationMaster
• ApplicationMaster
– Per-application controller aka container_0
– Parent for all containers of the application
– ApplicationMaster negotiates all it’s containers from
ResourceManager
– ApplicationMaster container is child of ResourceManager
– Think init process in Unix
– RM restarts the ApplicationMaster attempt if required (unique
ApplicationAttemptId)
– Code for application is submitted along with Application itself
© Hortonworks Inc. 2013
Page 31
32. YARN - ApplicationMaster
• ApplicationMaster
– ApplicationSubmissionContext is the complete specification of the
ApplicationMaster, provided by Client
– ResourceManager responsible for allocating and launching
ApplicationMaster container
ApplicationSubmissionContext
resourceRequest
containerLaunchContext
appName
queue
© Hortonworks Inc. 2013
Page 32
33. YARN Application API - Overview
• YarnClient is submission client api
• Both synchronous & asynchronous APIs for resource
allocation and container start/stop
• Synchronous API
– AMRMClient
– AMNMClient
• Asynchronous API
– AMRMClientAsync
– AMNMClientAsync
© Hortonworks Inc. 2013
Page 33
34. YARN Application API – The Client
New Application Request: YarnClient.createApplication
ResourceManager
1
Client2
Scheduler
Submit Application:
YarnClient.submitApplication
2
NodeManager
NodeManager
NodeManager
NodeManager
Container 1.1
Container 2.2
Container 2.4
NodeManager
NodeManager
AM 1
NodeManager
Container 1.2
NodeManager
Container 1.3
© Hortonworks Inc. 2012
NodeManager
AM2
NodeManager
NodeManager
Container 2.1
NodeManager
Container 2.3
35. YARN Application API – The Client
• YarnClient
– createApplication to create application
– submitApplication to start application
– Application developer needs to provide ApplicationSubmissionContext
– APIs to get other information from ResourceManager
– getAllQueues
– getApplications
– getNodeReports
– APIs to manipulate submitted application e.g. killApplication
© Hortonworks Inc. 2013
Page 35
36. YARN Application API – Resource Allocation
ResourceManager
AMRMClient.allocate
Scheduler
Container
3
NodeManager
NodeManager
NodeManager
2
4
unregisterApplicationMaster
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
AM
1
NodeManager
registerApplicationMaster
NodeManager
© Hortonworks Inc. 2012
37. YARN Application API – Resource Allocation
• AMRMClient - Synchronous API for ApplicationMaster
to interact with ResourceManager
– Prologue / epilogue – registerApplicationMaster /
unregisterApplicationMaster
– Resource negotiation with ResourceManager
– Internal book-keeping - addContainerRequest / removeContainerRequest /
releaseAssignedContainer
– Main API – allocate
– Helper APIs for cluster information
– getAvailableResources
– getClusterNodeCount
© Hortonworks Inc. 2013
Page 37
38. YARN Application API – Resource Allocation
• AMRMClientAsync - Asynchronous API for
ApplicationMaster
– Extension of AMRMClient to provide asynchronous
CallbackHandler
– Callbacks make it easier to build mental model of interaction with
ResourceManager for the application developer
– onContainersAllocated
– onContainersCompleted
– onNodesUpdated
– onError
– onShutdownRequest
© Hortonworks Inc. 2013
Page 38
39. YARN Application API – Using Resources
ResourceManager
Scheduler
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
Container 1.1
AMNMClient.startContainer
NodeManager
NodeManager
AM 1
AMNMClient.getContainerStatus
NodeManager
NodeManager
© Hortonworks Inc. 2012
40. YARN Application API – Using Resources
• AMNMClient - Synchronous API for ApplicationMaster
to launch / stop containers at NodeManager
– Simple (trivial) APIs
– startContainer
– stopContainer
– getContainerStatus
© Hortonworks Inc. 2013
Page 40
41. YARN Application API – Using Resources
• AMNMClient - Asynchronous API for
ApplicationMaster to launch / stop containers at
NodeManager
– Simple (trivial) APIs
– startContainerAsync
– stopContainerAsync
– getContainerStatusAsync
– CallbackHandler to make it easier to build mental model of
interaction with NodeManager for the application developer
– onContainerStarted
– onContainerStopped
– onStartContainerError
– onContainerStatusReceived
© Hortonworks Inc. 2013
Page 41
Notes de l'éditeur Traditional Batch vsNextGen First Hadoop use case was to map the whole internet. Graph with millions of nodes and trillions of edges. Still back to the same problem of silos in order to manipulate data for interactive or Online applicationsLong Story Short, no support for alterative processing models. Iterative tasks can take 10x longer…. because of IO barriers In Classic Hadoop MapReduce was part of the JobTracker and TaskTracker. Hence everything had to be built on MapReduce firstHas some scalbility limits 4k nodesIterative processes can take forever on map reduceJT Failures kill jobs, and everything in the que Typical Open Soruce stack, as we can see MapReduce is On TOP of MapReduce, All other applications like Pig and Hive and HBase are also on top of MapReduce So while Hadoop 1.x had its uses this is really about turning Hadoop into the next generation platform. So what does that mean? A platform should be able to do multiple things, ergo more then just batch processing. Need Batch, Interactive, Online, and Streaming capabilities to really turn Hadoop into a Next Gen Platform.SCALES! Yahoo plans to move into a 10k node cluster So what does this really do? It provides a distributed application framework.Hadoop is now providing a platform were we can store all data in a reliable wayThen on the same platform be able to processes this data without having to move it. Data locality New additions to the family like Falcon for data lifecycle management TEZ a new way of processing which avoids some of the IO barriers that MapReduce experiencedKnox for security and other enterprise features.But most importantly Yarn which as you can see everything is now on top of. AKA a container spawned by an AM can be a Client and ask for another application to start which in turn can do the same thing. Now we have a concept of deploying applications into the hadoop clusterThese applications run in containers of set resources RM takes place of JT and still has scheduling ques and such like the fair, capacity and hierarchical ques Datalocality – attempts to find a local host, if that fails moves to nearest rackFault Tolerance – Rebust in terms of managing containers Recovery- If the AM dies MapReduce application master writes a checkpoint to HDFS. This way we can recover from an AM that dies, it will read the checkpoint and continue.Inter-application Priorites -Maps have to be completed before Reducers right- so there is a complex process in the application master to balance mappers and reducers. Complex feedback from RM -- App master can now kinda look ahead and find out how many resources it can get in the next 20 minutesMigrate directly to YARN without changing a single line of code. Just need to recompile. Havn’t used Weave but its on the 2do list. Saddly my tasks keep growing and I can’t do them in parallel Application Attempt Id ( combination of attemptId and fail count )Application Submission Context – submitted by the client getAllQueues - metrics related to the que such as max capacity, current capacity, application count.getApplications - list of applicationsgetNodeReports – id, rack, host, numCansApplicationSubContext needs a ContainerLaunchContext as well + resources, priority, que etc.