2. Main features of Hadoop 2.0
• High availability for HDFS
• Federation for HDFS
• Generalized Resource Management
(YARN)
• Plus: performance improvements, security
improvements, compatibility improvements…
VertiCloud 2
4. HDFS 1.0 (and earlier)
Name node
(Gets to be huge!)
Data nodes
(Lots of them!)
VertiCloud 4
5. Problems having a single NN
• Scalability – NN limits horizontal scaling
• Performance – NN is performance bottleneck
• Isolation – all tenants share same NN
– One misbehaving tenant brings everyone down
– Can’t provide higher QOS to mission-critical apps
– This is a problem even for small clusters!
VertiCloud 5
6. HDFS Federation
ViewFS
NN1 NN2 NN3 NN4
Data nodes
(Even more of them!)
VertiCloud 6
7. Future possibilities for HDFS
• Snapshots (!)
• Partial name spaces
• Alternative namespace managers
• Global replication management
• Disaster recovery
VertiCloud 7
9. MapReduce 1.0 (and earlier)
JobTracker Queue of jobs
Queue of tasks
Job and task scheduling and
monitoring
Slave nodes
(Lots of them!)
VertiCloud 9
10. Problems with JT
• Scalability – JT limits horizontal scaling
• Availability – when JT dies, jobs must restart
• Upgradability – must stop jobs to upgrade JT
• Hardwired – JT only supports MapReduce
• Increasingly hard to improve
– Performance, scheduling , or utilization
VertiCloud 10
11. Observation
Move intra-job management out of central node!
JobTracker Queue of jobs
Why are we Queue of tasks
doing all of this
on a single Job and task scheduling and
node? monitoring
When we have Slave nodes
all these nodes? (Lots of them!)
VertiCloud 11
13. YARN Components
• Resource Manager (per cluster)
– Manages job scheduling and execution
– Global resource allocation
• Application Master (per job)
– Manages task scheduling and execution
– Local resource allocation
• Node Manager (per-machine agent)
– Manages the lifecycle of task containers
– Reports to RM on health and resource usage
VertiCloud 13
14. Lifecycle of a job
Resource App Node
Client Manager Master Managers
Submit
OK Go
I need resources!
Here you are
Done? Start containers
No Here you are
Do work!
Done?
No
Done? Done
Done
Yes
Containers
VertiCloud 14
15. Why YARN is important
• Fixes scalability and availability problems
• Supports experimentation
– At both YARN and MapReduce levels
• Supports alternatives to MapReduce!!
– OpenMPI
– Interactive SQL (Impala)
– Streaming
• Storm, Apache S4, others…
– HBase integration
– Graph progressing (Apache Giraph)
VertiCloud 15
16. Futures of YARN and MR
• YARN
– Models beyond MapReduce
– Scheduling improvements (including preemption)
– Container isolation
• MapReduce
– Decompose into reusable pieces
– Push as well as pull in shuffle
– Simple hash (no sort) in shuffle
VertiCloud 16