Contenu connexe Similaire à Apache Hadoop YARN: Present and Future Similaire à Apache Hadoop YARN: Present and Future (20) Plus de DataWorks Summit (20) Apache Hadoop YARN: Present and Future1. © Hortonworks Inc. 2014
Apache Hadoop YARN
Present and Future
Vinod Kumar Vavilapalli
vinodkv [at] apache.org
@tshooter
Jian He
jianhe [at] apache.org
Page 1
2. © Hortonworks Inc. 2014
Who are we?
• Vinod Kumar Vavilapalli
– 7 Hadoop-years old
– Previously @Yahoo!, now @Hortonworks
– Hadoop MapReduce and YARN Development lead & Architect at Hortonworks
– Apache Hadoop YARN project lead
– Apache Hadoop PMC, Apache Member
– 99% + code in Apache, Hadoop
• Jian He
– Software Engineer @ Hortonworks
– Apache Hadoop Committer
– Masters Degree from Brown University.
– Focus on YARN/MapReduce
Page 2
Architecting the Future of Big Data
3. © Hortonworks Inc. 2014
A quick show of hands..
• Hadoop 1
• Hadoop 2 & YARN
• YARN for MapReduce2
• YARN for beyond MR2
Page 3
Architecting the Future of Big Data
4. © Hortonworks Inc. 2014
Agenda
• Apache Hadoop 2 : Overview
• Community
• Present
• Future
Page 4
Architecting the Future of Big Data
5. © Hortonworks Inc. 2014
Apache Hadoop 2
Next Generation Architecture
Architecting the Future of Big Data
Page 5
6. © Hortonworks Inc. 2014
YARN: the Data Operating System
Page 6
Architecting the Future of Big Data
• Resource Management Platform
• MapReduce v2
• Beyond MapReduce with Tez, Storm, Spark; in Hadoop!
• Did I mention Services like HBase, Accumulo on YARN with Apache Slider?
7. © Hortonworks Inc. 2014
Why?
• 2.0 >= 2 * 1.0
– YARN: Next generation architecture
• Scale
• Agility
• Return on Investment: 2x throughput on same hardware!
• Ready for improvements in hardware
• Not convinced? Let’s see what others are saying!
Page 7
Architecting the Future of Big Data
8. © Hortonworks Inc. 2014
Yahoo!
• Leader/Visionary on all things Hadoop!
• On YARN (0.23.x)
• Moving fast to 2.x
Page 8
Architecting the Future of Big Data
http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html
9. © Hortonworks Inc. 2014
Twitter
Page 9
Architecting the Future of Big Data
Talk: “ Hadoop 2 @Twitter, Elephant Scale”
By: Lohit Vijayarenu & Gera Shegalov
10. © Hortonworks Inc. 2014
Ebay
• Has one of the largest Hadoop clusters in the industry with tens-
hundreds petabytes of data
• Migrated production clusters to Hadoop-2
Page 10
Architecting the Future of Big Data
11. © Hortonworks Inc. 2014
YARN Community
At Apache Software Foundation
Architecting the Future of Big Data
Page 11
12. © Hortonworks Inc. 2014
YARN contributions
Page 12
Architecting the Future of Big Data
0
50
100
150
200
250
300
350
400
2.0.x 2.1.x 2.2.x 2.3.x 2.4.x 2.x trunk
YARN Releases - 06/02/14
YARN Releases - 06/02/14
13. © Hortonworks Inc. 2014
Contributors
• 104 and counting
• Few ‘big’ contributors
• And a long tail
Page 13
Architecting the Future of Big Data
0
10
20
30
40
50
60
70
80
90
100
15. © Hortonworks Inc. 2014
Apache Hadoop releases
• 15 October, 2013
• The 1st GA release of Apache Hadoop 2.x
• YARN
– First stable and supported release of YARN
– YARN level APIs solidified for the future
– Binary Compatibility for MapReduce applications built on hadoop-1.x
– Performance
– Scale!
• Support for running Hadoop on Microsoft Windows
• Substantial amount of integration testing with rest of projects in the
ecosystem
– Pig, Hive, Oozie, HBase..
Page 15
Architecting the Future of Big Data
Apache Hadoop 2.2
16. © Hortonworks Inc. 2014
Apache Hadoop releases (contd)
• 24 February, 2014
• First post GA release for the year 2014
• Alpha features in YARN
– ResourceManager High Availability
– Application History Server
– Will be covered in detail in the 2.4 section
• Number of bug-fixes, enhancements
Page 16
Architecting the Future of Big Data
Apache Hadoop 2.3
17. © Hortonworks Inc. 2014
Apache Hadoop releases (contd)
• 7 April, 2014
• Most recent release
• Stabilizing features in YARN
– Details follow
– ResourceManager HA
– YARN Timeline Server (beyond history server)
– Preemption in YARN CapacityScheduler
– Container-preserving AM recovery.
Page 17
Architecting the Future of Big Data
Apache Hadoop 2.4
18. © Hortonworks Inc. 2014
ResourceManager High Availability
Page 18
Architecting the Future of Big Data
• RM – single point of failure
• Goal : Downtime invisible to end-users
– Apps not required to be re-submitted
– NMs to rebind with newly started RM
• Two stories:
– Recovery of state
– Failover
19. © Hortonworks Inc. 2014
ResourceManager High Availability
Page 19
Architecting the Future of Big Data
• Active/Standby
o Leader election
(ZooKeeper)
• Standby on transition to
Active loads all the
state from the state
store.
• NM, AM, clients, redirect
to the new RM
o RMProxy lib
Talk: Highly Available Resource Management for YARN
By: Karthik Kambatla, Xuan Gong
20. © Hortonworks Inc. 2014
YARN Timeline Server
• Few MR specific implementations: History and web-UI
• YARN: Not just MR anymore!
• Previous state
– MapReduce specific Job History Server
– YARN level ‘History’ lost beyond ResourceManager Restart
Page 20
Architecting the Future of Big Data
21. © Hortonworks Inc. 2014
YARN Timeline Server (contd)
Page 21
Entity and Event
collection
RM and Applications periodically send events to
Timeline sever
Pluggable store Depending on site requirements
REST APIs or RPC
Applications and user-interfaces can access
information via REST/ RPC
Visualizations
Users can build tools and visualizations using the
APIs
Apps and System
Applications as well as the system
entities/events
22. © Hortonworks Inc. 2014
YARN Timeline Server (contd)
Page 22
Architecting the Future of Big Data
YARN
Timeline
Serv`er
App1
App2
RM
Custom App
monitoring
client
RPC
REST API
Events
Events
AMBARI
Events
Talk: “Analyzing Historical Data of Applications on Hadoop
YARN: for Fun and Profit”
By: Zhijie Shen, Mayank Bansal
23. © Hortonworks Inc. 2014
Capacity Scheduler Preemption
• Enforce
SLAs
• Preempt
across
queues
• Current Capacity
• Guaranteed Capacity
Gather Queue State
STEP1
• Select applications to preempt: Over
cap. Qs
Identify preemptions
STEP2
• Issue preemptions for containers to
application
Issue preemptions
STEP3
• Track containers that have been issued
by not yet executed preemption
• Forcibly kill these containers after
timeout
Kill containers
STEP4
24. © Hortonworks Inc. 2014
Capacity Scheduler Preemption (Contd)
Application Scheduler
Page 24
Architecting the Future of Big Data
Premptions
Release Resource
Premptions
Kill containers forcibly
after timeout
x
25. © Hortonworks Inc. 2014
Container-preserving AM restart
• Problem
– Containers are killed when AM goes down.
– New AM needs to know where the previous containers are running
– Previous containers need to know about the new AM. (WIP)
Page 25
Architecting the Future of Big Data
Container1
Container2
Container3
AM1
AM2
restart
26. © Hortonworks Inc. 2014
Apache Hadoop releases (contd)
• Next releases
– 2.4.1
– 2.5.x
• YARN
– Details follow in future’s section
– ResourceManager work-preserving restart for High Availability
– YARN Timeline Server security & enhancement.
– Lots more
Page 26
Architecting the Future of Big Data
Apache Hadoop 2.5.x
28. © Hortonworks Inc. 2014
Future: Operational enhancements
• Rolling upgrades
– No/minimal impact to users
– Ideal: Always rolling!
• HDFS upgrades effort is in
• YARN
– RM restart
– NM restart
– Upgrades
Page 28
Architecting the Future of Big Data
Talk: “Hadoop Rolling Upgrades – Taking Availability to the Next Level”
By: Suresh Srinvias, Hortonworks & Jason Lowe Yahoo!
29. © Hortonworks Inc. 2014
Future: Enabling apps
• Beyond MapReduce
– Apache Tez, Apache Slider, Apache Storm.
• Discussing next
– Long running services
– Multi-dimensional resource scheduling
– Isolation
– Web services
Page 29
Architecting the Future of Big Data
30. © Hortonworks Inc. 2014
Future: Long running services
• You can run them already!
• Few enhancements needed
– Logs
– Security
– Management/monitoring
• Resource sharing across workload types
Page 30
Architecting the Future of Big Data
Talk: “ Bring your Service to YARN”
By: Sumit Mohanty
31. © Hortonworks Inc. 2014
Multi-resource scheduling
• Today – memory & cpu
– Physical memory / virtual memory
– CPU Cores – Virtual cores
• CPU stuff: More bake in
• Disks
– Space
– IOPS
• Network
Page 31
Architecting the Future of Big Data
32. © Hortonworks Inc. 2014
Fine-grain isolation for multi-tenancy
• Custom memory-monitoring
• Cgroups
• Linux Containers
• VMs
Page 32
Architecting the Future of Big Data
33. © Hortonworks Inc. 2014
Other features
• Application SLAs
– Run my application at 6:00 AM tomorrow and guarantee capacity for me!
• Node labels
– Some of the nodes in my cluster have specialized hardware, give them to me!
• Node affinity/anti-affinity
– Get me on to the nodes where my data is
– Get me off of this node
• Better online queue-management
– Centralized
– Quality feedback
• Web-services
– RESTful APIs for submitting, monitoring and killing apps
– Beyond java-only clients
Page 33
Architecting the Future of Big Data
34. © Hortonworks Inc. 2014
YARN Ecosystem
Beyond the core YARN project: Briefly
Architecting the Future of Big Data
Page 34
35. © Hortonworks Inc. 2014
Eco-system
Page 35
Classic Apache Hadoop
MapReduce – Batch
Batch & Interactive
• Apache Tez –
Batch/Interactive
Stream Processing
• Apache Storm
• Apache Samza
Apache Spark – Iterative
applications
YARN Frameworks
• Apache Twill
• Microsoft REEF
There's an app for that...
YARN App Marketplace!
Existing apps
• Apache Slider
Graph Processing
• Apache Giraph
Applications Powered by YARN
Talk: Apache Tez - A New Chapter in Hadoop Data Processing”
By Bikas Saha, Hitesh Shah
37. © Hortonworks Inc. 2014
Recap
Page 37
Architecting the Future of Big Data
• YARN helps Apache Hadoop 2 to be twice as good!
• Exciting journey with Hadoop for this decade…
– Hadoop is no longer a one-trick pony, err elephant
– Beyond just MapReduce
• Hadoop 2: Architecture for the future
– Centralized data, multiple apps
• Lots of exciting new features
– Exciting spectrum of application types, workloads and use-cases
38. © Hortonworks Inc. 2014
Couple more things..
Architecting the Future of Big Data
Page 38
41. © Hortonworks Inc. 2014
Thank you!
Page 41
Download Sandbox: Experience Apache Hadoop
Both 2.x and 1.x Versions Available!
http://hortonworks.com/products/hortonworks-sandbox/
Questions Time!
Notes de l'éditeur Graph processing – Giraph, Hama
Stream proessing – Smaza, Storm, Spark, DataTorrent
MapReduce
Tez – fast query execution
Weave/REEF – frameworks to help with writing applications
List of some of the applications which already support YARN, in some form.
Smaza, Storm, S4 and DataTorrent are streaming frameworks
Various types of graph processing frameworks – Giraph and Hama are graph processing systems
There’s some github projects – caching systems, on-demand web-server spin up
Wave and REEF are frameworks on top of YARN to make writing applications easier