3. Say What?
• VMs will just add overhead, due to I/O virt
• VMs run on SAN, we’re all about local disks
• Hadoop does it’s own cluster management
• It’ll do resource management in 2.0
• And even HA is coming to Hadoop
• And… what is the point, anyway?
4. But you’ve been asking…
• Can I virtualize my Hadoop, so that I can make
it easier, quicker to get a cluster up and
running
• Is it possible to run Hadoop on those spare
machine cycles I have on hundreds/thousands
of nodes?
• Can I make my system more available by using
some of the standard HA features?
5. And the savvy are asking…
• Can I avoid having to install special hardware
for the master services, like name-node, job-
tracker?
• Can I dynamically change the size of the
cluster to use more resources?
• Can I use VM isolation to increase security or
guard against resource-intensive neighbors?
• Is it feasible to provision virtual-clusters, giving
out one each to a business unit?
6. Ok, so first what about the concerns?
• Use your SAN? … if you want to.
SAN Storage NAS Filers Local Storage
$2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte
$1M gets: $1M gets: $1M gets:
0.5Petabytes 1 Petabyte 20 Petabytes
1,000,000 IOPS 400,000 IOPS 10,000,000 IOPS
1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec
7. Hadoop Using Local Disks
Task Tracker Datanode
Other Hadoop
Workload Virtual
Machine
Ext4 Ext4 Ext4
Virtualization Host OS Image - VMDK VMDK VMDK VMDK
Shared
Storage
8. Hadoop Perf in a VM
(Ratio is elapsed time to physical, Lower Is Better)
1.2
1
Ratio to Native
0.8
0.6
0.4 1 VM
2 VMs
0.2
0
9. Evolution of Hadoop on VMs
VM VM VM VM
Current
Hadoop: Compute T1 T2
Combined VM VM
Storage/Co Storage Storage
mpute
Hadoop in VM Separate Storage Separate Compute Clusters
- VM lifecycle - Separate compute - Separate virtual clusters
determined from data per tenant
by Datanode - Elastic compute - Stronger VM-grade security
- NOT Elastic - Enable shared and resource isolation
- Limited to Hadoop workloads - Enable deployment of
Multi-Tenancy - Raise utilization multiple Hadoop runtime
versions
10. 1. Hadoop Task Tracker and Data Node in a VM
Add/Remove
Slot
Slots?
Slot
Other
Virtual Task Tracker
Hadoop
Workload
Node
Datanode
Grow/Shrink
by tens of GB?
Virtualization Host VMDK
Grow/Shrink of a VM is one
approach
12. But State makes it hard to power-off a node
Slot
Slot
Other
Virtual Task Tracker
Hadoop
Workload
Node
Datanode
Virtualization Host VMDK
Powering off the Hadoop VM
would in effect fail the datanode
13. Adding a node needs data…
Slot Slot
Slot Slot
Other
Virtual Task Tracker Virtual Task Tracker
Hadoop Hadoop
Workload
Node Node
Datanode Datanode
Virtualization Host VMDK VMDK
Adding a node would require TBs of
data replication
20. Demo: Shrink/Expand Cluster
Setup 1 Datanodes, 2 Nodemanagers and 2 web servers on
each physical host
Web Server Web Server Web Server Web Server
Web Server Web Server Web Server Web Server
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Datanode Datanode Datanode Datanode
21. Demo: Shrink/Expand Cluster
When web load is high in daytime, we can suspend some Nodemanagers and
power on more Web servers.
Web Server Web Server Web Server Web Server
Web Server Web Server Web Server Web Server
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Datanode Datanode Datanode Datanode
24. Expand Hadoop Ecosystem
• Hortonworks goal
– Expand Hadoop ecosystem
– Provide first class support of various platforms
• Hadoop should run well on VMs
• VMs offer several advantages as presented earlier
• Take advantage of vSphere for HA
Page 25
25. VMware-Hortonworks Joint
Engineering
• First class support for VMs
– Topology plugins (Hadoop-8468)
• 2 VMs can be on same host
– Pick closer data
– Schedule tasks closer
– Don’t put two replicas on same host
– MR-tmp on HDFS using block pools
• Elastic Compute-VMs will not need local disk
– Fast communications within VMs
Page 26
26. Hadoop Total System Availability
Architecture
Slave Nodes of Hadoop Cluster
job job job job job
Apps
Running
Outside
Failover
JT into Safemode
NN JT NN
N+K
Server Server Server failover
HA Cluster for Master Daemons
27
32. Summary
• Advantages of Hadoop on VMs
– Cluster Management
– Cluster consolidation
– Greater Elasticity in mixed environment
– Alternate multi-tenancy to capacity scheduler’s
offerings
• HA for Hadoop Master Daemons
– vSphere based HA for NN, JT, … in Hadoop 1
– Total System Availability Architecture
Page 33
Notes de l'éditeur
Hybrid StorageLocal Disks, retains fault domains of individual disks
Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs