2. Cloud: Big Shifts in Simplification and Optimization
1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile
Costs IT Service Delivery
to simplify operations to redirect investment into to meet and anticipate the
and maintenance value-add opportunities needs of the business
2
3. Infrastructure, Apps and now Data…
Build Run
Private
Public
Manage
Simplify Infrastructure Simplify App Platform
Next Trend:
With Cloud Through PaaS
Simplify Data
3
4. Trend 1/3: New Data Growing at 60% Y/Y
Exabytes of information stored 20 Zetta by 2015
1 Yotta by 2030
Yes, you are part
of the yotta
audio( generation…
digital(tv(
digital(photos(
camera(phones,(rfid(
medical(imaging,(
sensors(
satellite(images,(games,(scanners,(
twi8er(
cad/cam,(appliances,(videoconfercing,(digital(movies(
Source: The Information Explosion , 2009
4
6. Trend 3/3: Value from Data Exceeds Hardware Cost
! Value from the intelligence of data analytics now outstrips the cost
of hardware
• Hadoop enables the use of lower cost hardware
• Hardware cost halving every 18mo
Value
Big Iron:
$40k/CPU
Commodity
Cluster:
$1k/CPU
Cost
6
7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware
! Trend is “not just hadoop” for big data
• Hadoop is often combined with other
technologies: Big SQL, NoSQL etc,…
• Unify the infrastructure platform for all
SQLCluster
Big SQL NoSQL Hadoop
NoSQL Cluster
Unified Big Data Infrastructure
Private
Public
Hadoop Cluster
! Common Hardware Base
• Eliminate the hardware/driver/testing phase
• Use existing team for ordering, diagnosis,
DSS Cluster capacity management of hardware farm
7
8. Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning
I WANT MY HADOOP CLUSTER NOW!
! Instant Cluster Provisioning
• Provision Hadoop Clusters instantly
• Automatable using provisioning
engines/scripts: e.g. whir
8
9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities
! Increase Utilization
• Hadoop cluster only uses resources it needs
• Extra resources can be used by other applications when not in use
! Eliminate single points of failure
• Use vSphere HA for Namenode and Jobtracker
! Use VM Isolation
• Create separate clusters with defensible security
• Enables multiple-versions of Hadoop on the same infrastructure
• Extends to Hadoop and Linux Environments
! Leverage Resource Management
• Control/assign resources through resource pools
• E.g. Use spare cycles for Hadoop Processing through priority control
9
10. What? Hadoop in a VM? Really?
Actually, Hadoop performs well in a virtual machine
10
13. Hadoop Configuration
Distribution
• Cloudera CDH3u0
• Based on Apache open-source 0.20.2
Parameters
• dfs.datanode.max.xcievers=4096
• dfs.replication=2
• dfs.block.size=134217728
• io.file.buffer.size=131072
• mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)
• mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)
! Network topology
• Hadoop uses info for reliability and performance
• Multiple VMs/host: Each host is a “rack”
13
14. Benchmarks
! Derived from test apps included in distro
! Pi
• Direct-exec Monte-Carlo estimation of pi
• # map tasks = # logical processors
• 1.68 T samples
! TestDFSIO
• Streaming write and read
π ~ 4*R/(R+G) = 22/7
• 1 TB
• More tasks than processors
! Terasort
• 3 phases: teragen, terasort, teravalidate
• 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)
• More tasks than processors
• CPU, networking, and storage I/O
14
15. Performance of Hadoop for Several Workloads
Ratio of time taken – Lower is Better
1.2
1
0.8
Ratio to Native
0.6
1 VM
0.4 2 VMs
0.2
0
15
16. Architecting Hadoop as a Service using Virtualization
! Goals
• Make it fast and easy to provision new Hadoop Clusters on Demand
• Leverage virtual machines to provide isolation (esp. for Multi-tenant)
• Optimize Hadoop’s performance based on virtual topologies
• Make the system reliable based on virtual topologies
! Leveraging Virtualization
• Elastic scale in/out
• Use high-availability to protect namenode/job tracker
• Resource controls and sharing: re-use underutilized memory, cpu
• Prioritize Workloads: limit or guarantee resource usage in a mixed environment
16
17. Provisioning
! Leverage the vSphere APIs to auto-deploy a cluster
• Whirr, HOD, or custom using ruby, chef, etc,…
! Use linked-clones to rapidly fork many nodes
17
18. Fast Provisioning
! From a “seed” node to a cluster
Thin Provisioning Linked Clone
60GB => 3.5GB ~6 second
18
19. SAN, NAS or Local Disk?
! Shared Storage: SAN or NAS ! Hybrid Storage
• Easy to provision • SAN for boot images, VMs, other
• Automated cluster rebalancing workloads
• Local disk for HDFS
• Scalable Bandwidth, Lower Cost/GB
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Host Host Host Host Host Host
19
20. Enable Automatic Rack awareness through vSphere
! Important to robust hadoop
cluster
! Automatic network topology
detect — an important
vSphere feature
! Rack script is generated
automatically
20
21. Multi-tenant: share cluster or not
! Shared big cluster VS. Isolated small clusters
High performance Secure
Large scale Flexible
Pre-job provisioning Post-job provisioning
Combination – as customers’ requirement are different
21
22. Elastic Hadoop Cluster
! Traditional hadoop cluster
• Easy to scale out
• Fast-provision new hadoop nodes and join into existing cluster
• Hard to scale in
While (ClusterIsTooLarge) {
choose node k;
kill (node k);
wait (k’s data block is recovered);
if necessary, hadoop.rebalance();
}
! Elastic hadoop cluster
…
Normal node
NN JT Elastic node
TaskTracker
…
DataNode
22
23. Replica Placement
! Second Replica
• Different rack
• Rack-awareness required
! Third Replica
• Same rack, different physical host
• Nodes share host (in virtualized
environment)
23
25. Performance
! Create more smaller VMs
• Makes Hadoop scale better
• Allows for easier/faster adjustment of packing of VMs across hosts by vSphere
(including through DRS)
! Sizing/Configuration of storage is critical
• Plan on ~50Mbytes/sec of bandwidth per core
• SANs are typically configured by default for IOPS, not Bandwidth
• Ensure SAN ports/switch topology allows required aggregate bandwidth
• Performance of the backend storage should be tested/sized
• Local disks will give ~100-140MBytes/sec per disk: pick correct controller
25
26. Summary
! Hadoop does work well in a virtual environment
! Plan a virtual cluster, enable other big-data solutions on the same
infrastructure
! Leverage the recipes to automate your configuration and
deployment
26