2. 2 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
3. 3 Confidential
Trends Driving Change in Enterprise IT
Cloud
• Offered “as-a-Service”
• Virtualization
New Application Types
• Mobile, SaaS, social
• Apps released early and often
Frameworks
• New application frameworks driving
• Increase in application development
Data Disruption
• Web orientation drives exponential
data volumes
• Reduced latency and new types of data
4. 4 Confidential
The Database is Being Stretched
Big Data
Cloud Delivery
Flexible Data
Virtualized
Offered “-as-a-Service”
Petabytes vs.
Gigabytes
Democratize BI
Multi-structured data
Developer productivity
Fast Data
Global access patterns
Mobile app proliferation
5. 5 Confidential
Big, Fast and Flexible Data
Flexible
Big
Big Data
Processing
Big Data
Analytics
Serengeti
Fast
OLTP
workloads
Analytic
workloads
Cloud Delivery Model
Data as a service for private and public clouds
OSS Relational
Document
Object
Key / Value
GemFire
vPostgres
GemFire
GemFire
6. 6 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
7. 7 Confidential
Data is exploding & Hadoop is driving growth
Unstructured data driving growth Hadoop adoption is ramping
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Structured Unstructured
Complex unstructured data
forecastedto outpace structured
relationaldata by 10x by 2020
Evaluating
53%
In-
production
23%
Piloting
18%
Testing
2%
Don't know
2%
Other
2%
Source: Forrester Survey of 60 CIOs, September 2011
• Unstructured data explosion and Hadoop capabilities causing CIOs to
reconsider Enterprise data strategy
• Gartner predicts +800% data growth over next 5 years
• Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs
8. 8 Confidential
Log Processing / Click
Stream Analytics
Machine Learning /
sophisticated data mining
Web crawling / text
processing
Extract Transform Load
(ETL) replacement
Image / XML message
processing
Broad Application of Hadoop technology
General archiving /
compliance
Financial Services
Mobile / Telecom
Internet Retailer
Scientific Research
Pharmaceutical / Drug
Discovery
Social Media
Vertical Use Cases
Horizontal Use Cases
Hadoop’s ability to handle large unstructured data affordably and efficiently makes
it a valuable tool kit for enterprises across a number of applications and fields.
9. 9 Confidential
The Future of Virtualization
VDC
Software-defined Datacenter Services
2008 2012 FUTURE
Time to Provision
New Services
Workloads
Virtualized
Weeks Days/Hours Minutes/Seconds
25% 60%
+
>90%
10. 10 Confidential
Virtualization enables a Common Infrastructure for Big Data
Single purpose clusters for various
business applications lead to cluster
sprawl.
Virtualization Platform
Simplify
• Single Hardware Infrastructure
• Unified operations
Optimize
• Shared Resources = higher utilization
• Elastic resources = faster on-demand access
MPP DB Hadoop
HBase
Virtualization Platform
MPP DB
Hadoop
HBase
Cluster Sprawling
Cluster Consolidation
11. 11 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
12. 12 Confidential
Hadoop Users
Data scientists, analysts, developers
• Line of business users
• Intimate with data and analysis, not IT
• Tasked with providing actionable intelligence that impacts the business
Concerns
• Obtain a Hadoop cluster on demand
• Minimize time to insight
• Require reasonable performance from Hadoop cluster
13. 13 Confidential
The IT Guy
Admins, architects, CIO
• Responsible for technology infrastructure, compliance, budget management
• Evaluates new technologies and recommends best practices
Concerns
• Keeping up with demands of the business
• Cost savings and consolidation
• Reliability
• Complexity of running and tuning Hadoop clusters
• Shortage of skills to do the above
15. 15 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
16. 16 Confidential
Why Virtualize Hadoop?
Shrink and expand
cluster on demand
Independent scaling of
Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for
entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command
center
Easy to
configure/reconfigure
Operational Simplicity
17. 17 Confidential
Project Serengeti
Open source project launched in June, 2012
Toolkit that leverage virtualization to simplify Hadoop deployment
and operations
To learn more, projectserengeti.org
Deploy a Hadoop cluster in 10 Minutes
Customize Hadoop cluster
Use Your Favorite Hadoop Distribution
One stop command center
Serengeti
18. 18 Confidential
Rapid Deployment of a Hadoop Cluster with Serengeti
Done
Step 1: Deploy Serengeti virtual appliance on vSphere.
Step 2: A few simple commands to stand up Hadoop Cluster.
21. 21 Confidential
A Walk Through Serengeti
Scaling out a cluster
Advanced cluster creation
22. 22 Confidential
Customizing Your Hadoop Cluster
Choice of distros
Storage configuration
• Choice of shared storage or local disk
Resource configuration
High availability option
# of nodes
Also used to tune Hadoop config
…
"distro":"apache",
"groups":[
{ "name": "master",
"roles":[
"hadoop_namenode",
"hadoop_jobtracker”],
"storage": {
"type": "SHARED",
"sizeGB": 20},
"instanceType": "MEDIUM",
"instanceNum": 1,
"haFlag": 'on’},
{"name": "worker",
"roles":[
"hadoop_datanode",
"hadoop_tasktracker"
],
"instanceType": "SMALL",
"instanceNum": 5,
"haFlag": 'off'
…
23. 23 Confidential
Freedom of Choice and Open Source
Community
Projects
Distributions
• Flexibility to choose from major distributions
• Support for multiple projects (work in progress)
• Open architecture to welcome industry participation
• Contributing Hadoop Virtualization Extensions (HVE) to open
source community
24. 24 Confidential
Use Local Disk where it’s Needed
SAN Storage
$2 - $10/Gigabyte
$1M gets:
0.5 Petabytes
200,000 IOPS
8Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:
1 Petabyte
200,000 IOPS
10Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets:
10 Petabytes
400,000 IOPS
250 Gbytes/sec
25. 25 Confidential
Virtual Storage Architecture Includes Local Disk
Shared Storage: SAN or NAS
• Easy to provision
• Automated cluster rebalancing
• Leverage high availability protection
Local Storage: Local Disks
• Local disk for Hadoop
• Scalable bandwidth, lower cost/GB
Host
Hadoop
Other
VM
Other
VM
Host
Hadoop
Hadoop
Other
VM
Host
Hadoop
Hadoop
Other
VM
Host
Hadoop
Other
VM
Other
VM
Host
Hadoop
Hadoop
Other
VM
Host
Hadoop
Hadoop
Other
VM
Shared Storage Shared Storage
Local Storage
26. 26 Confidential
Hadoop Runs Well on Virtualization
0
50
100
150
200
250
300
350
400
450
TeraGen TeraSort TeraValidate
Elapsed
time,
seconds
(lower
is
better)
Native
1 VM
2 VMs
4 VMs
Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf
27. 27 Confidential
Why Virtualize Hadoop?
Shrink and expand
cluster on demand
Independent scaling of
Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for
entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command
center
Easy to
configure/reconfigure
Operational Simplicity
28. 28 Confidential
High Availability for the Hadoop Stack
HDFS
(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI Reporting
ETL Tools
Management
Server
Zookeepr
(Coordination)
HCatalog
RDBMS
Namenode
Jobtracker
Hive
MetaDB
Hcatalog MDB
Server
HA for Hadoop stack is more than Name node HA
29. 29 Confidential
vMotion Reduces Planned Downtime
Description:
Enables the live migration of virtual
machines from one host to another
with continuous service availability.
Benefits:
• Revolutionary technology that is the
basis for automated virtual machine
movement
• Meets service level and performance
goals
30. 30 Confidential
Hadoop Aware HA - Protection Against Unplanned Downtime
• Protection against host and VM failures
• Added application-aware HA for Hadoop NameNode (NN) and JobTracker (JT),
protecting against NN and JT failures
• Automatic failure detection and restart virtual machine in minutes, on any
available host in cluster
• In progress Hadoop Jobs will pause and resume when name node is up
Overview
31. 31 Confidential
vSphere Fault Tolerance Provides Continuous Protection
App
OS
App
OS
App
OS
X
X
App
OS
App
OS
App
OS
App
OS
X
VMware ESX VMware ESX
• Single identical VMs running in
lockstep on separate hosts
• Zero downtime, zero data loss
failover for all virtual machines in
case of hardware failures
• Integrated with VMware HA/DRS
• No complex clustering or
specialized hardware required
• Single common mechanism for all
applications and operating
systems
FT
HA
HA
Overview
Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters
32. 32 Confidential
Achieve HA for the Entire Hadoop Stack
HDFS
(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI Reporting
ETL Tools
Management
Server
Zookeepr
(Coordination)
HCatalog
RDBMS
Namenode
Jobtracker
Hive MetaDB Hcatalog MDB
Server
• Battle-tested high availability technology
• Single mechanism to achieve HA for the entire Hadoop stack
• One click to enable HA and/or FT
33. 33 Confidential
Why Virtualize Hadoop?
Shrink and expand
cluster on demand
Independent scaling of
Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for
entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command
center
Easy to
configure/reconfigure
Operational Simplicity
34. 34 Confidential
Storage
Evolution of Hadoop on VMs
Compute
Current
Hadoop:
Combined
Storage/
Compute
Storage
T1 T2
VM VM VM
VM
VM
VM
Hadoop in VM
- VM lifecycle
determined
by Datanode
- Limited elasticity
- Limited to Hadoop
Multi-Tenancy
Separate Storage
- Separate compute
from data
- Elastic compute
- Enable shared
workloads
- Raise utilization
Separate Compute Clusters
- Separate virtual clusters
per tenant
- Stronger VM-grade security
and resource isolation
- Enable deployment of
multiple Hadoop runtime
versions
Slave Node
35. 35 Confidential
Ad hoc
data mining
In-house Hadoop as a Service “Enterprise EMR” – (Hadoop + Hadoop)
Compute
layer
Data
layer
HDFS
Host Host Host Host Host Host
Production
recommendation engine
Production
ETL of log files
Virtualization platform
HDFS
36. 36 Confidential
Hadoop
batch analysis
Integrated Big Data Production – (Hadoop + other big data)
HDFS
Host Host Host Host Host Host
HBase
real-time queries
NoSQL –
Cassandra
key-value
store
MPP DBMS –
Analysis of
structured data
Compute
layer
Data
layer
Virtualization platform
37. 37 Confidential
Short-lived
Hadoop compute cluster
Integrated Hadoop and Webapps – (Hadoop + Other Workloads)
HDFS
Host Host Host Host Host Host
Web servers
for ecommerce site
Compute
layer
Data
layer
Hadoop
compute cluster
Virtualization platform
38. 38 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
39. 39 Confidential
Simple, Reliable, Elastic Hadoop on Demand
Shrink and expand
cluster on demand
Independent scaling of
Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for
entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command
center
Easy to
configure/reconfigure
Operational Simplicity
Hadoop-as-a-Service
(Enterprise Grade EMR)