2. Not Just for the Web Giants – The Intelligent Enterprise
2
3. Real-time analysis allows
instant understanding of
market dynamics.
Retailers can have intimate
understanding of their
customers needs and use
direct targeted marketing.
Market Segment Analysis ! Personalized Customer Targeting`
3
4. The Emerging Pattern of Big Data Systems: Retail Example
Real-Time
Processing
Data Science
Real-Time
Streams
Machine
Learning
Parallel Data
Processing
Exa-scale Data Store
Cloud Infrastructure
4
5. Storage: Plan for Peta-scale Data Storage and Processing
Analytics Rapidly Outgrows Traditional Data Size
by 100x
1000
100
10
Online Apps
PB of
Data
Analytics
1
0.1
0.01
2000
5
2003
2006
2009
2012
2015
6. Unprecedented Scale
We are creating an Exabyte
of data every minute in 2013
Yottabyte by 2030
“Data transparency,
amplified by Social Networks
generates data at a
scale never seen before”
- The Human Face of Big Data
6
7. A single GE Jet Engine produces
10 Terabytes of data in one hour
– 90 Petabytes per year.
Enabling early detection of
faults, common mode failures,
product engineering feedback.
Post Mortem ! Proactively Maintained Connected Product
7
8. The Emerging Pattern of Big Data Systems: Manufacturing
Real-Time
Sensor
Analytics
Product
Engineering
Support
Data Science
Real-Time
Processing
Machine
Learning
Parallel Data
Processing
Exa-scale Data Store
Cloud Infrastructure
8
12. What if you can…
Recommendation engine
Production
Test
Ad targeting
Production
One physical platform to support multiple virtual big
data clusters
Test
Experimentation
Production
recommendation engine
Experimentation
12
Experimentation
Test/Dev
Production
Ad Targeting
13. Values of a Cloud Platform for Big Data
1
Agility / Rapid deployment
2
Isolation for resource control
and security
3
Lower Capex
4
Operational efficiency
Compute
Management
Storage/Availability
Network/Security
13
14. Hadoop as a Service
Operational Simplicity
High Availability
Elasticity & Multi-tenancy
! Rapid deployment
! High availability for
entire Hadoop stack
! Shrink and expand
cluster on demand
! One click to setup
! Independent scaling of
Compute and data
! One stop command
center
! Easy to configure/
reconfigure
14
! Battle-tested
! Strong multi-tenancy
15. Self Service Access to Big Data Environments
Developer
Data Scientist
High priority
…
• 3 Hadoop nodes
• Cloudera, Pivotal
MapR
• Small VM
• Local storage
• No HA
• …
•
•
•
•
•
•
•
•
•
•
•
•
• …
• …
5 Hadoop nodes
Cloudera, Pivotal
Hive, Pig
Medium VM
HA
…
50 Hadoop nodes
Cloudera
Hive, Pig
Large VM
HA
…
Templates for Different Cloud Users
15
16. Big Data needs a Mix of Workloads
Other
Hadoop
batch analysis
Compute
layer
NoSQL
Big SQL
HBase
real-time queries
Impala,
Pivotal HawQ
Cassandra,
Mongo, etc
Spark,
Shark,
Solr,
Platfora,
Etc,…
File System/Data Store
Platform Virtualization Technology
Host
16
Host
Host
Host
Host
Host
Host
17. Strong Isolation between Workloads is Key
Hungry
Workload 1
Reckless
Workload 2
Virtualization Platform
17
Nosy
Workload 3
18. Community activity in Isolation and Resource Management
! YARN
• Goal: Support workloads other than M-R on Hadoop
• Initial need is for MPI/M-R from Yahoo
• Non-posix File system self selects workload types
! Mesos
• Distributed Resource Broker
• Mixed Workloads with some RM
• Active project, in use at Twitter
• Leverages OS Virtualization – e.g. cgroups
! Virtualization
• Virtual machine as the primary isolation, resource management and
versioned deployment container
• Basis for Project Serengeti
18
19. Use case: Elastic Hadoop with Tiered SLA
• Production workloads has high priority
• Experimentation workloads has lower priority
Experimentation
Mapreduce
Compute layer
Dynamic resourcepool
Compute
VM
Compute
VM
Experimentation
Experimentation
Production
Mapreduce
Compute
VM
Compute
VM
Compute
VM
19
Compute
VM
Compute
VM
Production
recommendation engine
Production
vSphere
Data layer
Compute
VM
20. Cloud Enabled Auto-elastic Hadoop
J
T
N
N
T
T
VHM
T
T
T
T
J
T
T
T
T
T
DATA VM
DATA VM
DATA VM
ESX
ESX
ESX
Local Disks
JT: JobTracker
TT: TaskTracker
NN: NameNode
VHM: Virtual Hadoop Manager
20
T
T
Hadoop HDFS VMs
Hadoop Compute VMs
SAN/NAS
Non-Hadoop VMs
21. Hadoop Performance with Virtualization
32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM
(lower is better)
[http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013]
21
24. Traditional Networks: Core Switch is the Choke Point
Core
Aggregation
Network
Topology
Rack
Hosts
Hosts
Non Uniform Bandwidth
10s of Gbits
100s of Gbits
24
Modeled
Bandwidth
25. Modern Networks: Great for Big Data
Spine
Network
Topology
Leaf
Hosts
Uniform Bandwidth
25
Modeled
Bandwidth
28. Use Local Disk where it’s Needed
SAN Storage
Local Storage
$2 - $10/Gigabyte
$1 - $5/Gigabyte
$0.05/Gigabyte
$1M gets:
0.5Petabytes
200,000 IOPS
8Gbyte/sec
28
NAS Filers
$1M gets:
1 Petabyte
200,000 IOPS
10Gbyte/sec
$1M gets:
10 Petabytes
400,000 IOPS
250 Gbytes/sec
29. Storage Economics: Traditional vs. Scale-out
$5.50
Traditional
SAN/NAS
$5.00
$4.50
$4.00
$3.50
$3.00
Cost per GB
$2.50
$2.00
$1.50
Distributed
Object
Storage
$1.00
$0.50
$0.5
1
2
4
8
16
Petabytes Deployed
Scale-out NAS
Isilon
29
32
64
128
HDFS
MAPR
CEPH
Gluster
Scality
30. Big Data Storage Architectures
External SAN
With HDFS
Local Disks
With HDFS or
Other
HDFS,
CEPH,
MAPR,
Gluster,
Scality,
…
30
External
Scale-out NAS
31. Features from New Storage Solutions
Snapshots
Universal File Store
Clones
Geo Replication
Erasure Coding
NFS Access
Posix Support
31
QoS Controls
SSD Capability
33. Customers Winning from Consolidated Big Data Platforms
“Dedicated hardware makes no
sense”
“Software-defined Datacenter
enables rapid deployment
multiple tenants and labs”
“Our mixed workloads include
Hadoop, Database, ETL and
App-servers”
Compute
Management
Storage/Availability
Network/Security
33
“Any performance penalties are
minor”