6. Exploring new opportunities in Big Data-as-a-Service(BDaaS)
o Researching the possibility BDaaS solution
o Let BDaaS become better in IT infrastructure
o Moving forward the future of BDaaS
Focusing on Sahara in OpenStack
o Bring CDH into Sahara
o Create more features in Sahara
o Rank #1 in LOC, #3 in Commits for Sahara contribution
ABOUT OUR TEAM
8. oYou or someone at the company is using public Big Data application services
like AWS EMR.
You need Sahara to migrate Big Data application to your private cloud
oYou have multiple Hadoop clusters in your environment and you would like to
integrate them for better infrastructure utilization.
You need Sahara to virtualized Hadoop into cloud infrastructure.
oYou are using OpenStack as a IT cloud infrastructure for many years and there
is a Hadoop cluster also running in your IT environment.
You must use Sahara to bring them together as a unified IT environment for
better maintenance.
FROM THE CUSTOMER NEEDS
source from OpenStack Vancouver Design Summit: Benchmarking Sahara-based as a Service solution by RedHat & Intel
9. Data Scientists/Analysts
o Provide an elastic way to run big data application
Developers
o Bring a custom big data infrastructure by different needs
Administrator/Operators
o A better way to maintain not only hardware platform but also software package
Company
o Cost, cost, cost
BETTER USER EXPERIENCE MEANS…
10. A COMPLEX BIG DATA SOLUTION
Structured, Unstructured Data Big Data Solution
Different type data sources Complexity in organizing Data(ETL)
BI Report
Diverse BI Report
Pig
ZooKeeper
13. SAHARA DATA PROCESSING PATTERN
OpenStack
Instance
Data Node
Pattern 1: Internal HDFS
Collect Application
Collecting Data
OpenStack support to create HDFS on Cinder
or Ephemeral Disk. This method can provide a
better data processing performance via
Ephemeral Disk or to persist the data via
Cinder with lower performance.
Node Manager
Pros:
Performance would be extreme fast.(depends on the
storage backend)
Cons:
Data persistence may be a problem if you would like
to follow with the life of Virtual Cluster.
14. SAHARA DATA PROCESSING PATTERN
OpenStack
Instance 1
Pattern 2: External HDFS
Collect Application
Collecting Data
You can also choose to deploy HDFS to two
different instances. This way can bring you
more elasticity to manage your instances when
you would like to save more compute power
via turn off your node manager instance.
Node
Manager
Pros:
Performance may be the same as Pattern 1, but it can
bring more flexible to control your instances, save the
power, and also persist your data in data node.
Cons:
A long run cluster may still need to consider another
way for persisting data.
Instance 2
Data Node
15. SAHARA DATA PROCESSING PATTERN
OpenStack
Instance
Pattern 3: Swift
Collect Application
Collecting Data
Use Swift can stream the data from storage to
Hadoop directly. It provide a way to store your
data externally and solve the data persistence
problem. Currently Swift can also support data
locality feature.
Node Manager
Pros:
Streaming data directly and integrating with your
Swift infrastructure.
Cons:
Performance could be an issue when comparing with
other pattern by using HDFS.
Swift
Streaming Data
16. Cluster Deployment
o Service Deployment
Compute Engine Choice
o Baremetal, KVM, Docker, Hyper-V, vSphere,
Xen
Storage Architecture
o Ephemeral Disk
o Persistent Volume
o Performance
o Cost
o Current IT Infrastructure
Deployment Consideration
Host
Instance Instance …Instance
Data
Bare Metal KVM Container
Ephemeral
Block
Storage
Data Data
Node
Manager
Node
Manager
Node
Manager
Object
Storage
Compute
Engine
Storage
Infrastructure
Cluster
Deployment
18. Issue1 - Provision a Cluster Takes a Long Time
Problem Description:
o 10000+ jobs per day including several different workloads(some jobs run in SECs and some jobs
run in HOURs)
o Hard to sort out a job is small or large, it is not only about data size but also in logistic
o Provisioning a cluster takes a longer time than running a small job in secs, for example: launch a
4-nodes cluster in 10+ mins
Customer’s Feedback:
o Finish job on time, no need to worry about provisioning a cluster
Possible Solutions/Alternatives:
o Run jobs in an existing cluster(depends on the cases)
o Run jobs in a public cluster using Resource ACL(will support in Liberty)
o To reduce the time for provisioning a cluster -> Plugin specific
o Use Docker can save time to launch an instance, but still need time to launch services
20. Docker also get the advantage when instance is idle
0
10
20
30
40
50
60
70
80
1
9
17
25
33
41
49
57
65
73
81
89
97
105
113
121
129
137
145
153
161
169
177
185
193
201
209
217
225
233
241
249
257
265
273
281
289
297
305
313
321
CPUUsageInPercent
Time
Docker: Compute Node CPU (full test duration)
usr
sys
Averages
– 0.54
– 0.17
0
10
20
30
40
50
60
70
80
1
10
19
28
37
46
55
64
73
82
91
100
109
118
127
136
145
154
163
172
181
190
199
208
217
226
235
244
253
262
271
280
289
298
307
316
325
334
343
CPUUsageInPercent
Time
KVM: Compute Node CPU (full test duration)
usr
sys
Averages
– 7.64
– 1.4
Source from IBM: Boden Russell (Performance Characteristics of Traditional VMs vs Docker Containers)
21. Issue2 - A complex data processing
Problem Description:
o A job usually run multiple sub-jobs in a row, Ex: Job A -> Job B -> Job C, and also need to
support scheduling a job
Customer’s Feedback:
o Running a complex job to fulfill their case
o To Schedule a job using Sahara EDP
o Running a recurring job
oPossible Solutions/Alternatives:
• Currently Sahara EDP only support to run a simple job
• Schedule a job -> BP: https://review.openstack.org/#/c/175719/
• A complex job running -> Under discussion
• Running a recurring job -> Under discussion
22. Issue3 - Storage Architecture
Problem Description:
o Currently our customers use individual Compute Cluster(Using Nova) and Storage
Cluster(Using Swift as an Object Storage for data store). But there is a performance issue if
compute and data put in different node, to transfer data must pass through network.
Customer’s Expectation:
o Find a better solution to fulfill their requirements and integrate to their current storage
architecture
Possible Solutions/Alternatives:
o Use Internal HDFS -> Needs a way to copy data from Swift to Internal HDFS
o Use Swift Data Locality Feature -> Must change their storage architecture
23. Two-phases in Sort running period for disk write
o Shuffle Map-Reduce Data -> Use temp folder to store
o intermediate data(40%total throughput)
• Write Output -> HDFS Write(60%total throughput)
Sort Workload Profile
Shuffling data using temp folder
Write output to HDFS/External Storage
Disk IO Peak
24. 1. Hadoop temp Folder Location
2. HDFS Location
3. Data Persistent
4. Integrate with current Storage Architecture, usually use shared
storage in cloud
5. Optimize storage by your workload
Storage Consideration
25. Redundant Issue when HDFS over Ceph/GlusterFS
Compute Cluster
Instance1
HDFS
Instance2
HDFS
…..
Instance3
HDFS
Ceph Cluster
Cinder
DATA DATA DATA
A DATA C DATAB DATA
A DATA B DATAC DATA
C DATAB DATA A DATA
3(in HDFS) x 3(in Ceph)
= 9 Replicas in Ceph
Cluster
26. Cinder Volume Instance Locality Support in Sahara
Compute1
Instance1
HDFS
Instance2
HDFS
…..
Instance3
HDFS
Cinder-volume
DATA DATA DATA
Volume1 Volume2 Volume3
Compute2
Instance4
HDFS
Instance5
HDFS
…..
Instance6
HDFS
Cinder-volume
DATA DATA DATA
Volume4 Volume5 Volume6
Nova Nova
27. Performance Impact from
o Swift overhead comes from “Rename” method in Hadoop
o “List Endpoint” feature bring huge impact
o Larger data size may deliver worse performance gap
27
Swift Performance Issue
Host
Swift
VMVM
Host
Nova Inst.
Store
VM
HDFS
VM
HDFS…..
…..
vs.
1.25x
overhead
1.67x
overhead
1X
28. The output of the reduce function is written to a temporary location in HDFS.
After completing, the output will automatically renamed from its temporary
location to its final location.
Rename in Reduce Task
ANALYSIS
• Object storage cannot support
rename, swiftfs use “copy and delete”
for rename function.
• HDFS Rename -> Change METADATA
in Name Node
• Swift Rename -> Copy new object and
Delete the older one in Swift
1.5x overhead
local to swift
swift to swift
local to hdfs
29. Issue4 - Scaling a Cluster
Problem Description:
o Current there are several issues they found when using scaling a cluster, they would like to
ask Community to improve their experience
Customer’s Expectation:
o Rebalancing HDFS after scaling
o Auto-scale a cluster by request(ex: job size, …etc)
Possible Solutions/Alternatives:
o Rebalance HDFS -> BP: https://blueprints.launchpad.net/sahara/+spec/hdfs-rebalance
o Auto-scaling -> Needs be discussed
30. Issue5 - OpenStack Version Support
Problem Description:
o New features usually support in new release, customers would like to use new feature in old
environment
o Some new features cannot be accepted to backport to an older one
Customer’s Expectation:
o Customers would like to use new feature in Kilo or later version OpenStack
Possible Solutions/Alternatives:
o Rolling Upgrade from Juno to Kilo
o Only use Sahara and Horizon in Kilo and other OpenStack project in Juno -> We haven’t try
this
o In the future, plugin will support backward compatible, let plugin can separate with Sahara
32. oVanilla support Hadoop v1.2.1 and Hadoop 2.6
oSpark Plugin
oCloudera CDH Plugin
oMapR Plugin
oStorm Plugin
oNew Horizon UI with a Guide Panel
oDefault Template Support
What’s New in Kilo
33. oSahara EDP is the focus to process data flow
oSupport more data sources and storage architecture
oSupport more Big Data projects
oIntegrate with other OpenStack projects
oBaremetal -> Ironic
oDocker -> Magnum
oApplication Catalog -> Murano
The Future of Sahara