MapReduce on ZeroVM for Lightweight Big Data Processing
1. MapReduce on ZeroVM
A Lightweight virtualization for Big Data Processing
Joy Rahman
Research Assistant
Cloud and Big Data Lab, UTSA
2. MapReduce and Big Data
● Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using traditional data processing
applications.
● MapReduce is a distributed processing framework that supports Big Data
Processing.
● A MapReduce program is composed of a Map() procedure that performs filtering
and sorting and a Reduce() procedure that performs a summary operation
● MapReduce libraries have been written in many programming languages. A
popular open-source implementation is Apache Hadoop (http://hadoop.apache.
org/).
3. Lets start with an example
Challenge : Count all the words in a file
Lorem Ipsum is simply dummy text of the printing and
typesetting industry. Lorem Ipsum has been the
industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and
scrambled it to make a type specimen book. It has
survived not only five centuries, but also the leap into
electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of
Letraset sheets containing Lorem Ipsum passages, and
more recently with desktop publishing software like Aldus
PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply
random text. It has roots in a piece of classical Latin
literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-
Sydney College in Virginia, looked up one of the more
obscure Latin words, consectetur, from a Lorem Ipsum
passage, and going through the cites of the word in
classical literature, discovered the undoubtable source.
Word Count
-------- --------
Lorem 5
.... 1
.... 1
.... 1
dummy 1
Any problem with this
approach?
- Yes, the file may be too big
4. Lets see an example (cont)
A better Approach : Divide and Conquer
Lorem Ipsum is simply dummy text of the printing and
typesetting industry. Lorem Ipsum has been the
industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and
scrambled it to make a type specimen book. It has
survived not only five centuries, but also the leap into
electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the
release of Letraset sheets containing Lorem Ipsum
passages, and more recently with desktop publishing
software like Aldus PageMaker including versions of
Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply
random text. It has roots in a piece of classical Latin
literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at
Hampden-Sydney College in Virginia, looked up one of
the more obscure Latin words, consectetur, from a
Lorem Ipsum passage, and going through the cites of the
word in classical literature, discovered the undoubtable
source.
Program 1 Program 2 Program 3
Lorem, 2
simply, 1
has, 1
Lorem, 1
was , 2
has, 5
Lorem, 3
from , 2
has, 1
Do you see any
problem with this
approach?
key value
5. We need to combine the results..
- We have divided the big input file to multiple pieces so that parallel
processes can attack the file simultaneously lowering the total
processing time.
- But the result from each process needs to be combined.
Lorem, 2
simply, 1
has, 1
Lorem, 1
was , 2
has, 5
Lorem, 3
from , 2
has, 1
Lorem, 6
simply, 1
has, 7
from, 2
....
....
6. MapReduce
● The example we have just seen is a typical
MapReduce program for big data processing,
● where the first phase (split-up and processing of the input) is
called Map
● and the final phase (the combining of the results) is called
Reduce.
7.
8. Formal Definitions
❏ The Map and Reduce functions of MapReduce are both defined with respect to
data structured in (key, value) pairs.
❏ Map takes one pair of data with a type in one data domain, and returns a list of
pairs in a different domain:
Map(k1,v1) → list(k2,v2)
The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the
MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each
key.
❏ The Reduce function is then applied in parallel to each group, which in turn
produces a collection of values in the same domain:
Reduce(k2, list (v2)) → list(v3)
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values.
10. Existing Limitations of Big Data
Processing on the Cloud
● Current implementation of Cloud has two distinct clusters:
○ 1) Computation Cluster (ex :Amazon EC2)
○ 2) Storage Cluster ( ex: Amazon S3)
● Computation cluster is used for cpu intensive processing whereas storage cluster
is used to store the persistent data.
● Running MapReduce on the cloud is costly due to the fact a considerable
amount of overhead incurred due to fetching the data from storage to the
computation cluster and putting them back after processing.
11. ex: Amazon EMR
Image source & Ref: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html
Costly
Data
Transfer
12. Challenges....
● How to avoid the data transfer overhead for big data processing?
○ Answer : Take computation to the Storage cluster
apps
storage cluster
But traditional OS level virtualizations
are
● bulky and cpu intensive to run
inside a cluster that is optimized
for storage I/O only
● slow spin-up
● horizontal scaling is expensive
apps
13. ZeroVM to the rescue
● ZeroVM is an open–source lightweight virtualization platform
based on the Chromium Native Client project (NaCl provides the
essential isolation through software fault isolation technique)
● ZeroVM permits to safely execute arbitrary code (c/c++, python)
from untrusted users in multi-tenant environments
● The ZeroVM Core is only 75 KB in Size and can spin-up in 5 ms.
● Thus It’s an ideal candidate to be run on top of Storage clusters
like Openstack SWIFT.
● ZeroVM Takes computation to the storage enabling cost effective
MapReduce on the cloud.
14. ZeroVM Properties
1. ZeroVM is small, light, fast, Secure, Hyper Scalable.
2. ZeroVM virtualizes Application not Operating System.
3.Single threaded (thus deterministic) execution. Same executable will
produce same results each time it is run.
4. Predefined resource constraints before execution
● Channel based I/O
● Predefine socket port / network
● Restricted Memory Access
● Limited Read/ Write (in bytes)
● Short life sessions / Predefined session_timeout
15. credit : Ryan McKinney, Senior Software Engineer, Rackspace
16. ZeroCloud
● ZeroCloud is the cloud module that runs on top of SWIFT that provides the facility
to run zerovm sessions on different servers of the cluster
● ZeroCloud makes it easy to create large clusters of instances, aggregating the
compute power of many individual physical servers into a single execution
environment.
● Users can leverage the power of 100s of physical servers for a few seconds or
even milliseconds at time.
● Horizontal scalability is a key design goal for ZeroVM
17. ZeroCloud (on SWIFT)
swift proxy
with zerocloud
Object Server
REQ
Resp
GET/POST
Object Server
Object Server
Object Server
apps
zerovm
session
apps
zerovm
session
if (exec)
spawn
if (exec)
spawn
user supplies the job
description with the
executables (apps)
result
result
job
desc
Openstack SWIFT Cluster
18. MapReduce on ZeroVM
● ZeroVM running on ZeroCloud is inherently targeted for Big
data processing, particularly using MapReduce style.
● Users can have multiple stage jobs and any stage can
connect with another stage
● The users need to provide the executables only.
● Since data is already inside the SWIFT cluster, an execution
job request through GET/POST is enough to fire the big
data processing instantly and obtain the result.
● Ensures Data Locality and eliminates the costly data transfer.
20. Our Research on ZeroVM
● There are many ongoing researches on ZeroVM.
● UTSA Big Data and Cloud Lab has some ongoing research
projects.
● Currently I am working under the supervision Dr.Lama to
improve MapReduce on ZeroVM.
● Our projects involves developing a scheduler for ZeroCloud
that will be optimized to ensure Data Locality, Interference &
Heterogeneity and Skew Aware.
21. Our Research on ZeroVM (contd)
● Data Locality is of great importance for Big Data Processing.
● Current Implementation ensures Data Locality for Map Phase
since the executables will be run on the input data.
● We would like to optimize and ensure Data Locality for
Reducer phases.
● We would like to design a scheduler that would mitigate the
data/computational skew problem (which is inherent in
every MapReduce environment) intelligently, which is
currently handled manually by the end user
22. Thanks
Get this ppt from: http://goo.gl/6fJpbn
Credits:
[1] Prosunjit Biswas, UTSA
[2] Carina C. Zona, Rackspace
[3] Ryan Mckinney, Rackspace
References:
[1] zeroVM: http://www.zerovm.org
[2] apache hadoop: http://apache.hadoop.org
[3] Amazon EMR: http://aws.amazon.com/elasticmapreduce
[4] Map Reduce: http://en.wikipedia.org/wiki/MapReduce
[5] Native Client: A Sandbox for Portable, Untrusted x86 Native Code : http://static.googleusercontent.
com/media/research.google.com/en/us/pubs/archive/34913.pdf
More about ZeroVM
Website: www.zerovm.org
Github: https://github.
com/zerovm/
User Mailing List:
zerovm@googlegroups.com
IRC: #zerovm on Freenode