SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
MapReduce on ZeroVM 
A Lightweight virtualization for Big Data Processing 
Joy Rahman 
Research Assistant 
Cloud and Big Data Lab, UTSA
MapReduce and Big Data 
● Big data is an all-encompassing term for any collection of data sets so large and 
complex that it becomes difficult to process using traditional data processing 
applications. 
● MapReduce is a distributed processing framework that supports Big Data 
Processing. 
● A MapReduce program is composed of a Map() procedure that performs filtering 
and sorting and a Reduce() procedure that performs a summary operation 
● MapReduce libraries have been written in many programming languages. A 
popular open-source implementation is Apache Hadoop (http://hadoop.apache. 
org/).
Lets start with an example 
Challenge : Count all the words in a file 
Lorem Ipsum is simply dummy text of the printing and 
typesetting industry. Lorem Ipsum has been the 
industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has 
survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of 
Letraset sheets containing Lorem Ipsum passages, and 
more recently with desktop publishing software like Aldus 
PageMaker including versions of Lorem Ipsum. 
Contrary to popular belief, Lorem Ipsum is not simply 
random text. It has roots in a piece of classical Latin 
literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden- 
Sydney College in Virginia, looked up one of the more 
obscure Latin words, consectetur, from a Lorem Ipsum 
passage, and going through the cites of the word in 
classical literature, discovered the undoubtable source. 
Word Count 
-------- -------- 
Lorem 5 
.... 1 
.... 1 
.... 1 
dummy 1 
Any problem with this 
approach? 
- Yes, the file may be too big
Lets see an example (cont) 
A better Approach : Divide and Conquer 
Lorem Ipsum is simply dummy text of the printing and 
typesetting industry. Lorem Ipsum has been the 
industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has 
survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the 
release of Letraset sheets containing Lorem Ipsum 
passages, and more recently with desktop publishing 
software like Aldus PageMaker including versions of 
Lorem Ipsum. 
Contrary to popular belief, Lorem Ipsum is not simply 
random text. It has roots in a piece of classical Latin 
literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at 
Hampden-Sydney College in Virginia, looked up one of 
the more obscure Latin words, consectetur, from a 
Lorem Ipsum passage, and going through the cites of the 
word in classical literature, discovered the undoubtable 
source. 
Program 1 Program 2 Program 3 
Lorem, 2 
simply, 1 
has, 1 
Lorem, 1 
was , 2 
has, 5 
Lorem, 3 
from , 2 
has, 1 
Do you see any 
problem with this 
approach? 
key value
We need to combine the results.. 
- We have divided the big input file to multiple pieces so that parallel 
processes can attack the file simultaneously lowering the total 
processing time. 
- But the result from each process needs to be combined. 
Lorem, 2 
simply, 1 
has, 1 
Lorem, 1 
was , 2 
has, 5 
Lorem, 3 
from , 2 
has, 1 
Lorem, 6 
simply, 1 
has, 7 
from, 2 
.... 
....
MapReduce 
● The example we have just seen is a typical 
MapReduce program for big data processing, 
● where the first phase (split-up and processing of the input) is 
called Map 
● and the final phase (the combining of the results) is called 
Reduce.
Formal Definitions 
❏ The Map and Reduce functions of MapReduce are both defined with respect to 
data structured in (key, value) pairs. 
❏ Map takes one pair of data with a type in one data domain, and returns a list of 
pairs in a different domain: 
Map(k1,v1) → list(k2,v2) 
The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the 
MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each 
key. 
❏ The Reduce function is then applied in parallel to each group, which in turn 
produces a collection of values in the same domain: 
Reduce(k2, list (v2)) → list(v3) 
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values.
Split 
[k1, v1] 
sort 
by k1 
Merge 
[k1, [v1,v2,v3,...]]
Existing Limitations of Big Data 
Processing on the Cloud 
● Current implementation of Cloud has two distinct clusters: 
○ 1) Computation Cluster (ex :Amazon EC2) 
○ 2) Storage Cluster ( ex: Amazon S3) 
● Computation cluster is used for cpu intensive processing whereas storage cluster 
is used to store the persistent data. 
● Running MapReduce on the cloud is costly due to the fact a considerable 
amount of overhead incurred due to fetching the data from storage to the 
computation cluster and putting them back after processing.
ex: Amazon EMR 
Image source & Ref: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html 
Costly 
Data 
Transfer
Challenges.... 
● How to avoid the data transfer overhead for big data processing? 
○ Answer : Take computation to the Storage cluster 
apps 
storage cluster 
But traditional OS level virtualizations 
are 
● bulky and cpu intensive to run 
inside a cluster that is optimized 
for storage I/O only 
● slow spin-up 
● horizontal scaling is expensive 
apps
ZeroVM to the rescue 
● ZeroVM is an open–source lightweight virtualization platform 
based on the Chromium Native Client project (NaCl provides the 
essential isolation through software fault isolation technique) 
● ZeroVM permits to safely execute arbitrary code (c/c++, python) 
from untrusted users in multi-tenant environments 
● The ZeroVM Core is only 75 KB in Size and can spin-up in 5 ms. 
● Thus It’s an ideal candidate to be run on top of Storage clusters 
like Openstack SWIFT. 
● ZeroVM Takes computation to the storage enabling cost effective 
MapReduce on the cloud.
ZeroVM Properties 
1. ZeroVM is small, light, fast, Secure, Hyper Scalable. 
2. ZeroVM virtualizes Application not Operating System. 
3.Single threaded (thus deterministic) execution. Same executable will 
produce same results each time it is run. 
4. Predefined resource constraints before execution 
● Channel based I/O 
● Predefine socket port / network 
● Restricted Memory Access 
● Limited Read/ Write (in bytes) 
● Short life sessions / Predefined session_timeout
credit : Ryan McKinney, Senior Software Engineer, Rackspace
ZeroCloud 
● ZeroCloud is the cloud module that runs on top of SWIFT that provides the facility 
to run zerovm sessions on different servers of the cluster 
● ZeroCloud makes it easy to create large clusters of instances, aggregating the 
compute power of many individual physical servers into a single execution 
environment. 
● Users can leverage the power of 100s of physical servers for a few seconds or 
even milliseconds at time. 
● Horizontal scalability is a key design goal for ZeroVM
ZeroCloud (on SWIFT) 
swift proxy 
with zerocloud 
Object Server 
REQ 
Resp 
GET/POST 
Object Server 
Object Server 
Object Server 
apps 
zerovm 
session 
apps 
zerovm 
session 
if (exec) 
spawn 
if (exec) 
spawn 
user supplies the job 
description with the 
executables (apps) 
result 
result 
job 
desc 
Openstack SWIFT Cluster
MapReduce on ZeroVM 
● ZeroVM running on ZeroCloud is inherently targeted for Big 
data processing, particularly using MapReduce style. 
● Users can have multiple stage jobs and any stage can 
connect with another stage 
● The users need to provide the executables only. 
● Since data is already inside the SWIFT cluster, an execution 
job request through GET/POST is enough to fire the big 
data processing instantly and obtain the result. 
● Ensures Data Locality and eliminates the costly data transfer.
Demonstration??? 
Do you like to give ZeroVM a try? http://zebra. 
zerovm.org/
Our Research on ZeroVM 
● There are many ongoing researches on ZeroVM. 
● UTSA Big Data and Cloud Lab has some ongoing research 
projects. 
● Currently I am working under the supervision Dr.Lama to 
improve MapReduce on ZeroVM. 
● Our projects involves developing a scheduler for ZeroCloud 
that will be optimized to ensure Data Locality, Interference & 
Heterogeneity and Skew Aware.
Our Research on ZeroVM (contd) 
● Data Locality is of great importance for Big Data Processing. 
● Current Implementation ensures Data Locality for Map Phase 
since the executables will be run on the input data. 
● We would like to optimize and ensure Data Locality for 
Reducer phases. 
● We would like to design a scheduler that would mitigate the 
data/computational skew problem (which is inherent in 
every MapReduce environment) intelligently, which is 
currently handled manually by the end user
Thanks 
Get this ppt from: http://goo.gl/6fJpbn 
Credits: 
[1] Prosunjit Biswas, UTSA 
[2] Carina C. Zona, Rackspace 
[3] Ryan Mckinney, Rackspace 
References: 
[1] zeroVM: http://www.zerovm.org 
[2] apache hadoop: http://apache.hadoop.org 
[3] Amazon EMR: http://aws.amazon.com/elasticmapreduce 
[4] Map Reduce: http://en.wikipedia.org/wiki/MapReduce 
[5] Native Client: A Sandbox for Portable, Untrusted x86 Native Code : http://static.googleusercontent. 
com/media/research.google.com/en/us/pubs/archive/34913.pdf 
More about ZeroVM 
Website: www.zerovm.org 
Github: https://github. 
com/zerovm/ 
User Mailing List: 
zerovm@googlegroups.com 
IRC: #zerovm on Freenode

Contenu connexe

Tendances

Fault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataFault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataKaran Pardeshi
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 

Tendances (20)

Fault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataFault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big Data
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop
HadoopHadoop
Hadoop
 

En vedette

第11組 Ubc Blog
第11組 Ubc Blog第11組 Ubc Blog
第11組 Ubc BlogsvoALEXorz
 
North American Odyssey Intro
North American Odyssey IntroNorth American Odyssey Intro
North American Odyssey Introavoytilla
 
Interactive Powerpoint
Interactive PowerpointInteractive Powerpoint
Interactive Powerpointlaurenpollard
 
HREX Surviving to Thriving Growing on a Budget 091510
HREX Surviving to Thriving Growing on a Budget 091510HREX Surviving to Thriving Growing on a Budget 091510
HREX Surviving to Thriving Growing on a Budget 091510Advex Client Services
 
Kevin Ashley Mid Con Aade Presentation.Rev
Kevin Ashley Mid Con Aade Presentation.RevKevin Ashley Mid Con Aade Presentation.Rev
Kevin Ashley Mid Con Aade Presentation.Revguestbb6c509
 

En vedette (6)

第11組 Ubc Blog
第11組 Ubc Blog第11組 Ubc Blog
第11組 Ubc Blog
 
North American Odyssey Intro
North American Odyssey IntroNorth American Odyssey Intro
North American Odyssey Intro
 
Interactive Powerpoint
Interactive PowerpointInteractive Powerpoint
Interactive Powerpoint
 
HREX Surviving to Thriving Growing on a Budget 091510
HREX Surviving to Thriving Growing on a Budget 091510HREX Surviving to Thriving Growing on a Budget 091510
HREX Surviving to Thriving Growing on a Budget 091510
 
La mujer en el peru prehispanico
La mujer en el peru prehispanicoLa mujer en el peru prehispanico
La mujer en el peru prehispanico
 
Kevin Ashley Mid Con Aade Presentation.Rev
Kevin Ashley Mid Con Aade Presentation.RevKevin Ashley Mid Con Aade Presentation.Rev
Kevin Ashley Mid Con Aade Presentation.Rev
 

Similaire à MapReduce on ZeroVM for Lightweight Big Data Processing

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReducecoolmirza143
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacmlmphuong06
 

Similaire à MapReduce on ZeroVM for Lightweight Big Data Processing (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
try
trytry
try
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 

Dernier

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Dernier (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

MapReduce on ZeroVM for Lightweight Big Data Processing

  • 1. MapReduce on ZeroVM A Lightweight virtualization for Big Data Processing Joy Rahman Research Assistant Cloud and Big Data Lab, UTSA
  • 2. MapReduce and Big Data ● Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. ● MapReduce is a distributed processing framework that supports Big Data Processing. ● A MapReduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary operation ● MapReduce libraries have been written in many programming languages. A popular open-source implementation is Apache Hadoop (http://hadoop.apache. org/).
  • 3. Lets start with an example Challenge : Count all the words in a file Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden- Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Word Count -------- -------- Lorem 5 .... 1 .... 1 .... 1 dummy 1 Any problem with this approach? - Yes, the file may be too big
  • 4. Lets see an example (cont) A better Approach : Divide and Conquer Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Program 1 Program 2 Program 3 Lorem, 2 simply, 1 has, 1 Lorem, 1 was , 2 has, 5 Lorem, 3 from , 2 has, 1 Do you see any problem with this approach? key value
  • 5. We need to combine the results.. - We have divided the big input file to multiple pieces so that parallel processes can attack the file simultaneously lowering the total processing time. - But the result from each process needs to be combined. Lorem, 2 simply, 1 has, 1 Lorem, 1 was , 2 has, 5 Lorem, 3 from , 2 has, 1 Lorem, 6 simply, 1 has, 7 from, 2 .... ....
  • 6. MapReduce ● The example we have just seen is a typical MapReduce program for big data processing, ● where the first phase (split-up and processing of the input) is called Map ● and the final phase (the combining of the results) is called Reduce.
  • 7.
  • 8. Formal Definitions ❏ The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. ❏ Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) → list(k2,v2) The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key. ❏ The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → list(v3) Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values.
  • 9. Split [k1, v1] sort by k1 Merge [k1, [v1,v2,v3,...]]
  • 10. Existing Limitations of Big Data Processing on the Cloud ● Current implementation of Cloud has two distinct clusters: ○ 1) Computation Cluster (ex :Amazon EC2) ○ 2) Storage Cluster ( ex: Amazon S3) ● Computation cluster is used for cpu intensive processing whereas storage cluster is used to store the persistent data. ● Running MapReduce on the cloud is costly due to the fact a considerable amount of overhead incurred due to fetching the data from storage to the computation cluster and putting them back after processing.
  • 11. ex: Amazon EMR Image source & Ref: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html Costly Data Transfer
  • 12. Challenges.... ● How to avoid the data transfer overhead for big data processing? ○ Answer : Take computation to the Storage cluster apps storage cluster But traditional OS level virtualizations are ● bulky and cpu intensive to run inside a cluster that is optimized for storage I/O only ● slow spin-up ● horizontal scaling is expensive apps
  • 13. ZeroVM to the rescue ● ZeroVM is an open–source lightweight virtualization platform based on the Chromium Native Client project (NaCl provides the essential isolation through software fault isolation technique) ● ZeroVM permits to safely execute arbitrary code (c/c++, python) from untrusted users in multi-tenant environments ● The ZeroVM Core is only 75 KB in Size and can spin-up in 5 ms. ● Thus It’s an ideal candidate to be run on top of Storage clusters like Openstack SWIFT. ● ZeroVM Takes computation to the storage enabling cost effective MapReduce on the cloud.
  • 14. ZeroVM Properties 1. ZeroVM is small, light, fast, Secure, Hyper Scalable. 2. ZeroVM virtualizes Application not Operating System. 3.Single threaded (thus deterministic) execution. Same executable will produce same results each time it is run. 4. Predefined resource constraints before execution ● Channel based I/O ● Predefine socket port / network ● Restricted Memory Access ● Limited Read/ Write (in bytes) ● Short life sessions / Predefined session_timeout
  • 15. credit : Ryan McKinney, Senior Software Engineer, Rackspace
  • 16. ZeroCloud ● ZeroCloud is the cloud module that runs on top of SWIFT that provides the facility to run zerovm sessions on different servers of the cluster ● ZeroCloud makes it easy to create large clusters of instances, aggregating the compute power of many individual physical servers into a single execution environment. ● Users can leverage the power of 100s of physical servers for a few seconds or even milliseconds at time. ● Horizontal scalability is a key design goal for ZeroVM
  • 17. ZeroCloud (on SWIFT) swift proxy with zerocloud Object Server REQ Resp GET/POST Object Server Object Server Object Server apps zerovm session apps zerovm session if (exec) spawn if (exec) spawn user supplies the job description with the executables (apps) result result job desc Openstack SWIFT Cluster
  • 18. MapReduce on ZeroVM ● ZeroVM running on ZeroCloud is inherently targeted for Big data processing, particularly using MapReduce style. ● Users can have multiple stage jobs and any stage can connect with another stage ● The users need to provide the executables only. ● Since data is already inside the SWIFT cluster, an execution job request through GET/POST is enough to fire the big data processing instantly and obtain the result. ● Ensures Data Locality and eliminates the costly data transfer.
  • 19. Demonstration??? Do you like to give ZeroVM a try? http://zebra. zerovm.org/
  • 20. Our Research on ZeroVM ● There are many ongoing researches on ZeroVM. ● UTSA Big Data and Cloud Lab has some ongoing research projects. ● Currently I am working under the supervision Dr.Lama to improve MapReduce on ZeroVM. ● Our projects involves developing a scheduler for ZeroCloud that will be optimized to ensure Data Locality, Interference & Heterogeneity and Skew Aware.
  • 21. Our Research on ZeroVM (contd) ● Data Locality is of great importance for Big Data Processing. ● Current Implementation ensures Data Locality for Map Phase since the executables will be run on the input data. ● We would like to optimize and ensure Data Locality for Reducer phases. ● We would like to design a scheduler that would mitigate the data/computational skew problem (which is inherent in every MapReduce environment) intelligently, which is currently handled manually by the end user
  • 22. Thanks Get this ppt from: http://goo.gl/6fJpbn Credits: [1] Prosunjit Biswas, UTSA [2] Carina C. Zona, Rackspace [3] Ryan Mckinney, Rackspace References: [1] zeroVM: http://www.zerovm.org [2] apache hadoop: http://apache.hadoop.org [3] Amazon EMR: http://aws.amazon.com/elasticmapreduce [4] Map Reduce: http://en.wikipedia.org/wiki/MapReduce [5] Native Client: A Sandbox for Portable, Untrusted x86 Native Code : http://static.googleusercontent. com/media/research.google.com/en/us/pubs/archive/34913.pdf More about ZeroVM Website: www.zerovm.org Github: https://github. com/zerovm/ User Mailing List: zerovm@googlegroups.com IRC: #zerovm on Freenode