SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
MAP REDUCE FEST
TACKLING THE MULTI-HEADED DRAGON
GENOVEVAVARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
Genoveva.Vargas@imag.fr
http://www.vargas-solar.com/teaching/map-reduce-fest/
http://www.vargas-solar.com
Use of memory and computing capacities of all computers and servers distributed in the world communicated by a
network (e.g. Internet)
Applications	
  
Hardware	
  
Data	
  Volume	
  
Peta	
  1015	
  
Exa	
  1018	
  
Zetta	
  1021	
  
Yota	
  1024	
  
RAID	
  
magnetic	
  
ubiquitous	
  	
  
computing	
  
reactive	
  
real-­‐time	
  
adaptable	
  
tape	
  
2
Modern	
  informa,on	
  socie,es	
  are	
  defined	
  by	
  vast	
  repositories	
  of	
  data,	
  both	
  public	
  
and	
  private	
  
Any	
  prac,cal	
  applica,on	
  must	
  be	
  able	
  to	
  scale	
  up	
  to	
  datasets	
  
of	
  interest	
  
+
EXECUTION MODEL: MAP-REDUCE
¡  Programming model for expressing distributed computations on massive amounts of data and an execution
framework for large-scale data processing on clusters of commodity servers
¡  Open-source implementation called Hadoop, whose development was led byYahoo (now an Apache project)
¡  Divide and conquer principle: partition a large problem into smaller subproblems
¡  To the extent that the sub-problems are independent, they can be tackled in parallel by different workers
¡  intermediate results from each individual worker are then combined to yield the final output
¡  Large-data processing requires bringing data and code together for computation to occur:
à no small feat for datasets that are terabytes and perhaps petabytes in size!
3
+
MAP-REDUCE ELEMENTS
¡  Stage 1:Apply a user-specified computation over all input records in a dataset.
¡  These operations occur in parallel and yield intermediate output (key-value couples)
¡  Stage 2:Aggregate intermediate output by another user-specified computation
¡  Recursively applies a function on every pair of the list
¡  Execution framework coordinates the actual processing
¡  Implementation of the programming model and the execution framework
4
+
MAP REDUCE EXAMPLE
Count the number of occurrences of every word in a text collection
5
+
EXECUTION FRAMEWORK
¡  Important idea behind MapReduce is separating the what of distributed processing from the how
¡  A MapReduce program (job) consists of
¡  code for mappers and reducers packaged together with
¡  configuration parameters (such as where the input lies and where the output should be stored)
¡  Execution framework responsibilities: scheduling
¡  Each MapReduce job is divided into smaller units called tasks
¡  In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently à
manage tasks queues
¡  Coordination among tasks belonging to different jobs
6
+
MAP-REDUCE ADDITIONAL ELEMENTS
¡  Partitioners are responsible for dividing up the intermediate key space and assigning intermediate
key-value pairs to reducers
¡  the partitioner species the task to which an intermediate key-value pair must be copied
¡  Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and
sort phase
7
+
CONTENT
¡  Introduction
¡  Principle and objective
¡  Map reduce in the context of different
architectures
¡  Dealing with huge data collections
¡  Practical exercise 1: thinking map-reduce
¡  Counting words on an Azure Hadoop
component
¡  Counting words with Pig
¡  Comparing Hadoop and Pig
¡  Programming summarization
patterns
¡  Numerical summarization
¡  Inverted Index summarization
¡  Counting with counters
8
Part I (1 week 3 hours/day
01.04.2013 – 05.04.2013)
Part II (1 week 3 hours/day
22.04.2013 – 26.04.2013)
+
CONTENT
¡  Data organization patterns
¡  Structured to hierarchical
¡  Partitioning
¡  Binning
¡  Total order sorting
¡  Shuffling
¡  Join patterns
¡  Do U remember how to program
joins?
¡  Reduce side join
¡  Replicated join
¡  Composite join
¡  Cartesian product
¡  Wrapping up: analysis, limitations,
perspectives (3 hours)
9
Part III (1 week 3 hours/day
20.05.2013 – 24.05.2013)
Part IV (1 week 3 hours/day
10.06.2013 – 14.06.2013)
+
METHODOLOGY: HACKATON STYLE
¡  Collaborative knowledge construction
¡  Control weeks: get together
¡  Free weeks: team programming work
10
+
AGENDA: DIA 1
¡  10:30 - 11:00 introducción del MapReduce Fest (G.Vargas -Solar)
¡  11:00 - 12:30 Preparación grupal de una presentación sobre lo que es MapReduce (por equipos)
¡  12:30 - 13:00 Presentación de resultados
¡  13:00 - 13:30 Discusión (animada por G.Vargas-Solar)
11
12
DIGITAL INFORMATION SCALE
Unit Size Meaning
Bit (b) 1 or 0 Short for binary digit, after the binary code (1 or 0) computers use to store and process data
Byte (B) 8 bits Enough information to create an English letter or number in computer code. It is the basic unit of
computing
Kilobyte (KB) 210 bytes From “thousand’ in Greek. One page of typed text is 2KB
Megabyte (MB) 220 bytes From “large” in Greek.The complete works of Shakespeare total 5 MB.A typical pop song is 4 MB.
Gigabyte (GB) 230 bytes From “giant” in Greek.A two-hour film ca be compressed into 1-2 GB.
Terabyte (TB) 240 bytes From “monster” in Greek.All the catalogued books in America’s Library of Congress total 15TB.
Petabyte (PB) 250 bytes All letters delivered in America’s postal service this year will amount ca. 5PB. Google processes
1PB per hour.
Exabyte (EB) 260 bytes Equivalent to 10 billion copies of The Economist
Zettabyte (ZB) 270 bytes The total amount of information in existence this year is forecast to be around 1,27ZB
Yottabyte (YB) 280 bytes Currently too big to imagine
Megabyte(106) Gigabyte (109)Terabytes (1012), petabytes (1015), exabytes (1018) and zettabytes (1021)
13
MASSIVE DATA
Data sources
¡  Information-sensing mobile devices, aerial sensory
technologies (remote sensing)
¡  Software logs, posts to social media sites
¡  Telescopes, cameras, microphones, digital pictures
and videos posted online
¡  Transaction records of online purchases
¡  RFID readers, wireless sensor networks (number
of sensors increasing by 30% a year)
¡  Cell phone GPS signals (increasing 20% a year)
Massive data
¡  Sloan Digital Sky Survey (2000-) its archive contains 140
terabytes. Its successor: Large Synoptic Survey Telescope
(2016-), will acquire that quantity of data every five days !
¡  Facebook hosts 140 billion photos, and will add 70 billion
this year (ca. 1 petabyte). Every 2 minutes today we snap as
many photos as the whole of humanity took in the 1800s !
¡  Wal-Mart, handles more than 1 million customer
transactions every hour, feeding databases estimated at
more than 2.5 petabytes (1015)
¡  The Large Hadron Collider (LHC): nearly 15 million billion
bytes per year -15 petabytes (1015).These data require
70,000 processors to be processed!
http://blog.websourcing.fr/infographie-­‐la-­‐vrai-­‐taille-­‐dinternet/	
  
14
WHO PRODUCES DATA ? DIGITAL SHADOW
¡  3,146 billion mail addresses, of which 360 million Hotmail
¡  95.5 million domains.com
¡  2.1 billion Internet users, including 922 million in Asia and 485 million in China
¡  2.4 billion accounts on social networks
¡  1000 billion videos viewed onYouTube, about 140 average per capita on the planet!
¡  60 images posted on Instagram every second
¡  250 million tweets per day in October
75% of the information is generated by individuals — writing documents, taking pictures,
downloading music, etc. — but is far less than the amount of information being created about
them in the digital universe
15
BIG DATA
¡  Collection of data sets so large and complex that it becomes difficult to process using on-hand database management
tools or traditional data processing applications.
¡  Challenges include capture, curation, storage, search, sharing, analysis, and visualization within a tolerable elapsed
time
¡  Data growth challenges and opportunities are three-dimensional (3Vs model)
¡  increasing volume (amount of data)
¡  velocity (speed of data in and out)
¡  variety (range of data types and sources)
"Big data are high volume, high velocity, and/or high variety information assets that require new forms of
processing to enable enhanced decision making, insight discovery and process optimization.”
16
+
DATA MANAGEMENT WITH RESOURCES CONSTRAINTSSTORAGE	
  
SUPPORT	
  
Systems	
  
ARCHITECTURE	
  &	
  
RESOURCES	
  AWARE	
  
RAM	
  
Algorithms	
  
Efficiently manage and exploit data sets according to given specific storage, memory and computation
resources
17
+
DATA MANAGEMENT WITHOUT RESOURCES CONSTRAINTS
Costly manage and exploit data sets according to unlimited storage, memory and computation resources
Systems	
  
Algorithms	
  
COSTAWARE	
  
18
ELASTIC	
  
+
CLOUD IN A NUTSHELL
DEPLOYME
NT MODELS
PIVOT
ISSUES
DELIVER
Y
MODELS
19
1.  Cloud	
  Software	
  as	
  a	
  Service	
  (SaaS)	
  
2.  Cloud	
  Platform	
  as	
  a	
  Service	
  (PaaS)	
  
3.  Cloud	
  Infrastructure	
  as	
  a	
  Service	
  (IaaS)	
  
1.  Private	
  cloud	
  
2.  Public	
  cloud	
  
3.  Hybrid	
  cloud	
  
4.  Community	
  Cloud	
  
1.  Virtualization	
  
2.  Autonomics	
  (automation)	
  
3.  Grid	
  Computing	
  (job	
  scheduling)	
  
20
CLOUD COMPUTINGPackaged
Software
Storage
Servers
Networking
O/S
Middleware
Virtualization
Data
Applications
Runtime
Youmanage
Infrastructure
(as a Service)
Storage
Servers
Networking
O/S
Middleware
Virtualization
Data
Applications
Runtime
Managedbyvendor
Youmanage
Platform
(as a Service)
Managedbyvendor
Youmanage
Storage
Servers
Networking
O/S
Middleware
Virtualization
Applications
Runtime
Data
Software
(as a Service)
Managedbyvendor
Storage
Servers
Networking
O/S
Middleware
Virtualization
Applications
Runtime
Data
CLOUD COMPUTING CHARACTERISTICS
¡  On-demand self-service
¡  Available computing capabilities without human interaction
¡  Broad network access
¡  Accessibility from heterogeneous clients (mobile phones, laptops, etc.)
¡  Rapid elasticity
¡  Possibility to automatically and rapidly increase or decrease capabilities
¡  Measured service
¡  Automatic control and optimization of resources
¡  Resource pooling (virtualization, multi-location)
¡  Multi-tenant resources (CPU, disk, memory, network) with a sense of location independence
D	
  
C	
  
Time	
  
R	
  
21
+
STORING DATA
¡  Storage services: mobileme, live,Amazon
¡  Virtualization of disk space ß transparency (20 Go)
¡  Explicit configuration of the space assigned for files and mail
¡  Distribution, fragmentation and replication of data – sets
¡  Decision making for assigning storage space, for migrating data, maintaining stored data
22
Data centre
Persistency support
Enabling virtualisation
platform
+
… WITH RESOURCES CONSTRAINTS
Q1:Which are the most popular products at Starbucks ?
Q2:Which are the consumption rules of
Starbucks clients ?
Distribution and organization of
data on disk
Query and data processing
on server
Swap memory– disk
Data transfer
•  Efficiency => time cost
•  Optimizing memory and
computing cost
23
Efficiently manage and exploit data sets according to given specific storage, memory and computation
resources
+
WITHOUT RESOURCES CONSTRAINTS …
¡  Query evaluationà How and under which limits ?
¡  is not longer completely constraint by resources availability: computing, RAM, storage, network services
¡  Decision making process determined by resources consumption and consumer requirements
¡  Data involved in the query, particularly in the result can have different costs: top 5 gratis and the
rest available in return to a credit card number
¡  Results storage and exploitation demands mor resources
24
Costly => minimizing cost, energy
consumption
Costly manage and exploit data sets according to unlimited storage, memory and computation resources
+
EXECUTION MODEL: MAP-REDUCE
¡  Programming model for expressing distributed computations on massive amounts of data and an execution
framework for large-scale data processing on clusters of commodity servers
¡  Open-source implementation called Hadoop, whose development was led byYahoo (now an Apache project)
¡  Divide and conquer principle: partition a large problem into smaller subproblems
¡  To the extent that the sub-problems are independent, they can be tackled in parallel by different workers
¡  intermediate results from each individual worker are then combined to yield the final output
¡  Large-data processing requires bringing data and code together for computation to occur:
à no small feat for datasets that are terabytes and perhaps petabytes in size!
25
+
MAP-REDUCE ELEMENTS
¡  Stage 1:Apply a user-specified computation over all input records in a dataset.
¡  These operations occur in parallel and yield intermediate output (key-value couples)
¡  Stage 2:Aggregate intermediate output by another user-specified computation
¡  Recursively applies a function on every pair of the list
¡  Execution framework coordinates the actual processing
¡  Implementation of the programming model and the execution framework
26
+
MAP REDUCE EXAMPLE
Count the number of occurrences of every word in a text collection
27
+
EXECUTION FRAMEWORK
¡  Important idea behind MapReduce is separating the what of distributed processing from the how
¡  A MapReduce program (job) consists of
¡  code for mappers and reducers packaged together with
¡  configuration parameters (such as where the input lies and where the output should be stored)
¡  Execution framework responsibilities: scheduling
¡  Each MapReduce job is divided into smaller units called tasks
¡  In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently à
manage tasks queues
¡  Coordination among tasks belonging to different jobs
28
+
MAP-REDUCE ADDITIONAL ELEMENTS
¡  Partitioners are responsible for dividing up the intermediate key space and assigning intermediate
key-value pairs to reducers
¡  the partitioner species the task to which an intermediate key-value pair must be copied
¡  Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and
sort phase
29
BIGVALUE FROM BIG DATA
¡  Big data is not only about the original content stored or being consumed but also about the
information around its consumption.
¡  Gigabytes of stored content can generate a petabyte or more of transient data that we typically don't
store digital TV signals we watch but don't record voice calls that are made digital in the network
backbone during the duration of a call
¡  Big data technologies describe a new generation of technologies and architectures, designed to
economically extract value from very large volumes of a wide variety of data, by enabling
high-velocity capture, discovery, and/or analysis
http://www.emc.com/collateral/analyst-­‐reports/idc-­‐extracting-­‐value-­‐from-­‐chaos-­‐ar.pdf	
  
Big data is not a "thing” but instead a dynamic/activity that crosses many IT borders	

30
WHAT ARE THE FORCES BEHIND THE EXPLOSIVE GROWTH OF
THE DIGITAL UNIVERSE?
¡  Technology has helped by driving the cost of creating, capturing, managing, and storing
information
¡  … Prime mover is financial
¡  Since 2005, the investment by enterprises in the digital universe has increased 50% — to $4
trillion
¡  spent on hardware, software, services, and staff to create, manage, and store — and derive revenues from
the digital universe
¡  the trick is to generate value by extracting the right information from the digital universe
31
BIG DATA TECH ECONOMY
¡  Data-driven world guided by a rapid ROD (Return on Data)
à reducing cost, complexity, risk and increasing the value of your holdings thanks to a mastery of
technologies
¡  The key is how quickly data can be turned in to currency by:
¡  Analysing patterns and spotting relationships/trends that enable decisions to be made faster with more
precision and confidence
¡  Identifying actions and bits of information that are out of compliance with company policies can avoid
millions in fines
¡  Proactively reducing the amount of data you pay ($18,750/gigabyte to review in eDiscovery) by identifying
only the relevant pieces of information
¡  Optimizing storage by deleting or offloading non-critical assets to cheaper cloud storage thus saving
millions in archive solutions
32
+
CONTENT
¡  Introduction
¡  Principle and objective
¡  Map reduce in the context of different
architectures
¡  Dealing with huge data collections
¡  Practical exercise 1: thinking map-reduce
¡  Counting words on an Azure Hadoop
component
¡  Counting words with Pig
¡  Comparing Hadoop and Pig
¡  Programming summarization
patterns
¡  Numerical summarization
¡  Inverted Index summarization
¡  Counting with counters
33
Part I: First and commun steps
Part II (1 week 3 hours/day
22.04.2013 – 26.04.2013)
+
CONTENT
¡  Data organization patterns
¡  Structured to hierarchical
¡  Partitioning
¡  Binning
¡  Total order sorting
¡  Shuffling
¡  Wrapping up: analysis,
limitations, perspectives (3
hours)
34
Part III Part IV
+
METHODOLOGY: HACKATON STYLE
¡  Collaborative knowledge construction
¡  Control weeks: get together
¡  Free weeks: team programming work
35
Contact: GenovevaVargas-Solar, CNRS, LIG-LAFMIA
Genoveva.Vargas@imag.fr	
  
http://www.vargas-­‐solar.com	
  
http://code.google.com/p/exschema/	
  	
  
Want to put cloud data management in practice ?
cf. http://www.vargas-­‐solar.com/teaching	
  
36
http://code.google.com/p/model2roo/	
  	
  
Open source polyglot persistence tools
37
+
PRINCIPLE
38
+
PRINCIPLE
39
+
PRINCIPLE
40
+
DEFINITION
TECHNOLOGICAL VISION
¡  Cloud computing is a model for enabling
on-demand network access to a shared
pool of virtualized computing resources
¡  e.g., networks, servers, storage,
applications, devices/mobiles and
services)
¡  Rapidly provisioned and released with
minimal management effort or service
provider interaction (self-service model
through API or web portals)
MARKET VISION
¡  Pay-per-use (or pay-as-you-go) billing
models
¡  An application may exist to
¡  run a job for a few minutes or hours
¡  provide services to costumers on a
long-term basis
¡  Billing is based on resource consumption:
CPU hours, volumes of data moved, or
gigabytes of data stored
41
+
AMBITION
¡  Provides a platform for new software applications that run across a large collection of physically separate
computers and free computation in front of a user
¡  Heralds a revolution comparable to PC [J. Larus, MS]
¡  Supply on-demand internet computer resources on a vast scale and at low price
¡  Exists beyond services offered by Amazons’ AWS, Microsoft Azure or Google AppEngine
42
+
DELIVERY MODELS
¡  SOFTWARE AS A SERVICE: applications accessible through
the network (Web services, REST/SOAP)
¡  Salesforce.com (CRM) and Google (Gmail, Google
Apps)
¡  PLATFORM AS A SERVICE: provide services for transparently
managing hardware resources
¡  SalesForce.com (Force.com), Google (Google App
Engine), Microsoft (Windows Azure), Facebook
(Facebook Platform)
¡  INSFRASTRUCTURE AS A SERVICE: provide Data centers
resources and others like CPU, storage and memory
¡  Amazon (EC2/S3) and IBM (Bluehouse)
43
INFRASTRUCTURE AS A
SERVICE
SOFTWARE AS A
SERVICE PLATFORM AS A
SERVICE
+
DEPLOYMENT MODELS
PUBLIC CLOUD
¡  Run by third parties and
applications from different clients
mixed together on the clouds’
servers, storage systems and
networks
PRIVATE CLOUD
¡  Used by an exclusive client
providing utmost control over
data, security and QoS
¡  The company owns and controls
the infrastructure
44
DATA	
  CENTER	
   DATA	
  CENTER	
  
+
DEPLOYMENT MODELS
HYBRID CLOUD COMMUNITY CLOUD
¡  My data and services anywhere
(online, on home devices, on
mobile devices) from anywhere
¡  Extension of Personal Cloud with
sharing
¡  Support for the Internet of Things,
M2M platforms
45
DATA	
  CENTER	
  
+
VIRTUALIZATION
¡  Virtual machines and virtual appliances become standard deployment object
¡  Abstract the hardware to the point where software stacks can be deployed and redeployed without being
tied to a specific physical server
¡  Servers provide a pool of resources that are harnessed as needed
¡  The relationship of applications to compute, storage, and network resources changes dynamically to meet workload and
business demands
¡  Applications can be deployed and scaled rapidly without having to produce physical servers
46
+
VIRTUALIZATION
¡  Full virtualization is a technique in which a complete installation of one machine is run on another
¡  A system where all software running on the server is within a virtual machine
¡  Applications and operating systems
¡  Means of accessing services on the cloud
¡  A compute cloud is a self-service proposition where a credit card can purchase compute cycles, and a Web
interface or API is used to create virtual machines and establish network relationships between them
47
+
PROGRAMMABLE INFRASTRUCTURE
¡  Cloud provider API
¡  to create an application initial composition onto virtual machines
¡  To define how it should scale and evolve to accommodate workload changes
¡  Self monitoring and self expanding applications
¡  Applications must be assembled by assembling and configuring appliances and software
¡  Cloud services must be composable so they can be consumed easily
48
+
PROGRAMMABLE INFRASTRUCTURE
49
Load
balancer
Web
server
DBMS	
  
Load
balancer
Load
balancer
Web
server
Web
server
Web
server
DBMS	
   DBMS	
  
LIBRARY CONFIGURE PATTERN DEPLOYMENT

Contenu connexe

Tendances

Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 

Tendances (20)

Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 

Similaire à MAP REDUCE FEST TACKLING MASSIVE DATA

Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
 
Data Intensive Computing Map-Reduce Programming.ppt
Data Intensive Computing Map-Reduce Programming.pptData Intensive Computing Map-Reduce Programming.ppt
Data Intensive Computing Map-Reduce Programming.pptThippeswamyAlJalilEx
 
Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupalDay
 
Thinking spatially with your open data
Thinking spatially with your open dataThinking spatially with your open data
Thinking spatially with your open dataTwinbit
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Codiax
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Scaling Spatial Analytics with Google Cloud & CARTO
Scaling Spatial Analytics with Google Cloud & CARTOScaling Spatial Analytics with Google Cloud & CARTO
Scaling Spatial Analytics with Google Cloud & CARTOCARTO
 
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)MIT College Of Engineering,Pune
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONScscpconf
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?CodePolitan
 

Similaire à MAP REDUCE FEST TACKLING MASSIVE DATA (20)

Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big data
Big dataBig data
Big data
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
12575474.ppt
12575474.ppt12575474.ppt
12575474.ppt
 
Data Intensive Computing Map-Reduce Programming.ppt
Data Intensive Computing Map-Reduce Programming.pptData Intensive Computing Map-Reduce Programming.ppt
Data Intensive Computing Map-Reduce Programming.ppt
 
Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open data
 
Thinking spatially with your open data
Thinking spatially with your open dataThinking spatially with your open data
Thinking spatially with your open data
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Scaling Spatial Analytics with Google Cloud & CARTO
Scaling Spatial Analytics with Google Cloud & CARTOScaling Spatial Analytics with Google Cloud & CARTO
Scaling Spatial Analytics with Google Cloud & CARTO
 
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Plus de Genoveva Vargas-Solar

Talk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceTalk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceGenoveva Vargas-Solar
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYGenoveva Vargas-Solar
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYGenoveva Vargas-Solar
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar
 

Plus de Genoveva Vargas-Solar (7)

Aiccsa 2021-w-stem
Aiccsa 2021-w-stemAiccsa 2021-w-stem
Aiccsa 2021-w-stem
 
Talk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceTalk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial Intelligence
 
Data w-steamm
Data w-steammData w-steamm
Data w-steamm
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 

Dernier

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Dernier (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

MAP REDUCE FEST TACKLING MASSIVE DATA

  • 1. MAP REDUCE FEST TACKLING THE MULTI-HEADED DRAGON GENOVEVAVARGAS SOLAR FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE Genoveva.Vargas@imag.fr http://www.vargas-solar.com/teaching/map-reduce-fest/ http://www.vargas-solar.com
  • 2. Use of memory and computing capacities of all computers and servers distributed in the world communicated by a network (e.g. Internet) Applications   Hardware   Data  Volume   Peta  1015   Exa  1018   Zetta  1021   Yota  1024   RAID   magnetic   ubiquitous     computing   reactive   real-­‐time   adaptable   tape   2 Modern  informa,on  socie,es  are  defined  by  vast  repositories  of  data,  both  public   and  private   Any  prac,cal  applica,on  must  be  able  to  scale  up  to  datasets   of  interest  
  • 3. + EXECUTION MODEL: MAP-REDUCE ¡  Programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers ¡  Open-source implementation called Hadoop, whose development was led byYahoo (now an Apache project) ¡  Divide and conquer principle: partition a large problem into smaller subproblems ¡  To the extent that the sub-problems are independent, they can be tackled in parallel by different workers ¡  intermediate results from each individual worker are then combined to yield the final output ¡  Large-data processing requires bringing data and code together for computation to occur: à no small feat for datasets that are terabytes and perhaps petabytes in size! 3
  • 4. + MAP-REDUCE ELEMENTS ¡  Stage 1:Apply a user-specified computation over all input records in a dataset. ¡  These operations occur in parallel and yield intermediate output (key-value couples) ¡  Stage 2:Aggregate intermediate output by another user-specified computation ¡  Recursively applies a function on every pair of the list ¡  Execution framework coordinates the actual processing ¡  Implementation of the programming model and the execution framework 4
  • 5. + MAP REDUCE EXAMPLE Count the number of occurrences of every word in a text collection 5
  • 6. + EXECUTION FRAMEWORK ¡  Important idea behind MapReduce is separating the what of distributed processing from the how ¡  A MapReduce program (job) consists of ¡  code for mappers and reducers packaged together with ¡  configuration parameters (such as where the input lies and where the output should be stored) ¡  Execution framework responsibilities: scheduling ¡  Each MapReduce job is divided into smaller units called tasks ¡  In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently à manage tasks queues ¡  Coordination among tasks belonging to different jobs 6
  • 7. + MAP-REDUCE ADDITIONAL ELEMENTS ¡  Partitioners are responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers ¡  the partitioner species the task to which an intermediate key-value pair must be copied ¡  Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and sort phase 7
  • 8. + CONTENT ¡  Introduction ¡  Principle and objective ¡  Map reduce in the context of different architectures ¡  Dealing with huge data collections ¡  Practical exercise 1: thinking map-reduce ¡  Counting words on an Azure Hadoop component ¡  Counting words with Pig ¡  Comparing Hadoop and Pig ¡  Programming summarization patterns ¡  Numerical summarization ¡  Inverted Index summarization ¡  Counting with counters 8 Part I (1 week 3 hours/day 01.04.2013 – 05.04.2013) Part II (1 week 3 hours/day 22.04.2013 – 26.04.2013)
  • 9. + CONTENT ¡  Data organization patterns ¡  Structured to hierarchical ¡  Partitioning ¡  Binning ¡  Total order sorting ¡  Shuffling ¡  Join patterns ¡  Do U remember how to program joins? ¡  Reduce side join ¡  Replicated join ¡  Composite join ¡  Cartesian product ¡  Wrapping up: analysis, limitations, perspectives (3 hours) 9 Part III (1 week 3 hours/day 20.05.2013 – 24.05.2013) Part IV (1 week 3 hours/day 10.06.2013 – 14.06.2013)
  • 10. + METHODOLOGY: HACKATON STYLE ¡  Collaborative knowledge construction ¡  Control weeks: get together ¡  Free weeks: team programming work 10
  • 11. + AGENDA: DIA 1 ¡  10:30 - 11:00 introducción del MapReduce Fest (G.Vargas -Solar) ¡  11:00 - 12:30 Preparación grupal de una presentación sobre lo que es MapReduce (por equipos) ¡  12:30 - 13:00 Presentación de resultados ¡  13:00 - 13:30 Discusión (animada por G.Vargas-Solar) 11
  • 12. 12
  • 13. DIGITAL INFORMATION SCALE Unit Size Meaning Bit (b) 1 or 0 Short for binary digit, after the binary code (1 or 0) computers use to store and process data Byte (B) 8 bits Enough information to create an English letter or number in computer code. It is the basic unit of computing Kilobyte (KB) 210 bytes From “thousand’ in Greek. One page of typed text is 2KB Megabyte (MB) 220 bytes From “large” in Greek.The complete works of Shakespeare total 5 MB.A typical pop song is 4 MB. Gigabyte (GB) 230 bytes From “giant” in Greek.A two-hour film ca be compressed into 1-2 GB. Terabyte (TB) 240 bytes From “monster” in Greek.All the catalogued books in America’s Library of Congress total 15TB. Petabyte (PB) 250 bytes All letters delivered in America’s postal service this year will amount ca. 5PB. Google processes 1PB per hour. Exabyte (EB) 260 bytes Equivalent to 10 billion copies of The Economist Zettabyte (ZB) 270 bytes The total amount of information in existence this year is forecast to be around 1,27ZB Yottabyte (YB) 280 bytes Currently too big to imagine Megabyte(106) Gigabyte (109)Terabytes (1012), petabytes (1015), exabytes (1018) and zettabytes (1021) 13
  • 14. MASSIVE DATA Data sources ¡  Information-sensing mobile devices, aerial sensory technologies (remote sensing) ¡  Software logs, posts to social media sites ¡  Telescopes, cameras, microphones, digital pictures and videos posted online ¡  Transaction records of online purchases ¡  RFID readers, wireless sensor networks (number of sensors increasing by 30% a year) ¡  Cell phone GPS signals (increasing 20% a year) Massive data ¡  Sloan Digital Sky Survey (2000-) its archive contains 140 terabytes. Its successor: Large Synoptic Survey Telescope (2016-), will acquire that quantity of data every five days ! ¡  Facebook hosts 140 billion photos, and will add 70 billion this year (ca. 1 petabyte). Every 2 minutes today we snap as many photos as the whole of humanity took in the 1800s ! ¡  Wal-Mart, handles more than 1 million customer transactions every hour, feeding databases estimated at more than 2.5 petabytes (1015) ¡  The Large Hadron Collider (LHC): nearly 15 million billion bytes per year -15 petabytes (1015).These data require 70,000 processors to be processed! http://blog.websourcing.fr/infographie-­‐la-­‐vrai-­‐taille-­‐dinternet/   14
  • 15. WHO PRODUCES DATA ? DIGITAL SHADOW ¡  3,146 billion mail addresses, of which 360 million Hotmail ¡  95.5 million domains.com ¡  2.1 billion Internet users, including 922 million in Asia and 485 million in China ¡  2.4 billion accounts on social networks ¡  1000 billion videos viewed onYouTube, about 140 average per capita on the planet! ¡  60 images posted on Instagram every second ¡  250 million tweets per day in October 75% of the information is generated by individuals — writing documents, taking pictures, downloading music, etc. — but is far less than the amount of information being created about them in the digital universe 15
  • 16. BIG DATA ¡  Collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. ¡  Challenges include capture, curation, storage, search, sharing, analysis, and visualization within a tolerable elapsed time ¡  Data growth challenges and opportunities are three-dimensional (3Vs model) ¡  increasing volume (amount of data) ¡  velocity (speed of data in and out) ¡  variety (range of data types and sources) "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” 16
  • 17. + DATA MANAGEMENT WITH RESOURCES CONSTRAINTSSTORAGE   SUPPORT   Systems   ARCHITECTURE  &   RESOURCES  AWARE   RAM   Algorithms   Efficiently manage and exploit data sets according to given specific storage, memory and computation resources 17
  • 18. + DATA MANAGEMENT WITHOUT RESOURCES CONSTRAINTS Costly manage and exploit data sets according to unlimited storage, memory and computation resources Systems   Algorithms   COSTAWARE   18 ELASTIC  
  • 19. + CLOUD IN A NUTSHELL DEPLOYME NT MODELS PIVOT ISSUES DELIVER Y MODELS 19 1.  Cloud  Software  as  a  Service  (SaaS)   2.  Cloud  Platform  as  a  Service  (PaaS)   3.  Cloud  Infrastructure  as  a  Service  (IaaS)   1.  Private  cloud   2.  Public  cloud   3.  Hybrid  cloud   4.  Community  Cloud   1.  Virtualization   2.  Autonomics  (automation)   3.  Grid  Computing  (job  scheduling)  
  • 20. 20 CLOUD COMPUTINGPackaged Software Storage Servers Networking O/S Middleware Virtualization Data Applications Runtime Youmanage Infrastructure (as a Service) Storage Servers Networking O/S Middleware Virtualization Data Applications Runtime Managedbyvendor Youmanage Platform (as a Service) Managedbyvendor Youmanage Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data Software (as a Service) Managedbyvendor Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data
  • 21. CLOUD COMPUTING CHARACTERISTICS ¡  On-demand self-service ¡  Available computing capabilities without human interaction ¡  Broad network access ¡  Accessibility from heterogeneous clients (mobile phones, laptops, etc.) ¡  Rapid elasticity ¡  Possibility to automatically and rapidly increase or decrease capabilities ¡  Measured service ¡  Automatic control and optimization of resources ¡  Resource pooling (virtualization, multi-location) ¡  Multi-tenant resources (CPU, disk, memory, network) with a sense of location independence D   C   Time   R   21
  • 22. + STORING DATA ¡  Storage services: mobileme, live,Amazon ¡  Virtualization of disk space ß transparency (20 Go) ¡  Explicit configuration of the space assigned for files and mail ¡  Distribution, fragmentation and replication of data – sets ¡  Decision making for assigning storage space, for migrating data, maintaining stored data 22 Data centre Persistency support Enabling virtualisation platform
  • 23. + … WITH RESOURCES CONSTRAINTS Q1:Which are the most popular products at Starbucks ? Q2:Which are the consumption rules of Starbucks clients ? Distribution and organization of data on disk Query and data processing on server Swap memory– disk Data transfer •  Efficiency => time cost •  Optimizing memory and computing cost 23 Efficiently manage and exploit data sets according to given specific storage, memory and computation resources
  • 24. + WITHOUT RESOURCES CONSTRAINTS … ¡  Query evaluationà How and under which limits ? ¡  is not longer completely constraint by resources availability: computing, RAM, storage, network services ¡  Decision making process determined by resources consumption and consumer requirements ¡  Data involved in the query, particularly in the result can have different costs: top 5 gratis and the rest available in return to a credit card number ¡  Results storage and exploitation demands mor resources 24 Costly => minimizing cost, energy consumption Costly manage and exploit data sets according to unlimited storage, memory and computation resources
  • 25. + EXECUTION MODEL: MAP-REDUCE ¡  Programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers ¡  Open-source implementation called Hadoop, whose development was led byYahoo (now an Apache project) ¡  Divide and conquer principle: partition a large problem into smaller subproblems ¡  To the extent that the sub-problems are independent, they can be tackled in parallel by different workers ¡  intermediate results from each individual worker are then combined to yield the final output ¡  Large-data processing requires bringing data and code together for computation to occur: à no small feat for datasets that are terabytes and perhaps petabytes in size! 25
  • 26. + MAP-REDUCE ELEMENTS ¡  Stage 1:Apply a user-specified computation over all input records in a dataset. ¡  These operations occur in parallel and yield intermediate output (key-value couples) ¡  Stage 2:Aggregate intermediate output by another user-specified computation ¡  Recursively applies a function on every pair of the list ¡  Execution framework coordinates the actual processing ¡  Implementation of the programming model and the execution framework 26
  • 27. + MAP REDUCE EXAMPLE Count the number of occurrences of every word in a text collection 27
  • 28. + EXECUTION FRAMEWORK ¡  Important idea behind MapReduce is separating the what of distributed processing from the how ¡  A MapReduce program (job) consists of ¡  code for mappers and reducers packaged together with ¡  configuration parameters (such as where the input lies and where the output should be stored) ¡  Execution framework responsibilities: scheduling ¡  Each MapReduce job is divided into smaller units called tasks ¡  In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently à manage tasks queues ¡  Coordination among tasks belonging to different jobs 28
  • 29. + MAP-REDUCE ADDITIONAL ELEMENTS ¡  Partitioners are responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers ¡  the partitioner species the task to which an intermediate key-value pair must be copied ¡  Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and sort phase 29
  • 30. BIGVALUE FROM BIG DATA ¡  Big data is not only about the original content stored or being consumed but also about the information around its consumption. ¡  Gigabytes of stored content can generate a petabyte or more of transient data that we typically don't store digital TV signals we watch but don't record voice calls that are made digital in the network backbone during the duration of a call ¡  Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis http://www.emc.com/collateral/analyst-­‐reports/idc-­‐extracting-­‐value-­‐from-­‐chaos-­‐ar.pdf   Big data is not a "thing” but instead a dynamic/activity that crosses many IT borders 30
  • 31. WHAT ARE THE FORCES BEHIND THE EXPLOSIVE GROWTH OF THE DIGITAL UNIVERSE? ¡  Technology has helped by driving the cost of creating, capturing, managing, and storing information ¡  … Prime mover is financial ¡  Since 2005, the investment by enterprises in the digital universe has increased 50% — to $4 trillion ¡  spent on hardware, software, services, and staff to create, manage, and store — and derive revenues from the digital universe ¡  the trick is to generate value by extracting the right information from the digital universe 31
  • 32. BIG DATA TECH ECONOMY ¡  Data-driven world guided by a rapid ROD (Return on Data) à reducing cost, complexity, risk and increasing the value of your holdings thanks to a mastery of technologies ¡  The key is how quickly data can be turned in to currency by: ¡  Analysing patterns and spotting relationships/trends that enable decisions to be made faster with more precision and confidence ¡  Identifying actions and bits of information that are out of compliance with company policies can avoid millions in fines ¡  Proactively reducing the amount of data you pay ($18,750/gigabyte to review in eDiscovery) by identifying only the relevant pieces of information ¡  Optimizing storage by deleting or offloading non-critical assets to cheaper cloud storage thus saving millions in archive solutions 32
  • 33. + CONTENT ¡  Introduction ¡  Principle and objective ¡  Map reduce in the context of different architectures ¡  Dealing with huge data collections ¡  Practical exercise 1: thinking map-reduce ¡  Counting words on an Azure Hadoop component ¡  Counting words with Pig ¡  Comparing Hadoop and Pig ¡  Programming summarization patterns ¡  Numerical summarization ¡  Inverted Index summarization ¡  Counting with counters 33 Part I: First and commun steps Part II (1 week 3 hours/day 22.04.2013 – 26.04.2013)
  • 34. + CONTENT ¡  Data organization patterns ¡  Structured to hierarchical ¡  Partitioning ¡  Binning ¡  Total order sorting ¡  Shuffling ¡  Wrapping up: analysis, limitations, perspectives (3 hours) 34 Part III Part IV
  • 35. + METHODOLOGY: HACKATON STYLE ¡  Collaborative knowledge construction ¡  Control weeks: get together ¡  Free weeks: team programming work 35
  • 36. Contact: GenovevaVargas-Solar, CNRS, LIG-LAFMIA Genoveva.Vargas@imag.fr   http://www.vargas-­‐solar.com   http://code.google.com/p/exschema/     Want to put cloud data management in practice ? cf. http://www.vargas-­‐solar.com/teaching   36 http://code.google.com/p/model2roo/     Open source polyglot persistence tools
  • 37. 37
  • 41. + DEFINITION TECHNOLOGICAL VISION ¡  Cloud computing is a model for enabling on-demand network access to a shared pool of virtualized computing resources ¡  e.g., networks, servers, storage, applications, devices/mobiles and services) ¡  Rapidly provisioned and released with minimal management effort or service provider interaction (self-service model through API or web portals) MARKET VISION ¡  Pay-per-use (or pay-as-you-go) billing models ¡  An application may exist to ¡  run a job for a few minutes or hours ¡  provide services to costumers on a long-term basis ¡  Billing is based on resource consumption: CPU hours, volumes of data moved, or gigabytes of data stored 41
  • 42. + AMBITION ¡  Provides a platform for new software applications that run across a large collection of physically separate computers and free computation in front of a user ¡  Heralds a revolution comparable to PC [J. Larus, MS] ¡  Supply on-demand internet computer resources on a vast scale and at low price ¡  Exists beyond services offered by Amazons’ AWS, Microsoft Azure or Google AppEngine 42
  • 43. + DELIVERY MODELS ¡  SOFTWARE AS A SERVICE: applications accessible through the network (Web services, REST/SOAP) ¡  Salesforce.com (CRM) and Google (Gmail, Google Apps) ¡  PLATFORM AS A SERVICE: provide services for transparently managing hardware resources ¡  SalesForce.com (Force.com), Google (Google App Engine), Microsoft (Windows Azure), Facebook (Facebook Platform) ¡  INSFRASTRUCTURE AS A SERVICE: provide Data centers resources and others like CPU, storage and memory ¡  Amazon (EC2/S3) and IBM (Bluehouse) 43 INFRASTRUCTURE AS A SERVICE SOFTWARE AS A SERVICE PLATFORM AS A SERVICE
  • 44. + DEPLOYMENT MODELS PUBLIC CLOUD ¡  Run by third parties and applications from different clients mixed together on the clouds’ servers, storage systems and networks PRIVATE CLOUD ¡  Used by an exclusive client providing utmost control over data, security and QoS ¡  The company owns and controls the infrastructure 44 DATA  CENTER   DATA  CENTER  
  • 45. + DEPLOYMENT MODELS HYBRID CLOUD COMMUNITY CLOUD ¡  My data and services anywhere (online, on home devices, on mobile devices) from anywhere ¡  Extension of Personal Cloud with sharing ¡  Support for the Internet of Things, M2M platforms 45 DATA  CENTER  
  • 46. + VIRTUALIZATION ¡  Virtual machines and virtual appliances become standard deployment object ¡  Abstract the hardware to the point where software stacks can be deployed and redeployed without being tied to a specific physical server ¡  Servers provide a pool of resources that are harnessed as needed ¡  The relationship of applications to compute, storage, and network resources changes dynamically to meet workload and business demands ¡  Applications can be deployed and scaled rapidly without having to produce physical servers 46
  • 47. + VIRTUALIZATION ¡  Full virtualization is a technique in which a complete installation of one machine is run on another ¡  A system where all software running on the server is within a virtual machine ¡  Applications and operating systems ¡  Means of accessing services on the cloud ¡  A compute cloud is a self-service proposition where a credit card can purchase compute cycles, and a Web interface or API is used to create virtual machines and establish network relationships between them 47
  • 48. + PROGRAMMABLE INFRASTRUCTURE ¡  Cloud provider API ¡  to create an application initial composition onto virtual machines ¡  To define how it should scale and evolve to accommodate workload changes ¡  Self monitoring and self expanding applications ¡  Applications must be assembled by assembling and configuring appliances and software ¡  Cloud services must be composable so they can be consumed easily 48