SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Introduction to Hadoop and MapReduce
Colin Su, Tagtoo
Advertisement System Architecture (now)
Advertisement System Architecture (future)
• Grid
• Ad Server
• Data Highway
• Steaming Computing
Grid
• Core:
• Data mining
• Machine Learning
• Collecting data from users, logs and calculate out the strategy
• Sort our data in a proper form, them we could use it anytime

Data -> Information
Ad Server
• Ranking
• According the “information” in Grid, decide which AD should be advertised
• show proper ads to website visitors
Data Highway
• Transfer your data to the proper place
Stream Computing
• Core:
• logging
• feedback
• anti-cheating
• pricing
• post-process everything thrown out from Ad Server, and feedback useful information to Grid
• be the entrance of advertisement system
Hadoop
• an open-source software framework for data scientists
• derives from Google’s MapReduce and Google File System (GFS) papers
• written in Java
• could be divided in to 2 components:
• MapReduce
• HDFS (Hadoop distributed file system)
• a yellow elephant
Why Hadoop?
• moving computation is much cheaper and easier than moving data
• “Big Data”, the amount of data becomes too large, need a effective way to manage it
• so does computation
• high fault-tolerance
• developed by Yahoo!
MapReduce
• a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster
• different from map/reduce, the conception of functional programming, but actually they have the same idea,
“divide and conquer”
• proposed by Google
Functional “map/reduce”
• map()/reduce() in Python
• map(function(elem), list) -> list
• reduce(function(elem1, elem2), list) -> single result
• e.g.
• map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8]
• reduce(lambda x,y: x+y, [1,2,3,4]) => 10
Parallel “MapReduce” 5 Steps
•

prepare the map() input for mappers

•

mappers run the map() code -> generated intermediate pairs

•

dispatch intermediate pairs to reducers

•

reducers run the reduce() code, aggregate the results

•

prepare output from the result of reduce()
Example of “MapReduce” Word Count

map()

reduce()
Example of “MapReduce” Word Count
• Original Input

Apple Orange Mongo
Orange Grapes Plum
...
Example of “MapReduce” Word Count
• Prepare data for mappers

Apple Orange Mongo

Orange Grapes Plum

...
Example of “MapReduce” Word Count
• map() to useful record

(Apple, 1)

Apple Orange Mongo

(Orange, 1)

(Mongo, 1)

Intermediate key/value pair
Example of “MapReduce” Word Count
• sort and shuffle
(Apple, 1)
(Mongo, 1)

(Apple, 1)

(Orange, 1)

Reducer

(Apple, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Orange, 1)
Shuffle to Reducers

(Orange, 1)

(Orange, 1)

(Apple, 1)

(Mongo, 1)

(Apple, 1)

(Mongo, 1)

unsorted

Sorted

(Orange, 1)
Reducer

(Mongo, 1)
(Mongo, 1)
Reducer
Example of “MapReduce” Word Count
• Reduce()

(Apple, 1)
(Apple, 1)

(Apple, 2)

Reducer

(Orange, 1)
(Orange, 1)
(Orange, 1)
Reducer

(Orange, 3)
Example of “MapReduce” Word Count
• Generate Output

(Apple, 2)

(Orange, 3)

(Grapes, 1)

(Plum, 5)

Apple 2
Orange 3
Grapes 1
Plum 5
WordCount.txt
Hadoop Infrastructure
• Pig: Programming Language for MapReduce
• Thrift: cross-language communication, just like Google’s ProtoBuffer
• Zookeeper: cluster management

Hadoop

Hadoop

Other Services

Thrift

MapReduce
Pig

Hadoop

HDFS

Hadoop

Hadoop
ZooKeeper

Contenu connexe

Tendances

Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 

Tendances (20)

Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
EMR AWS Demo
EMR AWS DemoEMR AWS Demo
EMR AWS Demo
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 

En vedette

Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Edureka!
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 

En vedette (18)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce Implementation
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similaire à Introduction to MapReduce & hadoop

Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Similaire à Introduction to MapReduce & hadoop (20)

Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
 
Mapreduce
MapreduceMapreduce
Mapreduce
 
Hadoop
HadoopHadoop
Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop
HadoopHadoop
Hadoop
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 

Plus de Colin Su

Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
Colin Su
 
Functional programming in Python
Functional programming in PythonFunctional programming in Python
Functional programming in Python
Colin Su
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code Lab
Colin Su
 
How to Speak Charms Like a Wizard
How to Speak Charms Like a WizardHow to Speak Charms Like a Wizard
How to Speak Charms Like a Wizard
Colin Su
 
房地產報告
房地產報告房地產報告
房地產報告
Colin Su
 
Facebook Python SDK - Introduction
Facebook Python SDK - IntroductionFacebook Python SDK - Introduction
Facebook Python SDK - Introduction
Colin Su
 

Plus de Colin Su (20)

Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
 
Introduction to Google Cloud Endpoints: Speed Up Your API Development
Introduction to Google Cloud Endpoints: Speed Up Your API DevelopmentIntroduction to Google Cloud Endpoints: Speed Up Your API Development
Introduction to Google Cloud Endpoints: Speed Up Your API Development
 
Functional programming in Python
Functional programming in PythonFunctional programming in Python
Functional programming in Python
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code Lab
 
A Tour of Google Cloud Platform
A Tour of Google Cloud PlatformA Tour of Google Cloud Platform
A Tour of Google Cloud Platform
 
Introduction to Facebook JavaScript & Python SDK
Introduction to Facebook JavaScript & Python SDKIntroduction to Facebook JavaScript & Python SDK
Introduction to Facebook JavaScript & Python SDK
 
Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Engine
 
Django Deployer
Django DeployerDjango Deployer
Django Deployer
 
Introduction to Google - the most natural way to learn English (English Speech)
Introduction to Google - the most natural way to learn English (English Speech)Introduction to Google - the most natural way to learn English (English Speech)
Introduction to Google - the most natural way to learn English (English Speech)
 
How to Speak Charms Like a Wizard
How to Speak Charms Like a WizardHow to Speak Charms Like a Wizard
How to Speak Charms Like a Wizard
 
房地產報告
房地產報告房地產報告
房地產報告
 
Introduction to Git
Introduction to GitIntroduction to Git
Introduction to Git
 
Introduction to Facebook Javascript SDK (NEW)
Introduction to Facebook Javascript SDK (NEW)Introduction to Facebook Javascript SDK (NEW)
Introduction to Facebook Javascript SDK (NEW)
 
Introduction to Facebook Python API
Introduction to Facebook Python APIIntroduction to Facebook Python API
Introduction to Facebook Python API
 
Facebook Python SDK - Introduction
Facebook Python SDK - IntroductionFacebook Python SDK - Introduction
Facebook Python SDK - Introduction
 
Web Programming - 1st TA Session
Web Programming - 1st TA SessionWeb Programming - 1st TA Session
Web Programming - 1st TA Session
 
Nested List Comprehension and Binary Search
Nested List Comprehension and Binary SearchNested List Comprehension and Binary Search
Nested List Comprehension and Binary Search
 
Python-List comprehension
Python-List comprehensionPython-List comprehension
Python-List comprehension
 
Python-FileIO
Python-FileIOPython-FileIO
Python-FileIO
 
Python Dictionary
Python DictionaryPython Dictionary
Python Dictionary
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Introduction to MapReduce & hadoop

  • 1. Introduction to Hadoop and MapReduce Colin Su, Tagtoo
  • 3. Advertisement System Architecture (future) • Grid • Ad Server • Data Highway • Steaming Computing
  • 4. Grid • Core: • Data mining • Machine Learning • Collecting data from users, logs and calculate out the strategy • Sort our data in a proper form, them we could use it anytime Data -> Information
  • 5. Ad Server • Ranking • According the “information” in Grid, decide which AD should be advertised • show proper ads to website visitors
  • 6. Data Highway • Transfer your data to the proper place
  • 7. Stream Computing • Core: • logging • feedback • anti-cheating • pricing • post-process everything thrown out from Ad Server, and feedback useful information to Grid • be the entrance of advertisement system
  • 8. Hadoop • an open-source software framework for data scientists • derives from Google’s MapReduce and Google File System (GFS) papers • written in Java • could be divided in to 2 components: • MapReduce • HDFS (Hadoop distributed file system) • a yellow elephant
  • 9. Why Hadoop? • moving computation is much cheaper and easier than moving data • “Big Data”, the amount of data becomes too large, need a effective way to manage it • so does computation • high fault-tolerance • developed by Yahoo!
  • 10. MapReduce • a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster • different from map/reduce, the conception of functional programming, but actually they have the same idea, “divide and conquer” • proposed by Google
  • 11. Functional “map/reduce” • map()/reduce() in Python • map(function(elem), list) -> list • reduce(function(elem1, elem2), list) -> single result • e.g. • map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8] • reduce(lambda x,y: x+y, [1,2,3,4]) => 10
  • 12. Parallel “MapReduce” 5 Steps • prepare the map() input for mappers • mappers run the map() code -> generated intermediate pairs • dispatch intermediate pairs to reducers • reducers run the reduce() code, aggregate the results • prepare output from the result of reduce()
  • 13. Example of “MapReduce” Word Count map() reduce()
  • 14. Example of “MapReduce” Word Count • Original Input Apple Orange Mongo Orange Grapes Plum ...
  • 15. Example of “MapReduce” Word Count • Prepare data for mappers Apple Orange Mongo Orange Grapes Plum ...
  • 16. Example of “MapReduce” Word Count • map() to useful record (Apple, 1) Apple Orange Mongo (Orange, 1) (Mongo, 1) Intermediate key/value pair
  • 17. Example of “MapReduce” Word Count • sort and shuffle (Apple, 1) (Mongo, 1) (Apple, 1) (Orange, 1) Reducer (Apple, 1) (Mongo, 1) (Apple, 1) (Orange, 1) (Orange, 1) Shuffle to Reducers (Orange, 1) (Orange, 1) (Apple, 1) (Mongo, 1) (Apple, 1) (Mongo, 1) unsorted Sorted (Orange, 1) Reducer (Mongo, 1) (Mongo, 1) Reducer
  • 18. Example of “MapReduce” Word Count • Reduce() (Apple, 1) (Apple, 1) (Apple, 2) Reducer (Orange, 1) (Orange, 1) (Orange, 1) Reducer (Orange, 3)
  • 19. Example of “MapReduce” Word Count • Generate Output (Apple, 2) (Orange, 3) (Grapes, 1) (Plum, 5) Apple 2 Orange 3 Grapes 1 Plum 5 WordCount.txt
  • 20. Hadoop Infrastructure • Pig: Programming Language for MapReduce • Thrift: cross-language communication, just like Google’s ProtoBuffer • Zookeeper: cluster management Hadoop Hadoop Other Services Thrift MapReduce Pig Hadoop HDFS Hadoop Hadoop ZooKeeper