SlideShare une entreprise Scribd logo
1  sur  21
Map Reduce and
Hadoop
- S A L IL NAVG IR E
Big Data Explosion
• 90% of today's data was created in the last 2 years

• Moore's law: Data volume doubles every 18
months
• YouTube: 13 million hours and 700 billion views in
2010
• Facebook: 20TB/day (compressed)
• CERN/LHC: 40TB/day (15PB/year)

• Many more examples
Solution: Scalability
How?
Divide and Conquer
Challenges!
• How to assign units of work to the workers?

• What if there are more units of work than workers?
• What if the workers need to share intermediate
incomplete data?

• How do we aggregate such intermediate data?
• How do we know when all workers have completed
their assignments?

• What if some workers failed?
History
• 2000: Apache Lucene: batch index updates and
sort/merge with on disk index
• 2002: Apache Nutch: distributed, scalable open
source web crawler
• 2004: Google publishes GFS and MapReduce
papers
• 2006: Apache Hadoop: open source Java
implementation of GFS and MapReduce to solve
Nutch’ problem; later becomes standalone project
What is Map Reduce?
• A programming model to distribute a task on
multiple nodes
• Used to develop solutions that will process large
amounts of data in a parallelized fashion in clusters
of computing nodes

• Original MapReduce paper by Google
• Features of MapReduce:
• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers
MapReduce Execution Overview
User
Program
fork
assign
map
Input Data
Split 0
read
Split 1
Split 2

fork
Master

fork

assign
reduce

Worker

Worker

Worker

local
write

Worker
Worker

remote
read,
sort

write

Output
File 0
Output
File 1
Hadoop Components

Storage

Processing

HDFS

MapReduce

Self-healing
high-bandwidth
clustered storage

Fault-tolerant
distributed
processing
HDFS Architecture
HDFS Basics
• HDFS is a filesystem written in Java
• Sits on top of a native filesystem
• Provides redundant storage for massive
amounts of data

• Use Commodity devices
HDFS Data
• Data is split into blocks and stored on
multiple nodes in the cluster
• Each block is usually 64 MB or 128 MB
• Each block is replicated multiple times
• Replicas stored on different data nodes
2 Types of Nodes
Slave Nodes
Master Nodes
Master Node
• NameNode
• only 1 per cluster
• metadata server and database
• SecondaryNameNode helps with some housekeeping

• JobTracker
• only 1 per cluster
• job scheduler
Slave Nodes
• DataNodes
• 1-4000 per cluster
• block data storage

• TaskTrackers
• 1-4000 per cluster
• task execution
NameNode
• A single NameNode stores all
metadata, replication of blocks and
read/write access to files
• Filenames, locations on DataNodes of each
block, owner, group, etc.
• All information maintained in RAM for fast
lookup
Secondary NameNode
• Does memory-intensive administrative
functions for the NameNode
• Should run on a separate machine
Data Node
• DataNodes store file contents
• Different blocks of the same file will be
stored on different DataNodes
• Same block is stored on three (or more)
DataNodes for redundancy
Word Count Example
• Input
• Text files

• Output
• Single file containing (Word <TAB> Count)

• Map Phase
• Generates (Word, Count) pairs
• [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]

• Reduce Phase
• For each word, calculates aggregate
• [{a,7}, {b,5}, {c,6}]
Typical Cluster
• 3-4000 commodity servers

• Each server
• 2x quad-core
• 16-24 GB ram

• 4-12 TB disk space

• 20-30 servers per rack
When Should I use it?
Good choice for jobs that can be broken into parallelized jobs:

• Indexing/Analysis of log files
• Sorting of large data sets
• Image Processing/Machine Learning

Bad choice for serial or low latency jobs:
• For real-time processing
• For processing intensive task with little data
• Replacing MySQL
Who uses Hadoop?

Contenu connexe

Tendances

Mongo presentation conf
Mongo presentation confMongo presentation conf
Mongo presentation conf
Shridhar Joshi
 

Tendances (20)

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemKey Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
 
Dataspace presentatie
Dataspace presentatieDataspace presentatie
Dataspace presentatie
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Gfs and map redusing
Gfs and map redusingGfs and map redusing
Gfs and map redusing
 
Mongo db cluster administration and Shredded Databases
Mongo db cluster administration and Shredded DatabasesMongo db cluster administration and Shredded Databases
Mongo db cluster administration and Shredded Databases
 
RubiX
RubiXRubiX
RubiX
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 Minutes
 
Mongo presentation conf
Mongo presentation confMongo presentation conf
Mongo presentation conf
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Why Spark for large scale data analysis
Why Spark for large scale data analysisWhy Spark for large scale data analysis
Why Spark for large scale data analysis
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
SANSA ISWC 2017 Talk
SANSA ISWC 2017 TalkSANSA ISWC 2017 Talk
SANSA ISWC 2017 Talk
 
Big Data Overview Part 1
Big Data Overview Part 1Big Data Overview Part 1
Big Data Overview Part 1
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
 
Barcamp MySQL
Barcamp MySQLBarcamp MySQL
Barcamp MySQL
 
Mashing the data
Mashing the dataMashing the data
Mashing the data
 
Why geoserver
Why geoserverWhy geoserver
Why geoserver
 

Similaire à MapReduce and Hadoop

Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
MongoDB Administration 20110922
MongoDB Administration 20110922MongoDB Administration 20110922
MongoDB Administration 20110922
radiocats
 

Similaire à MapReduce and Hadoop (20)

Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
MongoDB Administration 20110922
MongoDB Administration 20110922MongoDB Administration 20110922
MongoDB Administration 20110922
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
Big data
Big dataBig data
Big data
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

MapReduce and Hadoop

  • 1. Map Reduce and Hadoop - S A L IL NAVG IR E
  • 2. Big Data Explosion • 90% of today's data was created in the last 2 years • Moore's law: Data volume doubles every 18 months • YouTube: 13 million hours and 700 billion views in 2010 • Facebook: 20TB/day (compressed) • CERN/LHC: 40TB/day (15PB/year) • Many more examples
  • 4. Challenges! • How to assign units of work to the workers? • What if there are more units of work than workers? • What if the workers need to share intermediate incomplete data? • How do we aggregate such intermediate data? • How do we know when all workers have completed their assignments? • What if some workers failed?
  • 5. History • 2000: Apache Lucene: batch index updates and sort/merge with on disk index • 2002: Apache Nutch: distributed, scalable open source web crawler • 2004: Google publishes GFS and MapReduce papers • 2006: Apache Hadoop: open source Java implementation of GFS and MapReduce to solve Nutch’ problem; later becomes standalone project
  • 6. What is Map Reduce? • A programming model to distribute a task on multiple nodes • Used to develop solutions that will process large amounts of data in a parallelized fashion in clusters of computing nodes • Original MapReduce paper by Google • Features of MapReduce: • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers
  • 7. MapReduce Execution Overview User Program fork assign map Input Data Split 0 read Split 1 Split 2 fork Master fork assign reduce Worker Worker Worker local write Worker Worker remote read, sort write Output File 0 Output File 1
  • 10. HDFS Basics • HDFS is a filesystem written in Java • Sits on top of a native filesystem • Provides redundant storage for massive amounts of data • Use Commodity devices
  • 11. HDFS Data • Data is split into blocks and stored on multiple nodes in the cluster • Each block is usually 64 MB or 128 MB • Each block is replicated multiple times • Replicas stored on different data nodes
  • 12. 2 Types of Nodes Slave Nodes Master Nodes
  • 13. Master Node • NameNode • only 1 per cluster • metadata server and database • SecondaryNameNode helps with some housekeeping • JobTracker • only 1 per cluster • job scheduler
  • 14. Slave Nodes • DataNodes • 1-4000 per cluster • block data storage • TaskTrackers • 1-4000 per cluster • task execution
  • 15. NameNode • A single NameNode stores all metadata, replication of blocks and read/write access to files • Filenames, locations on DataNodes of each block, owner, group, etc. • All information maintained in RAM for fast lookup
  • 16. Secondary NameNode • Does memory-intensive administrative functions for the NameNode • Should run on a separate machine
  • 17. Data Node • DataNodes store file contents • Different blocks of the same file will be stored on different DataNodes • Same block is stored on three (or more) DataNodes for redundancy
  • 18. Word Count Example • Input • Text files • Output • Single file containing (Word <TAB> Count) • Map Phase • Generates (Word, Count) pairs • [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase • For each word, calculates aggregate • [{a,7}, {b,5}, {c,6}]
  • 19. Typical Cluster • 3-4000 commodity servers • Each server • 2x quad-core • 16-24 GB ram • 4-12 TB disk space • 20-30 servers per rack
  • 20. When Should I use it? Good choice for jobs that can be broken into parallelized jobs: • Indexing/Analysis of log files • Sorting of large data sets • Image Processing/Machine Learning Bad choice for serial or low latency jobs: • For real-time processing • For processing intensive task with little data • Replacing MySQL