SlideShare une entreprise Scribd logo
1  sur  14
Ali Bahu
10/17/2012
APACHE HADOOP
INTRODUCTION
 Apache Hadoop is an open-source software
framework that supports data-intensive
distributed applications, licensed under the
Apache v2 license. It enables applications to
work with thousands of computation-
independent computers and petabytes of
data.
 Hadoop was derived from Google's
MapReduce and Google File System (GFS)
papers.
 Hadoop is implemented in Java.
WHY HADOOP
 Need to process Multi Petabyte Datasets
 It is expensive to build reliability in each application
 Nodes failure is expected and Hadoop can help
 Number of nodes is not constant
 Efficient, reliable, Open Source Apache License
 Workloads are IO bound and not CPU bound
WHO USES HADOOP
 Amazon/A9
 Facebook
 Google
 IBM
 Joost
 Last.fm
 New York Times
 PowerSet
 Veoh
 Yahoo!
COMMODITY HARDWARE
 Typically in 2 level architecture
 Nodes are commodity PCs
 30-40 nodes/rack
 Uplink from rack is 3-4 gigabit
 Rack-internal is one gigabit
HDFS ARCHITECTURE
HDFS (HADOOP DISTRIBUTED FILE
SYSTEM) Very Large Distributed File System
 10K nodes, 100 million files, 10 PB
 Assumes Commodity Hardware
 Files are replicated to handle hardware failure
 Detect failures and recovers from them
 Optimized for Batch Processing
 Data locations exposed so that computations can
move to where data resides
 Provides very high aggregate bandwidth
 User Space, runs on heterogeneous OS
 Single Namespace for entire cluster
 Data Coherency
 Write-once-read-many access model
 Client can only append to existing files
 Files are broken up into blocks
 Typically 128 MB block size
 Each block replicated on multiple Data Nodes
 Intelligent Client
 Client can find location of blocks
 Client accesses data directly from Data Node
NAME NODE METADATA
 Meta-data in Memory
 The entire metadata is in main memory
 No demand paging of meta-data
 Types of Metadata
 List of files
 List of Blocks for each file
 List of Data Nodes for each block
 File attributes, e.g. creation time, replication factor, etc.
 Transaction Log
 Records file creations, file deletions, etc. is kept here
DATA NODE
Block Server
 Stores data in the local file system (e.g. ext3)
 Stores meta-data of a block (e.g. CRC)
 Serves data and meta-data to Clients
Block Report
 Periodically sends a report of all existing blocks to
the Name Node
Facilitates Pipelining of Data
 Forwards data to other specified Data Nodes
BLOCK PLACEMENT
 Block Placement Strategy
 One replica on local node
 Second replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly placed
 Clients read from nearest replica
DATA CORRECTNESS
 Use Checksums to validate data
 Use CRC32, etc.
 File Creation
 Client computes checksum per 512 byte
 Data Node stores the checksum
 File access
 Client retrieves the data and checksum from Data Node
 If Validation fails, Client tries other replicas
NAMENODE FAILURE
 A single point of failure
 Transaction Log stored in multiple directories
 A directory on the local file system
 A directory on a remote file system (NFS/CIFS)
DATA PIPELINING
 Client retrieves a list of DataNodes on which to place replicas of a
block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next DataNode in the
Pipeline
 When all replicas are written, the Client moves on to write the next
block in file
 Rebalancer is used to ensure that the % disk full on DataNodes are
similar
 Usually run when new DataNodes are added
 Cluster is online when Rebalancer is active
 Rebalancer is throttled to avoid network congestion
 Command line tool
HADOOP MAP/REDUCE
 The Map-Reduce programming model
 Framework for distributed processing of large data sets
 Pluggable user code runs in generic framework
 Common design pattern in data processing
 cat * | grep | sort | unique -c | cat > file
 input | map | shuffle | reduce | output
 Useful for:
 Log processing
 Web search indexing
 Ad-hoc queries, etc.

Contenu connexe

Tendances

Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
Directory services by SAJID
Directory services by SAJIDDirectory services by SAJID
Directory services by SAJIDSajid khan
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data PublishingRuben Taelman
 
Big data technologies and databases
Big data technologies and databasesBig data technologies and databases
Big data technologies and databasesHariniA7
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemMilad Sobhkhiz
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemBhavesh Padharia
 
Directory services by SAJID
Directory services by SAJIDDirectory services by SAJID
Directory services by SAJIDSajid khan
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderRuben Taelman
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File Systemtutchiio
 
3. distributed file system requirements
3. distributed file system requirements3. distributed file system requirements
3. distributed file system requirementsAbDul ThaYyal
 
Describing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked DataDescribing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked DataJoachim Van Herwegen
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18karenostil
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems ReviewSchubert Zhang
 
Poster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQLPoster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQLRuben Taelman
 

Tendances (20)

Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Directory services by SAJID
Directory services by SAJIDDirectory services by SAJID
Directory services by SAJID
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 
Big data technologies and databases
Big data technologies and databasesBig data technologies and databases
Big data technologies and databases
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
Directory services by SAJID
Directory services by SAJIDDirectory services by SAJID
Directory services by SAJID
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
3. distributed file system requirements
3. distributed file system requirements3. distributed file system requirements
3. distributed file system requirements
 
Describing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked DataDescribing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked Data
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
 
File organization
File organizationFile organization
File organization
 
Chapter13
Chapter13Chapter13
Chapter13
 
File organization
File organizationFile organization
File organization
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
Ch11
Ch11Ch11
Ch11
 
Poster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQLPoster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQL
 
Distributed file systems dfs
Distributed file systems   dfsDistributed file systems   dfs
Distributed file systems dfs
 

Similaire à Hadoop

Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiUnmesh Baile
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiUnmesh Baile
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemsrikanthhadoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutoriallilyco
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 

Similaire à Hadoop (20)

HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutorial
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 

Plus de Ali Bahu

Apache Ant
Apache AntApache Ant
Apache AntAli Bahu
 
Apache Ant
Apache AntApache Ant
Apache AntAli Bahu
 
EclipseMAT
EclipseMATEclipseMAT
EclipseMATAli Bahu
 
Cloud computing
Cloud computingCloud computing
Cloud computingAli Bahu
 
Pervasive computing
Pervasive computingPervasive computing
Pervasive computingAli Bahu
 

Plus de Ali Bahu (6)

Apache Ant
Apache AntApache Ant
Apache Ant
 
Jhiccup
JhiccupJhiccup
Jhiccup
 
Apache Ant
Apache AntApache Ant
Apache Ant
 
EclipseMAT
EclipseMATEclipseMAT
EclipseMAT
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Pervasive computing
Pervasive computingPervasive computing
Pervasive computing
 

Dernier

Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 

Dernier (20)

Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 

Hadoop

  • 2. INTRODUCTION  Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It enables applications to work with thousands of computation- independent computers and petabytes of data.  Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.  Hadoop is implemented in Java.
  • 3. WHY HADOOP  Need to process Multi Petabyte Datasets  It is expensive to build reliability in each application  Nodes failure is expected and Hadoop can help  Number of nodes is not constant  Efficient, reliable, Open Source Apache License  Workloads are IO bound and not CPU bound
  • 4. WHO USES HADOOP  Amazon/A9  Facebook  Google  IBM  Joost  Last.fm  New York Times  PowerSet  Veoh  Yahoo!
  • 5. COMMODITY HARDWARE  Typically in 2 level architecture  Nodes are commodity PCs  30-40 nodes/rack  Uplink from rack is 3-4 gigabit  Rack-internal is one gigabit
  • 7. HDFS (HADOOP DISTRIBUTED FILE SYSTEM) Very Large Distributed File System  10K nodes, 100 million files, 10 PB  Assumes Commodity Hardware  Files are replicated to handle hardware failure  Detect failures and recovers from them  Optimized for Batch Processing  Data locations exposed so that computations can move to where data resides  Provides very high aggregate bandwidth  User Space, runs on heterogeneous OS  Single Namespace for entire cluster  Data Coherency  Write-once-read-many access model  Client can only append to existing files  Files are broken up into blocks  Typically 128 MB block size  Each block replicated on multiple Data Nodes  Intelligent Client  Client can find location of blocks  Client accesses data directly from Data Node
  • 8. NAME NODE METADATA  Meta-data in Memory  The entire metadata is in main memory  No demand paging of meta-data  Types of Metadata  List of files  List of Blocks for each file  List of Data Nodes for each block  File attributes, e.g. creation time, replication factor, etc.  Transaction Log  Records file creations, file deletions, etc. is kept here
  • 9. DATA NODE Block Server  Stores data in the local file system (e.g. ext3)  Stores meta-data of a block (e.g. CRC)  Serves data and meta-data to Clients Block Report  Periodically sends a report of all existing blocks to the Name Node Facilitates Pipelining of Data  Forwards data to other specified Data Nodes
  • 10. BLOCK PLACEMENT  Block Placement Strategy  One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed  Clients read from nearest replica
  • 11. DATA CORRECTNESS  Use Checksums to validate data  Use CRC32, etc.  File Creation  Client computes checksum per 512 byte  Data Node stores the checksum  File access  Client retrieves the data and checksum from Data Node  If Validation fails, Client tries other replicas
  • 12. NAMENODE FAILURE  A single point of failure  Transaction Log stored in multiple directories  A directory on the local file system  A directory on a remote file system (NFS/CIFS)
  • 13. DATA PIPELINING  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file  Rebalancer is used to ensure that the % disk full on DataNodes are similar  Usually run when new DataNodes are added  Cluster is online when Rebalancer is active  Rebalancer is throttled to avoid network congestion  Command line tool
  • 14. HADOOP MAP/REDUCE  The Map-Reduce programming model  Framework for distributed processing of large data sets  Pluggable user code runs in generic framework  Common design pattern in data processing  cat * | grep | sort | unique -c | cat > file  input | map | shuffle | reduce | output  Useful for:  Log processing  Web search indexing  Ad-hoc queries, etc.