SlideShare a Scribd company logo
1 of 19
Hadoop Distributed File System
(HDFS)
Big Data Concepts
• Volume
– No more GBs of data
– TB,PB,EB,ZB
• Velocity
– High frequency data like
in stocks
• Variety
– Structure and
Unstructured data
Challenges In Big Data
• Complex
– No proper understanding
of the underlying data
• Storage
– How to accommodate
large amount of data in
single physical machine
• Performance
– How to process large
amount of data efficiently
and effectively so as to
increase the performance
Challenges in Traditional Application
• Network
– Limited bandwidth
• Data
– Growth of data can’t be
controlled
• Efficiency & Performance
– How fast data can be read
• Processing capacity of machine
– Processor, RAM is a bottleneck
Statistics
Application Size(MB) Data Size Total Round trip time(sec)
10 10 MB 1+1 = 2
10 100MB 10+10 = 20
10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min)
10 1000 GB= 1TB 100000 + 100000 = ~55 Hour
• Calculation is done under ideal condition
• No processing time is taken into consideration
Assuming N/W bandwidth is 10MBPS
• How data is read ?
• Line by Line reading
• Depends on seek rate and disc latency
Average Data Transfer rate = 75MB/sec
Total Time to read 100GB = 22 min
Total time to read 1TB = 3 hours
How much time you take to sort 1TB of data??
Enough time to
watch a movie, while
data is being read
Statistics(Contd.)
Observation
• Large amount of data takes lot of
time to read
• Data is moved back and forth
over the low latency network
where application is running
– 90% of the time is consumed in
data transfer
• Application size is constant
Conclusion
• Achieving Data Localization
– Move application close to data
Or
– Move data close to application
Summary
• Storage is problem
– Cannot store large amount of data
– Upgrading the hard disk will also not solve the problem
(Hardware limitation)
• Performance degradation
– Upgrading RAM will not solve the problem (Hardware
limitation)
• Reading
– Larger data requires larger time to read
Solution Approach
• Distributed Framework
– Storing the data across several
machine
– Performing computation
parallel across several
machines
• Should Support
– Partial failures
– Recoverability
– Data availability
– Consistency
– Data reliability
– Upgrading
Introducing Hadoop
Distributed framework that provides scaling in :
• Storage
• Performance
• IO Bandwidth
What makes Hadoop special?
• No high end or expensive systems are required
• Can run on Linux, Mac OS/X, Windows, Solaris
• Fault tolerant system
– Execution of the job continues even of nodes are failing
• Highly reliable and efficient storage system
• In built intelligence to speed up the application
– Speculative execution
• Fit for lot of applications:
– Web log processing
– Page Indexing, page ranking
– Complex event processing
Features of Hadoop
• Partition, replicate and distributes the data
– Data availability, consistency
• Performs Computation closer to the data
– Data Localization
• Performs computation across several hosts
– MapReduce framework
Hadoop Components
• Hadoop is bundled with two independent
components
– HDFS (Hadoop Distributed File System)
• Designed for scaling in terms of storage and IO
bandwidth
– MR framework (MapReduce)
• Designed for scaling in terms of performance
Understanding file structure
1 GB file
File is
split into
blocks
Each block is
typically
64MB
Each block is stored as
two files – one holding
data and second for
metadata, checksum
Block
Hadoop Processes
• Processes running on Hadoop
– NameNode
– DataNode
– Secondary NameNode
– Task Tracker
– Job Tracker
NameNode
• Single point of contact
• HDFS master
• Holds meta information
– List of files and directories
– Location of blocks
• Single node per cluster
– Cluster can have thousands of
DataNodes and tens of
thousands of HDFS client.
NameNode
DataNode
• Can execute multiple tasks concurrently
• Holds actual data blocks, checksum and
generation stamp
• If block is half full, needs only half of the space of
full block
• At start-up, connects to NameNode and perform
handshake
• No binding to IP address or port, uses Storage ID
• Sends heartbeat to NameNode
DataNode
Storage ID: XYZ001
Communication
• Total Storage Capacity
• Fraction of storage in
use
• No of data transfer
currently in progress
• Instructs DataNode
• Replicate block to other node
• Remove local block replica
• Send immediate block report
• Shut down the node
Every 3
seconds.
“I AM ALIVE”
NameNode
DataNode
Storage ID: XYZ001 DataNode
Storage ID: XYZ002
DataNode
Storage ID: XYZ003
Reply
No
heartbeat
for 10
minutes
Heartbeat
Overview of HDFS
HDFS Client

More Related Content

What's hot

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 

What's hot (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 

Similar to Hadoop Distributed File System (HDFS) Explained

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxNIKHILGR3
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 

Similar to Hadoop Distributed File System (HDFS) Explained (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 

Hadoop Distributed File System (HDFS) Explained

  • 1. Hadoop Distributed File System (HDFS)
  • 2. Big Data Concepts • Volume – No more GBs of data – TB,PB,EB,ZB • Velocity – High frequency data like in stocks • Variety – Structure and Unstructured data
  • 3. Challenges In Big Data • Complex – No proper understanding of the underlying data • Storage – How to accommodate large amount of data in single physical machine • Performance – How to process large amount of data efficiently and effectively so as to increase the performance
  • 4. Challenges in Traditional Application • Network – Limited bandwidth • Data – Growth of data can’t be controlled • Efficiency & Performance – How fast data can be read • Processing capacity of machine – Processor, RAM is a bottleneck
  • 5. Statistics Application Size(MB) Data Size Total Round trip time(sec) 10 10 MB 1+1 = 2 10 100MB 10+10 = 20 10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min) 10 1000 GB= 1TB 100000 + 100000 = ~55 Hour • Calculation is done under ideal condition • No processing time is taken into consideration Assuming N/W bandwidth is 10MBPS • How data is read ? • Line by Line reading • Depends on seek rate and disc latency Average Data Transfer rate = 75MB/sec Total Time to read 100GB = 22 min Total time to read 1TB = 3 hours How much time you take to sort 1TB of data?? Enough time to watch a movie, while data is being read
  • 6. Statistics(Contd.) Observation • Large amount of data takes lot of time to read • Data is moved back and forth over the low latency network where application is running – 90% of the time is consumed in data transfer • Application size is constant Conclusion • Achieving Data Localization – Move application close to data Or – Move data close to application
  • 7. Summary • Storage is problem – Cannot store large amount of data – Upgrading the hard disk will also not solve the problem (Hardware limitation) • Performance degradation – Upgrading RAM will not solve the problem (Hardware limitation) • Reading – Larger data requires larger time to read
  • 8. Solution Approach • Distributed Framework – Storing the data across several machine – Performing computation parallel across several machines • Should Support – Partial failures – Recoverability – Data availability – Consistency – Data reliability – Upgrading
  • 9. Introducing Hadoop Distributed framework that provides scaling in : • Storage • Performance • IO Bandwidth
  • 10. What makes Hadoop special? • No high end or expensive systems are required • Can run on Linux, Mac OS/X, Windows, Solaris • Fault tolerant system – Execution of the job continues even of nodes are failing • Highly reliable and efficient storage system • In built intelligence to speed up the application – Speculative execution • Fit for lot of applications: – Web log processing – Page Indexing, page ranking – Complex event processing
  • 11. Features of Hadoop • Partition, replicate and distributes the data – Data availability, consistency • Performs Computation closer to the data – Data Localization • Performs computation across several hosts – MapReduce framework
  • 12. Hadoop Components • Hadoop is bundled with two independent components – HDFS (Hadoop Distributed File System) • Designed for scaling in terms of storage and IO bandwidth – MR framework (MapReduce) • Designed for scaling in terms of performance
  • 13. Understanding file structure 1 GB file File is split into blocks Each block is typically 64MB Each block is stored as two files – one holding data and second for metadata, checksum Block
  • 14. Hadoop Processes • Processes running on Hadoop – NameNode – DataNode – Secondary NameNode – Task Tracker – Job Tracker
  • 15. NameNode • Single point of contact • HDFS master • Holds meta information – List of files and directories – Location of blocks • Single node per cluster – Cluster can have thousands of DataNodes and tens of thousands of HDFS client. NameNode
  • 16. DataNode • Can execute multiple tasks concurrently • Holds actual data blocks, checksum and generation stamp • If block is half full, needs only half of the space of full block • At start-up, connects to NameNode and perform handshake • No binding to IP address or port, uses Storage ID • Sends heartbeat to NameNode DataNode Storage ID: XYZ001
  • 17. Communication • Total Storage Capacity • Fraction of storage in use • No of data transfer currently in progress • Instructs DataNode • Replicate block to other node • Remove local block replica • Send immediate block report • Shut down the node Every 3 seconds. “I AM ALIVE” NameNode DataNode Storage ID: XYZ001 DataNode Storage ID: XYZ002 DataNode Storage ID: XYZ003 Reply No heartbeat for 10 minutes Heartbeat