Hadoop Distributed File System (HDFS) Explained

Hadoop Distributed File System
(HDFS)

Big Data Concepts
• Volume
– No more GBs of data
– TB,PB,EB,ZB
• Velocity
– High frequency data like
in stocks
• Variety
– Structure and
Unstructured data

Challenges In Big Data
• Complex
– No proper understanding
of the underlying data
• Storage
– How to accommodate
large amount of data in
single physical machine
• Performance
– How to process large
amount of data efficiently
and effectively so as to
increase the performance

Challenges in Traditional Application
• Network
– Limited bandwidth
• Data
– Growth of data can’t be
controlled
• Efficiency & Performance
– How fast data can be read
• Processing capacity of machine
– Processor, RAM is a bottleneck

Statistics
Application Size(MB) Data Size Total Round trip time(sec)
10 10 MB 1+1 = 2
10 100MB 10+10 = 20
10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min)
10 1000 GB= 1TB 100000 + 100000 = ~55 Hour
• Calculation is done under ideal condition
• No processing time is taken into consideration
Assuming N/W bandwidth is 10MBPS
• How data is read ?
• Line by Line reading
• Depends on seek rate and disc latency
Average Data Transfer rate = 75MB/sec
Total Time to read 100GB = 22 min
Total time to read 1TB = 3 hours
How much time you take to sort 1TB of data??
Enough time to
watch a movie, while
data is being read

Statistics(Contd.)
Observation
• Large amount of data takes lot of
time to read
• Data is moved back and forth
over the low latency network
where application is running
– 90% of the time is consumed in
data transfer
• Application size is constant
Conclusion
• Achieving Data Localization
– Move application close to data
Or
– Move data close to application

Summary
• Storage is problem
– Cannot store large amount of data
– Upgrading the hard disk will also not solve the problem
(Hardware limitation)
• Performance degradation
– Upgrading RAM will not solve the problem (Hardware
limitation)
• Reading
– Larger data requires larger time to read

Solution Approach
• Distributed Framework
– Storing the data across several
machine
– Performing computation
parallel across several
machines
• Should Support
– Partial failures
– Recoverability
– Data availability
– Consistency
– Data reliability
– Upgrading

Introducing Hadoop
Distributed framework that provides scaling in :
• Storage
• Performance
• IO Bandwidth

What makes Hadoop special?
• No high end or expensive systems are required
• Can run on Linux, Mac OS/X, Windows, Solaris
• Fault tolerant system
– Execution of the job continues even of nodes are failing
• Highly reliable and efficient storage system
• In built intelligence to speed up the application
– Speculative execution
• Fit for lot of applications:
– Web log processing
– Page Indexing, page ranking
– Complex event processing

Features of Hadoop
• Partition, replicate and distributes the data
– Data availability, consistency
• Performs Computation closer to the data
– Data Localization
• Performs computation across several hosts
– MapReduce framework

Hadoop Components
• Hadoop is bundled with two independent
components
– HDFS (Hadoop Distributed File System)
• Designed for scaling in terms of storage and IO
bandwidth
– MR framework (MapReduce)
• Designed for scaling in terms of performance

Understanding file structure
1 GB file
File is
split into
blocks
Each block is
typically
64MB
Each block is stored as
two files – one holding
data and second for
metadata, checksum
Block

Hadoop Processes
• Processes running on Hadoop
– NameNode
– DataNode
– Secondary NameNode
– Task Tracker
– Job Tracker

NameNode
• Single point of contact
• HDFS master
• Holds meta information
– List of files and directories
– Location of blocks
• Single node per cluster
– Cluster can have thousands of
DataNodes and tens of
thousands of HDFS client.
NameNode

DataNode
• Can execute multiple tasks concurrently
• Holds actual data blocks, checksum and
generation stamp
• If block is half full, needs only half of the space of
full block
• At start-up, connects to NameNode and perform
handshake
• No binding to IP address or port, uses Storage ID
• Sends heartbeat to NameNode
DataNode
Storage ID: XYZ001

Communication
• Total Storage Capacity
• Fraction of storage in
use
• No of data transfer
currently in progress
• Instructs DataNode
• Replicate block to other node
• Remove local block replica
• Send immediate block report
• Shut down the node
Every 3
seconds.
“I AM ALIVE”
NameNode
DataNode
Storage ID: XYZ001 DataNode
Storage ID: XYZ002
DataNode
Storage ID: XYZ003
Reply
No
heartbeat
for 10
minutes
Heartbeat

Hadoop Distributed File System (HDFS) Explained

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Distributed File System (HDFS) Explained

Similar to Hadoop Distributed File System (HDFS) Explained (20)

Hadoop Distributed File System (HDFS) Explained