2. Agenda
• Different Hadoop daemons & its roles
• How does a Hadoop cluster look like
• Under the Hood:- How does it write a file
• Under the Hood:- How does it read a file
• Under the Hood:- How does it replicate the file
• Under the Hood:- How does it run a job
• How to balance an un-balanced hadoop cluster
3. Hadoop – A bit of background
• It’s an open source project
• Based on 2 technical papers published by Google
• A well known platform for distributed applications
• Easy to scale-out
• Works well with commodity hard wares(not entirely true)
• Very good for background applications
4. Hadoop Architecture
• Two Primary components
Distributed File System (HDFS): It deals with file
operations like read, write, delete & etc
Map Reduce Engine: It deals with parallel computation
5. Hadoop Distributed File System
• Runs on top of existing file system
• A file broken into pre-defined equal sized blocks & stored
individually
• Designed to handle very large files
• Not good for huge number of small files
6. Map Reduce Engine
• A Map Reduce Program consists of map and reduce functions
• A Map Reduce job is broken into tasks that run in parallel
• Prefers local processing if possible
7. Hadoop Server Roles
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
slaves
masters
Clients
Name NodeJob Tracker
Secondary
Name Node
Map Reduce HDFS
Distributed Data Analytics Distributed Data Storage
9. Typical Workflow
Typical Workflow
• Load data into the cluster (HDFS writes)
• Analyze the data (Map Reduce)
• Store results in the cluster (HDFS writes)
• Read the results from the cluster (HDFS reads)
How many times did our customers type the word
“Fraud” into emails sent to customer service?
Sample Scenario:
File.txt
Huge file containing all emails sent
to customer service
BRAD HEDLUND .com
10. Writing files to HDFS
• Client consults Name Node
• Client writes block directly to one Data Node
• Data Nodes replicates block
• Cycle repeats for next block
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Client
I want to write
Blocks A,B,C of
File.txt
OK. Write to
Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Blk A Blk B Blk C
11. switch
switchswitch
Preparing HDFS writes
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
I want to write
File.txt
Block A
OK. Write to
Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Ready
Data Nodes
5,6
Ready?
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 5
Data Node 6
Ready
Data Node 6
Rack aware
Ready! • Name Node picks
two nodes in the
same rack, one
node in a different
rack
• Data protection
• Locality for M/R
Ready!
BRAD HEDLUND .com
12. switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
Blk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 5
Data Node 6
A A
A
• Data Nodes 1
& 2 pass data
along as its
received
• TCP 50010
Rack aware
BRAD HEDLUND .com
13. switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 2
Data Node 3
Client
Blk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 2
Data Node 3
A A
A
Block received
Success
File.txt
Blk A:
DN1, DN2, DN3
BRAD HEDLUND .com
14. Multi-block Replication Pipeline
Data Node 1 Data Node X
Data Node 3
Rack 1
switch
switch
Client
switch
Blk A Blk A
Blk A
Rack 4
Data Node 2
Rack 5
switch
Data Node Y
Data Node Z
Blk B
Blk B
Blk B
Data Node W
Blk C
Blk C
Blk C
Blk A Blk B Blk C
File.txt 1TB File =
3TB storage
3TB network traffic
BRAD HEDLUND .com
15. Client writes Span the HDFS Cluster
Client
Rack N
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
File.txt
Rack 4
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 3
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
• Block size
• File Size
Factors:
More blocks = Wider spread
BRAD HEDLUND .com
16. Hadoop Rack Awareness – Why?
Name Node
File.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node 5
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node 7
Data Node 8
switch
Rack 9
Data Node 9
Data Node 10
Data Node 11
Data Node 12
switch
• Never loose all data if entire rack fails
• Keep bulky flows in-rack when possible
• Assumption that in-rack is higher bandwidth,
lower latency
A
A
BB
B
C C
C
switch
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Rack 5:
Data Node 5
Data Node 6
Data Node 7
Rack aware
BRAD HEDLUND .com
17. Name Node
• Data Node sends Heartbeats
• Every 10th heartbeat is a Block report
• Name Node builds metadata from Block reports
• TCP – every 3 seconds
• If Name Node is down, HDFS is down
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node N
A AA CC
DN1: A,C
DN2: A,C
DN3: A,C
metadata
File.txt = A,C
C
File system
Awesome!
Thanks.
I’m
alive!
I have
blocks:
A, C
BRAD HEDLUND .com
18. Re-replicating missing replicas
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node 8
A AA CC
DN1: A,C
DN2: A,C
DN3: A, C
metadata
Rack1: DN1, DN2
Rack5: DN3,
Rack9: DN8
C
Rack Awareness
Uh Oh!
Missing
replicas
Copy
blocks A,C
to Node 8
A C
• Missing Heartbeats signify lost Nodes
• Name Node consults metadata, finds affected data
• Name Node consults Rack Awareness script
• Name Node tells a Data Node to re-replicate
BRAD HEDLUND .com
19. Secondary Name Node
• Not a hot standby for the Name Node
• Connects to Name Node every hour*
• Housekeeping, backup of Name Node metadata
• Saved metadata can rebuild a failed Name Node
Name Node
metadata
File.txt = A,C
File system
Secondary
Name Node
Its been an hour,
give me your
metadata
BRAD HEDLUND .com
20. Client reading files from HDFS
Name Node
Client
Tell me the
block locations
of Results.txt
Blk A = 1,5,6
Blk B = 8,1,2
Blk C = 5,8,9
Results.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Client receives Data Node list for each block
• Client picks first Data Node for each block
• Client reads blocks sequentially
A
A
BB
B
C C
C
BRAD HEDLUND .com
21. Data Processing: Map
• Map: “Run this computation on your local data”
• Job Tracker delivers Java code to Nodes with local data
Map Task Map Task Map Task
A B C
Client
How many
times does
“Fraud” appear
in File.txt?
Count
“Fraud”
in Block C
File.txt
Fraud = 3 Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1 Data Node 5 Data Node 9
BRAD HEDLUND .com
22. switchswitchswitch
What if d ta isn’t local?
• Job Tracker tries to select Node in same rack as data
• Name Node rack awareness
“I need block A”
Map Task Map Task
B C
Client
How many
times does
“Fraud” appear
in File.txt?
Count
“Fraud”
in Block C
Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1
Data Node 5 Data Node 9
“no Map tasks left”
A
Data Node 2
Rack 1 Rack 5 Rack 9
BRAD HEDLUND .com
23. Data Node reading files from HDFS
Name Node
Block A = 1,5,6
File.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Name Node provides rack local Nodes first
• Leverage in-rack bandwidth, single hop
A
A
BB
B
C C
C
Tell me the
locations of
Block A of
File.txt switch
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Rack 5:
Data Node 5
Rack aware
BRAD HEDLUND .com
24. Data Processing: Reduce
• Reduce: “Run this computation across Map results”
• Map Tasks deliver output data over the network
• Reduce Task data output written to and read from HDFS
Fraud = 0
Job Tracker
Reduce Task
Sum
“Fraud”
Results.txt
Fraud = 14
Map Task Map Task Map Task
A B C
Client
HDFS
X Y Z
Data Node 1 Data Node 5 Data Node 9
Data Node 3
BRAD HEDLUND .com
25. Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible
• New servers underutilized for Map Reduce, HDFS*
• Might see more network bandwidth, slower job times**
**I was assigned
a Map Task but
don’t have th e
block. Guess I
need to get it.
*I’m bored!
BRAD HEDLUND .com
Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible
• New servers underutilized for Map Reduce, HDFS*
• Might see more network bandwidth, slower job times**
**I was assigned
a Map Task but
don’t have th e
block. Guess I
need to get it.
*I’m bored!
BRAD HEDLUND .com
26. Cluster BalancingCluster Balancing
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Balancer utility (if used) runs in the background
• Does not interfere with Map Reduce or HDFS
• Default speed limit 1 MB/s
brad@cloudera-1:~$hadoop balancer
BRAD HEDLUND .com
27. Quiz
• If you had written a file of size 1TB into
HDFS with replication factor 2, What is
the actual size required by the HDFS to
store this file?
• True/False? Even if Name node goes
down, I still will be able to read files from
HDFS.
28. Quiz
• True/False? In Hadoop Cluster, We can
have a secondary Job Tracker to enhance
the fault tolerance.
• True/False? If Job Tracker goes down, You
will not be able to write any file into
HDFS.
29. Quiz
• True/False? Name node stores the actual
data itself.
• True/False? Name node can be re-built using
the secondary name node.
• True/False? If a data node goes down,
Hadoop takes care of re-replicating the
affected data block.
30. Quiz
• In which scenario, one data node tries to
read data from another data node?
• What are the benefits of Name node’s rack-
awareness?
• True/False? HDFS is well suited for
applications which write huge number of
small files.
31. Quiz
• True/False? Hadoop takes care of
balancing the cluster automatically?
• True/False? Output of Map tasks are
written to HDFS file?
• True/False? Output of Reduce tasks are
written to HDFS file?
32. Quiz
• True/False? In production cluster,
commodity hardware can be used to
setup Name node.
• Thank You