2. www.edureka.co/hadoop-adminSlide 2
Objectives of this Session
At the end of this module, you will be able to
Understand Cluster Planning
Understand Hadoop fully distributed cluster set up
Add further nodes to the running cluster
Upgrade existing Hadoop cluster
Understand name node High availability
4. www.edureka.co/hadoop-adminSlide 4
With the Rise of Hadoop Adoption and usage across various industries, the role of Hadoop Administrator has
become very important and is in demand.
Hadoop Administrator
8. www.edureka.co/hadoop-adminSlide 8
Top 5 Hadoop Admin Tasks
Task-1
Cluster Planning
Task-2
Hadoop Cluster set up Hadoop Version upgrade
Task-3
Adding or Removing Nodes to Cluster Providing High Availability to Cluster
Task-4 Task-5
10. www.edureka.co/hadoop-adminSlide 10
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
StandBy NameNode
Optional
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
DataNode
DataNode DataNode DataNode
11. www.edureka.co/hadoop-adminSlide 11
Seeking cluster growth on storage capacity is often a good method to use!
Cluster Growth Based On Storage Capacity
Data grows by approximately
5TB per week
HDFS set up to replicate each
block three times
Thus, 15TB of extra storage
space required per week
Assuming machines with 5x3TB
hard drives, equating to a new
machine required each week
Assume Overheads to be 30%
12. www.edureka.co/hadoop-adminSlide 12
Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
General ( Depends on requirement
‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a
JBOD* configuration
» Do not use RAID!
» 2 x Quad-core CPUs
» 24 -32GB RAM
» Gigabit Ethernet
General Configuration
Multiples of ( 1 hard drive + 2 cores
+ 6-8GB RAM) generally work well
for many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”
13. www.edureka.co/hadoop-adminSlide 13
Slave Nodes: More Details (RAM)
Slave Nodes (RAM)
Generally each Map or Reduce task
will take 1GB to 2GB of RAM
Slave nodes should not be using
virtual memory
RULE OF THUMB!
Total number of tasks = 1.5 x number
of processor core
Ensure enough RAM is present to
run all tasks, plus the DataNode,
TaskTracker daemons, plus the
operating system
14. www.edureka.co/hadoop-adminSlide 14
Master Node Hardware Recommendations
Carrier-class hardware
(Not commodity hardware)
Dual power supplies
Dual Ethernet cards
(Bonded to provide failover)
Raided hard drives
At least 32GB of RAM
Master
Node
Requires
16. www.edureka.co/hadoop-adminSlide 16
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM
Suitable for running MapReduce programs during development
Has no DFS
Hadoop daemons run on the local machine
Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode
18. www.edureka.co/hadoop-adminSlide 18
Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh
yarn-env.sh
Settings for Hadoop Daemon’s process environment.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that common to both HDFS and
YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for Resource Manager and Node Manager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and Node Manager.
19. www.edureka.co/hadoop-adminSlide 19
Hadoop Daemons
NameNode daemon
» Runs on master node of the Hadoop Distributed File System (HDFS)
» Directs Data Nodes to perform their low-level I/O tasks
DataNode daemon
» Runs on each slave machine in the HDFS
» Does the low-level I/O work
Resource Manager
» Runs on master node of the Data processing System(MapReduce)
» Global resource Scheduler
Node Manager
» Runs on each slave node of Data processing System
» Platform for the Data processing tasks
Job HistoryServer
» JobHistoryServer is responsible for servicing all job history related requests from client
20. www.edureka.co/hadoop-adminSlide 20
Hadoop 1.x and Hadoop 2.x Ecosystem
Pig Latin
Data Analysis
Hive
DW System
Other
YARN
Frameworks
(MPI, GIRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
HBase
Structured DataUnstructured/
Semi-structured Data
Hadoop 1.x Hadoop 2.x
23. www.edureka.co/hadoop-adminSlide 23
Stop map-reduce cluster and all client applications running on the DFS cluster
Take the back up of File System Name Space
Install new version of Hadoop software
Update the all configuration files in new Hadoop
start name node with Upgrade command
Compare the new HDFS file system with previous version file system name space
finalize upgrade.
Hadoop Version Upgrade
24. www.edureka.co/hadoop-adminSlide 24
1) Run Report
• FSCK
• LSR
• DFSADMIN
2) Take Back up
• Configuration
• Applications
• Data and Meta Data
3) Install new Version of Hadoop
4) Upgrade
hadoop-daemon.sh start namenode -upgrade
Hadoop Version Upgrade
5) Run New Reports
• FSCK
• LSR
• DFSADMIN
Compare old and new Reports
Test new Cluster
6) Finalize upgrade
• hadoop dfsadmin -finalizeUpgrade
27. www.edureka.co/hadoop-adminSlide 27
Add (Commission) DataNodes
Update the network
addresses in the
‘include’ files
dfs.include
mapred.include
Update the
NameNode:
hadoop dfsadmin
-refreshNodes
Update the Job
Tracker:
hadoop mradmin
-refreshNodes Update the
‘slaves’ file
Start the DataNode
and TaskTracker
hadoop-daemon.sh
start tasktracker
hadoop-daemon.sh
start datanode
Cross Check the Web
6 UI to ensure the
successful addition
Run Balancer to
7 move the HDFS
blocks to
DataNodes
1 2 3
4
5
30. www.edureka.co/hadoop-adminSlide 30
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a
single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable
until the NameNode was either restarted or brought up on a separate machine.
Achieve the High Availability in two different ways
HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes.
HA using NFS for shared storage instead of the QJM
High Availability (HA)
31. www.edureka.co/hadoop-adminSlide 31
Slave NodeSlave NodeSlave Node
Standby NodeActive Node
Journal Nodes
(Shared Edits)
Failover Controller
Standby
Failover Controller
Active
Zookeeper Service
Block Report & Heart
beat
Monitor status and
health. Manage HA
state
HA Architecture
Monitor status and
health. Manage HA
state
Write Read