Introduction to hadoop administration jk

View Hadoop Administration course details at www.edureka.co/hadoop-admin
Top 5 Hadoop admin tasks

www.edureka.co/hadoop-adminSlide 2
Objectives of this Session
At the end of this module, you will be able to
Understand Cluster Planning
Understand Hadoop fully distributed cluster set up
Add further nodes to the running cluster
Upgrade existing Hadoop cluster
Understand name node High availability

Why Hadoop Administration

With the Rise of Hadoop Adoption and usage across various industries, the role of Hadoop Administrator has
become very important and is in demand.
Hadoop Administrator

Hadoop Administration Responsibilities

HDFS Support & Maintenance Monitor Hadoop ClusterProviding Security
Integrating Different Frameworks Hadoop Infrastructure Maintenance
Hadoop Admin Responsibilities

Top 5 Hadoop Admin Tasks

Top 5 Hadoop Admin Tasks
Task-1
Cluster Planning
Task-2
Hadoop Cluster set up Hadoop Version upgrade
Task-3
Adding or Removing Nodes to Cluster Providing High Availability to Cluster
Task-4 Task-5

Cluster Planning
Task-1

RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
OS: 64-bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
OS: 64-bit CentOS
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,
Hard disk: 1 TB
OS: 64-bit CentOS
StandBy NameNode
Optional
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
OS: 64-bit CentOS
DataNode
DataNode DataNode DataNode

Seeking cluster growth on storage capacity is often a good method to use!
Cluster Growth Based On Storage Capacity
Data grows by approximately
5TB per week
HDFS set up to replicate each
block three times
Thus, 15TB of extra storage
space required per week
Assuming machines with 5x3TB
hard drives, equating to a new
machine required each week
Assume Overheads to be 30%

Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
 General ( Depends on requirement
‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a
JBOD* configuration
» Do not use RAID!
» 2 x Quad-core CPUs
» 24 -32GB RAM
» Gigabit Ethernet
General Configuration
 Multiples of ( 1 hard drive + 2 cores
+ 6-8GB RAM) generally work well
for many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”

Slave Nodes: More Details (RAM)
Slave Nodes (RAM)
Generally each Map or Reduce task
will take 1GB to 2GB of RAM
Slave nodes should not be using
virtual memory
RULE OF THUMB!
Total number of tasks = 1.5 x number
of processor core
Ensure enough RAM is present to
run all tasks, plus the DataNode,
TaskTracker daemons, plus the
operating system

Master Node Hardware Recommendations
Carrier-class hardware
(Not commodity hardware)
Dual power supplies
Dual Ethernet cards
(Bonded to provide failover)
Raided hard drives
At least 32GB of RAM
Master
Node
Requires

Hadoop Cluster Set up
Task-2

Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
 No daemons, everything runs in a single JVM
 Suitable for running MapReduce programs during development
 Has no DFS
 Hadoop daemons run on the local machine
 Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode

Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYARN
mapred-site.xml
Map
Reduce
Hadoop 2.x Configuration Files – Apache Hadoop
www.edureka.co/hadoop-admin

Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh
yarn-env.sh
Settings for Hadoop Daemon’s process environment.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that common to both HDFS and
YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for Resource Manager and Node Manager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and Node Manager.

Hadoop Daemons
NameNode daemon
» Runs on master node of the Hadoop Distributed File System (HDFS)
» Directs Data Nodes to perform their low-level I/O tasks
DataNode daemon
» Runs on each slave machine in the HDFS
» Does the low-level I/O work
Resource Manager
» Runs on master node of the Data processing System(MapReduce)
» Global resource Scheduler
Node Manager
» Runs on each slave node of Data processing System
» Platform for the Data processing tasks
Job HistoryServer
» JobHistoryServer is responsible for servicing all job history related requests from client

Hadoop 1.x and Hadoop 2.x Ecosystem
Pig Latin
Data Analysis
Hive
DW System
Other
YARN
Frameworks
(MPI, GIRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
HBase
Structured DataUnstructured/
Semi-structured Data
Hadoop 1.x Hadoop 2.x

Demo On Hadoop Cluster Set Up

Hadoop Version upgrade
Task-3

Stop map-reduce cluster and all client applications running on the DFS cluster
Take the back up of File System Name Space
Install new version of Hadoop software
Update the all configuration files in new Hadoop
start name node with Upgrade command
Compare the new HDFS file system with previous version file system name space
finalize upgrade.
Hadoop Version Upgrade

1) Run Report
• FSCK
• LSR
• DFSADMIN
2) Take Back up
• Configuration
• Applications
• Data and Meta Data
3) Install new Version of Hadoop
4) Upgrade
hadoop-daemon.sh start namenode -upgrade
Hadoop Version Upgrade
5) Run New Reports
• FSCK
• LSR
• DFSADMIN
Compare old and new Reports
Test new Cluster
6) Finalize upgrade
• hadoop dfsadmin -finalizeUpgrade

Adding or Removing Nodes
from Cluster
Task-4

Commissioning and Decommissioning of DataNode
DataNode
Master Node
DataNode
DataNode DataNode DataNode
DataNodeDataNode
DataNode
DecommissioningCommissioning

Add (Commission) DataNodes
Update the network
addresses in the
‘include’ files
dfs.include
mapred.include
Update the
NameNode:
hadoop dfsadmin
-refreshNodes
Update the Job
Tracker:
hadoop mradmin
-refreshNodes Update the
‘slaves’ file
Start the DataNode
and TaskTracker
hadoop-daemon.sh
start tasktracker
hadoop-daemon.sh
start datanode
Cross Check the Web
6 UI to ensure the
successful addition
Run Balancer to
7 move the HDFS
blocks to
DataNodes
1 2 3
4
5

Demo On Commissioning Data Node

Providing High Availability to Cluster
Task-5

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a
single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable
until the NameNode was either restarted or brought up on a separate machine.
Achieve the High Availability in two different ways
 HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes.
 HA using NFS for shared storage instead of the QJM
High Availability (HA)

Slave NodeSlave NodeSlave Node
Standby NodeActive Node
Journal Nodes
(Shared Edits)
Failover Controller
Standby
Failover Controller
Active
Zookeeper Service
Block Report & Heart
beat
Monitor status and
health. Manage HA
state
HA Architecture
Monitor status and
health. Manage HA
state
Write Read

Demo On NameNode High Availability

Hadoop admin Job Trends

Questions
www.edureka.co/hadoop-adminSlide 34 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Introduction to hadoop administration jk

Introduction to hadoop administration jk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Introduction to hadoop administration jk

Similaire à Introduction to hadoop administration jk (20)

Plus de Edureka!

Plus de Edureka! (20)

Dernier

Dernier (20)

Introduction to hadoop administration jk