Hadoop Architecture and HDFS

Module-2
Hadoop Architecture and HDFS
www.edureka.co/big-data-and-hadoop

Objectives
www.edureka.co/big-data-and-hadoopSlide 2
At the end of this module, you will be able to
 Analyse Hadoop 2.x Cluster Architecture – Federation
 Analyse Hadoop 2.x Cluster Architecture – High Availability
 Run Hadoop in different cluster modes
 Implement basic Hadoop commands on Terminal
 Prepare Hadoop 2.x configuration files and analyze the parameters in it
 Analyze dump of a MapReduce program
 Implement different data loading techniques

 Hadoop Core Components
 HDFS Architecture
 What is HDFS?
 Hadoop Vs. Traditional Systems
 NameNode and Secondary
NameNode
Let’s Revise
DataNode
Node
Manager
DataNode DataNode DataNode
YARN
HDFS
Cluster
Resource
Manager
NameNode
Node
Manager
Node
Manager
Node
Manager

Pre-Class Questions
Pre-Class Questions

Annie’s Question
The default replication factor is:
a. 2
b. 4
c. 5
d. 3

Annie’s Answer
Ans. Option d.
It means if you move a file to HDFS then by default 3
copies of the file will be stored on different datanodes.

Annie’s Question
Every Slave node has two daemons running on
them that is DataNode and NodeManager in a
MultiNode Cluster.
a. TRUE
b. FALSE

Annie’s Answer
Ans. TRUE
DataNode service for HDFS and NodeManager for
processing.

Annie’s Question
A block is replicated in 4 nodes K,L,M, and N. If
M, K and N fails. A client can still read the data.
a. TRUE
b. FALSE

Annie’s Answer
Ans. TRUE.
As the remaining node ‘L’ will contain the block in
question.

RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
OS: 64-bit CentOS
Active NameNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
OS: 64-bit CentOS
Power: Redundant Power Supply
StandBy NameNode
RAM: 64 GB,
Hard disk: 1 TB
OS: 64-bit CentOS
Optional
Secondary NameNode
RAM: 32 GB,
Hard disk: 1 TB
OS: 64-bit CentOS
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
OS: 64-bit CentOS
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode

Master
NameNode
http://master:50070/
ResourceManager
http://master:8088
Slave01
DataNode
NodeManager
Slave02
NodeManager
DataNode
Slave03
DataNode
NodeManager
Slave04
DataNode
Slave05
NodeManager
DataNode
NodeManager
Hadoop 2.x Cluster Architecture

NodeManager
DataNode
NodeManager
Client
HDFS YARN
NameNode
DataNode
NodeManager DataNode
ResourceManager
DataNode
NodeManager
DataNode
NodeManager
NodeManager
DataNode
NodeManager
DataNode
NodeManager
DataNode
Hadoop 2.x Cluster Architecture (Contd.)

Namenode N
S
BlockStorageNamespace
Hadoop 1.0 Hadoop 2.0
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
Datanode … Datanode
Storage
Block Management
Hadoop 2.x Cluster Architecture - Federation

Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?
a.Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
b.Provides cross-data centre (non-local) support for
HDFS, allowing a cluster administrator to split the Block
Storage outside the local cluster.

Annie’s Answer
Ans. Option (a)
In order to scale the name service horizontally, HDFS
federation uses multiple independent NameNode. The
NameNode are federated, that is, the NameNode are
independent and don’t require coordination with each
other.

Annie’s Question
You have configured two name nodes to manage
/marketing and /finance respectively. What will happen if
you try to put a file in /accounting directory?

Annie’s Answer
Ans. Put will fail. None of the namespace will manage the
file and you will get an IOException with a No such file or
directory error.

Hadoop 2.x – High Availability
Node Manager
Container
App
Master
Node Manager
Container
App
Master
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and
applies to its own
namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
DataNode
NameNode
High
Availability
*Not necessary to
configureSecondary
NameNode
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Shared Edit Logs
HDFS HIGH AVAILABILITY
Client

Hadoop 2.x – Resource Management
YARN
Resource
Manager
it logs and
to its own
ace
DataNodeDataNode
Next Generation
MapReduce
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Client
HDFS
All name space edits
logged to shared NFS
storage; single writer Read ed
(fencing)
Shared Edit Logs applies
namesp
NameNode
High Secondary Active Standby
Availability Name Node NameNode NameNode
*Not necessary to
DataNode DataNode
configureSecondary
NameNode
Node Manager Node Manager
App App
Container Master Container Master
HDFS HIGH AVAILABILITY
Node Manager
Container
App
Master
Node Manager
Container
App
Master

Node Manager
DataNode
Node Manager
DataNode
YARN – Yet Another Resource Negotiator
Masters
Slaves
Container
App
Master
Scheduler
Resource Manager
Applications
Manager
(AsM)
Client
Container
App
Master
Hadoop 2.x – Resource Management (Contd.)

Annie’s Question
HDFS HA was developed to overcome the following
disadvantage in Hadoop 1.0?
a. Single Point of Failure of NameNode
b. Only one version can be run in classic Map-
Reduce
c. Too much burden on Job Tracker

Annie’s Answer
Ans. Single Point of Failure of NameNode

Hadoop Cluster: Facebook
Facebook
 We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
 Currently we have 2 major clusters:
» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.

Pseudo-Distributed Mode
 Hadoop daemons run on the local machine.
Fully-Distributed Mode
 Hadoop daemons run on a cluster of machines.
Hadoop can run in any of the following three modes:
Standalone (or Local) Mode
 No daemons, everything runs in a single JVM.
 Suitable for running MapReduce programs during development.
 Has no DFS.
Hadoop Cluster Modes

Terminal Commands

Hadoop FS Shell Commands
 HDFS organizes its data in files and directories
 It provides a command line interface called the FS shell that lets the user interact with data in the HDFS
 The syntax of the commands is similar to bash

Listing of files present in bin Directory
Listing of files present on HDFS
Terminal Commands

Configuration
Filenames
Description of Log Files
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
MapReduce.
hdfs-site.xml
Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
nodes.
mapred-site.xml Configuration settings for MapReduce Applications.
yarn-site.xml Configuration settings for ResourceManager and NodeManager.
masters A list of machines (one per line) that each run a secondary namenode.
slaves A list of machines (one per line) that each run a Datanode and a NodeManager.
Hadoop 2.x Configuration Files

Hadoop 2.x Configuration Files – Apache Hadoop

Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYARN
mapred-site.xml
Map
Reduce
Hadoop 2.x Configuration Files – Apache Hadoop

core-site.xml
-------------------------------------------------core-site.xml-----------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
------------------------------------------------core-site.xml-----------------------------------------------------
The name of the default file
system. The url's authority is
used to determine the host,
port, etc. for a filesystem.

hdfs-site.xml
---------------------------------------------------------hdfs-site.xml-------------------------------------------------------------

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/edureka/hadoop-2.2.0/hadoop2_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.2.0/hadoop2_data/hdfs/datanode</value>
</property>
</configuration>
----------------------------------
Determines where on the local
filesystem the DFS name node
should store the name
table(fsimage).
If "true", enable permission
checking in HDFS. If "false",
permission checking is turned off.
filesystem the DFS name node
should store the name
table(fsimage).
filesystem an DFS data node should
store its blocks.
---------------------------------------------------------hdfs-site.xml---------------------------

mapred-site.xml
-----------------------------------------------mapred-site.xml---------------------------------------------------

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
-----------------------------------------------mapred-site.xml---------------------------------------------------
The runtime framework for
executing MapReduce jobs.
Can be one of local, classic
or yarn.

yarn-site.xml
-----------------------------------------------yarn-site.xml---------------------------------------------------

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
-----------------------------------------------yarn-site.xml---------------------------------------------------
The auxiliary service
name.
The auxiliary service
class to use.

1. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/core-default.xml
2. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
3. http://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/mapred-default.xml
4. http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
All Properties

Two files are used by the startup and shutdown commands:
Slaves and Masters
Slaves
 Contains a list of hosts, one per line, that are to host DataNode and
NodeManager servers.
Masters
 Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.

 This file also offers a way to provide custom parameters for each of the servers.
 Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the hadoop directory which is present in
hadoop installation directory (hadoop-2.2.0/etc/hadoop).
 Examples of environment variables that you can specify:
export HADOOP_HEAPSIZE=“512"
export HADOOP_DATANODE_HEAPSIZE=“128"
Set parameter JAVA_HOME
JVMhadoop-env.sh
Per-Process RunTime Environment

Hadoop Daemons

Hadoop Daemons
 NameNodedaemon
» Runs on master node of the Hadoop Distributed File System (HDFS)
» Directs Data Nodes to perform their low-level I/O tasks
 DataNodedaemon
» Runs on each slave machine in the HDFS
» Does the low-level I/O work
 ResourceManager
» Runs on master node of the Data processing System(MapReduce)
» Global resource Scheduler
 NodeManager
» Runs on each slave node of Data processing System
» Platform for the Data processing tasks
 JobHistoryServer
» JobHistoryServer is responsible for servicing all job history related requests from client

Hadoop Web UI Parts
Service Servers Default
Used Ports
Protocol Description
NameNode
WebUI
Master Nodes
(NameNode and any
back-up NameNodes)
50070 http
Web UI to look at current status of
HDFS, explore file system
DataNode All Slave Nodes
50075 http Data Node WebUI to access the status,
logs etc.
Resource-
Manager Web UI
Cluster Level resource
manager
8088 http Web UI for Resource-Manager and for
application submissions
NodeManager Monitors resources on
Data Node
8042 TCP Node information, List of Applications
and List of containers
MapReduce
JobHistory
Server
Get status on finished
applications.
19888 TCP
Providing logs of important events in
MapReduce job execution and associated
profiling metrics

 NameNode status: http://localhost:50070/dfshealth.jsp
 ResourceManager status: http://localhost:8088/cluster
 MapReduce JobHistory Server status: http://localhost:19888/jobhistory
Web UI URLs

Annie’s Question
Which of the following file is used to specify the
NameNode's heap size?
a. bashrc
b. hadoop-env.sh
c. hdfs-site.sh
d. core-site.xml

Annie’s Answer
Ans. hadoop-env.sh.
This file specifies environment variables that affect the
JDK used by Hadoop Daemon (bin/Hadoop)

Annie’s Question
It is necessary to define all the properties in core-
site.xml, hdfs-site.xml,yarn-site.xml & mapred-site.xml.
a. TRUE
b. FALSE

Annie’s Answer
Ans. False.
Detailed answer will be given after the next question.

Annie’s Question
Stand alone Mode uses default configuration?
a) TRUE
b) FALSE

Annie’s Answer
Ans. True.
In Stand alone mode Hadoop runs with default
configuration
(Empty configuration files i.e. no configuration settings in
core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-
site.xml). If properties are not defined in the
configuration files, hadoop runs with default values for
the corresponding properties.

Sample Examples List

Running the Teragen Example

Checking the Output

Annie’s Question
The output of a MR job will be stored on HDFS:
a. TRUE
b. FALSE

Annie’s Answer
Ans. True
It is stored in different part files for eg – part-m-00000,
part-m-00001 and so on.

Annie’s Question
To run MR job data should be present on HDFS:
a. TRUE
b. FALSE

Annie’s Answer
Ans. True
In order to process data in parallel it is necessary that it
is present on HDFS so that MR can work on chunks of
data in parallel.

www.edureka.co/big-data-and-hadoop
HDFS
Data Analysis
Data Loading
Using Pig Using HIVE
Using Flume Using Sqoop Using Hadoop Copy Commands
Data Loading Techniques and Data Analysis

hadoop distcp hdfs://<source NN> hdfs://<target NN>
distcp: Distributed Copy to move data between clusters, used for backup and recovery
put: Copy file(s) from local file system to destination file system. It can also read from “stdin” and writes to
destination file system.
copyFromLocal: Similar to “put” command, except that the source is restricted to a local file reference.
hadoop dfs –put weather.txt hdfs://<target Namenode>
hadoop dfs –copyFromLocal weather.txt hdfs://<target Namenode>
Hadoop Copy Commands

Demo on Copy Commands

Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
Twitter Source Memory Channel
Twitter
Streaming
API
HDFS Sink
HDFS
Flume
Data Loading Using Flume
Demo will be covered in
Module 10

Apache Sqoop (TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational
databases.
 Imports individual tables or entire databases to HDFS.
 Generates Java classes to allow you to interact with your
imported data.
 Provides the ability to import from SQL databases straight into
your Hive data warehouse.
Data Loading Using Sqoop
Demo will be covered in
Module 10

Annie’s Question
Your website is hosting a group of more than 300 sub-
websites. You want to have an analytics on the shopping
patterns of different visitors? What is the best way to
collect those information from the weblogs?
a. SQOOP
b. FLUME

Annie’s Answer
Ans. FLUME.

Annie’s Question
You want to join data collected from two sources. One
source of data collected from a big database of call
records is already available in HDFS. The another source
of data is available in a database table. The best way to
move that data in HDFS is:
a. SQOOP import
b. PIG script
c. Hive Query

Annie’s Answer
Ans. SQOOP import.

Assignment
 Go through Edureka VM and explore it
 Check working condition of Hadoop eco system in Edureka VM
Follow this
document to
install
Edureka VM

Further Reading
 Hadoop Cluster Setup
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html
 Hadoop on Amazon AWS ec2
http://www.edureka.in/blog/install-apache-hadoop-cluster/
 Hadoop Hardware Selection
http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-
hadoop-cluster/
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
 Hadoop Cluster Configuration
http://www.edureka.in/blog/hadoop-cluster-configuration-files/

Further Reading
 MapReduce Job execution
http://www.edureka.in/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/
 Add/Remove Nodes in a Cluster
http://www.edureka.in/blog/commissioning-and-decommissioning-nodes-in-a-hadoop-cluster/
 Secondary Namenode
https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-
hdfs/HdfsUserGuide.html#Secondary_NameNode

Setup the Hadoop development environment using the documents present in the LMS.
 Refresh your Java Skills using Java Essential for Hadoop Tutorial
Review the Interview Questions for setting up hadoop cluster.
http://www.edureka.in/blog/hadoop-interview-questions-hadoop-cluster/
Pre-work for next Class

Agenda for Next Class
 Use Cases of MapReduce
 Traditional vs MapReduce Way
 Hadoop 2.x MapReduce Components and Architecture
 YARN Execution Flow
 MapReduceConcepts

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Survey

Hadoop Architecture and HDFS

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop Architecture and HDFS

Similaire à Hadoop Architecture and HDFS (20)

Plus de Edureka!

Plus de Edureka! (20)

Dernier

Dernier (20)

Hadoop Architecture and HDFS