2. Objectives
www.edureka.co/big-data-and-hadoopSlide 2
At the end of this module, you will be able to
Analyse Hadoop 2.x Cluster Architecture – Federation
Analyse Hadoop 2.x Cluster Architecture – High Availability
Run Hadoop in different cluster modes
Implement basic Hadoop commands on Terminal
Prepare Hadoop 2.x configuration files and analyze the parameters in it
Analyze dump of a MapReduce program
Implement different data loading techniques
3. Hadoop Core Components
HDFS Architecture
What is HDFS?
Hadoop Vs. Traditional Systems
NameNode and Secondary
NameNode
Let’s Revise
DataNode
Node
Manager
DataNode DataNode DataNode
YARN
HDFS
Cluster
Resource
Manager
NameNode
Node
Manager
Node
Manager
Node
Manager
www.edureka.co/big-data-and-hadoopSlide 3
6. Annie’s Answer
Ans. Option d.
It means if you move a file to HDFS then by default 3
copies of the file will be stored on different datanodes.
www.edureka.co/big-data-and-hadoopSlide 6
7. Annie’s Question
Every Slave node has two daemons running on
them that is DataNode and NodeManager in a
MultiNode Cluster.
a. TRUE
b. FALSE
www.edureka.co/big-data-and-hadoopSlide 7
15. Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?
a.Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
b.Provides cross-data centre (non-local) support for
HDFS, allowing a cluster administrator to split the Block
Storage outside the local cluster.
www.edureka.co/big-data-and-hadoopSlide 15
16. Annie’s Answer
www.edureka.co/big-data-and-hadoopSlide 16
Ans. Option (a)
In order to scale the name service horizontally, HDFS
federation uses multiple independent NameNode. The
NameNode are federated, that is, the NameNode are
independent and don’t require coordination with each
other.
17. Annie’s Question
You have configured two name nodes to manage
/marketing and /finance respectively. What will happen if
you try to put a file in /accounting directory?
www.edureka.co/big-data-and-hadoopSlide 17
18. Annie’s Answer
Ans. Put will fail. None of the namespace will manage the
file and you will get an IOException with a No such file or
directory error.
www.edureka.co/big-data-and-hadoopSlide 18
19. Hadoop 2.x – High Availability
Node Manager
Container
App
Master
Node Manager
Container
App
Master
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and
applies to its own
namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
DataNode
NameNode
High
Availability
*Not necessary to
configureSecondary
NameNode
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Shared Edit Logs
HDFS HIGH AVAILABILITY
Client
www.edureka.co/big-data-and-hadoopSlide 19
20. Hadoop 2.x – Resource Management
YARN
Resource
Manager
it logs and
to its own
ace
DataNodeDataNode
Next Generation
MapReduce
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Client
HDFS
All name space edits
logged to shared NFS
storage; single writer Read ed
(fencing)
Shared Edit Logs applies
namesp
NameNode
High Secondary Active Standby
Availability Name Node NameNode NameNode
*Not necessary to
DataNode DataNode
configureSecondary
NameNode
Node Manager Node Manager
App App
Container Master Container Master
HDFS HIGH AVAILABILITY
Node Manager
Container
App
Master
Node Manager
Container
App
Master
www.edureka.co/big-data-and-hadoopSlide 20
22. Annie’s Question
HDFS HA was developed to overcome the following
disadvantage in Hadoop 1.0?
a. Single Point of Failure of NameNode
b. Only one version can be run in classic Map-
Reduce
c. Too much burden on Job Tracker
www.edureka.co/big-data-and-hadoopSlide 22
24. Hadoop Cluster: Facebook
Facebook
We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters:
» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.
www.edureka.co/big-data-and-hadoopSlide 24
25. Pseudo-Distributed Mode
Hadoop daemons run on the local machine.
Fully-Distributed Mode
Hadoop daemons run on a cluster of machines.
Hadoop can run in any of the following three modes:
Standalone (or Local) Mode
No daemons, everything runs in a single JVM.
Suitable for running MapReduce programs during development.
Has no DFS.
www.edureka.co/big-data-and-hadoopSlide 25
Hadoop Cluster Modes
28. Hadoop FS Shell Commands
HDFS organizes its data in files and directories
It provides a command line interface called the FS shell that lets the user interact with data in the HDFS
The syntax of the commands is similar to bash
www.edureka.co/big-data-and-hadoopSlide 28
29. Listing of files present in bin Directory
Listing of files present on HDFS
Terminal Commands
www.edureka.co/big-data-and-hadoopSlide 29
30. Configuration
Filenames
Description of Log Files
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
MapReduce.
hdfs-site.xml
Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
nodes.
mapred-site.xml Configuration settings for MapReduce Applications.
yarn-site.xml Configuration settings for ResourceManager and NodeManager.
masters A list of machines (one per line) that each run a secondary namenode.
slaves A list of machines (one per line) that each run a Datanode and a NodeManager.
www.edureka.co/big-data-and-hadoopSlide 30
Hadoop 2.x Configuration Files
34. hdfs-site.xml
---------------------------------------------------------hdfs-site.xml-------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/edureka/hadoop-2.2.0/hadoop2_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.2.0/hadoop2_data/hdfs/datanode</value>
</property>
</configuration>
----------------------------------
Determines where on the local
filesystem the DFS name node
should store the name
table(fsimage).
If "true", enable permission
checking in HDFS. If "false",
permission checking is turned off.
Determines where on the local
filesystem the DFS name node
should store the name
table(fsimage).
Determines where on the local
filesystem an DFS data node should
store its blocks.
---------------------------------------------------------hdfs-site.xml---------------------------
www.edureka.co/big-data-and-hadoopSlide 35
38. Two files are used by the startup and shutdown commands:
Slaves and Masters
Slaves
Contains a list of hosts, one per line, that are to host DataNode and
NodeManager servers.
Masters
Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.
www.edureka.co/big-data-and-hadoopSlide 38
39. This file also offers a way to provide custom parameters for each of the servers.
Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the hadoop directory which is present in
hadoop installation directory (hadoop-2.2.0/etc/hadoop).
Examples of environment variables that you can specify:
export HADOOP_HEAPSIZE=“512"
export HADOOP_DATANODE_HEAPSIZE=“128"
Set parameter JAVA_HOME
JVMhadoop-env.sh
Per-Process RunTime Environment
www.edureka.co/big-data-and-hadoopSlide 39
41. Hadoop Daemons
www.edureka.co/big-data-and-hadoopSlide 41
NameNodedaemon
» Runs on master node of the Hadoop Distributed File System (HDFS)
» Directs Data Nodes to perform their low-level I/O tasks
DataNodedaemon
» Runs on each slave machine in the HDFS
» Does the low-level I/O work
ResourceManager
» Runs on master node of the Data processing System(MapReduce)
» Global resource Scheduler
NodeManager
» Runs on each slave node of Data processing System
» Platform for the Data processing tasks
JobHistoryServer
» JobHistoryServer is responsible for servicing all job history related requests from client
42. Hadoop Web UI Parts
www.edureka.co/big-data-and-hadoopSlide 42
Service Servers Default
Used Ports
Protocol Description
NameNode
WebUI
Master Nodes
(NameNode and any
back-up NameNodes)
50070 http
Web UI to look at current status of
HDFS, explore file system
DataNode All Slave Nodes
50075 http Data Node WebUI to access the status,
logs etc.
Resource-
Manager Web UI
Cluster Level resource
manager
8088 http Web UI for Resource-Manager and for
application submissions
NodeManager Monitors resources on
Data Node
8042 TCP Node information, List of Applications
and List of containers
MapReduce
JobHistory
Server
Get status on finished
applications.
19888 TCP
Providing logs of important events in
MapReduce job execution and associated
profiling metrics
43. NameNode status: http://localhost:50070/dfshealth.jsp
ResourceManager status: http://localhost:8088/cluster
MapReduce JobHistory Server status: http://localhost:19888/jobhistory
www.edureka.co/big-data-and-hadoopSlide 43
Web UI URLs
44. Annie’s Question
Which of the following file is used to specify the
NameNode's heap size?
a. bashrc
b. hadoop-env.sh
c. hdfs-site.sh
d. core-site.xml
www.edureka.co/big-data-and-hadoopSlide 44
45. Annie’s Answer
Ans. hadoop-env.sh.
This file specifies environment variables that affect the
JDK used by Hadoop Daemon (bin/Hadoop)
www.edureka.co/big-data-and-hadoopSlide 45
46. Annie’s Question
It is necessary to define all the properties in core-
site.xml, hdfs-site.xml,yarn-site.xml & mapred-site.xml.
a. TRUE
b. FALSE
www.edureka.co/big-data-and-hadoopSlide 46
48. Annie’s Question
Stand alone Mode uses default configuration?
a) TRUE
b) FALSE
www.edureka.co/big-data-and-hadoopSlide 48
49. Annie’s Answer
Ans. True.
In Stand alone mode Hadoop runs with default
configuration
(Empty configuration files i.e. no configuration settings in
core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-
site.xml). If properties are not defined in the
configuration files, hadoop runs with default values for
the corresponding properties.
www.edureka.co/big-data-and-hadoopSlide 49
54. Annie’s Question
The output of a MR job will be stored on HDFS:
a. TRUE
b. FALSE
www.edureka.co/big-data-and-hadoopSlide 54
55. Annie’s Answer
Ans. True
It is stored in different part files for eg – part-m-00000,
part-m-00001 and so on.
www.edureka.co/big-data-and-hadoopSlide 55
56. Annie’s Question
To run MR job data should be present on HDFS:
a. TRUE
b. FALSE
www.edureka.co/big-data-and-hadoopSlide 56
57. Annie’s Answer
Ans. True
In order to process data in parallel it is necessary that it
is present on HDFS so that MR can work on chunks of
data in parallel.
www.edureka.co/big-data-and-hadoopSlide 57
59. hadoop distcp hdfs://<source NN> hdfs://<target NN>
www.edureka.co/big-data-and-hadoopSlide 59
distcp: Distributed Copy to move data between clusters, used for backup and recovery
put: Copy file(s) from local file system to destination file system. It can also read from “stdin” and writes to
destination file system.
copyFromLocal: Similar to “put” command, except that the source is restricted to a local file reference.
hadoop dfs –put weather.txt hdfs://<target Namenode>
hadoop dfs –copyFromLocal weather.txt hdfs://<target Namenode>
Hadoop Copy Commands
60. Demo on Copy Commands
www.edureka.co/big-data-and-hadoopSlide 60
61. Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
Twitter Source Memory Channel
Twitter
Streaming
API
HDFS Sink
HDFS
Flume
Data Loading Using Flume
Demo will be covered in
Module 10
www.edureka.co/big-data-and-hadoopSlide 61
62. Apache Sqoop (TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational
databases.
Imports individual tables or entire databases to HDFS.
Generates Java classes to allow you to interact with your
imported data.
Provides the ability to import from SQL databases straight into
your Hive data warehouse.
Data Loading Using Sqoop
Demo will be covered in
Module 10
www.edureka.co/big-data-and-hadoopSlide 62
63. Annie’s Question
Your website is hosting a group of more than 300 sub-
websites. You want to have an analytics on the shopping
patterns of different visitors? What is the best way to
collect those information from the weblogs?
a. SQOOP
b. FLUME
www.edureka.co/big-data-and-hadoopSlide 63
65. Annie’s Question
You want to join data collected from two sources. One
source of data collected from a big database of call
records is already available in HDFS. The another source
of data is available in a database table. The best way to
move that data in HDFS is:
a. SQOOP import
b. PIG script
c. Hive Query
www.edureka.co/big-data-and-hadoopSlide 65
67. Assignment
Go through Edureka VM and explore it
Check working condition of Hadoop eco system in Edureka VM
Follow this
document to
install
Edureka VM
www.edureka.co/big-data-and-hadoopSlide 67
69. Further Reading
MapReduce Job execution
http://www.edureka.in/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/
Add/Remove Nodes in a Cluster
http://www.edureka.in/blog/commissioning-and-decommissioning-nodes-in-a-hadoop-cluster/
Secondary Namenode
https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-
hdfs/HdfsUserGuide.html#Secondary_NameNode
www.edureka.co/big-data-and-hadoopSlide 69
70. Setup the Hadoop development environment using the documents present in the LMS.
Refresh your Java Skills using Java Essential for Hadoop Tutorial
Review the Interview Questions for setting up hadoop cluster.
http://www.edureka.in/blog/hadoop-interview-questions-hadoop-cluster/
www.edureka.co/big-data-and-hadoopSlide 70
Pre-work for next Class
71. Agenda for Next Class
www.edureka.co/big-data-and-hadoopSlide 71
Use Cases of MapReduce
Traditional vs MapReduce Way
Hadoop 2.x MapReduce Components and Architecture
YARN Execution Flow
MapReduceConcepts
72. Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
www.edureka.co/big-data-and-hadoopSlide 72
Survey