This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
2. What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism
What’s in it for you?
4. What’s in it for you?
Why Hadoop?
What is Hadoop?
5. What’s in it for you?
Why Hadoop?
What is Hadoop?
Hadoop HDFS
6. What’s in it for you?
Why Hadoop?
What is Hadoop?
Hadoop HDFSHadoop MapReduce
7. What’s in it for you?
Why Hadoop?
What is Hadoop?
Hadoop HDFSHadoop MapReduce
Hadoop YARN
8. What’s in it for you?
Why Hadoop?
What is Hadoop?
Hadoop HDFSHadoop MapReduce
Hadoop YARN
Use case of Hadoop
9. What’s in it for you?
Why Hadoop?
What is Hadoop?
Hadoop HDFSHadoop MapReduce
Hadoop YARN
Use case of Hadoop
Demo on HDFS, MapReduce
and YARN
10. What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism
Why Hadoop?
14. Tim sensed a good demand for other products, so he
thought of expanding his business
15. He started selling fruits, vegetables, meat, and dairy
products in addition to food grains
16. But it wasn’t as easy as he expected it to be. The
number of customers increased, and he was not
able to cater to their needs on time
17. He had to look into assisting his customers with
each of their orders and billing. It was too difficult for
him to manage alone
18. To start delivering orders on time and to manage the
customers’ demands, Tim hired 3 more people to
work with him
19. Matt took care of the fruits and vegetable section.
Luke handled the dairy and meat section. Ann was
appointed as the cashier
Matt
Luke
Ann
Tim
20. However, this was still not a solution to Tim’s problem
as there was not enough space in the shop for all the
items
Storage area
21. The storage was a bottleneck since storing and accessing
became more and more difficult with increased supply and
demand
Storage area
22. Tim came up with an idea to overcome this issue. He
decided to expand the storage area and distribute each
category of product on different floors
23. Now, customers were happy, and after picking up their
products from the respective sections, it was then billed
24. Now, customers were happy, and after picking up their
products from the respective sections, it was then billed
Now, let us compare this story to big data
25. Earlier, data was generated at a moderate rate, and all the
data was structured in nature. One processor was enough to
process all of it
26. With the increase in data generation, different types of data
were generated at high speed. It became difficult for a single
processor to process different types of data
27. Massive amount of different types of data which cannot be
processed and stored using traditional databases is known as
big data
28. To overcome this issue, multiple processors were used to
process each type of data
29. But now the problem was that one storage system was
accessed by all the processors and the storage became the
bottleneck
30. Just like how Tim adopted the distributed approach, the
storage system was also distributed and by doing so, the data
was stored in individual databases
31. Just like how Tim adopted the distributed approach, the
storage system was also distributed and by doing so, the data
was stored in individual databases
Through this story, we see the two approaches that are
used by Hadoop that is HDFS and MapReduce
32. HDFS refers to the distributed storage space just like how Tim distributed the
storage space amongst the various sections
33. Each person took care of a separate section and at the end the customers
went to the cashier for the final billing, this sorted the process and made it
easier. This is how Hadoop MapReduce works
34. This was a rough story of big data
generation and why Hadoop is required. I
will now explain in detail as to what
Hadoop is
36. What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism
What is Hadoop?
37. What is Hadoop?
Hadoop is a framework which stores and processes big data in a distributed and parallel fashion
38. What is Hadoop?
Hadoop is a framework which stores and processes big data in a distributed and parallel fashion
BIG DATA
40. Hadoop has individual components, which
are used for storing and processing big
data
41. One day in an office..
HDFS
MapReduce
YARN
Components of Hadoop
The storage unit of Hadoop
42. One day in an office..
HDFS
MapReduce
YARN
Components of Hadoop
The storage unit of Hadoop
The processing unit of Hadoop
43. One day in an office..
HDFS
MapReduce
YARN
Components of Hadoop
The storage unit of Hadoop
The processing unit of Hadoop
The resource management unit of Hadoop
44. What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism
Hadoop HDFS
45. What is HDFS?
Each block of data is stored on multiple
systems and by default has 128 MB of data
Data
Datanode Datanode Datanode
Hadoop Distributed File System (HDFS) is known for its distributed storage method.
It distributes the data amongst many computers. In addition to this, replication of
data is also done to avoid loss of data
46. What is HDFS?
Let us now see how 500 MB of data is stored in the traditional method
47. Let us now see how 500 MB of data is stored in the
traditional method
500 MB data
What is HDFS?
Let us now see how 500 MB of data is stored in the traditional method
48. Let us now see how 500 MB of data is stored in the
traditional method
Here, the entire set of data is stored in one
database. This overloads the database, and if it
crashes, we lose all our data
500 MB data
What is HDFS?
Let us now see how 500 MB of data is stored in the traditional method
49. What is HDFS?
Using Hadoop HDFS, this problem is taken care of as data is distributed amongst
many systems
50. Using Hadoop HDFS, this problem is taken care of
as data is distributed amongst many databases
By doing so, a single database is not
overloaded
500 MB data
What is HDFS?
.
.
.
Using Hadoop HDFS, this problem is taken care of as data is distributed amongst
many systems
51. Hadoop Distributed File System (HDFS) is specially designed for
storing massive datasets in commodity hardware
What is HDFS?
52. What is HDFS?
HDFS has two main components that help
with its storage
NameNode DataNode
Hadoop Distributed File System (HDFS) is specially designed for
storing massive datasets in commodity hardware
53. What is HDFS?
DataNode DataNode DataNode DataNode
NameNode
• NameNode is the master of the
system
• It stores all the metadata
54. NameNode
What is HDFS?
DataNode DataNode DataNode DataNode
• NameNode is the master of the
system
• It stores all the metadata• DataNode is known as the slave
node. There are multiple
DataNodes
• It performs the read/write
operations and stores the actual
data
55. What is HDFS?
NameNode
DataNode DataNode DataNode DataNode
• NameNode manages all the
DataNodes
• The DataNodes send signals
known as heartbeats to the
NameNode. This signal gives the
status of the DataNode
56. As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
What is HDFS?
57. What is HDFS?
Now, let’s consider storing a file of size 530
MB in HDFS
As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
58. What is HDFS?
Now, let’s consider storing a file of size 530
MB in HDFS
File.txt
530 MB
As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
59. What is HDFS?
Now, let’s consider storing a file of size 530
MB in HDFS
File.txt
530 MB
Block B Block DBlock C
128 MB 128 MB128 MB 128 MB
Block A
As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
60. What is HDFS?
Now, let’s consider storing a file of size 530
MB in HDFS
File.txt
530 MB
Block B Block D Block E
18 MB
Block C
128 MB 128 MB128 MB 128 MB
Block A
As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
61. What is HDFS?
Now, let’s consider storing a file of size 530
MB in HDFS
File.txt
530 MB
Block B Block D Block E
18 MB
Block C
128 MB 128 MB128 MB 128 MB
Block A
The final block uses
only the remaining
space for storage
As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
62. What is HDFS?
Now, let’s consider storing a file of size 530
MB in HDFS
File.txt
530 MB
Block B Block D Block E
18 MB
Block C
128 MB 128 MB128 MB 128 MB
Block A
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
63. What is HDFS?
Now, let’s consider storing a file of size 530
MB in HDFS
File.txt
530 MB
Block B Block D Block E
18 MB
Block C
128 MB 128 MB128 MB 128 MB
Block A
All these data blocks are stored
in DataNodes – computers
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
64. What happens if the computer that
contains block A crashes? Do we lose
the data in block A?
65. No, we don’t. That’s the beauty of Hadoop
HDFS. It uses replication to prevent the
loss of data
66. c
Rack 1
Replication in HDFS
HDFS overcomes the issue of DataNode failure by creating copies of the
data; this is known as the replication method
Block ADN 1
67. c
Rack 1 Rack 2
Replication in HDFS
HDFS overcomes the issue of DataNode failure by creating copies of the
data; this is known as the replication method
Block ADN 1 DN 1
Block ADN 5
Block A is replicated. The
replication factor is 3. The
replicas are stored in
different DataNodes
Block ADN 6
2 replicas cannot be stored on the same datanode
68. c
Rack 1 Rack 2
Replication in HDFS
HDFS overcomes the issue of DataNode failure by creating copies of the
data; this is known as the replication method
Rack 3 Rack 4 Rack 5
Similarly, every other
block is replicated
Block ADN 1
Block DDN 2
DN 1
Block ADN 5
Block DDN 10Block BDN 4 Block CDN 7
Block CDN 11
Block EDN 13
Block DDN 14
DN 12Block ADN 6
Block BDN 8
Block BDN 9 Block CDN 15Block EDN 3 Block EDN 12
74. Stores
DataNodes
Metadata (Name, replicas, ….)
DataNodes
NameNode
Here is the data
that is read
ReplicationRead
data
…..….
Architecture of HDFS
Metadata ops
Client
Read
request?
76. Features of HDFS
HDFS is fault tolerant as
multiple copies of data are
made
Fault tolerant Data security Scalability Flexibility
77. Features of HDFS
Provides end-to-end
encryption that protects
data
Fault tolerant Data security Scalability Flexibility
78. Features of HDFS
Multiple nodes can be
added to the cluster
depending on the
requirement
Fault tolerant Data security Scalability Flexibility
79. Features of HDFS
Hadoop is flexible in storing any type
of data, like structured, semi
structured or unstructured data
Fault tolerant Data security Scalability Flexibility
80. Now that we have stored data in
HDFS, how can we process it?
82. In the traditional approach, big data was processed at the master node
Why MapReduce?
big data
83. In the traditional approach, big data was processed at the master node
Why MapReduce?
Master
Slave Slave
Slave Slave
big data
84. This was a disadvantage as it consumed more time to process various types of
data
Master
Slave Slave
Slave Slave
Why MapReduce?
big data
85. To overcome this issue, data was processed at each slave node. This approach
is known as MapReduce
Master
Slave Slave
Slave Slave
Why MapReduce?
big data
86. What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism
Hadoop MapReduce
87. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
88. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
MapReduce tasks
Map tasks Reduce tasks
89. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Map and Reduce steps
Input Data Output Data
map()
map()
map()
Shuffle and
Sort
reduce()
reduce()
Input Data is divided to form the input splits
90. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Map and Reduce steps
Input Data Output Data
map()
map()
map()
Shuffle and
Sort
reduce()
reduce()
Map phase is the first phase, here data in each split is passed to produce output
values
91. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Map and Reduce steps
Input Data Output Data
map()
map()
map()
Shuffle and
Sort
reduce()
reduce()
In the shuffle and sort phase, output of mapping phase is taken and similar data
is grouped
92. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Map and Reduce steps
Input Data Output Data
map()
map()
map()
Shuffle and
Sort
reduce()
reduce()
Here, the output values from the shuffling phase are aggregated. It then returns
a single output value
93. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Let us now see how MapReduce works with an example
94. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Let us now see how MapReduce works with an example
Input data
Welcome to Hadoop
Hadoop is interesting
Hadoop is easy
95. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Let us now see how MapReduce works with an example
Input data
Welcome to Hadoop
Hadoop is interesting
Hadoop is easy
Welcome to Hadoop
Hadoop is interesting
Hadoop is easy
Input Splits
96. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Let us now see how MapReduce works with an example
Input data
Welcome to Hadoop
Hadoop is interesting
Hadoop is easy
Welcome to Hadoop
Hadoop is interesting
Hadoop is easy
Input Splits
Hadoop, 1
is, 1
interesting, 1
Welcome, 1
to, 1
Hadoop, 1
Hadoop, 1
is, 1
easy, 1
Map phase
97. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Let us now see how MapReduce works with an example
Map phase Shuffle and Sort phase
Hadoop, 1
is, 1
interesting, 1
Welcome, 1
to, 1
Hadoop, 1
Hadoop, 1
is, 1
easy, 1
to, 1
Hadoop, 1
Hadoop, 1
Hadoop, 1
is, 1
is, 1
interesting, 1
Welcome, 1
easy, 1
98. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Let us now see how MapReduce works with an example
Map phase Shuffle and Sort phase
Hadoop, 1
is, 1
interesting, 1
Welcome, 1
to, 1
Hadoop, 1
Hadoop, 1
is, 1
easy, 1
to, 1
Hadoop, 1
Hadoop, 1
Hadoop, 1
is, 1
is, 1
interesting, 1
Welcome, 1
Reducer phase
easy, 1easy, 1
Hadoop, 3
interesting, 1
is, 2
to, 1
Welcome, 1
99. What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce
Let us now see how MapReduce works with an example
Map phase Shuffle and Sort phase Final Output
Hadoop, 1
is, 1
interesting, 1
Welcome, 1
to, 1
Hadoop, 1
Hadoop, 1
is, 1
easy, 1
to, 1
Hadoop, 1
Hadoop, 1
Hadoop, 1
is, 1
is, 1
interesting, 1
Welcome, 1
Reducer phase
easy, 1
easy 1
Hadoop 3
interesting 1
is 2
to 1
Welcome 1
easy, 1
Hadoop, 3
interesting, 1
is, 2
to, 1
Welcome, 1
100. Features of MapReduce
Good load
balancing
Re-execution of
tasks
Simple programming
model
Map task +
Reduce task
Splitting the stages into Map and
Reduce tasks improves the load
balancing
101. Features of MapReduce
Good load
balancing
Re-execution of
tasks
Simple programming
model
There is an automatic re-execution if a
certain task fails
Map task +
Reduce task
102. Features of MapReduce
Good load
balancing
Re-execution of
tasks
Simple programming
model
MapReduce has one of the simplest
programming model which is based on
Java. Java is a very common
programming language
Map task +
Reduce task
105. The disadvantage with this version was
that the Job tracker did both the
processing of data and resource
allocation
106. As a result, Job tracker was overburdened
due to handling job scheduling, and
resource management
107. To overcome this issue, Hadoop 2
introduced YARN as the processing layer
that supported many frameworks
108. What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism
Hadoop YARN
109. What is YARN?
Yet Another Resource Negotiator (YARN) acts as the resource management
unit of Hadoop
110. What is YARN?
Yet Another Resource Negotiator (YARN) acts as the resource management
unit of Hadoop
Apache YARN consists of
Resource
Manager
It is the master daemon. Manages the assignment of
resources such as CPU, memory
111. What is YARN?
Yet Another Resource Negotiator (YARN) acts as the resource management
unit of Hadoop
Apache YARN consists of
Resource
Manager
Node
Manager
It is the slave daemon. It reports the resource
usage to the Resource Manager
112. What is YARN?
Yet Another Resource Negotiator (YARN) acts as the resource management
unit of Hadoop
Apache YARN consists of
Resource
Manager
Application
Master
Node
Manager
Works with the negotiation of resources from resource
manager and works with node manager
119. What is YARN?
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Client
Client
Resource
Manager
Node status
Job request
Node manager updates the status of the nodes to the resource
manager
120. What is YARN?
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Client
Client
Resource
Manager
Job request
Resource Manager contacts the Node Manager requesting for
resources(containers). The Node Manager grants the request
121. What is YARN?
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Client
Client
Resource
Manager
Job request
App Master contacts the Node Manager to use the container and runs in
one of the container allocated on one of the nodes
122. Features of YARN
Job scheduling Multitenancy
YARN is responsible to
process job requests and
allocate resources
Scalability
123. Features of YARN
Job scheduling Multitenancy
Different versions of MapReduce
can run on YARN. This makes
upgrading of MapReduce
manageable
Scalability
124. Features of YARN
Job scheduling Multitenancy Scalability
Depending on the requirement, the
number of nodes can be increased
125. Many companies use Hadoop for storing
and processing data. Now, let me tell you
about one such company
126. What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism
Use case - Pinterest
127. You would have probably heard of the
popular image sharing website Pinterest
128.
129. `
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site
130. `
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site
Pinterest has more than 250 million users and nearly 30 billion pins. All these
account to big data concerning Pinterest
131. `
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site
Problem
Pinterest faced a challenge in processing tremendous amount of data
Pinterest has more than 250 million users and nearly 30 billion pins. All these
account to big data concerning Pinterest
132. `
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site
Problem
Pinterest faced a challenge in processing tremendous amount of data
There was a difficulty in analyzing which data needs to be displayed in a user’s
personalized discovery engine
Pinterest has more than 250 million users and nearly 30 billion pins. All these
account to big data concerning Pinterest
133. `
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site
Solution
Pinterest has more than 250 million users and nearly 30 billion pins. All these
account to big data concerning Pinterest
134. `
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site
Solution
Pinterest uses Hadoop to process and analyze big data in a way that it helps
the company to show the most relevant content to its users
Pinterest has more than 250 million users and nearly 30 billion pins. All these
account to big data concerning Pinterest
135. `
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site
Solution
Pinterest uses Hadoop to process and analyze big data in a way that it helps
the company to show the most relevant content to its users
Through continuous analysis of the data, Pinterest can provide its users with
features such as related pins, guided search and so on
Pinterest has more than 250 million users and nearly 30 billion pins. All these
account to big data concerning Pinterest
136. This is how Pinterest benefited from
Hadoop. Let’s also start using Hadoop to
put an end to the big data challenges we
are facing
137. What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism
Demo on HDFS, MapReduce
and YARN