Hadoop architecture-tutorial

Hadoop Architecture | Features and Objectives
What is Hadoop?
Hadoop is an Apache open-source framework. It was written using Java that allows the
distributed processing of large datasets across clusters of computers using simple
programming models. The Hadoop framework application works in a platform that provides
distributed storage and computation across clusters of computers. Using Google’s solution,
Doug Cutting and his team developed an Open Source Project which is named as HADOOP.
Using the Map Reduce algorithm, Hadoop runs the applications where the data is processed in
parallel with others. In simple, Hadoop is used to develop applications that could perform a
complete statistical analysis of huge amounts of data.
Architecture of Hadoop
Hadoop has two major layers namely
• Processing/Computation layer (Map Reduce), and
• Storage layer (Hadoop Distributed File System).

Map Reduce
Map Reduce is a parallel programming model that is used for writing distributed applications.
These distributed applications are devised at Google for efficient processing of large amounts
of data, on large clusters of commodity hardware in a reliable, fault-tolerant manner. The Map
Reduce program runs on the Hadoop framework.
Hadoop Distributed File System
The Hadoop Distributed File System is based on the Google File System. It provides a
distributed file system that is designed to run on commodity hardware. HDFS has many
similarities with existing distributed file systems. It is designed to be deployed on low-cost
hardware and highly fault-tolerant. It provides high throughput access to application data.
The below following two modules are also included in Hadoop Framework −
a. Hadoop Common
These are Java libraries and utilities which are required by other Hadoop modules.
b. Hadoop YARN
Hadoop a framework for job scheduling and cluster resource management.
How Does Hadoop Work?
It’s quite expensive to build bigger servers with heavy configurations that handle large scale
processing , As it is cheaper than one high-end server We can use Hadoop as an alternative. So
this is the major factor behind using Hadoop that it runs across clustered and low-cost
machines.
The following core tasks that Hadoop performs are clearly mentioned below:-
1. Data is firstly segmented into directories and files. Files are further divided into
uniform-sized blocks of 128M and 64M (preferably 128M).
2. These files are then again distributed across various cluster nodes for further
processing.
3. Being on top of the local file system, HDFS supervises the processing.
4. All the Blocks are replicated for handling hardware failure.

5. It Checks that the code was executed successfully.
6. Performs the sort that takes place between the map and reduces stages.
7. Sends the sorted data to a certain computer.
8. Writes the debugging logs for each job.
Hadoop File System was developed using a distributed file system design. It runs on commodity
hardware. Comparing to other distributed systems, HDFS is highly faulted tolerant and
designed using low-cost hardware.
HDFS holds a very large amount of data and it maintains easier access. The files are stored
across multiple machines for storing such huge data. HDFS also makes applications available to
parallel processing.
Features of HDFS.
1. To interact with HDFS, Hadoop provides a command interface
2. Users can easily check the status of the cluster with the help of name node and data node
3. Available of streaming access to file system data.
4. HDFS provides file permissions and authentication.
HDFS Architecture
It mainly follows the master-slave architecture

Name node
It is the commodity hardware that consists of the GNU/Linux operating system and the name
node software. It is software that runs on commodity hardware. Below mentioned are the
following tasks that it can perform
a. It manages the file system namespace.
b. It regulates the client’s access to files.
c. Executes the file system operations such as renaming, closing, and opening files and
directories.
Data node
It is a commodity hardware that consists of the GNU/Linux operating system and data node
software. There will be a data node. For every node in a cluster, these nodes will manage the
data storage of their system.
a. As per client request, Data nodes perform read-write operations on the file systems
b. They also perform other operations such as block creation, deletion, and replication.
Block
The file in a file system is divided into one or more segments. These file segments are called
blocks. In simple words, we can say that the minimum amount of data that HDFS can read or
write is called a Block. Generally, the default block size is 64MB, but we can increase the block
size as per the need to change in HDFS configuration.
Objectives of HDFS
1. Fault detection and recovery
As HDFS includes a large number of commodity hardware, there is a probability of having
failures in components. To overcome this HDFS should have mechanisms for quick and
automatic fault detection and recovery.
2. Huge datasets
To manage the applications having huge datasets HDFS should have hundreds of nodes per
cluster
3. Hardware at data

When the computation takes place near the data a requested task can be done. The network
traffic is reduced and results in increment in the throughput.
Hadoop Advantages :-
1. Varied data sources
2. Availability
3. Scalable
4. Cost effective
5. Low network traffic
6. Ease of use
7. Performance
8. High throughput
9. Compatibility
10. Fault tolerant
11. Open source
12. Multi-Language support
Limitations of Hadoop:-
1. Issues with small files
2. Slow processing speed
3. Latency
4. Security
5. No real time data processing
6. Uncertainty
7. Lengthy line of code
8. Not easy to use
9. No caching
10. Supports only batch processing
Summary:-
This brings us to the end of this article on Hadoop. In this article you have learn what is Hadoop,
Architecture of Hadoop, Features and HDFS Architecture. We have also come up with a
curriculum that covers exactly what you would need to be expert in Hadoop Development! You
can have a look at the course details for Hadoop.

Hadoop architecture-tutorial

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop architecture-tutorial

Similaire à Hadoop architecture-tutorial (20)

Dernier

Dernier (20)

Hadoop architecture-tutorial