2. INTRODUCTION :
What is HADOOP?
Hadoop is an open source software framework. It is
provided by Apache to store, process and analyze
big data in a distributed environment across
clusters of computers. It is designed to scale up
from single servers to thousands of machines each
offering local computation and storage.
3. What is big data?
Big data means a large datasets that cannot be processed
by traditional computing techniques.
Big data is not merely a data rather it has become a
complete subject, which involves various tools,
techniques and frameworks.
New technologies, devices, communication are growing
day by day. So the amount of data produced by mankind
is growing rapidly every year.
90% of the world's data was generated in the last few
years
4. Data Generated per minute
on the internet.
2.1 million snaps are shared on snapchat.
3.8 million search queries are made on google.
1 million people log on to facebook.
4.5 million videos are watched on YouTube.
188 million emails are send..
That's a lot of data!!!!
5. Example of Big Data?
The statistic shows that 500+terabytes of new data
get ingested into the databases of social media site
Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message
exchanges, putting comments etc.
6.
7. Types Of Big Data
Following are the types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
8. Problems with Big data:
Big Data Is Too Big for Traditional Storage.
Big Data Is Too Complex for Traditional Storage.
Big Data Is Too Fast for Traditional Storage
9. Hadoop as a solution
Hadoop is an open source software framework. It is
provided by Apache to store, process and analyze big
data in a distributed environment across clusters of
computers. It is designed to scale up from single servers
to thousands of machines each offering local
computation and storage.
It was developed by Doug cutting and Mike Cafarella.
Hadoop is written in Java.
It has 2 main components : HDFS & Map reduce
HDFS :Hadoop distributed file system is the storage unit of
hadoop.
Map reduce : Hadoop map reduce is the processing unit of
hadoop.
10. Features of Hadoop
Open source.
Highly scalable.
Fault Tolerance is Available.
High Availability is provided.
Cost-Effective.
Hadoop provide Flexibility.
12. Hadoop essentially is a DFS ,
but why?
Lets take an example of reading 1TB of data & we have a
high end machine which has got 4 I/O channels & each
channel has got a bandwidth of 100Mb/s.
Say using this machine i was able to read the data in about
43 minutes.
Now, i will bring say 10 similar machines i.e each having 4
I/O channels with bandwidth of 100Mb/s.
Can you guess what is the time it is going to take to read the
same 1TB of data using all these 10 machines?
The entire effort gets divided into 10 machines .So the time
required to read 1TB of data is reduced to one tenth i.e 4.3
minutes.
Similarly when we consider big data , that data gets divided
into multiple chunks of data & we actually process these
chunks seperately. That is why hadoop has choosen a DFS
over a centralized file system.
14. HDFS:
HDFS stands for Hadoop Distributed File System.
HDFS as a whole has got 2 major components: Name
Node & Data node.
HDFS follows a master slave architecture where name
node is the master component running on the master
machine i.e a high end machine essentially & data node
is a slave component running on commodity hardware.
There is always a single name node & multiple data
nodes.
Each file is stored on HDFS as blocks. The entire file is
not stored in HDFS as it is distributed kind of file system.
15. Concept of blocks
Since the hadoop slaves are made up of commodity
hardware, the hardware in single machine would be
some where around 1TB or 2TB to the maximum. So the
entire file needs to be broken into chunks or segments
called blocks.This block is mapped by name node into
one of the data nodes.
The file is not randomly divided. It is divided as per the
default size package i.e 64 MB in Apache Hadoop 1.x ,
128 MB in hadoop 2.x , 256 MB in hadoop 3.x
Lets say i have a file example.txt of size 248 MB. Below
is the representation of how it will be stored on HDFS :
Example
.txt
248MB
BLOCK
A
128MB
BLOCK
B
120MB
16. Is it safe to have just one copy of
each block?
There is a chance that the machine
gets failed as it is a commodity
hardware.
17. Block replication:
Hadoop creates replicas of each block that gets stored in
HDFS . Thats why hadoop is a fault tolerant system which
means even though our system fails or our block is lost,
we will have multiple such copies in other data nodes..
Hadoop follows the default replication factor of 3 that
means there will be 3 copies of each block. However, this
default replication factor can also be changed using
configuration files of hadoop.
18. How does hadoop decide where to
store the replicas of the block
created?
Hadoop actually follows the concept of Rack Awareness to decide
where to store which replica of a block.
As per the concept of Rack Awareness , the replica of a block can't
be created in the rack in which it already exists. it needs to be
created in any other rack. It is because if we create the copy in the
same rack itself & the rack fails , We are going to loose the entire
data anyway.
19. NAME NODE
Name node is the master daemon.
Name node stores meta data i.e it keeps all the information
about file input e.g size of file, location of blocks stored, name
of file etc.
It maintains & manages all the data nodes.
It helps in mapping of blocks into data nodes.
Receives heartbeat & block report from all the data nodes.
It may direct the data node to create replica of a block .
20. DATA NODE
As the name suggests data node stores the actual
data i.e the data from the file input.
It is a slave daemon.
Data node regularly send heart beat back to Name
node.
21. DATA NODE
As the name suggests data node stores the actual
data i.e the data from the file input.
It is a slave daemon.
Data node regularly send heart beat back to Name
node.
22. Common utilities:
Common utilities are also called hadoop common.
Common utilities are required by other modules to work.
Common utilities are required to maintain the
performance of hadoop & to start the Hadoop.
23. YARN framework:
YARN stands for Yet Another Resource Negotiator.
It basically performs two main functions:
1. Job scheduling.
2. Resource management.
24. ADVANTAGES:
1. Scalability:
Hadoop is a highly scalable storage platform because it can
store and distribute very large data sets across hundreds of
inexpensive servers that operate in parallel. Unlike
traditional relational database systems (RDBMS) that can’t
scale to process large amounts of data.
2. Flexibility:
Hadoop is designed in such a way that it can deal with any
kind of dataset like structured(MySql Data), Semi-
Structured(XML), Un-structured (Images and Videos) very
efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible.
which is very much useful for enterprises as they can
process large datasets easily, so the businesses can use
Hadoop to analyze valuable insights of data from sources
like social media, email, etc.
25. ADVANTAGES:
3. Cost effective:
Hadoop is open source in nature I.e its source code is freely
available. We can modify source code as per our business
requirements. Also it uses cost effective commodity
hardware which provides a cost efficient model.Unlike
traditional RDBMS, that requires inexpensive hardware and
high end processors to deal with big data.
4. Fast:
Hadoop’s unique storage method is based on a distributed
file system that basically ‘maps’ data wherever it is located
on a cluster. The tools for data processing are often on the
same servers where the data is located, resulting in much
faster data processing. If you’re dealing with large volumes
of unstructured data, Hadoop is able to efficiently process
terabytes of data in just minutes, and petabytes in hours
26. ADVANTAGES:
5. High Throughput and Low Latency:
Throughput means the amount work of done per unit time
and Low latency means to process the data with no delay or
less delay. As Hadoop is driven by the principle of
distributed storage and parallel processing, Processing is
done simultaneously on each block of data and independent
of each other. Also, instead of moving data, code is moved
to data in the cluster. These two contribute to High
Throughput and Low Latency.
6. Minimum Network Traffic:
In Hadoop, each task is divided into various small sub-task
which is then assigned to each data node available in the
Hadoop cluster. Each data node process a small amount of
data which leads to low traffic in a Hadoop cluster.
27. ADVANTAGES:
7. Fault Tolerance:
Hadoop uses commodity hardware(inexpensive systems)
which can be crashed at any moment. In Hadoop data is
replicated on various DataNodes in a Hadoop cluster which
ensures the availability of data if somehow any of your
systems got crashed. . By default, Hadoop makes 3 copies of
each file block and stored it into different nodes
28. ISSUES:
1. Issue With Small Files:
Hadoop is suitable for a small number of large files but
when it comes to the application which deals with a
large number of small files, Hadoop fails here. A small
file is nothing but a file which is significantly smaller
than Hadoop’s block size which can be either 128MB or
256MB by default. These large number of small files
overload the Namenode as it stores namespace for the
system and makes it difficult for Hadoop to function.
2. Vulnerable By Nature:
Hadoop is written in java which is widely used
programming language hence it is easily exploited by
cyber criminals which makes Hadoop vulnerable to
security breaches.
29. ISSUES:
3. Low Performance in small Data surrounding:
Hadoop is mainly designed for dealing with large
datasets, so it can be efficiently utilized for the
organizations that are generating a massive volume of
data. It’s efficiency decreases while performing in
small data surroundings.
4. Security Problem:
Hadoop does not implement encryption-decryption at
the storage as well as network levels. Thus it is not
much secure. For security, Hadoop adopts Kerberos
authentication which is difficult to maintain.
30. ISSUES:
5. Processing Overhead:
In Hadoop, the data is read from the disk and written to the
disk which makes read/write operations very expensive when
we are dealing with tera and petabytes of data. Hadoop cannot
do in-memory calculations hence it incurs processing overhead.
6. Lengthy Code:
Apache Hadoop has 1, 20,000 line of code. The number of lines
produces the number of bugs. Hence it will take more time to
execute the programs.
7. Slow Processing Speed:
MapReduce processes a huge amount of data. In Hadoop,
MapReduce works by breaking the processing into phases: Map
and Reduce. So, MapReduce requires a lot of time to perform
these tasks, thus increasing latency. Hence, reduces processing
speed.