SlideShare une entreprise Scribd logo
1  sur  32
OPERATING SYSTEM
BY
DR SHIFA MAAM
TOPIC : HADOOP-DFS DESIGN & ISSUES
NAME : ALTAF HUSSAIN DEADED (48)
INTRODUCTION :
What is HADOOP?
 Hadoop is an open source software framework. It is
provided by Apache to store, process and analyze
big data in a distributed environment across
clusters of computers. It is designed to scale up
from single servers to thousands of machines each
offering local computation and storage.
What is big data?
 Big data means a large datasets that cannot be processed
by traditional computing techniques.
 Big data is not merely a data rather it has become a
complete subject, which involves various tools,
techniques and frameworks.
 New technologies, devices, communication are growing
day by day. So the amount of data produced by mankind
is growing rapidly every year.
 90% of the world's data was generated in the last few
years
Data Generated per minute
on the internet.
 2.1 million snaps are shared on snapchat.
 3.8 million search queries are made on google.
 1 million people log on to facebook.
 4.5 million videos are watched on YouTube.
 188 million emails are send..
That's a lot of data!!!!
Example of Big Data?
 The statistic shows that 500+terabytes of new data
get ingested into the databases of social media site
Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message
exchanges, putting comments etc.
Types Of Big Data
 Following are the types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
Problems with Big data:
 Big Data Is Too Big for Traditional Storage.
 Big Data Is Too Complex for Traditional Storage.
 Big Data Is Too Fast for Traditional Storage
Hadoop as a solution
 Hadoop is an open source software framework. It is
provided by Apache to store, process and analyze big
data in a distributed environment across clusters of
computers. It is designed to scale up from single servers
to thousands of machines each offering local
computation and storage.
 It was developed by Doug cutting and Mike Cafarella.
 Hadoop is written in Java.
It has 2 main components : HDFS & Map reduce
HDFS :Hadoop distributed file system is the storage unit of
hadoop.
Map reduce : Hadoop map reduce is the processing unit of
hadoop.
Features of Hadoop
 Open source.
 Highly scalable.
 Fault Tolerance is Available.
 High Availability is provided.
 Cost-Effective.
 Hadoop provide Flexibility.
HADOOP
ARCHITECTURE
Hadoop essentially is a DFS ,
but why?
 Lets take an example of reading 1TB of data & we have a
high end machine which has got 4 I/O channels & each
channel has got a bandwidth of 100Mb/s.
 Say using this machine i was able to read the data in about
43 minutes.
 Now, i will bring say 10 similar machines i.e each having 4
I/O channels with bandwidth of 100Mb/s.
 Can you guess what is the time it is going to take to read the
same 1TB of data using all these 10 machines?
 The entire effort gets divided into 10 machines .So the time
required to read 1TB of data is reduced to one tenth i.e 4.3
minutes.
 Similarly when we consider big data , that data gets divided
into multiple chunks of data & we actually process these
chunks seperately. That is why hadoop has choosen a DFS
over a centralized file system.
HADOOP COMPONENTS
HDFS:
 HDFS stands for Hadoop Distributed File System.
 HDFS as a whole has got 2 major components: Name
Node & Data node.
 HDFS follows a master slave architecture where name
node is the master component running on the master
machine i.e a high end machine essentially & data node
is a slave component running on commodity hardware.
 There is always a single name node & multiple data
nodes.
 Each file is stored on HDFS as blocks. The entire file is
not stored in HDFS as it is distributed kind of file system.
Concept of blocks
 Since the hadoop slaves are made up of commodity
hardware, the hardware in single machine would be
some where around 1TB or 2TB to the maximum. So the
entire file needs to be broken into chunks or segments
called blocks.This block is mapped by name node into
one of the data nodes.
 The file is not randomly divided. It is divided as per the
default size package i.e 64 MB in Apache Hadoop 1.x ,
128 MB in hadoop 2.x , 256 MB in hadoop 3.x
 Lets say i have a file example.txt of size 248 MB. Below
is the representation of how it will be stored on HDFS :
Example
.txt
248MB
BLOCK
A
128MB
BLOCK
B
120MB
Is it safe to have just one copy of
each block?
 There is a chance that the machine
gets failed as it is a commodity
hardware.
Block replication:
 Hadoop creates replicas of each block that gets stored in
HDFS . Thats why hadoop is a fault tolerant system which
means even though our system fails or our block is lost,
we will have multiple such copies in other data nodes..
 Hadoop follows the default replication factor of 3 that
means there will be 3 copies of each block. However, this
default replication factor can also be changed using
configuration files of hadoop.
How does hadoop decide where to
store the replicas of the block
created?
 Hadoop actually follows the concept of Rack Awareness to decide
where to store which replica of a block.
 As per the concept of Rack Awareness , the replica of a block can't
be created in the rack in which it already exists. it needs to be
created in any other rack. It is because if we create the copy in the
same rack itself & the rack fails , We are going to loose the entire
data anyway.
NAME NODE
 Name node is the master daemon.
 Name node stores meta data i.e it keeps all the information
about file input e.g size of file, location of blocks stored, name
of file etc.
 It maintains & manages all the data nodes.
 It helps in mapping of blocks into data nodes.
 Receives heartbeat & block report from all the data nodes.
 It may direct the data node to create replica of a block .
DATA NODE
 As the name suggests data node stores the actual
data i.e the data from the file input.
 It is a slave daemon.
 Data node regularly send heart beat back to Name
node.
DATA NODE
 As the name suggests data node stores the actual
data i.e the data from the file input.
 It is a slave daemon.
 Data node regularly send heart beat back to Name
node.
Common utilities:
 Common utilities are also called hadoop common.
 Common utilities are required by other modules to work.
 Common utilities are required to maintain the
performance of hadoop & to start the Hadoop.
YARN framework:
 YARN stands for Yet Another Resource Negotiator.
 It basically performs two main functions:
1. Job scheduling.
2. Resource management.
ADVANTAGES:
1. Scalability:
Hadoop is a highly scalable storage platform because it can
store and distribute very large data sets across hundreds of
inexpensive servers that operate in parallel. Unlike
traditional relational database systems (RDBMS) that can’t
scale to process large amounts of data.
2. Flexibility:
Hadoop is designed in such a way that it can deal with any
kind of dataset like structured(MySql Data), Semi-
Structured(XML), Un-structured (Images and Videos) very
efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible.
which is very much useful for enterprises as they can
process large datasets easily, so the businesses can use
Hadoop to analyze valuable insights of data from sources
like social media, email, etc.
ADVANTAGES:
3. Cost effective:
Hadoop is open source in nature I.e its source code is freely
available. We can modify source code as per our business
requirements. Also it uses cost effective commodity
hardware which provides a cost efficient model.Unlike
traditional RDBMS, that requires inexpensive hardware and
high end processors to deal with big data.
4. Fast:
Hadoop’s unique storage method is based on a distributed
file system that basically ‘maps’ data wherever it is located
on a cluster. The tools for data processing are often on the
same servers where the data is located, resulting in much
faster data processing. If you’re dealing with large volumes
of unstructured data, Hadoop is able to efficiently process
terabytes of data in just minutes, and petabytes in hours
ADVANTAGES:
5. High Throughput and Low Latency:
Throughput means the amount work of done per unit time
and Low latency means to process the data with no delay or
less delay. As Hadoop is driven by the principle of
distributed storage and parallel processing, Processing is
done simultaneously on each block of data and independent
of each other. Also, instead of moving data, code is moved
to data in the cluster. These two contribute to High
Throughput and Low Latency.
6. Minimum Network Traffic:
In Hadoop, each task is divided into various small sub-task
which is then assigned to each data node available in the
Hadoop cluster. Each data node process a small amount of
data which leads to low traffic in a Hadoop cluster.
ADVANTAGES:
7. Fault Tolerance:
Hadoop uses commodity hardware(inexpensive systems)
which can be crashed at any moment. In Hadoop data is
replicated on various DataNodes in a Hadoop cluster which
ensures the availability of data if somehow any of your
systems got crashed. . By default, Hadoop makes 3 copies of
each file block and stored it into different nodes
ISSUES:
1. Issue With Small Files:
Hadoop is suitable for a small number of large files but
when it comes to the application which deals with a
large number of small files, Hadoop fails here. A small
file is nothing but a file which is significantly smaller
than Hadoop’s block size which can be either 128MB or
256MB by default. These large number of small files
overload the Namenode as it stores namespace for the
system and makes it difficult for Hadoop to function.
2. Vulnerable By Nature:
Hadoop is written in java which is widely used
programming language hence it is easily exploited by
cyber criminals which makes Hadoop vulnerable to
security breaches.
ISSUES:
3. Low Performance in small Data surrounding:
Hadoop is mainly designed for dealing with large
datasets, so it can be efficiently utilized for the
organizations that are generating a massive volume of
data. It’s efficiency decreases while performing in
small data surroundings.
4. Security Problem:
Hadoop does not implement encryption-decryption at
the storage as well as network levels. Thus it is not
much secure. For security, Hadoop adopts Kerberos
authentication which is difficult to maintain.
ISSUES:
5. Processing Overhead:
In Hadoop, the data is read from the disk and written to the
disk which makes read/write operations very expensive when
we are dealing with tera and petabytes of data. Hadoop cannot
do in-memory calculations hence it incurs processing overhead.
6. Lengthy Code:
Apache Hadoop has 1, 20,000 line of code. The number of lines
produces the number of bugs. Hence it will take more time to
execute the programs.
7. Slow Processing Speed:
MapReduce processes a huge amount of data. In Hadoop,
MapReduce works by breaking the processing into phases: Map
and Reduce. So, MapReduce requires a lot of time to perform
these tasks, thus increasing latency. Hence, reduces processing
speed.
Thank you

Contenu connexe

Similaire à OPERATING SYSTEM .pptx

Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 

Similaire à OPERATING SYSTEM .pptx (20)

Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
paper
paperpaper
paper
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 

Dernier

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Dernier (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 

OPERATING SYSTEM .pptx

  • 1. OPERATING SYSTEM BY DR SHIFA MAAM TOPIC : HADOOP-DFS DESIGN & ISSUES NAME : ALTAF HUSSAIN DEADED (48)
  • 2. INTRODUCTION : What is HADOOP?  Hadoop is an open source software framework. It is provided by Apache to store, process and analyze big data in a distributed environment across clusters of computers. It is designed to scale up from single servers to thousands of machines each offering local computation and storage.
  • 3. What is big data?  Big data means a large datasets that cannot be processed by traditional computing techniques.  Big data is not merely a data rather it has become a complete subject, which involves various tools, techniques and frameworks.  New technologies, devices, communication are growing day by day. So the amount of data produced by mankind is growing rapidly every year.  90% of the world's data was generated in the last few years
  • 4. Data Generated per minute on the internet.  2.1 million snaps are shared on snapchat.  3.8 million search queries are made on google.  1 million people log on to facebook.  4.5 million videos are watched on YouTube.  188 million emails are send.. That's a lot of data!!!!
  • 5. Example of Big Data?  The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
  • 6.
  • 7. Types Of Big Data  Following are the types of Big Data: 1. Structured 2. Unstructured 3. Semi-structured
  • 8. Problems with Big data:  Big Data Is Too Big for Traditional Storage.  Big Data Is Too Complex for Traditional Storage.  Big Data Is Too Fast for Traditional Storage
  • 9. Hadoop as a solution  Hadoop is an open source software framework. It is provided by Apache to store, process and analyze big data in a distributed environment across clusters of computers. It is designed to scale up from single servers to thousands of machines each offering local computation and storage.  It was developed by Doug cutting and Mike Cafarella.  Hadoop is written in Java. It has 2 main components : HDFS & Map reduce HDFS :Hadoop distributed file system is the storage unit of hadoop. Map reduce : Hadoop map reduce is the processing unit of hadoop.
  • 10. Features of Hadoop  Open source.  Highly scalable.  Fault Tolerance is Available.  High Availability is provided.  Cost-Effective.  Hadoop provide Flexibility.
  • 12. Hadoop essentially is a DFS , but why?  Lets take an example of reading 1TB of data & we have a high end machine which has got 4 I/O channels & each channel has got a bandwidth of 100Mb/s.  Say using this machine i was able to read the data in about 43 minutes.  Now, i will bring say 10 similar machines i.e each having 4 I/O channels with bandwidth of 100Mb/s.  Can you guess what is the time it is going to take to read the same 1TB of data using all these 10 machines?  The entire effort gets divided into 10 machines .So the time required to read 1TB of data is reduced to one tenth i.e 4.3 minutes.  Similarly when we consider big data , that data gets divided into multiple chunks of data & we actually process these chunks seperately. That is why hadoop has choosen a DFS over a centralized file system.
  • 14. HDFS:  HDFS stands for Hadoop Distributed File System.  HDFS as a whole has got 2 major components: Name Node & Data node.  HDFS follows a master slave architecture where name node is the master component running on the master machine i.e a high end machine essentially & data node is a slave component running on commodity hardware.  There is always a single name node & multiple data nodes.  Each file is stored on HDFS as blocks. The entire file is not stored in HDFS as it is distributed kind of file system.
  • 15. Concept of blocks  Since the hadoop slaves are made up of commodity hardware, the hardware in single machine would be some where around 1TB or 2TB to the maximum. So the entire file needs to be broken into chunks or segments called blocks.This block is mapped by name node into one of the data nodes.  The file is not randomly divided. It is divided as per the default size package i.e 64 MB in Apache Hadoop 1.x , 128 MB in hadoop 2.x , 256 MB in hadoop 3.x  Lets say i have a file example.txt of size 248 MB. Below is the representation of how it will be stored on HDFS : Example .txt 248MB BLOCK A 128MB BLOCK B 120MB
  • 16. Is it safe to have just one copy of each block?  There is a chance that the machine gets failed as it is a commodity hardware.
  • 17. Block replication:  Hadoop creates replicas of each block that gets stored in HDFS . Thats why hadoop is a fault tolerant system which means even though our system fails or our block is lost, we will have multiple such copies in other data nodes..  Hadoop follows the default replication factor of 3 that means there will be 3 copies of each block. However, this default replication factor can also be changed using configuration files of hadoop.
  • 18. How does hadoop decide where to store the replicas of the block created?  Hadoop actually follows the concept of Rack Awareness to decide where to store which replica of a block.  As per the concept of Rack Awareness , the replica of a block can't be created in the rack in which it already exists. it needs to be created in any other rack. It is because if we create the copy in the same rack itself & the rack fails , We are going to loose the entire data anyway.
  • 19. NAME NODE  Name node is the master daemon.  Name node stores meta data i.e it keeps all the information about file input e.g size of file, location of blocks stored, name of file etc.  It maintains & manages all the data nodes.  It helps in mapping of blocks into data nodes.  Receives heartbeat & block report from all the data nodes.  It may direct the data node to create replica of a block .
  • 20. DATA NODE  As the name suggests data node stores the actual data i.e the data from the file input.  It is a slave daemon.  Data node regularly send heart beat back to Name node.
  • 21. DATA NODE  As the name suggests data node stores the actual data i.e the data from the file input.  It is a slave daemon.  Data node regularly send heart beat back to Name node.
  • 22. Common utilities:  Common utilities are also called hadoop common.  Common utilities are required by other modules to work.  Common utilities are required to maintain the performance of hadoop & to start the Hadoop.
  • 23. YARN framework:  YARN stands for Yet Another Resource Negotiator.  It basically performs two main functions: 1. Job scheduling. 2. Resource management.
  • 24. ADVANTAGES: 1. Scalability: Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data. 2. Flexibility: Hadoop is designed in such a way that it can deal with any kind of dataset like structured(MySql Data), Semi- Structured(XML), Un-structured (Images and Videos) very efficiently. This means it can easily process any kind of data independent of its structure which makes it highly flexible. which is very much useful for enterprises as they can process large datasets easily, so the businesses can use Hadoop to analyze valuable insights of data from sources like social media, email, etc.
  • 25. ADVANTAGES: 3. Cost effective: Hadoop is open source in nature I.e its source code is freely available. We can modify source code as per our business requirements. Also it uses cost effective commodity hardware which provides a cost efficient model.Unlike traditional RDBMS, that requires inexpensive hardware and high end processors to deal with big data. 4. Fast: Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours
  • 26. ADVANTAGES: 5. High Throughput and Low Latency: Throughput means the amount work of done per unit time and Low latency means to process the data with no delay or less delay. As Hadoop is driven by the principle of distributed storage and parallel processing, Processing is done simultaneously on each block of data and independent of each other. Also, instead of moving data, code is moved to data in the cluster. These two contribute to High Throughput and Low Latency. 6. Minimum Network Traffic: In Hadoop, each task is divided into various small sub-task which is then assigned to each data node available in the Hadoop cluster. Each data node process a small amount of data which leads to low traffic in a Hadoop cluster.
  • 27. ADVANTAGES: 7. Fault Tolerance: Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability of data if somehow any of your systems got crashed. . By default, Hadoop makes 3 copies of each file block and stored it into different nodes
  • 28. ISSUES: 1. Issue With Small Files: Hadoop is suitable for a small number of large files but when it comes to the application which deals with a large number of small files, Hadoop fails here. A small file is nothing but a file which is significantly smaller than Hadoop’s block size which can be either 128MB or 256MB by default. These large number of small files overload the Namenode as it stores namespace for the system and makes it difficult for Hadoop to function. 2. Vulnerable By Nature: Hadoop is written in java which is widely used programming language hence it is easily exploited by cyber criminals which makes Hadoop vulnerable to security breaches.
  • 29. ISSUES: 3. Low Performance in small Data surrounding: Hadoop is mainly designed for dealing with large datasets, so it can be efficiently utilized for the organizations that are generating a massive volume of data. It’s efficiency decreases while performing in small data surroundings. 4. Security Problem: Hadoop does not implement encryption-decryption at the storage as well as network levels. Thus it is not much secure. For security, Hadoop adopts Kerberos authentication which is difficult to maintain.
  • 30. ISSUES: 5. Processing Overhead: In Hadoop, the data is read from the disk and written to the disk which makes read/write operations very expensive when we are dealing with tera and petabytes of data. Hadoop cannot do in-memory calculations hence it incurs processing overhead. 6. Lengthy Code: Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of bugs. Hence it will take more time to execute the programs. 7. Slow Processing Speed: MapReduce processes a huge amount of data. In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. So, MapReduce requires a lot of time to perform these tasks, thus increasing latency. Hence, reduces processing speed.
  • 31.