Unit-1 -2-3- BDA PIET 6 AIDS.pptx

Big Data Analytics
Dr. BK Verma
Professor in CSE(ET)
In charge: - Data Science Branch
6-AIDS

Syllabus of BDA
• UNIT-I Introduction To Big Data - Distributed file system, Big Data and its
importance, Four Vs, Drivers for Big data, big data analytics, big data
applications. Algorithms using map reduce, Matrix-Vector Multiplication
by Map Reduce.
• UNIT-II Introduction To Hadoop- Big Data – Apache Hadoop & Hadoop Eco
System – Moving Data in and out of Hadoop – Understanding inputs and
outputs of MapReduce - Data Serialization.
• UNIT- III Hadoop Architecture - Hadoop Architecture, Hadoop Storage:
HDFS, Common Hadoop Shell commands, Anatomy of File Write and
Read., NameNode, Secondary NameNode, and DataNode, Hadoop
MapReduce paradigm, Map and Reduce tasks, Job, Task trackers - Cluster
Setup – SSH &Hadoop Configuration – HDFS Administering –Monitoring &
Maintenance.
• UNIT-IV Hadoop Ecosystem And Yarn -Hadoop ecosystem components -
Schedulers - Fair and Capacity, Hadoop 2.0 New Features- NameNode High
Availability, HDFS Federation, MRv2, YARN, Running MRv1 in YARN.

Big Data
• Big data refers to the large and complex data sets that
are generated by various sources such as social media,
online transactions, and machine sensors.
• These data sets are too big, too fast, and too complex
to be processed and analyzed by traditional data
processing tools and methods.
• The growth of big data has been driven by the
explosion of digital devices, cloud computing, and
the internet of things (IoT), which has created a large
amount of data.
• This data contains valuable information that can be
used to improve business processes, gain new insights,
and make better decisions.

Do you know – Every minute we send 204 million emails, generate 1.8
million Facebook likes, send 278 thousand Tweets, and up-load 200,000
photos to Facebook.
The largest contributor of data is social media. For instance, Facebook
generates 500 TB of data every day. Twitter generates 8TB of data daily.

3Vs
• Big data can be characterized by its 3Vs: Volume, Velocity,
and Variety.
• Volume refers to the sheer size of the data sets, which can
range from terabytes to petabytes.
• Velocity refers to the speed at which the data is generated and
needs to be processed.
• Variety refers to the range of different types of data, including
structured data (e.g., databases), unstructured data (e.g., text,
images, and videos), and semi-structured data (e.g., social
media posts).
• To deal with big data, organizations are turning to new
technologies and tools such as Hadoop, Spark, and NoSQL
databases. These tools are designed to handle the scale and
complexity of big data and enable organizations to extract
valuable insights from it.

Four Vs
The Four Vs of big data refer to the characteristics of
big data that make it different from traditional data. The
Four Vs are
1. Volume: The sheer amount of data generated and
stored by organizations and individuals.
2. Velocity: The speed at which data is generated and
processed. This refers to real-time streaming data as
well as batch processing of large amounts of data.
3. Variety: The different forms of data, such as
structured, semi-structured, and unstructured data.
4. Veracity: The quality and accuracy of the data,
including data sources, data context, and the
possibility of missing, incomplete, or inconsistent
data

Distributed file system(DFS):
• A distributed file system is a type of file system that allows
multiple users to access and manage a shared pool of
configurable computer storage, known as a distributed file
system(DFS).
• The file system is distributed across many interconnected
nodes, each of which may provide one or more parts of the
overall file system.
• This means that users can access the file system from any
node, and the file system will automatically manage the data
distribution and replication.
• The goal of a distributed file system is to provide a single,
unified namespace and file management interface, so that
users can access their data from any location, without having
to worry about the underlying network topology or physical
storage locations.

Big Data and its importance
• Big Data refers to the large volume of data,
both structured and unstructured, that
inundates a business on a day-to-day basis.
• Big Data can be analyzed for insights that lead
to better decisions and strategic business
moves.
• For example: a company might analyze
customer purchase history to improve its sales
and marketing strategies.
• For Example: a healthcare organization might
analyze patient data to improve patient
outcomes.

Cont..
• The importance of Big Data lies in the insights it
provides. With the right tools and techniques,
organizations can unlock valuable information
and use it to improve their operations and gain
a competitive advantage.
Big Data can also help organizations:
• Enhance customer experiences
• Increase efficiency and productivity
• Identify new business opportunities
• Improve risk management and fraud detection
• Optimize supply chains and logistics

Drivers for Big data
There are several drivers that have led to the growth of big
data, including:
1. Technological advancements: The exponential growth of digital
data is largely due to advancements in technology such as the
internet, mobile devices, and sensors, which have made it easier to
collect, store, and share data.
2. Increased use of cloud computing: The widespread adoption of
cloud computing has made it possible for organizations to store and
process large amounts of data cost-effectively.
3. Internet of Things (IoT): The increasing number of connected
devices, such as sensors and smart devices, has led to the generation
of large amounts of data in real-time.
4. Social media: Social media platforms generate vast amounts of
user-generated data, including text, images, and videos.
5. Business requirements: Organizations are looking to leverage big
data to gain a competitive advantage and improve decision-making.
6. Government initiatives: Governments around the world are
investing in big data initiatives to improve public services and drive
economic growth

Big Data Analytics
• Big Data analytics is the process of examining, transforming, and
modeling large and complex data sets to uncover hidden
patterns, correlations, and other insights that can inform
decision making and support business objectives.
• The goal of Big Data analytics is to turn data into actionable
information and knowledge that can drive growth, improve
operational efficiency, and enhance competitiveness.
• Big Data analytics requires a combination of technologies, tools, and
techniques, including data storage, processing, and visualization.
• Hadoop, Spark, and NoSQL databases are popular technologies
used to manage and store big data, while machine learning
algorithms, statistical models, and data visualization tools are used
to perform the analytics.

Cont..
• The applications of Big Data analytics are
numerous and span across multiple industries,
including finance, healthcare, retail,
transportation, and many others.
• For example, in healthcare, Big Data analytics
can be used to analyze electronic medical
records to identify high-risk patients and
improve patient outcomes, while in retail, it
can be used to improve customer
engagement and experience through
personalized marketing and product
recommendations.

Big Data Applications
1. Healthcare: Big data is being used in healthcare to improve patient outcomes,
reduce costs, and improve overall efficiency. For example, electronic health records
(EHRs) generate massive amounts of data that can be analyzed to identify trends,
track patient outcomes, and support research.
2. Finance: Financial institutions use big data to analyze customer behavior, detect
fraud, and make investment decisions. For example, credit card companies use
big data to analyze spending patterns to identify and prevent fraudulent
transactions.
3. Retail: Retail companies use big data to gain insights into consumer behavior
and preferences, optimize pricing and promotions, and personalize the shopping
experience. For example, online retailers use big data to recommend products based
on a customer's past purchases and online behavior.
4. Manufacturing: Manufacturing companies use big data to optimize their supply
chain, improve production processes, and reduce costs. For example, data
generated by sensors on manufacturing equipment can be analyzed to identify
inefficiencies and improve production processes.
5. Transportation: The transportation industry uses big data to optimize routes,
reduce fuel consumption, and improve safety. For example, data from GPS
devices, weather sensors, and traffic cameras can be used to optimize routes for
delivery trucks and reduce fuel consumption.

Different Algorithms using MapReduce
MapReduce is a programming model for processing large data sets that can be used to
implement a wide variety of algorithms. Some common algorithms that can be implemented
using MapReduce include:
1. Word count: A classic example of a MapReduce algorithm that counts the number of
occurrences of each word in a large text corpus.
2. Inverted Index: An inverted index is a data structure that maps each word in a document to a
list of documents that contain it. In MapReduce, this can be implemented as a two-step
process, with the map function emitting a key-value pair for each word in each document, and
the reduce function aggregating the list of document IDs for each word.
3. PageRank: PageRank is an algorithm used by Google to rank web pages based on their
importance. In MapReduce, this can be implemented as a series of iterative MapReduce jobs
that propagate importance scores from one set of pages to another.
4. K-means clustering: K-means is a popular algorithm for clustering data into a fixed number
of clusters. In MapReduce, this can be implemented as a series of MapReduce jobs that
iteratively update the cluster centroids based on the assignment of data points to clusters.
5. Matrix multiplication: Matrix multiplication is a fundamental operation in linear algebra
that can be used to solve systems of linear equations. In MapReduce, this can be implemented
as a two-step process, with the map function emitting intermediate values that represent the
product of individual entries in the matrices, and the reduce function aggregating these
intermediate values to compute the final product.

Matrix-Vector Multiplication by Map Reduce
• Matrix-vector multiplication is a common operation in
linear algebra, where a matrix and a vector are multiplied to
produce another vector. In the context of map reduce, this
operation can be performed by dividing the matrix into
multiple chunks and distributing them across multiple
computers or nodes for parallel processing.
• Here's how matrix-vector multiplication could be performed
using the Map Reduce framework:
• Map step: Each node takes one chunk of the matrix and
multiplies it with the entire vector. The result is a partial
vector.
• Reduce step: The partial vectors from all nodes are
combined to form the final result vector.

Unit-2
• Introduction To Hadoop
• Big Data
• Apache Hadoop & Hadoop Eco System
• Moving Data in and out of Hadoop
• Understanding inputs and outputs of
MapReduce
• Data Serialization.

What is Hadoop?
• Hadoop is the solution to above Big Data
problems. It is the technology to store massive
datasets on a cluster of cheap machines in a
distributed manner. Not only this it provides Big
Data analytics through distributed computing
framework.
• It is an open-source software developed as a
project by Apache Software Foundation. Doug
Cutting created Hadoop. In the year 2008
Yahoo gave Hadoop to Apache Software
Foundation. Since then two versions of Hadoop
has come. Version 1.0 in the year 2011 and
version 2.0.6 in the year 2013. Hadoop comes in
various flavors like Cloudera, IBM BigInsight,
MapR and Hortonworks.

Hadoop consists of three core
components
• Hadoop Distributed File System (HDFS)
– It is the storage layer of Hadoop.
• Map-Reduce – It is the data processing
layer of Hadoop.
• YARN – It is the resource management
layer of Hadoop

Apache Hadoop
• Apache Hadoop is an open-source software
framework for distributed storage and
processing of big data sets across a cluster of
computers.
• It was developed to address the limitations of
traditional centralized systems when it comes
to storing and processing large amounts of
data.

Hadoop consists of two main
components
1.Hadoop Distributed File System (HDFS): A
scalable and fault-tolerant distributed file system
that enables the storage of very large files across a
cluster of commodity servers.
2.MapReduce: A programming model for
processing large data sets in parallel across a
cluster of computers. MapReduce consists of
two stages: the map stage, which processes data
in parallel on different nodes, and the reduce
stage, which aggregates the results.

HDFS:
• Hadoop Distributed File System provides for
distributed storage for Hadoop. HDFS has a
master-slave topology
• Master is a high-end machine where as slaves
are inexpensive computers. The Big Data files
get divided into the number of blocks. Hadoop
stores these blocks in a distributed fashion on
the cluster of slave nodes. On the master, we
have metadata stored.
• HDFS has two daemons running for it.
NameNode and DataNode

NameNode :
• NameNode Daemon runs on the master
machine.
• It is responsible for maintaining, monitoring
and managing DataNodes.
• It records the metadata of the files like the
location of blocks, file size, permission,
hierarchy etc.
• Namenode captures all the changes to the
metadata like deletion, creation and renaming
of the file in edit logs.
• It regularly receives heartbeat and block
reports from the DataNodes.

DataNode:
• DataNode runs on the slave machine.
• It stores the actual business data.
• It serves the read-write request from the
user.
• DataNode does the ground work of creating,
replicating and deleting the blocks on the
command of NameNode.
• After every 3 seconds, by default, it sends
heartbeat to NameNode reporting the health
of HDFS.

Cont..
• The Hadoop ecosystem is a collection of open-source projects
that work together with Hadoop to provide a complete big data
solution. Some of the popular projects in the Hadoop
ecosystem include:
1. Hive: A data warehousing and SQL-like query language for
Hadoop.
2. Pig: A high-level platform for creating MapReduce programs.
3. HBase: A NoSQL database that provides random real-time
read/write access to large amounts of structured data.
4. Spark: An open-source, fast, and general-purpose cluster
computing framework for big data processing.
5. YARN (Yet Another Resource Negotiator): A resource
management system that allocates resources in the cluster for
running applications.

What is Data?
• The quantities, characters, or symbols
on which operations are performed by
a computer, which may be stored and
transmitted in the form of electrical
signals and recorded on magnetic,
optical, or mechanical recording
media.
“90% of the world’s data was generated in the last few years”

What is Big Data?
• Big Data is a collection of data that is
huge in volume, yet growing
exponentially with time. It is a data
with so large size and complexity that
none of traditional data management
tools can store it or process it
efficiently. Big data is also a data but
with huge size.

What is an
Example of
Big Data?
• The New York Stock
Exchange is an example of
Big Data that generates
about one terabyte of new
trade data per day.

Social
Media
• The statistic
shows
that 500+terabyte
s of new data get
ingested into the
databases of
social media
site Facebook,
every day. This
data is mainly
generated in
terms of photo
and video
uploads, message
exchanges,
putting comments
etc.

• A single Jet engine can
generate 10+terabytes of data
in 30 minutes of flight time.
With many thousand flights per
day, generation of data reaches
up to many Petabytes.

Types Of Big Data
• Structured data − Relational data.
• Semi Structured data − XML data.
• Unstructured data − Word, PDF, Text,
Media Logs

Structured
• Any data that can be stored, accessed and processed
in the form of fixed format is termed as a ‘structured’
data. Over the period of time, talent in computer
science has achieved greater success in developing
techniques for working with such kind of data (where
the format is well known in advance) and also
deriving value out of it.
• However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes
are being in the rage of multiple zettabytes.
Data stored in a relational database management system is one example
of a ‘structured’ data.

Unstructured
• Any data with unknown form or the
structure is classified as unstructured
data.
• In addition to the size being huge, un-
structured data poses multiple
challenges in terms of its processing for
deriving value out of it.
• A typical example of unstructured data is
a heterogeneous data source containing
a combination of simple text files,
images, videos etc.
• Now day organizations have wealth of
data available with them but
unfortunately, they don’t know how to
derive value out of it since this data is in

Semi-structured
• Semi-structured data can contain both
the forms of data. We can see semi-
structured data as a structured in form
but it is actually not defined with e.g.
a table definition in relational DBMS.
Example of semi-structured data is a
data represented in an XML file.

Characteristics Of Big Data
• Volume
• Variety
• Velocity
• Variability

(i) Volume:-
• The name Big Data itself is related to a
size which is enormous. Size of data
plays a very crucial role in determining
value out of data. Also, whether a
particular data can actually be
considered as a Big Data or not, is
dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic
which needs to be considered while
dealing with Big Data solutions.

(ii) Variety:-
• The next aspect of Big Data is its variety.
• Variety refers to heterogeneous sources and
the nature of data, both structured and
unstructured. During earlier days,
spreadsheets and databases were the only
sources of data considered by most of the
applications. Nowadays, data in the form of
emails, photos, videos, monitoring devices,
PDFs, audio, etc. are also being considered in
the analysis applications. This variety of
unstructured data poses certain issues for
storage, mining and analyzing data.

iii) Velocity:-
• The term ‘velocity’ refers to the speed of
generation of data. How fast the data is
generated and processed to meet the
demands, determines real potential in
the data.
• Big Data Velocity deals with the speed at
which data flows in from sources like
business processes, application logs,
networks, and social media sites,
sensors, Mobile devices, etc. The flow
of data is massive and continuous.

(iv) Variability:-
• This refers to the inconsistency which
can be shown by the data at times,
thus hampering the process of being
able to handle and manage the data
effectively.

Advantages Of Big Data
Processing
• Ability to process Big Data in DBMS brings in multiple
benefits, such as-
• Businesses can utilize outside intelligence while taking
decisions
• Access to social data from search engines and sites like
facebook, twitter are enabling organizations to fine tune
their business strategies.
• Improved customer service
• Traditional customer feedback systems are getting
replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and
natural language processing technologies are being
used to read and evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency

Drivers for Big Data
• The main business drivers for such rising demand for
Big Data Analytics are :
• 1. The digitization of society
• 2. The drop in technology costs
• 3. Connectivity through cloud computing
• 4. Increased knowledge about data science
• 5. Social media applications
• 6. The rise of Internet-of-Things(IoT)
• Example: A number of companies that have Big Data
at the core of their strategy like :
• Apple, Amazon, Facebook and Netflix have become
very successful at the beginning of the 21st century.

What is Big Data Analytics?
• Big Data analytics is a process used to extract
meaningful insights, such as hidden patterns,
unknown correlations, market trends, and customer
preferences. Big Data analytics provides various
advantages—it can be used for better decision
making, preventing fraudulent activities, among
other things.

Different Types of Big Data Analytics
1. Descriptive Analytics
This summarizes past data into a form that people can easily
read. This helps in creating reports, like a company’s revenue,
profit, sales, and so on. Also, it helps in the tabulation of social
media metrics.
Use Case: The Dow Chemical Company analyzed its past data
to increase facility utilization across its office and lab space.
Using descriptive analytics, Dow was able to identify
underutilized space. This space consolidation helped the
company save nearly US $4 million annually.

2. Diagnostic Analytics
• This is done to understand what caused a problem in the first
place. Techniques like drill-down, data mining, and data
recovery are all examples. Organizations use diagnostic
analytics because they provide an in-depth insight into a
particular problem.
Use Case: An e-commerce company’s report shows that their
sales have gone down, although customers are adding
products to their carts. This can be due to various reasons
like the form didn’t load correctly, the shipping fee is too high,
or there are not enough payment options available. This is
where you can use diagnostic analytics to find the reason.

3. Predictive Analytics
• This type of analytics looks into the historical and present
data to make predictions of the future. Predictive analytics
uses data mining, AI, and machine learning to analyze
current data and make predictions about the future. It works
on predicting customer trends, market trends, and so on.
Use Case: PayPal determines what kind of precautions they
have to take to protect their clients against fraudulent
transactions. Using predictive analytics, the company uses all
the historical payment data and user behavior data and builds
an algorithm that predicts fraudulent activities.

4. Prescriptive Analytics
• This type of analytics prescribes the solution to a
particular problem. Perspective analytics works with
both descriptive and predictive analytics. Most of the
time, it relies on AI and machine learning.
Use Case: Prescriptive analytics can be used to
maximize an airline’s profit. This type of analytics is
used to build an algorithm that will automatically adjust
the flight fares based on numerous factors, including
customer demand, weather, destination, holiday
seasons, and oil prices.

Big Data Analytics Tools
• Hadoop - helps in storing and analyzing data
• MongoDB - used on datasets that change frequently
• Talend - used for data integration and management
• Cassandra - a distributed database used to handle chunks of
data
• Spark - used for real-time processing and analyzing large
amounts of data
• STORM - an open-source real-time computational system
• Kafka - a distributed streaming platform that is used for fault-
tolerant storage

Big Data Industry Applications
• Ecommerce - Predicting customer trends and optimizing prices are a few of the
ways e-commerce uses Big Data analytics
• Marketing - Big Data analytics helps to drive high ROI marketing campaigns, which
result in improved sales
• Education - Used to develop new and improve existing courses based on market
requirements
• Healthcare - With the help of a patient’s medical history, Big Data analytics is used
to predict how likely they are to have health issues
• Media and entertainment - Used to understand the demand of shows, movies,
songs, and more to deliver a personalized recommendation list to its users
• Banking - Customer income and spending patterns help to predict the likelihood of
choosing various banking offers, like loans and credit cards
• Telecommunications - Used to forecast network capacity and improve customer
experience
• Government - Big Data analytics helps governments in law enforcement, among
other things

MapReduce
• MapReduce is a programming model for
writing applications that can process Big
Data in parallel on multiple nodes.
• MapReduce provides analytical capabilities
for analyzing huge volumes of complex
data.

Why MapReduce?
• Traditional Enterprise Systems normally have
a centralized server to store and process data.
The following illustration depicts a schematic
view of a traditional enterprise system.
Traditional model is certainly not suitable to
process huge volumes of scalable data and
cannot be accommodated by standard
database servers. Moreover, the centralized
system creates too much of a bottleneck
while processing multiple files
simultaneously.

• Google solved this bottleneck issue using an
algorithm called MapReduce. MapReduce
divides a task into small parts and assigns
them to many computers. Later, the results
are collected at one place and integrated to
form the result dataset.

How MapReduce Works?
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
• The Map task takes a set of data and converts
it into another set of data, where individual
elements are broken down into tuples (key-
value pairs).
• The Reduce task takes the output from the
Map as an input and combines those data
tuples (key-value pairs) into a smaller set of
tuples.
The reduce task is always performed after the
map job.

Cont..
• Input Phase − Here we have a Record Reader that translates each record in an input
file and sends the parsed data to the mapper in the form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups similar data from the
map phase into identifiable sets. It takes the intermediate keys from the mapper as
input and applies a user-defined code to aggregate the values in a small scope of one
mapper. It is not a part of the main MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the Reducer is
running. The individual key-value pairs are sorted by key into a larger data list. The
data list groups the equivalent keys together so that their values can be iterated easily
in the Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered,
and combined in a number of ways, and it requires a wide range of processing. Once
the execution is over, it gives zero or more key-value pairs to the final step.
• Output Phase − In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a
record writer.

MapReduce-
Example
• Let us take a real-world example to
comprehend the power of MapReduce.
Twitter receives around 500 million tweets
per day, which is nearly 3000 tweets per
second. The following illustration shows
how Tweeter manages its tweets with the
help of MapReduce.

Cont..
• Tokenize − Tokenizes the tweets into maps
of tokens and writes them as key-value pairs.
• Filter − Filters unwanted words from the
maps of tokens and writes the filtered maps
as key-value pairs.
• Count − Generates a token counter per word.
• Aggregate Counters − Prepares an
aggregate of similar counter values into small
manageable units.

MapReduce - Algorithm
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
• The map task is done by means of Mapper
Class
• The reduce task is done by means of
Reducer Class.

Cont..
• Mapper class takes the input, tokenizes it,
maps and sorts it. The output of Mapper
class is used as input by Reducer class,
which in turn searches matching pairs and
reduces them.

Cont..
• MapReduce implements various mathematical
algorithms to divide a task into small parts and
assign them to multiple systems. In technical
terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a
cluster.
• These mathematical algorithms may include the
following: −
• Sorting
• Searching
• Indexing
• TF-IDF(Term Frequency (TF)-Inverse Document
Frequency (IDF)

Unit-2
Introduction of Hadoop
Hadoop is an open-source software framework that is used
for storing and processing large amounts of data in a
distributed computing environment.
It is designed to handle big data and is based on the
MapReduce programming model, which allows for the
parallel processing of large datasets.

History of Hadoop
• Apache Software Foundation is the developers of Hadoop, and
it’s co-founders are Doug Cutting and Mike Cafarella. It’s co-
founder Doug Cutting named it on his son’s toy elephant. In
October 2003 the first paper release was Google File System.
• In January 2006, MapReduce development started on the Apache
Nutch which consisted of around 6000 lines coding for it and
around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0
was released.
• Hadoop is an open-source software framework for storing and
processing big data. It was created by Apache Software Foundation
in 2006, based on a white paper written by Google in 2003 that
described the Google File System (GFS) and the MapReduce
programming model.
• The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers
to thousands of machines, each offering local computation and
storage.
• It is used by many organizations, including Yahoo, Facebook, and
IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the

Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features
that make it well-suited for big
data processing:
• Distributed Storage: Hadoop stores large data
sets across multiple machines, allowing for
the storage and processing of extremely large
amounts of data.
• Scalability: Hadoop can scale from a single
server to thousands of machines, making it
easy to add more capacity as needed.
• Fault-Tolerance: Hadoop is designed to be
highly fault-tolerant, meaning it can continue
to operate even in the presence of hardware
failures.

Cont..
• Data locality: Hadoop provides data locality
feature, where the data is stored on the same
node where it will be processed, this feature
helps to reduce the network traffic and
improve the performance
• High Availability: Hadoop provides High
Availability feature, which helps to make sure
that the data is always available and is not
lost.
• Flexible Data Processing: Hadoop’s
MapReduce programming model allows for the
processing of data in a distributed fashion,
making it easy to implement a wide variety of

Cont..
• Data Integrity: Hadoop provides built-in checksum
feature, which helps to ensure that the data stored
is consistent and correct.
• Data Replication: Hadoop provides data replication
feature, which helps to replicate the data across the
cluster for fault tolerance.
• Data Compression: Hadoop provides built-in data
compression feature, which helps to reduce the
storage space and improve the performance.
• YARN: A resource management platform that
allows multiple data processing engines like real-
time streaming, batch processing, and
interactive SQL, to run and process data stored
in HDFS.

What is Hadoop?
• Def. “Hadoop is an open source software
programming framework for storing a large
amount of data and performing the
computation”.
• Its framework is based on Java programming
with some native code in C and shell scripts.
• Hadoop is an open-source software framework
that is used for storing and processing large
amounts of data in a distributed computing
environment.
• It is designed to handle big data and is based
on the MapReduce programming model,

Hadoop has Two main
components
• HDFS (Hadoop Distributed File System): This is the storage
component of Hadoop, which allows for the storage of large
amounts of data across multiple machines. It is designed to work
with commodity hardware, which makes it cost-effective.
• YARN (Yet Another Resource Negotiator): This is the resource
management component of Hadoop, which manages the
allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
• Hadoop also includes several additional modules that provide
additional functionality, such as Hive (a SQL-like query
language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
• Hadoop is commonly used in big data scenarios such as
data warehousing, business intelligence, and machine
learning. It’s also used for data processing, data analysis,
and data mining.

Hadoop Distributed File
System
• It has distributed file system known as
HDFS and this HDFS splits files into
blocks and sends them across various
nodes in form of large clusters. Also in
case of a node failure, the system
operates and data transfer takes place
between the nodes which are facilitated by
HDFS.

Advantages of HDFS
• Scalability: Hadoop can easily scale to
handle large amounts of data by adding
more nodes to the cluster.
• Cost-effective: Hadoop is designed to work
with commodity hardware, which makes it
a cost-effective option for storing and
processing large amounts of data.

Cont..
• Fault-tolerance: Hadoop’s distributed
architecture provides built-in fault-
tolerance, which means that if one node
in the cluster goes down, the data can
still be processed by the other nodes.
• Flexibility: Hadoop can process
structured, semi-structured, and
unstructured data, which makes it a
versatile option for a wide range of big
data scenarios.

Cont..
• Open-source: Hadoop is open-source software,
which means that it is free to use and modify.
This also allows developers to access the source
code and make improvements or add new
features.
• Large community: Hadoop has a large and
active community of developers and users who
contribute to the development of the software,
provide support, and share best practices.
• Integration: Hadoop is designed to work with
other big data technologies such as Spark,
Storm, and Flink, which allows for integration with
a wide range of data processing and analysis
tools.

Disadvantages of HDFS:
• Not very effective for small data.
• Hard cluster management.
• Has stability issues.
• Security major concerns.
• Complexity: Hadoop can be complex to set
up and maintain, especially for
organizations without a dedicated team of
experts.
• Latency: Hadoop is not well-suited for
low-latency workloads and may not be the
best choice for real-time data processing.

Cont..
• Limited Support for Real-time Processing:
Hadoop’s batch-oriented nature makes it less suited
for real-time streaming or interactive data processing
use cases.
• Limited Support for Structured Data: Hadoop is
designed to work with unstructured and semi-
structured data, it is not well-suited for structured
data processing
• Data Security: Hadoop does not provide built-in
security features such as data encryption or user
authentication, which can make it difficult to secure
sensitive data.
• Limited Support for Ad-hoc Queries: Hadoop’s
MapReduce programming model is not well-suited for

Cont..
• Limited Support for Graph and Machine
Learning: Hadoop’s core component HDFS and
MapReduce are not well-suited for graph and
machine learning workloads, specialized
components like Apache Giraph and Mahout are
available but have some limitations.
• Cost: Hadoop can be expensive to set up and
maintain, especially for organizations with large
amounts of data.
• Data Loss: In the event of a hardware failure, the
data stored in a single node may be lost
permanently.
• Data Governance: Data Governance is a critical
aspect of data management, Hadoop does not
provide a built-in feature to manage data lineage,
data quality, data cataloging, data lineage, and

Hadoop framework is made up of
the following modules: Hadoop
Ecosystem
1.Hadoop MapReduce:- a MapReduce
programming model for handling and
processing large data.
2.Hadoop Distributed File System:- distributed
files in clusters among nodes.
3.Hadoop YARN:- a platform which manages
computing resources.
4. Hadoop Common- it contains packages and
libraries which are used for other modules.

Apache Hadoop and Hadoop Eco
System
• Apache Hadoop is an open source
software framework used to develop
data processing applications which are
executed in a distributed computing
environment.
• Applications built using HADOOP are
run on large data sets distributed
across clusters of commodity
computers.
• Commodity computers are cheap and
widely available.
• These are mainly useful for achieving

Form of Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data
Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data
services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning
algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

Apache Hadoop consists of two
sub-projects:
1.Hadoop MapReduce: MapReduce is a
computational model and software
framework for writing applications which are
run on Hadoop. These MapReduce programs
are capable of processing enormous data in
parallel on large clusters of computation
nodes.
2.HDFS (Hadoop Distributed File System):
HDFS takes care of the storage part of
Hadoop applications. MapReduce
applications consume data from HDFS. HDFS
creates multiple replicas of data blocks and
distributes them on compute nodes in a
cluster. This distribution enables reliable

Hadoop Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed
data processing using MapReduce and HDFS methods.

Name Node:
NameNode represented every files and directory
which is used in the namespace
Data Node:
DataNode helps to manage the state of an HDFS
node and allows you to interacts with the blocks
Master Node:
The master node allows to conduct parallel
processing of data using Hadoop MapReduce.
Slave node:
The slave nodes are the additional machines in
Hadoop cluster which allows to store data to
conduct complex calculations. Moreover, all the
slave node comes with Task Tracker and a
DataNode.
This allows to synchronize the processes with the
NameNode and Job Tracker respectively.

Data storage Nodes in HDFS.
•NameNode(Master)
•DataNode(Slave)

NameNode:
• NameNode works as a Master in a Hadoop cluster
that guides the Datanode(Slaves).
• Namenode is mainly used for storing the Metadata
i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity
in a Hadoop cluster.
• Meta Data can also be the name of the file, size, and
the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find
the closest DataNode for Faster Communication.
• Namenode instructs the DataNodes with the
operation like delete, create, Replicate, etc.

Cont..
• NameNode is the master node in the Apache Hadoop
HDFS Architecture that maintains and manages the blocks
present on the DataNodes (slave nodes).
• NameNode is a very highly available server that manages
the File System Namespace and controls access to files by
clients.
• The HDFS architecture is built in such a way that the user
data never resides on the NameNode. The data resides on
DataNodes only.

Functions of NameNode
• It is the master daemon that maintains and
manages the DataNodes (slave nodes)
• It records the metadata of all the files stored in
the cluster, e.g. The location of blocks
stored, the size of the files, permissions,
hierarchy, etc.
• There are two files associated with the metadata:
• FsImage: It contains the complete state of the file
system namespace since the start of the
NameNode.
• EditLogs: It contains all the recent modifications
made to the file system with respect to the most
recent FsImage.

DataNode:
• DataNodes works as a Slave DataNodes
are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes
can be from 1 to 500 or even more than
that.
• The more number of DataNode, the Hadoop
cluster will be able to store more data.
• DataNode should have High storing capacity
to store a large number of file blocks.

File Block In HDFS:
• Data in HDFS is always stored in terms of
blocks. So the single block of data is
divided into multiple blocks of size 128MB
which is default and you can also change
it manually.
Suppose you have uploaded a file of 400MB to your HDFS

Replication In HDFS
• Replication ensures the availability of the data.
Replication is making a copy of something and the
number of times you make a copy of that particular
thing can be expressed as it’s Replication Factor. As
we have seen in File blocks that the HDFS stores the
data in the form of various blocks at the same time
Hadoop is also configured to make a copy of those file
blocks.
• By default, the Replication Factor for Hadoop is set to
3 which can be configured means you can change it
manually as per your requirement like in above
example we have made 4 file blocks which means that
3 Replica or copy of each file block is made means
total of 4×3 = 12 blocks are made for the backup
purpose.

Rack Awareness
• The rack is nothing but just the physical
collection of nodes in our Hadoop cluster
(maybe 30 to 40).
• A large Hadoop cluster is consists of so
many Racks . with the help of this Racks
information Namenode chooses the
closest Datanode to achieve the maximum
performance while performing the
read/write information which reduces the
Network Traffic.

• Hadoop cluster
consists of a data
center, the rack
and the node
which actually
executes jobs.
• Here, data center
consists of racks
and rack consists
of nodes.
• Network
bandwidth
available to
processes varies
depending upon
the location of
the processes

Moving Data In and Out of Hadoop
• Moving data in and out of Hadoop, which
refer to as data ingress and egress, is the
process by which data is transported
from an external system into an internal
system, and vice versa.
• Hadoop supports ingress and egress at a
low level in HDFS and MapReduce.

Understanding Inputs and Outputs of
MapReduce
Done in Class

Understanding Inputs and Outputs of
MapReduce
• Inputs and Outputs
• The MapReduce model operates on <key,
value> pairs.
• It views the input to the jobs as a set of <key,
value> pairs and produces a different set of
<key, value> pairs as the output of the jobs.
• Data input is supported by two classes in this
framework, namely InputFormat and
RecordReader.

Cont..
• The first is consulted to determine how the
input data should be partitioned for the map
tasks, while the latter reads the data from
the inputs.
• For the data output also there are two
classes, OutputFormat and RecordWriter.
• The first class performs a basic validation of
the data sink properties and the second
class is used to write each reducer output to
the data sink.

MapReduce Phase
• Input Splits: An input in the MapReduce model is divided
into small fixed-size parts called input splits. This part of
the input is consumed by a single map. The input data is
generally a file or directory stored in the HDFS.
• Mapping: This is the first phase in the map-reduce
program execution where the data in each split is passed
line by line, to a mapper function to process it and
produce the output values.
• Shuffling: It is a part of the output phase of Mapping
where the relevant records are consolidated from the
output. It consists of merging and sorting. So, all the key-
value pairs which have the same keys are combined. In
sorting, the inputs from the merging step are taken and
sorted. It returns key-value pairs, sorting the output.
• Reduce: All the values from the shuffling phase are
combined and a single output value is returned. Thus,
summarizing the entire dataset.

Hadoop InputFormat
• Hadoop InputFormat checks the Input-
Specification of the job.
• InputFormat split the Input file into InputSplit and
assign to individual Mapper.
• Different methods to get the data to the mapper
and different types of InputFormat in Hadoop like
FileInputFormat in Hadoop, TextInputFormat,
KeyValueTextInputFormat, etc.

Cont.
• How the input files are split up and read in Hadoop is
defined by the InputFormat.
• An Hadoop InputFormat is the first component in Map-
Reduce, it is responsible for creating the input splits and
dividing them into records.
• Initially, the data for a MapReduce task is stored in input
files, and input files typically reside in HDFS.
• Although these files format is arbitrary, line-based log files and
binary format can be used.
• Using InputFormat we define how these input files are split
and read.

InputFormat Class:
• The files or other objects that should be used for
input is selected by the InputFormat.
• InputFormat defines the Data splits, which
defines both the size of individual Map tasks and
its potential execution server.
• InputFormat defines the RecordReader, which is
responsible for reading actual records from the
input files.

Cont..
TextOutputFormat:
• MapReduce default Hadoop reducer Output Format is TextOutputFormat,
which writes (key, value) pairs on individual lines of text files and its keys and
values.
SequenceOutputFormat:
• It is an Output Format which writes sequences files for its output and it is
intermediate format use between MapReduce jobs
MapFileOutputFormat
• It is another form of FileOutputFormat in Hadoop Output Format, which is
used to write output as map files.
DBOutputFormat
• DBOutputFormat in Hadoop is an Output Format for writing to relational
databases and HBase.
MultipleOutputs
• It allows writing data to files whose names are derived from the output keys
and values, or in fact from an arbitrary string.

YARN(Yet Another Resource
Negotiator)
• YARN is a Framework on which
MapReduce works. YARN performs 2
operations that are Job scheduling and
Resource Management.
• The Purpose of Job schedular is to divide
a big task into small jobs so that each
job can be assigned to various slaves in
a Hadoop cluster and Processing can be
Maximized.
• Job Scheduler also keeps track of which
job is important, which job has more
priority, dependencies between the jobs
and all the other information like job timing,
etc. And the use of Resource Manager is
to manage all the resources that are

Data Serialization
• Data serialization is a process that converts structure data
manually back to the original form.
• Serialize to translate data structures into a stream of data.
Transmit this stream of data over the network or store it in DB
regardless of the system architecture.
• Serialization does the same but isn't dependent on
architecture.
• Consider CSV files contains a comma (,) in between data, so
while Deserialization, wrong outputs may occur. Now, if
metadata is stored in XML form, a self- architected form of
data storage, data can easily deserialize.

Unit-3
• Hadoop Architecture
• Hadoop Architecture,
• Hadoop Storage: HDFS, Common Hadoop Shell commands,
Anatomy of File Write and Read.,
• NameNode, Secondary NameNode, and DataNode,
• Hadoop MapReduce paradigm, Map and Reduce tasks, Job,
• Task trackers - Cluster Setup – SSH &Hadoop Configuration –
HDFS Administering –Monitoring & Maintenance.

• History of Hadoop in the following steps: -
• In 2002, Doug Cutting and Mike Cafarella started to work on a
project, Apache Nutch. It is an open source web crawler
software project.
• While working on Apache Nutch, they were dealing with big
data. To store that data they have to spend a lot of costs which
becomes the consequence of that project. This problem becomes
one of the important reason for the emergence of Hadoop.
• In 2003, Google introduced a file system known as GFS (Google
file system). It is a proprietary distributed file system developed
to provide efficient access to data.
• In 2004, Google released a white paper on Map Reduce. This
technique simplifies the data processing on large clusters.

Cont..
• In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
• In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.
• Doug Cutting gave named his project Hadoop after his son's toy elephant.
• In 2007, Yahoo runs two clusters of 1000 machines.
• In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
• In 2013, Hadoop 2.2 was released.
• In 2017, Hadoop 3.0 was released.

Hadoop Architecture
• Hadoop is a framework written in Java
that utilizes a large cluster of
commodity hardware to maintain and
store big size data.
• Hadoop works on MapReduce
Programming Algorithm that was
introduced by Google.
• Big Brand Companies are using Hadoop in
their Organization to deal with big data,
eg. Facebook, Yahoo, Netflix, eBay, etc.

Hadoop Architecture
• The Hadoop architecture is a package of the
file system, MapReduce engine and the
HDFS (Hadoop Distributed File System).
• The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
• A Hadoop cluster consists of a single master
and multiple slave nodes.
• The master node includes Job Tracker, Task
Tracker, NameNode, and DataNode whereas
the slave node includes DataNode and
TaskTracker.

The Hadoop Architecture Mainly
consists of 4 components
• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common

1. MapReduce
• MapReduce nothing but just like an Algorithm or
a data structure that is based on the YARN
framework.
• The major feature of MapReduce is to perform
the distributed processing in parallel in a Hadoop
cluster which Makes Hadoop working so fast.
• When you are dealing with Big Data, serial
processing is no more of any use.
• MapReduce has mainly 2 tasks which are divided
phase-wise:
• In first phase, Map is utilized and in next
phase Reduce is utilized.

The Input is provided to the Map() function then
it’s output is used as an input to the Reduce function
and after that, we receive our final output.

Cont..
• Input is provided to the Map(), now as we are
using Big Data. The Input is a set of Data.
• The Map() function here breaks this DataBlocks
into Tuples that are nothing but a key-value pair.
• These key-value pairs are now sent as input to the
Reduce().
• The Reduce() function then combines this broken
Tuples or key-value pair based on its Key value
and form set of Tuples, and perform some
operation like sorting, summation type job, etc.
which is then sent to the final Output Node.
• Finally, the Output is Obtained.

Anatomy of File Write and Read

NameNode
• It is a single master server exist in the
HDFS cluster.
• As it is a single node, it may become the
reason of single point failure.
• It manages the file system namespace by
executing an operation like the opening,
renaming and closing the files.
• It simplifies the architecture of the system.

DataNode
• The HDFS cluster contains multiple
DataNodes.
• Each DataNode contains multiple data
blocks.
• These data blocks are used to store
data.
• The responsibility of DataNode to read
and write requests from the file
system's clients.
• DataNode performs block creation,
deletion, and replication upon

Job Tracker
• The role of Job Tracker is to accept the
MapReduce jobs from client and
process the data by using NameNode.
• In response, NameNode provides
metadata to Job Tracker.

Task Tracker
• Task Tracker works as a slave node for
Job Tracker.
• It receives task and code from Job
Tracker and applies that code on the
file.
• This process can also be called as a
Mapper.

MapReduce Layer
• The MapReduce comes into existence
when the client application submits the
MapReduce job to Job Tracker.
• In response, the Job Tracker sends the
request to the appropriate Task
Trackers.
• Sometimes, the TaskTracker fails or
time out.
• In such a case, that part of the job is
rescheduled.

Advantages of Hadoop
• Fast: In HDFS the data distributed over the cluster and are
mapped which helps in faster retrieval. Even the tools to
process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in
minutes and Peta bytes in hours.
• Scalable: Hadoop cluster can be extended by just adding nodes
in the cluster.
• Cost Effective: Hadoop is open source and uses commodity
hardware to store data so it really cost effective as compared to
traditional relational database management system.
• Resilient to failure: HDFS has the property with which it can
replicate data over the network, so if one node is down or some
other network failure happens, then Hadoop takes the other
copy of data and use it. Normally, data are replicated thrice but
the replication factor is configurable.

Cont..
• all the metadata is stored in name node, it is
very important.
• If it fails the file system can not be used as
there would be no way of knowing how to
reconstruct the files from blocks present in
data node. To overcome this, the concept of
secondary name node arises.
• Secondary Name Node: It is a separate
physical machine which acts as a helper of
name node.
• It performs periodic check points.
• It communicates with the name node and take
snapshot of meta data which helps minimize

NameNode
• NameNode is the master node in the
Apache Hadoop HDFS Architecture that
maintains and manages the blocks
present on the DataNodes (slave
nodes).
• NameNode is a very highly available
server that manages the File System
Namespace and controls access to files by
clients.

Main function performed by NameNode:
• 1. Stores metadata of actual data. E.g. Filename, Path,
No. of Data Blocks, Block IDs, Block Location, No.
of Replicas, Slave related configuration
2. Manages File system namespace.
3. Regulates client access request for actual file data
file.
4. Assign work to Slaves(DataNode).
5. Executes file system name space operation like
opening/closing files, renaming files and directories.
6. As Name node keep metadata in memory for fast
retrieval, the huge amount of memory is required for
its operation.
• This should be hosted on reliable hardware.

DataNode
• DataNode works as Slave in Hadoop cluster. Main
function performed by DataNode:
• 1.Actually stores Business data.
2. This is actual worker node were Read/Write/Data
processing is handled.
3. Upon instruction from Master, it performs
creation/replication/deletion of data blocks.
4. As all the Business data is stored on DataNode, the
huge amount of storage is required for its operation.
• Commodity hardware can be used for hosting
DataNode.

Secondary NameNode
• Secondary Name Node: It is a separate physical machine which acts as
a helper of name node.
• It performs periodic check points.It communicates with the name node
and take snapshot of meta data which helps minimize downtime and loss
of data.
• Secondary NameNode is not a backup of NameNode. You can call it a
helper of NameNode.
• NameNode is the master daemon which maintains and manages the
DataNodes.
• It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live.

Cont..
• In case of the DataNode failure, the NameNode chooses
new DataNodes for new replicas, balance disk usage and
manages the communication traffic to the DataNodes.
• It stores the metadata of all the files stored in HDFS, e.g.
The location of blocks stored, the size of the files,
permissions, hierarchy, etc.
It maintains 2 files:
• FsImage: Contains the complete state of the file system
namespace since the start of the NameNode.
• EditLogs: Contains all the recent modifications made to
the file system with respect to the most recent FsImage.
• Whereas the Secondary NameNode is one which
constantly reads all the file systems and metadata from
the RAM of the NameNode and writes it into the hard
disk or the file system.

HDFS
• HDFS is a distributed file system that handles
large data sets running on commodity hardware.
• Hadoop comes with a distributed file system
called HDFS.
• In HDFS data is distributed over several machines and
replicated to ensure their durability to failure and high
availability to parallel application.
• It is cost effective as it uses commodity hardware.
It involves the concept of blocks, data nodes and
node name.
• It is used to scale a single Apache Hadoop cluster
to hundreds (and even thousands) of nodes.
• HDFS is one of the major components of Apache
Hadoop, the others being MapReduce and YARN

Where to use HDFS
• Very Large Files: Files should be of hundreds of megabytes,
gigabytes or more.
• Streaming Data Access: The time to read whole data set is
more important than latency in reading the first. HDFS is built
on write-once and read-many-times pattern.
• Commodity Hardware: It works on low cost hardware.
Where not to use HDFS
• Low Latency data access: Applications that require very less
time to access the first data should not use HDFS as it is
giving importance to whole data rather than time to fetch the
first record.
• Lots Of Small Files:The name node contains the metadata
of files in memory and if the files are small in size it takes a lot
of memory for name node's memory which is not feasible.
• Multiple Writes:It should not be used when we have to write
multiple times.

HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.
HDFS blocks are 128 MB by default and this is configurable. Files in HDFS
are broken into block-sized chunks,which are stored as independent
units.Unlike a file system, if the file is in HDFS is smaller than block size,
then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of
block size 128 MB takes 5MB of space only.The HDFS block size is large
just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node
acts as master.Name Node is controller and manager of HDFS as it knows the
status and the metadata of all the files in HDFS; the metadata information
being file permission, names and location of each block.The metadata are
small, so it is stored in the memory of name node,allowing faster access to
data. Moreover the HDFS cluster is accessed by multiple clients
concurrently,so all this information is handled bya single machine. The file
system operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or
name node. They report back to name node periodically, with list of blocks
that they are storing. The data node being a commodity hardware also does
the work of block creation, deletion and replication as stated by the name
node.

Common Hadoop Shell commands
• The File System (FS) shell includes
various shell-like commands that directly
interact with the Hadoop Distributed File
System (HDFS) as well as other file
systems that Hadoop supports.

In the above screenshot, it is clearly shown that we are creating a new directory
named “example” using mkdir command and the same is shown is using ls
command.

HDFS Basic File Operations
1. Putting data to HDFS from local file system
First create a folder in HDFS where data can be put form local
file system.
$ hadoop fs -mkdir /user/test
Copy the file "data.txt" from a file kept in local folder
/usr/home/Desktop to HDFS folder /user/ test
$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt
/user/test
Display the content of HDFS folder
$ Hadoop fs -ls /user/test

Cont..
2. Copying data from HDFS to local file
system
$ hadoop fs -copyToLocal /user/test/data.txt
/usr/bin/data_copy.txt
3. Compare the files and see that both are
same
$ md5
/usr/bin/data_copy.txt/usr/home/Desktop/data.txt
Recursive deleting
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/sonoo/

MapReduce
• Map Task
• Reduce Task
• Job
• Task Trackers-Cluster setup
• SSH
• Hadoop Configuration
• HDFS Administering
• Monitoring and Maintenance

Map Task
• A Map Task is a single instance of a
MapReduce app. These tasks determine
which records to process from a data
block.
• The input data is split and analyzed, in
parallel, on the assigned compute
resources in a Hadoop cluster.
• This step of a MapReduce job prepares
the <key, value> pair output for the reduce
step.

Reduce Task
• Map stage − The map or mapper’s job is to
process the input data. Generally the input data
is in the form of file or directory and is stored in
the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The
mapper processes the data and creates several
small chunks of data.
• Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a
new set of output, which will be stored in the
HDFS

SSH
• SSH setup is required to do different operations on
a cluster such as starting, stopping, distributed
daemon shell operations.
• Hadoop core requires shell i.e, (SSH) to
communicate with slave nodes and to create the
process on to the slave nodes. The communication
will be frequent when the cluster is live and working
in a fully distributed environment.

Unit-1 -2-3- BDA PIET 6 AIDS.pptx

Unit-1 -2-3- BDA PIET 6 AIDS.pptx

Recommandé

Recommandé

Contenu connexe

Similaire à Unit-1 -2-3- BDA PIET 6 AIDS.pptx

Similaire à Unit-1 -2-3- BDA PIET 6 AIDS.pptx (20)

Dernier

Dernier (20)

Unit-1 -2-3- BDA PIET 6 AIDS.pptx