SlideShare une entreprise Scribd logo
1  sur  219
Big Data Analytics
Dr. BK Verma
Professor in CSE(ET)
In charge: - Data Science Branch
6-AIDS
Syllabus of BDA
• UNIT-I Introduction To Big Data - Distributed file system, Big Data and its
importance, Four Vs, Drivers for Big data, big data analytics, big data
applications. Algorithms using map reduce, Matrix-Vector Multiplication
by Map Reduce.
• UNIT-II Introduction To Hadoop- Big Data – Apache Hadoop & Hadoop Eco
System – Moving Data in and out of Hadoop – Understanding inputs and
outputs of MapReduce - Data Serialization.
• UNIT- III Hadoop Architecture - Hadoop Architecture, Hadoop Storage:
HDFS, Common Hadoop Shell commands, Anatomy of File Write and
Read., NameNode, Secondary NameNode, and DataNode, Hadoop
MapReduce paradigm, Map and Reduce tasks, Job, Task trackers - Cluster
Setup – SSH &Hadoop Configuration – HDFS Administering –Monitoring &
Maintenance.
• UNIT-IV Hadoop Ecosystem And Yarn -Hadoop ecosystem components -
Schedulers - Fair and Capacity, Hadoop 2.0 New Features- NameNode High
Availability, HDFS Federation, MRv2, YARN, Running MRv1 in YARN.
Big Data
• Big data refers to the large and complex data sets that
are generated by various sources such as social media,
online transactions, and machine sensors.
• These data sets are too big, too fast, and too complex
to be processed and analyzed by traditional data
processing tools and methods.
• The growth of big data has been driven by the
explosion of digital devices, cloud computing, and
the internet of things (IoT), which has created a large
amount of data.
• This data contains valuable information that can be
used to improve business processes, gain new insights,
and make better decisions.
Do you know – Every minute we send 204 million emails, generate 1.8
million Facebook likes, send 278 thousand Tweets, and up-load 200,000
photos to Facebook.
The largest contributor of data is social media. For instance, Facebook
generates 500 TB of data every day. Twitter generates 8TB of data daily.
3Vs
• Big data can be characterized by its 3Vs: Volume, Velocity,
and Variety.
• Volume refers to the sheer size of the data sets, which can
range from terabytes to petabytes.
• Velocity refers to the speed at which the data is generated and
needs to be processed.
• Variety refers to the range of different types of data, including
structured data (e.g., databases), unstructured data (e.g., text,
images, and videos), and semi-structured data (e.g., social
media posts).
• To deal with big data, organizations are turning to new
technologies and tools such as Hadoop, Spark, and NoSQL
databases. These tools are designed to handle the scale and
complexity of big data and enable organizations to extract
valuable insights from it.
Four Vs
The Four Vs of big data refer to the characteristics of
big data that make it different from traditional data. The
Four Vs are
1. Volume: The sheer amount of data generated and
stored by organizations and individuals.
2. Velocity: The speed at which data is generated and
processed. This refers to real-time streaming data as
well as batch processing of large amounts of data.
3. Variety: The different forms of data, such as
structured, semi-structured, and unstructured data.
4. Veracity: The quality and accuracy of the data,
including data sources, data context, and the
possibility of missing, incomplete, or inconsistent
data
Distributed file system(DFS):
• A distributed file system is a type of file system that allows
multiple users to access and manage a shared pool of
configurable computer storage, known as a distributed file
system(DFS).
• The file system is distributed across many interconnected
nodes, each of which may provide one or more parts of the
overall file system.
• This means that users can access the file system from any
node, and the file system will automatically manage the data
distribution and replication.
• The goal of a distributed file system is to provide a single,
unified namespace and file management interface, so that
users can access their data from any location, without having
to worry about the underlying network topology or physical
storage locations.
Big Data and its importance
• Big Data refers to the large volume of data,
both structured and unstructured, that
inundates a business on a day-to-day basis.
• Big Data can be analyzed for insights that lead
to better decisions and strategic business
moves.
• For example: a company might analyze
customer purchase history to improve its sales
and marketing strategies.
• For Example: a healthcare organization might
analyze patient data to improve patient
outcomes.
Cont..
• The importance of Big Data lies in the insights it
provides. With the right tools and techniques,
organizations can unlock valuable information
and use it to improve their operations and gain
a competitive advantage.
Big Data can also help organizations:
• Enhance customer experiences
• Increase efficiency and productivity
• Identify new business opportunities
• Improve risk management and fraud detection
• Optimize supply chains and logistics
Big Data Analytics
No SQL Tool
Example
Drivers for Big data
There are several drivers that have led to the growth of big
data, including:
1. Technological advancements: The exponential growth of digital
data is largely due to advancements in technology such as the
internet, mobile devices, and sensors, which have made it easier to
collect, store, and share data.
2. Increased use of cloud computing: The widespread adoption of
cloud computing has made it possible for organizations to store and
process large amounts of data cost-effectively.
3. Internet of Things (IoT): The increasing number of connected
devices, such as sensors and smart devices, has led to the generation
of large amounts of data in real-time.
4. Social media: Social media platforms generate vast amounts of
user-generated data, including text, images, and videos.
5. Business requirements: Organizations are looking to leverage big
data to gain a competitive advantage and improve decision-making.
6. Government initiatives: Governments around the world are
investing in big data initiatives to improve public services and drive
economic growth
Big Data Analytics
• Big Data analytics is the process of examining, transforming, and
modeling large and complex data sets to uncover hidden
patterns, correlations, and other insights that can inform
decision making and support business objectives.
• The goal of Big Data analytics is to turn data into actionable
information and knowledge that can drive growth, improve
operational efficiency, and enhance competitiveness.
• Big Data analytics requires a combination of technologies, tools, and
techniques, including data storage, processing, and visualization.
• Hadoop, Spark, and NoSQL databases are popular technologies
used to manage and store big data, while machine learning
algorithms, statistical models, and data visualization tools are used
to perform the analytics.
Cont..
• The applications of Big Data analytics are
numerous and span across multiple industries,
including finance, healthcare, retail,
transportation, and many others.
• For example, in healthcare, Big Data analytics
can be used to analyze electronic medical
records to identify high-risk patients and
improve patient outcomes, while in retail, it
can be used to improve customer
engagement and experience through
personalized marketing and product
recommendations.
Big Data Applications
1. Healthcare: Big data is being used in healthcare to improve patient outcomes,
reduce costs, and improve overall efficiency. For example, electronic health records
(EHRs) generate massive amounts of data that can be analyzed to identify trends,
track patient outcomes, and support research.
2. Finance: Financial institutions use big data to analyze customer behavior, detect
fraud, and make investment decisions. For example, credit card companies use
big data to analyze spending patterns to identify and prevent fraudulent
transactions.
3. Retail: Retail companies use big data to gain insights into consumer behavior
and preferences, optimize pricing and promotions, and personalize the shopping
experience. For example, online retailers use big data to recommend products based
on a customer's past purchases and online behavior.
4. Manufacturing: Manufacturing companies use big data to optimize their supply
chain, improve production processes, and reduce costs. For example, data
generated by sensors on manufacturing equipment can be analyzed to identify
inefficiencies and improve production processes.
5. Transportation: The transportation industry uses big data to optimize routes,
reduce fuel consumption, and improve safety. For example, data from GPS
devices, weather sensors, and traffic cameras can be used to optimize routes for
delivery trucks and reduce fuel consumption.
Different Algorithms using MapReduce
MapReduce is a programming model for processing large data sets that can be used to
implement a wide variety of algorithms. Some common algorithms that can be implemented
using MapReduce include:
1. Word count: A classic example of a MapReduce algorithm that counts the number of
occurrences of each word in a large text corpus.
2. Inverted Index: An inverted index is a data structure that maps each word in a document to a
list of documents that contain it. In MapReduce, this can be implemented as a two-step
process, with the map function emitting a key-value pair for each word in each document, and
the reduce function aggregating the list of document IDs for each word.
3. PageRank: PageRank is an algorithm used by Google to rank web pages based on their
importance. In MapReduce, this can be implemented as a series of iterative MapReduce jobs
that propagate importance scores from one set of pages to another.
4. K-means clustering: K-means is a popular algorithm for clustering data into a fixed number
of clusters. In MapReduce, this can be implemented as a series of MapReduce jobs that
iteratively update the cluster centroids based on the assignment of data points to clusters.
5. Matrix multiplication: Matrix multiplication is a fundamental operation in linear algebra
that can be used to solve systems of linear equations. In MapReduce, this can be implemented
as a two-step process, with the map function emitting intermediate values that represent the
product of individual entries in the matrices, and the reduce function aggregating these
intermediate values to compute the final product.
Matrix-Vector Multiplication by Map Reduce
• Matrix-vector multiplication is a common operation in
linear algebra, where a matrix and a vector are multiplied to
produce another vector. In the context of map reduce, this
operation can be performed by dividing the matrix into
multiple chunks and distributing them across multiple
computers or nodes for parallel processing.
• Here's how matrix-vector multiplication could be performed
using the Map Reduce framework:
• Map step: Each node takes one chunk of the matrix and
multiplies it with the entire vector. The result is a partial
vector.
• Reduce step: The partial vectors from all nodes are
combined to form the final result vector.
Unit-2
• Introduction To Hadoop
• Big Data
• Apache Hadoop & Hadoop Eco System
• Moving Data in and out of Hadoop
• Understanding inputs and outputs of
MapReduce
• Data Serialization.
What is Hadoop?
• Hadoop is the solution to above Big Data
problems. It is the technology to store massive
datasets on a cluster of cheap machines in a
distributed manner. Not only this it provides Big
Data analytics through distributed computing
framework.
• It is an open-source software developed as a
project by Apache Software Foundation. Doug
Cutting created Hadoop. In the year 2008
Yahoo gave Hadoop to Apache Software
Foundation. Since then two versions of Hadoop
has come. Version 1.0 in the year 2011 and
version 2.0.6 in the year 2013. Hadoop comes in
various flavors like Cloudera, IBM BigInsight,
MapR and Hortonworks.
Hadoop consists of three core
components
• Hadoop Distributed File System (HDFS)
– It is the storage layer of Hadoop.
• Map-Reduce – It is the data processing
layer of Hadoop.
• YARN – It is the resource management
layer of Hadoop
Apache Hadoop
• Apache Hadoop is an open-source software
framework for distributed storage and
processing of big data sets across a cluster of
computers.
• It was developed to address the limitations of
traditional centralized systems when it comes
to storing and processing large amounts of
data.
Hadoop consists of two main
components
1.Hadoop Distributed File System (HDFS): A
scalable and fault-tolerant distributed file system
that enables the storage of very large files across a
cluster of commodity servers.
2.MapReduce: A programming model for
processing large data sets in parallel across a
cluster of computers. MapReduce consists of
two stages: the map stage, which processes data
in parallel on different nodes, and the reduce
stage, which aggregates the results.
HDFS:
• Hadoop Distributed File System provides for
distributed storage for Hadoop. HDFS has a
master-slave topology
• Master is a high-end machine where as slaves
are inexpensive computers. The Big Data files
get divided into the number of blocks. Hadoop
stores these blocks in a distributed fashion on
the cluster of slave nodes. On the master, we
have metadata stored.
• HDFS has two daemons running for it.
NameNode and DataNode
NameNode :
• NameNode Daemon runs on the master
machine.
• It is responsible for maintaining, monitoring
and managing DataNodes.
• It records the metadata of the files like the
location of blocks, file size, permission,
hierarchy etc.
• Namenode captures all the changes to the
metadata like deletion, creation and renaming
of the file in edit logs.
• It regularly receives heartbeat and block
reports from the DataNodes.
DataNode:
• DataNode runs on the slave machine.
• It stores the actual business data.
• It serves the read-write request from the
user.
• DataNode does the ground work of creating,
replicating and deleting the blocks on the
command of NameNode.
• After every 3 seconds, by default, it sends
heartbeat to NameNode reporting the health
of HDFS.
Cont..
• The Hadoop ecosystem is a collection of open-source projects
that work together with Hadoop to provide a complete big data
solution. Some of the popular projects in the Hadoop
ecosystem include:
1. Hive: A data warehousing and SQL-like query language for
Hadoop.
2. Pig: A high-level platform for creating MapReduce programs.
3. HBase: A NoSQL database that provides random real-time
read/write access to large amounts of structured data.
4. Spark: An open-source, fast, and general-purpose cluster
computing framework for big data processing.
5. YARN (Yet Another Resource Negotiator): A resource
management system that allocates resources in the cluster for
running applications.
What is Data?
• The quantities, characters, or symbols
on which operations are performed by
a computer, which may be stored and
transmitted in the form of electrical
signals and recorded on magnetic,
optical, or mechanical recording
media.
“90% of the world’s data was generated in the last few years”
What is Big Data?
• Big Data is a collection of data that is
huge in volume, yet growing
exponentially with time. It is a data
with so large size and complexity that
none of traditional data management
tools can store it or process it
efficiently. Big data is also a data but
with huge size.
What is an
Example of
Big Data?
• The New York Stock
Exchange is an example of
Big Data that generates
about one terabyte of new
trade data per day.
Social
Media
• The statistic
shows
that 500+terabyte
s of new data get
ingested into the
databases of
social media
site Facebook,
every day. This
data is mainly
generated in
terms of photo
and video
uploads, message
exchanges,
putting comments
etc.
• A single Jet engine can
generate 10+terabytes of data
in 30 minutes of flight time.
With many thousand flights per
day, generation of data reaches
up to many Petabytes.
Types Of Big Data
• Structured data − Relational data.
• Semi Structured data − XML data.
• Unstructured data − Word, PDF, Text,
Media Logs
Structured
• Any data that can be stored, accessed and processed
in the form of fixed format is termed as a ‘structured’
data. Over the period of time, talent in computer
science has achieved greater success in developing
techniques for working with such kind of data (where
the format is well known in advance) and also
deriving value out of it.
• However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes
are being in the rage of multiple zettabytes.
Data stored in a relational database management system is one example
of a ‘structured’ data.
Unstructured
• Any data with unknown form or the
structure is classified as unstructured
data.
• In addition to the size being huge, un-
structured data poses multiple
challenges in terms of its processing for
deriving value out of it.
• A typical example of unstructured data is
a heterogeneous data source containing
a combination of simple text files,
images, videos etc.
• Now day organizations have wealth of
data available with them but
unfortunately, they don’t know how to
derive value out of it since this data is in
Semi-structured
• Semi-structured data can contain both
the forms of data. We can see semi-
structured data as a structured in form
but it is actually not defined with e.g.
a table definition in relational DBMS.
Example of semi-structured data is a
data represented in an XML file.
Data Growth over the years
Characteristics Of Big Data
• Volume
• Variety
• Velocity
• Variability
(i) Volume:-
• The name Big Data itself is related to a
size which is enormous. Size of data
plays a very crucial role in determining
value out of data. Also, whether a
particular data can actually be
considered as a Big Data or not, is
dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic
which needs to be considered while
dealing with Big Data solutions.
(ii) Variety:-
• The next aspect of Big Data is its variety.
• Variety refers to heterogeneous sources and
the nature of data, both structured and
unstructured. During earlier days,
spreadsheets and databases were the only
sources of data considered by most of the
applications. Nowadays, data in the form of
emails, photos, videos, monitoring devices,
PDFs, audio, etc. are also being considered in
the analysis applications. This variety of
unstructured data poses certain issues for
storage, mining and analyzing data.
iii) Velocity:-
• The term ‘velocity’ refers to the speed of
generation of data. How fast the data is
generated and processed to meet the
demands, determines real potential in
the data.
• Big Data Velocity deals with the speed at
which data flows in from sources like
business processes, application logs,
networks, and social media sites,
sensors, Mobile devices, etc. The flow
of data is massive and continuous.
(iv) Variability:-
• This refers to the inconsistency which
can be shown by the data at times,
thus hampering the process of being
able to handle and manage the data
effectively.
Advantages Of Big Data
Processing
• Ability to process Big Data in DBMS brings in multiple
benefits, such as-
• Businesses can utilize outside intelligence while taking
decisions
• Access to social data from search engines and sites like
facebook, twitter are enabling organizations to fine tune
their business strategies.
• Improved customer service
• Traditional customer feedback systems are getting
replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and
natural language processing technologies are being
used to read and evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency
Drivers for Big Data
• The main business drivers for such rising demand for
Big Data Analytics are :
• 1. The digitization of society
• 2. The drop in technology costs
• 3. Connectivity through cloud computing
• 4. Increased knowledge about data science
• 5. Social media applications
• 6. The rise of Internet-of-Things(IoT)
• Example: A number of companies that have Big Data
at the core of their strategy like :
• Apple, Amazon, Facebook and Netflix have become
very successful at the beginning of the 21st century.
What is Big Data Analytics?
• Big Data analytics is a process used to extract
meaningful insights, such as hidden patterns,
unknown correlations, market trends, and customer
preferences. Big Data analytics provides various
advantages—it can be used for better decision
making, preventing fraudulent activities, among
other things.
Different Types of Big Data Analytics
1. Descriptive Analytics
This summarizes past data into a form that people can easily
read. This helps in creating reports, like a company’s revenue,
profit, sales, and so on. Also, it helps in the tabulation of social
media metrics.
Use Case: The Dow Chemical Company analyzed its past data
to increase facility utilization across its office and lab space.
Using descriptive analytics, Dow was able to identify
underutilized space. This space consolidation helped the
company save nearly US $4 million annually.
2. Diagnostic Analytics
• This is done to understand what caused a problem in the first
place. Techniques like drill-down, data mining, and data
recovery are all examples. Organizations use diagnostic
analytics because they provide an in-depth insight into a
particular problem.
Use Case: An e-commerce company’s report shows that their
sales have gone down, although customers are adding
products to their carts. This can be due to various reasons
like the form didn’t load correctly, the shipping fee is too high,
or there are not enough payment options available. This is
where you can use diagnostic analytics to find the reason.
3. Predictive Analytics
• This type of analytics looks into the historical and present
data to make predictions of the future. Predictive analytics
uses data mining, AI, and machine learning to analyze
current data and make predictions about the future. It works
on predicting customer trends, market trends, and so on.
Use Case: PayPal determines what kind of precautions they
have to take to protect their clients against fraudulent
transactions. Using predictive analytics, the company uses all
the historical payment data and user behavior data and builds
an algorithm that predicts fraudulent activities.
4. Prescriptive Analytics
• This type of analytics prescribes the solution to a
particular problem. Perspective analytics works with
both descriptive and predictive analytics. Most of the
time, it relies on AI and machine learning.
Use Case: Prescriptive analytics can be used to
maximize an airline’s profit. This type of analytics is
used to build an algorithm that will automatically adjust
the flight fares based on numerous factors, including
customer demand, weather, destination, holiday
seasons, and oil prices.
Big Data Analytics Tools
• Hadoop - helps in storing and analyzing data
• MongoDB - used on datasets that change frequently
• Talend - used for data integration and management
• Cassandra - a distributed database used to handle chunks of
data
• Spark - used for real-time processing and analyzing large
amounts of data
• STORM - an open-source real-time computational system
• Kafka - a distributed streaming platform that is used for fault-
tolerant storage
Big Data Industry Applications
• Ecommerce - Predicting customer trends and optimizing prices are a few of the
ways e-commerce uses Big Data analytics
• Marketing - Big Data analytics helps to drive high ROI marketing campaigns, which
result in improved sales
• Education - Used to develop new and improve existing courses based on market
requirements
• Healthcare - With the help of a patient’s medical history, Big Data analytics is used
to predict how likely they are to have health issues
• Media and entertainment - Used to understand the demand of shows, movies,
songs, and more to deliver a personalized recommendation list to its users
• Banking - Customer income and spending patterns help to predict the likelihood of
choosing various banking offers, like loans and credit cards
• Telecommunications - Used to forecast network capacity and improve customer
experience
• Government - Big Data analytics helps governments in law enforcement, among
other things
MapReduce
• MapReduce is a programming model for
writing applications that can process Big
Data in parallel on multiple nodes.
• MapReduce provides analytical capabilities
for analyzing huge volumes of complex
data.
Why MapReduce?
• Traditional Enterprise Systems normally have
a centralized server to store and process data.
The following illustration depicts a schematic
view of a traditional enterprise system.
Traditional model is certainly not suitable to
process huge volumes of scalable data and
cannot be accommodated by standard
database servers. Moreover, the centralized
system creates too much of a bottleneck
while processing multiple files
simultaneously.
• Google solved this bottleneck issue using an
algorithm called MapReduce. MapReduce
divides a task into small parts and assigns
them to many computers. Later, the results
are collected at one place and integrated to
form the result dataset.
How MapReduce Works?
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
• The Map task takes a set of data and converts
it into another set of data, where individual
elements are broken down into tuples (key-
value pairs).
• The Reduce task takes the output from the
Map as an input and combines those data
tuples (key-value pairs) into a smaller set of
tuples.
The reduce task is always performed after the
map job.
Cont..
• Input Phase − Here we have a Record Reader that translates each record in an input
file and sends the parsed data to the mapper in the form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups similar data from the
map phase into identifiable sets. It takes the intermediate keys from the mapper as
input and applies a user-defined code to aggregate the values in a small scope of one
mapper. It is not a part of the main MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the Reducer is
running. The individual key-value pairs are sorted by key into a larger data list. The
data list groups the equivalent keys together so that their values can be iterated easily
in the Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered,
and combined in a number of ways, and it requires a wide range of processing. Once
the execution is over, it gives zero or more key-value pairs to the final step.
• Output Phase − In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a
record writer.
MapReduce-
Example
• Let us take a real-world example to
comprehend the power of MapReduce.
Twitter receives around 500 million tweets
per day, which is nearly 3000 tweets per
second. The following illustration shows
how Tweeter manages its tweets with the
help of MapReduce.
Cont..
• Tokenize − Tokenizes the tweets into maps
of tokens and writes them as key-value pairs.
• Filter − Filters unwanted words from the
maps of tokens and writes the filtered maps
as key-value pairs.
• Count − Generates a token counter per word.
• Aggregate Counters − Prepares an
aggregate of similar counter values into small
manageable units.
MapReduce - Algorithm
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
• The map task is done by means of Mapper
Class
• The reduce task is done by means of
Reducer Class.
Cont..
• Mapper class takes the input, tokenizes it,
maps and sorts it. The output of Mapper
class is used as input by Reducer class,
which in turn searches matching pairs and
reduces them.
Cont..
• MapReduce implements various mathematical
algorithms to divide a task into small parts and
assign them to multiple systems. In technical
terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a
cluster.
• These mathematical algorithms may include the
following: −
• Sorting
• Searching
• Indexing
• TF-IDF(Term Frequency (TF)-Inverse Document
Frequency (IDF)
Unit-2
Introduction of Hadoop
Hadoop is an open-source software framework that is used
for storing and processing large amounts of data in a
distributed computing environment.
It is designed to handle big data and is based on the
MapReduce programming model, which allows for the
parallel processing of large datasets.
History of Hadoop
• Apache Software Foundation is the developers of Hadoop, and
it’s co-founders are Doug Cutting and Mike Cafarella. It’s co-
founder Doug Cutting named it on his son’s toy elephant. In
October 2003 the first paper release was Google File System.
• In January 2006, MapReduce development started on the Apache
Nutch which consisted of around 6000 lines coding for it and
around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0
was released.
• Hadoop is an open-source software framework for storing and
processing big data. It was created by Apache Software Foundation
in 2006, based on a white paper written by Google in 2003 that
described the Google File System (GFS) and the MapReduce
programming model.
• The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers
to thousands of machines, each offering local computation and
storage.
• It is used by many organizations, including Yahoo, Facebook, and
IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Hadoop has several key features
that make it well-suited for big
data processing:
• Distributed Storage: Hadoop stores large data
sets across multiple machines, allowing for
the storage and processing of extremely large
amounts of data.
• Scalability: Hadoop can scale from a single
server to thousands of machines, making it
easy to add more capacity as needed.
• Fault-Tolerance: Hadoop is designed to be
highly fault-tolerant, meaning it can continue
to operate even in the presence of hardware
failures.
Cont..
• Data locality: Hadoop provides data locality
feature, where the data is stored on the same
node where it will be processed, this feature
helps to reduce the network traffic and
improve the performance
• High Availability: Hadoop provides High
Availability feature, which helps to make sure
that the data is always available and is not
lost.
• Flexible Data Processing: Hadoop’s
MapReduce programming model allows for the
processing of data in a distributed fashion,
making it easy to implement a wide variety of
Cont..
• Data Integrity: Hadoop provides built-in checksum
feature, which helps to ensure that the data stored
is consistent and correct.
• Data Replication: Hadoop provides data replication
feature, which helps to replicate the data across the
cluster for fault tolerance.
• Data Compression: Hadoop provides built-in data
compression feature, which helps to reduce the
storage space and improve the performance.
• YARN: A resource management platform that
allows multiple data processing engines like real-
time streaming, batch processing, and
interactive SQL, to run and process data stored
in HDFS.
What is Hadoop?
• Def. “Hadoop is an open source software
programming framework for storing a large
amount of data and performing the
computation”.
• Its framework is based on Java programming
with some native code in C and shell scripts.
• Hadoop is an open-source software framework
that is used for storing and processing large
amounts of data in a distributed computing
environment.
• It is designed to handle big data and is based
on the MapReduce programming model,
Hadoop has Two main
components
• HDFS (Hadoop Distributed File System): This is the storage
component of Hadoop, which allows for the storage of large
amounts of data across multiple machines. It is designed to work
with commodity hardware, which makes it cost-effective.
• YARN (Yet Another Resource Negotiator): This is the resource
management component of Hadoop, which manages the
allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
• Hadoop also includes several additional modules that provide
additional functionality, such as Hive (a SQL-like query
language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
• Hadoop is commonly used in big data scenarios such as
data warehousing, business intelligence, and machine
learning. It’s also used for data processing, data analysis,
and data mining.
Hadoop Distributed File
System
• It has distributed file system known as
HDFS and this HDFS splits files into
blocks and sends them across various
nodes in form of large clusters. Also in
case of a node failure, the system
operates and data transfer takes place
between the nodes which are facilitated by
HDFS.
HDFS
Advantages of HDFS
• Scalability: Hadoop can easily scale to
handle large amounts of data by adding
more nodes to the cluster.
• Cost-effective: Hadoop is designed to work
with commodity hardware, which makes it
a cost-effective option for storing and
processing large amounts of data.
Cont..
• Fault-tolerance: Hadoop’s distributed
architecture provides built-in fault-
tolerance, which means that if one node
in the cluster goes down, the data can
still be processed by the other nodes.
• Flexibility: Hadoop can process
structured, semi-structured, and
unstructured data, which makes it a
versatile option for a wide range of big
data scenarios.
Cont..
• Open-source: Hadoop is open-source software,
which means that it is free to use and modify.
This also allows developers to access the source
code and make improvements or add new
features.
• Large community: Hadoop has a large and
active community of developers and users who
contribute to the development of the software,
provide support, and share best practices.
• Integration: Hadoop is designed to work with
other big data technologies such as Spark,
Storm, and Flink, which allows for integration with
a wide range of data processing and analysis
tools.
Disadvantages of HDFS:
• Not very effective for small data.
• Hard cluster management.
• Has stability issues.
• Security major concerns.
• Complexity: Hadoop can be complex to set
up and maintain, especially for
organizations without a dedicated team of
experts.
• Latency: Hadoop is not well-suited for
low-latency workloads and may not be the
best choice for real-time data processing.
Cont..
• Limited Support for Real-time Processing:
Hadoop’s batch-oriented nature makes it less suited
for real-time streaming or interactive data processing
use cases.
• Limited Support for Structured Data: Hadoop is
designed to work with unstructured and semi-
structured data, it is not well-suited for structured
data processing
• Data Security: Hadoop does not provide built-in
security features such as data encryption or user
authentication, which can make it difficult to secure
sensitive data.
• Limited Support for Ad-hoc Queries: Hadoop’s
MapReduce programming model is not well-suited for
Cont..
• Limited Support for Graph and Machine
Learning: Hadoop’s core component HDFS and
MapReduce are not well-suited for graph and
machine learning workloads, specialized
components like Apache Giraph and Mahout are
available but have some limitations.
• Cost: Hadoop can be expensive to set up and
maintain, especially for organizations with large
amounts of data.
• Data Loss: In the event of a hardware failure, the
data stored in a single node may be lost
permanently.
• Data Governance: Data Governance is a critical
aspect of data management, Hadoop does not
provide a built-in feature to manage data lineage,
data quality, data cataloging, data lineage, and
Hadoop framework is made up of
the following modules: Hadoop
Ecosystem
1.Hadoop MapReduce:- a MapReduce
programming model for handling and
processing large data.
2.Hadoop Distributed File System:- distributed
files in clusters among nodes.
3.Hadoop YARN:- a platform which manages
computing resources.
4. Hadoop Common- it contains packages and
libraries which are used for other modules.
Apache Hadoop and Hadoop Eco
System
• Apache Hadoop is an open source
software framework used to develop
data processing applications which are
executed in a distributed computing
environment.
• Applications built using HADOOP are
run on large data sets distributed
across clusters of commodity
computers.
• Commodity computers are cheap and
widely available.
• These are mainly useful for achieving
Form of Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data
Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data
services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning
algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Hadoop Eco System
Apache Hadoop consists of two
sub-projects:
1.Hadoop MapReduce: MapReduce is a
computational model and software
framework for writing applications which are
run on Hadoop. These MapReduce programs
are capable of processing enormous data in
parallel on large clusters of computation
nodes.
2.HDFS (Hadoop Distributed File System):
HDFS takes care of the storage part of
Hadoop applications. MapReduce
applications consume data from HDFS. HDFS
creates multiple replicas of data blocks and
distributes them on compute nodes in a
cluster. This distribution enables reliable
Hadoop Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed
data processing using MapReduce and HDFS methods.
Name Node:
NameNode represented every files and directory
which is used in the namespace
Data Node:
DataNode helps to manage the state of an HDFS
node and allows you to interacts with the blocks
Master Node:
The master node allows to conduct parallel
processing of data using Hadoop MapReduce.
Slave node:
The slave nodes are the additional machines in
Hadoop cluster which allows to store data to
conduct complex calculations. Moreover, all the
slave node comes with Task Tracker and a
DataNode.
This allows to synchronize the processes with the
NameNode and Job Tracker respectively.
Data storage Nodes in HDFS.
•NameNode(Master)
•DataNode(Slave)
NameNode:
• NameNode works as a Master in a Hadoop cluster
that guides the Datanode(Slaves).
• Namenode is mainly used for storing the Metadata
i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity
in a Hadoop cluster.
• Meta Data can also be the name of the file, size, and
the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find
the closest DataNode for Faster Communication.
• Namenode instructs the DataNodes with the
operation like delete, create, Replicate, etc.
Cont..
• NameNode is the master node in the Apache Hadoop
HDFS Architecture that maintains and manages the blocks
present on the DataNodes (slave nodes).
• NameNode is a very highly available server that manages
the File System Namespace and controls access to files by
clients.
• The HDFS architecture is built in such a way that the user
data never resides on the NameNode. The data resides on
DataNodes only.
Functions of NameNode
• It is the master daemon that maintains and
manages the DataNodes (slave nodes)
• It records the metadata of all the files stored in
the cluster, e.g. The location of blocks
stored, the size of the files, permissions,
hierarchy, etc.
• There are two files associated with the metadata:
• FsImage: It contains the complete state of the file
system namespace since the start of the
NameNode.
• EditLogs: It contains all the recent modifications
made to the file system with respect to the most
recent FsImage.
DataNode:
• DataNodes works as a Slave DataNodes
are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes
can be from 1 to 500 or even more than
that.
• The more number of DataNode, the Hadoop
cluster will be able to store more data.
• DataNode should have High storing capacity
to store a large number of file blocks.
File Block In HDFS:
• Data in HDFS is always stored in terms of
blocks. So the single block of data is
divided into multiple blocks of size 128MB
which is default and you can also change
it manually.
Suppose you have uploaded a file of 400MB to your HDFS
Replication In HDFS
• Replication ensures the availability of the data.
Replication is making a copy of something and the
number of times you make a copy of that particular
thing can be expressed as it’s Replication Factor. As
we have seen in File blocks that the HDFS stores the
data in the form of various blocks at the same time
Hadoop is also configured to make a copy of those file
blocks.
• By default, the Replication Factor for Hadoop is set to
3 which can be configured means you can change it
manually as per your requirement like in above
example we have made 4 file blocks which means that
3 Replica or copy of each file block is made means
total of 4×3 = 12 blocks are made for the backup
purpose.
Rack Awareness
• The rack is nothing but just the physical
collection of nodes in our Hadoop cluster
(maybe 30 to 40).
• A large Hadoop cluster is consists of so
many Racks . with the help of this Racks
information Namenode chooses the
closest Datanode to achieve the maximum
performance while performing the
read/write information which reduces the
Network Traffic.
• Hadoop cluster
consists of a data
center, the rack
and the node
which actually
executes jobs.
• Here, data center
consists of racks
and rack consists
of nodes.
• Network
bandwidth
available to
processes varies
depending upon
the location of
the processes
Moving Data In and Out of Hadoop
• Moving data in and out of Hadoop, which
refer to as data ingress and egress, is the
process by which data is transported
from an external system into an internal
system, and vice versa.
• Hadoop supports ingress and egress at a
low level in HDFS and MapReduce.
Understanding Inputs and Outputs of
MapReduce
Done in Class
Understanding Inputs and Outputs of
MapReduce
• Inputs and Outputs
• The MapReduce model operates on <key,
value> pairs.
• It views the input to the jobs as a set of <key,
value> pairs and produces a different set of
<key, value> pairs as the output of the jobs.
• Data input is supported by two classes in this
framework, namely InputFormat and
RecordReader.
Cont..
• The first is consulted to determine how the
input data should be partitioned for the map
tasks, while the latter reads the data from
the inputs.
• For the data output also there are two
classes, OutputFormat and RecordWriter.
• The first class performs a basic validation of
the data sink properties and the second
class is used to write each reducer output to
the data sink.
MapReduce Phase
• Input Splits: An input in the MapReduce model is divided
into small fixed-size parts called input splits. This part of
the input is consumed by a single map. The input data is
generally a file or directory stored in the HDFS.
• Mapping: This is the first phase in the map-reduce
program execution where the data in each split is passed
line by line, to a mapper function to process it and
produce the output values.
• Shuffling: It is a part of the output phase of Mapping
where the relevant records are consolidated from the
output. It consists of merging and sorting. So, all the key-
value pairs which have the same keys are combined. In
sorting, the inputs from the merging step are taken and
sorted. It returns key-value pairs, sorting the output.
• Reduce: All the values from the shuffling phase are
combined and a single output value is returned. Thus,
summarizing the entire dataset.
Hadoop InputFormat
• Hadoop InputFormat checks the Input-
Specification of the job.
• InputFormat split the Input file into InputSplit and
assign to individual Mapper.
• Different methods to get the data to the mapper
and different types of InputFormat in Hadoop like
FileInputFormat in Hadoop, TextInputFormat,
KeyValueTextInputFormat, etc.
Cont.
• How the input files are split up and read in Hadoop is
defined by the InputFormat.
• An Hadoop InputFormat is the first component in Map-
Reduce, it is responsible for creating the input splits and
dividing them into records.
• Initially, the data for a MapReduce task is stored in input
files, and input files typically reside in HDFS.
• Although these files format is arbitrary, line-based log files and
binary format can be used.
• Using InputFormat we define how these input files are split
and read.
InputFormat Class:
• The files or other objects that should be used for
input is selected by the InputFormat.
• InputFormat defines the Data splits, which
defines both the size of individual Map tasks and
its potential execution server.
• InputFormat defines the RecordReader, which is
responsible for reading actual records from the
input files.
Cont..
TextOutputFormat:
• MapReduce default Hadoop reducer Output Format is TextOutputFormat,
which writes (key, value) pairs on individual lines of text files and its keys and
values.
SequenceOutputFormat:
• It is an Output Format which writes sequences files for its output and it is
intermediate format use between MapReduce jobs
MapFileOutputFormat
• It is another form of FileOutputFormat in Hadoop Output Format, which is
used to write output as map files.
DBOutputFormat
• DBOutputFormat in Hadoop is an Output Format for writing to relational
databases and HBase.
MultipleOutputs
• It allows writing data to files whose names are derived from the output keys
and values, or in fact from an arbitrary string.
YARN(Yet Another Resource
Negotiator)
• YARN is a Framework on which
MapReduce works. YARN performs 2
operations that are Job scheduling and
Resource Management.
• The Purpose of Job schedular is to divide
a big task into small jobs so that each
job can be assigned to various slaves in
a Hadoop cluster and Processing can be
Maximized.
• Job Scheduler also keeps track of which
job is important, which job has more
priority, dependencies between the jobs
and all the other information like job timing,
etc. And the use of Resource Manager is
to manage all the resources that are
Data Serialization
• Data serialization is a process that converts structure data
manually back to the original form.
• Serialize to translate data structures into a stream of data.
Transmit this stream of data over the network or store it in DB
regardless of the system architecture.
• Serialization does the same but isn't dependent on
architecture.
• Consider CSV files contains a comma (,) in between data, so
while Deserialization, wrong outputs may occur. Now, if
metadata is stored in XML form, a self- architected form of
data storage, data can easily deserialize.
Unit-3
• Hadoop Architecture
• Hadoop Architecture,
• Hadoop Storage: HDFS, Common Hadoop Shell commands,
Anatomy of File Write and Read.,
• NameNode, Secondary NameNode, and DataNode,
• Hadoop MapReduce paradigm, Map and Reduce tasks, Job,
• Task trackers - Cluster Setup – SSH &Hadoop Configuration –
HDFS Administering –Monitoring & Maintenance.
• History of Hadoop in the following steps: -
• In 2002, Doug Cutting and Mike Cafarella started to work on a
project, Apache Nutch. It is an open source web crawler
software project.
• While working on Apache Nutch, they were dealing with big
data. To store that data they have to spend a lot of costs which
becomes the consequence of that project. This problem becomes
one of the important reason for the emergence of Hadoop.
• In 2003, Google introduced a file system known as GFS (Google
file system). It is a proprietary distributed file system developed
to provide efficient access to data.
• In 2004, Google released a white paper on Map Reduce. This
technique simplifies the data processing on large clusters.
Cont..
• In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
• In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.
• Doug Cutting gave named his project Hadoop after his son's toy elephant.
• In 2007, Yahoo runs two clusters of 1000 machines.
• In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
• In 2013, Hadoop 2.2 was released.
• In 2017, Hadoop 3.0 was released.
Hadoop Architecture
• Hadoop is a framework written in Java
that utilizes a large cluster of
commodity hardware to maintain and
store big size data.
• Hadoop works on MapReduce
Programming Algorithm that was
introduced by Google.
• Big Brand Companies are using Hadoop in
their Organization to deal with big data,
eg. Facebook, Yahoo, Netflix, eBay, etc.
Hadoop Architecture
• The Hadoop architecture is a package of the
file system, MapReduce engine and the
HDFS (Hadoop Distributed File System).
• The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
• A Hadoop cluster consists of a single master
and multiple slave nodes.
• The master node includes Job Tracker, Task
Tracker, NameNode, and DataNode whereas
the slave node includes DataNode and
TaskTracker.
The Hadoop Architecture Mainly
consists of 4 components
• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common
Hadoop Architecture
1. MapReduce
• MapReduce nothing but just like an Algorithm or
a data structure that is based on the YARN
framework.
• The major feature of MapReduce is to perform
the distributed processing in parallel in a Hadoop
cluster which Makes Hadoop working so fast.
• When you are dealing with Big Data, serial
processing is no more of any use.
• MapReduce has mainly 2 tasks which are divided
phase-wise:
• In first phase, Map is utilized and in next
phase Reduce is utilized.
The Input is provided to the Map() function then
it’s output is used as an input to the Reduce function
and after that, we receive our final output.
Cont..
• Input is provided to the Map(), now as we are
using Big Data. The Input is a set of Data.
• The Map() function here breaks this DataBlocks
into Tuples that are nothing but a key-value pair.
• These key-value pairs are now sent as input to the
Reduce().
• The Reduce() function then combines this broken
Tuples or key-value pair based on its Key value
and form set of Tuples, and perform some
operation like sorting, summation type job, etc.
which is then sent to the final Output Node.
• Finally, the Output is Obtained.
Anatomy of File Write and Read
NameNode
• It is a single master server exist in the
HDFS cluster.
• As it is a single node, it may become the
reason of single point failure.
• It manages the file system namespace by
executing an operation like the opening,
renaming and closing the files.
• It simplifies the architecture of the system.
DataNode
• The HDFS cluster contains multiple
DataNodes.
• Each DataNode contains multiple data
blocks.
• These data blocks are used to store
data.
• The responsibility of DataNode to read
and write requests from the file
system's clients.
• DataNode performs block creation,
deletion, and replication upon
Job Tracker
• The role of Job Tracker is to accept the
MapReduce jobs from client and
process the data by using NameNode.
• In response, NameNode provides
metadata to Job Tracker.
Task Tracker
• Task Tracker works as a slave node for
Job Tracker.
• It receives task and code from Job
Tracker and applies that code on the
file.
• This process can also be called as a
Mapper.
MapReduce Layer
• The MapReduce comes into existence
when the client application submits the
MapReduce job to Job Tracker.
• In response, the Job Tracker sends the
request to the appropriate Task
Trackers.
• Sometimes, the TaskTracker fails or
time out.
• In such a case, that part of the job is
rescheduled.
Advantages of Hadoop
• Fast: In HDFS the data distributed over the cluster and are
mapped which helps in faster retrieval. Even the tools to
process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in
minutes and Peta bytes in hours.
• Scalable: Hadoop cluster can be extended by just adding nodes
in the cluster.
• Cost Effective: Hadoop is open source and uses commodity
hardware to store data so it really cost effective as compared to
traditional relational database management system.
• Resilient to failure: HDFS has the property with which it can
replicate data over the network, so if one node is down or some
other network failure happens, then Hadoop takes the other
copy of data and use it. Normally, data are replicated thrice but
the replication factor is configurable.
HDFS Read Image:
HDFS Write Image:
Cont..
• all the metadata is stored in name node, it is
very important.
• If it fails the file system can not be used as
there would be no way of knowing how to
reconstruct the files from blocks present in
data node. To overcome this, the concept of
secondary name node arises.
• Secondary Name Node: It is a separate
physical machine which acts as a helper of
name node.
• It performs periodic check points.
• It communicates with the name node and take
snapshot of meta data which helps minimize
NameNode
• NameNode is the master node in the
Apache Hadoop HDFS Architecture that
maintains and manages the blocks
present on the DataNodes (slave
nodes).
• NameNode is a very highly available
server that manages the File System
Namespace and controls access to files by
clients.
Main function performed by NameNode:
• 1. Stores metadata of actual data. E.g. Filename, Path,
No. of Data Blocks, Block IDs, Block Location, No.
of Replicas, Slave related configuration
2. Manages File system namespace.
3. Regulates client access request for actual file data
file.
4. Assign work to Slaves(DataNode).
5. Executes file system name space operation like
opening/closing files, renaming files and directories.
6. As Name node keep metadata in memory for fast
retrieval, the huge amount of memory is required for
its operation.
• This should be hosted on reliable hardware.
DataNode
• DataNode works as Slave in Hadoop cluster. Main
function performed by DataNode:
• 1.Actually stores Business data.
2. This is actual worker node were Read/Write/Data
processing is handled.
3. Upon instruction from Master, it performs
creation/replication/deletion of data blocks.
4. As all the Business data is stored on DataNode, the
huge amount of storage is required for its operation.
• Commodity hardware can be used for hosting
DataNode.
Secondary NameNode
• Secondary Name Node: It is a separate physical machine which acts as
a helper of name node.
• It performs periodic check points.It communicates with the name node
and take snapshot of meta data which helps minimize downtime and loss
of data.
• Secondary NameNode is not a backup of NameNode. You can call it a
helper of NameNode.
• NameNode is the master daemon which maintains and manages the
DataNodes.
• It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live.
Cont..
• In case of the DataNode failure, the NameNode chooses
new DataNodes for new replicas, balance disk usage and
manages the communication traffic to the DataNodes.
• It stores the metadata of all the files stored in HDFS, e.g.
The location of blocks stored, the size of the files,
permissions, hierarchy, etc.
It maintains 2 files:
• FsImage: Contains the complete state of the file system
namespace since the start of the NameNode.
• EditLogs: Contains all the recent modifications made to
the file system with respect to the most recent FsImage.
• Whereas the Secondary NameNode is one which
constantly reads all the file systems and metadata from
the RAM of the NameNode and writes it into the hard
disk or the file system.
HDFS
• HDFS is a distributed file system that handles
large data sets running on commodity hardware.
• Hadoop comes with a distributed file system
called HDFS.
• In HDFS data is distributed over several machines and
replicated to ensure their durability to failure and high
availability to parallel application.
• It is cost effective as it uses commodity hardware.
It involves the concept of blocks, data nodes and
node name.
• It is used to scale a single Apache Hadoop cluster
to hundreds (and even thousands) of nodes.
• HDFS is one of the major components of Apache
Hadoop, the others being MapReduce and YARN
Where to use HDFS
• Very Large Files: Files should be of hundreds of megabytes,
gigabytes or more.
• Streaming Data Access: The time to read whole data set is
more important than latency in reading the first. HDFS is built
on write-once and read-many-times pattern.
• Commodity Hardware: It works on low cost hardware.
Where not to use HDFS
• Low Latency data access: Applications that require very less
time to access the first data should not use HDFS as it is
giving importance to whole data rather than time to fetch the
first record.
• Lots Of Small Files:The name node contains the metadata
of files in memory and if the files are small in size it takes a lot
of memory for name node's memory which is not feasible.
• Multiple Writes:It should not be used when we have to write
multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.
HDFS blocks are 128 MB by default and this is configurable. Files in HDFS
are broken into block-sized chunks,which are stored as independent
units.Unlike a file system, if the file is in HDFS is smaller than block size,
then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of
block size 128 MB takes 5MB of space only.The HDFS block size is large
just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node
acts as master.Name Node is controller and manager of HDFS as it knows the
status and the metadata of all the files in HDFS; the metadata information
being file permission, names and location of each block.The metadata are
small, so it is stored in the memory of name node,allowing faster access to
data. Moreover the HDFS cluster is accessed by multiple clients
concurrently,so all this information is handled bya single machine. The file
system operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or
name node. They report back to name node periodically, with list of blocks
that they are storing. The data node being a commodity hardware also does
the work of block creation, deletion and replication as stated by the name
node.
Common Hadoop Shell commands
• The File System (FS) shell includes
various shell-like commands that directly
interact with the Hadoop Distributed File
System (HDFS) as well as other file
systems that Hadoop supports.
In the above screenshot, it is clearly shown that we are creating a new directory
named “example” using mkdir command and the same is shown is using ls
command.
HDFS Basic File Operations
1. Putting data to HDFS from local file system
First create a folder in HDFS where data can be put form local
file system.
$ hadoop fs -mkdir /user/test
Copy the file "data.txt" from a file kept in local folder
/usr/home/Desktop to HDFS folder /user/ test
$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt
/user/test
Display the content of HDFS folder
$ Hadoop fs -ls /user/test
Cont..
2. Copying data from HDFS to local file
system
$ hadoop fs -copyToLocal /user/test/data.txt
/usr/bin/data_copy.txt
3. Compare the files and see that both are
same
$ md5
/usr/bin/data_copy.txt/usr/home/Desktop/data.txt
Recursive deleting
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/sonoo/
MapReduce
• Map Task
• Reduce Task
• Job
• Task Trackers-Cluster setup
• SSH
• Hadoop Configuration
• HDFS Administering
• Monitoring and Maintenance
Map Task
• A Map Task is a single instance of a
MapReduce app. These tasks determine
which records to process from a data
block.
• The input data is split and analyzed, in
parallel, on the assigned compute
resources in a Hadoop cluster.
• This step of a MapReduce job prepares
the <key, value> pair output for the reduce
step.
Reduce Task
• Map stage − The map or mapper’s job is to
process the input data. Generally the input data
is in the form of file or directory and is stored in
the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The
mapper processes the data and creates several
small chunks of data.
• Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a
new set of output, which will be stored in the
HDFS
SSH
• SSH setup is required to do different operations on
a cluster such as starting, stopping, distributed
daemon shell operations.
• Hadoop core requires shell i.e, (SSH) to
communicate with slave nodes and to create the
process on to the slave nodes. The communication
will be frequent when the cluster is live and working
in a fully distributed environment.
Unit-1 -2-3- BDA PIET 6 AIDS.pptx

Contenu connexe

Similaire à Unit-1 -2-3- BDA PIET 6 AIDS.pptx

Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataSpringPeople
 
Data Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptxData Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptxsa3302
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 

Similaire à Unit-1 -2-3- BDA PIET 6 AIDS.pptx (20)

Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
All About Big Data
All About Big Data All About Big Data
All About Big Data
 
Big Data
Big DataBig Data
Big Data
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Data Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptxData Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptx
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Big data
Big dataBig data
Big data
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data
Big dataBig data
Big data
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 

Dernier

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 

Dernier (20)

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 

Unit-1 -2-3- BDA PIET 6 AIDS.pptx

  • 1. Big Data Analytics Dr. BK Verma Professor in CSE(ET) In charge: - Data Science Branch 6-AIDS
  • 2. Syllabus of BDA • UNIT-I Introduction To Big Data - Distributed file system, Big Data and its importance, Four Vs, Drivers for Big data, big data analytics, big data applications. Algorithms using map reduce, Matrix-Vector Multiplication by Map Reduce. • UNIT-II Introduction To Hadoop- Big Data – Apache Hadoop & Hadoop Eco System – Moving Data in and out of Hadoop – Understanding inputs and outputs of MapReduce - Data Serialization. • UNIT- III Hadoop Architecture - Hadoop Architecture, Hadoop Storage: HDFS, Common Hadoop Shell commands, Anatomy of File Write and Read., NameNode, Secondary NameNode, and DataNode, Hadoop MapReduce paradigm, Map and Reduce tasks, Job, Task trackers - Cluster Setup – SSH &Hadoop Configuration – HDFS Administering –Monitoring & Maintenance. • UNIT-IV Hadoop Ecosystem And Yarn -Hadoop ecosystem components - Schedulers - Fair and Capacity, Hadoop 2.0 New Features- NameNode High Availability, HDFS Federation, MRv2, YARN, Running MRv1 in YARN.
  • 3.
  • 4. Big Data • Big data refers to the large and complex data sets that are generated by various sources such as social media, online transactions, and machine sensors. • These data sets are too big, too fast, and too complex to be processed and analyzed by traditional data processing tools and methods. • The growth of big data has been driven by the explosion of digital devices, cloud computing, and the internet of things (IoT), which has created a large amount of data. • This data contains valuable information that can be used to improve business processes, gain new insights, and make better decisions.
  • 5.
  • 6.
  • 7. Do you know – Every minute we send 204 million emails, generate 1.8 million Facebook likes, send 278 thousand Tweets, and up-load 200,000 photos to Facebook. The largest contributor of data is social media. For instance, Facebook generates 500 TB of data every day. Twitter generates 8TB of data daily.
  • 8.
  • 9.
  • 10.
  • 11. 3Vs • Big data can be characterized by its 3Vs: Volume, Velocity, and Variety. • Volume refers to the sheer size of the data sets, which can range from terabytes to petabytes. • Velocity refers to the speed at which the data is generated and needs to be processed. • Variety refers to the range of different types of data, including structured data (e.g., databases), unstructured data (e.g., text, images, and videos), and semi-structured data (e.g., social media posts). • To deal with big data, organizations are turning to new technologies and tools such as Hadoop, Spark, and NoSQL databases. These tools are designed to handle the scale and complexity of big data and enable organizations to extract valuable insights from it.
  • 12. Four Vs The Four Vs of big data refer to the characteristics of big data that make it different from traditional data. The Four Vs are 1. Volume: The sheer amount of data generated and stored by organizations and individuals. 2. Velocity: The speed at which data is generated and processed. This refers to real-time streaming data as well as batch processing of large amounts of data. 3. Variety: The different forms of data, such as structured, semi-structured, and unstructured data. 4. Veracity: The quality and accuracy of the data, including data sources, data context, and the possibility of missing, incomplete, or inconsistent data
  • 13. Distributed file system(DFS): • A distributed file system is a type of file system that allows multiple users to access and manage a shared pool of configurable computer storage, known as a distributed file system(DFS). • The file system is distributed across many interconnected nodes, each of which may provide one or more parts of the overall file system. • This means that users can access the file system from any node, and the file system will automatically manage the data distribution and replication. • The goal of a distributed file system is to provide a single, unified namespace and file management interface, so that users can access their data from any location, without having to worry about the underlying network topology or physical storage locations.
  • 14.
  • 15. Big Data and its importance • Big Data refers to the large volume of data, both structured and unstructured, that inundates a business on a day-to-day basis. • Big Data can be analyzed for insights that lead to better decisions and strategic business moves. • For example: a company might analyze customer purchase history to improve its sales and marketing strategies. • For Example: a healthcare organization might analyze patient data to improve patient outcomes.
  • 16. Cont.. • The importance of Big Data lies in the insights it provides. With the right tools and techniques, organizations can unlock valuable information and use it to improve their operations and gain a competitive advantage. Big Data can also help organizations: • Enhance customer experiences • Increase efficiency and productivity • Identify new business opportunities • Improve risk management and fraud detection • Optimize supply chains and logistics
  • 18.
  • 20.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. Drivers for Big data There are several drivers that have led to the growth of big data, including: 1. Technological advancements: The exponential growth of digital data is largely due to advancements in technology such as the internet, mobile devices, and sensors, which have made it easier to collect, store, and share data. 2. Increased use of cloud computing: The widespread adoption of cloud computing has made it possible for organizations to store and process large amounts of data cost-effectively. 3. Internet of Things (IoT): The increasing number of connected devices, such as sensors and smart devices, has led to the generation of large amounts of data in real-time. 4. Social media: Social media platforms generate vast amounts of user-generated data, including text, images, and videos. 5. Business requirements: Organizations are looking to leverage big data to gain a competitive advantage and improve decision-making. 6. Government initiatives: Governments around the world are investing in big data initiatives to improve public services and drive economic growth
  • 27.
  • 28. Big Data Analytics • Big Data analytics is the process of examining, transforming, and modeling large and complex data sets to uncover hidden patterns, correlations, and other insights that can inform decision making and support business objectives. • The goal of Big Data analytics is to turn data into actionable information and knowledge that can drive growth, improve operational efficiency, and enhance competitiveness. • Big Data analytics requires a combination of technologies, tools, and techniques, including data storage, processing, and visualization. • Hadoop, Spark, and NoSQL databases are popular technologies used to manage and store big data, while machine learning algorithms, statistical models, and data visualization tools are used to perform the analytics.
  • 29. Cont.. • The applications of Big Data analytics are numerous and span across multiple industries, including finance, healthcare, retail, transportation, and many others. • For example, in healthcare, Big Data analytics can be used to analyze electronic medical records to identify high-risk patients and improve patient outcomes, while in retail, it can be used to improve customer engagement and experience through personalized marketing and product recommendations.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Big Data Applications 1. Healthcare: Big data is being used in healthcare to improve patient outcomes, reduce costs, and improve overall efficiency. For example, electronic health records (EHRs) generate massive amounts of data that can be analyzed to identify trends, track patient outcomes, and support research. 2. Finance: Financial institutions use big data to analyze customer behavior, detect fraud, and make investment decisions. For example, credit card companies use big data to analyze spending patterns to identify and prevent fraudulent transactions. 3. Retail: Retail companies use big data to gain insights into consumer behavior and preferences, optimize pricing and promotions, and personalize the shopping experience. For example, online retailers use big data to recommend products based on a customer's past purchases and online behavior. 4. Manufacturing: Manufacturing companies use big data to optimize their supply chain, improve production processes, and reduce costs. For example, data generated by sensors on manufacturing equipment can be analyzed to identify inefficiencies and improve production processes. 5. Transportation: The transportation industry uses big data to optimize routes, reduce fuel consumption, and improve safety. For example, data from GPS devices, weather sensors, and traffic cameras can be used to optimize routes for delivery trucks and reduce fuel consumption.
  • 35.
  • 36.
  • 37.
  • 38. Different Algorithms using MapReduce MapReduce is a programming model for processing large data sets that can be used to implement a wide variety of algorithms. Some common algorithms that can be implemented using MapReduce include: 1. Word count: A classic example of a MapReduce algorithm that counts the number of occurrences of each word in a large text corpus. 2. Inverted Index: An inverted index is a data structure that maps each word in a document to a list of documents that contain it. In MapReduce, this can be implemented as a two-step process, with the map function emitting a key-value pair for each word in each document, and the reduce function aggregating the list of document IDs for each word. 3. PageRank: PageRank is an algorithm used by Google to rank web pages based on their importance. In MapReduce, this can be implemented as a series of iterative MapReduce jobs that propagate importance scores from one set of pages to another. 4. K-means clustering: K-means is a popular algorithm for clustering data into a fixed number of clusters. In MapReduce, this can be implemented as a series of MapReduce jobs that iteratively update the cluster centroids based on the assignment of data points to clusters. 5. Matrix multiplication: Matrix multiplication is a fundamental operation in linear algebra that can be used to solve systems of linear equations. In MapReduce, this can be implemented as a two-step process, with the map function emitting intermediate values that represent the product of individual entries in the matrices, and the reduce function aggregating these intermediate values to compute the final product.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44. Matrix-Vector Multiplication by Map Reduce • Matrix-vector multiplication is a common operation in linear algebra, where a matrix and a vector are multiplied to produce another vector. In the context of map reduce, this operation can be performed by dividing the matrix into multiple chunks and distributing them across multiple computers or nodes for parallel processing. • Here's how matrix-vector multiplication could be performed using the Map Reduce framework: • Map step: Each node takes one chunk of the matrix and multiplies it with the entire vector. The result is a partial vector. • Reduce step: The partial vectors from all nodes are combined to form the final result vector.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51. Unit-2 • Introduction To Hadoop • Big Data • Apache Hadoop & Hadoop Eco System • Moving Data in and out of Hadoop • Understanding inputs and outputs of MapReduce • Data Serialization.
  • 52. What is Hadoop? • Hadoop is the solution to above Big Data problems. It is the technology to store massive datasets on a cluster of cheap machines in a distributed manner. Not only this it provides Big Data analytics through distributed computing framework. • It is an open-source software developed as a project by Apache Software Foundation. Doug Cutting created Hadoop. In the year 2008 Yahoo gave Hadoop to Apache Software Foundation. Since then two versions of Hadoop has come. Version 1.0 in the year 2011 and version 2.0.6 in the year 2013. Hadoop comes in various flavors like Cloudera, IBM BigInsight, MapR and Hortonworks.
  • 53. Hadoop consists of three core components • Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop. • Map-Reduce – It is the data processing layer of Hadoop. • YARN – It is the resource management layer of Hadoop
  • 54. Apache Hadoop • Apache Hadoop is an open-source software framework for distributed storage and processing of big data sets across a cluster of computers. • It was developed to address the limitations of traditional centralized systems when it comes to storing and processing large amounts of data.
  • 55. Hadoop consists of two main components 1.Hadoop Distributed File System (HDFS): A scalable and fault-tolerant distributed file system that enables the storage of very large files across a cluster of commodity servers. 2.MapReduce: A programming model for processing large data sets in parallel across a cluster of computers. MapReduce consists of two stages: the map stage, which processes data in parallel on different nodes, and the reduce stage, which aggregates the results.
  • 56.
  • 57. HDFS: • Hadoop Distributed File System provides for distributed storage for Hadoop. HDFS has a master-slave topology • Master is a high-end machine where as slaves are inexpensive computers. The Big Data files get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion on the cluster of slave nodes. On the master, we have metadata stored. • HDFS has two daemons running for it. NameNode and DataNode
  • 58. NameNode : • NameNode Daemon runs on the master machine. • It is responsible for maintaining, monitoring and managing DataNodes. • It records the metadata of the files like the location of blocks, file size, permission, hierarchy etc. • Namenode captures all the changes to the metadata like deletion, creation and renaming of the file in edit logs. • It regularly receives heartbeat and block reports from the DataNodes.
  • 59. DataNode: • DataNode runs on the slave machine. • It stores the actual business data. • It serves the read-write request from the user. • DataNode does the ground work of creating, replicating and deleting the blocks on the command of NameNode. • After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of HDFS.
  • 60. Cont.. • The Hadoop ecosystem is a collection of open-source projects that work together with Hadoop to provide a complete big data solution. Some of the popular projects in the Hadoop ecosystem include: 1. Hive: A data warehousing and SQL-like query language for Hadoop. 2. Pig: A high-level platform for creating MapReduce programs. 3. HBase: A NoSQL database that provides random real-time read/write access to large amounts of structured data. 4. Spark: An open-source, fast, and general-purpose cluster computing framework for big data processing. 5. YARN (Yet Another Resource Negotiator): A resource management system that allocates resources in the cluster for running applications.
  • 61. What is Data? • The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. “90% of the world’s data was generated in the last few years”
  • 62.
  • 63. What is Big Data? • Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
  • 64. What is an Example of Big Data? • The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade data per day.
  • 65. Social Media • The statistic shows that 500+terabyte s of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
  • 66. • A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
  • 67. Types Of Big Data • Structured data − Relational data. • Semi Structured data − XML data. • Unstructured data − Word, PDF, Text, Media Logs
  • 68.
  • 69. Structured • Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. • However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes. Data stored in a relational database management system is one example of a ‘structured’ data.
  • 70.
  • 71. Unstructured • Any data with unknown form or the structure is classified as unstructured data. • In addition to the size being huge, un- structured data poses multiple challenges in terms of its processing for deriving value out of it. • A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. • Now day organizations have wealth of data available with them but unfortunately, they don’t know how to derive value out of it since this data is in
  • 72.
  • 73. Semi-structured • Semi-structured data can contain both the forms of data. We can see semi- structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file.
  • 74.
  • 75. Data Growth over the years
  • 76. Characteristics Of Big Data • Volume • Variety • Velocity • Variability
  • 77. (i) Volume:- • The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data solutions.
  • 78. (ii) Variety:- • The next aspect of Big Data is its variety. • Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
  • 79. iii) Velocity:- • The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. • Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
  • 80. (iv) Variability:- • This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.
  • 81. Advantages Of Big Data Processing • Ability to process Big Data in DBMS brings in multiple benefits, such as- • Businesses can utilize outside intelligence while taking decisions • Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies. • Improved customer service • Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses. • Early identification of risk to the product/services, if any • Better operational efficiency
  • 82. Drivers for Big Data • The main business drivers for such rising demand for Big Data Analytics are : • 1. The digitization of society • 2. The drop in technology costs • 3. Connectivity through cloud computing • 4. Increased knowledge about data science • 5. Social media applications • 6. The rise of Internet-of-Things(IoT) • Example: A number of companies that have Big Data at the core of their strategy like : • Apple, Amazon, Facebook and Netflix have become very successful at the beginning of the 21st century.
  • 83. What is Big Data Analytics? • Big Data analytics is a process used to extract meaningful insights, such as hidden patterns, unknown correlations, market trends, and customer preferences. Big Data analytics provides various advantages—it can be used for better decision making, preventing fraudulent activities, among other things.
  • 84. Different Types of Big Data Analytics 1. Descriptive Analytics This summarizes past data into a form that people can easily read. This helps in creating reports, like a company’s revenue, profit, sales, and so on. Also, it helps in the tabulation of social media metrics. Use Case: The Dow Chemical Company analyzed its past data to increase facility utilization across its office and lab space. Using descriptive analytics, Dow was able to identify underutilized space. This space consolidation helped the company save nearly US $4 million annually.
  • 85. 2. Diagnostic Analytics • This is done to understand what caused a problem in the first place. Techniques like drill-down, data mining, and data recovery are all examples. Organizations use diagnostic analytics because they provide an in-depth insight into a particular problem. Use Case: An e-commerce company’s report shows that their sales have gone down, although customers are adding products to their carts. This can be due to various reasons like the form didn’t load correctly, the shipping fee is too high, or there are not enough payment options available. This is where you can use diagnostic analytics to find the reason.
  • 86. 3. Predictive Analytics • This type of analytics looks into the historical and present data to make predictions of the future. Predictive analytics uses data mining, AI, and machine learning to analyze current data and make predictions about the future. It works on predicting customer trends, market trends, and so on. Use Case: PayPal determines what kind of precautions they have to take to protect their clients against fraudulent transactions. Using predictive analytics, the company uses all the historical payment data and user behavior data and builds an algorithm that predicts fraudulent activities.
  • 87. 4. Prescriptive Analytics • This type of analytics prescribes the solution to a particular problem. Perspective analytics works with both descriptive and predictive analytics. Most of the time, it relies on AI and machine learning. Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type of analytics is used to build an algorithm that will automatically adjust the flight fares based on numerous factors, including customer demand, weather, destination, holiday seasons, and oil prices.
  • 88. Big Data Analytics Tools • Hadoop - helps in storing and analyzing data • MongoDB - used on datasets that change frequently • Talend - used for data integration and management • Cassandra - a distributed database used to handle chunks of data • Spark - used for real-time processing and analyzing large amounts of data • STORM - an open-source real-time computational system • Kafka - a distributed streaming platform that is used for fault- tolerant storage
  • 89. Big Data Industry Applications • Ecommerce - Predicting customer trends and optimizing prices are a few of the ways e-commerce uses Big Data analytics • Marketing - Big Data analytics helps to drive high ROI marketing campaigns, which result in improved sales • Education - Used to develop new and improve existing courses based on market requirements • Healthcare - With the help of a patient’s medical history, Big Data analytics is used to predict how likely they are to have health issues • Media and entertainment - Used to understand the demand of shows, movies, songs, and more to deliver a personalized recommendation list to its users • Banking - Customer income and spending patterns help to predict the likelihood of choosing various banking offers, like loans and credit cards • Telecommunications - Used to forecast network capacity and improve customer experience • Government - Big Data analytics helps governments in law enforcement, among other things
  • 90. MapReduce • MapReduce is a programming model for writing applications that can process Big Data in parallel on multiple nodes. • MapReduce provides analytical capabilities for analyzing huge volumes of complex data.
  • 91. Why MapReduce? • Traditional Enterprise Systems normally have a centralized server to store and process data. The following illustration depicts a schematic view of a traditional enterprise system. Traditional model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers. Moreover, the centralized system creates too much of a bottleneck while processing multiple files simultaneously.
  • 92. • Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts and assigns them to many computers. Later, the results are collected at one place and integrated to form the result dataset.
  • 93. How MapReduce Works? • The MapReduce algorithm contains two important tasks, namely Map and Reduce. • The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key- value pairs). • The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples. The reduce task is always performed after the map job.
  • 94.
  • 95. Cont.. • Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs. • Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs. • Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys. • Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional. • Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task. • Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step. • Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.
  • 96.
  • 97.
  • 98.
  • 99. MapReduce- Example • Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with the help of MapReduce.
  • 100. Cont.. • Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs. • Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs. • Count − Generates a token counter per word. • Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.
  • 101. MapReduce - Algorithm • The MapReduce algorithm contains two important tasks, namely Map and Reduce. • The map task is done by means of Mapper Class • The reduce task is done by means of Reducer Class.
  • 102. Cont.. • Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as input by Reducer class, which in turn searches matching pairs and reduces them.
  • 103. Cont.. • MapReduce implements various mathematical algorithms to divide a task into small parts and assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the Map & Reduce tasks to appropriate servers in a cluster. • These mathematical algorithms may include the following: − • Sorting • Searching • Indexing • TF-IDF(Term Frequency (TF)-Inverse Document Frequency (IDF)
  • 104. Unit-2 Introduction of Hadoop Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets.
  • 105. History of Hadoop • Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug Cutting and Mike Cafarella. It’s co- founder Doug Cutting named it on his son’s toy elephant. In October 2003 the first paper release was Google File System. • In January 2006, MapReduce development started on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released. • Hadoop is an open-source software framework for storing and processing big data. It was created by Apache Software Foundation in 2006, based on a white paper written by Google in 2003 that described the Google File System (GFS) and the MapReduce programming model. • The Hadoop framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. • It is used by many organizations, including Yahoo, Facebook, and IBM, for a variety of purposes such as data warehousing, log processing, and research. Hadoop has been widely adopted in the
  • 106. Features of hadoop: 1. it is fault tolerance. 2. it is highly available. 3. it’s programming is easy. 4. it have huge flexible storage. 5. it is low cost.
  • 107. Hadoop has several key features that make it well-suited for big data processing: • Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the storage and processing of extremely large amounts of data. • Scalability: Hadoop can scale from a single server to thousands of machines, making it easy to add more capacity as needed. • Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to operate even in the presence of hardware failures.
  • 108. Cont.. • Data locality: Hadoop provides data locality feature, where the data is stored on the same node where it will be processed, this feature helps to reduce the network traffic and improve the performance • High Availability: Hadoop provides High Availability feature, which helps to make sure that the data is always available and is not lost. • Flexible Data Processing: Hadoop’s MapReduce programming model allows for the processing of data in a distributed fashion, making it easy to implement a wide variety of
  • 109. Cont.. • Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the data stored is consistent and correct. • Data Replication: Hadoop provides data replication feature, which helps to replicate the data across the cluster for fault tolerance. • Data Compression: Hadoop provides built-in data compression feature, which helps to reduce the storage space and improve the performance. • YARN: A resource management platform that allows multiple data processing engines like real- time streaming, batch processing, and interactive SQL, to run and process data stored in HDFS.
  • 110. What is Hadoop? • Def. “Hadoop is an open source software programming framework for storing a large amount of data and performing the computation”. • Its framework is based on Java programming with some native code in C and shell scripts. • Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. • It is designed to handle big data and is based on the MapReduce programming model,
  • 111. Hadoop has Two main components • HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which allows for the storage of large amounts of data across multiple machines. It is designed to work with commodity hardware, which makes it cost-effective. • YARN (Yet Another Resource Negotiator): This is the resource management component of Hadoop, which manages the allocation of resources (such as CPU and memory) for processing the data stored in HDFS. • Hadoop also includes several additional modules that provide additional functionality, such as Hive (a SQL-like query language), Pig (a high-level platform for creating MapReduce programs), and HBase (a non-relational, distributed database). • Hadoop is commonly used in big data scenarios such as data warehousing, business intelligence, and machine learning. It’s also used for data processing, data analysis, and data mining.
  • 112. Hadoop Distributed File System • It has distributed file system known as HDFS and this HDFS splits files into blocks and sends them across various nodes in form of large clusters. Also in case of a node failure, the system operates and data transfer takes place between the nodes which are facilitated by HDFS.
  • 113. HDFS
  • 114. Advantages of HDFS • Scalability: Hadoop can easily scale to handle large amounts of data by adding more nodes to the cluster. • Cost-effective: Hadoop is designed to work with commodity hardware, which makes it a cost-effective option for storing and processing large amounts of data.
  • 115. Cont.. • Fault-tolerance: Hadoop’s distributed architecture provides built-in fault- tolerance, which means that if one node in the cluster goes down, the data can still be processed by the other nodes. • Flexibility: Hadoop can process structured, semi-structured, and unstructured data, which makes it a versatile option for a wide range of big data scenarios.
  • 116. Cont.. • Open-source: Hadoop is open-source software, which means that it is free to use and modify. This also allows developers to access the source code and make improvements or add new features. • Large community: Hadoop has a large and active community of developers and users who contribute to the development of the software, provide support, and share best practices. • Integration: Hadoop is designed to work with other big data technologies such as Spark, Storm, and Flink, which allows for integration with a wide range of data processing and analysis tools.
  • 117. Disadvantages of HDFS: • Not very effective for small data. • Hard cluster management. • Has stability issues. • Security major concerns. • Complexity: Hadoop can be complex to set up and maintain, especially for organizations without a dedicated team of experts. • Latency: Hadoop is not well-suited for low-latency workloads and may not be the best choice for real-time data processing.
  • 118. Cont.. • Limited Support for Real-time Processing: Hadoop’s batch-oriented nature makes it less suited for real-time streaming or interactive data processing use cases. • Limited Support for Structured Data: Hadoop is designed to work with unstructured and semi- structured data, it is not well-suited for structured data processing • Data Security: Hadoop does not provide built-in security features such as data encryption or user authentication, which can make it difficult to secure sensitive data. • Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model is not well-suited for
  • 119. Cont.. • Limited Support for Graph and Machine Learning: Hadoop’s core component HDFS and MapReduce are not well-suited for graph and machine learning workloads, specialized components like Apache Giraph and Mahout are available but have some limitations. • Cost: Hadoop can be expensive to set up and maintain, especially for organizations with large amounts of data. • Data Loss: In the event of a hardware failure, the data stored in a single node may be lost permanently. • Data Governance: Data Governance is a critical aspect of data management, Hadoop does not provide a built-in feature to manage data lineage, data quality, data cataloging, data lineage, and
  • 120. Hadoop framework is made up of the following modules: Hadoop Ecosystem 1.Hadoop MapReduce:- a MapReduce programming model for handling and processing large data. 2.Hadoop Distributed File System:- distributed files in clusters among nodes. 3.Hadoop YARN:- a platform which manages computing resources. 4. Hadoop Common- it contains packages and libraries which are used for other modules.
  • 121. Apache Hadoop and Hadoop Eco System • Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed computing environment. • Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers. • Commodity computers are cheap and widely available. • These are mainly useful for achieving
  • 122. Form of Hadoop ecosystem: • HDFS: Hadoop Distributed File System • YARN: Yet Another Resource Negotiator • MapReduce: Programming based Data Processing • Spark: In-Memory data processing • PIG, HIVE: Query based processing of data services • HBase: NoSQL Database • Mahout, Spark MLLib: Machine Learning algorithm libraries • Solar, Lucene: Searching and Indexing • Zookeeper: Managing cluster • Oozie: Job Scheduling
  • 124.
  • 125. Apache Hadoop consists of two sub-projects: 1.Hadoop MapReduce: MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes. 2.HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in a cluster. This distribution enables reliable
  • 126. Hadoop Architecture Hadoop has a Master-Slave Architecture for data storage and distributed data processing using MapReduce and HDFS methods.
  • 127.
  • 128.
  • 129.
  • 130. Name Node: NameNode represented every files and directory which is used in the namespace Data Node: DataNode helps to manage the state of an HDFS node and allows you to interacts with the blocks Master Node: The master node allows to conduct parallel processing of data using Hadoop MapReduce. Slave node: The slave nodes are the additional machines in Hadoop cluster which allows to store data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker and a DataNode. This allows to synchronize the processes with the NameNode and Job Tracker respectively.
  • 131.
  • 132. Data storage Nodes in HDFS. •NameNode(Master) •DataNode(Slave)
  • 133. NameNode: • NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves). • Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop cluster. • Meta Data can also be the name of the file, size, and the information about the location(Block number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication. • Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
  • 134. Cont.. • NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). • NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. • The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNodes only.
  • 135. Functions of NameNode • It is the master daemon that maintains and manages the DataNodes (slave nodes) • It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. • There are two files associated with the metadata: • FsImage: It contains the complete state of the file system namespace since the start of the NameNode. • EditLogs: It contains all the recent modifications made to the file system with respect to the most recent FsImage.
  • 136. DataNode: • DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. • The more number of DataNode, the Hadoop cluster will be able to store more data. • DataNode should have High storing capacity to store a large number of file blocks.
  • 137. File Block In HDFS: • Data in HDFS is always stored in terms of blocks. So the single block of data is divided into multiple blocks of size 128MB which is default and you can also change it manually. Suppose you have uploaded a file of 400MB to your HDFS
  • 138. Replication In HDFS • Replication ensures the availability of the data. Replication is making a copy of something and the number of times you make a copy of that particular thing can be expressed as it’s Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of those file blocks. • By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change it manually as per your requirement like in above example we have made 4 file blocks which means that 3 Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup purpose.
  • 139. Rack Awareness • The rack is nothing but just the physical collection of nodes in our Hadoop cluster (maybe 30 to 40). • A large Hadoop cluster is consists of so many Racks . with the help of this Racks information Namenode chooses the closest Datanode to achieve the maximum performance while performing the read/write information which reduces the Network Traffic.
  • 140. • Hadoop cluster consists of a data center, the rack and the node which actually executes jobs. • Here, data center consists of racks and rack consists of nodes. • Network bandwidth available to processes varies depending upon the location of the processes
  • 141. Moving Data In and Out of Hadoop • Moving data in and out of Hadoop, which refer to as data ingress and egress, is the process by which data is transported from an external system into an internal system, and vice versa. • Hadoop supports ingress and egress at a low level in HDFS and MapReduce.
  • 142.
  • 143.
  • 144. Understanding Inputs and Outputs of MapReduce Done in Class
  • 145.
  • 146.
  • 147.
  • 148.
  • 149. Understanding Inputs and Outputs of MapReduce • Inputs and Outputs • The MapReduce model operates on <key, value> pairs. • It views the input to the jobs as a set of <key, value> pairs and produces a different set of <key, value> pairs as the output of the jobs. • Data input is supported by two classes in this framework, namely InputFormat and RecordReader.
  • 150. Cont.. • The first is consulted to determine how the input data should be partitioned for the map tasks, while the latter reads the data from the inputs. • For the data output also there are two classes, OutputFormat and RecordWriter. • The first class performs a basic validation of the data sink properties and the second class is used to write each reducer output to the data sink.
  • 151.
  • 152. MapReduce Phase • Input Splits: An input in the MapReduce model is divided into small fixed-size parts called input splits. This part of the input is consumed by a single map. The input data is generally a file or directory stored in the HDFS. • Mapping: This is the first phase in the map-reduce program execution where the data in each split is passed line by line, to a mapper function to process it and produce the output values. • Shuffling: It is a part of the output phase of Mapping where the relevant records are consolidated from the output. It consists of merging and sorting. So, all the key- value pairs which have the same keys are combined. In sorting, the inputs from the merging step are taken and sorted. It returns key-value pairs, sorting the output. • Reduce: All the values from the shuffling phase are combined and a single output value is returned. Thus, summarizing the entire dataset.
  • 153. Hadoop InputFormat • Hadoop InputFormat checks the Input- Specification of the job. • InputFormat split the Input file into InputSplit and assign to individual Mapper. • Different methods to get the data to the mapper and different types of InputFormat in Hadoop like FileInputFormat in Hadoop, TextInputFormat, KeyValueTextInputFormat, etc.
  • 154. Cont. • How the input files are split up and read in Hadoop is defined by the InputFormat. • An Hadoop InputFormat is the first component in Map- Reduce, it is responsible for creating the input splits and dividing them into records. • Initially, the data for a MapReduce task is stored in input files, and input files typically reside in HDFS. • Although these files format is arbitrary, line-based log files and binary format can be used. • Using InputFormat we define how these input files are split and read.
  • 155. InputFormat Class: • The files or other objects that should be used for input is selected by the InputFormat. • InputFormat defines the Data splits, which defines both the size of individual Map tasks and its potential execution server. • InputFormat defines the RecordReader, which is responsible for reading actual records from the input files.
  • 156.
  • 157.
  • 158.
  • 159.
  • 160.
  • 161. Cont.. TextOutputFormat: • MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes (key, value) pairs on individual lines of text files and its keys and values. SequenceOutputFormat: • It is an Output Format which writes sequences files for its output and it is intermediate format use between MapReduce jobs MapFileOutputFormat • It is another form of FileOutputFormat in Hadoop Output Format, which is used to write output as map files. DBOutputFormat • DBOutputFormat in Hadoop is an Output Format for writing to relational databases and HBase. MultipleOutputs • It allows writing data to files whose names are derived from the output keys and values, or in fact from an arbitrary string.
  • 162. YARN(Yet Another Resource Negotiator) • YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job scheduling and Resource Management. • The Purpose of Job schedular is to divide a big task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be Maximized. • Job Scheduler also keeps track of which job is important, which job has more priority, dependencies between the jobs and all the other information like job timing, etc. And the use of Resource Manager is to manage all the resources that are
  • 163. Data Serialization • Data serialization is a process that converts structure data manually back to the original form. • Serialize to translate data structures into a stream of data. Transmit this stream of data over the network or store it in DB regardless of the system architecture. • Serialization does the same but isn't dependent on architecture. • Consider CSV files contains a comma (,) in between data, so while Deserialization, wrong outputs may occur. Now, if metadata is stored in XML form, a self- architected form of data storage, data can easily deserialize.
  • 164. Unit-3 • Hadoop Architecture • Hadoop Architecture, • Hadoop Storage: HDFS, Common Hadoop Shell commands, Anatomy of File Write and Read., • NameNode, Secondary NameNode, and DataNode, • Hadoop MapReduce paradigm, Map and Reduce tasks, Job, • Task trackers - Cluster Setup – SSH &Hadoop Configuration – HDFS Administering –Monitoring & Maintenance.
  • 165.
  • 166. • History of Hadoop in the following steps: - • In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web crawler software project. • While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of costs which becomes the consequence of that project. This problem becomes one of the important reason for the emergence of Hadoop. • In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file system developed to provide efficient access to data. • In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large clusters.
  • 167. Cont.. • In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File System). This file system also includes Map reduce. • In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year. • Doug Cutting gave named his project Hadoop after his son's toy elephant. • In 2007, Yahoo runs two clusters of 1000 machines. • In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds. • In 2013, Hadoop 2.2 was released. • In 2017, Hadoop 3.0 was released.
  • 168.
  • 169. Hadoop Architecture • Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. • Hadoop works on MapReduce Programming Algorithm that was introduced by Google. • Big Brand Companies are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc.
  • 170. Hadoop Architecture • The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop Distributed File System). • The MapReduce engine can be MapReduce/MR1 or YARN/MR2. • A Hadoop cluster consists of a single master and multiple slave nodes. • The master node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.
  • 171.
  • 172. The Hadoop Architecture Mainly consists of 4 components • MapReduce • HDFS(Hadoop Distributed File System) • YARN(Yet Another Resource Negotiator) • Common Utilities or Hadoop Common
  • 174. 1. MapReduce • MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN framework. • The major feature of MapReduce is to perform the distributed processing in parallel in a Hadoop cluster which Makes Hadoop working so fast. • When you are dealing with Big Data, serial processing is no more of any use. • MapReduce has mainly 2 tasks which are divided phase-wise: • In first phase, Map is utilized and in next phase Reduce is utilized.
  • 175. The Input is provided to the Map() function then it’s output is used as an input to the Reduce function and after that, we receive our final output.
  • 176. Cont.. • Input is provided to the Map(), now as we are using Big Data. The Input is a set of Data. • The Map() function here breaks this DataBlocks into Tuples that are nothing but a key-value pair. • These key-value pairs are now sent as input to the Reduce(). • The Reduce() function then combines this broken Tuples or key-value pair based on its Key value and form set of Tuples, and perform some operation like sorting, summation type job, etc. which is then sent to the final Output Node. • Finally, the Output is Obtained.
  • 177.
  • 178.
  • 179.
  • 180.
  • 181.
  • 182.
  • 183.
  • 184. Anatomy of File Write and Read
  • 185. NameNode • It is a single master server exist in the HDFS cluster. • As it is a single node, it may become the reason of single point failure. • It manages the file system namespace by executing an operation like the opening, renaming and closing the files. • It simplifies the architecture of the system.
  • 186. DataNode • The HDFS cluster contains multiple DataNodes. • Each DataNode contains multiple data blocks. • These data blocks are used to store data. • The responsibility of DataNode to read and write requests from the file system's clients. • DataNode performs block creation, deletion, and replication upon
  • 187. Job Tracker • The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode. • In response, NameNode provides metadata to Job Tracker.
  • 188. Task Tracker • Task Tracker works as a slave node for Job Tracker. • It receives task and code from Job Tracker and applies that code on the file. • This process can also be called as a Mapper.
  • 189. MapReduce Layer • The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. • In response, the Job Tracker sends the request to the appropriate Task Trackers. • Sometimes, the TaskTracker fails or time out. • In such a case, that part of the job is rescheduled.
  • 190. Advantages of Hadoop • Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes of data in minutes and Peta bytes in hours. • Scalable: Hadoop cluster can be extended by just adding nodes in the cluster. • Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as compared to traditional relational database management system. • Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data are replicated thrice but the replication factor is configurable.
  • 191.
  • 194. Cont.. • all the metadata is stored in name node, it is very important. • If it fails the file system can not be used as there would be no way of knowing how to reconstruct the files from blocks present in data node. To overcome this, the concept of secondary name node arises. • Secondary Name Node: It is a separate physical machine which acts as a helper of name node. • It performs periodic check points. • It communicates with the name node and take snapshot of meta data which helps minimize
  • 195. NameNode • NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). • NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.
  • 196. Main function performed by NameNode: • 1. Stores metadata of actual data. E.g. Filename, Path, No. of Data Blocks, Block IDs, Block Location, No. of Replicas, Slave related configuration 2. Manages File system namespace. 3. Regulates client access request for actual file data file. 4. Assign work to Slaves(DataNode). 5. Executes file system name space operation like opening/closing files, renaming files and directories. 6. As Name node keep metadata in memory for fast retrieval, the huge amount of memory is required for its operation. • This should be hosted on reliable hardware.
  • 197. DataNode • DataNode works as Slave in Hadoop cluster. Main function performed by DataNode: • 1.Actually stores Business data. 2. This is actual worker node were Read/Write/Data processing is handled. 3. Upon instruction from Master, it performs creation/replication/deletion of data blocks. 4. As all the Business data is stored on DataNode, the huge amount of storage is required for its operation. • Commodity hardware can be used for hosting DataNode.
  • 198. Secondary NameNode • Secondary Name Node: It is a separate physical machine which acts as a helper of name node. • It performs periodic check points.It communicates with the name node and take snapshot of meta data which helps minimize downtime and loss of data. • Secondary NameNode is not a backup of NameNode. You can call it a helper of NameNode. • NameNode is the master daemon which maintains and manages the DataNodes. • It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.
  • 199. Cont.. • In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance disk usage and manages the communication traffic to the DataNodes. • It stores the metadata of all the files stored in HDFS, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. It maintains 2 files: • FsImage: Contains the complete state of the file system namespace since the start of the NameNode. • EditLogs: Contains all the recent modifications made to the file system with respect to the most recent FsImage. • Whereas the Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system.
  • 200. HDFS • HDFS is a distributed file system that handles large data sets running on commodity hardware. • Hadoop comes with a distributed file system called HDFS. • In HDFS data is distributed over several machines and replicated to ensure their durability to failure and high availability to parallel application. • It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and node name. • It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. • HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN
  • 201. Where to use HDFS • Very Large Files: Files should be of hundreds of megabytes, gigabytes or more. • Streaming Data Access: The time to read whole data set is more important than latency in reading the first. HDFS is built on write-once and read-many-times pattern. • Commodity Hardware: It works on low cost hardware. Where not to use HDFS • Low Latency data access: Applications that require very less time to access the first data should not use HDFS as it is giving importance to whole data rather than time to fetch the first record. • Lots Of Small Files:The name node contains the metadata of files in memory and if the files are small in size it takes a lot of memory for name node's memory which is not feasible. • Multiple Writes:It should not be used when we have to write multiple times.
  • 202. HDFS Concepts 1. Blocks: A Block is the minimum amount of data that it can read or write. HDFS blocks are 128 MB by default and this is configurable. Files in HDFS are broken into block-sized chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek. 2. Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the metadata information being file permission, names and location of each block.The metadata are small, so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this information is handled bya single machine. The file system operations like opening, closing, renaming etc. are executed by it. 3. Data Node: They store and retrieve blocks when they are told to; by client or name node. They report back to name node periodically, with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node.
  • 203. Common Hadoop Shell commands • The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports.
  • 204.
  • 205.
  • 206.
  • 207.
  • 208. In the above screenshot, it is clearly shown that we are creating a new directory named “example” using mkdir command and the same is shown is using ls command.
  • 209.
  • 210.
  • 211.
  • 212. HDFS Basic File Operations 1. Putting data to HDFS from local file system First create a folder in HDFS where data can be put form local file system. $ hadoop fs -mkdir /user/test Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS folder /user/ test $ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test Display the content of HDFS folder $ Hadoop fs -ls /user/test
  • 213. Cont.. 2. Copying data from HDFS to local file system $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt 3. Compare the files and see that both are same $ md5 /usr/bin/data_copy.txt/usr/home/Desktop/data.txt Recursive deleting hadoop fs -rmr <arg> Example: hadoop fs -rmr /user/sonoo/
  • 214. MapReduce • Map Task • Reduce Task • Job • Task Trackers-Cluster setup • SSH • Hadoop Configuration • HDFS Administering • Monitoring and Maintenance
  • 215. Map Task • A Map Task is a single instance of a MapReduce app. These tasks determine which records to process from a data block. • The input data is split and analyzed, in parallel, on the assigned compute resources in a Hadoop cluster. • This step of a MapReduce job prepares the <key, value> pair output for the reduce step.
  • 216. Reduce Task • Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. • Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS
  • 217.
  • 218. SSH • SSH setup is required to do different operations on a cluster such as starting, stopping, distributed daemon shell operations. • Hadoop core requires shell i.e, (SSH) to communicate with slave nodes and to create the process on to the slave nodes. The communication will be frequent when the cluster is live and working in a fully distributed environment.