Hi all, its presentation about the big data analysis done using a data mining tool known as HADOOP, which is based on Distributive file system and uses parallel computing for working.
4. BIG DATA
The term Big data is used to describe a massive volume
of both structured and unstructured data that is so large
that it's difficult to process using traditional database
and software techniques.
5. BIG DATA(contd.)
• Big data consists of a heterogeneous mixture of structured and
unstructured data.
• Big data refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, process
and analyze.
6. Challenges
• These statistical records keep on increasing and increase
very fast.
• Unfortunately, as the data grows it becomes a tedious task
to process such a large data set and extract meaningful
information.
• If the data generated is in various formats, its processing
possesses new challenges.
7. Challenges(contd.)
• An issue with big data is that it uses NoSQL and has no Data
Description Language.
• Also, web-scale data is not universal and is heterogeneous. For
analysis of big data, database integration and cleaning is much
harder than the traditional mining approaches.
8. Solution
• Parallel computing programming
• An efficient platform for computing will not have centralized data
storage instead of that platform will be distributed in big scale
storage.
• Restricting access to the data
10. HADOOP
Hadoop is basically a tool which operates on a Distributive
File System. In this Architecture, all the Data Nodes
function parallel but functioning of a single Data Node is
still in sequential fashion.
11. HADOOP Architecture
•It is developed by Apache Software Foundation project and
open source software platform for scalable, distributed
computing.
•Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across
clusters of computers using simple programming models.
12. HADOOP Architecture(contd.)
•Hadoop provides fast and reliable analysis of both
Structured and un structured data.
•It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
•Hadoop uses Map/Reduce programming model to mine
data.
13. • This Map Reduce program is used to separate datasets which are sent as
input into independent subsets.Those are process parallel map task.
• Map() procedure that performs filtering and sorting
• Reduce() procedure that performs a summary operation
16. Methodology
Hadoop’s library is designed to deliver a highly-available service on
top of a cluster of computers. Hadoop Cluster as a whole can be seen
as that consisting of:
1. Core Hadoop
2. Hadoop Ecosystem
17. Relationship b/w Core Hadoop and Hadoop
Ecosystem
Core Hadoop consists of :
• HDFS
• MapReduce.
Since the commencement of the project, a lot of other softwares
have grown around it.This is called Hadoop Ecosystem
18. HDFS(HADOOP distributed file system)
• An HDFS instance may consist of a large number of server machines,
each storing a part of the file system data.
• Detection of faults and quick automatic recovery from them is a core
architectural objective of HDFS.
• Applications that run on HDFS need streaming access to their datasets.
19. MapReduce
It is the basic logic flow of task execution. It comprises
mainly of Mappers and Reducers.
Mappers:
Mappers do the job of extracting the required raw information from
the whole dataset. i.e. In one case it extracts date of sale, name of the
product, selling price and cost price of various products.
20. MapReduce(contd.)
•Reducers:
It is then sorted according to the key value of Mappers and
passed to Reducers. Reducers do actual processing on this
reduced data provided by Mappers and accomplish the final
task yielding desired output.