3. Abstract
• Data now stream from daily life from
phones ,
credit cards ,
televisions and
computers etc.(specially from Internet).
• The data flows so fast. Five exabytes of data are generated
every days!!
• This huge collection of data is known as big data. -- Data is
too diverse, fast, changing and massive.
• Its difficult for the current computing infrastructure to handle big
data.
• To overcome this draw back, Google introduced “MapReduce”
framework.
3
4. Introduction
• Big Data has to deal with large and complex datasets that can be
structured, semi-structured, or unstructured.
• Big Data is so large that, its difficult to process by traditional
database and other software techniques;
How to explore, analyze such large datasets?.
• Analyzing big data is one of the challenges for researchers system
and academicians that needs special analyzing techniques
• Hadoop Map Reduce is a technique which analysis big data.
Two distinct tasks
MAP and REDUCE
that Hadoop programs perform.
4
6. Goals and Challenges
Goals:-
Main goals of high-dimensional data analysis are:-
To develop effective methods that can accurately
predict the future observations.
Exploring the hidden structures of each
subpopulation of the data.
Extracting important common features across
many subpopulations.
6
7. Continued….
Challenges:--
A. Meeting the need for speed .
B. Understanding the data.
C. Addressing data quality .
D. Displaying meaningful results .
E. Dealing with outliers .
7
8. Applications
8
Aadhar project by Govt. of India uses Hadoop.
New applications that are becoming possible in
the Big Data era include:
A. Personalized services.
B. Internet security.
C. Personalized medicine.
9. HDFS(Hadoop Distributed File System)
9
• Designed to hold very large amounts of data
(petabytes or even zettabytes), and provide high-
throughput access to this information.
Characteristics:
• Fault tolerant.
• Runs with commodity hardware.
• Able to handle large datasets.
• Master slave paradigm.
• Write once file access only.
HDFS components:
• NameNode.
• DataNode.
• Secondary NameNode.
11. BIG DATA ANALYTICS
• “ The process of collecting, organizing and
analyzing large sets of data.”
--To discover patterns &
other useful information.
• It will also help identify the data.
• Big data analysts basically want the knowledge
that comes from analyzing the data.
11
12. MAP REDUCE
12
• Invented by Google.
• Is a programming model for processing large datasets
distributed on a large cluster.
• MapReduce is the heart of Hadoop.
• Uses the concept of Divide and Conquer.
• Two methods: map() and Reduce() .
• Map()sorting and filtering.
• Reduce()counting and produce Result.
15. Map Reduce algorithms:
• MapReduce is a programming model designed for processing
large volumes of data in parallel by dividing the work into a
set of independent tasks.
For example twitter data was processed on
different servers on basis of months .
Hadoop is the physical implementation of Mapreduce .
It is combination of 2 java functions :
Mapper() and Reducer().
example: to check popularity of text.
15
16. Continued….
Mapper function maps the split files and provide input to
reducer.
Mapper ( filename , file –contents):
for each word in file-contents:
emit (word , 1).
Reducer function clubs the input provided by mapper and
produce output
Reducer ( word , values):
sum=0;
for each value in values:
sum=sum + value
emit(word , sum).
16
17. Conclusion
MapReduce is simple but provides good scalability and fault-
tolerance for massive data processing.
Analysis tools like Map Reduce over Hadoop guarantees …
Faster advances in many scientific disciplines and
Improving the Profitability and success of many enterprises.
MapReduce has received a lot of attentions in many fields---
including Data mining,
Information retrieval,
Image retrieval and
Pattern recognition.
17
18. References
[1]Hadoop ,“PoweredbyHadoop”,
http://wiki.apache.org/hadoop/Poweredby.
[2 ] Hadoop Tutorial,YahooInc.,
https://developer.yahoo.com/hadoop/tutorial/index.html.
[3 ] Apache: Apache Hadoop,http://hadoop.apache.org
[4 ] Hadoop Distributed File System (HDFS),
http://hortonworks.com/hadoop/hdfs/
[5 ] Jianqing Fan1, Fang Han and Han Liu, Challenges of Big Data analysis,
National Science Review Advance Access published February, 2014.
[6 ] Haddop MapReduce- http://hadooptutorial.wikispaces.com/MapReduce
[7] Jens Dittrich JorgeArnulfo Quian´eRuiz, Efficient Big Data Processing in
Hadoop MapReduce.
18