2. Index
Introduction and History
Use and Advantages
Issues and Need of Hadoop
Users of Hadoop
Framework and Architecture
HDFS Basic Concept
Map Reduce
Summery
3. Introduction and History
• Apache Software Foundation Project
• Open Source - Reliable, Scalable, Distributed
Computing and Data Storage
• Concept: Moving computation > Moving large data.
History:
• Google File System paper – Oct’2003
• MapReducing & Clustering
• Doug Cutting and Mike Carafella in 2005.
• Name: Doug Cutting – Yahoo – Feb 2006
• Name Comes - Doug Cutting’s Son (Tohvelant)
4. Use & Advantages
• Data-intensive text processing
• Assembly of large genomes
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
Advantages:
• Massive Scalability
• Flexible Schema
• Quicker/Cheaper to set up
• Consistence with High Performance
• Limitation:
Gaps in Analytic Functionality
Multiple copies of already big data
Inefficient execution & Challenging framework
5. Issues and Need Of Hadoop
500 TB per day
Over 170 PB
Over 6 PB
Getting the data to the processors
becomes the bottleneck
10. HDFS Basic Concept
• HDFS works best with a smaller number of large files
o Millions as opposed to billions of files
o Typically 100MB or more per file
• Files in HDFS are write once
• Optimized for streaming reads of large files and not
random reads
11. MapReduce Component
• JobTracker & TaskTracker
• JobTracker splits up data into smaller tasks(“Map”)
and sends it to the TaskTracker process in each
node
• TaskTracker reports back to the JobTracker node
and reports on job progress, sends data
(“Reduce”) or requests new jobs
13. Summery
Open Source Data Management with Scale out Storage
High Performance while handling large and Complex data
Optimizing for Streaming & Distributed Processing