Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Apache Hadoop is a Java software framework that allows for the distributed processing
of large data sets across clusters o...
•  Distributed, scalable and
reliable
•  Fault‐tolerant storage
system
Hadoop Distributed
File System
•  High-performance ...
A class teacher of class 5 needs to find out the name of the student with highest marks
for each subject.
Total students :...
HDFS: Distribute the
data into blocks across
multiple nodes
Distribute papers across 5 peons – Each
peon will have papers ...
Social Media Data
Analyzing Web Clickstream Data
Server Log Data
Machine and Sensor Data
HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
•  Namenode & Datanodes
Map-Reduce Eng...
Namenode
Datanode_1 Datanode_2 Datanode_3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Storage & Replication of Blocks i...
Job
Tracker
Task Tracker 1 Task Tracker _2 Task Tracker _3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Map-Reduce
job f...
NameNode
Ø  Maps a block to the Datanodes
Ø  Controls read/write access to files
Ø  Manages Replication Engine for Bloc...
Hadoop
Services
HDFS MapReduce YARN
YARN stands for “Yet
Another Resource
Negotiator”, a framework
to provide generic
reso...
Allows easy integration of
multiple data processing
algorithms to the data stored in
HDFS
Query Language Pig Scripting
Coordination Service
Columnar Database
Log Management
Data Exchange
Designing Workflow
Machin...
a)  Apache Website
à http://hadoop.apache.org/
b)  Learning YARN
à https://www.packtpub.com/big-data-and-business-intell...
Hadoop Introduction
Hadoop Introduction
Hadoop Introduction
Hadoop Introduction
Hadoop Introduction
Hadoop Introduction
Prochain SlideShare
Chargement dans…5
×

Hadoop Introduction

677 vues

Publié le

The document starts with the introduction for Hadoop and covers the Hadoop 1.x / 2.x services (HDFS / MapReduce / YARN).
It also explains the architecture of Hadoop, the working of Hadoop distributed file system and MapReduce programming model.

Publié dans : Logiciels
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Hadoop Introduction

  1. 1. Apache Hadoop is a Java software framework that allows for the distributed processing of large data sets across clusters of computers spread across the world using a simple programming model.
  2. 2. •  Distributed, scalable and reliable •  Fault‐tolerant storage system Hadoop Distributed File System •  High-performance parallel data processing •  Employs the divide-conquer principle Map-Reduce Programming Model
  3. 3. A class teacher of class 5 needs to find out the name of the student with highest marks for each subject. Total students : 50 Total subjects : 5 Our Goal To minimize the Total time spent Time to process each subject per student : 1min Total time spent : 250mins Subject 1 : S1-98 Subject 2 : S13-95 Subject 3 : S1-97 Subject 4 : S23-100 Subject 5 : S8-99 Input Output
  4. 4. HDFS: Distribute the data into blocks across multiple nodes Distribute papers across 5 peons – Each peon will have papers of 10 students for each subject (50 papers each) a) Map Phase: Apply business logic on distributed data in parallel Each peon will provide list of subjects with student name and highest marks from his data from a list of 10 students. Total time spent: 50mins (in parallel) b) Reduce Phase: Iterate over the map phase output and get final result Total records left: 5 students for 5 subjects only. Time to get subject list for student name with highest marks: 25mins c) Total time spent: 50 + 25 = 75mins
  5. 5. Social Media Data Analyzing Web Clickstream Data Server Log Data Machine and Sensor Data
  6. 6. HDFS Layer : -- Stores files across storage nodes in a Hadoop cluster Consists of : •  Namenode & Datanodes Map-Reduce Engine : -- Processes vast amounts of data in- parallel on large clusters in a reliable & fault-tolerant manner Consists of : •  Job Tracker & Task Trackers
  7. 7. Namenode Datanode_1 Datanode_2 Datanode_3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 Storage & Replication of Blocks in HDFS Filedividedintoblocks Block 1 Block 2 Block 3 Block 4 HDFS Client File write request
  8. 8. Job Tracker Task Tracker 1 Task Tracker _2 Task Tracker _3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 Map-Reduce job from client Executes individual Map-Reduce tasks assigned by Job Tracker Task Trackers retrieve data from HDFS which is stored on the Data-node i.e. the same system where Task Tracker is running. Task Tracker Data Node Slave m/c
  9. 9. NameNode Ø  Maps a block to the Datanodes Ø  Controls read/write access to files Ø  Manages Replication Engine for Blocks DataNode Ø  Responsible for serving read and write requests (block creation, deletion, and replication) JobTracker Ø  Accepts Map-Reduce tasks from the clients Ø  Assigns tasks to the Task Trackers & monitors their status TaskTracker Ø  Worker daemon, runs Map-Reduce tasks Ø  Sends heart-beat to Job Tracker Ø  Retrieves Job resources from HDFS NameNode DataNode JobTracker TaskTracker Hadoop Daemons
  10. 10. Hadoop Services HDFS MapReduce YARN YARN stands for “Yet Another Resource Negotiator”, a framework to provide generic resource management solution to Hadoop clusters.
  11. 11. Allows easy integration of multiple data processing algorithms to the data stored in HDFS
  12. 12. Query Language Pig Scripting Coordination Service Columnar Database Log Management Data Exchange Designing Workflow Machine Learning Messaging System
  13. 13. a)  Apache Website à http://hadoop.apache.org/ b)  Learning YARN à https://www.packtpub.com/big-data-and-business-intelligence/learning-yarn c)  Hadoop: The definitive guide àhttp://shop.oreilly.com/product/0636920033448.do

×