SlideShare une entreprise Scribd logo
1  sur  75
Apache Hadoop 
Design Pathshala 
April 22, 2014 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
1
Course Details 
 The Motivation for Hadoop 
 Hadoop: Basic Concepts 
 Writing a MapReduce Program 
 Common MapReduce Algorithms 
 PIG Concepts 
 Hive Concepts 
 Working with Sqoop 
 Working with Flume 
 OOZIE Concepts 
 HUE Concepts 
 Reporting Tools 
 Project 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
2
Apache Hadoop 
The Motivation for Hadoop 
Design Pathshala 
April 22, 2014 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
3
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
4
Design Pathshala 
 Every one of our courses, written by experts in their respective fields. 
 We try our best to make you connect real life examples with real business 
practices. 
 Learn and apply to work or your own business. 
 We provide online classes on different subjects, including Oracle HRMS, 
Peoplesoft HRMS & JAVA. 
 We have both Weekday as well as Weekend classes. 
5 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
How data comes? 
6 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Machine generated and historical data 
7 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Three V’s of Bigdata 
Volume 
Velocity 
Variety 
8 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Volume .. Amount of data 
~3 ZB of 
data exist in 
the digital 
universe 
today. 
>300 TB of 
data in U.S. 
Library of 
Congress. 
Facebook 
has 30+ PB. 
~2.5 PB of 
data in 
DWH. 
+10PB DWH 
size. 
9 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Velocity .. How Rapidly data is growing 
48 hours of 
new video 
every minute 
571 new 
websites every 
minute 
500+ TB to 
Facebook. 
175 million 
tweets every 
day 
1+ million 
customer 
transactions 
every hour 
Data 
production will 
be 44 times 
greater in 2020 
than it was in 
2009. 
10 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Variety.. How Rapidly data is growing 
Structured 
• Traditional 
Databases 
• Numeric data 
Semi - 
structured 
• Json 
• XML 
Unstructured 
• Text documents 
• Email 
• Video 
• Audio 
• Machine 
Generated 
11 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
12
How Companies minting on Bigdata! 
Predict exactly what customers want before they ask for it 
Marketing Campaign 
Improve customer service 
Fraud Detection 
Get customers excited about their own data 
Identify customer pain points and solve them 
Reduce health care costs and improve treatment 
Social Graph Analysis & Sentiment Analysis 
Research and development 
13 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
How data is used by some big Companies for 
different business analysis. 
14 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Big Data Market Forecast 
15 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
16
Career options 
www.designpathshala.com | +91 120 260 5512 | +91 17 
98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Big data jobs, big pay jobs 
www.designpathshala.com | +91 120 260 5512 | +91 18 
98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Top Recruiters in India 
www.designpathshala.com | +91 120 260 5512 | +91 19 
98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Hadoop & Hive History 
 Dec 2004 – Google GFS paper published 
 July 2005 – Nutch uses MapReduce 
 Feb 2006 – Becomes Lucene subproject 
 Apr 2007 – Yahoo! on 1000-node cluster 
 Jan 2008 – An Apache Top Level Project 
 Jul 2008 – A 4000 node test cluster 
 Sept 2008 – Hive becomes a Hadoop subproject 
20 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
You Say, “tomato…” 
Google calls it: Hadoop equivalent: 
GFS HDFS 
Bigtable HBase 
Chubby Zookeeper 
21 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Problems with current systems 
1 Machine 
• Read 1 TB data 
• 4 I/O operations 
• 100 Mbps 
22 
~45 
mins 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
23
Apache Hadoop Wins Terabyte Sort Benchmark (July 2008) 
 Yahoo's sorted 1 TB data in 209 seconds 
 Beat the previous record of 297 seconds of Google. 
 The sort used 1800 mappers and 1800 reduces 
 Cluster configuration used for benchmark sort 
 910 nodes 
 2 quad core Xeons @ 2.0ghz per node 
 8G RAM per a node; 
www.designpathshala.com | +91 120 260 5512 | +91 24 
98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Why Hadoop? 
1 Machine 
• Read 1 TB data 
• 4 I/O operations 
• 100 Mbps 
10 Machines 
4 I/O operations 
100 Mbps 
25 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
~45 
mins 
~4.5 
mins
Distributed File System (DFS) 
designpathshalaproject 
 dp.global.inhomeproject 
 dp.global.inhomeimages 
 dp.global.inhomesoftware 
 dp.global.inhomewebsites 
26 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
designpathshalasoftware 
designpathshalaimages 
designpathshalawebsites 
Namespace 
dp.global.in
Who uses Hadoop? 
27 
42,000 nodes 
as on July 
2011 
4100 nodes 
1400 
nodes 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
What is Hadoop 
 Hadoop is a framework for distributed processing of large datasets across 
large clusters of commodity computers using simple programing model. 
 Large datasets  Terabytes or petabytes of data 
 Large clusters  hundreds or thousands of nodes 
 Hadoop is open-source implementation for Google MapReduce 
 Hadoop is based on a simple programming model called MapReduce 
 Hadoop is based on a simple data model, any data will fit 
28 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
29
What makes it especially useful 
 Scalable: It can reliably store and process petabytes. 
 Economical: It distributes the data and processing across clusters of commonly available 
computers (in thousands). 
 Efficient: By distributing the data, it can process it in parallel on the nodes where the 
data is located. 
 Reliable: It automatically maintains multiple copies of data and automatically redeploys 
computing tasks based on failures. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
30
Hadoop: Assumptions 
 Hardware will fail. 
 Applications need a write-once-read-many access model. 
 Data transfer and I/o is bottleneck 
 Very Large Distributed File System 
– 10K nodes, 100 million files, 10 PB 
 Assumes Commodity Hardware 
– Files are replicated to handle hardware failure 
– Detect failures and recovers from them 
 Move logic rather than data 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
31
Secondary 
NameNode 
Client 
HDFS Architecture 
NameNode 
Data Nodes 
Metadata 
NameNode : Contains information about data 
DataNode : Contains physical data 
SecondaryNameNode: Keeps reading data from NN 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
32
Distributed File System 
 Single Namespace for entire cluster 
 Data Coherency 
– Write-once-read-many access model 
– Client can only append to existing files 
 Files are broken up into blocks 
– Typically 64 MB block size 
– Each block replicated on multiple DataNodes 
 Intelligent Client 
– Client can find location of blocks 
– Client accesses data directly from DataNode 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
33
Hadoop architecture 
34 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
35
 Major re-architecture of Distributed 
File System and Processing 
 YARN Architecture enables to run 
multiple things on Hadoop nodes 
 Interactive SQL Support 
 Integrated streaming support 
 In-memory processing 
 Search 
 Enterprise Security 
 Data Lifecycle Management 
 Readily available tools and libraries 
36 
Why Hadoop 2.0? 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Hadoop 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
37
38 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop and the Hadoop Ecosystem 
 MapReduce 
 A distributed data processing model and execution environment that runs on large 
clusters of commodity machines. 
 HDFS 
 A distributed filesystem that runs on large clusters of commodity machines. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
39
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
40
Apache Hadoop and the Hadoop Ecosystem 
 Pig 
 A data flow language and execution environment for exploring very large datasets. 
Pig runs on HDFS and MapReduce clusters. 
 Hive 
 A distributed data warehouse. Hive manages data stored in HDFS and provides a 
query language based on SQL (and which is translated by the runtime engine to 
MapReduce jobs) for querying the data. 
 Sqoop 
 A tool for efficiently moving data between relational databases and HDFS. 
 Oozie 
 Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie 
Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
41
Apache Hadoop and the Hadoop Ecosystem 
 HBase 
 A distributed, column-oriented database. HBase uses HDFS for its underlying 
storage, and supports both batch-style computations using MapReduce and point 
queries (random reads). 
 ZooKeeper 
 A distributed, highly available coordination service. ZooKeeper provides primitives 
such as distributed locks that can be used for building distributed applications. 
 Flume 
 Flume is a distributed, reliable, and available service for efficiently collecting, 
aggregating, and moving large amounts of log data. 
 Strom 
 Apache Storm is a free and open source distributed realtime computation system. 
Storm makes it easy to reliably process unbounded streams of data, doing for 
realtime processing what Hadoop did for batch processing. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
42
Apache Hadoop and the Hadoop Ecosystem 
 Spark & Spark 
 Apache Spark™ is a fast and general engine for large-scale data processing. 
 Drill 
 Apache Drill provides direct queries on self-describing and semi-structured data in 
files (such as JSON, Parquet) and HBase tables without needing to specify metadata 
definitions in a centralized store such as Hive metastore. 
 Avro 
 A serialization system for efficient, cross-language RPC, and persistent data 
storage. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
43
44 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
45
Oozie Workflows 
Pig Sqoop Jobs/Hive Scripts 
46 
Source Databases (Reporting) 
HDFS (domain/xbec/dwh) 
Source 
Table Data 
Temporary 
Table Data 
DW Table 
Data 
Hive Tables 
MySQL Data 
Warehouse 
Dashboard 
Reporting 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Big Data 
Platform 
Data Export 
Algorithms 
Integration Mahout Execution 
Base HDFS / MapReduce / Hive / Pig on Hortonworks HDP 2.0 
47 
Data As A 
Service 
Summary Data 
Services 
Repository API 
Command & 
Job API 
Hive Metadata 
API 
Workflow DSL 
API 
Workflow API 
API 
Storage API 
External MR 
Submissions 
Events Elastic Search & Indexing Application & Notifications + CEP 
Domain & User 
Mgmt 
Security Layer 
SSO 
User Auth Authorization 
Hadoop Platform 
Integration 
Gateway 
API 
Other 
Real Time Analytics 
Kerberos 
Security 
Hive 
Security 
Pig/MR 
Security 
HDFS Data 
Privacy 
Log Analytics 
Data Analytics 
Spatial Analytics 
Tracking Analytics 
RDBMS 
Integration 
API 
Queues 
Flume 
Integration 
Queuing & 
Ingestion 
Real-time 
Analytics 
R Integration 
Detached 
Storage 
Archiving 
Storage Management 
Core Build & Deploy VM Provisioning & Software Deployment Cloud Foundry / Open Shift 
Existing DW 
External 
Storage 
Analytics 
Functions 
Fast SQL Layer 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Data 
Warehouse 
(Domain 
Specific) 
(Traditional 
DW) 
Multi-tenant Big Data Platform w/ Hadoop & PaaS Platforms 
Ingestion 
Engine 
(Kafka) 
Devices, Visualization, Search, Reporting & Alerts 
Application/Rules/Webservices (REST) Layer 
(JBOSS & JBRMS) 
Intermediate 
Storage 
Engine 
(Cassandra & 
HBASE& Solr) 
Real-time 
Processing 
Engine 
(Storm) 
(MapReduce/Pig/ 
Hive/Stringer) 
48 
Ingestion 
Engine 
(Kafka) 
Real-time 
Stream 
Processing 
Engine 
(Storm) 
Intermediate 
Data Store & 
Search 
(MySQL/HBase) 
& 
Elastic Search 
Hadoop 
HDFS 
Hadoop 
HDFS 
(MapReduce/Pig/ 
Hive) 
Predictive 
Analytics & 
Machine 
Learning 
(Mahout/R) 
Data Inputs 
Gateway 
(Talend,Fuse) 
(JMS, 
RDBMS/Sqoop, Log 
files/Flume, 
REST/WebHDFS, 
etc) 
Hadoop 2.0 Platform (Hartonworks) 
Data Integration/ETL/Workflows (Oozie) 
Cloud Orchestration (Zookeeper, YARN, Ambari) 
Reporting/BI 
Tools 
(ex: Jaspersoft) 
Batch 
Metrics & 
ETL 
Real-time 
Metrics 
Predictive 
Metrics 
Analytics Libraries 
SQL/Hive 
Engines 
(Data Warehouse) 
(Stringer/HAWQ 
ETL 
PaaS Platform 
Virtualization (Public/Private/Hybrid) 
Data Access 
Existing 
Datawarehouse 
Platforms 
Data Export 
Ad-hoc/ 
Interac 
tive 
Analytics 
Analytics & Business Applications 
API 
Queues 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
49
Hadoop complex queries comparison with 
traditional DB’s 
50 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Which Hadoop Distribution? 
Type Distribution Pros Cons 
Pureplay 
(Apache/Ope 
nSource) 
Hortonworks 100% Open source version 
Integration/Services focused 
Extensive partnership 
network 
Slower interactive 
queries 
Cloudera Widely used distribution 
Faster interactive queries 
Extensive tooling 
Proprietary extensions 
like Impala 
Commercial version only 
MapR Enterprise and Production 
ready focused 
Works with NFS & Native Unix 
commands 
Less focused on using 
new Hadoop features 
such as Yarn, etc 
Proprietary PivotalHD Faster interactive query 
support with Greenplum 
Integrates with CloudFoundry 
PaaS platform 
Proprietary extensions 
Not easy to decouple 
IBM Offer open source without 
branch version 
Integrated with PaaS and IBM 
tools 
Limited releases 
Expensive 
May not be easy to 
decouple 51 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Disk 1 Disk 5 
2 Disk 6 
2 
Disk 7 
Disk 2 
Disk 3 
1 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
52 
Disk 9 
1 2 3 
Racks 
Disk 10 
Disk 11 
Disk 8 Disk 12 
Disk 4 
1 
1 
2 
3 
3 
3 
Data blocks 
Rack 1 Rack 2 Rack 3 
File F 1 2 3 4 5 
Blocks (64 MB)
Block Placement 
 Current Strategy 
-- One replica on local node 
-- Second replica on a remote rack 
-- Third replica on same remote rack 
-- Additional replicas are randomly placed 
 Clients read from nearest replica 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
53
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
54
Main Properties of HDFS 
 Large: A HDFS instance may consist of thousands of server machines, each 
storing part of the file system’s data 
 Replication: Each data block is replicated many times (default is 3) 
 Failure: Failure is the norm rather than exception 
 Fault Tolerance: Detection of faults and quick, automatic recovery from 
them is a core architectural goal of HDFS 
 Datanodes send heartbeats to Name node 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
55
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
56
NameNode Metadata 
 Meta-data in Memory 
Types of Metadata 
– List of files 
– List of Blocks for each file 
– List of DataNodes for each block 
– File attributes, e.g creation time, replication factor 
 A Transaction Log 
– Records file creations, file deletions. etc 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
57
DataNode 
 A Block Server 
– Stores data in the local file system 
– Stores meta-data of a block 
– Serves data to Clients 
 Block Report 
– Periodically sends a report of all existing blocks to the 
NameNode 
 Facilitates Pipelining of Data 
– Forwards data to other specified DataNodes 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
58
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
59
Hadoop Master/Slave Architecture 
 Hadoop is designed as a master-slave shared-nothing architecture 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
Master node (single node) 
Many slave nodes 
60
JobTracker 
 Master node runs JobTracker instance, which accepts Job requests from 
clients 
 There is only one JobTracker daemon running per hadoop cluster 
 Determine the execution plan by determining which files to process 
 Assigns Nodes to different task 
 Monitor all tasks as they are running 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
61
TaskTracker 
 Manages execution of individual tasks on each data node 
 One TaskTracker each data node 
 Each TaskTracker can spawn multiple JVM’s to handle many map or reduce 
task in parallel 
 TaskTracker constantly communicate with job tracker 
 JobTracker fails to receive heartbeat from TaskTracker in specified amount of 
time, it assumes the task tracker has crashed. In such a scenario, job tracker 
will resubmit the task to some other TaskTracker. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
62
Job Tracker 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
63 
User 
DFS 
Copy 
Input 
Files 
Client 
Submit 
job 
Create 
Splits 
Upload 
Job Info 
Job.XML 
Job.jar 
Job Tracker 
Submit 
Job 
Get Input 
file info
Job Tracker Cont.. 
Job.XML 
Job.jar 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
64 
Clint 
DFS 
Job Tracker 
Submit 
job 
Initialize 
job 
Create Map 
& Reduce 
Job Queue 
M 
R 
S 
S 
S 
S 
S 
S 
No of maps = 
Input splits 
Read Files
Job Tracker Cont.. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
65 
Job Tracker 
Task Tracker 
Picks Task 
Heart 
Beat 
Job Queue 
Assign 
Task 
Job Queue
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
66
Job Tracker Cont.. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
67 
Task Tracker 
Job Tracker 
Read 
from local 
Disk 
DFS 
Assign 
Task 
Job.xml 
Job.jar
Heartbeats 
 DataNodes send hearbeat to the NameNode 
 NameNode uses heartbeats to detect DataNode failure 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
68
Replication Engine 
 NameNode detects DataNode failures 
 Chooses new DataNodes for new replicas 
 Balances disk usage 
 Balances communication traffic to DataNodes 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
69
Data Pipeline & Write Anatomy 
HDFS Client Add Block Name Node 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
70 
Data Node 
Data Node 
Data Node 
Write 
Ack 
Complete
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
71
Data Pipelining 
 Client retrieves a list of DataNodes on which to place 
replicas of a block 
 Client writes block to the first DataNode 
 The first DataNode forwards the data to the next 
DataNode in the Pipeline 
 When all replicas are written, the Client moves on to 
write the next block in file 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
72
Read Anatomy 
HDFS Client Get Block Name Node 
Data Node Data Node Data Node 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
73 
Read 
Read
Data Correctness 
 Use Checksums to validate data 
– Use CRC32 
 File Creation 
– Client computes checksum per 512 byte 
– DataNode stores the checksum 
 File access 
– Client retrieves the data and checksum from DataNode 
– If Validation fails, Client tries other replicas 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
74
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
75

Contenu connexe

Tendances

Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
Hortonworks
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

Tendances (20)

Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop Infrastructure
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

En vedette

Internet of Things. Definition of a concept
Internet of Things. Definition of a conceptInternet of Things. Definition of a concept
Internet of Things. Definition of a concept
Jesús Fontecha
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
AbDul ThaYyal
 

En vedette (20)

Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Scala+data
Scala+dataScala+data
Scala+data
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Big data Ppt
Big data PptBig data Ppt
Big data Ppt
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Sales Transformation in a Digital World - Summary
Sales Transformation in a Digital World - SummarySales Transformation in a Digital World - Summary
Sales Transformation in a Digital World - Summary
 
Big Data - How Marketing Has Revolutionised - by Sean Singleton
Big Data - How Marketing Has Revolutionised - by Sean SingletonBig Data - How Marketing Has Revolutionised - by Sean Singleton
Big Data - How Marketing Has Revolutionised - by Sean Singleton
 
Exploiter les potentialités du Big Data et du marketing automation en B2B
Exploiter les potentialités du Big Data et du marketing automation en B2BExploiter les potentialités du Big Data et du marketing automation en B2B
Exploiter les potentialités du Big Data et du marketing automation en B2B
 
Internet of Things. Definition of a concept
Internet of Things. Definition of a conceptInternet of Things. Definition of a concept
Internet of Things. Definition of a concept
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
 
Real Time Marketing Big Data Analytics Social Marketing Intelligence Disruption
Real Time Marketing Big Data Analytics Social Marketing Intelligence DisruptionReal Time Marketing Big Data Analytics Social Marketing Intelligence Disruption
Real Time Marketing Big Data Analytics Social Marketing Intelligence Disruption
 
BIG DATA in MARKETING
BIG DATA in MARKETINGBIG DATA in MARKETING
BIG DATA in MARKETING
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
eMarketer Webinar: Data Management Platforms—Using Big Data to Power Marketin...
eMarketer Webinar: Data Management Platforms—Using Big Data to Power Marketin...eMarketer Webinar: Data Management Platforms—Using Big Data to Power Marketin...
eMarketer Webinar: Data Management Platforms—Using Big Data to Power Marketin...
 

Similaire à Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala

There is more to Big Data than data
There is more to Big Data than dataThere is more to Big Data than data
There is more to Big Data than data
Capgemini
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Kinetica
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Pentaho
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
PyData
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 

Similaire à Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala (20)

Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
There is more to Big Data than data
There is more to Big Data than dataThere is more to Big Data than data
There is more to Big Data than data
 
Dr. Bjarne Berg for Knowledge Stream
Dr. Bjarne Berg for Knowledge StreamDr. Bjarne Berg for Knowledge Stream
Dr. Bjarne Berg for Knowledge Stream
 
Bn1028 demo hadoop administration and development
Bn1028 demo  hadoop administration and developmentBn1028 demo  hadoop administration and development
Bn1028 demo hadoop administration and development
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
QMeeting 2018 - Como integrar qlik e cloudera
QMeeting 2018 - Como integrar qlik e clouderaQMeeting 2018 - Como integrar qlik e cloudera
QMeeting 2018 - Como integrar qlik e cloudera
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWS
 
Big Trends in Big Data
Big Trends in Big DataBig Trends in Big Data
Big Trends in Big Data
 

Dernier

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 

Dernier (20)

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala

  • 1. Apache Hadoop Design Pathshala April 22, 2014 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 1
  • 2. Course Details  The Motivation for Hadoop  Hadoop: Basic Concepts  Writing a MapReduce Program  Common MapReduce Algorithms  PIG Concepts  Hive Concepts  Working with Sqoop  Working with Flume  OOZIE Concepts  HUE Concepts  Reporting Tools  Project www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 2
  • 3. Apache Hadoop The Motivation for Hadoop Design Pathshala April 22, 2014 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 3
  • 4. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 4
  • 5. Design Pathshala  Every one of our courses, written by experts in their respective fields.  We try our best to make you connect real life examples with real business practices.  Learn and apply to work or your own business.  We provide online classes on different subjects, including Oracle HRMS, Peoplesoft HRMS & JAVA.  We have both Weekday as well as Weekend classes. 5 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 6. How data comes? 6 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 7. Machine generated and historical data 7 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 8. Three V’s of Bigdata Volume Velocity Variety 8 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 9. Volume .. Amount of data ~3 ZB of data exist in the digital universe today. >300 TB of data in U.S. Library of Congress. Facebook has 30+ PB. ~2.5 PB of data in DWH. +10PB DWH size. 9 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 10. Velocity .. How Rapidly data is growing 48 hours of new video every minute 571 new websites every minute 500+ TB to Facebook. 175 million tweets every day 1+ million customer transactions every hour Data production will be 44 times greater in 2020 than it was in 2009. 10 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 11. Variety.. How Rapidly data is growing Structured • Traditional Databases • Numeric data Semi - structured • Json • XML Unstructured • Text documents • Email • Video • Audio • Machine Generated 11 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 12. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 12
  • 13. How Companies minting on Bigdata! Predict exactly what customers want before they ask for it Marketing Campaign Improve customer service Fraud Detection Get customers excited about their own data Identify customer pain points and solve them Reduce health care costs and improve treatment Social Graph Analysis & Sentiment Analysis Research and development 13 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 14. How data is used by some big Companies for different business analysis. 14 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 15. Big Data Market Forecast 15 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 16. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 16
  • 17. Career options www.designpathshala.com | +91 120 260 5512 | +91 17 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 18. Big data jobs, big pay jobs www.designpathshala.com | +91 120 260 5512 | +91 18 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 19. Top Recruiters in India www.designpathshala.com | +91 120 260 5512 | +91 19 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 20. Hadoop & Hive History  Dec 2004 – Google GFS paper published  July 2005 – Nutch uses MapReduce  Feb 2006 – Becomes Lucene subproject  Apr 2007 – Yahoo! on 1000-node cluster  Jan 2008 – An Apache Top Level Project  Jul 2008 – A 4000 node test cluster  Sept 2008 – Hive becomes a Hadoop subproject 20 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 21. You Say, “tomato…” Google calls it: Hadoop equivalent: GFS HDFS Bigtable HBase Chubby Zookeeper 21 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 22. Problems with current systems 1 Machine • Read 1 TB data • 4 I/O operations • 100 Mbps 22 ~45 mins www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 23. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 23
  • 24. Apache Hadoop Wins Terabyte Sort Benchmark (July 2008)  Yahoo's sorted 1 TB data in 209 seconds  Beat the previous record of 297 seconds of Google.  The sort used 1800 mappers and 1800 reduces  Cluster configuration used for benchmark sort  910 nodes  2 quad core Xeons @ 2.0ghz per node  8G RAM per a node; www.designpathshala.com | +91 120 260 5512 | +91 24 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 25. Why Hadoop? 1 Machine • Read 1 TB data • 4 I/O operations • 100 Mbps 10 Machines 4 I/O operations 100 Mbps 25 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com ~45 mins ~4.5 mins
  • 26. Distributed File System (DFS) designpathshalaproject  dp.global.inhomeproject  dp.global.inhomeimages  dp.global.inhomesoftware  dp.global.inhomewebsites 26 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com designpathshalasoftware designpathshalaimages designpathshalawebsites Namespace dp.global.in
  • 27. Who uses Hadoop? 27 42,000 nodes as on July 2011 4100 nodes 1400 nodes www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 28. What is Hadoop  Hadoop is a framework for distributed processing of large datasets across large clusters of commodity computers using simple programing model.  Large datasets  Terabytes or petabytes of data  Large clusters  hundreds or thousands of nodes  Hadoop is open-source implementation for Google MapReduce  Hadoop is based on a simple programming model called MapReduce  Hadoop is based on a simple data model, any data will fit 28 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 29. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 29
  • 30. What makes it especially useful  Scalable: It can reliably store and process petabytes.  Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).  Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.  Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 30
  • 31. Hadoop: Assumptions  Hardware will fail.  Applications need a write-once-read-many access model.  Data transfer and I/o is bottleneck  Very Large Distributed File System – 10K nodes, 100 million files, 10 PB  Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them  Move logic rather than data www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 31
  • 32. Secondary NameNode Client HDFS Architecture NameNode Data Nodes Metadata NameNode : Contains information about data DataNode : Contains physical data SecondaryNameNode: Keeps reading data from NN www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 32
  • 33. Distributed File System  Single Namespace for entire cluster  Data Coherency – Write-once-read-many access model – Client can only append to existing files  Files are broken up into blocks – Typically 64 MB block size – Each block replicated on multiple DataNodes  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 33
  • 34. Hadoop architecture 34 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 35. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 35
  • 36.  Major re-architecture of Distributed File System and Processing  YARN Architecture enables to run multiple things on Hadoop nodes  Interactive SQL Support  Integrated streaming support  In-memory processing  Search  Enterprise Security  Data Lifecycle Management  Readily available tools and libraries 36 Why Hadoop 2.0? www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 37. Hadoop www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 37
  • 38. 38 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 39. Apache Hadoop and the Hadoop Ecosystem  MapReduce  A distributed data processing model and execution environment that runs on large clusters of commodity machines.  HDFS  A distributed filesystem that runs on large clusters of commodity machines. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 39
  • 40. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 40
  • 41. Apache Hadoop and the Hadoop Ecosystem  Pig  A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.  Hive  A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.  Sqoop  A tool for efficiently moving data between relational databases and HDFS.  Oozie  Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 41
  • 42. Apache Hadoop and the Hadoop Ecosystem  HBase  A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).  ZooKeeper  A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.  Flume  Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.  Strom  Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 42
  • 43. Apache Hadoop and the Hadoop Ecosystem  Spark & Spark  Apache Spark™ is a fast and general engine for large-scale data processing.  Drill  Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables without needing to specify metadata definitions in a centralized store such as Hive metastore.  Avro  A serialization system for efficient, cross-language RPC, and persistent data storage. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 43
  • 44. 44 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 45. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 45
  • 46. Oozie Workflows Pig Sqoop Jobs/Hive Scripts 46 Source Databases (Reporting) HDFS (domain/xbec/dwh) Source Table Data Temporary Table Data DW Table Data Hive Tables MySQL Data Warehouse Dashboard Reporting www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 47. Big Data Platform Data Export Algorithms Integration Mahout Execution Base HDFS / MapReduce / Hive / Pig on Hortonworks HDP 2.0 47 Data As A Service Summary Data Services Repository API Command & Job API Hive Metadata API Workflow DSL API Workflow API API Storage API External MR Submissions Events Elastic Search & Indexing Application & Notifications + CEP Domain & User Mgmt Security Layer SSO User Auth Authorization Hadoop Platform Integration Gateway API Other Real Time Analytics Kerberos Security Hive Security Pig/MR Security HDFS Data Privacy Log Analytics Data Analytics Spatial Analytics Tracking Analytics RDBMS Integration API Queues Flume Integration Queuing & Ingestion Real-time Analytics R Integration Detached Storage Archiving Storage Management Core Build & Deploy VM Provisioning & Software Deployment Cloud Foundry / Open Shift Existing DW External Storage Analytics Functions Fast SQL Layer www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 48. Data Warehouse (Domain Specific) (Traditional DW) Multi-tenant Big Data Platform w/ Hadoop & PaaS Platforms Ingestion Engine (Kafka) Devices, Visualization, Search, Reporting & Alerts Application/Rules/Webservices (REST) Layer (JBOSS & JBRMS) Intermediate Storage Engine (Cassandra & HBASE& Solr) Real-time Processing Engine (Storm) (MapReduce/Pig/ Hive/Stringer) 48 Ingestion Engine (Kafka) Real-time Stream Processing Engine (Storm) Intermediate Data Store & Search (MySQL/HBase) & Elastic Search Hadoop HDFS Hadoop HDFS (MapReduce/Pig/ Hive) Predictive Analytics & Machine Learning (Mahout/R) Data Inputs Gateway (Talend,Fuse) (JMS, RDBMS/Sqoop, Log files/Flume, REST/WebHDFS, etc) Hadoop 2.0 Platform (Hartonworks) Data Integration/ETL/Workflows (Oozie) Cloud Orchestration (Zookeeper, YARN, Ambari) Reporting/BI Tools (ex: Jaspersoft) Batch Metrics & ETL Real-time Metrics Predictive Metrics Analytics Libraries SQL/Hive Engines (Data Warehouse) (Stringer/HAWQ ETL PaaS Platform Virtualization (Public/Private/Hybrid) Data Access Existing Datawarehouse Platforms Data Export Ad-hoc/ Interac tive Analytics Analytics & Business Applications API Queues www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 49. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 49
  • 50. Hadoop complex queries comparison with traditional DB’s 50 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 51. Which Hadoop Distribution? Type Distribution Pros Cons Pureplay (Apache/Ope nSource) Hortonworks 100% Open source version Integration/Services focused Extensive partnership network Slower interactive queries Cloudera Widely used distribution Faster interactive queries Extensive tooling Proprietary extensions like Impala Commercial version only MapR Enterprise and Production ready focused Works with NFS & Native Unix commands Less focused on using new Hadoop features such as Yarn, etc Proprietary PivotalHD Faster interactive query support with Greenplum Integrates with CloudFoundry PaaS platform Proprietary extensions Not easy to decouple IBM Offer open source without branch version Integrated with PaaS and IBM tools Limited releases Expensive May not be easy to decouple 51 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 52. Disk 1 Disk 5 2 Disk 6 2 Disk 7 Disk 2 Disk 3 1 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 52 Disk 9 1 2 3 Racks Disk 10 Disk 11 Disk 8 Disk 12 Disk 4 1 1 2 3 3 3 Data blocks Rack 1 Rack 2 Rack 3 File F 1 2 3 4 5 Blocks (64 MB)
  • 53. Block Placement  Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed  Clients read from nearest replica www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 53
  • 54. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 54
  • 55. Main Properties of HDFS  Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data  Replication: Each data block is replicated many times (default is 3)  Failure: Failure is the norm rather than exception  Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS  Datanodes send heartbeats to Name node www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 55
  • 56. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 56
  • 57. NameNode Metadata  Meta-data in Memory Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor  A Transaction Log – Records file creations, file deletions. etc www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 57
  • 58. DataNode  A Block Server – Stores data in the local file system – Stores meta-data of a block – Serves data to Clients  Block Report – Periodically sends a report of all existing blocks to the NameNode  Facilitates Pipelining of Data – Forwards data to other specified DataNodes www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 58
  • 59. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 59
  • 60. Hadoop Master/Slave Architecture  Hadoop is designed as a master-slave shared-nothing architecture www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com Master node (single node) Many slave nodes 60
  • 61. JobTracker  Master node runs JobTracker instance, which accepts Job requests from clients  There is only one JobTracker daemon running per hadoop cluster  Determine the execution plan by determining which files to process  Assigns Nodes to different task  Monitor all tasks as they are running www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 61
  • 62. TaskTracker  Manages execution of individual tasks on each data node  One TaskTracker each data node  Each TaskTracker can spawn multiple JVM’s to handle many map or reduce task in parallel  TaskTracker constantly communicate with job tracker  JobTracker fails to receive heartbeat from TaskTracker in specified amount of time, it assumes the task tracker has crashed. In such a scenario, job tracker will resubmit the task to some other TaskTracker. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 62
  • 63. Job Tracker www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 63 User DFS Copy Input Files Client Submit job Create Splits Upload Job Info Job.XML Job.jar Job Tracker Submit Job Get Input file info
  • 64. Job Tracker Cont.. Job.XML Job.jar www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 64 Clint DFS Job Tracker Submit job Initialize job Create Map & Reduce Job Queue M R S S S S S S No of maps = Input splits Read Files
  • 65. Job Tracker Cont.. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 65 Job Tracker Task Tracker Picks Task Heart Beat Job Queue Assign Task Job Queue
  • 66. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 66
  • 67. Job Tracker Cont.. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 67 Task Tracker Job Tracker Read from local Disk DFS Assign Task Job.xml Job.jar
  • 68. Heartbeats  DataNodes send hearbeat to the NameNode  NameNode uses heartbeats to detect DataNode failure www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 68
  • 69. Replication Engine  NameNode detects DataNode failures  Chooses new DataNodes for new replicas  Balances disk usage  Balances communication traffic to DataNodes www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 69
  • 70. Data Pipeline & Write Anatomy HDFS Client Add Block Name Node www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 70 Data Node Data Node Data Node Write Ack Complete
  • 71. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 71
  • 72. Data Pipelining  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 72
  • 73. Read Anatomy HDFS Client Get Block Name Node Data Node Data Node Data Node www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 73 Read Read
  • 74. Data Correctness  Use Checksums to validate data – Use CRC32  File Creation – Client computes checksum per 512 byte – DataNode stores the checksum  File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 74
  • 75. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 75