Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
2. Course Details
The Motivation for Hadoop
Hadoop: Basic Concepts
Writing a MapReduce Program
Common MapReduce Algorithms
PIG Concepts
Hive Concepts
Working with Sqoop
Working with Flume
OOZIE Concepts
HUE Concepts
Reporting Tools
Project
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
2
3. Apache Hadoop
The Motivation for Hadoop
Design Pathshala
April 22, 2014
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
3
4. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
4
5. Design Pathshala
Every one of our courses, written by experts in their respective fields.
We try our best to make you connect real life examples with real business
practices.
Learn and apply to work or your own business.
We provide online classes on different subjects, including Oracle HRMS,
Peoplesoft HRMS & JAVA.
We have both Weekday as well as Weekend classes.
5
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
6. How data comes?
6
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
9. Volume .. Amount of data
~3 ZB of
data exist in
the digital
universe
today.
>300 TB of
data in U.S.
Library of
Congress.
Facebook
has 30+ PB.
~2.5 PB of
data in
DWH.
+10PB DWH
size.
9
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
10. Velocity .. How Rapidly data is growing
48 hours of
new video
every minute
571 new
websites every
minute
500+ TB to
Facebook.
175 million
tweets every
day
1+ million
customer
transactions
every hour
Data
production will
be 44 times
greater in 2020
than it was in
2009.
10
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
11. Variety.. How Rapidly data is growing
Structured
• Traditional
Databases
• Numeric data
Semi -
structured
• Json
• XML
Unstructured
• Text documents
• Email
• Video
• Audio
• Machine
Generated
11
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
12. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
12
13. How Companies minting on Bigdata!
Predict exactly what customers want before they ask for it
Marketing Campaign
Improve customer service
Fraud Detection
Get customers excited about their own data
Identify customer pain points and solve them
Reduce health care costs and improve treatment
Social Graph Analysis & Sentiment Analysis
Research and development
13
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
14. How data is used by some big Companies for
different business analysis.
14
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
27. Who uses Hadoop?
27
42,000 nodes
as on July
2011
4100 nodes
1400
nodes
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
28. What is Hadoop
Hadoop is a framework for distributed processing of large datasets across
large clusters of commodity computers using simple programing model.
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Hadoop is based on a simple data model, any data will fit
28
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |
http://designpathshala.com
29. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
29
30. What makes it especially useful
Scalable: It can reliably store and process petabytes.
Economical: It distributes the data and processing across clusters of commonly available
computers (in thousands).
Efficient: By distributing the data, it can process it in parallel on the nodes where the
data is located.
Reliable: It automatically maintains multiple copies of data and automatically redeploys
computing tasks based on failures.
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
30
31. Hadoop: Assumptions
Hardware will fail.
Applications need a write-once-read-many access model.
Data transfer and I/o is bottleneck
Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
Move logic rather than data
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
31
32. Secondary
NameNode
Client
HDFS Architecture
NameNode
Data Nodes
Metadata
NameNode : Contains information about data
DataNode : Contains physical data
SecondaryNameNode: Keeps reading data from NN
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
32
33. Distributed File System
Single Namespace for entire cluster
Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
Files are broken up into blocks
– Typically 64 MB block size
– Each block replicated on multiple DataNodes
Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
33
39. Apache Hadoop and the Hadoop Ecosystem
MapReduce
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
HDFS
A distributed filesystem that runs on large clusters of commodity machines.
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
39
40. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
40
41. Apache Hadoop and the Hadoop Ecosystem
Pig
A data flow language and execution environment for exploring very large datasets.
Pig runs on HDFS and MapReduce clusters.
Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.
Sqoop
A tool for efficiently moving data between relational databases and HDFS.
Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie
Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
41
42. Apache Hadoop and the Hadoop Ecosystem
HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.
Flume
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
Strom
Apache Storm is a free and open source distributed realtime computation system.
Storm makes it easy to reliably process unbounded streams of data, doing for
realtime processing what Hadoop did for batch processing.
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
42
43. Apache Hadoop and the Hadoop Ecosystem
Spark & Spark
Apache Spark™ is a fast and general engine for large-scale data processing.
Drill
Apache Drill provides direct queries on self-describing and semi-structured data in
files (such as JSON, Parquet) and HBase tables without needing to specify metadata
definitions in a centralized store such as Hive metastore.
Avro
A serialization system for efficient, cross-language RPC, and persistent data
storage.
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
43
51. Which Hadoop Distribution?
Type Distribution Pros Cons
Pureplay
(Apache/Ope
nSource)
Hortonworks 100% Open source version
Integration/Services focused
Extensive partnership
network
Slower interactive
queries
Cloudera Widely used distribution
Faster interactive queries
Extensive tooling
Proprietary extensions
like Impala
Commercial version only
MapR Enterprise and Production
ready focused
Works with NFS & Native Unix
commands
Less focused on using
new Hadoop features
such as Yarn, etc
Proprietary PivotalHD Faster interactive query
support with Greenplum
Integrates with CloudFoundry
PaaS platform
Proprietary extensions
Not easy to decouple
IBM Offer open source without
branch version
Integrated with PaaS and IBM
tools
Limited releases
Expensive
May not be easy to
decouple 51
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
52. Disk 1 Disk 5
2 Disk 6
2
Disk 7
Disk 2
Disk 3
1
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
52
Disk 9
1 2 3
Racks
Disk 10
Disk 11
Disk 8 Disk 12
Disk 4
1
1
2
3
3
3
Data blocks
Rack 1 Rack 2 Rack 3
File F 1 2 3 4 5
Blocks (64 MB)
53. Block Placement
Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
Clients read from nearest replica
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
53
54. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
54
55. Main Properties of HDFS
Large: A HDFS instance may consist of thousands of server machines, each
storing part of the file system’s data
Replication: Each data block is replicated many times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick, automatic recovery from
them is a core architectural goal of HDFS
Datanodes send heartbeats to Name node
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
55
57. NameNode Metadata
Meta-data in Memory
Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
A Transaction Log
– Records file creations, file deletions. etc
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
57
58. DataNode
A Block Server
– Stores data in the local file system
– Stores meta-data of a block
– Serves data to Clients
Block Report
– Periodically sends a report of all existing blocks to the
NameNode
Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
58
59. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
59
60. Hadoop Master/Slave Architecture
Hadoop is designed as a master-slave shared-nothing architecture
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
Master node (single node)
Many slave nodes
60
61. JobTracker
Master node runs JobTracker instance, which accepts Job requests from
clients
There is only one JobTracker daemon running per hadoop cluster
Determine the execution plan by determining which files to process
Assigns Nodes to different task
Monitor all tasks as they are running
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
61
62. TaskTracker
Manages execution of individual tasks on each data node
One TaskTracker each data node
Each TaskTracker can spawn multiple JVM’s to handle many map or reduce
task in parallel
TaskTracker constantly communicate with job tracker
JobTracker fails to receive heartbeat from TaskTracker in specified amount of
time, it assumes the task tracker has crashed. In such a scenario, job tracker
will resubmit the task to some other TaskTracker.
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
62
69. Replication Engine
NameNode detects DataNode failures
Chooses new DataNodes for new replicas
Balances disk usage
Balances communication traffic to DataNodes
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
69
70. Data Pipeline & Write Anatomy
HDFS Client Add Block Name Node
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
70
Data Node
Data Node
Data Node
Write
Ack
Complete
71. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
71
72. Data Pipelining
Client retrieves a list of DataNodes on which to place
replicas of a block
Client writes block to the first DataNode
The first DataNode forwards the data to the next
DataNode in the Pipeline
When all replicas are written, the Client moves on to
write the next block in file
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
72
73. Read Anatomy
HDFS Client Get Block Name Node
Data Node Data Node Data Node
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
73
Read
Read
74. Data Correctness
Use Checksums to validate data
– Use CRC32
File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
74
75. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
75