Hadoop Ecosystem

Hadoop Ecosystem
Lior Sidi
Sep 2016

Big data V’s
Volume
Velocity
Variety

What is Hadoop?
• Hadoop – Open source implementation of MapReduce (MR)
• Perform MR Jobs fast and efficient
Goal
generating Value from large datasets
That cannot be analyzed
using traditional technologies

Hadoop Concepts
Requirements
• Linear horizontal scalability
• Jobs run in isolation
• Simple programming model
Challenges and solution
• Ch1: Data access bottleneck
• Sol: Store and process data on same node
• Ch1: Distributed Programming is Difficult
• Sol: Use high level languages API

Hadoop Timeline
2003 Oct
Google File System
paper released
2004 Dec
MapReduce: Simplified Data
Processing on Large Clusters
2006 Oct
Hadoop 1.0 released
2007 Oct
Yahoo Labs creates Pig
2008 Oct
Cloudera, Hadoop
distributor is founded
2010 Sep
Hive and Pig Graduates
2011 Jan
Zookeeper Graduates
2013 Mar
Yarn deployed in Yahoo
2014 Feb
Apache Spark top
Level Apache Project

Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Visualization
Cluster
management
Storage
Search
Data
Formats
Hadoop Ecosystem

Storage / HDFS
• “Hadoop Distributed File System”
• Design:
• Write once – read many times pattern
• Cheap hardware
• Low latency data access
• Concepts:
• Block – File is split to Size 128 MB blocks, redundancy - 3
• NameNode (Master) – per cluster - file system namespace for blocks (single point of
failure)
• DataNode (Worker) – per Node - store and retrieve blocks
• Functions:
• High availability – run a second NameNode
• Block caching – block cached in only one DataNode
• Locality - Rack sensitive, network topology
• File permissions – like POSIX – r w x – owner/group/mode file/directory
• Interfaces – HTTP (proxy/direct), Java API
• Cluster balance – evenly spread the block on the cluster

2Rack
1Rack
Data
Block 1
Block 2
Block 3
DataNodeDataNodeDataNodeDataNode
Block 1
Block 1
Block 2
Block 2
Block 3
Block 3
Block 1
DataNode
Block 2
Block 3
NameNode
HDFS proxy Client
file is distribution and
accessed on Hadoop HDFS

Resource
Management
Storage
Hadoop Ecosystem

Resource Management / YARN
• “Yet Another Resource Negotiator”
• Manage and schedule the cluster resource
• Daemons:
• Resource Manager – Per Cluster – manage resource across the cluster
• Node Manager – Per Node – launch and monitor a Container
• Container – execute an app process
• Resource requests for containers:
• Amount of computers (CPU & Memory)
• Locality (node/rack)
• Lifespan: application per user job or long-running apps shared by users
• Scheduling:
• Allocate resource by policy (FIFO, capacity (ordanisation), Fair

Hadoop Cluster
Nodemanager
node
NodeManager
Container
Master
Client node
application
Resource manager node
ResourceManager
Client
Nodemanager
node
NodeManager
Container
Worker
Nodemanager
node
NodeManager
Container
Worker
launch
launch
launch
launch
Launch
YARN app
heartbeat Job scheduling on top
Hadoop Cluster

Resource
Management
Processing
Storage
Hadoop Ecosystem

Processing / MapReduce
• Simplify, large scale, automatic, Fault tolerant development data
processing
• origin - Google paper 2004
• Batch processing
• Hadoop MR:
• JobTracker – 1per cluster - master process, schedule tasks on workers,
monitor progress
• taskTracker – 1 per worker - execute map/reduce tasks locally and
report progress

Processing / MapReduce
LiorRonLior
RonRonAndrey
LiorAndreyLior
CountName
1Lior
1Ron
1Lior
CountName
1Lior
1Andrey
1Lior
CountName
1Andrey
1Ron
1Ron
CountName
4Lior
CountName
3Ron
CountName
2Andrey
Data
Map
ReduceShuffle
& Sort

Hadoop Cluster
Nodemanager
node
NodeManager
Container
JobTracker
Client node
MR program
Resource manager node
ResourceManager
Client
Nodemanager
node
NodeManager
Container
TaskTracker
Nodemanager
node
NodeManager
Container
TaskTracker
launch
launch
launch
launch
Launch
YARN app
heartbeat
MR Job scheduling on top
Hadoop Cluster

Storage / HBase
• Distributes Column Base database on top HDFS
• Real time read/write random access for large data-sets
• Region – tables splitting by row
• Pheonex - SQL on HBase
RowKey Column Family 1 Column Family 2
Col 1.1
Version Data
Col 1.2 Col 1.3
Version Data
Version Data
Hbase Data Model

Resource
Management
coordination
Processing
Storage
Hadoop Ecosystem

Coordination / ZooKeeper
• Hadoop’s distributed coordination service
• Coordinate read/write action on data
• high availability filesystem
• Implementation:
• Data model:
• Tree build from Znodes (1MB data)
• Znode – data changes, ACL (access control list )
• Leader - perform write and broadcast an update
• Follower – pass atomic request to leader
• Lock service
• User groups
• Replicate mode

Coordination / ZooKeeper
Hadoop Cluster
ZooKeaper Service
Leader
HDFSHBase
DataNodeDataNodeDataNode
HMaster Other
Client
RegionRegionRegion
NameNode
/
/HBase HDFS/
Follower
/
/HBase HDFS/
Follower
/
/HBase HDFS/
LOCK LOCK
ZooKeeper
Coordination
example

Resource
Management
coordination
Processing
Storage
Data
Formats
Hadoop Ecosystem

Row Based Avro
• Language natural data serialization system
• Share many data formats with many code language
• Split able and sortable - Allow easy map reduce
• Rich schema resolution – flexible scheme
• Other Row Based formats
• sequenceFile - Logfile format
• MapFile - Sorted sequenceFile

Row Based Avro
Header Block 1 Block 2 Block N
Count objs Serialized objs SyncMarker
identifier Metadata: Schema & codec SyncMarker
Size objs
{
"Type":"record"
"Name":"Person"
"Fields":
[{
"name":"firstName",
"type":"string"
"order":"descending"
},{
"name":"age",
"type":"int"
},{...
]
}
Schema
File Structure

File Structure
Parquet
• Columnar storage format
• Skip unneeded columns
• Fast queries & small size
• Efficient nested data store Header Block 1 Block 2 Block N
Column chunk Column chunk Column chunk
Page Page Page Page
Magic Number File Metadata
Footer
Message Person {
Required binary name (UTF8);
Required int32 age (UTF8);
Required group hobbies (LIST) {
Required binary array (UTF8);
}
}
Schema

Data Injection
Resource
Management
coordination
Processing
Storage
Data
Formats
Hadoop Ecosystem

Data Integration / Sqoop
• Import/export structural data
• Sqoop connector:
• import/export from a database
• Sqoop1- command line
• Sqoop2 – service
• Connectors – connect RDBs
Hadoop Cluster
Export MapReduce Job
Database
Table
Sqoop client
Import MapReduce Job
Hdfs Hdfs
Map Map
Hdfs Hdfs
Map Map
metadata
launch launch
ExportImport

Data Integration / Flume
• Event base data injection into Hadoop
• Flume agent components:
• Sources – spoolingDir (create events), Avro(RPC), Http (requests)
• Channel
• Sink – Avro, HDFS, HBase, Solr(=near real time)
• Reliability - Use separate transaction
• Fan out – one source many sinks
• Scale - agent tiers for aggregation multiple sources
• Sink grouping- avoid failure and load balancing

Fan Out
Data Integration / Flume
Hadoop Data
File
system
Flume Agent
Source Channel Sink
Tier 1
Flume Agent
Tier 1
Flume Agent
Tier 1
Flume Agent
Tier 2
Flume Agent
Tier 2
Flume Agent
Tier 3
Flume Agent
Tier 3
Flume Agent
File
system Sink
GroupingScale
HDFS
HBase
Data

Data Integration / Kafka
• distributed publish-subscribe messaging system
• Fast, scalable, durable
• Components:
• Topics – categories of feeds messages
• Procedures – process that publish messages to topic
• Message consumer – processes that subscribe for topic
• Broker – kafka servers on cluster
• Distribution
• Leader – allow read/write
• Follower – replicate

Data
Streaming
Data Injection
Resource
Management
coordination
Processing
Storage
Data
Formats
Hadoop Ecosystem

Data Integration / Streaming
• Stream processing
• Kafka Stream - Process and analyze data in Kafka
• Storm – real-time computation
• Spark streaming – process live data and can apply Spark MLib and
graphX
Flume Agent 1
Data
Kafka
Spark Streaming
Flume Agent 2 Storm
Topic
A
Topic
B
HDFS
1
1
1
2
2

Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
Storage
Data
Formats
Hadoop Ecosystem

• Cluster Computing Framework
• In Memory processing
• Language: Scala, Java and Python
• RDD – resilience Distributed dataset
• Read only collection spread in the cluster
• Computation of transformation happened when Action
• DAG engine – schedule many transformations to one optimal Job
• Spark context
• parallel jobs
• Caching
• Broadcast variables (Data/Functions)
• Cluster Manager of executors:
• Local, Standalone, Mesos , Yarn
Computation / Spark

Computation / Spark
Hadoop
Driver
SparkContext
Spark Program
DAG Scheduler
Task Scheduler
Scheduler backend
Executer Executer Executer
Job
Job
Stages
Tasks
Task Task Task

Scripting / Pig
• Data flow programming language - Map reduce abstraction
• support: User defined functions (UDF), Streaming, nested data
• Don’t support: random read/write
• Pig Latin - Scripting language
• Load, store, filtering, Group, Join, Sort, Union and Split, UDF, Co-group
• Modes
• Local – small datasets
• MR mode – run on cluster
• Execution - script, grunt (shell), embedded (java)
• Parameter substitution – run script with different parameters
• Similar
• Crunch – MR pipeline with Java (no UDF)

Query / Hive
• Components
• MetaStore – tables description
• HiveQL – SQL dialect (SQL: 2003)
• tables Management
• warehouse directory
• external tables
• functionality
• Bucketing and Partitions by column
• Support UDF and UDAF (aggregate)
• Insert Update Delete:
• Saved in delta files
• Background MR Jobs
• (Available Transaction context)
• Lock table (avoid drop)

Query / Comparison
SparkSql (shark)ImpalaHive
Procedural
development
BI & SQL analyticsBatchUsage
OKBestbadSpeed
MemoryDedicated Deamons on
DataNode
MapReduceimplementation
Persto ,
Drill (SQL: 2011)
Hive On sparkSimilar tools

Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Storage
Data
Formats
Hadoop Ecosystem

Workflow / Oozie
• Schedule Hadoop jobs
• Job types:
• Workflows – sequence of jobs via Directed Graphs (DAGs)
• Coordinator - trigger jobs by time or availability
start Sqoop Fork
Pig
PigMR
Sub
workflow
FS
(HDFS)
Join End
Control flow
Action
Email

Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Storage
Search
Data
Formats
Hadoop Ecosystem

Search / Solr
• Full- text search over Hadoop
• Near real time indexing
• REST API
• Based on Apache Lucene java search library

Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Visualization
Storage
Search
Data
Formats
Hadoop Ecosystem

Visualization / Hue
• Open source Web interface for analyzing data with any Hadoop.
• Application:
• File Browser: HDFS, Hbase
• Scheduling of jobs and workflows : Oozie
• Job Browser: YARN
• SQL : Hive, Impala
• Data analysis: Pig, UDF
• Dynamic Search: Solr
• Notebooks: Spark
• Data Transfer: Sqoop 2

Cluster Management / Cloudera
• 100% open source
• The most complete and tested distribution of Hadoop
• Integrate all Hadoop project
• Express – free, end to end administration
• Enterprise – Extra features and support

Cluster Management / Comparison
https://talendexpert.com/cloudera-vs-honworks-vs-mapr

MasterMasterMaster
Other Servers
Worker
Basic Cluster configuration
Resource manager
Standby
Resource Manager
NodeManager
DataNode
Cloudera Manager
Hive GW
ZooKeeper
Impala Daemon
Impala State
Sqoop GW
Spark GW
NameNode
Master
ZooKeeper
Secondary
NameNode
Worker
NodeManager
DataNode
Impala Daemon
Worker
NodeManager
DataNode
Impala Daemon
Worker
NodeManager
DataNode
Impala Daemon

Hadoop Ecosystem

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop Ecosystem

Similaire à Hadoop Ecosystem (20)

Dernier

Dernier (20)

Hadoop Ecosystem

Notes de l'éditeur