Hadoop 3.0 - Revolution or evolution?

Hadoop 3.0
Revolution or evolution?
uweprintz

Some Hadoop history
Hadoop 2
HDFS
Redundant, reliable storage
MapReduce
Data processing
YARN
Cluster resource management
Hive
SQL
Spark
In-Memory
…
Oct. 2013
Let there be YARN Apps!
Era of Enterprise Hadoop
2006
Hadoop 1
HDFS
Redundant, reliable
storage
MapReduce
Cluster resource mgmt. +
data processing
Let there be batch!
Era of Silicon Valley Hadoop
Hadoop 3
?
IoT
Machine
Learning GPU’s
TensorFlow
Data
Science
Streaming
Data
Cloud
FPGA’s
Artificial
Intelligence
Kafka
Late 2017
Let there be …?
Era of ?

Why Hadoop 3.0?
• Deprecated APIs can only be removed in major
release
• Wire-compatibility will be broken
• Change of default ports
• Hadoop 2.x Client —||—> Hadoop 3.x Server (and vice versa)
• Hadoop command scripts rewrite
• Big features that need stabilizing major release

What is Hadoop 3.0?
20142010 2011 201320122009 2015
2.2.02.0.0-alpha
branch-1
(branch-0.20)
1.0.0 1.1.0 1.2.1 (Stable)0.20.1 0.20.205
0.21.0
New append
0.23.0
branch-2
HDFS Snapshots
NFSv3 support
HDFS ACLs
HDFS Rolling Upgrades
RM Automatic Failover
2.6.0
YARN Rolling Upgrades
Transparent Encryption
Archival Storage
2.7.0
Hadoop 2
Drop JDK6 Support
File Truncate API
2016
branch-0.23
Hadoop 3
Hadoop 2 and 3 were
diverged 5+ years ago
Hadoop 1 (EOL)
Source: Akira Ajisaka
(with additions by Uwe Printz)
2017
0.22.0
0.23.11 (Final)
Security
trunk
2.3.0 2.5.0
2.4.0
NameNode Federation , YARN
NameNode HA
Heterogeneous storage
HDFS In-Memory Caching
2.8.0
3.0.0-alpha1
3.0.0-alpha2
2.1.0-beta
HDFS Extended
attributes
Docker Container in Linux
ATS 1.5

Hadoop 3.0 in a nutshell
• HDFS
• Erasure codes
• Low-Level Performance enhancements with Intel ISA-L
• 2+ NameNodes
• Intra-DataNode Balancer
• YARN
• Better support for long-running services
• Improved isolation & Docker support
• Scheduler enhancements
• Application Timeline Service v2
• New UI
• MapReduce
• Task-level native optimization
• Derive heap-size automatically
• DevOps
• Drop JDK7 & Move to JDK8
• Change of default ports
• Library & Dependency Upgrade
• Client-side classpath Isolation
• Shell Script Rewrite & ShellDoc
• .hadooprc & .hadoop-env
• Metrics2 Sink plugin for Kafka

HDFS - Current implementation
• 3 replicas by default
• Tolerate maximum of 2 failures
Write request
Lease for file
Split into blocks
Request for
data nodes
List of
data nodes
HDFS Client
NameNode
DataNode 1 DataNode 2 DataNode 3
Write block +
checksum
• Simple, scalable & robust
• 200% space overhead
Write
Pipeline
Write
Pipeline
Calculate
checksum
ACKACK
ACK
Complete!

Erasure Coding (EC)
• k data blocks + m parity blocks
• Example: Reed-Solomon (6,3)
d d d d d d
Raw
Data
Splitting
d d d d d d
d d d d d d
d d d d d d
p p p
p p p
p p p
p p p
Encoding
Store data and parity
• Key Points
• XOR Coding —> Saves space, slower recovery
• Missing or corrupt data will be restored from available data and parity
• Parity can be smaller than data

EC - Main characteristics
Replication
(Factor 1)
Replication
(Factor 3)
Reed-Solomon
(6,3)
Reed-Solomon
(10,4)
Maximum fault tolerance 0 2 3 4
Space Efficiency 100 % 33 % 67 % 71 %
Data Locality Yes No (Phase 1) / Yes (Phase 2)
Write performance High Low
Read performance High Medium
Recovery costs Low High
Pluggable implementation,
first choice
Storage Tier Hot Warm Cold Frozen
Memory/SSD Disk Dense Disk EC
20 x Day 5 x Week 5 x Month 2 x Year

EC - Contiguous blocks
• Approach 1: Retain block size and add parity
File A File B File C
128
MB
128
MB
128
MB
128
MB
128
MB
128
MB
Block 1 Block 2 Block 3 Block 4 Block 5 Block 6
DN 3 DN 2 DN 12 DN 7 DN 5 DN 1
• Pro: Better for locality
• Con: Significant overhead for smaller files, always 3 parity
blocks needed
• Con: Client potentially needs to process GB’s of data for encoding
Parity Parity Parity
DN 6 DN 4 DN 8
Encoding

EC - Striping
• Approach 2: Splitting blocks into smaller cells (1 MB)
File A File B File C
• Pro: Works for small files
• Pro: Allows parallel write
• Con: No data locality -> Increased read latency &
More complicated recovery process
Block 2 Block 3 Block 4 Block 5 Block 6
DN 7 DN 3 DN 4 DN 1 DN 6 DN 12
Stripe 1
Stripe 2
Stripe n
Block 1 Block 4
Round-robin
… … … … … …
Parity
DN 8 DN 9 DN 10
Parity Parity
Encoding
… … …

• Start from striping to deal with smaller files
EC - Apache Hadoop’s decision (HDFS-7285)
Contiguous
Striping
Replication Erasure Coding
HDFS
Facebook f4
Azure
Ceph (before Firefly)
Lustre
Ceph (with Firefly)
QFS
Phase 1.1
HDFS-7285
Phase
1.2
HDFS-8031
Phase 3
(Future Work)
Phase 2
HDFS-8030
Hadoop 3.0.x implements Phase 1.1 and 1.2

EC - Shell Command
• Create a EC Zone on an empty directory
• All the files under a zone directory are automatically erasure coded
• Rename across zones with different EC schemas are disallowed
Usage: hdfs erasurecode [generic options]
[-getPolicy <path>]
[-help [cmd ...]]
[-listPolicies]
[-setPolicy [-p <policyName>] <path>]
-getPolicy <path> :
Get erasure coding policy information about at specified path
-listPolicies :
Get the list of erasure coding policies supported
-setPolicy [-p <policyName>] <path> :
Set a specified erasure coding policy to a directory
Options :
-p <policyName> erasure coding policy name to encode files. If not passed the
default policy will be used
<path> Path to a directory. Under this directory files will be
encoded using specified erasure coding policy

EC - Write Path
• Parallel write
• Client writes to 9 data nodes at the same time
• Calculate parity at client, at write time
• Durability
• Solomon-Reed(6,3) can tolerate max. 3 failures
• Visibility
• Read is supported for files being written
• Consistency
• Client can start reading from any 6 of the 9 data nodes
• Appendable
• Files can be reopened for appending data
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
1MB
Data
Parity
Parity
Parity
ACK
ACK
ACK
ACK
ACK

EC - Write Failure Handling
• Data node failure
• Client ignores the failed data node and
continues writing
• Write path is able to tolerate 3 data node failures
• Requires at least 6 data nodes
• Missing blocks will be constructed later
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
1MB
Data
Parity
Parity
Parity
ACK
ACK
ACK
ACK
ACK

EC - Read Path
• Read data from 6 data nodes
in parallel
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
1MB
Data
Block

EC - Read Failure Handling
• Read data from 6 arbitrary
data nodes in parallel
• Read parity block to reconstruct missing
data block
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
Block
Parity
Parity
reconstructs

EC - Network behavior
• Pro’s
• Low latency because of parallel read & write
• Good for small file sizes
• Con’s
• Requires high network bandwidth between client & server
• Dead data nodes result in high network traffic and reconstruction
time

EC - Coder implementation
• Legacy coder
• From Facebook’s HDFS-RAID project
• [Umbrella] HADOOP-11264
• Pure Java coder
• Code improvements over HDFS-RAID
• HADOOP-11542
• Intel ISA-L coder
• Native coder with Intel’s Intelligent Storage Acceleration Library
• Accelerates EC-related linear algebra calculations by exploiting advanced hardware
instruction sets like SSE, AVX, and AVX2
• HADOOP-11540

• Hadoop 1
• No built-in High Availability
• Needed to solve yourself via e.g. VMware
2+ Name Nodes (HDFS-6440)
• Hadoop 2
• High Availability out-of-the-box via Active-Passive Pattern
• Needed to recover immediately after failure NameNode
Active
NameNode
Standby
• Hadoop 3
• 1 Active NameNode with N Standby NameNodes
• Trade-off between operation costs vs. hardware costs
NameNode
Active
NameNode
Standby
NameNode
Standby

Intra-DataNode Balancer (HDFS-1312)
• Hadoop already has a Balancer between
DataNodes
• Needs to be called manually by design
• Typically used after adding additional worker nodes
• The Disk Balancer lets administrators rebalance
data across multiple disks of a DataNode
• It is useful to correct skewed data distribution often seen after adding or
replacing disks
• Adds hdfs diskbalancer that will submit a plan but does not wait for the
plan to finish executing and the DataNode will do the moves itself

YARN - Scheduling enhancements
• Application priorities within a queue (YARN-1963)
• For example, in queue Marketing Hive jobs > MapReduce jobs
• Inter-Queue priorities (YARN-4945)
• Queue 1 > Queue 2, irrespective of demand & capacity
• Previously based only on unconsumed capacity
• Affinity / Anti-Affinity (YARN-1042)
• More fine-granular restraints on locations, e.g. do not allocate HBase Region servers and Storm workers on the same
host
• Global Scheduling (YARN-5139)
• Currently YARN scheduling is done one-node-at-a-time at arrival of heart beats and can lead to suboptimal decisions
• With global scheduling, YARN scheduler looks at more nodes and selects the best nodes based on application
requirements which leads to a globally optimal placement and enhanced container scheduling throughput
• Gang Scheduling (YARN-624)
• Allow allocation of sets of containers, e.g. 1 container with 128GB of RAM and 16 cores OR 100 containers with 2GB of
RAM and 1 core
• Can be achieved already by holding on to containers but might lead to deadlocks and decreased cluster utilization

YARN - Built-in support for long-running services
• Simplified and first-class support for services (YARN-4692)
• Abstract common framework to support long running service (similar to Apache Slider)
• More simplified API for managing the service lifecycle of YARN Apps
• Better support for long running service
• Recognition of long running service (YARN-4725)
• Auto-restart of containers
• Containers for long running service are retried at same node in case of local state
• Service/Application upgrade support (YARN-4726)
• Hold on to containers during an upgrade of the YARN App
• Dynamic container resizing (YARN-1197)
• Only ask for minimum resources at start and rather adjust them at runtime
• Currently the only way is releasing containers and allocating new containers with the
expected size

YARN - Resource Isolation & Docker
• Better Resource Isolation
• Support for disk isolation (YARN-2619)
• Support for network isolation (YARN-2140)
• Uses cgroups to give containers their fair share
• Docker support in LinuxContainerExecutor (YARN-3611)
• The LinuxContainerExecutor already provides functionality around localization,
cgroups based resource management and isolation for CPU, network, disk, etc. as
well as security mechanisms
• Support Docker containers to be run inside of LinuxContainerExecutor
• Offers packaging and resource isolation
• Complements YARN’s support for long-running services

YARN - Service Discovery
• Services can run on any YARN node
• Dynamic IP, can change in case of node failures, etc.
• YARN Service Discovery via DNS (YARN-4757)
• The YARN service registry already provides facilities for applications to register their
endpoints and for clients to discover them but they are only available via Java API and REST
• Expose service information via a already available discovery mechanism: DNS
• Current YARN Service Registry records need to be converted into DNS entries
• Discovery of the container IP and service port via standard DNS lookups
• Mapping of Applications, e.g.
zkapp1.griduser.yarncluster.com -> 172.17.0.2
• Mapping of containers, e.g.
container-e3741-1454001598828-0131-01000004.yarncluster.com -> 172.17.0.3

YARN - Use the force!
YARN
MapReduce Tez Spark
YARN
MapReduce Tez Spark

Application Timeline Service v2 (YARN-2928)
Why?
• Scalability & Performance
• Single global instance of Writer/Reader
• Local disk based LevelDB storage
• Reliability
• Failure handling with local disk
• Single point-of-failure
• Usability
• Add configuration and metrics as first-class
members
• Better support for queries
• Flexibility
• Data model is more describable
Core Concepts
• Distributed write path
• Logical per app collector
• Separate reader instances
• Pluggable backend storage
• HBase
• Enhanced internal data model
• Metrics Aggregation
• Richer REST API for queries

• Major release, incompatible to Hadoop 2
• Main features are Erasure Coding and better
support for long-running services & Docker
• Good fit for IoT and Deep Learning use cases
Summary
Release time line
• 3.0.0-alpha1 - Sep/3/2016
• Alpha2 - Jan/25/2017
• Alpha3 - Q2 2017 (Estimated)
• Beta/GA - Q3/Q4 2017 (Estimated)

…but it’s not a revolution!

Twitter:
@uweprintz
uwe.seiler@codecentric.de
Mail:
uwe.printz@codecentric.de
Phone
+49 176 1076531
XING:
https://www.xing.com/profile/Uwe_Printz
Thank you!

https://unsplash.com/photos/CIXoFys3gsw
Slide 2: Copyright by Uwe Printz
Slide 7: https://unsplash.com/photos/LHlwgjbSo3k
Slide 34: https://unsplash.com/photos/Cvf1IqUel9w
Slide 36: https://imgflip.com/i/mkovb
Slide 37: Copyright by Uwe Printz
All pictures CC0 or shot by the author

Hadoop 3.0 - Revolution or evolution?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop 3.0 - Revolution or evolution?

Similar to Hadoop 3.0 - Revolution or evolution? (20)

More from Uwe Printz

More from Uwe Printz (15)

Recently uploaded

Recently uploaded (20)

Hadoop 3.0 - Revolution or evolution?