Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature

Reduce Storage Costs by 5x
Using The New HDFS Tiered Storage Feature
Tsz-Wo Nicholas Sze @Hortonworks
Benoy Antony @eBay
June 9, 2015

Supporting Heterogeneous Storage

About Me
• Software Engineer @Hortonworks
– HDFS development lead
• PMC Member at Apache Hadoop
– Started working on Hadoop in 2007
– One of the most active contributors/committers of HDFS
• Used Hadoop to compute Pi at the two-quadrillionth (2x1015th) bit
– It was a World Record.
• Received Ph.D. from the University of Maryland, College Park
3
= 3.141592654…

Previous Storage Model
• Datanode: a single logical storage consisting of
one or more physical storage medias
Supporting Heterogeneous Storage 4

New Storage Model
• Datanode: a collection of storages,
each storage corresponding to a
physical storage media.

Advantages of Multiple Storage Model
• Heterogeneity
– Individual storage may have different characteristic: capacity, speed, cost, etc.
• Flexibility
– Client could write to a particular storage type
• Scalability
– A datanode can support more storage drive (>100) since storage reports are sent individually.

Storage Types
• ARCHIVE: archival drive
– slow but cheap
• DISK: hard disk drive
– default type
• SSD: solid state drive
– fast but expensive
• RAM_DISK: memory drive
– capacity is limited
– data could be lazily persisted

Block Storage Policy
• Describe how to store a data block in HDFS
– When a new file/block is created, how to store the data?
– After a file/block is created, should it be replicated to some other storages?
– When there is not enough space in a particular storage type,
• Should the operation fail?
• Fallback to other storage type?

Block Storage Policies
• HOT
– Store all replicas to DISK
– Fallback to ARCHIVE if necessary.
– Default policy
• WARM
– Store a replica to DISK and the other replicas to ARCHIVE
– May fallback to DISK or ARCHIVE
– For data need to be archived soon
• COLD
– Store all replicas to ARCHIVE
– Do not fallback to any other storage type
– For archival data

Example 1

More Block Storage Policies
• ONE_SSD
– Store a replica to SSD and the other replicas to DISK
– May fallback to SSD or DISK
– For data needed to be access locally after write
• ALL_SSD
– Store all replicas to SSD
– Fallback to DISK if necessary
– For frequently accessed data
• LAZY_PERSIST
– Store only one replica to memory
– The replica stored in memory is lazily persisted to DISK later on
– For transient data

Example 2

Enforcing Block Storage Requirements
• When a block is created,
– block placement tells how to store the replicas.
• When the blocks are stored in HDFS, datanode or network may fail so that
– Blocks may become under/over replicated
– Blocks may fail to satisfy rack requirement
– Blocks may fail to satisfy storage policy requirement
– Cluster may become imbalanced (some datanodes may be over utilized while some are under utilized)
HDFS needs to fix them!

Maintaining Block Storage Requirements
• Replication Monitor
– Copy/delete replicas to satisfy replication and rack requirement
– Won’t move replicas around for fulfilling storage policy requirement or balancing the cluster.
• Balancer
– Move replicas around to balance the cluster
– Preserves replication and rack requirement
– Preserves storage type: always move replica from a storage to another storage with the same type.
• Mover (a new tool)
– Mover replicas around to satisfy storage policy requirement
– Preserves replication and rack requirement

Feature Development
• HDFS-2832: Heterogeneous Storage
– 66 subtasks (Mainly contributed by Hortonworks)
– Committed to Hadoop 2.3.0
• HDFS-6584: Archival Storage
• HDFS-7584: Quota Support for Storage Types
• HDFS-7197: Enhancements to Mover
– 10 subtasks (Mainly contributed by eBay)

About Me
• Software Engineer @eBay
• Apache Hadoop Committer
– working on Hadoop since 2011
• Focus on HDFS & Security
17

Outline
•Requirements for Archival storage
•Setting up Tiered storage
•Archival Process
HDFS Tiered Storage 18

eBay’s Apollo Cluster HDFS Capacity
0
5
10
15
20
25
30
35
40
45
2012 2013 2014

Cluster is getting Full

Data at eBay’ Clusters
• What’s Stored ?
– Different types of User behavior Data
– Experimentation results
– Buyer and Seller data
• How is it used ?

Access Pattern over months - Session
0
5000
10000
15000
20000
25000
30000
35000
40000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
session
session

Many options for Archival Storage
• Create another HDFS cluster
– Move data using Distcp
• Non HDFS cheap storage
• Tiered Storage
– DISK
– ARCHIVE

Benefits of using Tiered Storage
• Data resides in one cluster
– No change for applications
– Security of datasets remains unchanged
– Audit Log History can be aggregated from one cluster
– Data can be moved back to Compute nodes upon demand
– Operational simplicity

Tiered Storage on Apollo Cluster
0
10
20
30
40
50
60
2012 2013 2014 2014

Disk Vs Archival Node
Regular Archive
Storage Capacity 40 TB 210 TB
# of drives 12 60
# of processing units 32 4
Main Memory 128 GB 64GB
cost / GB X¢ (varies ) X/5 ¢ (varies)
Runs Node Manager Yes No

Archival Data Node Configuration
• Dfs.datanode.data.dir
– [ARCHIVE]/hadoop/1/data, [ARCHIVE]/hadoop/2/data,…
• dfs.datanode.balance.max.concurrent.moves
– 400 (10 for DISK)
• dfs.datanode.balance.bandwidthPerSec
– 400 MB ( 10MB for DISK)

Archival Process
• For each dataset
– Scan HDFS Audit logs to identify the data access pattern
– Derive an Archival Policy. Example:
– Set up storage policy on sub directories based on Archival Policy.
– Run Mover for the dataset
Time Storage Policy Block Placement
< 90 days HOT 3 replicas in DISK
> 90 days < 270 days WARM 1 DISK , 2 ARCHIVE
> 270 days COLD 3 replica in ARCHIVE

Notes
• The Mover speed goes up with more sources and destinations
– With 2000 > 48 nodes, 1 PB gets moved in 12 hours
• FSCK displays information on types and storage policies for a directory.
• The detailed storage tier information on Namenode UI is in progress.

Summary
• eBay uses Tiered Storage to store rarely used data
– Reduces storage costs by using big storage with limited computing
• Tiered storage can be operated using storage types and storage policies.
• An Archival policy needs to be setup for datasets based on their access pattern.
– Keep Audit Logs to analyze the access pattern.
• Tiered storage information is available via FSCK.

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature

Similaire à Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature

Notes de l'éditeur