To successfully archive and tier data in Hadoop, you must understand data heat, age, size and usage. FactorData HDFSplus can provide this visibility and enable automation and simplicity. The result is reduce infrastructure, better performance, and better planning of existing HDFS Hadoop clusters.
2. 2
Reasons For Storage Tiering with Hadoop:
• Single tier lends to a large imbalance of compute and storage resources
• More applications create varying workloads
• Large percent of data is cold in most cases
• More recently ingested data can be better balanced
• Fewer nodes per GB with archive nodes
• Lower infrastructure costs
Existing Tier Node
Medium Compute
Medium Capacity
Cold Tier Node
Low Compute
High Density Capacity
4x Less Per GB
Name
Nodes
Accessed Data
Archive Node Example
3. 3
• Over 65% less hardware
• 60% fewer nodes (software licensing)
• Significant performance improvement
• Immediate ROI for cloud and private infrastructures
Archive Data Nodes
80%
Disk Data Nodes
20%
Disk Data Nodes
100%
Single Tier
HDFS Storage
“The price per GB of the ARCHIVE tier is 4x less”
-eBay Hadoop Engineering Blog
4x Fewer Nodes
Capacity 10PBCapacity 10PB
4. 4
Access frequency
of data is the
most important
metric for
effective tiering
Age is easiest to
determine.
CAUTION: Some
data is long-term
active so this
cannot be the
only criteria.
Zero and small
files should be
treated differently
in tiering Hadoop.
Large cold files
should have
priority for archive
Knowing how
long data is
accessed once
ingested can
provide better
capacity planning
for your tiers.
5. 5
Installed on a server
or VM
outside your existing
Hadoop cluster without
inserting any
proprietary technology
on the cluster or in the
data path.
Report data usage
(heat), small files, user
activity, replication, and
HDFS tier utilization.
Customize rules and
queries to properly utilize
infrastructure and plan
better for future scale.
Automatically archive,
promote, or change the
replication factor of data
based on usage patterns
and user defined rules.
Tier Hadoop HDFS By Heat, Age, Size & Activity
In Three Easy Steps
01/INSTALL WITHOUT
CHANGES TO CLUSTER
02/VISUALIZE &
REPORT
03/AUTOMATE
OPTIMIZATION
6. 6
HDFSplus
Apply storage
policy based on
custom query
Files are optimized
during normal
balancing window
Query list based
on size, heat,
activity, and age
1 2 3
• Move all files 120 days old and not
accessed for 90 days to ARCHIVE…..
• FactorData creates a data list based on
query
• Limit automated run by max files or capacity
• FactorData tracks completion of each run
• Data can be excluded from run according to
path, size and application
Custom Query Example: Automated Tiering:
7. 7
Completely out of the data path
FactorData HDFSplus sits outside the Hadoop cluster and collects only
metadata information from the Hadoop cluster
No software to install on the existing Hadoop cluster
Because HDFSplus leverages only existing Hadoop APIs and features,
there is no software to install on the cluster.
Provides a highly scalable solution in a small foot-print
HDFS visibility and automation for thousands of Hadoop nodes on a single
node, VM or server
HDFSplus
Namenodes
Communicates with
Existing Hadoop API
VM or Physical Machine
32GB RAM
4 CPU or vCPU
500GB Free Disk
8. 8
Simplify and Automate Archive and Tiering in Hadoop Today
• Move less accessed data to storage dense nodes for better utilization
• Lower software licensing
• Free resources on existing namenodes and datanodes
How can we get more
performance out of our
existing Hadoop cluster?
How can we move data
not accessed for 90 days
to archive nodes?
How can we better plan
for future scale with real
Hadoop storage metrics?
Result: Better Performance, Lower Hardware Costs, Lower Software Costs
Plus: Get Necessary Storage Visibility To Answer These Questions & More
with FactorData HDFSplus