Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
HBase User Group #9: HBase and HDFS
1. HBase and HDFS
Todd Lipcon
todd@cloudera.com
Twitter: @tlipcon
#hbase IRC: tlipcon
March 10, 2010
2. Outline
HDFS Overview
HDFS meets HBase
Solving the HDFS-HBase problems
Small Random Reads
Single-Client Fault Tolerance
Durable Record Appends
Summary
3. HDFS Overview
What is HDFS?
Hadoop’s Distributed File System
Modeled after Google’s GFS
Scalable, reliable data storage
All persistent HBase storage is on HDFS
HDFS reliability and performance are key to
HBase reliability and performance
5. HDFS Design Goals
Store large amounts of data
Data should be reliable
Storage and performance should scale with
number of nodes.
Primary use: bulk processing with MapReduce
6. Requirements for MapReduce
MR Task Outputs
Large streaming writes of entire files
MR Task Inputs
Medium-size partial reads
Each task usually has 1 reader, 1 writer; 8-16
tasks per node.
DataNodes usually servicing few concurrent clients
MapReduce can restart tasks with ease (they
are idempotent)
7. Requirements for HBase
All of the requirements of MapReduce, plus:
Constantly append small records to an edit log
(WAL)
Small-size random reads
Many concurrent readers
Clients cannot restart → single-client fault
tolerance is necessary.
8. HDFS Requirements Matrix
Requirement MR HBase
Scalable storage
System fault tolerance
Large streaming writes
Large streaming reads
Small random reads -
Single client fault tolerance -
Durable record appends -
10. Solutions
...turn that frown upside-down
hard ↔ easy
Configuration Tuning
HBase-side workarounds
HDFS Development/Patching
11. Small Random Reads
Configuration Tuning
HBase often has more concurrent clients than
MapReduce.
Typical problems:
xceiverCount 257 exceeds the limit of
concurrent xcievers 256
Increase dfs.datanode.max.xcievers → 1024
(or greater)
Too many open files
Edit /etc/security/limits.conf to increase
nofile → 32768
12. Small Random Reads
HBase Features
HBase block cache
Avoids the need to hit HDFS for many reads
Finer grained synchronization in HFile reads
(HBASE-2180)
Allow parallel clients to read data in parallel for
higher throughput
Seek-and-read vs pread API (HBASE-1505)
In current HDFS, these have different performance
characteristics
13. Small Random Reads
HDFS Development in Progress
Client↔DN connection reuse (HDFS-941,
HDFS-380)
Eliminates TCP handshake latency
Avoids restarting TCP Slow-Start algorithm for
each read
Multiplexed BlockSender (HDFS-918)
Reduces number of threads and open files in DN
Netty DataNode (hack in progress)
Non-blocking IO may be more efficient for high
concurrency
14. Single-Client Fault Tolerance
What exactly do I mean?
If a MapReduce task fails to write, the MR
framework will restart the task.
MR relies on idempotence → task failures are not
a big deal.
Thus, fault tolerance of a single client is not as
important to MR
If an HBase region fails to write, it cannot
recreate the data easily
HBase may access a single file for a day at a
time → must ride over transient errors
15. Single-Client Fault Tolerance
HDFS Patches
HDFS-127 / HDFS-927
Clients used to give up after N read failures on a
file, with no regard for time. This patch resets the
failure count after successful reads.
HDFS-630
Fixes block allocation to exclude nodes client
knows to be bad
Important for small clusters!
Backported to 0.20 in CDH2
Various other write pipeline recovery fixes in
0.20.2 (HDFS-101, HDFS-793)
16. Durable Record Appends
What exactly is the infamous sync()/append()?
Well, it’s really hflush()
HBase accepts writes into memory (the
MemStore)
It also logs them to disk (the HLog / WAL)
Each write needs to be on disk before claiming
durability.
hflush() provides this guarantee (almost)
Unfortunately, it doesn’t work in Apache
Hadoop 0.20.x
17. Durable Record Appends
HBase Workarounds
HDFS files are durable once closed
Currently, HBase rolls the edit log periodically
After a roll, previous edits are safe
18. Durable Record Appends
HBase Workarounds
HDFS files are durable once closed
Currently, HBase rolls the edit log periodically
After a roll, previous edits are safe
Not much of a workaround §
A crash will lose any edits since last roll.
Rolling constantly results in small files
Bad for NN metadata efficiency.
Triggers frequent flushes → bad for region server
efficiency
19. Durable Record Appends
HDFS Development
On Apache trunk: HDFS-265
New append re-implementation for 0.21/0.22
Will work great, but essentially a very large set of
patches
Not released yet - running unreleased Hadoop is
“daring”
In 0.20.x distributions: HDFS-200 patch
Fixes bugs in old hflush() implementation
Not quite as efficient as HDFS-265, but good
enough and simpler
Dhruba Borthakur from Facebook testing and
improving
Cloudera will test and merge this into CDH3
20. Summary
HDFS’s original target workload was
MapReduce, and HBase has different (harder)
requirements.
Engineers from the HBase team plus Facebook,
Cloudera, and Yahoo are working together to
improve things.
Cloudera will integrate all necessary HDFS
patches in CDH3, available for testing soon.
Contact me if you’d like to help test in April.