In this talk, we share our experience on designing and implementing a next-generation, scale-out architecture for HDFS. Particularly, we implement the namespace on top of a key-value store. Key-value stores are highly scalable and can be scaled out on demand. Our current prototype shows that under the new architecture the namespace of HDFS scales 10x better than the current generation with no performance loss, demonstrating that HDFS is capable of storing billions of files using the current hardware.
1. Scaling HDFS to Manage Billions of Files
Haohui Mai, Jing Zhao
Hortonworks, Inc.
2. About the speakers
• Haohui Mai
• Active committers and PMC in Hadoop
• Ph.D. in Computer Science from UIUC in 2013
• Joined the HDFS team in Hortonworks
• 250+ commits in Hadoop
3. About the speakers
• Jing Zhao
• Active committers and PMC in Hadoop
• Ph.D. in Computer Science from USC in 2012
• HDFS team member in Hortonworks
• 250+ commits in Hadoop
5. Past: the scale of data
• In 2007 (PC)
• ~500 GB hard drives
• thousands of files
6. Past: the scale of data
• In 2007 (PC)
• ~500 GB hard drives
• thousands of files
• In 2007 (Hadoop)
• several hundred nodes
• several hundred TBs
• millions of files
7. Past: the scale of data
• In 2007 (Hadoop)
• several hundred nodes
• several hundred TBs
• millions of files
8. Past: the scale of data
• In 2007 (Hadoop)
• several hundred nodes
• several hundred TBs
• millions of files
• In 2015
• 4,000+ nodes (10x)
• 150+ PBs (1000x)
• 400M+ files (100x)
10. Present: a generic storage system
• SQL-On-Hadoop
• Machine learning
• Real-time analytics
• Data streaming
• File archives, NFS…
11. Present: a generic storage system
• SQL-On-Hadoop
• Machine learning
• Real-time analytics
• Data streaming
• File archives, NFS…
• From MR-centric filesystem to a
generic distributed storage system
14. Future: Billions of files in HDFS
• HDFS clusters continue to grow
• New use cases emerge
• IoT, time series data…
15. Future: Billions of files in HDFS
• HDFS clusters continue to grow
• New use cases emerge
• IoT, time series data…
• Files are natural abstractions of
data
• Few big files → many small
files in HDFS
• Billions of files in a few years
17. NameNode limits the scale
• Master / slave architecture
• All metadata in NN, data
across multiple DNs
• Simple and robust
NN
DN DN DN
18. NameNode limits the scale
• Master / slave architecture
• All metadata in NN, data
across multiple DNs
• Simple and robust
• Does not scale beyond the size
of the NN heap
• 400M files ~ 128G heap
• GC pauses
NN
DN DN DN
20. Next-gen arch: HDFS on top of KV stores
• Namespace (NS) on top of Key-
Value (KV) stores
• Storing the NS into LevelDB
21. Next-gen arch: HDFS on top of KV stores
• Namespace (NS) on top of Key-
Value (KV) stores
• Storing the NS into LevelDB
• Working set fits in memory,
cold metadata on disks
• Match the usage patterns of
HDFS
Namespace
22. Next-gen arch: HDFS on top of KV stores
• Namespace (NS) on top of Key-
Value (KV) stores
• Storing the NS into LevelDB
• Working set fits in memory,
cold metadata on disks
• Match the usage patterns of
HDFS
• Low adoption cost: fully
compatible
Namespace
42. Integrate with existing HDFS features
• HDFS snapshots
• Metadata only operations
• Append version ids for each key
• Map between snapshot ids and version ids
43. Integrate with existing HDFS features
• HDFS snapshots
• Metadata only operations
• Append version ids for each key
• Map between snapshot ids and version ids
• NameNode High Availability (HA)
• Use edit logs instead of the WAL of the KV stores to persist operations
• Minimal changes in the current HA mechanisms
44. Current status
• Phase I — NS on top of KV interfaces (HDFS-8286)
• NS on top of an in-memory KV store
• Under active development
• Phase II — Partial NS in the memory
• Working set of the NS in the memory, cold metadata on disks
• Scaling the NS beyond the size of heap
50. NNThroughput (read)
• Read-only operations
• 1/3 throughput of vanilla
LevelDB v.s. 2.7.0
• Contentions of the global lock
in LevelDB during get()
Throughput(ops/s)
0
50000
100000
150000
200000
open fileStatus
2.7.0 InMem
LevelDB LevelDB-opt
51. NNThroughput (read)
• Read-only operations
• 1/3 throughput of vanilla
LevelDB v.s. 2.7.0
• Contentions of the global lock
in LevelDB during get()
• A lock-free fast path of get() to
recover the performance
(LevelDB-opt)
Throughput(ops/s)
0
50000
100000
150000
200000
open fileStatus
2.7.0 InMem
LevelDB LevelDB-opt
52. YCSB: Throughput
• YCSB against HBase 1.0.1.1
• Enabled short-circuit reads
• 100 threads, 10M records
Throughput(ops/s)
0
30000
60000
90000
120000
A B C F D E
2.7.0 InMem LevelDB
60. Conclusions
• HDFS needs to continue to scale
• Evolve HDFS towards KV-based architecture
• Scaling beyond the size of the NN heap
61. Conclusions
• HDFS needs to continue to scale
• Evolve HDFS towards KV-based architecture
• Scaling beyond the size of the NN heap
• Preliminary evaluation looks promising
62. Acknowledgement
• Xiao Lin, interned with Hortonworks in 2013
• PoC implementation of LevelDB backed namespace
• Zhilei Xu, interned with Hortonworks in 2014
• Integration between various HDFS features and LevelDB
• Performance tuning
65. Integrating with LevelDB
Write operations in HDFS
• Updating LevelDB inside the
global lock
• New LevelDB::Write() w/
o blocking calls
• Write edit log
• logSync()
66. Integrating with LevelDB
Write operations in HDFS
• Updating LevelDB inside the
global lock
• New LevelDB::Write() w/
o blocking calls
• Write edit log
• logSync()
Pruning edit logs
• Dump memtable into the disks
• Update MANIFEST
• Prune edit logs