Current HDFS Namenode stores all of its metadata in RAM. This has allowed Hadoop clusters to scale to 100K concurrent tasks. However, the memory limits the total number of files that a single NameNode can store. While Federation allows one to create multiple volumes with additional Namenodes, there is a need to scale a single namespace and also to store multiple namespaces in a single Namenode.
This talk describes a project that removes the space limits while maintaining similar performance by caching only the working set or hot metadata in Namenode memory. We believe this approach will be very effective because the subset of files that is frequently accessed is much smaller than the full set of files stored in HDFS.
In this talk we will describe our overall approach and give details of our implementation along with some early performance numbers.
Speaker: Lin Xiao, PhD student at Carnegie Mellon University, intern at Hortonworks
2. About Me: Lin Xiao
• Phd Student at CMU
• Advisor: Garth Gibson
• Thesis area – scalable distributed file systems
• Intern at Hortonworks
• Intern project: removing the Namenode memory limitation
• Email: lxiao+@cs.cmu.edu
28/22/2013
3. Big Data
• We create 2.5x1018 bytes of data per day [IBM]
• Sloan Digital Sky Survey: 200GB/night
• Facebook: 240 billions of photos till Jan,2013
• 250 million photos uploaded daily
• Cloud storage
• Amazon: 2 trillion objects, peak1.1 million op/sec
• Need scalable storage systems
• Scalable metadata <- focus of this presentation
• Scalable storage
• Scalable IO
38/22/2013
4. Scalable Storage Systems
• Separate data and metadata servers
• More data nodes for higher throughput & capacity
• Bulk of work – the IO path - is done by data servers
• Not much work added to metadata servers?
48/22/2013
5. Federated HDFS
• Namenodes(MDS) see their own namespace (NS)
• Each datanode can serve all namenodes
5
!
!
"""! """! """!
!!!!!!!!!!#$%!
!
#$!&!
"""! """!
!!!!!!!!!!#$!' !
( )*+' !, **)-!
. /0/&*12!%! . /0/&*12!3! . /0/&*12!4 !
" ##$!%!" ##$!!&!" ##$!!' !
#/4 2&*12!%! #/4 2&*12!' ! #/4 2&*12!&!
8/22/2013
6. Single Namenode
• Stores all metadata in memory
• Design is simple
• Provide low latency and high throughput metadata operations
• Support up to 3K data servers
• Hadoop clusters make it affordable to store old data
• Cold data is stored in the cluster for a long time
• Take up memory space but rarely used
• Growth of data size can exceed throughput
• Goal: remove space limits while maintain similar
performance
68/22/2013
7. Metadata in Namenode
• Namespace
• Stored as a linked tree structure by inodes
• Always visit from the top for any operation
• Blocks Map: block_id to location mapping
• Handle separately for huge number of blocks
• Datanode status
• IPaddress, capacity, load, heartbeat status, Block report status
• Leases
• Namespace and Block map uses the majority of memory
• This talk will focus on the Namespace
78/22/2013
8. Problem and Proposed Solution
• Problem:
• Remove namespace limit while maintain similar performance when
the working set can fit in memory
• Solution
• Retain the same namespace tree structure
• Store the namespace in persistent store using LSM (LevelDB)
• No separate edit logs nor checkpoints
• All Inode and their updates are persistent via LevelDB
• Fast startup, with the cost of slow initial operations
• Could prefetch inodes in
• Do not expect customers to drastically reduce the actual heap size
• Larger heap benefits transition between different working sets as
applications and workload changes
• A customer may occasionally run queries against cold data
88/22/2013
9. New Namenode Architecture
• Namespace
• Same as before, but only part of the tree is in memory
• On cache miss, read from levelDB
• Edit logs and checkpoints are replaced by LevelDB
• Update to LevelDB for every inode change
• Key: <parent_inode_number + name>
9
Namenode
Inode
edit
logs
Namenode
Inode
Inode
levelDB
buffer
WAL
LevelDB
Inode
levelDB
buffer
WAL
LevelDB
8/22/2013
10. Comparison w/Traditional FileSystem
• Traditional File Systems
• VFS layer keeps inode and directory entry cache
• Goal is to support the work load of single machine
• Relatively large number of files
• Support the applications from a single machine or in case of NFS from a
larger number of client machines
• Much much smaller workload and size compared to Hadoop use cases
• LevelDB based Namenode
• Support very large traffic of Hadoop cluster
• Keep a much larger number of INodes in memory
• Cache replacement policies to suite the Hadoop work load
• Data is in Datanodes
108/22/2013
11. LevelDB
• A fast key-value storage library written at Google
• Basic operations: get, put, delete
• Concurrency: single process w/multiple threads
• By default, writes are asynchronous
• As long as the machine doesn’t crash, it’s safe.
• Support synchronous writes
• No separate sync() operation
• Can be implemented by sync write/delete
• Support batch updates
• Data is automatically compressed using the Snappy
118/22/2013
12. Cache Replacement Policy
• Only whole directories are replaced in or out
• Hot dirs are all in cache, others will require levelDB scan
• Future – don’t cache very large dirs?
• No need to read from disks to check file existence
• LRU replacement policy
• Use CLOCK to approximate to reduce cost
• Separate thread for cache replacement
• Start replacement when threshold is exceeded
• Remove eviction out of sessions with lock
128/22/2013
13. Benchmark description
• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly
• All operations are generated based on BFS order
• Each thread gets one portion of the work
• NN Load generator using YCSB++ framework (in progress)
• Normal HDFS client calls
• Thread either works in their own namespace, or choose randomly
• Load generator based on real cluster traces (in progress)
• Can you help me get traces from your cluster?
• Traditional Hadoop benchmark(in progress)
• E.g. Gridmix Expect little degradation when most work is for data
transfer
138/22/2013
14. Categories of tests
• Everything fits in memory
• Goal: should be almost the same as the current NN
• Working set does not fit in memory or changes over time
• Study various cache replacement policies
• Need to get good traces from real cluster to see patterns of hot,
warm and cold data
148/22/2013
15. Experiment Setup
• Hardware description (Susitna)
• CPU: AMD Opteron 6272, 64 bit, 16 MB L2, 16-core 2.1 GHz
• SSD: Crucial M4-CT064M4SSD2 SSD, 64 GB, SATA 6.0Gb/s
• (In progress) Use disks in future experiments
• Heap size is set to 1GB
• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly
• All operations are generated based on BFS order
• Multiple threads, but each thread gets one portion of the work
• Each directory contains 100 subdirs and 100 files
• Named sequentially: ThroughputBenchDir1, ThroughputBench1
• LevelDB NN
• Cache monitor thread starts replacement when 90% full
158/22/2013
16. Create & close 2.4M files – all fit in cache
0
1000
2000
3000
4000
5000
6000
7000
8000
2 4 8 16
Throughputops/sec
Number of Threads
original
w/LevelDB
16
• Note files are not accessed, but clearly parent dirs are
• Note: Old NN and LevelDB NN peak at different # threads
• Degradation for peak throughput is 13.5%
8/22/2013
17. Create 9.6M files: 1% fits in cache
• Old NN with 8 threads and LevelDB NN with 16 threads.
• Performance remains about the same using LevelDB
• Namenode’s throughput drops to zero when memory exhausted
17
0
1000
2000
3000
4000
5000
6000
7000
20
120
220
320
420
520
620
720
820
920
1020
1120
1220
1320
1420
1520
1620
ThroughputOps/sec
Time in seconds
Original
LevelDB NN
8/22/2013
18. GetFileInfo
18
• ListStatus of first 600K of 2.4M files
• Each thread working on different part of tree
• Original NN: all fit in memory (of course)
• LevelDB NN: 2 cases: (1) all fit, (2) half fit
• Half fit: 10%-20% degradation - cache is constantly replaced
0
20000
40000
60000
80000
100000
120000
140000
2 4 8 16 32
ThroughputOps/sec
Number of Threads
Original
FitCache
HalfInCache
8/22/2013
19. Benchmarks that remain
• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly
• All operations are generated based on BFS order
• Each thread gets one portion of the work
• NN Load generator using YCSB++ framework (in progress)
• Normal HDFS client calls
• Thread either works in their own namespace, or choose randomly
• Load generator based on real cluster traces (in progress)
• Can you help me get traces from your cluster?
• Traditional Hadoop benchmark(in progress)
• E.g. Gridmix Expect little degradation when most work is for data
transfer
198/22/2013
20. Summary
• Now that NN is HA, removing the namespace memory
limitation is one of most important problems to solve
• LSM (LevelDB) has worked out quite well
• Initial experiments have shown good results
• Need further benchmarks especially on how effective caching is for
different workloads and patterns
• Other LSM implementations? (e.g.HBase’s Java LSM)
• Work is done on branch 0.23
• Graduate student quality prototype (very good graduate student )
• But worked closed with the HDFS experts at Hortonworks
• Goal of internship was to see how well the idea worked
• Hortonworks plans to take this to the next stage once more experiments
are completed.
208/22/2013
21. Q&A
• Contact: lxiao+@cs.cmu.edu
• We’d love to get trace stats from your cluster
• Simple java program to run against your audit logs
• Can also run as Mapreduce jobs
• Extract metadata operation stats without exposing sensitive info
• Please contact me if you could help!
218/22/2013