Software and Systems Engineering Standards: Verification and Validation of Sy...
Giraffa - November 2014
1. Giraffa
A highly available, scalable, distributed file system
PLAMEN JELIAZKOV & MILAN DESAI
2. Quick Introduction
• Giraffa is a new file system.
• Distributes it’s namespace by utilizing features of HDFS
and HBase.
• Open source project in experimental stage.
3. Design Principals
• Linear scalability – more nodes can do more work within the same
time. Scale data size and compute resources.
• Reliability and availability – 1/1000 probability that a drive will fail
today; on a large cluster with thousands of drives there can be
several failures.
• Move computation to data – minimize expensive data transfers.
• Sequential data processing – avoid random reads. [Use HBase for
random access].
4. Scalability Limits
• Single-master architecture: a constraining resource
• Single NameNode limits linear performance growth – a few
bad clients / jobs can saturate the NameNode.
• Single point of failure – takes entire File System out of
service.
• NameNode space limit:
-- 100 million files and 200 million blocks with 64GB RAM
-- Restricts storage capacity to about 20 PB
-- Small file problem: block-to-file ratio is shrinking as people
store more small files in HDFS.
These are Konstantin’s own discoveries as published in
“HDFS Scalability: The limits to growth”, USENIX;login: 2010.
5. The Goals for Giraffa
• Support millions of concurrent clients
- More servers -> higher concurrent connections can be accepted.
• Store hundreds of billions of objects
- More servers -> higher total memory.
• Maintain Exabyte total storage capacity
- More servers -> host more slaves -> higher total storage.
Sharding the namespace achieves all three goals.
6. What About Federation?
1. HDFS Federation allows independent NameNodes to share a
common pool of DataNodes.
2. In Federation, a user sees NameNodes as volumes, or as isolated
file systems.
Federation is a static approach to Namespace partitioning.
We call it static because sub-trees are statically assigned to disjoint
volumes.
Relocating sub-trees to a new volume requires copying between file
systems.
A dynamic Namespace partitioning could move sub-trees
automatically based on utilization or load-balancing requirements.
In some cases, sub-trees could be relocated without copying data
blocks.
8. Giraffa Requirements
Availability – the primary goal
- Region splitting leads to load balancing of metadata traffic.
- Same data streaming speed to / from DataNodes.
- No SPOF. Continuous availability.
Scalability
- Each RegionServer stores a part of the namespace.
Cluster operability
- Cost running larger cluster is same as a smaller one.
- But, running multiple clusters is more expensive.
9. The Big Picture
1. Use HBase to store HDFS Namespace metadata.
2. DataNodes continue to store HDFS blocks.
3. Introduce coprocessors to act as communication layer between
HBase, HDFS, and the file system.
4. Store files and directories as rows in HBase.
A Giraffa “shard” consists of:
HBase RegionServer
HDFS NameNode – to be replaced with Giraffa BlockManager.
HDFS DataNode(s)
*HBase Master
*ZooKeeper(s)
* == Not required per shard, but necessary within the network.
10.
11. Giraffa File System
• fs.defaultFS = grfa:///
• fs.grfa.impl = org.apache.giraffa.GiraffaFileSystem
• Namespace is cached in RegionServer RAM.
• Regions lead to dynamic Namespace partitioning.
• Block management handled by specialized RegionObserver
coprocessor to handle communication to DataNodes -> performs
block allocation, replication, deletion, heartbeats, and block
reports.
• Namespace manipulation handled by specialized coprocessor ->
performs all NameNode RPC Server calls.
12. NamespaceAgent
Quick run through of this class:
1. Implements ClientProtocol. Not a coprocessor.
2. Replaces NameNode RPC channel for GiraffaClient
(which extends DFSClient and is the client used by
GiraffaFileSystem class).
3. Has an HBaseClient member that communicates RPC
requests to the NamespaceProcessor coprocessor of a
RegionServer.
13. Namespace Table
Single HBase table called “Namespace” stores:
1. A RowKey: the bytes that identify the row and therefore
the file / directory.
2. File attributes: name, owner, group, permissions, access-time,
modification-time, block size, replication, length.
3. List of blocks for the file.
4. List of block locations.
5. State of the file: under construction, closed.
14. Row Keys
• Files and directories are stored as rows in HBase.
• The key bytes of a row determine its sorting in the Namespace
table.
• Different RowKey definitions change locality of files and
directories within the HBase region.
• FullPathRowKey is the default implementation. The key bytes
of the row are the full source path to the file or directory.
-- Problem: Renames may cause row to move to another Region.
• Another idea is NumberedRowKey. The key bytes are some
decided number.
-- Problem: You lose locality within HBase Namespace table.
15. Locality of Reference
• Traditional tree structured namespace is flattened into
linear array.
• Ordered list of files is self-partitioned into regions.
• RowKey implementations define sorting of files and
directories in the table.
• Files in the same directory will belong to the same region
(most of the time).
-- This leads to an efficient “ls” implementation by purely
scanning across a Region.
16. Giraffa Today
A lot of work has been done by the current team, the newest to
date are:
• Introduction of custom Giraffa WebUI.
• Atomic in-place rename, non-atomic moves, and non-atomic
move failure recovery.
• Serializing Exceptions over RPC.
• Support for YARN.
• (Coming soon) Introduction of Lease management.
17. Neat Futures
• Full Hadoop compatibility / HDFS replacement. We are 96%
compliant with hadoop/hdfs shell today. Shown by passing
bulk of TestHDFSCLI. Missing dfsadmin commands today.
• Since file system metadata lives among the same pool as
regular data, it is possible to deploy analytics and obtain
detailed analysis of your own file system.
• Snapshot implementation becomes a matter of increasing the
number of versions of a row allowed in HBase.
• Extended attributes implementation just mean adding a new
column to the file row.
18. History
2009 – Study on scalability limits.
2010 – Konstantin Shvachko works on design with Michael Stack;
presentation at HDFS contributors meeting.
2011 – Plamen Jeliazkov implements first POC.
2012 – Presented at Hadoop Summit. Open sourced as Apache
Extra’s project.
2013 – Milan Desai and Konstantin Pelykh added as committers.
Konstantin Boudnik as a contributor.
2014 – Giraffa Scalability tested – ~46,300 mkdirs / second with 64
RegionServer nodes and 64 client nodes.