The document discusses erasure coding as an alternative to replication in distributed storage systems like HDFS. It notes that while replication provides high durability, it has high storage overhead, and erasure coding can provide similar durability with half the storage overhead but slower recovery. The document outlines how major companies like Facebook, Windows Azure Storage, and Google use erasure coding. It then provides details on HDFS-EC, including its architecture, use of hardware acceleration, and performance evaluation showing its benefits over replication.
2. HDFS inherits 3-way replication from Google File System
- Simple, scalable and robust
200% storage overhead
Secondary replicas rarely accessed
Replication is Expensive
3. Erasure Coding Saves Storage
Simplified Example: storing 2 bits
Same data durability
- can lose any 1 bit
Half the storage overhead
Slower recovery
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
4. Erasure Coding Saves Storage
Facebook
- f4 stores 65PB of BLOBs in EC
Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
Google File System
- Large portion of data stored in EC
5. Roadmap
Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
HDFS-EC architecture
Hardware-accelerated Codec Framework
Performance Evaluation
6. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
7. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1
8. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Data Durability = 2
Storage Efficiency = 4/6 (67%)
Very flexible!
9. Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4 71%
14. Choosing Block Layout
Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group)
96.29%
1.86% 1.85%
26.06%
9.33%
64.61%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
86.59%
11.38% 2.03%
23.89%
36.03%
40.08%
file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
99.64%
0.36% 0.00%
76.05%
20.75%
3.20%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile
19. Reconstruction on DataNode
Important to avoid delay on the critical path
- Especially if original data is lost
Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks
- New priority algorithms
New ErasureCodingWorker component on DataNode
20. Data Checksum Support
Supports getFileChecksum for EC striped mode files
- Comparable checksums for same content striped files
- Can’t compare the checksums for contiguous file and striped file
- Can reconstruct on the fly if found block misses while computing
Planning to introduce new version of getFileChecksum
- To achieve comparable checksums between contiguous and striped file
32. Conclusion
Erasure coding expands effective storage space by ~50%!
HDFS-EC phase I implements erasure coding in striped block layout
Upstream effort (HDFS-7285):
- Design finalized Nov. 2014
- Development started Jan. 2015
- 218 commits, ~25k LoC change
- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn
Phase II will support contiguous block layout for better locality
33. Acknowledgements
Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus
Intel
- Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li
Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze
Huawei
- Vinayakumar B, Walter Su, Xinwei Qin
Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
35. Come See us at Intel - Booth 305
“Amazing Analytics from Silicon to Software”
• Intel powers analytics solutions that are optimized for
performance and security from silicon to software
• Intel unleashes the potential of Big Data to enable
advancement in healthcare/ life sciences, retail,
manufacturing, telecom and financial services
• Intel accelerates advanced analytics and machine learning
solutions
Twitter #HS16SJ
36. LinkedIn Hadoop
Dali: LinkedIn’s Logical
Data Access Layer for
Hadoop
Meetup Thu 6/30
6~9PM @LinkedIn
2nd floor, Unite room
2025 Stierlin Ct
Mountain View
Dr. Elephant: performance
monitoring and tuning.
SFHUG in Aug
Simply put, it doubles the storage capacity of your cluster. This talk explains how it happens. Blog post link.
When the GFS paper was published more than a decade ago, the objective was to store massive amount of data on a large number of cheap commodity machines. A breakthrough design was to rely on machine-level replication to protect against machine failures, instead of xxx.
A more efficient approach to reliably store data is erasure coding. Here’s a simplified example
In this talk I will introduce how we implemented erasure coding in HDFS.
RS uses more sophisticated linear algebra operations to generate multiple parity cells, and thus can tolerate multiple failures per group. It works by multiplying a vector of k data cells with a Generator Matrix (GT) to generate an extended codeword vector with k data cells and m parity cells.
In this particular example, it combines the strong durability of replication and high efficiency of simple XOR. More importantly, flexible.
To manage potentially very large files, distributed storage systems usually divide files into fixed-size logical byte ranges called logical blocks. These logical blocks are then mapped to storage blocks on the cluster, which reflect the physical layout of data on the cluster. The simplest mapping between logical and storage blocks is a contiguous block layout, which maps each logical block one-to-one to a storage block. Reading a file with a contiguous block layout is as easy as reading each storage block linearly in sequence.
To manage potentially very large files, distributed storage systems usually divide files into fixed-size logical byte ranges called logical blocks. These logical blocks are then mapped to storage blocks on the cluster, which reflect the physical layout of data on the cluster. The simplest mapping between logical and storage blocks is a contiguous block layout, which maps each logical block one-to-one to a storage block. Reading a file with a contiguous block layout is as easy as reading each storage block linearly in sequence.
Non-trivial trade-offs between x and x, y and y.
In all cases, the saving from EC will be significantly lower if only applied on large files. In some cases, no savings at all.
The former represents a logical byte range in a file, while the latter is the basic unit of data chunks stored on a DataNode. In the example, the file /tmp/foo is logically divided into 13 striping cells (cell_0 through cell_12). Logical block 0 represents the logical byte range of cells 0~8, and logical block 1 represents cells 9~12. Cells 0, 3, 6 form a storage block, which will be stored as a single chunk of data on a DataNode.
To reduce this overhead we have introduced a new hierarchical block naming protocol. Currently HDFS allocates block IDs sequentially based on block creation time. This protocol instead divides each block ID into 2~3 sections, as illustrated in Figure 7. Each block ID starts with a flag indicating its layout (contiguous=0, striped=1). For striped blocks, the rest of the ID consists of two parts: the middle section with ID of the logical block and the tail section representing the index of a storage block in the logical block. This allows the NameNode to manage a logical block as a summary of its storage blocks. Storage block IDs can be mapped to their logical block by masking the index; this is required when the NameNode processes DataNode block reports.
Figure 8 first shows results from an in-memory encoding/decoding micro benchmark. The ISA-L implementation outperforms the HDFS-EC Java implementation by more than 4x, and the Facebook HDFS-RAID coder by ~20x. Based on the results, we strongly recommend the ISA-L accelerated implementation for all production deployments.
Figure 8 first shows results from an in-memory encoding/decoding micro benchmark. The ISA-L implementation outperforms the HDFS-EC Java implementation by more than 4x, and the Facebook HDFS-RAID coder by ~20x. Based on the results, we strongly recommend the ISA-L accelerated implementation for all production deployments.
We also compared end-to-end HDFS I/O performance with these different coders against HDFS’s default scheme of three-way replication. The tests were performed on a cluster with 11 nodes (1 NameNode, 9 DataNodes, 1 client node) interconnected with 10 GigE network. Figure 9 shows the throughput results of 1) client writing a 12GB file to HDFS; and 2) client reading a 12GB file from HDFS. In the reading tests we manually killed two DataNodes so the results include decoding overhead.
As shown in Figure 9, in both sequential write and read and read benchmarks, throughput is greatly constrained by the pure Java coders (HDFS-RAID and our own implementation). The ISA-L implementation is much faster than the pure Java coders because of its excellent CPU efficiency. It also outperforms replication by 2-3x because the striped layout allows the client to perform I/O with multiple DataNodes in parallel, leveraging the aggregate bandwidth of their disk drives. We have also tested read performance without any DataNode failure: HDFS-EC is roughly 5x faster than three-way replication.
Phase 2 backup slide
Accelerating Advanced Analytics and Machine Learning Solutions
“Accelerated Machine Learning and Big Data Analytics Applications (through or with) the Trusted Analytics Platform”
Need tagline for Machine Learning: need ML tagline from Nidhi
Accelerating Machine Learning Applications and Big Data deployments through Trusted Analytics Platform
Remove TAP tagline. Or say “Accelerate analytics on Big Data with Trusted Analytics Platform”
Last line lessen words: just say “visit us at booth 409”