Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Debunking the Myths of HDFS Erasure Coding Performance

4 371 vues

Publié le

Debunking the Myths of HDFS Erasure Coding Performance

Publié dans : Technologie
  • Login to see the comments

Debunking the Myths of HDFS Erasure Coding Performance

  1. 1. Debunking the Myths of HDFS Erasure Coding Performance
  2. 2.  HDFS inherits 3-way replication from Google File System - Simple, scalable and robust  200% storage overhead  Secondary replicas rarely accessed Replication is Expensive
  3. 3. Erasure Coding Saves Storage  Simplified Example: storing 2 bits  Same data durability - can lose any 1 bit  Half the storage overhead  Slower recovery 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  4. 4. Erasure Coding Saves Storage  Facebook - f4 stores 65PB of BLOBs in EC  Windows Azure Storage (WAS) - A PB of new data every 1~2 days - All “sealed” data stored in EC  Google File System - Large portion of data stored in EC
  5. 5. Roadmap  Background of EC - Redundancy Theory - EC in Distributed Storage Systems  HDFS-EC architecture  Hardware-accelerated Codec Framework  Performance Evaluation
  6. 6. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? useful data 3-way Replication: Data Durability = 2 Storage Efficiency = 1/3 (33%) redundant data
  7. 7. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 Storage Efficiency = 2/3 (67%) useful data redundant data X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0 Y = 0 ⊕ 1 = 1
  8. 8. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS): Data Durability = 2 Storage Efficiency = 4/6 (67%) Very flexible!
  9. 9. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4) 4 71%
  10. 10. EC in Distributed Storage Block Layout: Data Locality 👍🏻 Small Files 👎🏻 128~256MFile 0~128M … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  11. 11. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity Striped Layout: 0~1M 1~2M 5~6M 6~7M Data Locality 👎🏻 Small Files 👍🏻 Parallel I/O 👍🏻 0~128M 128~256M
  12. 12. EC in Distributed Storage Spectrum: Replication Erasure Coding Striping Contiguous Ceph Ceph Quancast File System Quancast File System HDFS Facebook f4 Windows Azure
  13. 13. Roadmap  - -  HDFS-EC architecture  Hardware-accelerated Codec Framework  Performance Evaluation
  14. 14. Choosing Block Layout Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group) 96.29% 1.86% 1.85% 26.06% 9.33% 64.61% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile 86.59% 11.38% 2.03% 23.89% 36.03% 40.08% file count space usage Top 2% files occupy ~40% space small medium large Cluster B Profile 99.64% 0.36% 0.00% 76.05% 20.75% 3.20% file count space usage Dominated by small files small medium large Cluster C Profile
  15. 15. Choosing Block Layout Current HDFS
  16. 16. Generalizing Block NameNode Mapping Logical and Storage Blocks Too Many Storage Blocks? Hierarchical Naming Protocol:
  17. 17. Client Parallel Writing streamer queue streamer … streamer Coordinator
  18. 18. Client Parallel Reading … parity
  19. 19. Reconstruction on DataNode  Important to avoid delay on the critical path - Especially if original data is lost  Integrated with Replication Monitor - Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms  New ErasureCodingWorker component on DataNode
  20. 20. Data Checksum Support  Supports getFileChecksum for EC striped mode files - Comparable checksums for same content striped files - Can’t compare the checksums for contiguous file and striped file - Can reconstruct on the fly if found block misses while computing  Planning to introduce new version of getFileChecksum - To achieve comparable checksums between contiguous and striped file
  21. 21. Roadmap  - -   Hardware-accelerated Codec Framework  Performance Evaluation
  22. 22. Acceleration with Intel ISA-L  1 legacy coder - From Facebook’s HDFS-RAID project  2 new coders - Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
  23. 23. Why is ISA-L Fast? pre-computed and reused parallel operation Direct ByteBuffer
  24. 24. Microbenchmark: Codec Calculation
  25. 25. Microbenchmark: Codec Calculation
  26. 26. Microbenchmark: HDFS I/O
  27. 27. Microbenchmark: HDFS I/O
  28. 28. Microbenchmark: HDFS I/O
  29. 29. DFSIO / MapReduce
  30. 30. Hive-on-MR — locality sensitive
  31. 31. Hive-on-Spark — locality sensitive
  32. 32. Conclusion  Erasure coding expands effective storage space by ~50%!  HDFS-EC phase I implements erasure coding in striped block layout  Upstream effort (HDFS-7285): - Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn  Phase II will support contiguous block layout for better locality
  33. 33. Acknowledgements  Cloudera - Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus  Intel - Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li  Hortonworks - Jing Zhao, Tsz Wo Nicholas Sze  Huawei - Vinayakumar B, Walter Su, Xinwei Qin  Yahoo (Japan) - Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
  34. 34. Questions? Zhe Zhang, LinkedIn zhz@apache.org | @oldcap http://zhe-thoughts.github.io/ Uma Gangumalla, Intel umamahesh@apache.org @UmaMaheswaraG http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/
  35. 35. Come See us at Intel - Booth 305 “Amazing Analytics from Silicon to Software” • Intel powers analytics solutions that are optimized for performance and security from silicon to software • Intel unleashes the potential of Big Data to enable advancement in healthcare/ life sciences, retail, manufacturing, telecom and financial services • Intel accelerates advanced analytics and machine learning solutions Twitter #HS16SJ
  36. 36. LinkedIn Hadoop Dali: LinkedIn’s Logical Data Access Layer for Hadoop Meetup Thu 6/30 6~9PM @LinkedIn 2nd floor, Unite room 2025 Stierlin Ct Mountain View Dr. Elephant: performance monitoring and tuning. SFHUG in Aug
  37. 37. Backup

×