More Related Content Similar to Compaction and Splitting in Apache Accumulo (20) More from Hortonworks (20) Compaction and Splitting in Apache Accumulo2. What are compaction and splitting?
•Accumulo tables are divided into
non-overlapping key ranges called
tablets
•Compaction selects a set of sorted
files for a single tablet and rewrites
them into one file
•Splitting divides a tablet into two
tablets
Page 2
© Hortonworks Inc. 2012
3. Tablet Overview
•When memory fills, new sorted files
are created by flushing
•Sorted files are combined together
into fewer sorted files
Page 3
© Hortonworks Inc. 2012
4. How much data are you writing?
•If you never compact – O(N)
…
•If you always compact – O(N2)
…
Page 4
© Hortonworks Inc. 2012
5. Accumulo Compaction Algorithm
•Compact a set of files when:
size of the
largest file
×
compaction
ratio ≤ sum of the
sizes of files
table.compaction.major.ratio
Page 5
© Hortonworks Inc. 2012
6. In Action (r = 3, N = 1, W = 1)
Page 6
© Hortonworks Inc. 2012
7. In Action (r = 3, N = 2, W = 2)
Page 7
© Hortonworks Inc. 2012
8. In Action (r = 3, N = 3, W = 3)
Page 8
© Hortonworks Inc. 2012
9. In Action (r = 3, N = 3, W = 6)
Page 9
© Hortonworks Inc. 2012
10. In Action (r = 3, N = 4, W = 7)
Page 10
© Hortonworks Inc. 2012
11. In Action (r = 3, N = 5, W = 8)
Page 11
© Hortonworks Inc. 2012
12. In Action (r = 3, N = 6, W = 9)
Page 12
© Hortonworks Inc. 2012
13. In Action (r = 3, N = 6, W = 12)
Page 13
© Hortonworks Inc. 2012
14. In Action (r = 3, N = 7, W = 13)
Page 14
© Hortonworks Inc. 2012
15. In Action (r = 3, N = 8, W = 14)
Page 15
© Hortonworks Inc. 2012
16. In Action (r = 3, N = 9, W = 15)
Page 16
© Hortonworks Inc. 2012
17. In Action (r = 3, N = 9, W = 24)
Page 17
© Hortonworks Inc. 2012
18. In Action (r = 3, N = 27, W = 90*)
Page 18
© Hortonworks Inc. 2012
19. Amount of data written
•W(rk) = (k+1)rk – (k-1)rk-1
•Thus, W(N) ≈ O(N log N)
Page 19
© Hortonworks Inc. 2012
20. HBase Compaction Algorithm
•Compact a set of files when:
sum of the
size of the
largest file ≤ sizes of ×
compaction
ratio
smaller files
hbase.hstore.compaction.ratio
Page 20
© Hortonworks Inc. 2012
21. HBase Compaction Algorithm
•Compact a set of files when:
sum of the
size of the
largest file ≤ sizes of × compaction
ratio
smaller files
1
HBase ratio = Accumulo
ratio –1
Page 21
© Hortonworks Inc. 2012
22. Other Compaction-related Properties
•Accumulo
table.file.max
tserver.compaction.major.thread.files.open.max
tserver.compaction.major.delay
table.compaction.major.everything.idle
•Hbase
hbase.hstore.compactionThreshold
hbase.hstore.blockingStoreFiles
hbase.hstore.blockingWaitTime
hbase.hstore.compaction.min
hbase.hstore.compaction.max
hbase.hstore.compaction.min.size
hbase.hstore.compaction.max.size
Page 22
© Hortonworks Inc. 2012
23. Accumulo Splitting
•Always check to see if a split is
needed before compacting
•If it is needed, split first
•File names stored in metadata table
split
threshold
Page 23
© Hortonworks Inc. 2012
24. Accumulo Splitting Process
•Tablet closed, no new writes
•Three writes to the metadata table
–tablet made smaller & marked as splitting
–new tablet added
–original tablet's splitting marks removed
•Tablet server swaps new tablets for
old tablet in its online tablet list
•Master informed
Page 24
© Hortonworks Inc. 2012
25. Accumulo Splitting Recovery
•Whenever a tablet is brought online,
the tablet server checks to see if it
has split marks.
•If so, it assumes the splitting
process was interrupted and
finishes making changes to the
metadata table.
Page 25
© Hortonworks Inc. 2012
26. Hortonworks Data Platform
• Simplify deployment to get
started quickly and easily
• Monitor, manage any size
cluster with familiar
console and tools
1 • Only platform to include
data integration services
to interact with any data
• Metadata services opens
the platform for integration
with existing applications
• Dependable high
availability architecture
Reduce risks and cost of adoption
Lower the total cost to administer and provision • Tested at scale to future
proof your cluster growth
Integrate with your existing ecosystem
Page 26
© Hortonworks Inc. 2012
27. Hortonworks Training
The expert source for
Apache Hadoop training &
certification
Role-based Developer and
Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available
Comprehensive Apache Hadoop
© Hortonworks Inc. 2012
Page 27
28. Next Steps?
1 Download Hortonworks Data Platform
hortonworks.com/download
2 Use the getting started guide
hortonworks.com/get-started
3 Learn more… get support
Hortonworks Support
• Expert role based training • Full lifecycle technical support
• Course for admins, developers across four service levels
and operators • Delivered by Apache Hadoop
• Certification program Experts/Committers
• Custom onsite options • Forward-compatible
hortonworks.com/training hortonworks.com/support
Page 28
© Hortonworks Inc. 2012
Editor's Notes Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. As the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise. Run through the points on left…