3. Why?
● SSTables immutable
● Get rid of duplicate/overwritten data
● Drop deleted data and tombstones
4. When?
● Manually, nodetool compact / scrub ...
● When we add sstables
○ After flush
○ Once a compaction is done
○ After streaming
● Search for usages of
○ o.a.c.db.compaction.
CompactionManager#submitBackground
5. Types of compaction
● Minor - runs automatically in the background
● Major - includes all sstables, only for size tiered
compaction
● Single-sstable compactions
○ upgradesstables
○ scrub
○ cleanup
● Anticompaction
○ After incremental repair to split out repaired/unrepaired data
6. Compaction strategies
● Pluggable interface
● Strategies decide
○ what sstables to compact
○ how big they should be
○ what implementation of CompactionTask to use
● Strategies can get notified when adding new sstables
○ Makes it possible to make smart decisions when deciding which
sstables to compact
○ LCS does this to keep track of what sstables are in each level
8. LeveledCompactionStrategy
● Keeps levels of non-overlapping sstables
● Each level is 10x the size of the previous one
● All sstables in levels 1+ are about the same size
(160MB)
● L0 is the dumping ground, overlapping, larger sstables
9. Tombstones
● Write a tombstone to delete data
● Covers data, but only data that is older than
the tombstone
● Drop covered data during compaction
10. When can we drop tombstones?
● Once the tombstone has existed
gc_grace_seconds
● When the tombstone is guaranteed to not
cover any data on the node
○ All sstables containing the key are included in the
compaction
○ The other sstables where the key exists only contain
newer data
13. CompactionManager
● submitBackground
○ Trigger minor compaction
○ Fill executor with BackgroundCompactionTasks
● BackgroundCompactionTask
● submitMaximal
○ Major compaction
○ Not blocking, get() the future to block
○ runWithCompactionsDisabled
● OneSSTableOperation
○ Common way to run the single-sstable compactions in parallel
14. CompactionTask
● Gets executed in the
CompactionExecutor and does the actual
compacting
● Eventually calls runWith(..) which is
where the magic happens
16. CompactionController
● Keep track of overlapping sstables
○ Is the currently compacting key in any other sstable?
● maxPurgeableTimestamp(DecoratedKey key)
○ How old tombstones do we need to keep?
○ Worst case, currently compacting key is the oldest in that sstable
18. SSTableWriter
● Writes sstables…
● Give it rows, it writes index, data file, sstable metadata
files etc
● openEarly(..)
○ link index and data files
○ in-memory-fake the rest of the files
● Collect SSTable metadata
19. SSTable metadata
● Collected whenever an sstable is written
● StatsMetadata
○ Kept on-heap
○ min/maxTimestamp
○ min/maxColumnNames
○ sstableLevel
● CompactionMetadata
○ Deserialized when needed
○ ancestors
○ cardinalityEstimator - HyperLogLog signature
● ValidationMetadata
○ Used to validate sstables when opening
20. Iterators all the way down
a 1 2 3
a 2 5 7
b 2 3 5
b 2 4 5
d .. .. ..
e .. .. ..
a 1 2 3 5 7
b 2 3 4 5
d .. .. .. .. ..
e .. .. .. .. ..
● “Partition iterator” for each sstable
(SSTableScanner)
● “Cell iterator” for each partition
(OnDiskAtomIterator)
● MergeIterator (MI) that takes a number
of (sorted) iterators and merges them
● One MI for sstables that merges
partitions
● One MI for each partition that merges
cells
21. MergeIterator
● Interesting implementation is ManyToOne
● Merges many sorted iterators into one
● Reducer
○ reduce(..) gets called for every version
that should be reduced
○ getReduced() gets called when all
versions with the same
name/priority/value has been reduce():ed
22. MergeIterator
1. call next()
2. poll one item out of the PQ
3. Reducer.reduce(..)
4. goto 2, until we find an item
that differs
5. Call next() on the iterators
you polled
6. Re-add the iterators to the PQ
7. return Reducer.getReduced
24. LazilyCompactedRow
● “Lazy” because we don’t deserialize until we
need to
● Uses a MergeIterator to merge the rows
● Drops tombstones if possible
○ Uses CompactionController for this