9. TAKEAWAYS
• All disk writes are sequential, append-
only operations
• On-disk tables (SSTables) are written in
sorted order, so compaction is linear
complexity O(N)
• SSTables are completely immutable
10. TAKEAWAYS
• All disk writes are sequential, append-
only operations
• On-disk tables (SSTables) are written in
sorted order, so compaction is linear
complexity O(N)
• SSTables are completely immutable
IMPORTANT
11. COMPARED
• Most popular data storage engines
rewrite modified data in-place: MySQL
(InnoDB), PostgreSQL, Oracle,
MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of
writes before flushing to disk
• ... but flushes are RANDOM writes
12. SPINNING DISKS
• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive
to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/
sec for modern SATA drives
14. 2012: MLC NAND FLASH*
• Affordable: ~$1.75/GB street
• Massive IOPS: 39,500/sec read, 23,000/
sec write
• Latency of less than 100µs
• Good sequential throughput: 270MB/
sec read, 205MB/sec write
• Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive
18. ... BUT
• Cannot overwrite directly: must erase
first, then write
• Can write in small increments (4KB),
but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC)
for each erase block
19. WEAR LEVELING is used
to reduce the number of
total erase operations
35. GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the
background... as much as possible
• If no empty blocks are available, GC
must be done before ANY writes can
complete
36. WRITE AMPLIFICATION
• When only a few kilobytes are written,
but fragmentation causes a whole
block to be rewritten
• The smaller & more random the writes,
the worse this gets
• Modern “mark and sweep” GC reduces
it, but cannot eliminate it
37. Torture test shows massive
write performance drop-off
for heavily fragmented drive
Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
38. Some poorly designed drives
COMPLETELY fall apart
Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
39. Even a well-behaved drive
suffers significantly from the
torture test
Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
40. Post-torture, all disk blocks
were marked empty, and the
“fast” comes back...
Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
41.
42. “TRIM”
• Filesystems don’t typically immediately
erase data when files are deleted, they just
mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive
when a region of disk is no longer used
• If an entire erase block is marked as
unused, GC is avoided, otherwise TRIM
just hastens the collection process
43. TRIM only reduces the
write amplification effect,
it can’t eliminate it.
49. TAKEAWAYS
• All disk writes are sequential, append-
only operations
• On-disk tables (SSTables) are written in
sorted order, so compaction is linear
complexity O(N)
• SSTables are completely immutable
51. “For a sequential write workload,
write amplification is equal to 1,
i.e., there is no write
amplification.”
Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write
Performance: Understanding, Analysis, and Performance Modeling”