4. 4
Data vs Disk
Put some data into the database
How much is written to disk?
INSERT INTO tbl1 VALUES ('foo')
amplification =
size of data
size of data on disk
6. 6
Amplification in InnoDB
● B*-tree
● Read amplification
– Assume random data lookup
– Locate the page, read it
– Root page is cached
● ~1 disk read for the leaf page
– Read amplification is ok
7. 7
Write amplification in InnoDB
● Locate the page to update
● Read; modify; write
– Write more if we caused a page split
● One page write per update!
– page_size / sizeof(int)
● Write amplification is an issue.
8. 8
Space amplification in InnoDB
● Page fill factor <100%
– to allow for updates
● Compression is done per-page
– Compressing bigger portions would be
better
– Page alignment
● Compressed to 3K ? Still on 4k page.
● => Space amplification is an issue
9. 9
InnoDB amplification summary
● Read amplification is ok
● Write and space amplification is an issue
– Saturated IO
– Faster SSD wear out
– Need more space on SSD
● => Low storage efficiency
11. 11
Log-structured merge tree
● Writes go to
– Log
– MemTable
● MemTable is flushed
to SortedStringTable
● Writing 2x
– But only useful data
MemTableWrite
Log SST
MemTable
12. 12
Log-structured merge tree
● Writing produces
more SSTs
● SSTs are immutable
● SSTs may have
multiple versions of
data
MemTableWrite
Log SST
MemTable
SST ...
13. 13
Reads in LSM tree
● Need to merge the
data on read
– Read amplification
suffers
● Should not have too
many SSTs.
MemTable Read
Log SST
MemTable
SST ...SST
14. 14
Compaction
● Merge multiple SSTs into one
● Removes old data versions
● Reduces the number of files
● Write amplification++ :-(
SST SST . . .SST
SST
15. 15
Compaction considerations
● Find the sweetspot
– Reduce # SSTs
– Don’t compact too often
● Be efficient
– Compact files of similar size
– Remove duplicate versions asap
● Many strategies
– Leveled
– Size-tiered
SST SST . . .SST
SST
– ...
20. 20
LSM Tree summary
● LSM architecture
– Data is stored in log, then SST files
– Writes to SST files are sequential, efficient
● Better compression
– Have to read from multiple SST files
– Compaction process merges SST files
● Efficiency
– Write amplification is reduced
– Space amplification is reduced
– Read amplification increases
22. 22
RocksDB
● “An embeddable key-value store for
fast storage environments”
● Uses LSM architecture
– Leveled compaction
– Server-grade
● Initially a fork of LevelDB
● Developed at Facebook
– First release in 2012
– Used at Facebook and many other companies
23. 23
RocksDB properties
● Embedded library
● Stores (key, value) pairs
– No data types
– No secondary indexes
– No SQL-like tables
● Column Families = tablespaces
● No replication support
– There is a 3rd-party addon
● Efficient, but hard to work with
25. 25
MyRocks
● A MySQL storage engine
● Uses RocksDB for storage
● Implements a MySQL storage engine on top
– Secondary indexes
– Data types
– SQL transactions
– …
● Developed* and used by Facebook
– *-- with some MariaDB involvement
34. 34
CPU usage is higher with MyRocks
Time
InnoDB
2x RocksDB
100%
0%
80%
50-60%
35. 35
MyRocks limitations
● Transactional storage engine
– REPEATABLE READ, READ COMMITTED
– No SERIALIZABLE
● Must use Row-based-Replication
● No cross-engine transactions
● Transaction must fit in memory
● Online DDL more limited than InnoDB
36. 36
MyRocks availability
● Part of github.com/facebook/mysql-5.6
● No binaries
● No packages
● Facebook’s branch of MySQL
– Special extensions
– Special ways to compile, run tests, etc
● Not easy to use
38. 38
MyRocks in MariaDB
● New technology
● Built and used at
Facebook’s scale
● Adoption
● Packaging
● Community
● MariaDB features
● ...
39. 39
Getting MyRocks into MariaDB
● Port MyRocks into MariaDB 10.2
– Decouple it from facebook/mysql-5.6 features
– Make it work with MariaDB’s features
● Set up a merge process
– Need to follow Facebook’s progress
● Setup the process to build packages
● Documentation
● Expertise
● ...
40. 40
Getting MyRocks into MariaDB
● Port MyRocks into MariaDB 10.2
– Decouple it from facebook/mysql-5.6 features
– Make it work with MariaDB’s features
● Set up a merge process
– Need to follow Facebook’s progress
● Setup the process to build packages
● Documentation
● Expertise
● ...
✘
✔
✘
✘✔
✔
✔
✔
41. 41
Current status
● “MariaDB 10.2.5 RC includes an ALPHA version of MyRocks
storage engine”
● It’s a loadable plugin (ha_rocksdb.so)
● Packages
– Bintar, deb, rpm, win64 zip + MSI
– Recent versions of OS due to compilers for RocksDB’s
requirements
● Not all features work yet
– Optimizer and SQL features work
– Replication/binlog features don’t work yet.
42. 42
“Is it stable”?
● The components are stable
– (MyRocks + RocksDB) are run in
production @ Facebook
– RocksDB is also used elsewhere
– MyRocks not much. yet.
● Connections with MariaDB
– Some are stable
– Some are [nearly] missing
MyRocks
MariaDB
RocksDB
43. 43
Plans
● Finish the missing pieces
– Storage Engine + binlog
● Improve support for multiple SEs
– MDEV-12179
● Increase maturity
– Pass the tests
– More test coverage
– Benchmarks
● Documentation
MyRocks
MariaDB
RocksDB