4. 4
Data vs Disk
Put some data into the database
How much is written to disk?
INSERT INTO tbl1 VALUES ('foo')
amplification =
size of data
size of data on disk
6. 6
Amplification in InnoDB
● B*-tree
● Read amplification
– Assume random data lookup
– Locate the page, read it
– Root page is cached
● ~1 disk read for the leaf page
– Read amplification is ok
7. 7
Write amplification in InnoDB
● Locate the page to update
● Read; modify; write
– Write more if we caused a page split
● One page write per update!
– page_size / sizeof(int) = 16K/8
● Write amplification is an issue.
8. 8
Space amplification in InnoDB
● Page fill factor <100%
– to allow for updates
● Compression is done per-page
– Compressing bigger portions would be
better
– Page alignment
● Compressed to 3K ? Still on 4k page.
● => Space amplification is an issue
9. 9
InnoDB amplification summary
● Read amplification is ok
● Write and space amplification is an issue
– Saturated IO
– SSDs wear out faster
– Need more space on SSD
● => Low storage efficiency
11. 11
Log-structured merge tree
● Writes go to
– MemTable
– Log (for durablility)
● When MemTable is full, it is
flushed to disk
● SST= SortedStringTable
– Is written sequentially
– Allows for lookups
MemTableWrite
Log SST
MemTable
12. 12
Log-structured merge tree
● Writing produces
more SSTs
● SSTs are immutable
● SSTs may have
multiple versions of
data
MemTableWrite
Log SST
MemTable
SST ...
13. 13
Reads in LSM tree
● Need to merge the
data on read
– Read amplification
suffers
● Should not have too
many SSTs.
MemTable Read
Log SST
MemTable
SST ...SST
14. 14
Compaction
● Merges multiple SSTs into one
● Removes old data versions
● Reduces the number of files
● Write amplification++
SST SST . . .SST
SST
15. 15
LSM Tree summary
● LSM architecture
– Data is stored in the log, then SST files
– Writes to SST files are sequential
– Have to look in several SST files to read the data
● Efficiency
– Write amplification is reduced
– Space amplification is reduced
– Read amplification increases
17. 17
RocksDB
● “An embeddable key-value store for
fast storage environments”
● LSM architecture
● Developed at Facebook (2012)
● Used at Facebook and many other companies
● It’s a key-value store, not a database
– C++ Library (local only)
– NoSQL-like API (Get, Put, Delete, Scan)
– No datatypes, tables, secondary indexes, ...
18. 18
MyRocks
● A MySQL storage engine
● Uses RocksDB for storage
● Implements a MySQL storage engine
on top
– Secondary indexes
– Data types
– SQL transactions
– …
● Developed* and used by Facebook
– *-- with some MariaDB involvement
MyRocks
RocksDB
MySQL
InnoDB
24. 24
Another write amplification test
● InnoDB
vs
● MyRocks serving 2x
data
InnoDB 2x RocksDB
0
4
8
12
Flash GC
Binlog / Relay log
Storage engine
25. 25
A real efficiency test
For a certain workload we could reduce the number of MySQL
servers by 50%
– Yoshinori Matsunobu
26. - What is MyRocks
- MyRocks’ way into MariaDB
- Using MyRocks
27. 27
The source of MyRocks
● MyRocks is a part of Facebook’s MySQL tree
– github.com/facebook/mysql-5.6
● No binaries or packages
● No releases
● Extra features
● “In-house” experience
28. 28
MyRocks in MariaDB
● MyRocks is a part of MariaDB 10.2 and MariaDB 10.3
– MyRocks itself is the same across 10.2 and 10.3
– New related feature in 10.3: “Per-engine gtid_slave_pos”
● MyRocks is a loadable plugin with its own Maturity level.
● Releases
– April, 2017: MyRocks is Alpha (in MariaDB 10.2.5 RC)
– January, 2018: MyRocks is Beta
– February, 2018: MyRocks is RC
● Release pending
29. - What is MyRocks
- MyRocks’ way into MariaDB
- Using MyRocks
31. 31
MyRocks packages and binaries
Packages are available for modern distributions
● RPM-based (RedHat, CentOS)
● Deb-based (Debian, Ubuntu)
● Binary tarball
– mariadb-10.2.12-linux-systemd-x86_64.tar.gz
● Windows
● ...
32. 32
Installing MyRocks
● Install the plugin
– RPM
– DEB
– Bintar/windows:
apt install mariadb-plugin-rocksdb
MariaDB> install soname 'ha_rocksdb';
MariaDB> create table t1( … ) engine=rocksdb;
yum install MariaDB-rocksdb-engine
● Restart the server
● Check
● Use
MariaDB> SHOW PLUGINS;
34. 34
Data loading – good news
● It’s a write-optimized storage engine
● The same data in MyRocks takes less space on disk
than on InnoDB
– 1.5x, 3x, or more
● Can load the data faster than InnoDB
● But…
35. 35
Data loading – limitations
● Limitation: Transaction must fit in memory
mysql> ALTER TABLE big_table ENGINE=RocksDB;
ERROR 2013 (HY000): Lost connection to MySQL server during query
● Uses a lot of memory, then killed by OOM killer.
● Need to use special settings for loading data
– https://mariadb.com/kb/en/library/loading-data-into-myrocks/
– https://github.com/facebook/mysql-5.6/wiki/data-loading
36. 36
Sorted bulk loading
MariaDB> set rocksdb_bulk_load=1;
... # something that inserts the data in PK order
MariaDB> set rocksdb_bulk_load=0;
Bypasses
● Transactional locking
● Seeing the data you are inserting
● Compactions
● ...
37. 37
Unsorted bulk loading
MariaDB> set rocksdb_bulk_load_allow_unsorted=0;
MariaDB> set rocksdb_bulk_load=1;
... # something that inserts data in any order
MariaDB> set rocksdb_bulk_load=0;
● Same as previous but doesn’t require PK order
38. 38
Other speedups for lading
● rocksdb_commit_in_the_middle
– Auto commit every rocksdb_bulk_load_size rows
● @@unique_checks
– Setting to FALSE will bypass unique checks
39. 39
Creating indexes
● Index creation doesn’t need any special settings
mysql> ALTER TABLE myrocks_table ADD INDEX(...);
● Compression is enabled by default
– Lots of settings to tune it further
40. 40
max_row_locks as a safety setting
● Avoid run-away memory usage and OOM killer:
MariaDB> set rocksdb_max_row_locks=10000;
MariaDB> alter table t1 engine=rocksdb;
ERROR 4067 (HY000): Status error 10 received from RocksDB:
Operation aborted: Failed to acquire lock due to
max_num_locks limit
● This setting is useful after migration, too.
42. 42
Essential MyRocks tuning
● rocksdb_flush_log_at_trx_commit=1
– Same as innodb_flush_log_at_trx_commit=1
– Together with sync_binlog=1 makes the master crash-
safe
● rocksdb_block_cache_size=...
– Similar to innodb_buffer_pool_size
– 500Mb by default (will use up to that amount * 1.5?)
– Set higher if have more RAM
● Safety: rocksdb_max_row_locks=...
43. 43
Indexes
● Primary Key is the clustered index (like in InnoDB)
● Secondary indexes include PK columns (like in InnoDB)
● Non-unique secondary indexes are cheap
– Index maintenance is read-free
● Unique secondary indexes are more expensive
– Maintenance is not read-free
– Reads can suffer from read amplification
– Can use @@unique_check=false, at your own risk
44. 44
Charsets and collations
● Index entries are compared with memcmp(), “mem-comparable”
● “Reversible collations”: binary, latin1_bin, utf8_bin
– Can convert values to/from their mem-comparable form
● “Restorable collations”
– 1-byte characters, one weght per character
– e.g. latin1_general_ci, latin1_swedish_ci, ...
– Index stores mem-comparable form + restore data
● Other collations (e.g. utf8_general_ci)
– Index-only scans are not supported
– Table row stores the original value
‘A’ a 1
‘a’ a 0
45. 45
Charsets and collations (2)
● Using indexes on “Other” (collation) produces a warning
MariaDB> create table t3 (...
a varchar(100) collate utf8_general_ci, key(a)) engine=rocksdb;
Query OK, 0 rows affected, 1 warning (0.20 sec)
MariaDB> show warningsG
*************************** 1. row ***************************
Level: Warning
Code: 1815
Message: Internal error: Indexed column test.t3.a uses a collation that
does not allow index-only access in secondary key and has reduced disk
space efficiency in primary key.
47. 47
MyRocks only supports RBR
● MyRocks’ highest isolation level is “snapshot isolation”,
that is REPEATABLE-READ
● Statement-Based Replication
– Slave runs the statements sequentially (~ serializable)
– InnoDB uses “Gap Locks” to provide isolation req’d by SBR
– MyRocks doesn’t support Gap Locks
● Row-Based Replication must be used
– binlog_format=row
48. 48
Parallel replication
● Facebook/MySQL-5.6 is based on MySQL 5.6
– Parallel replication for data in different databases
● MariaDB has more advanced parallel slave
(controlled by @@slave_parallel_mode)
– Conservative (group-commit based)
– Optimistic (apply transactions in parallel, on conflict roll
back the transaction with greater seq_no)
– Aggressive
49. 49
Parallel replication (2)
● Conservative parallel
replication works with MyRocks
– Different execution paths for
log_slave_updates=ON|OFF
– ON might be faster
● Optimistic and aggressive do
not provide speedups over
conservative.
50. 50
mysql.gtid_slave_pos
●
mysql.gtid_slave_pos stores slave’s position
● Use a transactional storage engine for it
– Database contents will match slave position after crash
recovery
– Makes the slave crash-safe.
●
mysql.gtid_slave_pos uses a different engine?
– Cross-engine transaction (slow).
51. 51
Per-engine mysql.gtid_slave_pos
● The idea:
– Have mysql.gtid_slave_pos_${engine} for each
storage engine
– Slave position is the biggest position in all tables.
– Transaction affecting only MyRocks will only update
mysql.gtid_slave_pos_rocksdb
● Configuration:
--gtid-pos-auto-engines=engine1,engine2,...
52. 52
Per-engine mysql.gtid_slave_pos
● Available in MariaDB 10.3
● Thanks for the patch to
– Kristian Nielsen (implementation)
– Booking.com (request and funding)
● Don’t let your slave be 2-3x slower:
– 10.3: my.cnf: gtid-pos-auto-engines=INNODB,ROCKSDB
– 10.2: MariaDB> ALTER TABLE mysql.gtid_slave_pos
ENGINE=RocksDB;
53. 53
Replication takeaways
● Must use binlog_format=row
● Parallel replication works
(slave_parallel_mode=conservative)
● Set your mysql.gtid_slave_pos correctly
– 10.2: preferred engine
– 10.3: per-engine
55. 55
myrocks_hotbackup tool
● Takes a hot (non-blocking) backup of RocksDB data
mkdir /path/to/checkpoint-dir
myrocks_hotbackup --user=... --port=...
--checkpoint_dir=/path/to/checkpoint-dir | ssh "tar -xi ..."
myrocks_hotbackup --move_back
--datadir=/path/to/restore-dir
--rocksdb_datadir=/path/to/restore-dir/#rocksdb
--rocksdb_waldir= /path/to/restore-dir/#rocksdb
--backup_dir=/where/tarball/was/unpacked
● Making a backup
● Starting from backup
56. 56
myrocks_hotbackup limitations
● Targets MyRocks-only installs
– Copies MyRocks data and supplementary files
– Does *not* work for mixed MyRocks+InnoDB
setups
● Assumes no DDL is running concurrently with the
backup
– May get a wrong frm file
● Not very user-friendly
57. 57
Further backup considerations
● mariabackup does not work with MyRocks
currently
● myrocks_hotbackup is similar (actually simpler) than
xtrabackup
● Will look at making mariabackup work with MyRocks
60. 60
Conclusions
● MyRocks is a write-optimized storage engine based
on LSM-trees
– Developed and used by Facebook
● Available in MariaDB 10.2 and MariaDB 10.3
– Currently RC, Stable very soon
● It has some limitations and requires some tuning
– Easy to tackle if you were here for this talk :-)