SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Vinyl:
why we wrote our own
write-optimized storage
engine rather than chose
RocksDB
kostja@tarantool.org
Konstantin Osipov
Plan
● A review of log structured merge tree data structure
● Vinyl engine in Tarantool
● Configuration parameters
● Key use cases
Why build, rather than buy: rocksdb, forestdb, ...
Transaction Control
- read your own writes
- multi-engine
- undo log
Engine 1
indexing
checkpoint
storage mgmt
Write ahead log
- redo log
- async relay
- group replication
I/O
- session
- parser
Engine 2
…
…
...
The shape of a classical LSM
DELETE in an LSM tree
● We can’t delete from append-only files
● Tombstones (delete markers) are inserted into L0 instead
26
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 37
Memory
DELETE in an LSM tree
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 26
Memory
37
DELETE in an LSM tree
Disk
3 8 10 15 22 35 36 37
Memory
40 42 45 4825
SELECT in LSM tree
● Search in all levels, until the key is found
● Optimistic for point queries
● Merge sort in case of range select
SELECT in LSM tree
Disk
2 6 7 10 14 16 23 28 30 32 37 38 41 45 47 49
3 8 15 26 35 40 45 48
10 25 36 42
22 37
GET(16) Memory
LSM: the algorithmic limit
LSM tree B-tree
Search K * O(log2
N/B) O(logB
N)
Delete O(log2
(N)/B) O(logB
N)
Insert O(log2
(N)/B) O(logB
N)
RUM conjecture
The ubiquitous fight between
the Read, the Update, and the
Memory overhead of access
methods for modern data
systems
Key LSM challenges in Web/OLTP
● Slow reads: read amplification
● Potentially write the same data multiple times: write
amplification
● Keep garbage around: space amplification
● Response times affected by background activity: latency spikes
Vinyl: memtable, sorted run, dump & compact
● Stores statements, not values:
● REPLACE,
● DELETE,
● UPSERT
● Every statement is marked by LSN
● Append-only files, garbage collected after checkpoint
● Transactional log of all filesystem changes: vylog
key lsn op_code value
Memtable, sorted run, dump & compact
Disk
3 8 15 26 36 40 45
10 25 36 40
Memory
memtable
22 37
dump
sorted run
compact
3 8 10 15 25 26 36 40 45
Vinyl: read [25, 30)
3 8 15 26
1 25 36
15 26 29
4
25 26 29merge
Uncommitted changes
Tuple cache
L0
Sorted run 83
Keeping the latency within limits
● anticipatory dump
● throttling
Disk
Memory
memtable
for writes
dump
3 8 10 15 25 26 36 40 45
22 37
22 37
shadow
Reducing read amplification
● page index
● bloom filters
● tuple range cache
● multi-level compaction
Disk22 37
3 8 10 15 25 26 36 40 45
3 15 22 36 22 25 26
page index tuple cachewrites
001010101010100110
110101011100000000
Bloom filter
Reducing write amplification
● Multi-level compaction can span any number of levels
● A level can contain multiple runs
Disk65 92
2 19 26 29 32 38 43 46 79
writes
14 15 35 89
22 23 27 31 37 46 65 78 94
3 9 28 33 44 47 48 54 56 59 61 66 75 81 93 94 96 98 99
Reducing space amplification
● Ranges reflect a static layout of sorted runs
● Slices connect a sorted run into a range
Disk65 92
2 3 9 14 15 22
14 15 35 89
23 27 28 31 33 37 44 81 93 94 96 98
Memory
[-inf, 23) [-23, 45) [-45, +inf)
Secondary key support
● L0 and tuple cache use memtx technology
● This makes it possible to include L0 and tuple cache in
checkpoint, so no cold restart
● L1+ stores the value of primary key as tuple id
● Unique secondary keys rely on bloom filter heavily for
INSERT, REPLACE
● REPLACE in a non-unique secondary is a blind write, garbage
is pruned at compaction
The scheduler
● The only active entity inside vinyl
engine
● Maintains two queues:
compact and dump
● Each range is in each queue,
parallel compaction
● Dump always trumps compact
● Chunked garbage collection for
quick memory reclamation
● Is a fiber, so needs no locks
Transactions
● MVCC
● the first transaction to commit wins
● no waits, no deadlocks, only aborts
● yields don’t abort transactions
Replication & backup
Work out of the box
Limitations
● Tuple size is up to 16M
● You need ~5k file descriptors per 1TB of data
● Optimistic transaction control is bad for long read-write
transactions
● No cross-engine transactions yet
Configuration
Name Description Default
bloom_fpr Bloom filter max false positive rate 5%
page_size Approximate page size 8192
range_size Approximate range size 1G
run_count_per_level How many sorted runs are allowed on a
single level
2
run_size_ratio Run size ratio between levels 3.5
memory L0 total size 0.25G
cache Tuple cache size 0.25G
threads The number of compaction threads 2
UPSERT
● Non-reading update or insert
● Delayed execution
● Background upsert squashing prevents upserts from piling up
space:upsert(tuple, {{operator, field, value}, ...})
@kostja_osipov fb.com/TarantoolDatabase
www.tarantool.orgkostja@tarantool.org
Thank you! Questions?
Links
Leveled compaction in Apache Cassandra
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
http://www.scylladb.com/kb/compaction/
https://github.com/facebook/rocksdb/wiki/Universal-Compaction
https://dom.as/2015/04/09/how-innodb-lost-its-advantage/

Contenu connexe

Similaire à vinyl

Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdffengxun
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesAlkin Tezuysal
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale:  an Intelligent Database ProxyMariaDB MaxScale:  an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMarkus Mäkelä
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixDatabricks
 
M|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsM|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsMariaDB plc
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberYing Zheng
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMarkus Mäkelä
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxVinicius M Grippa
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfssuser30e7d2
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDBPingCAP
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterRed_Hat_Storage
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityPythian
 
MariaDB / MySQL tripping hazard and how to get out again?
MariaDB / MySQL tripping hazard and how to get out again?MariaDB / MySQL tripping hazard and how to get out again?
MariaDB / MySQL tripping hazard and how to get out again?FromDual GmbH
 
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)data://disrupted®
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
MySQL configuration - The most important Variables
MySQL configuration - The most important VariablesMySQL configuration - The most important Variables
MySQL configuration - The most important VariablesFromDual GmbH
 

Similaire à vinyl (20)

Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar Series
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale:  an Intelligent Database ProxyMariaDB MaxScale:  an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database Proxy
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
 
M|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsM|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write Paths
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database Proxy
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptx
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
 
MariaDB / MySQL tripping hazard and how to get out again?
MariaDB / MySQL tripping hazard and how to get out again?MariaDB / MySQL tripping hazard and how to get out again?
MariaDB / MySQL tripping hazard and how to get out again?
 
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
 
Linux logging
Linux loggingLinux logging
Linux logging
 
MySQL DBA
MySQL DBAMySQL DBA
MySQL DBA
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
MySQL configuration - The most important Variables
MySQL configuration - The most important VariablesMySQL configuration - The most important Variables
MySQL configuration - The most important Variables
 

Dernier

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Dernier (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

vinyl

  • 1. Vinyl: why we wrote our own write-optimized storage engine rather than chose RocksDB kostja@tarantool.org Konstantin Osipov
  • 2. Plan ● A review of log structured merge tree data structure ● Vinyl engine in Tarantool ● Configuration parameters ● Key use cases
  • 3. Why build, rather than buy: rocksdb, forestdb, ... Transaction Control - read your own writes - multi-engine - undo log Engine 1 indexing checkpoint storage mgmt Write ahead log - redo log - async relay - group replication I/O - session - parser Engine 2 … … ...
  • 4. The shape of a classical LSM
  • 5. DELETE in an LSM tree ● We can’t delete from append-only files ● Tombstones (delete markers) are inserted into L0 instead 26 Disk 3 8 15 26 35 40 45 48 10 25 36 42 22 37 Memory
  • 6. DELETE in an LSM tree Disk 3 8 15 26 35 40 45 48 10 25 36 42 22 26 Memory 37
  • 7. DELETE in an LSM tree Disk 3 8 10 15 22 35 36 37 Memory 40 42 45 4825
  • 8. SELECT in LSM tree ● Search in all levels, until the key is found ● Optimistic for point queries ● Merge sort in case of range select
  • 9. SELECT in LSM tree Disk 2 6 7 10 14 16 23 28 30 32 37 38 41 45 47 49 3 8 15 26 35 40 45 48 10 25 36 42 22 37 GET(16) Memory
  • 10. LSM: the algorithmic limit LSM tree B-tree Search K * O(log2 N/B) O(logB N) Delete O(log2 (N)/B) O(logB N) Insert O(log2 (N)/B) O(logB N)
  • 11. RUM conjecture The ubiquitous fight between the Read, the Update, and the Memory overhead of access methods for modern data systems
  • 12. Key LSM challenges in Web/OLTP ● Slow reads: read amplification ● Potentially write the same data multiple times: write amplification ● Keep garbage around: space amplification ● Response times affected by background activity: latency spikes
  • 13. Vinyl: memtable, sorted run, dump & compact ● Stores statements, not values: ● REPLACE, ● DELETE, ● UPSERT ● Every statement is marked by LSN ● Append-only files, garbage collected after checkpoint ● Transactional log of all filesystem changes: vylog key lsn op_code value
  • 14. Memtable, sorted run, dump & compact Disk 3 8 15 26 36 40 45 10 25 36 40 Memory memtable 22 37 dump sorted run compact 3 8 10 15 25 26 36 40 45
  • 15. Vinyl: read [25, 30) 3 8 15 26 1 25 36 15 26 29 4 25 26 29merge Uncommitted changes Tuple cache L0 Sorted run 83
  • 16. Keeping the latency within limits ● anticipatory dump ● throttling Disk Memory memtable for writes dump 3 8 10 15 25 26 36 40 45 22 37 22 37 shadow
  • 17. Reducing read amplification ● page index ● bloom filters ● tuple range cache ● multi-level compaction Disk22 37 3 8 10 15 25 26 36 40 45 3 15 22 36 22 25 26 page index tuple cachewrites 001010101010100110 110101011100000000 Bloom filter
  • 18. Reducing write amplification ● Multi-level compaction can span any number of levels ● A level can contain multiple runs Disk65 92 2 19 26 29 32 38 43 46 79 writes 14 15 35 89 22 23 27 31 37 46 65 78 94 3 9 28 33 44 47 48 54 56 59 61 66 75 81 93 94 96 98 99
  • 19. Reducing space amplification ● Ranges reflect a static layout of sorted runs ● Slices connect a sorted run into a range Disk65 92 2 3 9 14 15 22 14 15 35 89 23 27 28 31 33 37 44 81 93 94 96 98 Memory [-inf, 23) [-23, 45) [-45, +inf)
  • 20. Secondary key support ● L0 and tuple cache use memtx technology ● This makes it possible to include L0 and tuple cache in checkpoint, so no cold restart ● L1+ stores the value of primary key as tuple id ● Unique secondary keys rely on bloom filter heavily for INSERT, REPLACE ● REPLACE in a non-unique secondary is a blind write, garbage is pruned at compaction
  • 21. The scheduler ● The only active entity inside vinyl engine ● Maintains two queues: compact and dump ● Each range is in each queue, parallel compaction ● Dump always trumps compact ● Chunked garbage collection for quick memory reclamation ● Is a fiber, so needs no locks
  • 22. Transactions ● MVCC ● the first transaction to commit wins ● no waits, no deadlocks, only aborts ● yields don’t abort transactions
  • 23. Replication & backup Work out of the box
  • 24. Limitations ● Tuple size is up to 16M ● You need ~5k file descriptors per 1TB of data ● Optimistic transaction control is bad for long read-write transactions ● No cross-engine transactions yet
  • 25. Configuration Name Description Default bloom_fpr Bloom filter max false positive rate 5% page_size Approximate page size 8192 range_size Approximate range size 1G run_count_per_level How many sorted runs are allowed on a single level 2 run_size_ratio Run size ratio between levels 3.5 memory L0 total size 0.25G cache Tuple cache size 0.25G threads The number of compaction threads 2
  • 26. UPSERT ● Non-reading update or insert ● Delayed execution ● Background upsert squashing prevents upserts from piling up space:upsert(tuple, {{operator, field, value}, ...})
  • 28. Links Leveled compaction in Apache Cassandra http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra http://www.scylladb.com/kb/compaction/ https://github.com/facebook/rocksdb/wiki/Universal-Compaction https://dom.as/2015/04/09/how-innodb-lost-its-advantage/