Database storage engine internals.pptx

Demystifying data structures
and algorithms adopted by
database storage engine
Adewumi Sunkanmi D.
Demystifying data
structures and
algorithms used by
database storage engine
Adewumi Sunkanmi D.
Senior Software Engineer at Acronis
working on Advanced Automation, one
of the cloud services offered by Acronis
Cyber Cloud.
Outline
1. Overview of a three-tier application
2. Criteria for selecting the best database for an application
3. Overview of database architecture
4. Types for database storage engines and their tradeoffs
5. Q/A
client
POST
GET
client
server
POST
GET
WRITE
READ
server
client
Database
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
POST
GET
WRITE
READ
server
client
Database
Which database should we use🤔?
BigTable
Neo4J
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
How do we select
the best database
for an application?
1. Structure of the data we want to store;
structured, unstructured, graph?
1. Scalability of the database
- Horizontal or Vertical scaling
- Sharding(Partition data across nodes)
- Replication(Copies of data on multiple nodes)
3. Support and familiarity of developers with database
4. Rate of write and read and how EXACTLY are these
operations handled at the hardware level?
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
SELECT
COLS FROM WHERE
COL_ID students >
score 70
firstname lastname
“SELECT firstname, lastname FROM students WHERE score > 70;”
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/ch01.html
Disk
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://www.cs.umb.edu/~poneil/lsmtree.pdf
Log Structured Merge Tree Storage
Engine
The LMS tree is an immutable disk resident data
structure and it is optimized for sequential writes while
maintaining the acceptable read performance.
Log Structured Merge Tree Storage
Engine
Three components
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
ben: 300
Memtable
e.g Red black
tree in RAM
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red black
tree in RAM
ben: 300 josh: 500
Threshold reached!
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
SSD/HDD file (SSTable file)
T1
ben: 300
bin: 220
josh: 500
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
10MB
alexandar : 10
andreas : 50
…….
erik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 177
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (segment file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
Find(apa)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
SSD/HDD file (SSTable file)
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 300 mia: 220
write
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
update?
Since we return from the most
recent memtable or segment file, we
just insert the key with the new
value,
Ben will be returned from T2 not T1
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
How do we handle
delete?
Insert the key with a delete marker
called tombstone, since this will be
the most recent, we can tell it has
been deleted, e.g
ben->null
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
But now we have
duplicates, space
wastage :(
Yes, but compaction will help
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine Key present?: Strict NO if not
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if we don’t find
the key, we search all
the SSTable files?
Compaction
Optimtimize reads with Bloom Filters
Maybe or Maybe
not(99% accurate)
https://brilliant.org/wiki/bloom-filter/
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
1. Memtable
2. In-memory index
3. SSTable (Sorted String Table)
ben 300
josh 500
bin 220
ben 177
mia 220
eve 173
write
bin: 220
Memtable
e.g Red Black
tree in RAM
ben: 300 josh: 500
flush
ben: 300
bin: 220
josh: 500
T1
40MB
10MB
10MB
10MB
400
alexandar : 10
andreas : 50
…….
arik : 500
erling : 200
……..
jan : 11
johan : 300
……..
robert: 499
roy: 200
………
In-memory index
Key Byte offset
alexandar 0
arik 303
jan
robert 500
10MB
eve: 173
ben: 177 mia: 220
write
ben: 177
eve: 173
mia: 200
flush T2
Read(ben)
Step 1
Step 2
Step 3
What if power failure
happens before data
is flushed to disk?
Compaction
1. Persist write in an append only log file before
writing to in-memory table. WAL
2. Recreate memtable from last Log Sequence
Number.
SSD/HDD file (SSTable file)
Log Structured Merge Tree Storage
Engine
Where is LSM tree Storage engine
used?
1. Apache Cassandra
2. WiredTiger
3. InfluxDB
4. Yugabyte DB
5. ScyllaDB
6. CockroachDB
7. Google’s BigTable
8. RocksDB
Types of storage engines
- Log Structured Merge (LSM) Tree
- Page Oriented (B-Tree)
https://carlosproal.com/ir/papers/p121-comer.pdf
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
B trees are page-oriented indexing structures
https://carlosproal.com/ir/papers/p121-comer.pdf
B-Trees
Important notes on B-tree
1. Store key value pairs (sorted by key)
2. Self balancing
3. Often used for indexing
4. Mutable data structure(in place update)
5. Each node is a fixed size block/page 4KB
6. Can only read or write one page at a time
https://carlosproal.com/ir/papers/p121-comer.pdf
Anatomy of B-Tree
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
B-Trees
https://sqlbak.com/academy/database-page
A database page
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
69 70 78 85
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
key [69, 90)
val val val
NOTE: Leaf Page contains both
the key and value
Anatomy of B-Trees
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
Branching factor = 5
Depth= 3
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
https://carlosproal.com/ir/papers/p121-comer.pdf
12 30
13 23 25 27
50 120
31 37 42 49 52 60 69 90
Key < 10
Key [120, inf)
key [12, 30) key [30, 50) key [50, 120)
69 70 78 85
key [69, 90)
val val val
found!
READ(78)
Anatomy of B-Tree
Searching for a key is faster because we are not scaning
all keys but only keys within range, takes O(log n)
Where n is the total number of keys
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
val val val val
70 78 85
INSERT(87)
87
69 val val val val
70 78 85 86 val
Branching factor - 5
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
Branching
factor
exceeded! > 5
Create new
page
val val val val
70 78 85 87
69 val val val val
70 78 85 86 val 87 val
Branching factor - 5
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 90
69 70 78
val val val 85 86 87
val val val
INSERT(87)
Branching
factor
exceeded! > 5
Create new
page
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
What if the parent page is full?
Split it
INSERT(87)
https://carlosproal.com/ir/papers/p121-comer.pdf
60 69 85
69 70 78
val val val 85 86 87
val val val
90
Add 85 to parent page
How does update work?
1. Find the leaf page with key
2. Edit the row
3. Overwrite the page
INSERT(87)
LSM trees Vs B-Trees storage engine
LSM Tree B-Tree
Optimized for write Optimized for read
Compressed better(No
Fragmentation)
Fragmentation wastes space
There can be duplicates before
compaction
Each key exist exactly in one
place
Strong transaction support
Spikes in write can cause slow
compaction due to many
SSTable files. Can cause Out
of Memory Error(OOM)
Space optimization in B-tree
Primary index(primary key index)
Secondary index
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Leaf page contains both key and value Leaf page contains both key and value
DUPLICATE !
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Space optimization in B-tree
Secondary index
Primary index(primary key index)
Store value offset(smaller in size)
Store value offset (smaller in size)
val1
val2
val3
val4
val5
…
Heap File
Store value offset(smaller in size)
Extra Disk I/O
So you can store important
columns in leaf page and less
important columns in heap file
@gifted_dl
@gifted_dl
Adewumi Sunkanmi D.
1 sur 79

Recommandé

IO Dubi Lebel par
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebelsqlserver.co.il
1.2K vues104 diapositives
5 steps to faster web sites & HTML5 games - updated for DDDscot par
5 steps to faster web sites & HTML5 games - updated for DDDscot5 steps to faster web sites & HTML5 games - updated for DDDscot
5 steps to faster web sites & HTML5 games - updated for DDDscotMichael Ewins
1.1K vues73 diapositives
SQL Server On SANs par
SQL Server On SANsSQL Server On SANs
SQL Server On SANsQuest Software
1K vues46 diapositives
Using Oracle Database with Amazon Web Services par
Using Oracle Database with Amazon Web ServicesUsing Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web Servicesguest484c12
5.7K vues79 diapositives
Redis — memcached on steroids par
Redis — memcached on steroidsRedis — memcached on steroids
Redis — memcached on steroidsRobert Lehmann
1.8K vues91 diapositives
All about Storage - Series 2 Defining Data par
All about Storage - Series 2 Defining DataAll about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining DataDAGEOP LTD
128 vues30 diapositives

Contenu connexe

Similaire à Database storage engine internals.pptx

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013 par
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Amazon Web Services
12.7K vues78 diapositives
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike par
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeHow to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeAerospike, Inc.
1.5K vues16 diapositives
5 Steps to Faster Web Sites and HTML5 Games par
5 Steps to Faster Web Sites and HTML5 Games5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 GamesMichael Ewins
1.3K vues62 diapositives
Amazed by AWS Series #4 par
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4Amazon Web Services Korea
1.7K vues102 diapositives
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America par
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaMichael Noel
699 vues35 diapositives
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the... par
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Knut Relbe-Moe [MVP, MCT]
390 vues50 diapositives

Similaire à Database storage engine internals.pptx(20)

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013 par Amazon Web Services
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Amazon Web Services12.7K vues
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike par Aerospike, Inc.
How to Get a Game Changing Performance Advantage with Intel SSDs and AerospikeHow to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
Aerospike, Inc. 1.5K vues
5 Steps to Faster Web Sites and HTML5 Games par Michael Ewins
5 Steps to Faster Web Sites and HTML5 Games5 Steps to Faster Web Sites and HTML5 Games
5 Steps to Faster Web Sites and HTML5 Games
Michael Ewins1.3K vues
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America par Michael Noel
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
Michael Noel699 vues
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the... par Knut Relbe-Moe [MVP, MCT]
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
HPCC Systems vs Hadoop par Fujio Turner
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
Fujio Turner3.4K vues
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features par Amazon Web Services
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
Amazon Web Services30.9K vues
Site Performance - From Pinto to Ferrari par Joseph Scott
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
Joseph Scott3.5K vues
MySpace Data Architecture June 2009 par Mark Ginnebaugh
MySpace Data Architecture June 2009MySpace Data Architecture June 2009
MySpace Data Architecture June 2009
Mark Ginnebaugh2.3K vues
Virtualization and SAN Basics for DBAs par Quest Software
Virtualization and SAN Basics for DBAsVirtualization and SAN Basics for DBAs
Virtualization and SAN Basics for DBAs
Quest Software901 vues
Designing Information Structures For Performance And Reliability par bryanrandol
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
bryanrandol608 vues
Maa wp-10g-racprimaryracphysicalsta-131940 par gopalchsamanta
Maa wp-10g-racprimaryracphysicalsta-131940Maa wp-10g-racprimaryracphysicalsta-131940
Maa wp-10g-racprimaryracphysicalsta-131940
gopalchsamanta177 vues
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver... par Guy Harrison
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Guy Harrison3.3K vues
The care and feeding of a MySQL database par Dave Stokes
The care and feeding of a MySQL databaseThe care and feeding of a MySQL database
The care and feeding of a MySQL database
Dave Stokes879 vues
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services par Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web ServicesAWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
Experiences with Oracle SPARC S7-2 Server par JomaSoft
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 Server
JomaSoft273 vues
Best practices for using flash in hyperscale software storage architectures par Eric Carter
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architectures
Eric Carter477 vues

Dernier

MS PowerPoint.pptx par
MS PowerPoint.pptxMS PowerPoint.pptx
MS PowerPoint.pptxLitty Sylus
5 vues14 diapositives
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action par
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionMárton Kodok
6 vues55 diapositives
SUGCON ANZ Presentation V2.1 Final.pptx par
SUGCON ANZ Presentation V2.1 Final.pptxSUGCON ANZ Presentation V2.1 Final.pptx
SUGCON ANZ Presentation V2.1 Final.pptxJack Spektor
23 vues34 diapositives
Ports-and-Adapters Architecture for Embedded HMI par
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMIBurkhard Stubert
21 vues19 diapositives
Programming Field par
Programming FieldProgramming Field
Programming Fieldthehardtechnology
5 vues9 diapositives
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx par
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptxanimuscrm
15 vues19 diapositives

Dernier(20)

Gen Apps on Google Cloud PaLM2 and Codey APIs in Action par Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok6 vues
SUGCON ANZ Presentation V2.1 Final.pptx par Jack Spektor
SUGCON ANZ Presentation V2.1 Final.pptxSUGCON ANZ Presentation V2.1 Final.pptx
SUGCON ANZ Presentation V2.1 Final.pptx
Jack Spektor23 vues
Ports-and-Adapters Architecture for Embedded HMI par Burkhard Stubert
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMI
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx par animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm15 vues
Dapr Unleashed: Accelerating Microservice Development par Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports par Ra'Fat Al-Msie'deen
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug ReportsBushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs par Deltares
DSD-INT 2023 The Danube Hazardous Substances Model - KovacsDSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
Deltares10 vues
Navigating container technology for enhanced security by Niklas Saari par Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy14 vues
FIMA 2023 Neo4j & FS - Entity Resolution.pptx par Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j8 vues
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... par Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke35 vues
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft... par Deltares
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...
Deltares7 vues
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... par Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller40 vues

Database storage engine internals.pptx

  • 1. Demystifying data structures and algorithms adopted by database storage engine Adewumi Sunkanmi D.
  • 2. Demystifying data structures and algorithms used by database storage engine
  • 3. Adewumi Sunkanmi D. Senior Software Engineer at Acronis working on Advanced Automation, one of the cloud services offered by Acronis Cyber Cloud.
  • 4. Outline 1. Overview of a three-tier application 2. Criteria for selecting the best database for an application 3. Overview of database architecture 4. Types for database storage engines and their tradeoffs 5. Q/A
  • 15. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph?
  • 16. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database
  • 17. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling
  • 18. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes)
  • 19. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes)
  • 20. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database
  • 21. How do we select the best database for an application? 1. Structure of the data we want to store; structured, unstructured, graph? 1. Scalability of the database - Horizontal or Vertical scaling - Sharding(Partition data across nodes) - Replication(Copies of data on multiple nodes) 3. Support and familiarity of developers with database 4. Rate of write and read and how EXACTLY are these operations handled at the hardware level?
  • 23. SELECT COLS FROM WHERE COL_ID students > score 70 firstname lastname “SELECT firstname, lastname FROM students WHERE score > 70;”
  • 26. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 28. Log Structured Merge Tree Storage Engine The LMS tree is an immutable disk resident data structure and it is optimized for sequential writes while maintaining the acceptable read performance.
  • 29. Log Structured Merge Tree Storage Engine Three components 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table)
  • 30. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177
  • 31. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM
  • 32. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write ben: 300 Memtable e.g Red black tree in RAM josh: 500
  • 33. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red black tree in RAM ben: 300 josh: 500 Threshold reached!
  • 34. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 SSD/HDD file (SSTable file) T1 ben: 300 bin: 220 josh: 500
  • 35. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB
  • 36. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 10MB alexandar : 10 andreas : 50 ……. erik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ………
  • 37. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB
  • 38. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 177 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (segment file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB Find(apa)
  • 39. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 SSD/HDD file (SSTable file) 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 300 mia: 220 write
  • 40. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 SSD/HDD file (SSTable file)
  • 41. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) SSD/HDD file (SSTable file)
  • 42. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 SSD/HDD file (SSTable file)
  • 43. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 SSD/HDD file (SSTable file)
  • 44. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 45. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 SSD/HDD file (SSTable file)
  • 46. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle update? Since we return from the most recent memtable or segment file, we just insert the key with the new value, Ben will be returned from T2 not T1 SSD/HDD file (SSTable file)
  • 47. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 How do we handle delete? Insert the key with a delete marker called tombstone, since this will be the most recent, we can tell it has been deleted, e.g ben->null SSD/HDD file (SSTable file)
  • 48. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( SSD/HDD file (SSTable file)
  • 49. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help SSD/HDD file (SSTable file)
  • 50. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 But now we have duplicates, space wastage :( Yes, but compaction will help Compaction SSD/HDD file (SSTable file)
  • 51. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction SSD/HDD file (SSTable file)
  • 52. Log Structured Merge Tree Storage Engine Key present?: Strict NO if not 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if we don’t find the key, we search all the SSTable files? Compaction Optimtimize reads with Bloom Filters Maybe or Maybe not(99% accurate) https://brilliant.org/wiki/bloom-filter/ SSD/HDD file (SSTable file)
  • 53. Log Structured Merge Tree Storage Engine 1. Memtable 2. In-memory index 3. SSTable (Sorted String Table) ben 300 josh 500 bin 220 ben 177 mia 220 eve 173 write bin: 220 Memtable e.g Red Black tree in RAM ben: 300 josh: 500 flush ben: 300 bin: 220 josh: 500 T1 40MB 10MB 10MB 10MB 400 alexandar : 10 andreas : 50 ……. arik : 500 erling : 200 …….. jan : 11 johan : 300 …….. robert: 499 roy: 200 ……… In-memory index Key Byte offset alexandar 0 arik 303 jan robert 500 10MB eve: 173 ben: 177 mia: 220 write ben: 177 eve: 173 mia: 200 flush T2 Read(ben) Step 1 Step 2 Step 3 What if power failure happens before data is flushed to disk? Compaction 1. Persist write in an append only log file before writing to in-memory table. WAL 2. Recreate memtable from last Log Sequence Number. SSD/HDD file (SSTable file)
  • 54. Log Structured Merge Tree Storage Engine Where is LSM tree Storage engine used? 1. Apache Cassandra 2. WiredTiger 3. InfluxDB 4. Yugabyte DB 5. ScyllaDB 6. CockroachDB 7. Google’s BigTable 8. RocksDB
  • 55. Types of storage engines - Log Structured Merge (LSM) Tree - Page Oriented (B-Tree)
  • 58. https://carlosproal.com/ir/papers/p121-comer.pdf B-Trees Important notes on B-tree 1. Store key value pairs (sorted by key) 2. Self balancing 3. Often used for indexing 4. Mutable data structure(in place update) 5. Each node is a fixed size block/page 4KB 6. Can only read or write one page at a time
  • 59. https://carlosproal.com/ir/papers/p121-comer.pdf Anatomy of B-Tree 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val
  • 61. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 69 70 78 85 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) key [69, 90) val val val NOTE: Leaf Page contains both the key and value Anatomy of B-Trees
  • 62. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val Branching factor = 5 Depth= 3 Anatomy of B-Tree
  • 63. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 64. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 65. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val READ(78) Anatomy of B-Tree
  • 66. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree
  • 67. https://carlosproal.com/ir/papers/p121-comer.pdf 12 30 13 23 25 27 50 120 31 37 42 49 52 60 69 90 Key < 10 Key [120, inf) key [12, 30) key [30, 50) key [50, 120) 69 70 78 85 key [69, 90) val val val found! READ(78) Anatomy of B-Tree Searching for a key is faster because we are not scaning all keys but only keys within range, takes O(log n) Where n is the total number of keys
  • 68. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 val val val val 70 78 85 INSERT(87) 87 69 val val val val 70 78 85 86 val Branching factor - 5
  • 69. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 Branching factor exceeded! > 5 Create new page val val val val 70 78 85 87 69 val val val val 70 78 85 86 val 87 val Branching factor - 5 INSERT(87)
  • 70. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 90 69 70 78 val val val 85 86 87 val val val INSERT(87) Branching factor exceeded! > 5 Create new page
  • 71. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page INSERT(87)
  • 72. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page What if the parent page is full? Split it INSERT(87)
  • 73. https://carlosproal.com/ir/papers/p121-comer.pdf 60 69 85 69 70 78 val val val 85 86 87 val val val 90 Add 85 to parent page How does update work? 1. Find the leaf page with key 2. Edit the row 3. Overwrite the page INSERT(87)
  • 74. LSM trees Vs B-Trees storage engine LSM Tree B-Tree Optimized for write Optimized for read Compressed better(No Fragmentation) Fragmentation wastes space There can be duplicates before compaction Each key exist exactly in one place Strong transaction support Spikes in write can cause slow compaction due to many SSTable files. Can cause Out of Memory Error(OOM)
  • 75. Space optimization in B-tree Primary index(primary key index) Secondary index
  • 76. Space optimization in B-tree Secondary index Primary index(primary key index) Leaf page contains both key and value Leaf page contains both key and value DUPLICATE !
  • 77. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File
  • 78. Space optimization in B-tree Secondary index Primary index(primary key index) Store value offset(smaller in size) Store value offset (smaller in size) val1 val2 val3 val4 val5 … Heap File Store value offset(smaller in size) Extra Disk I/O So you can store important columns in leaf page and less important columns in heap file