[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richard Hipp

tl;dr
Many people believe:
B-tree = slow and bad
LSM = fast and good
But the truth is more complicated…..

SQLite History
● SQLite 1 – 2000-08-17
– Hash-based GDBM storage engine. GPL
● SQLite 2 – 2001-08-28
– Custom b-tree storage engine. Text only
● SQLite 3 – 2004-06-18
– New b-tree storage engine supporting binary data

(Aside:) Development
● We use Fossil, not Git or Svn or Hg
– Blog, wiki, tickets built-in
– No dependencies
– Improved situational awareness
– Written specifically to support the development of
SQLite
– https://www.fossil-scm.org/
● The trunk is (almost) always production ready
– Problems discovered on a trunk check-in can be
retroactively shunted onto a branch

(Aside:) Development
● Only developers can write tickets
– Because we found that most “bug reports” are really
support requests
● Merge requests or patches not accepted
– In order to keep SQLite in the public domain, lots of
paperwork must be on file for each contributor
– “Open source” but not “Open development”

SQLite4 History
● Coding starts on 2012-01-20
● Intense work throughout 2012 and 2013
● Develop slows and stops in early 2014
● https://sqlite.org/src4
What Happened?

Goals of SQLite4
● Keep the spirit of SQLite intact
– Serverless
– Single-file database
● Faster than SQLite3
– LSM (Log Structured Merge) storage engine
– Compare storage engine keys using memcmp()
● Fix API quirks
● PRIMARY KEY is the storage engine key
● https://sqlite.org/src4

SQLite3 versus SQLite4
● B-tree storage
● Separate key/value
namespace (separate
b-tree) for each table
and index
● 100% backwards
compatible
● LSM storage
● Single key/value
namespace for all
tables and indexes
● Fresh, clean design

“Database” versus “Storage Engine”
● The “database” translates high-level SQL into
low-level key/value operations against the
“storage engine”
● In an SQL “database”, the “storage engine” is
just one of many component parts
● Some products call themselves “databases”
when they are really just a “storage engine”:
BerkeleyDB
GDBM
LevelDB
LMDB
RocksDB
Kyoto Cabinet

Ins & Outs of
Compile SQL
into bytecode
Bytecode
InterpreterSQL Prep'ed
Stmt
Result
Storage EngineThe Query Planner AI
SQLite4 keeps these parts
SQLite4 replaces this part
with a new LSM storage
engine

High-level Inputs To The Database
SELECT
blob.rid AS blobRid,
uuid AS uuid,
datetime(event.mtime,toLocal()) AS timestamp,
coalesce(ecomment, comment) AS comment,
coalesce(euser, user) AS user,
blob.rid IN leaf AS leaf,
bgcolor AS bgColor,
event.type AS eventType,
(SELECT group_concat(substr(tagname,5), ', ') FROM tag,
tagxref
WHERE tagname GLOB 'sym-*' AND tag.tagid=tagxref.tagid
AND tagxref.rid=blob.rid AND tagxref.tagtype>0) AS tags,
tagid AS tagid,
brief AS brief,
event.mtime AS mtime
FROM event CROSS JOIN blob
WHERE blob.rid=event.objid
AND NOT EXISTS(SELECT 1 FROM tagxref
WHERE tagid=5 AND tagtype>0
AND rid=blob.rid)
AND event.type='ci' ORDER BY event.mtime DESC LIMIT 50

Low-level Byte-code Key/Value Ops
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 84 0 00 Start at 84
1 Noop 6 14 0 00
2 Integer 50 1 0 00 r[1]=50; LIMIT counter
3 OpenRead 0 45 0 11 00 root=45 iDb=0; event
4 OpenRead 7 46 0 k(2,,) 00 root=46 iDb=0; event_i1
5 OpenRead 1 2 0 4 00 root=2 iDb=0; blob
6 Last 7 83 2 0 00
7 DeferredSeek 7 0 0 00 Move 0 to 7.rowid if
8 Column 0 0 2 00 r[2]=event.type
9 Ne 3 82 2 (BINARY) 52 if r[2]!=r[3] goto 82
10 IdxRowid 7 4 0 00 r[4]=rowid
11 SeekRowid 1 82 4 00 intkey=r[4]; pk
12 Integer 0 6 0 00 r[6]=0; Init EXISTS r
13 Integer 1 7 0 00 r[7]=1; LIMIT counter
14 OpenRead 5 56 0 7 00 root=56 iDb=0; tagxre
15 OpenRead 8 57 0 k(3,,,) 02 root=57 iDb=0; sqlite
16 Rowid 1 8 0 00 r[8]=rowid
17 Integer 5 9 0 00 r[9]=5
18 SeekGE 8 25 8 2 00 key=r[8..9]
19 IdxGT 8 25 8 2 00 key=r[8..9]
...

Ins & Outs of
Compile SQL
into bytecode
Bytecode
Stmt
Result
Storage Engine
Query Planner
Low-level Ops:
● Find(key)
● Insert(key, value)
● Delete(key)
● Next(key)
SELECT
blob.rid AS blobRid,
uuid AS uuid,
...
FROM ...
ORDER BY ...
LIMIT 50;
0 Init 0 84 0
1 Noop 0 0 0
2 Integer 50 1 0
3 OpenRead 0 45 0
...

About B-Trees
● Page oriented
– read/write a whole page (4096 bytes) at a time
– ... because that is what disk/SSD provides
● Root page → intermediate pages → leaf pages
● Approximately 100 entries per page on average
● B-tree: Key + value stored on all pages
● B+tree: Only keys stored on non-leaf pages -
values always stored in leaves

B+tree Structure
Root page
Leaf pages
Key
Pointer to lower page
Value

B+tree Structure
Non-leaf pages hold only keys
● Key + Data in leaves
● As few as one entry on leaf pages
Between 50 and 8000
keys/page depending
on page size.
Key
Value
Some keys appear more than
once in the tree.

B-tree Structure
● The key is the data.
● Larger entries, hence lower fan-out
● Each key appears in the table only once Key + Value
Usually about 20 to 40 bytes per entry

Key Properties Of B-Trees
● Quickly find any entry given the (one) root page
● Search is O(logN) ←N is the number of entries
– O(logN) page reads
– O(logN) key comparisons

Write Amplification
New 20-byte entry
Entire 4096-byte page must be written
4096
20
= 204.8 write amplification

Log Structured Merge (LSM)
INSERT
Accumulate
in RAM
Write small b-tree to disk, all at once

INSERT
Accumulate
in RAM
Write 2nd
small b-tree to disk

INSERT
Accumulate
in RAM
Write third small b-tree to disk

INSERT
Accumulate
in RAM

INSERT
Accumulate
in RAM
Merge

INSERT
Accumulate
in RAM
Merge
Level 0:
Level 1:
Level 2:

● Faster writes
● Reduced write
amplification
● Linear writes
● Less SSD wear
● Slower reads
● Background merge
process
● More space on disk
● Greater complexity
Good Bad

The LSM1 Storage Engine
● All content stored in one file on disk
● Transactions
● Incremental merging → All INSERT operations
take about the same amount of time
● Range Delete
● Faster than LevelDB

Compile SQL
into bytecode
Bytecode
Stmt
Result
Storage Engine
Delete old B-tree storage engine
Insert new LSM storage engine

CREATE TABLE user(
login TEXT PRIMARY KEY,
name TEXT UNIQUE,
officeId TEXT REFERENCES office,
jobType TEXT REFERENCES roles,
-- Other fields omitted....
);
INSERT INTO users(login,name,officeId,jobType)
VALUES('drh', 'Richard', '3D17','BDFL');
Schema:
Will this be faster using LSM?

CREATE TABLE user(
name TEXT UNIQUE,
);
INSERT INTO users(login,name,officeId,jobType)
4 reads, then if everything is ok, 1 write → Slower!

CREATE TABLE user(
name TEXT UNIQUE,
);
REPLACE INTO users(login,name,officeId,jobType)
0 reads, then if everything is ok, 1 write → Faster
Remove
constraints

Unified Key Namespace
● All tables are stored in a single namespace
● Every key must begin with a “table-id”
● With 100 tables in the schema, every search
begins with about 7 extra key comparisons

Lessions
● SQLite3 is already very fast and hard to beat
● LSM is great for “blind” writes, but does not
work as well when constraints must be checked
prior to each write
● Many workloads do more reading than writing
● Store each table and index in its own private
key namespace

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
0%
50%
100%
150%
200%
250%
300%
350%
3.6.7
3.6.15
3.6.23
3.7.2
3.7.5
3.7.8
3.7.13
3.7.14
3.7.17
3.8.0
3.8.1
3.8.2
3.8.3
3.8.4
3.8.5
3.8.6
3.8.7
3.8.8
3.8.9
3.8.10
3.8.11
3.9.0
3.10.0
3.11.0
3.12.0
3.13.0
3.14.0
3.15.0
3.16.0
3.17.0
3.18.0
3.19.0
3.20.0
CPU Cycles
SQLite3 Performance
Work on
SQLite4

Back-porting Lessons To SQLite3
● Added WITHOUT ROWID tables
– A backwards-compatible hack that allows any
arbitrary PRIMARY KEY to serve as the key in the
key/value storage
● Faster key comparison routines
● The LSM1 virtual table
– Access an LSM1 database file as a single table
within a larger schema
● Improved LSM techniques in FTS5

[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richard Hipp

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à [db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richard Hipp

Similaire à [db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richard Hipp (20)

Plus de Insight Technology, Inc.

Plus de Insight Technology, Inc. (20)

Dernier

Dernier (20)

[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richard Hipp