SQLite4 was a project started at the beginning of 2012 and designed to provide a follow-on to SQLite3 without the constraints of backwards compatibility. SQLite4 was built around a Log Structured Merge (LSM) storage engine that is transactional, stores all content in a single file on disk, and that is faster than LevelDB. Other innovations in include the use of decimal floating-point arthimetic and a single storage engine namespace used for all tables and indexes. Expectations were initially high. However, development stopped about 2.5 years later, after finding that the design of SQLite4 would never be competitive with SQLite3. This talk overviews the technological ideas tried in SQLite4 and discusses why they did not work out for the kinds of workloads typically encountered for an embedded database engine.
3. SQLite History
● SQLite 1 – 2000-08-17
– Hash-based GDBM storage engine. GPL
● SQLite 2 – 2001-08-28
– Custom b-tree storage engine. Text only
● SQLite 3 – 2004-06-18
– New b-tree storage engine supporting binary data
4. (Aside:) Development
● We use Fossil, not Git or Svn or Hg
– Blog, wiki, tickets built-in
– No dependencies
– Improved situational awareness
– Written specifically to support the development of
SQLite
– https://www.fossil-scm.org/
● The trunk is (almost) always production ready
– Problems discovered on a trunk check-in can be
retroactively shunted onto a branch
5. (Aside:) Development
● Only developers can write tickets
– Because we found that most “bug reports” are really
support requests
● Merge requests or patches not accepted
– In order to keep SQLite in the public domain, lots of
paperwork must be on file for each contributor
– “Open source” but not “Open development”
6. SQLite4 History
● Coding starts on 2012-01-20
● Intense work throughout 2012 and 2013
● Develop slows and stops in early 2014
● https://sqlite.org/src4
What Happened?
7. Goals of SQLite4
● Keep the spirit of SQLite intact
– Serverless
– Single-file database
● Faster than SQLite3
– LSM (Log Structured Merge) storage engine
– Compare storage engine keys using memcmp()
● Fix API quirks
● PRIMARY KEY is the storage engine key
● https://sqlite.org/src4
8. SQLite3 versus SQLite4
● B-tree storage
● Separate key/value
namespace (separate
b-tree) for each table
and index
● 100% backwards
compatible
● LSM storage
● Single key/value
namespace for all
tables and indexes
● Fresh, clean design
9. “Database” versus “Storage Engine”
● The “database” translates high-level SQL into
low-level key/value operations against the
“storage engine”
● In an SQL “database”, the “storage engine” is
just one of many component parts
● Some products call themselves “databases”
when they are really just a “storage engine”:
BerkeleyDB
GDBM
LevelDB
LMDB
RocksDB
Kyoto Cabinet
10. Ins & Outs of
Compile SQL
into bytecode
Bytecode
InterpreterSQL Prep'ed
Stmt
Result
Storage EngineThe Query Planner AI
SQLite4 keeps these parts
SQLite4 replaces this part
with a new LSM storage
engine
11. High-level Inputs To The Database
SELECT
blob.rid AS blobRid,
uuid AS uuid,
datetime(event.mtime,toLocal()) AS timestamp,
coalesce(ecomment, comment) AS comment,
coalesce(euser, user) AS user,
blob.rid IN leaf AS leaf,
bgcolor AS bgColor,
event.type AS eventType,
(SELECT group_concat(substr(tagname,5), ', ') FROM tag,
tagxref
WHERE tagname GLOB 'sym-*' AND tag.tagid=tagxref.tagid
AND tagxref.rid=blob.rid AND tagxref.tagtype>0) AS tags,
tagid AS tagid,
brief AS brief,
event.mtime AS mtime
FROM event CROSS JOIN blob
WHERE blob.rid=event.objid
AND NOT EXISTS(SELECT 1 FROM tagxref
WHERE tagid=5 AND tagtype>0
AND rid=blob.rid)
AND event.type='ci' ORDER BY event.mtime DESC LIMIT 50
13. Ins & Outs of
Compile SQL
into bytecode
Bytecode
InterpreterSQL Prep'ed
Stmt
Result
Storage Engine
Query Planner
Low-level Ops:
● Find(key)
● Insert(key, value)
● Delete(key)
● Next(key)
SELECT
blob.rid AS blobRid,
uuid AS uuid,
...
FROM ...
ORDER BY ...
LIMIT 50;
0 Init 0 84 0
1 Noop 0 0 0
2 Integer 50 1 0
3 OpenRead 0 45 0
...
14. About B-Trees
● Page oriented
– read/write a whole page (4096 bytes) at a time
– ... because that is what disk/SSD provides
● Root page → intermediate pages → leaf pages
● Approximately 100 entries per page on average
● B-tree: Key + value stored on all pages
● B+tree: Only keys stored on non-leaf pages -
values always stored in leaves
16. B+tree Structure
Non-leaf pages hold only keys
● Key + Data in leaves
● As few as one entry on leaf pages
Between 50 and 8000
keys/page depending
on page size.
Key
Pointer to lower page
Value
Some keys appear more than
once in the tree.
17. B-tree Structure
● The key is the data.
● Larger entries, hence lower fan-out
● Each key appears in the table only once Key + Value
Pointer to lower page
Usually about 20 to 40 bytes per entry
18. Key Properties Of B-Trees
● Quickly find any entry given the (one) root page
● Search is O(logN) ←N is the number of entries
– O(logN) page reads
– O(logN) key comparisons
29. Log Structured Merge (LSM)
● Faster writes
● Reduced write
amplification
● Linear writes
● Less SSD wear
● Slower reads
● Background merge
process
● More space on disk
● Greater complexity
Good Bad
30. The LSM1 Storage Engine
● All content stored in one file on disk
● Transactions
● Incremental merging → All INSERT operations
take about the same amount of time
● Range Delete
● Faster than LevelDB
32. CREATE TABLE user(
login TEXT PRIMARY KEY,
name TEXT UNIQUE,
officeId TEXT REFERENCES office,
jobType TEXT REFERENCES roles,
-- Other fields omitted....
);
INSERT INTO users(login,name,officeId,jobType)
VALUES('drh', 'Richard', '3D17','BDFL');
Schema:
Will this be faster using LSM?
33. CREATE TABLE user(
login TEXT PRIMARY KEY,
name TEXT UNIQUE,
officeId TEXT REFERENCES office,
jobType TEXT REFERENCES roles,
-- Other fields omitted....
);
INSERT INTO users(login,name,officeId,jobType)
VALUES('drh', 'Richard', '3D17','BDFL');
4 reads, then if everything is ok, 1 write → Slower!
34. CREATE TABLE user(
login TEXT PRIMARY KEY,
name TEXT UNIQUE,
officeId TEXT REFERENCES office,
jobType TEXT REFERENCES roles,
-- Other fields omitted....
);
REPLACE INTO users(login,name,officeId,jobType)
VALUES('drh', 'Richard', '3D17','BDFL');
0 reads, then if everything is ok, 1 write → Faster
Remove
constraints
35. Unified Key Namespace
● All tables are stored in a single namespace
● Every key must begin with a “table-id”
● With 100 tables in the schema, every search
begins with about 7 extra key comparisons
36. Lessions
● SQLite3 is already very fast and hard to beat
● LSM is great for “blind” writes, but does not
work as well when constraints must be checked
prior to each write
● Many workloads do more reading than writing
● Store each table and index in its own private
key namespace
38. Back-porting Lessons To SQLite3
● Added WITHOUT ROWID tables
– A backwards-compatible hack that allows any
arbitrary PRIMARY KEY to serve as the key in the
key/value storage
● Faster key comparison routines
● The LSM1 virtual table
– Access an LSM1 database file as a single table
within a larger schema
● Improved LSM techniques in FTS5