Fractal Tree Indexes are compared to the indexing incumbent, B-trees. The capabilities are then shown what they bring to MySQL (in TokuDB) and MongoDB (in TokuMX).
Presented at Percona Live London 2013.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Fractal Tree Indexes : From Theory to Practice
1. ®
Fractal Tree Indexes
Theory to Practice
Percona Live London 2013
Tim Callaghan, Tokutek
tim@tokutek.com
@tmcallaghan
®
Tuesday, November 12, 13
2. Ever seen this?
IO Utilization Graph, performance is IO limited
®
Tuesday, November 12, 13
3. Who is Tokutek?
Tokutek builds highperformance database
software!
TokuDB - storage engine for
MySQL and MariaDB
TokuMX - storage engine for
MongoDB
Developer Interface"
"
Storage Engine"
HDD & SSD!
storage"
®
Tuesday, November 12, 13
4. Who am I?
• 17 year database consumer
• schema design, development, deployment
• database administration + infrastructure
• mostly Oracle
• 5 year database producer
• 2 years @ VoltDB
• 2+ years @ Tokutek
®
Tuesday, November 12, 13
5. Housekeeping
• Feedback is important to me
• Ideas for Webinars or Presentations?
• Who’s using MongoDB?
• Anyone using TokuDB or TokuMX?
• Please ask questions
®
Tuesday, November 12, 13
6. Agenda
• Why Fractal Tree indexes are cool
• What they enable in MySQL® (TokuDB)
• What they enable in MongoDB® (TokuMX)
• Q+A
®
Tuesday, November 12, 13
13. B-tree Overview - performance
Performance is IO limited when data > RAM,
one IO is needed for each insert/update
(actually it’s one IO for every index on the table)
RAM
22
10
99
RAM
DISK
2, 3, 4
10,20
22,25
99
®
Tuesday, November 12, 13
15. Fractal Tree Indexes
message
buffer
message
buffer
All internal nodes
have message
buffers
message
buffer
As buffers overflow,
they cascade down
the tree
Messages are
eventually applied to
leaf nodes
similar to B-trees
•store data in leaf nodes
•use index key for ordering
different than B-trees
•message buffers
•big nodes (4MB vs. ~16KB)
®
Tuesday, November 12, 13
16. Fractal Tree Indexes - sample data
25
10
2,3,4
10,20
99
22,25
99
Looks a lot like a b-tree!
®
Tuesday, November 12, 13
17. Fractal Tree Indexes - insert
insert 15;
insert (15)
25
10
2,3,4
•
•
•
•
99
10,20
22,25
99
search operations must consider messages along the way
messages cascade down the tree as buffers fill up
they are eventually applied to the leaf nodes, hundreds or
thousands of operations for a single IO
CPU and cache are conserved as important data is not ejected
®
Tuesday, November 12, 13
18. Fractal Tree Indexes - other operations
25
delete(8)
delete(2)
insert (8)
2,3,4
10
10,20
add_column(c4 bigint)
delete(99)
increment(22,+5)
...
99
22,25
insert (100)
99
Lots of operations can be messages!
®
Tuesday, November 12, 13
20. What is TokuDB?
Transactional MySQL Storage Engine - think InnoDB
Available for MySQL 5.5 and MariaDB 5.5
ACID and MVCC
Free/OSS Community Edition
– http://github.com/Tokutek/ft-engine
• Enterprise Edition
– Commercial support + hot backup
•
•
•
•
Performance + Compression + Agility
20
Tuesday, November 12, 13
®
22. Indexed Insertion Performance
• High-performance insert/update/delete for large
databases (> RAM) while maintaining indexes
* old numbers, now > 25K/sec
22
Tuesday, November 12, 13
®
24. Performance Advantages
•
•
•
•
Efficient index maintenance, especially secondary
indexes
Clustered secondary indexes
• Additional copy of the row is stored in the index
• No additional IO to get row data from primary key
• Think better covering index (all non-indexed columns)
• Compression eliminates size concerns
Big blocks = sequential IO for range scans
• Basement nodes are always co-located
Multi-threaded bulk loader
®
24
Tuesday, November 12, 13
31. The Challenge of MySQL Schema Changes
• Common schema changes can take hours in
MySQL
– Adding, dropping, or expanding a column
– Adding an index
• And the table is unavailable for writes during the
process
• As a workaround, people generally
– Use a replication slave, then swap with master
– Use helper tools: Percona OSC, MySQL 5.6
o These have IO, CPU, RAM consequences
31
Tuesday, November 12, 13
®
32. Schema Changes Without Downtime
• In TokuDB, column add/drop/expand is
instantaneous
– “it’s just a message”
• Indexes can be created in the background while
table is fully available
– TokuDB just builds the index, it does not
rebuild the table (MySQL getting better)
32
Tuesday, November 12, 13
®
34. What is TokuMX?
• TokuMX = MongoDB with improved storage (Fractal Tree indexes)
• Drop in replacement for MongoDB v2.2 applications
– Including replication and sharding
– Same data model
– Same query language
– Drivers just work
• Open Source
– http://github.com/Tokutek/mongo
Performance + Compression + Transactions
®
Tuesday, November 12, 13
35. MongoDB Storage
memory mapped heap
db.test.insert({foo:55})
db.test.ensureIndex({foo:1})
PK index (_id + pointer)
Secondary index (foo + pointer)
18
85
4
(1,ptr5)
(4,ptr1),
(12,ptr8)
5555
(19,ptr7)
40
(10000,ptr2)
(2,ptr5),
(22,ptr6)
(50,ptr4)
120
(100,ptr7)
(222,ptr3)
The “pointer” tells MongoDB where to look in the heap for the requested
document (another IO)
®
35
Tuesday, November 12, 13
36. TokuMX Storage
db.test.insert({foo:55})
db.test.ensureIndex({foo:1})
memory mapped heap
PK index (_id + document)
Secondary index (foo + _id)
18
4
(1,doc)
(4,doc),
(12,doc)
85
5555
(19,doc)
40
(10000,doc)
(2,4), (22,12)
(50,19)
120
(100,10000)
(222,1)
One less IO per _id lookup, document is clustered in the index
®
36
Tuesday, November 12, 13
38. Performance - Indexed Insertion
• 100mm inserts into a collection with 3 secondary indexes
38
Tuesday, November 12, 13
®
39. Performance - Inserts on Indexed Arrays
• Indexed Insertion : Multikey (100 inserts per doc)
39
Tuesday, November 12, 13
®
40. Performance - Replication
• TokuMX replication allows secondary servers to process
replication without IO
– Simply injecting messages into the Fractal Tree
Indexes on the secondary server
– The “Hard Work” was done on the primary
o Uniqueness checking
o Transactional locking
o Update effort (read-before-write)
– Elimination of replication lag
• Your secondaries are fully available for read scaling!
– Wasn’t that the point?
40
Tuesday, November 12, 13
®
45. Performance - Clustered Indexes
•
Clustered secondary indexes
• Additional copy of the document is stored in the index
• No additional IO to get row data from primary key
• Think better covered index (all non-indexed fields)
• Good for point queries, great for range scans
• Compression eliminates size concerns
45
Tuesday, November 12, 13
®
46. Performance - Memory Management
• Two approaches to memory management
– MongoDB = memory-mapped files
o Operating system determines what data is
important
– TokuMX = managed cache
o User defined size
o TokuMX determines what data is important
• Run multiple TokuMX instances on a single server
– Each has it’s own fixed cache size
46
Tuesday, November 12, 13
®
48. Compression
• MongoDB does not offer compression
– Compressed file systems?
– Shortened field names?
o Remember: each field name is stored in every single document
• TokuMX easily achieves 5x-10x compression
– Buy less disk or flash
– Compressed reads and writes reduce overall IO
• TokuMX support 3 compression types
– zlib, quicklz, lzma (size vs. speed)
– all data is compressed
• Use descriptive field names!
– They are easy to compress
48
Tuesday, November 12, 13
®
49. Compression
• 31 million documents, bit torrent peer data
– http://cs.brown.edu/~pavlo/torrent/
49
Tuesday, November 12, 13
®
51. ACID + MVCC
• ACID
– In MongoDB, multi-insertion operations allow for
partial success
o Asked to store 5 documents, 3 succeeded
– We offer “all or nothing” behavior
– Document level locking
• MVCC
– In MongoDB, queries can be interrupted by writers.
o The effect of these writers are visible to the reader
– TokuMX offers MVCC
o Reads are consistent as of the operation start
51
Tuesday, November 12, 13
®
52. Multi-statement Transactions
• TokuMX brings the following to MongoDB
– db.runCommand({“beginTransaction”, “isolation”:
“mvcc”})
– ... perform 1 or more operations
– db.runCommand(“rollbackTransaction”) |
db.runCommand(“commitTransaction”)
• Not allowed in sharded environments
– mongos will reject
52
Tuesday, November 12, 13
®