Neural networks and machine learning models show promise as alternatives to traditional index structures in database management systems. A learned index framework can extract weights from a trained TensorFlow model to generate efficient index structures in C++. A recursive model index builds a hierarchy of models, with each model selecting the next based on the key, to more accurately search the "last mile". Hybrid models use a ReLU neural net at the top layer and thousands of simple linear regression models at the bottom to balance performance and accuracy. While results are promising, learned indexes may not always be the best choice.
2. F. Codd, E. (1970). A Relational
Model of Data for Large Shared
Data Banks. Commun. ACM. 13.
377-387.
Kraska, T., Beutel, A., Chi, E.H., Dean,
J. and Polyzotis, N., (2017). The Case
for Learned Index Structures. arXiv
preprint arXiv:1712.01208.
3. RELATIONAL MODEL
Can be expressed in first-order
predicate logic
Data is represented as tuples,
grouped into relations
Abstraction from physical storage
model
4. INDEX STRUCTURES
Needed for efficient data access
B-Trees, Hash maps, Bloom filters, ...
Need tuning
General data structures, do not
take advantage of data patterns
5. ENTER MACHINE
LEARNING
Replacing core components of a
data management system through
learned models
Traditional indexes are already
models
For efficiency reasons it is common
not to index every single key of the
sorted records, rather only the key of
every n-th record
Using other types of models as
indexes can provide benefits
6. INDEXES ARE CDF
MODELS
An index is a model that takes a
key as an input and predicts the
position of the record
A model that predicts the position
given a key inside a sorted array
approximates the cumulative
distribution function
F(Key) is the estimated cumulative
distribution function for the data to
estimate the likelihood to observe
a key smaller or equal to the look-
up key
7. ISSUES...
Decision trees in general, are really
good in overfitting the data with a
few operations
A single neural net requires
significantly more space and CPU
time for the “last mile”
B-Trees are extremely cache- and
operation-efficient
8. THE LEARNING INDEX
FRAMEWORK (LIF)
Given a trained Tensorflow model,
LIF automatically extracts all
weights from the model and
generates efficient index structures
in C++
Designed for small models
No unnecessary overhead
9. THE RECURSIVE MODEL
INDEX
Challenge: accuracy for last-mile
search
We build a hierarchy of models
Each model takes the key as an
input and based on it picks
another model
10. THE RECURSIVE MODEL
INDEX, 2
We iteratively train each stage with
loss Lℓ
We separate model size and
complexity from execution cost
We effectively divide the space
into smaller sub-ranges to make it
easier to achieve the required “last
mile” accuracy
11. HYBRID MODELS
Top-layer: rectified linear unit (ReLU)
neural net
At the bottom: thousands of simple,
inexpensive linear regression models
Traditional B-Trees at the bottom if
the data is particularly hard to learn
12. DOES THIS STUFF
WORK?
Simple NNs can be efficiently
trained using stochastic gradient
descent
A closed form solution exists for
linear multi-variate models
The results are promising, but
“learned indexes” might not be the
best choice in every use case
A new way to think about indexing