An overview and discussion on indexing data in Redis to facilitate fast and efficient data retrieval. Presented on September 22nd, 2014 to the Redis Tel Aviv Meetup.
2. A Little About Myself
A Redis Geek and Chief Developers Advocate
at .com
I write at http://redislabs.com/blog and edit the
Redis Watch newsletter at
http://redislabs.com/redis-watch-archive
3. Motivation
● Redis is a Key-Value datastore -> fetching
(is always) by (primary) key is fast
● Searching for keys is expensive - SCAN (or,
god forbid, the "evil" KEYS command)
● Searching for values in keys requires a full
(hash) table scan & sending the data to the
client for processing
5. antirez is Right
● Redis is a "database SDK"
● Indices imply some kind of schema (and
there's none in Redis)
● Redis wasn't made for indexing
● ...
But despite the Creator's humble opinion,
sometimes you still need a fast way to search :)
6. So What is an Index?
"A database index is a data
structure that improves the speed
of data retrieval operations"
Wikipedia, 2014
Space-Time Tradeoff
7. What Can be Indexed?
Data Index
Key -> Value Value -> Key
• Values can be numbers or strings
• Can be derived from "opaque" values:
JSONs, data structures (e.g. Hash),
functions, …
8. Index Operations Checklist
1. Create index from existing data
2. Update the index on
a. Addition of new values
b. Updates of existing values
c. Deletion of keys (and also RENAME/MIGRATE…)
3. Drop the index
4. If needed do index housekeeping
5. Access keys using the index
9. A Simple Example: Reverse Lookup
Assume the following database, where every
user has a single unique email address:
HMSET user:1 id "1" email "dfucbitz@terah.net"
How would you go about efficiently fetching the
user's ID given an email address?
10. Reverse Lookup (Pseudo) Recipe
def idxEmailAdd(email, id): # 2.a
if not(r.setnx("_email:" + email, id)):
raise Exception("INDEX_EXISTS")
def idxEmailCreate(): # 1
for each u in r.scan("user:*"):
id, email = r.hmget(u, "id", "email")
idxEmailAdd(email, id)
11. Reverse Lookup Recipe, more admin
def idxEmailDel(email): # 2.c
r.del("_email:" + email)
def idxEmailUpdate(old, new): # 2.b
idxEmailDel(old)
idxEmailAdd(new)
def idxEmailDrop(): ... # similar to Create
14. Reverse Lookup Recipe, Analysis
● Asymptotic computational complexity:
o Creating the index: O(N), N is no. of values
o Adding a new value to the index: O(1)
o Deleting a value from the index: O(1)
o Updating a value: O(1) + O(1) = O(1)
o Deleting the index: O(N), N is no. of values
● What about memory? Every key in Redis
takes up some extra space...
15. Hash Index
_email = { "dfucbitz@terah.net": 1,
"foo@bar.baz": 2 ... }
● Small lookups (e.g. countries) → single key
● Big lookups → partitioned to "buckets" (e.g.
by email address hash value)
More info: http://redis.io/topics/memory-optimization
17. Uniqueness
The lookup recipe makes the assumption that
every user has a single email address and that
it's unique (i.e. 1:1 relationship).
What happens if several keys (users) have the
same indexed value (email)?
18. Non-Uniqueness with Lists
Use lists instead of using Redis' strings/hashes.
To add:
r.lpush("_email:" + email, id) # 2.a
Simple. What about accessing the list for writes
or reads? Naturally, getting the all list's
members is O(N) but...
19. What?!? WTF do you mean O(N)?!?
Because a Redis List is essentially a linked list,
traversing it requires up to N operations
(LINDEX, LRANGE…). That
means that updates & deletes
are O(N)
Conclusion: suitable when N (i.e. number of
duplicate index entries) is smallish (e.g. < 10)
20. OT: A Tip for Traversing Lists
Lists don't have LSCAN, but with
RPOPLPUSH you easily can do a
circular list pattern and go over all
the members in O(N) w/o copying
the entire list.
More at: http://redis.io/commands/rpoplpush
21. Back to Non-Uniqueness - Hashes
Use Hashes to store multiple index values:
r.hset("_email:" + email, id, "") # 2.a
Great - still O(1). How about deleting?
r.hdel("_email:" + email, id) # 2.b
Another O(1).
(unused)
22. Non-Uniqueness, Sets Variant
r.sadd("_email:" + email, id) # 2.a
Great - still O(1). How about deleting?
r.srem("_email:" + email, id) # 2.b
Another O(1).
23. List vs. Hash vs. Set for NUIVs*
* Non-Unique Index Value
● Memory: List ~= Set ~= Hash (N < 100)
● Performance: List < Set, Hash
● Unlike a List's elements, Set members and
Hash fields are:
o Unique - meaning you can't index the same key
more than once (makes sense).
o Unordered - a non-issue for this type of index.
o Are SCANable
● Forget Lists, use Sets or Hashes.
24. Forget Hashes, Sets are Better
Because of the Set operations:
SUNION, SDIFF, SINTER
Endless possibilities, including
matchmaking:
SINTER _interest:devops _hair:blond _gender:...
25. [This Slide has No Title]
NULL means no value and Redis is all about
values.
When needed, arbitrarily decide on a value for
NULLs (e.g. "<null>") and handle it
appropriately in code.
26. Index Cardinality (~= unique values)
● High cardinality/no duplicates -> use a Hash
● Some duplicates -> use Hash and "pointers"
to Sets
_email = { "dfucbitz@terah.net": 1,
"foo@bar.baz": "*" ...}
_email:foo@bar.baz = { 2, 3 }
● Low cardinality is, however, another story...
27. Low Cardinality
When an indexed attribute has a small number
of possible values (e.g. Boolean, gender...):
● If distribution of values is 50:50, consider not
indexing it at all
● If distribution is heavily unbalanced (5:95),
index only the smaller subsets, full scan rest
● Use a bitmap index if possible
28. Bitmap Index
Assumption: key names are ordered
How: a Bitset where a bit's position maps to a
key and the bit's value is the indexed value:
first bit -> dfucbitz is online
_isLoggedIn = /100…/
second bit -> foo isn't logged in
29. Bitmap Index, cont.
More than 2 values? Use n Bitsets, where n is
the number of possible indexed values, e.g.:
_isFromTerah = /100.../
_isFromEarth = /010.../
Bonus: BITOP AND / OR / XOR / NOT
BITOP NOT _ET _isFromEarth
BITOP AND onlineET _isLoggedIn _ET
30. Interlude: Redis Indices Save Space
Consider the following: in a relational database
you need "x2" space: for the indexed data
(stored in a table) and for the index itself.
With most Redis indices, you don't have to
store the indexed data -> space saved :)
31. Numerical Ranges with Sorted Sets
Numerical values, including timestamps
(epoch), are trivially indexed with a Sorted Set:
ZADD _yearOfBirth 1972 "1" 1961 "2"...
ZADD _lastLogin 1411245569 "1"
Use ZRANGEBYSCORE and
ZREVRANGEBYSCORE for range queries
32. Ordered "Composite" Numerical Indices
Use Sorted Sets scores that are constructed by
the sort (range) order. Store two values in one
score using the integer and fractional parts:
user:1 = { "id": "1", "weightKg": "82",
"heightCm": "218", ... }
score = weightKg + ( heightCm / 1000 )
33. "Composite" Numerical Indices, cont.
For more "complex" sorts (up to 53 bits of
percision), you can construct the score like so:
user:1 = { "id": "1", "weightKg": "82",
"heightCm": "218", "IQ": "100", ... }
score = weightKg * 1000000 +
heightCm * 1000 + IQ
Adapted from:
http://www.dr-josiah.com/2013/10/multi-column-sql-like-sorting-in-redis.html
34. Full Text Search (Almost) (v2.8.9+)
ZRANGEBYLEX on Sorted Set members that
have the same score is handy for suffix
wildcard searches, i.e. dfuc*, a-la
autocomplete: http://autocomplete.redis.io/
Tip: by storing the reversed string (gnirts) you
can also do prefix searches, i.e. *terah.net, just
as easily.
35. Another Nice Thing With Sorted Sets
By combining the use of two of these, it is
possible to map ranges to keys (or just data).
For example, what is 5?
ZADD min 1 "low" 4 "medium" 7 "high"
ZADD max 3 "low" 6 "medium" 9 "high"
ZREVRANGEBYSCORE min –inf 5 LIMIT 0 1
ZRANGEBYSCORE max 5 +inf LIMIT 0 1
36. Binary Trees
Everybody knows that
binary trees are really useful
for searching and other stuff.
You can store a binary tree
as an array in a Sorted Set:
(Happy 80th Birthday!)
37. Why stop at binary trees? BTrees!
@thinkingfish from Twitter explained that they
took the BSD implementation of BTrees and
welded it into Redis (open source rulez!). This
allows them to do efficient (speed-wise, not
memory) key and range lookups.
http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-
to-scale-105tb-ram-39mm-qps-10000-ins.html
38. Index Atomicity & Consistency
In a relational database the index is (hopefully)
always in sync with the data.
You can strive for that in Redis, but:
• Your code will be much more complex
• Performance will suffer
• There will be bugs/edge cases/extreme
uses…
39. The Opposite of Atomicity & Consistency
On the other extreme, you could consider
implementing indexing with a:
• Periodical process (lazy indexing)
• Producer/Consumer pattern (i.e. queue)
• Keyspace notifications
You won't have any guarantees, but you'll be
offloading the index creation from the app.
40. Indices, Lua & Clustering
Server-side scripting is an obvious
consideration for implementing a lot (if
not all) of the indexing logic. But ...
… in a cluster setup, a script runs on
a single shard and can only access the
keys there -> no guarantee that a key
and an index are on the same shard.
41. Don't Think – Copy-Paste!
For even more "inspiration" you can review the
source code of popular ORMs libraries for
Redis, for example:
• https://github.com/josiahcarlson/rom
• https://github.com/yohanboniface/redis-limpyd