These are the slides for the talk I presented at the LA Web Speed meetup hosted by Yahoo on May 17, 2013 - http://www.meetup.com/LAWebSpeed/events/115663212/
Apidays New York 2024 - The value of a flexible API Management solution for O...
Josiah carlson 2013-05-16 - redis analytics
1. A High-Level Pass
Through Redis Analytics*
by Josiah Carlson www.dr-josiah.com
@dr_josiah bit.ly/redis-in-action
2. Agenda
● Quick overview of Redis
● Monthly unique return/churn
○ too much memory method
○ reasonable memory method
○ very low memory method
● Visitor action sequence analytics
○ sequence method
○ low-memory method
● Geographic notifications with partitioning*
3. Quick Redis overview
● Remote key -> data structure server
○ Strings/integers/bitmaps
○ Lists of strings
○ Sets of unique string members
○ Hashes of key -> value
○ Sorted sets (ZSETs) mapping of member -> score
● Supports
○ Persistence
○ Replication
○ Publish/subscribe
○ Server-side Lua scripting (like a stored procedure)
○ Client-side sharding (server side in-progress)
4. Monthly unique return/churn
Problem:
● Say that you have millions of monthly visitors
● Need to know monthly churn, expected
~50%
● Don't want to waste too much memory
5. Monthly unique return/churn
Too much memory:
● Generate UUIDs for users, store in cookie
● Use a HASH mapping from UUIDs to int ids
● Use a HASH mapping from int ids to UUIDs
● Create a ZSET of short ids to timestamp
● Use per-month bitmaps for churn calculation
● Recycle int ids based on old timestamps,
discarding UUIDs and resetting bits
6. Monthly unique return/churn
Drawbacks:
● Memory use based on size of HASHes and
ZSET (about to 400 bytes/unique user)
● Second HASH can be thrown away
● The other HASH, ZSET, and bitmaps can be
thrown away and replaced by a "this month"
and "last month" SET (about 120 bytes/user)
● With 63 bit integer UUID and sharding
techniques, about 16 bytes/user
7. Monthly unique return/churn
Reasonable memory solution:
● Store per-month id in a signed cookie (lower-32 is the
unique id for the month, next 8 is the month)
● One month of bitmap
● If this month cookie, do nothing
● If last month cookie and bit isn't set for that id, mark the
bitmap, generate a new cookie, increment unique and
returning counts
● If last month cookie and bit is set, generate a new
cookie
● If old cookie or no cookie, generate a new cookie,
increment unique count
8. Monthly unique return/churn
Drawbacks:
● Memory use based on unique monthly
counts, ~1 bit per user (not bad)
● If you push to hundreds of millions/billions of
users, you should shard your bitmaps to
minimize realloc cost on bitmap updates
9. Monthly unique return/churn
Very low memory method:
● Store per-month id in a signed cookie
● If this month cookie, do nothing
● If last month cookie, generate a new cookie
for the client, increment unique and return
counts
● If old cookie or no cookie, generate a new
cookie, increment unique count
10. Monthly unique return/churn
Drawback:
● If someone sends you duplicate cookies,
hard to detect (keep "recently replaced"
cache, 5-10 minutes worth is likely good
enough)
11. Tangent on ZSETs
This slide is a filler so that I can talk about one
of my favorite "get rid of ZSETs" tricks, which
results in significant memory savings for a fairly
large subset of problems
13. Visitor action sequences
Sequence method:
● Each user gets a LIST
● All users are recorded in a ZSET with a score based on
time
● Each action/page RPUSHes the action/page to the LIST
● Clean-up/analyze old sequences based on timestamps
in the ZSET
Drawbacks:
● Memory use can be high for active users
● More detailed events can use more memory
14. Visitor action sequences
Low memory method:
● Each user gets a bitmap (limit your unique events)
● All actions are mapped to an index in the bitmap
● When a user performs the action/visits the page, set the
bit and update the ZSET
● Clean up/analyze old bitmaps based on timestamps in
the ZSET
Drawbacks:
● No more strict sequence analysis possible
● Memory use is dominated by ZSET storage
15. Geo Notifications
Problem:
● Want to send events to nearby users
● Don't want users to be notified too often
● Reduce radius of results as notifications rise
● Increase radius of results as notifications fall
● Allow for history to be received on connect
16. Geo Notifications
● Consider the world as a recursively-divided series of
blocks (highest level as 1x1 degree)
● Clients subscribe to all block levels that their user is in
or is interested in
● When writing an event at point (lat,lon):
○ Add the event id to ZSETs to as deep a partition as you would ever
expect to need
○ Trim the ZSETs along the way based on your desired history
○ Check the resulting size of the ZSETs to determine the highest-level
block that is under your limit
○ Publish the event to a channel based on that level
17. Geo Notifications
Drawbacks:
● Event id/timestamp information is duplicated
● Large histories may use significant memory
(ZSETs can be replaced by LISTs with
minimal changes)
● Old data in un-visited blocks aren't cleaned
out (can add expiration)