2. Combining “big” and “real-time” is hard
Live & historical Drill downs
Trends...
aggregates... and roll ups
2
Analytics
3. What is Realtime Analytics?
eg “show me the number of mentions of
‘Acunu’ per day, between May and
November 2011, on Twitter”
Batch (Hadoop) approach would
require processing ~30 billion tweets,
or ~4.2 TB of data
http://blog.twitter.com/2011/03/numbers.html
4. Okay, so how are we
going to do it?
counter
updates
tweets
Twitter ?
• Push processing into ingest phase
• Make queries fast
5. Okay, so how are we
going to do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters
6. Preparing the data
12:32:15 I like #trafficlights
Step 1: Get a feed of 12:33:43 Nobody expects...
the tweets 12:33:49 I ate a #bee; woe is...
12:34:04 Man, @acunu rocks!
Step 2: Tokenise the
tweet
Step 3: Increment counters [1234, man] +1
in time buckets for [1234, acunu] +1
each token [1234, rock] +1
7. Querying
start: [01/05/11, acunu]
Step 1: Do a range query end: [30/05/11, acunu]
Key #Mentions
[01/05/11 00:01, acunu] 3
Step 2: Result table [01/05/11 00:02, acunu] 5
... ...
90
Step 3: Plot pretty graph 45
0
May Jun Jul Aug Sept Oct Nov
8. Except it’s not that easy...
• Cassandra best practice is to use RandomPartitioner,
so not possible to range queries on rows
• Could manually work out each row in range, do lots of
point gets
• This would suck - each query would be 100’s of random
IOs on disk
• Need to use wide rows, range query is a column slice,
each query ~1 IO - Denormalisation
9. So instead of this...
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
We do this
Key 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...
Row key is ‘big’ Column key is ‘small’
time bucket time bucket
12. 1. Get a twitter account - http://twitter.com
2. Get some Cassandra VMs - http://goo.gl/Ruqlt
3. Cluster them up
4. Get the code - http://goo.gl/VxXKB
5. Implement the missing bits!
6. (Prizes for the ones that spot bugs!)
14. Cluster them up
• SSH in, set password (on both!)
• Check you can connect to the UI
• Use UI (click add host)
15. Get the code
SSH into one of the VMs:
# curl https://acunu-
oss.s3.amazonaws.com/
painbird-2.tar.gz | tar zxf -
# cd release
# ./painbird.py -u tom_wilkie
16. Implement the “core”
• In core.py
• def insert_tweet(cassandra, tweet):
• def do_query(cassandra, term, start, finish):
17. Check you data
-bash-3.2$ cassandra-cli
Connected to: "Test Cluster" on localhost/9160
Welcome to Cassandra CLI version 1.0.8.acunu2
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default@unknown] use painbird;
Authenticated to keyspace: painbird
[default@painbird] list keywords;
Using default limit of 100
-------------------
RowKey: m-5-"woe
=> (counter=11, value=1)