Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

•

1 like•1,670 views

Acunu

Slides from my tutorial at Denormalized London on 21 Sept 2012

Technology Design

Realtime Analytics
with Cassandra
or: How I Learned to
Stopped Worrying and
Love Counting

Combining “big” and “real-time” is hard

Live & historical Drill downs
Trends...
aggregates... and roll ups

2
Analytics

What is Realtime Analytics?
eg “show me the number of mentions of
‘Acunu’ per day, between May and
November 2011, on Twitter”

Batch (Hadoop) approach would
require processing ~30 billion tweets,
or ~4.2 TB of data
http://blog.twitter.com/2011/03/numbers.html

Okay, so how are we
going to do it?
counter
updates
tweets
Twitter ?

• Push processing into ingest phase
• Make queries fast

Okay, so how are we
going to do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters

Preparing the data
12:32:15 I like #trafﬁclights
Step 1: Get a feed of 12:33:43 Nobody expects...
the tweets 12:33:49 I ate a #bee; woe is...
12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
tweet

Step 3: Increment counters [1234, man] +1
in time buckets for [1234, acunu] +1
each token [1234, rock] +1

Querying
start: [01/05/11, acunu]
Step 1: Do a range query end: [30/05/11, acunu]

Key #Mentions
[01/05/11 00:01, acunu] 3
Step 2: Result table [01/05/11 00:02, acunu] 5
... ...

90

Step 3: Plot pretty graph 45
0
May Jun Jul Aug Sept Oct Nov

Except it’s not that easy...
• Cassandra best practice is to use RandomPartitioner,
so not possible to range queries on rows
• Could manually work out each row in range, do lots of
point gets
• This would suck - each query would be 100’s of random
IOs on disk
• Need to use wide rows, range query is a column slice,
each query ~1 IO - Denormalisation

So instead of this...
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...

We do this
Key 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...

Row key is ‘big’ Column key is ‘small’
time bucket time bucket

Demo
./painbird.py -u tom_wilkie

http://ec2-176-34-212-226.eu-
west-1.compute.amazonaws.com:8000

1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/Ruqlt

3. Cluster them up

4. Get the code - http://goo.gl/VxXKB

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)

Get some Cassandra
VMs

http://goo.gl/O9hkv

Cluster them up
• SSH in, set password (on both!)
• Check you can connect to the UI
• Use UI (click add host)

Get the code
SSH into one of the VMs:
# curl https://acunu-
oss.s3.amazonaws.com/
painbird-2.tar.gz | tar zxf -
# cd release
# ./painbird.py -u tom_wilkie

Implement the “core”

• In core.py
• def insert_tweet(cassandra, tweet):
• def do_query(cassandra, term, start, ﬁnish):

Check you data
-bash-3.2$ cassandra-cli
Connected to: "Test Cluster" on localhost/9160
Welcome to Cassandra CLI version 1.0.8.acunu2
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;
Authenticated to keyspace: painbird
[default@painbird] list keywords;
Using default limit of 100
-------------------
RowKey: m-5-"woe
=> (counter=11, value=1)

Extensions
UI
• Pretty graphs
• Automatically periodically update?
• Search multiple terms
Painbird
• mentions of multiple terms
• sentiment analysis - http://www.nltk.org/
• ﬁltering by multiple ﬁelds (geo + keyword)

Similar to Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

More MiningDavid Evans

Earthquake shakes twitter usersEshan Mudwel

It's all about the timingSensePost

Reflex - How Does It Work? (extended dance remix)Rocco Caputo

Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International

Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...Volodymyr Bogdanov

Pointers lesson 3 (data types and pointer arithmetics)SetuMaheshwari1

Analyze database system using a 3 d methodAjith Narayanan

Asynchronous AwesomeFlip Sasser

Upgrade and Self Help Tools for Atlassian AdminsAtlassian

Digging Cassandra ClusterIvan Burmistrov

SKB Kontur: Digging Cassandra clusterDataStax Academy

Storm 2012 03-29MapR Technologies

Beyond PHP - it's not (just) about the codeWim Godden

Similar to Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012 (14)

More Mining

Earthquake shakes twitter users

It's all about the timing

Reflex - How Does It Work? (extended dance remix)

Dataflow - A Unified Model for Batch and Streaming Data Processing

Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...

Pointers lesson 3 (data types and pointer arithmetics)

Analyze database system using a 3 d method

Asynchronous Awesome

Upgrade and Self Help Tools for Atlassian Admins

Digging Cassandra Cluster

SKB Kontur: Digging Cassandra cluster

Storm 2012 03-29

Beyond PHP - it's not (just) about the code

Recently uploaded

Training state-of-the-art general text embeddingZilliz

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

"ML in Production",Oleksandr BaganFwdays

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Gen AI in Business - Global Trends Report 2024.pdfAddepto

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

How to write a Business Continuity PlanDatabarracks

A Journey Into the Emotions of Software DevelopersNicole Novielli

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

From Family Reminiscence to Scholarly Archive .Alan Dix

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Artificial intelligence in cctv survelliance.pptxhariprasad279825

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Recently uploaded (20)

Training state-of-the-art general text embedding

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

DevoxxFR 2024 Reproducible Builds with Apache Maven

"ML in Production",Oleksandr Bagan

DSPy a system for AI to Write Prompts and Do Fine Tuning

TeamStation AI System Report LATAM IT Salaries 2024

Ensuring Technical Readiness For Copilot in Microsoft 365

Gen AI in Business - Global Trends Report 2024.pdf

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

How to write a Business Continuity Plan

A Journey Into the Emotions of Software Developers

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Unraveling Multimodality with Large Language Models.pdf

From Family Reminiscence to Scholarly Archive .

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Artificial intelligence in cctv survelliance.pptx

The Ultimate Guide to Choosing WordPress Pros and Cons

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

1. Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting

2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups 2 Analytics

3. What is Realtime Analytics? eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html

4. Okay, so how are we going to do it? counter updates tweets Twitter ? • Push processing into ingest phase • Make queries fast

5. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters

6. Preparing the data 12:32:15 I like #trafﬁclights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1

7. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov

8. Except it’s not that easy... • Cassandra best practice is to use RandomPartitioner, so not possible to range queries on rows • Could manually work out each row in range, do lots of point gets • This would suck - each query would be 100’s of random IOs on disk • Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation

9. So instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket

10. Demo ./painbird.py -u tom_wilkie http://ec2-176-34-212-226.eu- west-1.compute.amazonaws.com:8000

11. Now its your turn.....

12. 1. Get a twitter account - http://twitter.com 2. Get some Cassandra VMs - http://goo.gl/Ruqlt 3. Cluster them up 4. Get the code - http://goo.gl/VxXKB 5. Implement the missing bits! 6. (Prizes for the ones that spot bugs!)

13. Get some Cassandra VMs http://goo.gl/O9hkv

14. Cluster them up • SSH in, set password (on both!) • Check you can connect to the UI • Use UI (click add host)

15. Get the code SSH into one of the VMs: # curl https://acunu- oss.s3.amazonaws.com/ painbird-2.tar.gz | tar zxf - # cd release # ./painbird.py -u tom_wilkie

16. Implement the “core” • In core.py • def insert_tweet(cassandra, tweet): • def do_query(cassandra, term, start, ﬁnish):

17. Check you data -bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160 Welcome to Cassandra CLI version 1.0.8.acunu2 Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use painbird; Authenticated to keyspace: painbird [default@painbird] list keywords; Using default limit of 100 ------------------- RowKey: m-5-"woe => (counter=11, value=1)

18. Extensions

19. Extensions UI • Pretty graphs • Automatically periodically update? • Search multiple terms Painbird • mentions of multiple terms • sentiment analysis - http://www.nltk.org/ • ﬁltering by multiple ﬁelds (geo + keyword)

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Recommended

Recommended

More Related Content

Similar to Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Similar to Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012 (14)

More from Acunu

More from Acunu (20)

Recently uploaded

Recently uploaded (20)

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012