SlideShare a Scribd company logo
1 of 19
Download to read offline
Realtime Analytics
 with Cassandra
   or: How I Learned to
  Stopped Worrying and
       Love Counting
Combining “big” and “real-time” is hard

    Live & historical                    Drill downs
                         Trends...
      aggregates...                      and roll ups




2
                                                        Analytics
What is Realtime Analytics?
    eg “show me the number of mentions of
        ‘Acunu’ per day, between May and
          November 2011, on Twitter”


       Batch (Hadoop) approach would
    require processing ~30 billion tweets,
              or ~4.2 TB of data
                 http://blog.twitter.com/2011/03/numbers.html
Okay, so how are we
            going to do it?
                                counter
                                updates
             tweets
Twitter                   ?

  •   Push processing into ingest phase
  •   Make queries fast
Okay, so how are we
     going to do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters
Preparing the data
                              12:32:15 I like #trafficlights
Step 1: Get a feed of    12:33:43 Nobody expects...
        the tweets     12:33:49 I ate a #bee; woe is...
                      12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
        tweet

Step 3: Increment counters            [1234, man]   +1
        in time buckets for           [1234, acunu] +1
        each token                    [1234, rock] +1
Querying
                            start: [01/05/11, acunu]
Step 1: Do a range query    end:   [30/05/11, acunu]

                                       Key            #Mentions
                              [01/05/11 00:01, acunu]    3
Step 2: Result table          [01/05/11 00:02, acunu]    5
                                        ...              ...


                              90

Step 3: Plot pretty graph     45
                               0
                                   May Jun Jul Aug Sept Oct Nov
Except it’s not that easy...
• Cassandra best practice is to use RandomPartitioner,
  so not possible to range queries on rows
• Could manually work out each row in range, do lots of
  point gets
  • This would suck - each query would be 100’s of random
    IOs on disk
• Need to use wide rows, range query is a column slice,
  each query ~1 IO - Denormalisation
So instead of this...
                              Key            #Mentions
                     [01/05/11 00:01, acunu]    3
                     [01/05/11 00:02, acunu]    5
                               ...              ...




                     We do this
                  Key           00:01       00:02        ...
            [01/05/11, acunu]     3           5          ...
            [02/05/11, acunu]    12           4          ...
                    ...           ...                    ...

Row key is ‘big’                     Column key is ‘small’
 time bucket                             time bucket
Demo
./painbird.py -u tom_wilkie

    http://ec2-176-34-212-226.eu-
 west-1.compute.amazonaws.com:8000
Now its your
  turn.....
1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/Ruqlt

3. Cluster them up

4. Get the code - http://goo.gl/VxXKB

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)
Get some Cassandra
       VMs


http://goo.gl/O9hkv
Cluster them up
• SSH in, set password (on both!)
• Check you can connect to the UI
• Use UI (click add host)
Get the code
SSH into one of the VMs:
# curl https://acunu-
oss.s3.amazonaws.com/
painbird-2.tar.gz | tar zxf -
# cd release
# ./painbird.py -u tom_wilkie
Implement the “core”

• In core.py
• def insert_tweet(cassandra, tweet):
• def do_query(cassandra, term, start, finish):
Check you data
-bash-3.2$ cassandra-cli
Connected to: "Test Cluster" on localhost/9160
Welcome to Cassandra CLI version 1.0.8.acunu2
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;
Authenticated to keyspace: painbird
[default@painbird] list keywords;
Using default limit of 100
-------------------
RowKey: m-5-"woe
=> (counter=11, value=1)
Extensions
Extensions
UI
• Pretty graphs
• Automatically periodically update?
• Search multiple terms
Painbird
•   mentions of multiple terms
•   sentiment analysis - http://www.nltk.org/
•   filtering by multiple fields (geo + keyword)

More Related Content

Similar to Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

200603ash.pdf Performance Tuning Oracle DB
200603ash.pdf Performance Tuning Oracle DB200603ash.pdf Performance Tuning Oracle DB
200603ash.pdf Performance Tuning Oracle DB
cookie1969
 
Upgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian AdminsUpgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian Admins
Atlassian
 

Similar to Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012 (15)

More Mining
More MiningMore Mining
More Mining
 
Earthquake shakes twitter users
Earthquake shakes twitter usersEarthquake shakes twitter users
Earthquake shakes twitter users
 
It's all about the timing
It's all about the timingIt's all about the timing
It's all about the timing
 
Reflex - How Does It Work? (extended dance remix)
Reflex - How Does It Work? (extended dance remix)Reflex - How Does It Work? (extended dance remix)
Reflex - How Does It Work? (extended dance remix)
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
Psychtoolbox (PTB) practical course  by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...Psychtoolbox (PTB) practical course  by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
 
Pointers lesson 3 (data types and pointer arithmetics)
Pointers lesson 3 (data types and pointer arithmetics)Pointers lesson 3 (data types and pointer arithmetics)
Pointers lesson 3 (data types and pointer arithmetics)
 
200603ash.pdf Performance Tuning Oracle DB
200603ash.pdf Performance Tuning Oracle DB200603ash.pdf Performance Tuning Oracle DB
200603ash.pdf Performance Tuning Oracle DB
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d method
 
Asynchronous Awesome
Asynchronous AwesomeAsynchronous Awesome
Asynchronous Awesome
 
Upgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian AdminsUpgrade and Self Help Tools for Atlassian Admins
Upgrade and Self Help Tools for Atlassian Admins
 
SKB Kontur: Digging Cassandra cluster
SKB Kontur: Digging Cassandra clusterSKB Kontur: Digging Cassandra cluster
SKB Kontur: Digging Cassandra cluster
 
Digging Cassandra Cluster
Digging Cassandra ClusterDigging Cassandra Cluster
Digging Cassandra Cluster
 
Storm 2012 03-29
Storm 2012 03-29Storm 2012 03-29
Storm 2012 03-29
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 

More from Acunu

Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
Acunu
 
Acunu Analytics
Acunu AnalyticsAcunu Analytics
Acunu Analytics
Acunu
 

More from Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
 
Cassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard LowCassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard Low
 
Acunu Analytics
Acunu AnalyticsAcunu Analytics
Acunu Analytics
 
Cassandra Performance: Past, present & future
Cassandra Performance: Past, present & futureCassandra Performance: Past, present & future
Cassandra Performance: Past, present & future
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

  • 1. Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting
  • 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups 2 Analytics
  • 3. What is Realtime Analytics? eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html
  • 4. Okay, so how are we going to do it? counter updates tweets Twitter ? • Push processing into ingest phase • Make queries fast
  • 5. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters
  • 6. Preparing the data 12:32:15 I like #trafficlights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1
  • 7. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov
  • 8. Except it’s not that easy... • Cassandra best practice is to use RandomPartitioner, so not possible to range queries on rows • Could manually work out each row in range, do lots of point gets • This would suck - each query would be 100’s of random IOs on disk • Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation
  • 9. So instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket
  • 10. Demo ./painbird.py -u tom_wilkie http://ec2-176-34-212-226.eu- west-1.compute.amazonaws.com:8000
  • 11. Now its your turn.....
  • 12. 1. Get a twitter account - http://twitter.com 2. Get some Cassandra VMs - http://goo.gl/Ruqlt 3. Cluster them up 4. Get the code - http://goo.gl/VxXKB 5. Implement the missing bits! 6. (Prizes for the ones that spot bugs!)
  • 13. Get some Cassandra VMs http://goo.gl/O9hkv
  • 14. Cluster them up • SSH in, set password (on both!) • Check you can connect to the UI • Use UI (click add host)
  • 15. Get the code SSH into one of the VMs: # curl https://acunu- oss.s3.amazonaws.com/ painbird-2.tar.gz | tar zxf - # cd release # ./painbird.py -u tom_wilkie
  • 16. Implement the “core” • In core.py • def insert_tweet(cassandra, tweet): • def do_query(cassandra, term, start, finish):
  • 17. Check you data -bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160 Welcome to Cassandra CLI version 1.0.8.acunu2 Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use painbird; Authenticated to keyspace: painbird [default@painbird] list keywords; Using default limit of 100 ------------------- RowKey: m-5-"woe => (counter=11, value=1)
  • 19. Extensions UI • Pretty graphs • Automatically periodically update? • Search multiple terms Painbird • mentions of multiple terms • sentiment analysis - http://www.nltk.org/ • filtering by multiple fields (geo + keyword)