SlideShare une entreprise Scribd logo
1  sur  30
Realtime Analytics
  with Apache
   Cassandra
        Tom Wilkie
 Founder & CTO, Acunu Ltd
       @tom_wilkie
Combining “big” and “real-time” is hard

    Live & historical                    Drill downs
                         Trends...
      aggregates...                      and roll ups




2
                                                        Analytics
Solution              Con

                       Scalability
                         $$$


                      Not realtime


               Spartan query semantics =>
                 complex, DIY solutions

3
                                            Analytics
Example I
    eg “show me the number of mentions of
        ‘Acunu’ per day, between May and
          November 2011, on Twitter”


    Batch (Hadoop) approach would require
    processing ~30 billion tweets, or ~4.2
                 TB of data
                  http://blog.twitter.com/2011/03/numbers.html


4
                                                                 Analytics
Okay, so how are we going to
                   do it?

    For each tweet,
    increment a bunch of counters,
    such that answering a query
    is as easy as reading some counters


5
                                          Analytics
Preparing the data
                              12:32:15 I like #trafficlights
Step 1: Get a feed of    12:33:43 Nobody expects...
        the tweets     12:33:49 I ate a #bee; woe is...
                      12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
        tweet

Step 3: Increment counters            [1234, man]   +1
        in time buckets for           [1234, acunu] +1
        each token                    [1234, rock] +1
6
                                                              Analytics
Querying
                            start: [01/05/11, acunu]
Step 1: Do a range query    end:   [30/05/11, acunu]

                                       Key            #Mentions
                              [01/05/11 00:01, acunu]    3
Step 2: Result table          [01/05/11 00:02, acunu]    5
                                        ...              ...


                              90

Step 3: Plot pretty graph     45
                               0
                                   May Jun Jul Aug Sept Oct Nov
7
                                                              Analytics
Instead of this...
                                  Key            #Mentions
                         [01/05/11 00:01, acunu]    3
                         [01/05/11 00:02, acunu]    5
                                   ...              ...




                         We do this
                      Key           00:01       00:02        ...
                [01/05/11, acunu]     3           5          ...
                [02/05/11, acunu]    12           4          ...
                        ...           ...                    ...

    Row key is ‘big’                     Column key is ‘small’
     time bucket                             time bucket
8
                                                                   Analytics
Towards a more
    general solution...
      (Example II)



9
                          Analytics
count
                grouped by ...
                    day
  count
 distinct
(session)
     count       ... geography

avg(duration)
                  ... browser


10
                          Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

                         22:00      all→3221    :00→22      :00→19     :02→104       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→228    user01→1   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1904       ...

                           ∅        all→87314   UK→238     US→354         ...




11
                                                                                 Analytics
21:00      all→1345     :00→45     :01→62      :02→87       ...

                         22:00      all→3222     :00→22     :00→19     :02→105       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→229    user01→2   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1905       ...

                           ∅        all→87315   UK→239     US→354         ...




12
                                                                                 Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

      22:00      all→3221    :00→22      :00→19     :02→104       ...

        ...                                                       ...

        UK        all→228    user01→1   user14→12   user99→7      ...

        US        all→354    user01→4   user04→8    user56→17     ...

        ...

     UK, 22:00   all→1904       ...

        ∅        all→87314   UK→238     US→354         ...




13
                                                              Analytics
where time 21:00-22:00
 count(*)
                          21:00      all→1345    :00→45      :01→62      :02→87       ...

                          22:00      all→3222    :00→22      :01→19     :02→105       ...

                            ...                                                       ...

                            UK        all→229    user01→2   user14→12   user99→7      ...

                            US        all→354    user01→4   user04→8    user56→17     ...

                            ...

                         UK, 22:00   all→1905       ...

                            ∅        all→87315   UK→239     US→354         ...




14
                                                                                  Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345    :00→45      :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...

                             US        all→354    user01→4   user04→8    user56→17     ...

                             ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239     US→354         ...




15
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239      US→354        ...




16
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




17
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




group all by geo
18
                                                                                   Analytics
What about more than
       just aggregates?



19
                            Analytics
Approximate Analytics
                 Exact




     Real-time           Large Scale


20
                                       Analytics
Count Distinct

     Plan A: keep a list of all the things you’ve seen
               count them at query time


                Quick to update
                  ... but at scale ...
                Takes lots of space
                Takes a long time to query
21
                                                         Analytics
Approximate Distinct

     max # leading zeroes seen so far
         item          hash        leading zeroes   max so far

         x        00101001110...          2            2
         y        11010100111...          0            2
         z        00011101011...          3            3
                       ...
     ... to see a max of M takes about        2M    items

22
                                                                 Analytics
Approximate Distinct

            to reduce var, average over m=2k sub-streams

     item          hash          index, zeroes   max so far

     x       00101001110...          0, 0        0,0,0,0
     y       11010100111...          3, 1        0,0,0,1
     z       00011101011...          0, 1        1,0,0,1
                   ...
            take the harmonic mean
23
                                                              Analytics
Okay... now what?




                    Analytics
Analytics

                                     counter
                                     updates
Click stream    events
                          Acunu
Sensor data
                         Analytics
     etc




     •   Aggregate incrementally, on the fly
     •   Store live + historical aggregates
10x vs MySQL...




                  Analytics
Dashboard UI




27
                    Analytics
“Up and running in about 4 hours”
“We found out a competitor
  was scraping our data”

                      “We keep discovering use cases
                         we hadn’t thought of ”




                                                 Analytics
"Quick, efficient and easy to
        get started"
                       "We're still finding new and
                     interesting use cases, which just
                         aren't possible with our
                           current datastores."

                                                         Analytics
Thanks!

     Questions?


30
                  Analytics

Contenu connexe

Plus de Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
All Your Base
All Your BaseAll Your Base
All Your BaseAcunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with CassandraAcunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your businessAcunu
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraAcunu
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: CassandraAcunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation CassandraAcunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixAcunu
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Acunu
 

Plus de Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
All Your Base
All Your BaseAll Your Base
All Your Base
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Realtime Analytics with Apache Cassandra - JAX London

  • 1. Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
  • 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups 2 Analytics
  • 3. Solution Con Scalability $$$ Not realtime Spartan query semantics => complex, DIY solutions 3 Analytics
  • 4. Example I eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 4 Analytics
  • 5. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters 5 Analytics
  • 6. Preparing the data 12:32:15 I like #trafficlights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1 6 Analytics
  • 7. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov 7 Analytics
  • 8. Instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket 8 Analytics
  • 9. Towards a more general solution... (Example II) 9 Analytics
  • 10. count grouped by ... day count distinct (session) count ... geography avg(duration) ... browser 10 Analytics
  • 11. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... { cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 11 Analytics
  • 12. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ... { cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 12 Analytics
  • 13. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 13 Analytics
  • 14. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 14 Analytics
  • 15. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 15 Analytics
  • 16. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 16 Analytics
  • 17. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... 17 Analytics
  • 18. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... group all by geo 18 Analytics
  • 19. What about more than just aggregates? 19 Analytics
  • 20. Approximate Analytics Exact Real-time Large Scale 20 Analytics
  • 21. Count Distinct Plan A: keep a list of all the things you’ve seen count them at query time Quick to update ... but at scale ... Takes lots of space Takes a long time to query 21 Analytics
  • 22. Approximate Distinct max # leading zeroes seen so far item hash leading zeroes max so far x 00101001110... 2 2 y 11010100111... 0 2 z 00011101011... 3 3 ... ... to see a max of M takes about 2M items 22 Analytics
  • 23. Approximate Distinct to reduce var, average over m=2k sub-streams item hash index, zeroes max so far x 00101001110... 0, 0 0,0,0,0 y 11010100111... 3, 1 0,0,0,1 z 00011101011... 0, 1 1,0,0,1 ... take the harmonic mean 23 Analytics
  • 24. Okay... now what? Analytics
  • 25. Analytics counter updates Click stream events Acunu Sensor data Analytics etc • Aggregate incrementally, on the fly • Store live + historical aggregates
  • 26. 10x vs MySQL... Analytics
  • 27. Dashboard UI 27 Analytics
  • 28. “Up and running in about 4 hours” “We found out a competitor was scraping our data” “We keep discovering use cases we hadn’t thought of ” Analytics
  • 29. "Quick, efficient and easy to get started" "We're still finding new and interesting use cases, which just aren't possible with our current datastores." Analytics
  • 30. Thanks! Questions? 30 Analytics