SlideShare a Scribd company logo
1 of 21
No BS Data Salon #3:
Probabilistic Sketching
May 2012
                          Analytics + Attribution =
                            Actionable Insights
Outline

     What we do at AK
     What’s sketching?
     Our motivation for sketching
     Why should you sketch?
     Our case: unique counting
       How it works
       How well it works
       How we use them




2
Here’s what we do at AK.


                   Online ad analytics
      Compare performance of different: campaigns, inventory,
                    providers, creatives, etc…




                        Bottom Line:
    Give the advertisers insight into the performance of their ads.



3
Motivation

     High throughput: 10s of K/s => 100s of K/s
     High dimensionality: 100M+ reporting keys
     Easy aggregates: counters, scalars
     Hard aggregates: unique user counting, set operations


     No cheap or effective “online” solutions
       Streaming DBs (Truviso, Coral8, StreamBase) insufficient
       Warehouse appliances (Aster, custom PG) same
       Our data is immutable. Paying for unneeded ACID is silly.

     Offline solutions slow, operationally finicky.
     Not a bank. We don’t need to be perfect, just useful.

4
Why should you bother?




    SELECT COUNT(DISTINCT user_id)
    FROM access_logs
    GROUP BY campaign_id




5
What is probabilistic sketching?




     One-pass
     “Small” memory
     Probabilistic error




6
Our Case Study: unique counting

     Non-unique stream of ints
     Want to keep unique count, up to about a billion
     Want to do set operations (union, intersection, set difference)
     Straw Man #1: “Put them in a HashSet, and go away.”
     (Maybe) Straw Man #2: “Fine, keep a sample.”
     How we did it: HyperLogLog




7
How it works

                                     The Papers:
     LogLog Counting of Large Cardinalities
       Marianne Durand and Philippe Flajolet (RIP 2010), 2003

     HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
       Flajolet, Fusy, Gandouet, Meunier, 2007

               The (rudimentary, unrigorous) Intuition:

        Flip fair coins
        Longest streak of heads is length k, seen once
        Probability of streak ≈ (½)k
        E[x] = 1, p = (½)k => n ≈ 2k
8
How it works cont’d

    1.   Stream of int_64 => “good” hash => random {0,1}64
    2.   Keep track of longest run of leading zeroes
    3.   Longest run of length k => cardinality ≈2k

     Crazy math business
         Correct systematic bias with a derived constant
         Stochastic averaging
         Balls and bins correction




9
Here’s what you get




                     Native:
                union, cardinality

                    Implies:
      intersection (!!!), set difference (!!!)




10
Show me the money!

      Used in production at AK for a year
      Accurate: count to a billion with 1-3% error
      Small: a few KB each so we can keep 100s of M in memory
      Fast: benched at 2M inserts/s, used in production at 100s of K/s




11
Lies, damn lies, and boxplots!

                                                Cardinality Relative Error vs True Cardinality
                                                              log2m=13 [5kB]


                          4%




                          2%     ●
     HLL Cardinality RE




                          0%




                          −2%



                                                               ●

                                                                                                  ●




                          −4%


                                102       103   104           105                106       107   108   109


12                                                                 True Cardinality
But wait, there’s more!
                                     ●
                                                                                                                             ●

                                                                     Intersection Error vs Magnitude Diff erence
                                                                                   log2m=13 [5kB]




                              40%


                                                                                                             ●
                                                                          ●                                  ●
                                                                              ●                              ●
                                                                              ●
                                                                                                                 ●   ●
                                         ●
                                         ●                                                                       ●
                                                                                                                 ●
                                                                                                                 ●
                              20%                                                 ●   ●
                                                                                                                             ●
                                                                                                                             ●   factor(overlap_fraction)
                                                                                                                             ●

                                             ●                                                                                       0.1
     HLL Intersection Error




                                                                                      ●   ●
                                                                                          ●                                          0.2

                                                                                              ●
                                                                                                                                     0.3
                                                                                                  ●

                                                             ●
                                                                                                                                     0.4

                               0%                                                                                                    0.5
                                                         ●   ●   ●                                                                   0.6
                                                     ●
                                                 ●                                                                                   0.7
                                                 ●                                            ●
                                                                                                                                     0.8
                                                                                      ●                                              0.9
                                                                                      ●
                                                                                      ●
                                                                                                                                     1
                              −20%



                                                                              ●                              ●




                                                                                                             ●
                              −40%




                                                     0                                1                  2               3


13                                                                      Cardinality Order of Magnitude Diff erence
Implementation caveats

      If you store an HLL for each key, you’ll likely be wasting space when all the
       registers aren’t set. Use map-based HLL or use compression.
      Pick a good hash function!
      Test on your data!
      Tune parameters to suit your business needs!




14
How we use them, in production

      Original problem: fast, on-the-fly overlaps and unique counts
      Solution:
        streaming, in-memory aggregations shipped to Postgres
        Postgres module to do set operations on binary representations in the DB

      Freebie: PG analytics support like GROUP BY, sliding windows, etc…




15
UI example




              To the browser, Robin!




16
How we use them, Ad Hoc

      Outside of production: amazing ad-hoc analysis tool
      Example: gathering more than a year’s worth of data for an RFP, at 20B
       impressions/month
         painless and quick when we had the data as sketches
        much more effort to put it through Hadoop

      Iterating on product and research is cheaper and faster.
        Waiting minutes instead of seconds between iterations is painful.




17
“Soft” Caveats



      Fixed N% error is deceiving
      Additive error for set operations can balloon
      Unbounded error sneaks in now and again




18
Parting Advice

      Test these on your data rigorously
      Choose good hash functions
      Tuning parameters are particularly sensitive
      You’ll find all kinds of unexpected uses for them, so get building!
      Bibliography blog post will be up in a bit!




19
Questions?


                  @timonk
     timon@aggregateknowledge.com
      blog.aggregateknowledge.com




20
Credits

     All the adorable cartoons you saw in this presentation were taken from
     http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong
     to him/her.




21

More Related Content

Similar to No BS Data Salon #3: Probabilistic Sketching (14)

01 Intro
01 Intro01 Intro
01 Intro
 
17 Sampling Dist
17 Sampling Dist17 Sampling Dist
17 Sampling Dist
 
01 intro
01 intro01 intro
01 intro
 
02 Large
02 Large02 Large
02 Large
 
02 large
02 large02 large
02 large
 
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
 
20 Polishing
20 Polishing20 Polishing
20 Polishing
 
How People Use Facebook -- And Why It Matters
How People Use Facebook -- And Why It MattersHow People Use Facebook -- And Why It Matters
How People Use Facebook -- And Why It Matters
 
08 Continuous
08 Continuous08 Continuous
08 Continuous
 
08 Continuous
08 Continuous08 Continuous
08 Continuous
 
About Vision, Mission And Strategy
About Vision, Mission And StrategyAbout Vision, Mission And Strategy
About Vision, Mission And Strategy
 
13 Bivariate
13 Bivariate13 Bivariate
13 Bivariate
 
14 case-study
14 case-study14 case-study
14 case-study
 
Over Visie, Missie En Strategie
Over Visie, Missie En StrategieOver Visie, Missie En Strategie
Over Visie, Missie En Strategie
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

No BS Data Salon #3: Probabilistic Sketching

  • 1. No BS Data Salon #3: Probabilistic Sketching May 2012 Analytics + Attribution = Actionable Insights
  • 2. Outline  What we do at AK  What’s sketching?  Our motivation for sketching  Why should you sketch?  Our case: unique counting How it works How well it works How we use them 2
  • 3. Here’s what we do at AK. Online ad analytics Compare performance of different: campaigns, inventory, providers, creatives, etc… Bottom Line: Give the advertisers insight into the performance of their ads. 3
  • 4. Motivation  High throughput: 10s of K/s => 100s of K/s  High dimensionality: 100M+ reporting keys  Easy aggregates: counters, scalars  Hard aggregates: unique user counting, set operations  No cheap or effective “online” solutions Streaming DBs (Truviso, Coral8, StreamBase) insufficient Warehouse appliances (Aster, custom PG) same Our data is immutable. Paying for unneeded ACID is silly.  Offline solutions slow, operationally finicky.  Not a bank. We don’t need to be perfect, just useful. 4
  • 5. Why should you bother? SELECT COUNT(DISTINCT user_id) FROM access_logs GROUP BY campaign_id 5
  • 6. What is probabilistic sketching?  One-pass  “Small” memory  Probabilistic error 6
  • 7. Our Case Study: unique counting  Non-unique stream of ints  Want to keep unique count, up to about a billion  Want to do set operations (union, intersection, set difference)  Straw Man #1: “Put them in a HashSet, and go away.”  (Maybe) Straw Man #2: “Fine, keep a sample.”  How we did it: HyperLogLog 7
  • 8. How it works The Papers:  LogLog Counting of Large Cardinalities Marianne Durand and Philippe Flajolet (RIP 2010), 2003  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet, Fusy, Gandouet, Meunier, 2007 The (rudimentary, unrigorous) Intuition: Flip fair coins Longest streak of heads is length k, seen once Probability of streak ≈ (½)k E[x] = 1, p = (½)k => n ≈ 2k 8
  • 9. How it works cont’d 1. Stream of int_64 => “good” hash => random {0,1}64 2. Keep track of longest run of leading zeroes 3. Longest run of length k => cardinality ≈2k  Crazy math business Correct systematic bias with a derived constant Stochastic averaging Balls and bins correction 9
  • 10. Here’s what you get Native: union, cardinality Implies: intersection (!!!), set difference (!!!) 10
  • 11. Show me the money!  Used in production at AK for a year  Accurate: count to a billion with 1-3% error  Small: a few KB each so we can keep 100s of M in memory  Fast: benched at 2M inserts/s, used in production at 100s of K/s 11
  • 12. Lies, damn lies, and boxplots! Cardinality Relative Error vs True Cardinality log2m=13 [5kB] 4% 2% ● HLL Cardinality RE 0% −2% ● ● −4% 102 103 104 105 106 107 108 109 12 True Cardinality
  • 13. But wait, there’s more! ● ● Intersection Error vs Magnitude Diff erence log2m=13 [5kB] 40% ● ● ● ● ● ● ● ● ● ● ● ● ● 20% ● ● ● ● factor(overlap_fraction) ● ● 0.1 HLL Intersection Error ● ● ● 0.2 ● 0.3 ● ● 0.4 0% 0.5 ● ● ● 0.6 ● ● 0.7 ● ● 0.8 ● 0.9 ● ● 1 −20% ● ● ● −40% 0 1 2 3 13 Cardinality Order of Magnitude Diff erence
  • 14. Implementation caveats  If you store an HLL for each key, you’ll likely be wasting space when all the registers aren’t set. Use map-based HLL or use compression.  Pick a good hash function!  Test on your data!  Tune parameters to suit your business needs! 14
  • 15. How we use them, in production  Original problem: fast, on-the-fly overlaps and unique counts  Solution: streaming, in-memory aggregations shipped to Postgres Postgres module to do set operations on binary representations in the DB  Freebie: PG analytics support like GROUP BY, sliding windows, etc… 15
  • 16. UI example To the browser, Robin! 16
  • 17. How we use them, Ad Hoc  Outside of production: amazing ad-hoc analysis tool  Example: gathering more than a year’s worth of data for an RFP, at 20B impressions/month painless and quick when we had the data as sketches much more effort to put it through Hadoop  Iterating on product and research is cheaper and faster. Waiting minutes instead of seconds between iterations is painful. 17
  • 18. “Soft” Caveats  Fixed N% error is deceiving  Additive error for set operations can balloon  Unbounded error sneaks in now and again 18
  • 19. Parting Advice  Test these on your data rigorously  Choose good hash functions  Tuning parameters are particularly sensitive  You’ll find all kinds of unexpected uses for them, so get building!  Bibliography blog post will be up in a bit! 19
  • 20. Questions? @timonk timon@aggregateknowledge.com blog.aggregateknowledge.com 20
  • 21. Credits All the adorable cartoons you saw in this presentation were taken from http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong to him/her. 21