SlideShare une entreprise Scribd logo
1  sur  39
Cardinality Estimation for
    Very Large Data Sets


    Matt Abrams, VP Data and Operations
                         March 25, 2013
THANKS FOR
COMING!
I build large scale distributed systems and work on
algorithms that make sense of the data stored in
them


Contributor to the open source project Stream-
Lib, a Java library for summarizing data streams
(https://github.com/clearspring/stream-lib)


Ask me questions: @abramsm
HOW CAN WE COUNT
THE NUMBER OF
DISTINCT ELEMENTS
IN LARGE DATA
SETS?
HOW CAN WE COUNT
THE NUMBER OF
DISTINCT ELEMENTS
IN VERY LARGE DATA
SETS?
GOALS FOR
COUNTING SOLUTION
Support high throughput data streams (up
to many 100s of thousands per second)
Estimate cardinality with known error
thresholds in sets up to around 1 billion (or
even 1 trillion when needed)
Support set operations (unions and
intersections)
Support data streams with large number of
dimensions
1 UID = 128 bits
513a71b843e54b73
In one month AddThis
    logs 5B+ UIDs

          2,500,000 * 2000
          = 5,000,000,000
That’s 596GB of
  just UIDS
NAÏVE SOLUTIONS

• Select count(distinct
  UID) from table where
  dimension = foo
• HashSet<K>
• Run a batch job for each
  new query request
WE ARE NOT A BANK




    This means a estimate rather
    than exact value is acceptable.

                  http://graphics8.nytimes.com/images/2008/01/30/timestopics/feddc.jp
                  g
THREE INTUITIONS
• It is possible to estimate the cardinality of a set
  by understanding the probability of a sequence
  of events occurring in a random variable (e.g.
  how many coins were flipped if I saw n heads in
  a row?)
• Averaging the the results of multiple
  observations can reduce the variance
  associated with random variables
• By applying a good hash function effectively de-
  duplicates the input stream
INTUITION




   What is the probability
   that a binary string
   starts with ’01’?
INTUITION




  (1/2)2    = 25%
INTUITION




(1/2)3      = 12.5%
INTUITION




Crude analysis: If a stream
has 8 unique values the hash
of at least one of them should
start with ‘001’
INTUITION




Given the variability of a single
random value we can not use
a single variable for accurate
cardinality estimations
MULTIPLE OBSERVATIONS HELP
REDUCE VARIANCE

By taking the mean of the standard
deviation of multiple random variables we
can make the error rate as small as desired
by controlling the size of m (the number
random variables)



    error = s / m
THE PROBLEM WITH
MULTIPLE HASH
FUNCTIONS

• It is too costly from a
  computational perspective to
  apply m hash functions to
  each data point
• It is not clear that it is
  possible to generate m good
  hash functions that are
  independent
STOCHASTIC
AVERAGING
• Emulating the effect of m experiments
  with a single hash function
• Divide input stream h(M) into m sub-
  streams
            é1 2      m -1 ù
            ê , ,...,
            ëm m
                          ,1ú
                       m û
• An average of the observable values for
  each sub-stream will yield a cardinality
  that improves in proportion to 1/ m as
  m increases
HASH FUNCTIONS
32 Bit         64 Bit       160 Bit                      Odds of a
Hash           Hash         Hash                         Collision
77163          5.06 Billion 1.42 *                       1 in 2
                            10^14
30084          1.97 Billion 5.55 *                       1 in 10
                            10^23
9292           609 million 1.71 *                        1 in 100
                            10^23
2932           192 million 5.41 *                        1 in 1000
                            10^22
         http://preshing.com/20110504/hash-collision-probabilities
HYPERLOGLOG
      (2007)
Counts up to1 Billion in 1.5KB of space




            Philippe Flajolet (1948-2011)
HYPERLOGLOG (HLL)
• Operates with a single pass
  over the input data set
• Produces a typical error of of
            1.04 / m
• Error decreases as m
  increases. Error is not a
  function of the number of
  elements in the set
HLL SUBSTREAMS

 HLL uses a single hash
 function and splits the result
 into m buckets
                              Bucket 1
                Hash
Input Values   Function
                          S   Bucket 2

                              Bucket m
HLL ALGORITHM
BASICS
• Each substream maintains an Observable
 • Observable is largest value p(x) which is the
   position of the leftmost 1-bit in a binary string x



• 32 bit hashing function with 5 bit “short bytes”
• Harmonic mean
 • Increases quality of estimates by reducing variance
WHAT ARE “SHORT BYTES”?
• We know a priori that the value of a given
  substream of the multiset M is in the
  range

          0..(L +1- log2 m)
• Assuming L = 32 we only need 5 bits to
  store the value of the register
• 85% less memory usage as compared to
  standard java int
ADDING VALUES TO
HLL



       r ( xb+1 xb+2 ×××)       index =1+ x1x2 ××× xb   2


• The first b bits of the new value define
  the index for the multiset M that may be
  updated when the new value is added
• The bits b+1 to m are used to determine
  the leading number of zeros (p)
ADDING VALUES TO
HLL
                   Observations




{M[1], M[2],..., M[m]}
The multiset is updated using the equation:

   M[ j] := max(M[ j], r (w ))
                              Number of leading zeros + 1
INTUITION ON
EXTRACTING
CARDINALITY FROM HLL
• If we add n elements to a stream then each
  substream will contain roughly n/m elements
• The MAX value in each substream should be
  about log2 ( n / m) (from earlier intuition re
  random variables)
• The harmonic mean (mZ) of 2MAX is on the
  order of n/m
• So m2Z is on the order of n  That’s the
  cardinality!
HLL CARDINALITY
ESTIMATE
                                            -1
                          æ     m        ö
             E := a m m × çå 2
                                -M [ j ]
                       2
                          ç              ÷
                                         ÷
                          è j=1          ø

                (2 )
                  p 2
                         Harmonic Mean


• m2Z has systematic multiplicative bias that needs to be
  corrected. This is done by multiplying a constant value
A NOTE ON LONG
RANGE CORRECTIONS
• The paper says to apply a long range
  correction function when the estimate is
  greater than: E > 1 232
                  30
• The correction function is:
     E := -2 log(1- E / 2
       *    32


• DON’T DO THIS! It doesn’t work and
  increases error. Better approach is to
  use a bigger/better hash function
DEMO TIME!
Lets look at HLL in Action.


            http://www.aggregateknowledge.com/science/blog/hll.html
HLL UNIONS                      Root

• Merging two or more HLL
  data structures is a                 MON   HLL
  similar process to adding
  a new value to a single
  HLL                                  TUE   HLL
• For each register in the
  HLL take the max value of
  the HLLs you are merging             WED
                                             HLL
  and the resulting register
  set can be used to
  estimate the cardinality of          THU   HLL
  the combined sets

                                       FRI   HLL
HLL INTERSECTION
        C = A + B - AÈ B



            A           C       B




     You must understand the properties
     of your sets to know if you can trust
     the resulting intersection
HYPERLOGLOG++
• Google researches have recently released an
  update to the HLL algorithm
• Uses clever encoding/decoding techniques to
  create a single data structure that is very
  accurate for small cardinality sets and can
  estimate sets that have over a trillion elements
  in them
• Empirical bias correction. Observations show
  that most of the error in HLL comes from the
  bias function. Using empirically derived values
  significantly reduces error
• Already available in Stream-Lib!
OTHER PROBABILISTIC
DATA STRUCTURES

• Bloom Filters – set membership
  detection
• CountMinSketch – estimate number
  of occurrences for a given element
• TopK Estimators – estimate the
  frequency and top elements from a
  stream
REFERENCES
• Stream-Lib -
  https://github.com/clearspring/stream-lib
• HyperLogLog -
  http://citeseerx.ist.psu.edu/viewdoc/summary?d
  oi=10.1.1.142.9475
• HyperLogLog In Practice -
  http://research.google.com/pubs/pub40671.html
• Aggregate Knowledge HLL Blog Posts -
  http://blog.aggregateknowledge.com/tag/hyperlo
  glog/
THANKS!


     AddThis is hiring!

Contenu connexe

Tendances

Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Sangwoo Mo
 
Random finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensorsRandom finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensorsDaniel Hauschildt
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksAndrew Ferlitsch
 
Spatial co location pattern mining
Spatial co location pattern miningSpatial co location pattern mining
Spatial co location pattern miningSeung Kwan Kim
 
Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...
Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...
Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...Becky Burwell
 
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureRobust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureIJRES Journal
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoSeongwon Hwang
 
Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksSang Jun Lee
 

Tendances (11)

MSPRT2015
MSPRT2015MSPRT2015
MSPRT2015
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
 
Random finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensorsRandom finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensors
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
 
Spatial co location pattern mining
Spatial co location pattern miningSpatial co location pattern mining
Spatial co location pattern mining
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...
Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...
Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...
 
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureRobust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in Theano
 
Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural Networks
 

En vedette

The value of open source software open analytics summit - open geo - eddie ...
The value of open source software   open analytics summit - open geo - eddie ...The value of open source software   open analytics summit - open geo - eddie ...
The value of open source software open analytics summit - open geo - eddie ...Open Analytics
 
Greenbacker open analyticsdc
Greenbacker open analyticsdcGreenbacker open analyticsdc
Greenbacker open analyticsdcOpen Analytics
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summitOpen Analytics
 
Big Audience at Scale — Spark and Big Data
Big Audience at Scale — Spark and Big DataBig Audience at Scale — Spark and Big Data
Big Audience at Scale — Spark and Big DataGlobalLogic Ukraine
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 

En vedette (12)

The value of open source software open analytics summit - open geo - eddie ...
The value of open source software   open analytics summit - open geo - eddie ...The value of open source software   open analytics summit - open geo - eddie ...
The value of open source software open analytics summit - open geo - eddie ...
 
Greenbacker open analyticsdc
Greenbacker open analyticsdcGreenbacker open analyticsdc
Greenbacker open analyticsdc
 
Mongo lessons learned
Mongo lessons learnedMongo lessons learned
Mongo lessons learned
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Oas schwartz 16
Oas schwartz 16Oas schwartz 16
Oas schwartz 16
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Ikanow oanyc summit
Ikanow oanyc summitIkanow oanyc summit
Ikanow oanyc summit
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Big Audience at Scale — Spark and Big Data
Big Audience at Scale — Spark and Big DataBig Audience at Scale — Spark and Big Data
Big Audience at Scale — Spark and Big Data
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 

Similaire à 2013 open analytics_countingv3

Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct ProblemKai Zhang
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Understanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLUnderstanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLHPCC Systems
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
 
Wavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGAWavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGADr. Mohieddin Moradi
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...AmirParnianifard1
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
Echo State Hoeffding Tree Learning
Echo State Hoeffding Tree LearningEcho State Hoeffding Tree Learning
Echo State Hoeffding Tree LearningDiego Marrón Vida
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierDesign of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierVLSICS Design
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
 

Similaire à 2013 open analytics_countingv3 (20)

Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Understanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLUnderstanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECL
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Daniel Lee STAN
Daniel Lee STANDaniel Lee STAN
Daniel Lee STAN
 
Wavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGAWavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGA
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Echo State Hoeffding Tree Learning
Echo State Hoeffding Tree LearningEcho State Hoeffding Tree Learning
Echo State Hoeffding Tree Learning
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
 
Bf4102414417
Bf4102414417Bf4102414417
Bf4102414417
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierDesign of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic Multiplier
 
1015 track2 abbott
1015 track2 abbott1015 track2 abbott
1015 track2 abbott
 
1030 track2 abbott
1030 track2 abbott1030 track2 abbott
1030 track2 abbott
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 

Plus de Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Open Analytics
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Open Analytics
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)Open Analytics
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)Open Analytics
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationOpen Analytics
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsOpen Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital EconomyOpen Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYCOpen Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupOpen Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupOpen Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalOpen Analytics
 

Plus de Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

2013 open analytics_countingv3

  • 1. Cardinality Estimation for Very Large Data Sets Matt Abrams, VP Data and Operations March 25, 2013
  • 2. THANKS FOR COMING! I build large scale distributed systems and work on algorithms that make sense of the data stored in them Contributor to the open source project Stream- Lib, a Java library for summarizing data streams (https://github.com/clearspring/stream-lib) Ask me questions: @abramsm
  • 3. HOW CAN WE COUNT THE NUMBER OF DISTINCT ELEMENTS IN LARGE DATA SETS?
  • 4. HOW CAN WE COUNT THE NUMBER OF DISTINCT ELEMENTS IN VERY LARGE DATA SETS?
  • 5. GOALS FOR COUNTING SOLUTION Support high throughput data streams (up to many 100s of thousands per second) Estimate cardinality with known error thresholds in sets up to around 1 billion (or even 1 trillion when needed) Support set operations (unions and intersections) Support data streams with large number of dimensions
  • 6.
  • 7. 1 UID = 128 bits 513a71b843e54b73
  • 8. In one month AddThis logs 5B+ UIDs 2,500,000 * 2000 = 5,000,000,000
  • 9. That’s 596GB of just UIDS
  • 10. NAÏVE SOLUTIONS • Select count(distinct UID) from table where dimension = foo • HashSet<K> • Run a batch job for each new query request
  • 11. WE ARE NOT A BANK This means a estimate rather than exact value is acceptable. http://graphics8.nytimes.com/images/2008/01/30/timestopics/feddc.jp g
  • 12.
  • 13. THREE INTUITIONS • It is possible to estimate the cardinality of a set by understanding the probability of a sequence of events occurring in a random variable (e.g. how many coins were flipped if I saw n heads in a row?) • Averaging the the results of multiple observations can reduce the variance associated with random variables • By applying a good hash function effectively de- duplicates the input stream
  • 14. INTUITION What is the probability that a binary string starts with ’01’?
  • 16. INTUITION (1/2)3 = 12.5%
  • 17. INTUITION Crude analysis: If a stream has 8 unique values the hash of at least one of them should start with ‘001’
  • 18. INTUITION Given the variability of a single random value we can not use a single variable for accurate cardinality estimations
  • 19. MULTIPLE OBSERVATIONS HELP REDUCE VARIANCE By taking the mean of the standard deviation of multiple random variables we can make the error rate as small as desired by controlling the size of m (the number random variables) error = s / m
  • 20. THE PROBLEM WITH MULTIPLE HASH FUNCTIONS • It is too costly from a computational perspective to apply m hash functions to each data point • It is not clear that it is possible to generate m good hash functions that are independent
  • 21. STOCHASTIC AVERAGING • Emulating the effect of m experiments with a single hash function • Divide input stream h(M) into m sub- streams é1 2 m -1 ù ê , ,..., ëm m ,1ú m û • An average of the observable values for each sub-stream will yield a cardinality that improves in proportion to 1/ m as m increases
  • 22. HASH FUNCTIONS 32 Bit 64 Bit 160 Bit Odds of a Hash Hash Hash Collision 77163 5.06 Billion 1.42 * 1 in 2 10^14 30084 1.97 Billion 5.55 * 1 in 10 10^23 9292 609 million 1.71 * 1 in 100 10^23 2932 192 million 5.41 * 1 in 1000 10^22 http://preshing.com/20110504/hash-collision-probabilities
  • 23. HYPERLOGLOG (2007) Counts up to1 Billion in 1.5KB of space Philippe Flajolet (1948-2011)
  • 24. HYPERLOGLOG (HLL) • Operates with a single pass over the input data set • Produces a typical error of of 1.04 / m • Error decreases as m increases. Error is not a function of the number of elements in the set
  • 25. HLL SUBSTREAMS HLL uses a single hash function and splits the result into m buckets Bucket 1 Hash Input Values Function S Bucket 2 Bucket m
  • 26. HLL ALGORITHM BASICS • Each substream maintains an Observable • Observable is largest value p(x) which is the position of the leftmost 1-bit in a binary string x • 32 bit hashing function with 5 bit “short bytes” • Harmonic mean • Increases quality of estimates by reducing variance
  • 27. WHAT ARE “SHORT BYTES”? • We know a priori that the value of a given substream of the multiset M is in the range 0..(L +1- log2 m) • Assuming L = 32 we only need 5 bits to store the value of the register • 85% less memory usage as compared to standard java int
  • 28. ADDING VALUES TO HLL r ( xb+1 xb+2 ×××) index =1+ x1x2 ××× xb 2 • The first b bits of the new value define the index for the multiset M that may be updated when the new value is added • The bits b+1 to m are used to determine the leading number of zeros (p)
  • 29. ADDING VALUES TO HLL Observations {M[1], M[2],..., M[m]} The multiset is updated using the equation: M[ j] := max(M[ j], r (w )) Number of leading zeros + 1
  • 30. INTUITION ON EXTRACTING CARDINALITY FROM HLL • If we add n elements to a stream then each substream will contain roughly n/m elements • The MAX value in each substream should be about log2 ( n / m) (from earlier intuition re random variables) • The harmonic mean (mZ) of 2MAX is on the order of n/m • So m2Z is on the order of n  That’s the cardinality!
  • 31. HLL CARDINALITY ESTIMATE -1 æ m ö E := a m m × çå 2 -M [ j ] 2 ç ÷ ÷ è j=1 ø (2 ) p 2 Harmonic Mean • m2Z has systematic multiplicative bias that needs to be corrected. This is done by multiplying a constant value
  • 32. A NOTE ON LONG RANGE CORRECTIONS • The paper says to apply a long range correction function when the estimate is greater than: E > 1 232 30 • The correction function is: E := -2 log(1- E / 2 * 32 • DON’T DO THIS! It doesn’t work and increases error. Better approach is to use a bigger/better hash function
  • 33. DEMO TIME! Lets look at HLL in Action. http://www.aggregateknowledge.com/science/blog/hll.html
  • 34. HLL UNIONS Root • Merging two or more HLL data structures is a MON HLL similar process to adding a new value to a single HLL TUE HLL • For each register in the HLL take the max value of the HLLs you are merging WED HLL and the resulting register set can be used to estimate the cardinality of THU HLL the combined sets FRI HLL
  • 35. HLL INTERSECTION C = A + B - AÈ B A C B You must understand the properties of your sets to know if you can trust the resulting intersection
  • 36. HYPERLOGLOG++ • Google researches have recently released an update to the HLL algorithm • Uses clever encoding/decoding techniques to create a single data structure that is very accurate for small cardinality sets and can estimate sets that have over a trillion elements in them • Empirical bias correction. Observations show that most of the error in HLL comes from the bias function. Using empirically derived values significantly reduces error • Already available in Stream-Lib!
  • 37. OTHER PROBABILISTIC DATA STRUCTURES • Bloom Filters – set membership detection • CountMinSketch – estimate number of occurrences for a given element • TopK Estimators – estimate the frequency and top elements from a stream
  • 38. REFERENCES • Stream-Lib - https://github.com/clearspring/stream-lib • HyperLogLog - http://citeseerx.ist.psu.edu/viewdoc/summary?d oi=10.1.1.142.9475 • HyperLogLog In Practice - http://research.google.com/pubs/pub40671.html • Aggregate Knowledge HLL Blog Posts - http://blog.aggregateknowledge.com/tag/hyperlo glog/
  • 39. THANKS! AddThis is hiring!

Notes de l'éditeur

  1. 2.5M people
  2. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  3. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  4. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  5. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  6. Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.