2. THANKS FOR
COMING!
I build large scale distributed systems and work on
algorithms that make sense of the data stored in
them
Contributor to the open source project Stream-
Lib, a Java library for summarizing data streams
(https://github.com/clearspring/stream-lib)
Ask me questions: @abramsm
3. HOW CAN WE COUNT
THE NUMBER OF
DISTINCT ELEMENTS
IN LARGE DATA
SETS?
4. HOW CAN WE COUNT
THE NUMBER OF
DISTINCT ELEMENTS
IN VERY LARGE DATA
SETS?
5. GOALS FOR
COUNTING SOLUTION
Support high throughput data streams (up
to many 100s of thousands per second)
Estimate cardinality with known error
thresholds in sets up to around 1 billion (or
even 1 trillion when needed)
Support set operations (unions and
intersections)
Support data streams with large number of
dimensions
10. NAÏVE SOLUTIONS
• Select count(distinct
UID) from table where
dimension = foo
• HashSet<K>
• Run a batch job for each
new query request
11. WE ARE NOT A BANK
This means a estimate rather
than exact value is acceptable.
http://graphics8.nytimes.com/images/2008/01/30/timestopics/feddc.jp
g
12.
13. THREE INTUITIONS
• It is possible to estimate the cardinality of a set
by understanding the probability of a sequence
of events occurring in a random variable (e.g.
how many coins were flipped if I saw n heads in
a row?)
• Averaging the the results of multiple
observations can reduce the variance
associated with random variables
• By applying a good hash function effectively de-
duplicates the input stream
14. INTUITION
What is the probability
that a binary string
starts with ’01’?
17. INTUITION
Crude analysis: If a stream
has 8 unique values the hash
of at least one of them should
start with ‘001’
18. INTUITION
Given the variability of a single
random value we can not use
a single variable for accurate
cardinality estimations
19. MULTIPLE OBSERVATIONS HELP
REDUCE VARIANCE
By taking the mean of the standard
deviation of multiple random variables we
can make the error rate as small as desired
by controlling the size of m (the number
random variables)
error = s / m
20. THE PROBLEM WITH
MULTIPLE HASH
FUNCTIONS
• It is too costly from a
computational perspective to
apply m hash functions to
each data point
• It is not clear that it is
possible to generate m good
hash functions that are
independent
21. STOCHASTIC
AVERAGING
• Emulating the effect of m experiments
with a single hash function
• Divide input stream h(M) into m sub-
streams
é1 2 m -1 ù
ê , ,...,
ëm m
,1ú
m û
• An average of the observable values for
each sub-stream will yield a cardinality
that improves in proportion to 1/ m as
m increases
22. HASH FUNCTIONS
32 Bit 64 Bit 160 Bit Odds of a
Hash Hash Hash Collision
77163 5.06 Billion 1.42 * 1 in 2
10^14
30084 1.97 Billion 5.55 * 1 in 10
10^23
9292 609 million 1.71 * 1 in 100
10^23
2932 192 million 5.41 * 1 in 1000
10^22
http://preshing.com/20110504/hash-collision-probabilities
23. HYPERLOGLOG
(2007)
Counts up to1 Billion in 1.5KB of space
Philippe Flajolet (1948-2011)
24. HYPERLOGLOG (HLL)
• Operates with a single pass
over the input data set
• Produces a typical error of of
1.04 / m
• Error decreases as m
increases. Error is not a
function of the number of
elements in the set
25. HLL SUBSTREAMS
HLL uses a single hash
function and splits the result
into m buckets
Bucket 1
Hash
Input Values Function
S Bucket 2
Bucket m
26. HLL ALGORITHM
BASICS
• Each substream maintains an Observable
• Observable is largest value p(x) which is the
position of the leftmost 1-bit in a binary string x
• 32 bit hashing function with 5 bit “short bytes”
• Harmonic mean
• Increases quality of estimates by reducing variance
27. WHAT ARE “SHORT BYTES”?
• We know a priori that the value of a given
substream of the multiset M is in the
range
0..(L +1- log2 m)
• Assuming L = 32 we only need 5 bits to
store the value of the register
• 85% less memory usage as compared to
standard java int
28. ADDING VALUES TO
HLL
r ( xb+1 xb+2 ×××) index =1+ x1x2 ××× xb 2
• The first b bits of the new value define
the index for the multiset M that may be
updated when the new value is added
• The bits b+1 to m are used to determine
the leading number of zeros (p)
29. ADDING VALUES TO
HLL
Observations
{M[1], M[2],..., M[m]}
The multiset is updated using the equation:
M[ j] := max(M[ j], r (w ))
Number of leading zeros + 1
30. INTUITION ON
EXTRACTING
CARDINALITY FROM HLL
• If we add n elements to a stream then each
substream will contain roughly n/m elements
• The MAX value in each substream should be
about log2 ( n / m) (from earlier intuition re
random variables)
• The harmonic mean (mZ) of 2MAX is on the
order of n/m
• So m2Z is on the order of n That’s the
cardinality!
31. HLL CARDINALITY
ESTIMATE
-1
æ m ö
E := a m m × çå 2
-M [ j ]
2
ç ÷
÷
è j=1 ø
(2 )
p 2
Harmonic Mean
• m2Z has systematic multiplicative bias that needs to be
corrected. This is done by multiplying a constant value
32. A NOTE ON LONG
RANGE CORRECTIONS
• The paper says to apply a long range
correction function when the estimate is
greater than: E > 1 232
30
• The correction function is:
E := -2 log(1- E / 2
* 32
• DON’T DO THIS! It doesn’t work and
increases error. Better approach is to
use a bigger/better hash function
33. DEMO TIME!
Lets look at HLL in Action.
http://www.aggregateknowledge.com/science/blog/hll.html
34. HLL UNIONS Root
• Merging two or more HLL
data structures is a MON HLL
similar process to adding
a new value to a single
HLL TUE HLL
• For each register in the
HLL take the max value of
the HLLs you are merging WED
HLL
and the resulting register
set can be used to
estimate the cardinality of THU HLL
the combined sets
FRI HLL
35. HLL INTERSECTION
C = A + B - AÈ B
A C B
You must understand the properties
of your sets to know if you can trust
the resulting intersection
36. HYPERLOGLOG++
• Google researches have recently released an
update to the HLL algorithm
• Uses clever encoding/decoding techniques to
create a single data structure that is very
accurate for small cardinality sets and can
estimate sets that have over a trillion elements
in them
• Empirical bias correction. Observations show
that most of the error in HLL comes from the
bias function. Using empirically derived values
significantly reduces error
• Already available in Stream-Lib!
37. OTHER PROBABILISTIC
DATA STRUCTURES
• Bloom Filters – set membership
detection
• CountMinSketch – estimate number
of occurrences for a given element
• TopK Estimators – estimate the
frequency and top elements from a
stream
38. REFERENCES
• Stream-Lib -
https://github.com/clearspring/stream-lib
• HyperLogLog -
http://citeseerx.ist.psu.edu/viewdoc/summary?d
oi=10.1.1.142.9475
• HyperLogLog In Practice -
http://research.google.com/pubs/pub40671.html
• Aggregate Knowledge HLL Blog Posts -
http://blog.aggregateknowledge.com/tag/hyperlo
glog/
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.