Probabilistic & Approximate: Data Structures for Scalability

Probabilistic Data Structures
and Approximate Solutions
IPython notebook with code >>

by Oleksandr Pryymak
PyData London 2014

Probabilistic||Approximate: Why?
Often:
● an approximate answer is sufficient
● need to trade accuracy for scalability or speed
● need to analyse stream of data
Catch:
● despite typically achieving good result, exists a
chance of the bad worst case behaviour.
● use on large datasets (law of large numbers)

Code: Approximation
import random
x = [random.randint(0,80000) for _ in xrange(10000)]
y = [i>>8 for i in x] # trim 8 bits off of integers
z = x[:500]

# 5% sample (x is uniform)

avx = average(x)
avy = average(y) * 2**8 # add 8 bits
avz = average(z)
print avx
print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))
print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39547.8816
39420.7744 error 0.321401%
39591.424 error 0.110100%

Code: Sampling Data

Interview question:
Get K samples from an infinite stream

Probabilistic Data Structures
Generally they are:
● Use less space than a full dataset
● Require higher CPU load
● Stream-friendly
● Can be parallelized
● Have controlled error rate

Hash functions
One-way function:
arbitrary length of the key ->
to a fixed length of the message

message = hash(key)
However, collisions are possible:

hash(key1) = hash(key2)

Hash collisions and performance
●
●

Cryptographic hashes not ideal for our use (like bcrypt)
Need a fast algorithm with the lowest number of collisions:

Hash
=============
Murmur
FNV-1
DJB2
SDBM
SuperFastHash
CRC32
LoseLose

Lowercase
=============
145 ns
6 collis
184 ns
1 collis
156 ns
7 collis
148 ns
4 collis
164 ns
85 collis
250 ns
2 collis
338 ns
215178 collis

Random UUID
===========
259 ns
5 collis
730 ns
5 collis
437 ns
6 collis
484 ns
6 collis
344 ns
4 collis
946 ns
0 collis
-

Numbers
==============
92 ns
0 collis
92 ns
0 collis
93 ns
0 collis
90 ns
0 collis
118 ns
18742 collis
130 ns
0 collis
-

Murmur2 collisions
●

cataract collides with periti

●

roquette collides with skivie

●

shawl collides with stormbound

●

dowlases collides with tramontane

●

cricketings collides with twanger

●

longans collides with whigs

by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Hash randomness visualised hashmap

Great
murmur2

Not so great

on a sequence of numbers

DJB2
on a sequence of numbers

Comparison: Locality Sensitive Hashing (LSH)

Comparison: Locality Sensitive Hashing (LSH)
Image hashes

Kernelized locality-sensitive hashing for scalable image search
B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22

Membership test: Bloom filter
Bloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.
`

At least one 0 means
w definitely isn’t in set.
All 1s would mean w
probably is in set.

1..m

Use Bloom filter to serve requests

Use Bloom filter to store graphs
Graphs only gain nodes because of Bloom
filter false positives.

Pell et al., PNAS 2012

Counting Distinct Elements
In:
infinite stream of data
Question: how many distinct elements are there?
is similar to:
In:
coin flips
Question: how many times it has been flipped?

Coin flips: intuition
● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how
many coins you’ve flipped.

Cardinality estimation
Basic algorithm:
●
●

n=0
For each input item:
○ Hash item into bit string
○ Count trailing zeroes in bit string
○ If this count > n:
■ Let n = count

●

Estimated cardinality (“count distinct”) = 2^n

Cardinality estimation: HyperLogLog

Demo by: http://www.
aggregateknowledge.
com/science/blog/hll.html
Billions of distinct values in 1.5KB of
RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier;
2007

Count-min sketch
Frequency histogram
estimation with chance
of over-counting

count(value) = min{w1[h1(value)], ... wd[hd(value)]}

Machine Learning: Feature hashing
High-dimensional
machine learning without
feature dictionary

by Andrew Clegg “Approximate methods for
scalable data mining”

Locality-sensitive hashing
To approximate nearest
neighbours

by Andrew Clegg “Approximate methods for
scalable data mining”

Probabilistic Databases
● PrDB (University of Maryland)
● Orion (Purdue University)
● MayBMS (Cornell University)

● BlinkDB v0.1alpha
(UC Berkeley and MIT)

BlinkDB: queries
Queries with Bounded Errors
and Bounded Response Times
on Very Large Data

References
Mining of Massive Datasets
by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
http://infolab.stanford.edu/~ullman/mmds.html

Summary

● know the data structures
● know what you sacrifice
● control errors

http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df
http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov

Probabilistic & Approximate: Data Structures for Scalability

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (7)

Similar to Probabilistic & Approximate: Data Structures for Scalability

Similar to Probabilistic & Approximate: Data Structures for Scalability (20)

More from Oleksandr Pryymak

More from Oleksandr Pryymak (8)

Recently uploaded

Recently uploaded (20)

Probabilistic & Approximate: Data Structures for Scalability