SlideShare a Scribd company logo
1 of 30
Download to read offline
Probabilistic Data Structures
and Approximate Solutions
IPython notebook with code >>

by Oleksandr Pryymak
PyData London 2014
Probabilistic||Approximate: Why?
Often:
● an approximate answer is sufficient
● need to trade accuracy for scalability or speed
● need to analyse stream of data
Catch:
● despite typically achieving good result, exists a
chance of the bad worst case behaviour.
● use on large datasets (law of large numbers)
Code: Approximation
import random
x = [random.randint(0,80000) for _ in xrange(10000)]
y = [i>>8 for i in x] # trim 8 bits off of integers
z = x[:500]

# 5% sample (x is uniform)

avx = average(x)
avy = average(y) * 2**8 # add 8 bits
avz = average(z)
print avx
print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))
print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39547.8816
39420.7744 error 0.321401%
39591.424 error 0.110100%
Code: Sampling Data

Interview question:
Get K samples from an infinite stream
Probabilistic Data Structures
Generally they are:
● Use less space than a full dataset
● Require higher CPU load
● Stream-friendly
● Can be parallelized
● Have controlled error rate
Hash functions
One-way function:
arbitrary length of the key ->
to a fixed length of the message

message = hash(key)
However, collisions are possible:

hash(key1) = hash(key2)
Code: Hashing
Hash collisions and performance
●
●

Cryptographic hashes not ideal for our use (like bcrypt)
Need a fast algorithm with the lowest number of collisions:

Hash
=============
Murmur
FNV-1
DJB2
SDBM
SuperFastHash
CRC32
LoseLose

Lowercase
=============
145 ns
6 collis
184 ns
1 collis
156 ns
7 collis
148 ns
4 collis
164 ns
85 collis
250 ns
2 collis
338 ns
215178 collis

Random UUID
===========
259 ns
5 collis
730 ns
5 collis
437 ns
6 collis
484 ns
6 collis
344 ns
4 collis
946 ns
0 collis
-

Numbers
==============
92 ns
0 collis
92 ns
0 collis
93 ns
0 collis
90 ns
0 collis
118 ns
18742 collis
130 ns
0 collis
-

Murmur2 collisions
●

cataract collides with periti

●

roquette collides with skivie

●

shawl collides with stormbound

●

dowlases collides with tramontane

●

cricketings collides with twanger

●

longans collides with whigs

by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Hash randomness visualised hashmap

Great
murmur2

Not so great

on a sequence of numbers

DJB2
on a sequence of numbers
Comparison: Locality Sensitive Hashing (LSH)
Comparison: Locality Sensitive Hashing (LSH)
Image hashes

Kernelized locality-sensitive hashing for scalable image search
B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
Membership test: Bloom filter
Bloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.
`

At least one 0 means
w definitely isn’t in set.
All 1s would mean w
probably is in set.

1..m
Use Bloom filter to serve requests
Code: bloom filter
Use Bloom filter to store graphs
Graphs only gain nodes because of Bloom
filter false positives.

Pell et al., PNAS 2012
Counting Distinct Elements
In:
infinite stream of data
Question: how many distinct elements are there?
is similar to:
In:
coin flips
Question: how many times it has been flipped?
Coin flips: intuition
● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how
many coins you’ve flipped.
Code: Cardinality estimation
Cardinality estimation
Basic algorithm:
●
●

n=0
For each input item:
○ Hash item into bit string
○ Count trailing zeroes in bit string
○ If this count > n:
■ Let n = count

●

Estimated cardinality (“count distinct”) = 2^n
Cardinality estimation: HyperLogLog

Demo by: http://www.
aggregateknowledge.
com/science/blog/hll.html
Billions of distinct values in 1.5KB of
RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier;
2007
Code: HyperLogLog
Count-min sketch
Frequency histogram
estimation with chance
of over-counting

count(value) = min{w1[h1(value)], ... wd[hd(value)]}
Code: Frequent Itemsets
Machine Learning: Feature hashing
High-dimensional
machine learning without
feature dictionary

by Andrew Clegg “Approximate methods for
scalable data mining”
Locality-sensitive hashing
To approximate nearest
neighbours

by Andrew Clegg “Approximate methods for
scalable data mining”
Probabilistic Databases
● PrDB (University of Maryland)
● Orion (Purdue University)
● MayBMS (Cornell University)

● BlinkDB v0.1alpha
(UC Berkeley and MIT)
BlinkDB: queries
Queries with Bounded Errors
and Bounded Response Times
on Very Large Data
BlinkDB: architecture
References
Mining of Massive Datasets
by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
http://infolab.stanford.edu/~ullman/mmds.html
Summary

● know the data structures
● know what you sacrifice
● control errors

http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df
http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov

More Related Content

What's hot

Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraOlga Lavrentieva
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleBadrish Chandramouli
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of RWinston Chen
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationNorberto Leite
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesEugene Dvorkin
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Sidi chang demo
Sidi chang demoSidi chang demo
Sidi chang demoSidi Chang
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and SparkJosef Adersberger
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
 
Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sri Ambati
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalAlexander Decker
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 

What's hot (19)

Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and Scale
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of R
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Sidi chang demo
Sidi chang demoSidi chang demo
Sidi chang demo
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
 
AfterGlow
AfterGlowAfterGlow
AfterGlow
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 

Viewers also liked

Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data StructuresSHAKOOR AB
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
 
File Organization
File OrganizationFile Organization
File OrganizationManyi Man
 

Viewers also liked (7)

Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
File organisation
File organisationFile organisation
File organisation
 
Ch17 Hashing
Ch17 HashingCh17 Hashing
Ch17 Hashing
 
File structures
File structuresFile structures
File structures
 
File Organization
File OrganizationFile Organization
File Organization
 
File organization
File organizationFile organization
File organization
 

Similar to Probabilistic & Approximate: Data Structures for Scalability

Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr PryymakProbabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr PryymakPyData
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data StructureRishabh Dugar
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHLeo Chu
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsChristopher Conlan
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured predictionzukun
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the codeWim Godden
 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On RandomnessRanel Padon
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopPranab Ghosh
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia岳華 杜
 
How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...Richard Minerich
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionJohn Liu
 
Spatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataSpatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataTony Fast
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognitionIntel Nervana
 

Similar to Probabilistic & Approximate: Data Structures for Scalability (20)

Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr PryymakProbabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIH
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On Randomness
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia
 
How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting Recognition
 
Spatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataSpatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud data
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognition
 
Self healing data
Self healing dataSelf healing data
Self healing data
 

More from Oleksandr Pryymak

Information surprise or how to find interesting data
Information surprise or how to find interesting dataInformation surprise or how to find interesting data
Information surprise or how to find interesting dataOleksandr Pryymak
 
Efficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teamsEfficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teamsOleksandr Pryymak
 
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...Oleksandr Pryymak
 

More from Oleksandr Pryymak (8)

Information surprise or how to find interesting data
Information surprise or how to find interesting dataInformation surprise or how to find interesting data
Information surprise or how to find interesting data
 
Efficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teamsEfficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teams
 
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
 
Semantic Web - Introduction
Semantic Web - IntroductionSemantic Web - Introduction
Semantic Web - Introduction
 
sumno.com - march 2009
sumno.com - march 2009sumno.com - march 2009
sumno.com - march 2009
 
Sumno.com (eng)
Sumno.com (eng)Sumno.com (eng)
Sumno.com (eng)
 
Sumno.com (ukr)
Sumno.com (ukr)Sumno.com (ukr)
Sumno.com (ukr)
 
Gwt.org.ua (ukr)
Gwt.org.ua (ukr)Gwt.org.ua (ukr)
Gwt.org.ua (ukr)
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Probabilistic & Approximate: Data Structures for Scalability

  • 1. Probabilistic Data Structures and Approximate Solutions IPython notebook with code >> by Oleksandr Pryymak PyData London 2014
  • 2. Probabilistic||Approximate: Why? Often: ● an approximate answer is sufficient ● need to trade accuracy for scalability or speed ● need to analyse stream of data Catch: ● despite typically achieving good result, exists a chance of the bad worst case behaviour. ● use on large datasets (law of large numbers)
  • 3. Code: Approximation import random x = [random.randint(0,80000) for _ in xrange(10000)] y = [i>>8 for i in x] # trim 8 bits off of integers z = x[:500] # 5% sample (x is uniform) avx = average(x) avy = average(y) * 2**8 # add 8 bits avz = average(z) print avx print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx)) print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx)) 39547.8816 39420.7744 error 0.321401% 39591.424 error 0.110100%
  • 4. Code: Sampling Data Interview question: Get K samples from an infinite stream
  • 5. Probabilistic Data Structures Generally they are: ● Use less space than a full dataset ● Require higher CPU load ● Stream-friendly ● Can be parallelized ● Have controlled error rate
  • 6. Hash functions One-way function: arbitrary length of the key -> to a fixed length of the message message = hash(key) However, collisions are possible: hash(key1) = hash(key2)
  • 8. Hash collisions and performance ● ● Cryptographic hashes not ideal for our use (like bcrypt) Need a fast algorithm with the lowest number of collisions: Hash ============= Murmur FNV-1 DJB2 SDBM SuperFastHash CRC32 LoseLose Lowercase ============= 145 ns 6 collis 184 ns 1 collis 156 ns 7 collis 148 ns 4 collis 164 ns 85 collis 250 ns 2 collis 338 ns 215178 collis Random UUID =========== 259 ns 5 collis 730 ns 5 collis 437 ns 6 collis 484 ns 6 collis 344 ns 4 collis 946 ns 0 collis - Numbers ============== 92 ns 0 collis 92 ns 0 collis 93 ns 0 collis 90 ns 0 collis 118 ns 18742 collis 130 ns 0 collis - Murmur2 collisions ● cataract collides with periti ● roquette collides with skivie ● shawl collides with stormbound ● dowlases collides with tramontane ● cricketings collides with twanger ● longans collides with whigs by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
  • 9. Hash randomness visualised hashmap Great murmur2 Not so great on a sequence of numbers DJB2 on a sequence of numbers
  • 11. Comparison: Locality Sensitive Hashing (LSH) Image hashes Kernelized locality-sensitive hashing for scalable image search B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
  • 12. Membership test: Bloom filter Bloom filter is probabilistic but only yields false positives. Hash each item k times indices into bit field. ` At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set. 1..m
  • 13. Use Bloom filter to serve requests
  • 15. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
  • 16. Counting Distinct Elements In: infinite stream of data Question: how many distinct elements are there? is similar to: In: coin flips Question: how many times it has been flipped?
  • 17. Coin flips: intuition ● Long runs of HEADs in random series are rare. ● The longer you look, the more likely you see a long one. ● Long runs are very rare and are correlated with how many coins you’ve flipped.
  • 19. Cardinality estimation Basic algorithm: ● ● n=0 For each input item: ○ Hash item into bit string ○ Count trailing zeroes in bit string ○ If this count > n: ■ Let n = count ● Estimated cardinality (“count distinct”) = 2^n
  • 20. Cardinality estimation: HyperLogLog Demo by: http://www. aggregateknowledge. com/science/blog/hll.html Billions of distinct values in 1.5KB of RAM with 2% relative error HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
  • 22. Count-min sketch Frequency histogram estimation with chance of over-counting count(value) = min{w1[h1(value)], ... wd[hd(value)]}
  • 24. Machine Learning: Feature hashing High-dimensional machine learning without feature dictionary by Andrew Clegg “Approximate methods for scalable data mining”
  • 25. Locality-sensitive hashing To approximate nearest neighbours by Andrew Clegg “Approximate methods for scalable data mining”
  • 26. Probabilistic Databases ● PrDB (University of Maryland) ● Orion (Purdue University) ● MayBMS (Cornell University) ● BlinkDB v0.1alpha (UC Berkeley and MIT)
  • 27. BlinkDB: queries Queries with Bounded Errors and Bounded Response Times on Very Large Data
  • 29. References Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman http://infolab.stanford.edu/~ullman/mmds.html
  • 30. Summary ● know the data structures ● know what you sacrifice ● control errors http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov