SlideShare une entreprise Scribd logo
1  sur  124
It Probably Works
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/probabilistic-algorithms
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Tyler McMullen
CTO of Fastly
@tbmcmullen
Fastly
We’re an awesome CDN.
What is a probabilistic
algorithm?
Why bother?
–Hal Abelson and Gerald J. Sussman, SICP
“In testing primality of very large numbers chosen at
random, the chance of stumbling upon a value that
fools the Fermat test is less than the chance that
cosmic radiation will cause the computer to make
an error in carrying out a 'correct' algorithm.
Considering an algorithm to be inadequate for the
first reason but not for the second illustrates the
difference between mathematics and engineering.”
Everything is
probabilistic
Probabilistic algorithms
are not “guessing”
Provably bounded
error rates
Don’t use them if
someone’s life depends
on it.
Don’t use them if
someone’s life depends
on it.
When should I use
these?
Ok, now what?
The Count-distinct
Problem
The Problem
How many unique words are in a large
corpus of text?
The Problem
How many different users visited a
popular website in a day?
The Problem
How many unique IPs have connected to a
server over the last hour?
The Problem
How many unique URLs have been requested
through an HTTP proxy?
The Problem
How many unique URLs have been requested
through an entire network of HTTP proxies?
166.208.249.236	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /product/982"	
  200	
  20246	
  
103.138.203.165	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /article/490"	
  200	
  29870	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "HEAD	
  /page/1"	
  200	
  20409	
  
150.255.232.154	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /page/1"	
  200	
  42999	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /article/490"	
  200	
  25080	
  
150.255.232.154	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /listing/567"	
  200	
  33617	
  
103.138.203.165	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "HEAD	
  /listing/567"	
  200	
  29618	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "HEAD	
  /page/1"	
  200	
  30265	
  
166.208.249.236	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /page/1"	
  200	
  5683	
  
244.210.202.222	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "HEAD	
  /article/490"	
  200	
  47124	
  
103.138.203.165	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "HEAD	
  /listing/567"	
  200	
  48734	
  
103.138.203.165	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /listing/567"	
  200	
  27392	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /listing/567"	
  200	
  15705	
  
150.255.232.154	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /page/1"	
  200	
  22587	
  
244.210.202.222	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "HEAD	
  /product/982"	
  200	
  30063	
  
244.210.202.222	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /page/1"	
  200	
  6041	
  
166.208.249.236	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /product/982"	
  200	
  25783	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /article/490"	
  200	
  1099	
  
244.210.202.222	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /product/982"	
  200	
  31494	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /listing/567"	
  200	
  30389	
  
150.255.232.154	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /article/490"	
  200	
  10251	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /product/982"	
  200	
  19384	
  
150.255.232.154	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "HEAD	
  /product/982"	
  200	
  24062	
  
244.210.202.222	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /article/490"	
  200	
  19070	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /page/648"	
  200	
  45159	
  
191.141.247.227	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "HEAD	
  /page/648"	
  200	
  5576	
  
166.208.249.236	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /page/648"	
  200	
  41869	
  
166.208.249.236	
  -­‐	
  -­‐	
  [25/Feb/2015:07:20:13	
  +0000]	
  "GET	
  /listing/567"	
  200	
  42414	
  
def	
  count_distinct(stream):	
  
	
  	
  	
  	
  seen	
  =	
  set()	
  
	
  	
  	
  	
  for	
  item	
  in	
  stream:	
  
	
  	
  	
  	
  	
  	
  	
  	
  seen.add(item)	
  
	
  	
  	
  	
  return	
  len(seen)
Scale.
Count-distinct across
thousands of servers.
def	
  combined_cardinality(seen_sets):	
  
	
  	
  	
  	
  combined	
  =	
  set()	
  
	
  	
  	
  	
  for	
  seen	
  in	
  seen_sets:	
  
	
  	
  	
  	
  	
  	
  	
  	
  combined	
  |=	
  seen	
  
	
  	
  	
  	
  return	
  len(combined)
The set grows linearly.
Precision comes at a cost.
“example.com/user/page/1”
1	
  0	
  1	
  1	
  0	
  1	
  0	
  1
“example.com/user/page/1”
1	
  0	
  1	
  1	
  0	
  1	
  0	
  1
1	
  0	
  1	
  1	
  0	
  1	
  0	
  1
P(bit	
  0	
  is	
  set)	
  =	
  0.5
1	
  0	
  1	
  1	
  0	
  1	
  0	
  1
P(bit	
  0	
  is	
  set	
  
&	
  bit	
  1	
  is	
  set)	
  =	
  0.25
1	
  0	
  1	
  1	
  0	
  1	
  0	
  1
P(bit	
  0	
  is	
  set	
  	
  
&	
  bit	
  1	
  is	
  set	
  
&	
  bit	
  2	
  is	
  set)	
  =	
  0.125
1	
  0	
  1	
  1	
  0	
  1	
  0	
  1
P(bit	
  0	
  is	
  set)	
  =	
  0.5	
  
Expected	
  trials	
  =	
  2
1	
  0	
  1	
  1	
  0	
  1	
  0	
  1
P(bit	
  0	
  is	
  set	
  
&	
  bit	
  1	
  is	
  set)	
  =	
  0.25	
  
Expected	
  trials	
  =	
  4
1	
  0	
  1	
  1	
  0	
  1	
  0	
  1
P(bit	
  0	
  is	
  set	
  	
  
&	
  bit	
  1	
  is	
  set	
  
&	
  bit	
  2	
  is	
  set)	
  =	
  0.125	
  
Expected	
  trials	
  =	
  8
We expect the maximum number of leading
zeros we have seen + 1 to approximate
log2(unique items).
Improve the accuracy of
the estimate by
partitioning the input data.
class	
  LogLog(object):	
  
	
  	
  	
  	
  def	
  __init__(self,	
  k):	
  
	
  	
  	
  	
  	
  	
  	
  	
  self.k	
  =	
  k	
  
	
  	
  	
  	
  	
  	
  	
  	
  self.m	
  =	
  2	
  **	
  k	
  
	
  	
  	
  	
  	
  	
  	
  	
  self.M	
  =	
  np.zeros(self.m,	
  dtype=np.int)	
  
	
  	
  	
  	
  	
  	
  	
  	
  self.alpha	
  =	
  Alpha[k]	
  
	
  	
  
	
  	
  	
  	
  def	
  insert(self,	
  token):	
  
	
  	
  	
  	
  	
  	
  	
  	
  y	
  =	
  hash_fn(token)	
  
	
  	
  	
  	
  	
  	
  	
  	
  j	
  =	
  y	
  >>	
  (hash_len	
  -­‐	
  self.k)	
  
	
  	
  	
  	
  	
  	
  	
  	
  remaining	
  =	
  y	
  &	
  ((1	
  <<	
  (hash_len	
  -­‐	
  self.k))	
  -­‐	
  1)	
  
	
  	
  	
  	
  	
  	
  	
  	
  first_set_bit	
  =	
  (64	
  -­‐	
  self.k)	
  -­‐	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  int(math.log(remaining,	
  2))	
  
	
  	
  	
  	
  	
  	
  	
  	
  self.M[j]	
  =	
  max(self.M[j],	
  first_set_bit)	
  
	
  	
  
	
  	
  	
  	
  def	
  cardinality(self):	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  self.alpha	
  *	
  2	
  **	
  np.mean(self.M)
Unions of
HyperLogLogs
HyperLogLog
• Adding an item: O(1)
• Retrieving cardinality: O(1)
• Space: O(log log n)
• Error rate: 2%
HyperLogLog
For 100 million unique items,
and an error rate of 2%,
the size of the HyperLogLog is...
1,500 bytes
Reliable Broadcast
The Problem
Reliably broadcast “purge” messages across the world
as quickly as possible.
Single source of
truth
Single source of
failures
Atomic broadcast
Reliable broadcast
Gossip Protocols
“Designed for Scale”
Probabilistic Guarantees
Bimodal Multicast
• Quickly broadcast message to all servers
• Gossip to recover lost messages
One Problem
Computers have limited space
Throw away messages
“with high probability” is fine
Real World
End-to-End Latency
74ms
83ms
133ms
London
San Jose
Tokyo
0.00
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0 50 100 150
Latency (ms)
Density
End-to-End Latency
42ms
74ms
83ms
133ms
New York
London
San Jose
Tokyo
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0 50 100 150
Latency (ms)
Density
Density plot and 95th
percentile of purge latency by server location
Packet Loss
Good systems are boring
What was the point again?
We can build things that
are otherwise
unrealistic
We can build systems
that are more
reliable
You’re already using them.
We’re hiring!
Thanks
@tbmcmullen
What even is this?
Probabilistic Algorithms
Randomized Algorithms
Estimation Algorithms
Probabilistic Algorithms
1. An iota of theory
2. Where are they useful and where are they not?
3. HyperLogLog
4. Locality-sensitive Hashing
5. Bimodal Multicast
“An algorithm that uses randomness to improve
its efficiency”
Las Vegas
Monte Carlo
Las Vegas
def	
  find_las_vegas(haystack,	
  needle):	
  
	
  	
  	
  	
  length	
  =	
  len(haystack)	
  
	
  	
  	
  	
  while	
  True:	
  
	
  	
  	
  	
  	
  	
  	
  	
  index	
  =	
  randrange(length)	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  haystack[index]	
  ==	
  needle:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  return	
  index
Monte Carlo
def	
  find_monte_carlo(haystack,	
  needle,	
  k):	
  
	
  	
  	
  	
  length	
  =	
  len(haystack)	
  
	
  	
  	
  	
  for	
  i	
  in	
  range(k):	
  
	
  	
  	
  	
  	
  	
  	
  	
  index	
  =	
  randrange(length)	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  haystack[index]	
  ==	
  needle:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  return	
  index
– Prabhakar Raghavan (author of Randomized Algorithms)
“For many problems a randomized
algorithm is the simplest the
fastest or both.”
Naive Solution
For 100 million unique IPv4 addresses,
the size of the hash is...
>400mb
Slightly Less Naive
Add each IP to a bloom filter and keep a
counter of the IPs that don’t collide.
Slightly Less Naive
ips_seen	
  =	
  BloomFilter(capacity=expected_size,	
  error_rate=0.03)	
  
counter	
  =	
  0	
  
	
  	
  
for	
  line	
  in	
  log_file:	
  
	
  	
  	
  	
  ip	
  =	
  extract_ip(line)	
  
	
  	
  	
  	
  if	
  items_bloom.add(ip):	
  
	
  	
  	
  	
  	
  	
  	
  	
  counter	
  +=	
  1	
  
	
  	
  
print	
  "Unique	
  IPs:",	
  counter
Slightly Less Naive
• Adding an IP: O(1)
• Retrieving cardinality: O(1)
• Space: O(n)
• Error rate: 3%
kind of
Slightly Less Naive
For 100 million unique IPv4 addresses,
and an error rate of 3%,
the size of the bloom filter is...
87mb
def	
  insert(self,	
  token):	
  
	
  	
  	
  	
  #	
  Get	
  hash	
  of	
  token	
  
	
  	
  	
  	
  y	
  =	
  hash_fn(token)	
  
	
  	
  
	
  	
  	
  	
  #	
  Extract	
  `k`	
  most	
  significant	
  bits	
  of	
  `y`	
  
	
  	
  	
  	
  j	
  =	
  y	
  >>	
  (hash_len	
  -­‐	
  self.k)	
  
	
  	
  
	
  	
  	
  	
  #	
  Extract	
  remaining	
  bits	
  of	
  `y`	
  
	
  	
  	
  	
  remaining	
  =	
  y	
  &	
  ((1	
  <<	
  (hash_len	
  -­‐	
  self.k))	
  -­‐	
  1)	
  
	
  	
  
	
  	
  	
  	
  #	
  Find	
  "first"	
  set	
  bit	
  of	
  `remaining`	
  
	
  	
  	
  	
  first_set_bit	
  =	
  (64	
  -­‐	
  self.k)	
  -­‐	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  int(math.log(remaining,	
  2))	
  
	
  	
  
	
  	
  	
  	
  #	
  Update	
  `M[j]`	
  to	
  max	
  of	
  `first_set_bit`	
  
	
  	
  	
  	
  #	
  and	
  existing	
  value	
  of	
  `M[j]`	
  
	
  	
  	
  	
  self.M[j]	
  =	
  max(self.M[j],	
  first_set_bit)
def	
  cardinality(self):	
  
	
  	
  	
  	
  #	
  The	
  mean	
  of	
  `M`	
  estimates	
  `log2(n)`	
  with	
  
	
  	
  	
  	
  #	
  an	
  additive	
  bias	
  
	
  	
  	
  	
  return	
  self.alpha	
  *	
  2	
  **	
  np.mean(self.M)
The Problem
Find documents that are similar to one
specific document.
The Problem
Find images that are similar to one
specific image.
The Problem
Find graphs that are correlated to one
specific graph.
The Problem
Nearest neighbor search.
The Problem
“Find the n closest points in a d-dimensional space.”
The Problem
You have a bunch of things and you want to
figure out which ones are similar.
“There has been a lot of recent work on streaming algorithms,
i.e. algorithms that produce an output by making
one pass (or a few passes) over the data while using a
limited amount of storage space and time. To cite a few
examples, ...”
{	
  "there":	
  1,	
  "has":	
  1,	
  "been":	
  	
  1,	
  “a":4,	
  
	
  	
  "lot":	
  	
  	
  1,	
  "of":	
  	
  2,	
  "recent":1,	
  ...	
  }
• Cosine similarity
• Jaccard similarity
• Euclidian distance
• etc etc etc
Euclidian Distance
Metric space
kd-trees
Curse of Dimensionality
Locality-sensitive hashing
Locality-Sensitive Hashing for Finding Nearest Neighbors
- Slaney and Casey
Random Hyperplanes
{	
  "there":	
  1,	
  "has":	
  1,	
  "been":	
  	
  1,	
  “a":4,	
  
	
  	
  "lot":	
  	
  	
  1,	
  "of":	
  	
  2,	
  "recent":1,	
  ...	
  }
LSH
0	
  1	
  1	
  1	
  0	
  1	
  0	
  0	
  0	
  0	
  1	
  0	
  1	
  0	
  1	
  0	
  ...
Cosine Similarity
LSH
Firewall Partition
DDoS
• `
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations/
probabilistic-algorithms

Contenu connexe

Plus de C4Media

Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 

Plus de C4Media (20)

Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 

Dernier

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

It Probably Works

  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /probabilistic-algorithms
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. Tyler McMullen CTO of Fastly @tbmcmullen
  • 6. What is a probabilistic algorithm?
  • 8. –Hal Abelson and Gerald J. Sussman, SICP “In testing primality of very large numbers chosen at random, the chance of stumbling upon a value that fools the Fermat test is less than the chance that cosmic radiation will cause the computer to make an error in carrying out a 'correct' algorithm. Considering an algorithm to be inadequate for the first reason but not for the second illustrates the difference between mathematics and engineering.”
  • 12. Don’t use them if someone’s life depends on it.
  • 13. Don’t use them if someone’s life depends on it.
  • 14. When should I use these?
  • 17. The Problem How many unique words are in a large corpus of text?
  • 18. The Problem How many different users visited a popular website in a day?
  • 19. The Problem How many unique IPs have connected to a server over the last hour?
  • 20. The Problem How many unique URLs have been requested through an HTTP proxy?
  • 21. The Problem How many unique URLs have been requested through an entire network of HTTP proxies?
  • 22. 166.208.249.236  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /product/982"  200  20246   103.138.203.165  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /article/490"  200  29870   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "HEAD  /page/1"  200  20409   150.255.232.154  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /page/1"  200  42999   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /article/490"  200  25080   150.255.232.154  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /listing/567"  200  33617   103.138.203.165  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "HEAD  /listing/567"  200  29618   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "HEAD  /page/1"  200  30265   166.208.249.236  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /page/1"  200  5683   244.210.202.222  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "HEAD  /article/490"  200  47124   103.138.203.165  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "HEAD  /listing/567"  200  48734   103.138.203.165  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /listing/567"  200  27392   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /listing/567"  200  15705   150.255.232.154  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /page/1"  200  22587   244.210.202.222  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "HEAD  /product/982"  200  30063   244.210.202.222  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /page/1"  200  6041   166.208.249.236  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /product/982"  200  25783   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /article/490"  200  1099   244.210.202.222  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /product/982"  200  31494   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /listing/567"  200  30389   150.255.232.154  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /article/490"  200  10251   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /product/982"  200  19384   150.255.232.154  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "HEAD  /product/982"  200  24062   244.210.202.222  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /article/490"  200  19070   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /page/648"  200  45159   191.141.247.227  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "HEAD  /page/648"  200  5576   166.208.249.236  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /page/648"  200  41869   166.208.249.236  -­‐  -­‐  [25/Feb/2015:07:20:13  +0000]  "GET  /listing/567"  200  42414  
  • 23. def  count_distinct(stream):          seen  =  set()          for  item  in  stream:                  seen.add(item)          return  len(seen)
  • 26. def  combined_cardinality(seen_sets):          combined  =  set()          for  seen  in  seen_sets:                  combined  |=  seen          return  len(combined)
  • 27. The set grows linearly.
  • 29.
  • 30.
  • 32. 1  0  1  1  0  1  0  1 “example.com/user/page/1”
  • 33. 1  0  1  1  0  1  0  1
  • 34. 1  0  1  1  0  1  0  1 P(bit  0  is  set)  =  0.5
  • 35. 1  0  1  1  0  1  0  1 P(bit  0  is  set   &  bit  1  is  set)  =  0.25
  • 36. 1  0  1  1  0  1  0  1 P(bit  0  is  set     &  bit  1  is  set   &  bit  2  is  set)  =  0.125
  • 37. 1  0  1  1  0  1  0  1 P(bit  0  is  set)  =  0.5   Expected  trials  =  2
  • 38. 1  0  1  1  0  1  0  1 P(bit  0  is  set   &  bit  1  is  set)  =  0.25   Expected  trials  =  4
  • 39. 1  0  1  1  0  1  0  1 P(bit  0  is  set     &  bit  1  is  set   &  bit  2  is  set)  =  0.125   Expected  trials  =  8
  • 40. We expect the maximum number of leading zeros we have seen + 1 to approximate log2(unique items).
  • 41. Improve the accuracy of the estimate by partitioning the input data.
  • 42. class  LogLog(object):          def  __init__(self,  k):                  self.k  =  k                  self.m  =  2  **  k                  self.M  =  np.zeros(self.m,  dtype=np.int)                  self.alpha  =  Alpha[k]              def  insert(self,  token):                  y  =  hash_fn(token)                  j  =  y  >>  (hash_len  -­‐  self.k)                  remaining  =  y  &  ((1  <<  (hash_len  -­‐  self.k))  -­‐  1)                  first_set_bit  =  (64  -­‐  self.k)  -­‐                                                  int(math.log(remaining,  2))                  self.M[j]  =  max(self.M[j],  first_set_bit)              def  cardinality(self):                  return  self.alpha  *  2  **  np.mean(self.M)
  • 44. HyperLogLog • Adding an item: O(1) • Retrieving cardinality: O(1) • Space: O(log log n) • Error rate: 2%
  • 45. HyperLogLog For 100 million unique items, and an error rate of 2%, the size of the HyperLogLog is... 1,500 bytes
  • 47. The Problem Reliably broadcast “purge” messages across the world as quickly as possible.
  • 52.
  • 53.
  • 55.
  • 56.
  • 59.
  • 60. Bimodal Multicast • Quickly broadcast message to all servers • Gossip to recover lost messages
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67. One Problem Computers have limited space
  • 69.
  • 70.
  • 71.
  • 75. End-to-End Latency 42ms 74ms 83ms 133ms New York London San Jose Tokyo 0.00 0.05 0.10 0.00 0.05 0.10 0.00 0.05 0.10 0.00 0.05 0.10 0 50 100 150 Latency (ms) Density Density plot and 95th percentile of purge latency by server location
  • 78. What was the point again?
  • 79. We can build things that are otherwise unrealistic
  • 80. We can build systems that are more reliable
  • 84. What even is this?
  • 88. Probabilistic Algorithms 1. An iota of theory 2. Where are they useful and where are they not? 3. HyperLogLog 4. Locality-sensitive Hashing 5. Bimodal Multicast
  • 89. “An algorithm that uses randomness to improve its efficiency”
  • 92. Las Vegas def  find_las_vegas(haystack,  needle):          length  =  len(haystack)          while  True:                  index  =  randrange(length)                  if  haystack[index]  ==  needle:                          return  index
  • 93. Monte Carlo def  find_monte_carlo(haystack,  needle,  k):          length  =  len(haystack)          for  i  in  range(k):                  index  =  randrange(length)                  if  haystack[index]  ==  needle:                          return  index
  • 94. – Prabhakar Raghavan (author of Randomized Algorithms) “For many problems a randomized algorithm is the simplest the fastest or both.”
  • 95. Naive Solution For 100 million unique IPv4 addresses, the size of the hash is... >400mb
  • 96. Slightly Less Naive Add each IP to a bloom filter and keep a counter of the IPs that don’t collide.
  • 97. Slightly Less Naive ips_seen  =  BloomFilter(capacity=expected_size,  error_rate=0.03)   counter  =  0       for  line  in  log_file:          ip  =  extract_ip(line)          if  items_bloom.add(ip):                  counter  +=  1       print  "Unique  IPs:",  counter
  • 98. Slightly Less Naive • Adding an IP: O(1) • Retrieving cardinality: O(1) • Space: O(n) • Error rate: 3% kind of
  • 99. Slightly Less Naive For 100 million unique IPv4 addresses, and an error rate of 3%, the size of the bloom filter is... 87mb
  • 100. def  insert(self,  token):          #  Get  hash  of  token          y  =  hash_fn(token)              #  Extract  `k`  most  significant  bits  of  `y`          j  =  y  >>  (hash_len  -­‐  self.k)              #  Extract  remaining  bits  of  `y`          remaining  =  y  &  ((1  <<  (hash_len  -­‐  self.k))  -­‐  1)              #  Find  "first"  set  bit  of  `remaining`          first_set_bit  =  (64  -­‐  self.k)  -­‐                                          int(math.log(remaining,  2))              #  Update  `M[j]`  to  max  of  `first_set_bit`          #  and  existing  value  of  `M[j]`          self.M[j]  =  max(self.M[j],  first_set_bit)
  • 101. def  cardinality(self):          #  The  mean  of  `M`  estimates  `log2(n)`  with          #  an  additive  bias          return  self.alpha  *  2  **  np.mean(self.M)
  • 102. The Problem Find documents that are similar to one specific document.
  • 103. The Problem Find images that are similar to one specific image.
  • 104. The Problem Find graphs that are correlated to one specific graph.
  • 106. The Problem “Find the n closest points in a d-dimensional space.”
  • 107. The Problem You have a bunch of things and you want to figure out which ones are similar.
  • 108.
  • 109. “There has been a lot of recent work on streaming algorithms, i.e. algorithms that produce an output by making one pass (or a few passes) over the data while using a limited amount of storage space and time. To cite a few examples, ...” {  "there":  1,  "has":  1,  "been":    1,  “a":4,      "lot":      1,  "of":    2,  "recent":1,  ...  }
  • 110. • Cosine similarity • Jaccard similarity • Euclidian distance • etc etc etc
  • 116.
  • 117.
  • 118. Locality-Sensitive Hashing for Finding Nearest Neighbors - Slaney and Casey
  • 120. {  "there":  1,  "has":  1,  "been":    1,  “a":4,      "lot":      1,  "of":    2,  "recent":1,  ...  } LSH 0  1  1  1  0  1  0  0  0  0  1  0  1  0  1  0  ...
  • 124. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/ probabilistic-algorithms