Set Operations - Union Find and Bloom Filters

CS 6213 –
Advanced
Data
Structures
DISJOINT SET DATA
STRUCTURES

 Instructor
 Prof. Amrinder Arora
 amrinder@gwu.edu
 Please copy TA on emails
 Please feel free to call as well
 TA
 Iswarya Parupudi
 iswarya2291@gwmail.gwu.edu
LOGISTICS
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 2

CS 6213
Basics
Record / Struct
Arrays / Linked
Lists / Stacks /
Queues
Graphs / Trees
/ BSTs
Advanced
Trie, B-Tree
Splay Trees
R-Trees
Heaps and PQs
Union Find,
Sets
WHERE WE ARE

 Robert Tarjan – Princeton University
 Mikko Malinen – University of Eastern Finland
 Pasi Fränti – University of Eastern Finland
 Henry Kautz – University of Washington
 Michael Mitzenmacher – Harvard University
 Eli Upfal – Brown University
CREDITS

 Kruskals Algorithm for Minimum Spanning Tree
 Identifying islands in Social Networks
 Maze Design
 Logical/Physical Network Design
 Identifying equivalence classes
APPLICATIONS

1.Connected
2.Just one path between any two rooms
3.Random
WHAT’S A GOOD MAZE?

THE MAZE CONSTRUCTION PROBLEM
 Given:
 collection of rooms: V
 connections between rooms (initially all closed): E
 Construct a maze:
 collection of rooms: V = V
 designated rooms in, iV, and out, oV
 collection of connections to knock down: E  E
such that one unique path connects every two rooms

 While edges remain in E
 Remove a random edge e = (u, v) from E
 How can we do this efficiently?
 If u and v have not yet been connected
 add e to E
 mark u and v as connected
 How to check connectedness efficiently?
MAZE CONSTRUCTION ALGORITHM

 Operations to support
 Make Set(x): Make a new set with a single element x
 Union (S1, S2): Merge the sets S1 and S2
 Find(x): Find the set containing the element x
DISJOINT SET PROBLEM

 If using linked lists
 Find can take O(n) time
 Union can be done in O(1) time
 Makeset can be done in O(1) time
 If using hash function (hash table)
 Find can be done in O(1) time
 Union takes O(n) time
 Makeset can be done in O(1) time
 Any other “trivial” ideas?
WHY NOT USE LISTS OR HASH TABLES?

UP-TREE UNION-FIND
DATA STRUCTURE
 Each subset is an up-tree
with its root as its
representative member
 All members of a given
set are nodes in that set’s
up-tree
a c g h
d b
e
Up-trees are not necessarily binary!
f i

FIND
a c g h
d b
e
f i
find(f)
find(e)
Just traverse to the root!
Time taken is O(height)
runtime:Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 12

UNION
a c g h
d b
e
f i
union(a,c)
Just hang one root from the other!
runtime:Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 13

f
g ha
b
c
id
e
0 -1 0 1 2 -1 -1 7-1
0 (a) 1 (b) 2 (c) 3 (d) 4 (e) 5 (f) 6 (g) 7 (h) 8 (i)
 A forest of up-trees can easily be stored in an array.
NIFTY STORAGE TRICK
up-index:

WEIGHTED UNION
 Always makes the root of the larger tree the new
root
 Often cuts down on height of the new up-tree
f
g ha
b
c
id
e
f
g h
a
b
c
i
d
eCould we do a
better job on this union?
Weighted union!
f
g ha
b c id
e

WEIGHTED UNION FIND ANALYSIS
 Finds with weighted union are O(max up-tree
height)
 An up-tree of height h with weighted union must
have at least 2h nodes (why)
  2max height  n and
max height  log n
 So, find takes O(log n)

 Base case: h = 0, tree has 20 = 1 node
 Induction hypothesis: assume true for h < h
 and consider the sequence of unions.
 Case 1: Union does not increase max height.
Resulting tree still has  2h nodes.
 Case 2: Union has height h’= 1+h, where h = height
of each of the input trees. By induction hypothesis
each tree has  2h-1 nodes, so the merged tree has
at least 2h nodes. QED.
WEIGHTED UNION FIND ANALYSIS (CONT.)

ALTERNATIVES TO WEIGHTED UNION
 Union by height: Just use the height of the tree
 Union by rank: Same thing as the height (except
when it is not)

ROOM FOR IMPROVEMENT:
PATH COMPRESSION
f g ha
b
c i
d
e
While we’re finding e,
could we do anything else?
 Points everything along the path of a find to the root
 Reduces the height of the entire access path to 1
f g ha
b
c i
d
e
Path compression!

PATH COMPRESSION EXAMPLE
f ha
b
c
d
e
g
find(e)
i
f ha
c
d
e
g
b
i

 Let log(k) n = log (log (log … (log n)))
 Then, let log* n = minimum k such that log(k) n  1
 How fast does log* n grow?
 log* (2) = 1
 log* (4) = 2
 log* (16) = 3
 log* (65536) = 4
 log* (265536) = 5 (a 20,000 digit number!)
 log* (2265536) = 6
DIGRESSION: INVERSE ACKERMANN’S
k logs

 Tarjan (1984) proved that m weighted union and find
operations with path compression on a set of n
elements have worst case complexity
O(m log*(n))
 Later results showed that time complexity is actually
m alpha(m,n) where alpha function is the inverse
Ackermann’s function.
 For all practical purposes this is amortized constant
time
COMPLEX COMPLEXITY OF
WEIGHTED UNION + PATH COMPRESSION

 How can we handle the set membership questions if
we are in the context of big data?
 Can we hold 10 billion sets and items in main
memory?
 Is there an alternative to doing a file/database
search?
HOW DO WE HANDLE “BIG DATA”?

 Does x belong to a set S?
 Bloom Filter suggests:
 Does x belong to a set S?
 Yes (Possibly. With probability p, it still may not be there)
 No
 Assume inputs as probability p, bloom filter execution time as
x, and database search time as y
 Derive the scenario (using variables p, x and y) where using a
Bloom Filter makes sense
SLIGHTLY DIFFERENT CONSTRUCT

 Suppose we have a set
 S = {s1,s2,...,sm}  universe U
 Represent S in such a way we can quickly answer “Is
x an element of S ?”
 To take as little space as possible, we allow false
positive (i.e. xS , but we answer yes )
 If xS , we must answer yes. (That is, there are no
false negatives)
APPROXIMATE SET MEMBERSHIP
PROBLEM

BLOOM FILTERS
 Consist of an arrays A[n] of n bits (space) , and k
independent random hash functions
h1,…,hk : U --> {0,1,..,n-1}
1. Initially set the array to 0
2.  sS, A[hi(s)] = 1 for 1  i  k
(an entry can be set to 1 multiple times, only the
first times has an effect )
3. To check if xS, we check whether all location
A[hi(x)] for 1  i  k are set to 1
If not, clearly xS.
If all A[hi(x)] are set to 1, we report xS

0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1
x1 x2
Each element of S is hashed k times
Each hash location set to 1
1 1 1 1 1
y
To check if y is in S, check the k hash
location. If a 0 appears, y is not in S

0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1
y
If only 1s appear, report that y is in S
This may yield false positive

 We assume the hash function are random.
 After all m elements of S are hashed into the bloom
filter array of n bits using all k hash functions, the
probability that a specific bit is still 0 is
 Probability of a false positive is given by
BLOOM FILTERS: PROBABILITY OF A
FALSE POSITIVE
/1
(1 )km km n
p e
n

  
/
(1 ) (1 )k km n k
f p e
   

 Using the desired bound on the probability of a false
positive and the expected number of items (m), you
can design a bloom filter by choosing values for n
and k
 For example, if m = 1000,000,000 and n =
10,000,000,000 and k = 10, this becomes: 0.01
 Uses 10 B bits (approx 1.2 GB of RAM)
DESIGNING A BLOOM FILTER

CS 6213
Basics
Record / Struct
Arrays / Linked
Lists / Stacks /
Queues
Graphs / Trees
/ BSTs
Advanced
Trie, B-Tree
Splay Trees
R-Trees
Heaps and PQs
Union Find,
Sets
WHERE WE ARE (PHEW, AT THE END,
FINALLY)

Set Operations - Union Find and Bloom Filters

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Set Operations - Union Find and Bloom Filters

Similaire à Set Operations - Union Find and Bloom Filters (20)

Plus de Amrinder Arora

Plus de Amrinder Arora (20)

Dernier

Dernier (20)

Set Operations - Union Find and Bloom Filters