Set Operations - make set, union, find and contains are standard operations that appear in many scenarios. Union Find is a marvelous data structure to solve problems involving union and find operations.
Different use arises when we merely want to answer queries on whether a set contains an element x without keeping the entire set in the memory. Bloom Filters play an interesting role there.
2. Instructor
Prof. Amrinder Arora
amrinder@gwu.edu
Please copy TA on emails
Please feel free to call as well
TA
Iswarya Parupudi
iswarya2291@gwmail.gwu.edu
LOGISTICS
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 2
3. CS 6213
Basics
Record / Struct
Arrays / Linked
Lists / Stacks /
Queues
Graphs / Trees
/ BSTs
Advanced
Trie, B-Tree
Splay Trees
R-Trees
Heaps and PQs
Union Find,
Sets
WHERE WE ARE
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 3
4. Robert Tarjan – Princeton University
Mikko Malinen – University of Eastern Finland
Pasi Fränti – University of Eastern Finland
Henry Kautz – University of Washington
Michael Mitzenmacher – Harvard University
Eli Upfal – Brown University
CREDITS
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 4
5. Kruskals Algorithm for Minimum Spanning Tree
Identifying islands in Social Networks
Maze Design
Logical/Physical Network Design
Identifying equivalence classes
APPLICATIONS
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 5
6. 1.Connected
2.Just one path between any two rooms
3.Random
WHAT’S A GOOD MAZE?
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 6
7. THE MAZE CONSTRUCTION PROBLEM
Given:
collection of rooms: V
connections between rooms (initially all closed): E
Construct a maze:
collection of rooms: V = V
designated rooms in, iV, and out, oV
collection of connections to knock down: E E
such that one unique path connects every two rooms
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 7
8. While edges remain in E
Remove a random edge e = (u, v) from E
How can we do this efficiently?
If u and v have not yet been connected
add e to E
mark u and v as connected
How to check connectedness efficiently?
MAZE CONSTRUCTION ALGORITHM
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 8
9. Operations to support
Make Set(x): Make a new set with a single element x
Union (S1, S2): Merge the sets S1 and S2
Find(x): Find the set containing the element x
DISJOINT SET PROBLEM
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 9
10. If using linked lists
Find can take O(n) time
Union can be done in O(1) time
Makeset can be done in O(1) time
If using hash function (hash table)
Find can be done in O(1) time
Union takes O(n) time
Makeset can be done in O(1) time
Any other “trivial” ideas?
WHY NOT USE LISTS OR HASH TABLES?
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 10
11. UP-TREE UNION-FIND
DATA STRUCTURE
Each subset is an up-tree
with its root as its
representative member
All members of a given
set are nodes in that set’s
up-tree
a c g h
d b
e
Up-trees are not necessarily binary!
f i
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 11
12. FIND
a c g h
d b
e
f i
find(f)
find(e)
Just traverse to the root!
Time taken is O(height)
runtime:Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 12
13. UNION
a c g h
d b
e
f i
union(a,c)
Just hang one root from the other!
runtime:Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 13
14. f
g ha
b
c
id
e
0 -1 0 1 2 -1 -1 7-1
0 (a) 1 (b) 2 (c) 3 (d) 4 (e) 5 (f) 6 (g) 7 (h) 8 (i)
A forest of up-trees can easily be stored in an array.
NIFTY STORAGE TRICK
up-index:
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 14
15. WEIGHTED UNION
Always makes the root of the larger tree the new
root
Often cuts down on height of the new up-tree
f
g ha
b
c
id
e
f
g h
a
b
c
i
d
eCould we do a
better job on this union?
Weighted union!
f
g ha
b c id
e
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 15
16. WEIGHTED UNION FIND ANALYSIS
Finds with weighted union are O(max up-tree
height)
An up-tree of height h with weighted union must
have at least 2h nodes (why)
2max height n and
max height log n
So, find takes O(log n)
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 16
17. Base case: h = 0, tree has 20 = 1 node
Induction hypothesis: assume true for h < h
and consider the sequence of unions.
Case 1: Union does not increase max height.
Resulting tree still has 2h nodes.
Case 2: Union has height h’= 1+h, where h = height
of each of the input trees. By induction hypothesis
each tree has 2h-1 nodes, so the merged tree has
at least 2h nodes. QED.
WEIGHTED UNION FIND ANALYSIS (CONT.)
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 17
18. ALTERNATIVES TO WEIGHTED UNION
Union by height: Just use the height of the tree
Union by rank: Same thing as the height (except
when it is not)
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 18
19. ROOM FOR IMPROVEMENT:
PATH COMPRESSION
f g ha
b
c i
d
e
While we’re finding e,
could we do anything else?
Points everything along the path of a find to the root
Reduces the height of the entire access path to 1
f g ha
b
c i
d
e
Path compression!
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 19
20. PATH COMPRESSION EXAMPLE
f ha
b
c
d
e
g
find(e)
i
f ha
c
d
e
g
b
i
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 20
21. Let log(k) n = log (log (log … (log n)))
Then, let log* n = minimum k such that log(k) n 1
How fast does log* n grow?
log* (2) = 1
log* (4) = 2
log* (16) = 3
log* (65536) = 4
log* (265536) = 5 (a 20,000 digit number!)
log* (2265536) = 6
DIGRESSION: INVERSE ACKERMANN’S
k logs
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 21
22. Tarjan (1984) proved that m weighted union and find
operations with path compression on a set of n
elements have worst case complexity
O(m log*(n))
Later results showed that time complexity is actually
m alpha(m,n) where alpha function is the inverse
Ackermann’s function.
For all practical purposes this is amortized constant
time
COMPLEX COMPLEXITY OF
WEIGHTED UNION + PATH COMPRESSION
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 22
23. How can we handle the set membership questions if
we are in the context of big data?
Can we hold 10 billion sets and items in main
memory?
Is there an alternative to doing a file/database
search?
HOW DO WE HANDLE “BIG DATA”?
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 23
24. Does x belong to a set S?
Bloom Filter suggests:
Does x belong to a set S?
Yes (Possibly. With probability p, it still may not be there)
No
Assume inputs as probability p, bloom filter execution time as
x, and database search time as y
Derive the scenario (using variables p, x and y) where using a
Bloom Filter makes sense
SLIGHTLY DIFFERENT CONSTRUCT
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 24
25. Suppose we have a set
S = {s1,s2,...,sm} universe U
Represent S in such a way we can quickly answer “Is
x an element of S ?”
To take as little space as possible, we allow false
positive (i.e. xS , but we answer yes )
If xS , we must answer yes. (That is, there are no
false negatives)
APPROXIMATE SET MEMBERSHIP
PROBLEM
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 25
26. BLOOM FILTERS
Consist of an arrays A[n] of n bits (space) , and k
independent random hash functions
h1,…,hk : U --> {0,1,..,n-1}
1. Initially set the array to 0
2. sS, A[hi(s)] = 1 for 1 i k
(an entry can be set to 1 multiple times, only the
first times has an effect )
3. To check if xS, we check whether all location
A[hi(x)] for 1 i k are set to 1
If not, clearly xS.
If all A[hi(x)] are set to 1, we report xS
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 26
27. 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1
x1 x2
Each element of S is hashed k times
Each hash location set to 1
1 1 1 1 1
y
To check if y is in S, check the k hash
location. If a 0 appears, y is not in S
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 27
28. 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1
y
If only 1s appear, report that y is in S
This may yield false positive
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 28
29. We assume the hash function are random.
After all m elements of S are hashed into the bloom
filter array of n bits using all k hash functions, the
probability that a specific bit is still 0 is
Probability of a false positive is given by
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 29
BLOOM FILTERS: PROBABILITY OF A
FALSE POSITIVE
/1
(1 )km km n
p e
n
/
(1 ) (1 )k km n k
f p e
30. Using the desired bound on the probability of a false
positive and the expected number of items (m), you
can design a bloom filter by choosing values for n
and k
For example, if m = 1000,000,000 and n =
10,000,000,000 and k = 10, this becomes: 0.01
Uses 10 B bits (approx 1.2 GB of RAM)
DESIGNING A BLOOM FILTER
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 30
31. CS 6213
Basics
Record / Struct
Arrays / Linked
Lists / Stacks /
Queues
Graphs / Trees
/ BSTs
Advanced
Trie, B-Tree
Splay Trees
R-Trees
Heaps and PQs
Union Find,
Sets
WHERE WE ARE (PHEW, AT THE END,
FINALLY)
Set Data Structures CS213 - Advanced Data Structures - Arora - GWU 31