The document discusses techniques for creating small summaries of big data in order to improve computational scalability. It introduces sketch structures as a class of linear summaries that can be merged and updated efficiently. Specific sketch structures discussed include Bloom filters, Count-Min sketches, and Count sketches. It also covers counter-based summaries like the heavy hitters algorithm for finding frequent items in a data stream. The document outlines the structures, analysis, and applications of these various techniques for creating concise summaries of large datasets.
2. The case for “Big Data” in one slide
“Big” data arises in many forms:
– Medical data: genetic sequences, time series
– Activity data: GPS location, social network activity
– Business data: customer behavior tracking at fine detail
– Physical Measurements: from science (physics, astronomy)
The 3 V’s:
– Volume
– Velocity
– Variety
Small Summaries for Big Data
2
focus of this talk
3. Computational scalability
The first (prevailing) approach: scale up the computation
Many great technical ideas:
– Use many cheap commodity devices
– Accept and tolerate failure
– Move code to data
– MapReduce: BSP for programmers
– Break problem into many small pieces
– Decide which constraints to drop: noSQL, ACID, CAP
Scaling up comes with its disadvantages:
– Expensive (hardware, equipment, energy), still not always fast
This talk is not about this approach!
Small Summaries for Big Data
3
4. Downsizing data
A second approach to computational scalability:
scale down the data!
– A compact representation of a large data set
– Too much redundancy in big data anyway
– What we finally want is small: human readable analysis / decisions
– Necessarily gives up some accuracy: approximate answers
– Often randomized (small constant probability of error)
– Examples: samples, sketches, histograms, wavelet transforms
Complementary to the first approach: not a case of either-or
Some drawbacks:
– Not a general purpose approach: need to fit the problem
– Some computations don’t allow any useful summary
Small Summaries for Big Data
4
5. Outline for the talk
Some most important data summaries
– Sketches: Bloom filter, Count-Min, Count-Sketch (AMS)
– Counters: Heavy hitters, quantiles
– Sampling: simple samples, distinct sampling
– Summaries for more complex objects: graphs and matrices
Current trends and future challenges for data summarization
Many abbreviations and omissions (histograms, wavelets, ...)
A lot of work relevant to compact summaries
– In both the database and the algorithms literature
Small Summaries for Big Data
5
6. Small Summaries for Big Data
6
Summary Construction
There are several different models for summary construction
– Offline computation: e.g. sort data, take percentiles
– Streaming: summary merged with one new item each step
– Full mergeability: allow arbitrary merges of partial summaries
The most general and widely applicable category
Key methods for summaries:
– Create an empty summary
– Update with one new tuple: streaming processing
Insert and delete
– Merge summaries together: distributed processing (eg MapR)
– Query: may tolerate some approximation (parameterized by 𝜀)
Several important cost metrics (as function of 𝜀, 𝑛):
– Size of summary, time cost of each operation
8. Small Summaries for Big Data
Bloom Filters [Bloom 1970]
Bloom filters compactly encode set membership
– E.g. store a list of many long URLs compactly
– 𝑘 random hash functions map items to 𝑚-bit vector 𝑘 times
– Update: Set all 𝑘 entries to 1 to indicate item is present
– Query: Is 𝑥 in the set?
Return “yes” if all 𝑘 bits 𝑥 maps to are 1.
Duplicate insertions do not change Bloom filters
Can be merged by OR-ing vectors (of same size)
item
1 1 1
8
9. Bloom Filters Analysis
If item is in the set, then always return “yes”
– No false negative
If item is not in the set, may return “yes” with a probability.
– Prob. that a bit is 0: 1 −
1
𝑚
𝑘𝑛
– Prob. that all 𝑘 bits are 1:
1 − 1 −
1
𝑚
𝑘𝑛 𝑘
≈ exp(𝑘 ln(1 − 𝑒 𝑘𝑛/𝑚
)
– Setting 𝑘 =
𝑚
𝑛
ln 2 minimizes this probability to 0.6185 𝑚/𝑛
.
For false positive prob 𝛿, Bloom filter needs 𝑂(𝑛 log
1
𝛿
) bits
Small Summaries for Big Data
9
10. Bloom Filters Applications
Bloom Filters widely used in “big data” applications
– Many problems require storing a large set of items
Can generalize to allow deletions
– Swap bits for counters: increment on insert, decrement on delete
– If no duplicates, small counters suffice: 4 bits per counter
Bloom Filters are still an active research area
– Often appear in networking conferences
– Also known as “signature files” in DB
Small Summaries for Big Data
10
item
1 2 22 2
item item
11. Invertible Bloom Lookup Table [GM11]
A summary that
– Support insertions and deletions
– When # item ≤ 𝑡, can retrieve all items
– Also known as sparse recovery
Easy case when 𝑡 = 1
– Assume all items are integers
– Just keep 𝑧 = the bitwise-XOR of all items
To insert 𝑥: 𝑧 ← 𝑧 ⊕ 𝑥
To delete 𝑥: 𝑧 ← 𝑧 ⊕ 𝑥
– Assumption: no duplicates; can’t delete a non-existing item
A problem: How do we know when # items = 1?
– Just keep another counter
Small Summaries for Big Data
11
12. Invertible Bloom Lookup Table
Structure is the same as a Bloom filter, but each cell
– is the XOR or all items mapped here
– also keeps a counter of such items
How to recover?
– First check if there is any cell has only 1 item
– Recover the item if there is one
– Delete the item from the all cells, and repeat
Small Summaries for Big Data
12
item
1 2 22 2
item item
13. Invertible Bloom Lookup Table
Can show that
– Using 1.5𝑡 cells, can recover ≤ 𝑡 items with high probability
Small Summaries for Big Data
13
2 21 1
item item
14. Small Summaries for Big Data
Count-Min Sketch [CM 04]
Count Min sketch encodes item counts
– Allows estimation of frequencies (e.g. for selectivity estimation)
– Some similarities in appearance to Bloom filters
Model input data as a vector 𝑥 of dimension 𝑈
– E.g., [0. . 𝑈] is the space of all IP addresses, 𝑥[𝑖] is the frequency
of IP address 𝑖 in the data
– Create a small summary as an array of 𝑤 × 𝑑 in size
– Use 𝑑 hash function to map vector entries to [1..𝑤]
𝑊
𝑑
Array:
𝐶𝑀[𝑖, 𝑗]
14
15. Small Summaries for Big Data
Count-Min Sketch Structure
Update: each entry in vector 𝑥 is mapped to one bucket per row.
Merge two sketches by entry-wise summation
Query: estimate 𝑥[𝑖] by taking min 𝑘 𝐶𝑀[𝑘, ℎ 𝑘(𝑖)]
– Never underestimate
– Will bound the error of overestimation
+𝑐
+𝑐
+𝑐
+𝑐
ℎ1(𝑖)
ℎ 𝑑(𝑖)
𝑖, +𝑐
𝑑rows
𝑤 = 2/𝜀
15
16. Count-Min Sketch Analysis
Focusing on first row:
– 𝑥 𝑖 is always added to 𝐶𝑀[1, ℎ1 𝑖 ]
– 𝑥 𝑗 , 𝑗 ≠ 𝑖 is added to 𝐶𝑀[1, ℎ1 𝑖 ] with prob. 1/𝑤
– The expected error is 𝑥 𝑗 /𝑤
– The total expected error is
𝑗≠𝑖 𝑥 𝑗
𝑤
≤
| 𝑥 |1
𝑤
– By Markov inequality, Pr[error>
2⋅| 𝑥 |1
𝑤
] <
1
2
By taking the minimum of 𝑑 rows, this prob. is
1
2
𝑑
To give an 𝜀||𝑥||1 error with prob. 1 − 𝛿, the sketch needs to
have size
2
𝜀
× log
1
𝛿
Small Summaries for Big Data
16
17. Generalization: Sketch Structures
Sketch is a class of summary that is a linear transform of input
– Sketch(𝑥) = 𝑆𝑥 for some matrix 𝑆
– Hence, Sketch(𝑥 + 𝛽𝑦) = Sketch(𝑥) + 𝛽 Sketch(𝑦)
– Trivial to update and merge
Often describe 𝑆 in terms of hash functions
Analysis relies on properties of the hash functions
– Some assumes truly random hash functions (e.g., Bloom filter)
Don’t exist, but ad hoc hash functions work well for many
real-world data
– Others need hash functions with limited independence.
Small Summaries for Big Data
17
18. Small Summaries for Big Data
Count Sketch / AMS Sketch [CCF-C02, AMS96]
Second hash function: 𝑔 𝑘 maps each 𝑖 to +1 or −1
Update: each entry in vector 𝑥 is mapped to one bucket per row.
Merge two sketches by entry-wise summation
Query: estimate 𝑥[𝑖] by taking median 𝑘 𝐶[𝑘, ℎ 𝑘 𝑖 𝑔 𝑘 𝑖 ]
– May either overestimate or underestimate
+𝑐𝑔1(𝑖)
+𝑐𝑔2(𝑖)
+𝑐𝑔3(𝑖)
+𝑐𝑔 𝑑(𝑖)
ℎ1(𝑖)
ℎ 𝑑(𝑖)
𝑖, +𝑐
𝑑rows
𝑤
18
19. Count Sketch Analysis
Focusing on first row:
– 𝑥 𝑖 𝑔1 𝑖 is always added to 𝐶[1, ℎ1 𝑖 ]
We return 𝐶 1, ℎ1 𝑖 𝑔1 𝑖 = 𝑥[𝑖]
– 𝑥 𝑗 𝑔1 𝑗 , 𝑗 ≠ 𝑖 is added to 𝐶[1, ℎ1 𝑖 ] with prob. 1/𝑤
– The expected error is 𝑥 𝑗 𝐸[𝑔1 𝑖 𝑔1 𝑗 ]/𝑤 = 0
– E[total error] = 0, E[|total error|] ≤
| 𝑥 |2
𝑤
– By Chebyshev inequality, Pr[|total error|>
2⋅| 𝑥 |2
𝑤
] <
1
4
By taking the median of 𝑑 rows, this prob. is
1
2
𝑂(𝑑)
To give an 𝜀||𝑥||2 error with prob. 1 − 𝛿, the sketch needs to
have size 𝑂
1
𝜀2 log
1
𝛿
Small Summaries for Big Data
19
20. Count-Min Sketch vs Count Sketch
Count-Min:
– Size: 𝑂 𝑤𝑑
– Error: ||𝑥||1/𝑤
Small Summaries for Big Data
20
Count Sketch:
– Size: 𝑂 𝑤𝑑
– Error: ||𝑥||2/ 𝑤
Other benefits:
– Unbiased estimator
– Can also be used to estimate
𝑥 ⋅ 𝑦 (join size)
Count Sketch is better when ||𝑥||2 < ||𝑥||1/ 𝑤
21. Application to Large Scale Machine Learning
In machine learning, often have very large feature space
– Many objects, each with huge, sparse feature vectors
– Slow and costly to work in the full feature space
“Hash kernels”: work with a sketch of the features
– Effective in practice! [Weinberger, Dasgupta, Langford, Smola, Attenberg ‘09]
Similar analysis explains why:
– Essentially, not too much noise on the important features
Small Summaries for Big Data
21
23. Small Summaries for Big Data
23
Heavy hitters [MG82]
Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
– Equivalently, achieves 𝜀𝑁 error with 𝑂(1/𝜀) space
Better than Count-Min but doesn’t support deletions
Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
24. Small Summaries for Big Data
24
Heavy hitters
Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream [MG’82]
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
25. Small Summaries for Big Data
25
Heavy hitters
Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream [MG’82]
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
26. Small Summaries for Big Data
26
MG analysis
𝑁 = total input size
𝑀 = sum of counters in data structure
Error in any estimated count at most (𝑁 − 𝑀)/(𝑘 + 1)
– Estimated count a lower bound on true count
– Each decrement spread over (𝑘 + 1) items: 1 new one and 𝑘 in
MG
– Equivalent to deleting (𝑘 + 1) distinct items from stream
– At most (𝑁 − 𝑀)/(𝑘 + 1) decrement operations
– Hence, can have “deleted” (𝑁 − 𝑀)/(𝑘 + 1) copies of any item
– So estimated counts have at most this much error
27. Small Summaries for Big Data
27
Merging two MG summaries [ACHPZY12]
Merging algorithm:
– Merge two sets of 𝑘 counters in the obvious way
– Take the (𝑘 + 1)th largest counter = 𝐶 𝑘+1, and subtract from
all
– Delete non-positive counters
1 2 3 4 5 6 7 8 9
𝑘 = 5
28. (prior error) (from merge)
Small Summaries for Big Data
28
Merging two MG summaries
This algorithm gives mergeability:
– Merge subtracts at least 𝑘 + 1 𝐶 𝑘+1 from counter sums
– So 𝑘 + 1 𝐶 𝑘+1 (𝑀1 + 𝑀2 – 𝑀12)
Sum of remaining (at most 𝑘) counters is 𝑀12
– By induction, error is
((𝑁1 − 𝑀1) + (𝑁2 − 𝑀2) + (𝑀1 + 𝑀2– 𝑀12))/(𝑘 + 1)
= ((𝑁1 + 𝑁2) – 𝑀12)/(𝑘 + 1)
1 2 3 4 5 6 7 8 9
𝑘 = 5
𝐶 𝑘+1
29. Summarizing Disitributed Data
29
Quantiles (order statistics)
Exact quantiles: 𝐹−1() for 0 < < 1, 𝐹 : CDF
Approximate version: tolerate answer between
𝐹−1( − ) … 𝐹−1( + )
Dual problem - rank estimation: Given 𝑥, estimate 𝐹(𝑥)
– Can use binary search to find quantiles
30. Quantiles gives equi-height histogram
Automatically adapts to skew data distributions
Equi-width histograms (fixed binning) are is to construct but
does not adapt to data distribution
Summarizing Disitributed Data
30
31. The GK Quantile Summary [GK01]
3
Incoming: 3
0
node: value & rank
N=1
42. The GK Quantile Summary
The practical version
– Space: Open problem!
– Small in practice
A complicated version achieves 𝑂
1
𝜀
log 𝜀𝑁 space
Doesn’t support deletions
Doesn’t (don’t know how to) support merging
There is a randomized mergeable quantile summary
43. A Quantile Summary Supporting Deletions [WLYC13]
Assumption: all elements are integers from [0..U]
Use a dyadic decomposition
Small Summaries for Big Data
43
0 U-1U/2
# of elements
between U/2 and U-
1
45. A Quantile Summary Supporting Deletions
Consider each layer as a frequency vector, and build a sketch
– Count Sketch is better than Count-Min due to unbiasedness
Space: 𝑂
1
𝜀
log1.5
𝑈
CS
CS
CS
CS
47. Random Sample as a Summary
Cost
– A random sample is cheaper to build if you have random access
to data
Error-size tradeoff
– Specialized summaries often have size 𝑂(1/𝜀)
– A random sample needs size 𝑂(1/𝜀2
)
Generality
– Specialized summaries can only answer one type of queries
– Random sample can be used to answer different queries
In particular, ad hoc queries that are unknown beforehand
Small Summaries for Big Data
47
48. Sampling Definitions
Sampling without replacement
– Randomly draw an element
– Don’t put it back
– Repeat s times
Sampling with replacement
– Randomly draw an element
– Put it back
– Repeat s times
Coin-flip sampling
– Sample each element with probability 𝑝
– Sample size changes for dynamic data
Small Summaries for Big Data
48
The statistical difference is
very small, for 𝑁 ≫ 𝑠 ≫ 1
49. Reservoir Sampling
Given a sample of size 𝑠 drawn from a data set of size 𝑁
(without replacement)
A new element in added, how to update the sample?
Algorithm
– 𝑁 ← 𝑁 + 1
– With probability 𝑠/𝑁, use it to replace an item in the
current sample chosen uniformly at random
– With probability 1 − 𝑠/𝑁, throw it away
Small Summaries for Big Data
49
50. Reservoir Sampling Correctness Proof
Many “proofs” found online are actually wrong
– They only show that each item is sampled with probability 𝑠/𝑁
– Need to show that every subset of size 𝑠 has the same
probability to be the sample
Correct proof relates with the Fisher-Yates shuffle
Small Summaries for Big Data
50
a
b
c
d
b
a
c
d
a
b
c
d
b
c
a
d
b
d
a
c
s = 2
56. Random Sampling with Deletions
Naïve method:
– Maintain a sample of size 𝑀 > 𝑠
– Handle insertions as in reservoir sampling
– If the deleted item is not in the sample, ignore it
– If the deleted item is in the sample, delete it from the sample
– Problem: the sample size may drop to below 𝑠
Some other algorithms in the DB literature but sample size
still drops when there are many deletions
Small Summaries for Big Data
56
57. Small Summaries for Big Data
Min-wise Hashing / Sampling
For each item 𝑥, compute ℎ 𝑥
Return the 𝑠 items with the smallest values of ℎ(𝑥) as the
sample
– Assumption: ℎ is a truly random hash function
– A lot of research on how to remove this assumption
57
58. 𝑳 𝟎-Sampling [FIS05, CMR05]
Suppose ℎ 𝑥 ∈ [0. . 𝑢 − 1] and always has log 𝑢 bits (pad
zeroes when necessary)
Map all items to level 0
Map items that have one leading zero in ℎ(𝑥) to level 1
Map items that have two leading zeroes in ℎ(𝑥) to level 2
…
Use a 2𝑠 -sparse recovery summary for each level
Small Summaries for Big Data
p=1
p=1/U
2𝑠-sparse recovery
58
59. Estimating Set Similarity with MinHash
Assumption: The sample from 𝐴 and the sample from 𝐵 are
obtained by the same ℎ
Pr[(A item randomly sampled from 𝐴 ∪ 𝐵) ∈ 𝐴 ∩ 𝐵] = 𝐽(𝐴, 𝐵)
To estimate 𝐽(𝐴, 𝐵):
– Merge Sample(𝐴) and Sample(𝐵) to get Sample(𝐴 ∪ 𝐵) of size 𝑠
– Count how many items in Sample(𝐴 ∪ 𝐵) are in 𝐴 ∩ 𝐵
– Divide by 𝑠
Small Summaries for Big Data
59
𝐽 𝐴, 𝐵 =
𝐴 ∩ 𝐵
𝐴 ∪ 𝐵
61. Geometric data: -approximations
A sample that preserves “density” of point sets
– For any range (e.g., a circle),
|fraction of sample point − fraction of all points|≤ 𝜀
– Can simply draw a random sample of size 𝑂(1/𝜀2)
– More careful construction yields size 𝑂(1/𝜀2𝑑/(𝑑+1))
Summarizing Disitributed Data
61
62. Summarizing Disitributed Data
62
Geometric data: -kernels
-kernels approximately preserve the convex hull of points
– -kernel has size O(1/(d-1)/2)
– Streaming -kernel has size O(1/(d-1)/2 log(1/))
63. Graph Sketching [AGM12]
Goal: Build a sketch for each vertex and use it to answer queries
Connectivity: want to test if there is a path between any two nodes
– Trivial if edges can only be inserted; want to support deletions
Basic idea: repeatedly contract edges between components
– Use L0 sampling to get edges from vector of adjacencies
– The L0 sampling sketch supports deletions and merges
Problem: as components grow, sampling edges from components
most likely to produce internal links
Small Summaries for Big Data
63
64. Graph Sketching
Idea: use clever encoding of edges
Encode edge (i,j) as ((i,j),+1) for node i<j, as ((i,j),-1) for node j>i
When node i and node j get merged, sum their L0 sketches
– Contribution of edge (i,j) exactly cancels out
– Only non-internal edges remain in the L0 sketches
Use independent sketches for each iteration of the algorithm
– Only need O(log n) rounds with high probability
Result: O(poly-log n) space per node for connectivity
Small Summaries for Big Data
i
j
+
=
64
65. Other Graph Results via sketching
Recent flurry of activity in summaries for graph problems
– K-connectivity via connectivity
– Bipartiteness via connectivity:
– (Weight of the) Minimum spanning tree:
– Sparsification: find G’ with few edges so that cut(G,C) cut(G’,C)
– Matching: find a maximal matching (assuming it is small)
Total cost is typical O(|V|), rather than O(|E|)
– Semi-streaming / semi-external model
Small Summaries for Big Data
65
66. Matrix Sketching
Given matrices A, B, want to approximate matrix product AB
– Measure the normed error of approximation C: ǁAB – Cǁ
Main results for the Frobenius (entrywise) norm ǁǁF
– ǁCǁF = (i,j Ci,j
2)½
– Results rely on sketches, so this entrywise norm is most natural
Small Summaries for Big Data
66
67. Direct Application of Sketches
Build AMS sketch of each row of A (Ai), each column of B (Bj)
Estimate Ci,j by estimating inner product of Ai with Bj
– Absolute error in estimate is ǁAiǁ2 ǁBjǁ2 (whp)
– Sum over all entries in matrix, squared error is ǁAǁFǁBǁF
Outline formalized & improved by Clarkson & Woodruff [09,13]
– Improve running time to linear in number of non-zeros in A,B
Small Summaries for Big Data
67
68. Compressed Matrix Multiplication
What if we are just interested in the large entries of AB?
– Or, the ability to estimate any entry of (AB)
– Arises in recommender systems, other ML applications
If we had a sketch of (AB), could find these approximately
Compressed Matrix Multiplication [Pagh 12]:
– Can we compute sketch(AB) from sketch(A) and sketch(B)?
– To do this, need to dive into structure of the Count (AMS) sketch
Several insights needed to build the method:
– Express matrix product as summation of outer products
– Take convolution of sketches to get a sketch of outer product
– New hash function enables this to proceed
– Use the FFT to speed up from O(w2) to O(w log w)
Small Summaries for Big Data
68
69. More Linear Algebra
Matrix multiplication improvement: use more powerful hash fns
– Obtain a single accurate estimate with high probability
Linear regression given matrix A and vector b:
find x Rd to (approximately) solve minx ǁAx – bǁ
– Approach: solve the minimization in “sketch space”
– From a summary of size O(d2/) [independent of rows of A]
Frequent directions: approximate matrix-vector product
[Ghashami, Liberty, Phillips, Woodruff 15]
– Use the SVD to (incrementally) summarize matrices
The relevant sketches can be built quickly: proportional to the
number of nonzeros in the matrices (input sparsity)
– Survey: Sketching as a tool for linear algebra [Woodruff 14]
Small Summaries for Big Data
69
70. Current Directions in Data Summarization
Sparse representations of high dimensional objects
– Compressed sensing, sparse fast fourier transform
General purpose numerical linear algebra for (large) matrices
– k-rank approximation, linear regression, PCA, SVD, eigenvalues
Summaries to verify full calculation: a ‘checksum for computation’
Geometric (big) data: coresets, clustering, machine learning
Use of summaries in large-scale, distributed computation
– Build them in MapReduce, Continuous Distributed models
Communication-efficient maintenance of summaries
– As the (distributed) input is modified
Small Summaries for Big Data
70
71. Two complementary approaches in response to growing data sizes
– Scale the computation up; scale the data down
The theory and practice of data summarization has many guises
– Sampling theory (since the start of statistics)
– Streaming algorithms in computer science
– Compressive sampling, dimensionality reduction… (maths, stats, CS)
Continuing interest in applying and developing new theory
Small Summary of Small Summaries
71
Small Summaries for Big Data