2. “We promised to count live...
...but since you can’t do that, we used historical
numbers and this cool math to extrapolate.”
?!?
3. Stream counting is simple
You already have the building blocks
Yet many wait for batch execution
Or go through estimation hoops
4. Accurate counting
Server Bus
Bucketiser
Bucketiser
Bucketiser
Aggregator
Server
Server
Server
● Straightforward, with some plumbing.
● Heavier than you need.
5. Now or later? Exact or rough?
Approximation now >> accurate later
6. Basic scenarios
● How many distinct items in last x minutes?
● What are the top k items in last x minutes?
● How many Ys in last x minutes?
These base techniques are sufficient for
implementing e.g. personalisation and
recommendation algorithms.
8. Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.
● Naive 2: Set approximation with Bloom filter
+ counter.
9. Counting in context
● Look backward, different time windows,
compare.
● Count for a small time quantum, keep
history.
● Aggregate old windows.
● Monoid representations are desirable.
10. Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.
● Naive 2: Set approximation with Bloom filter
+ counter.
● Naive 3: Hash to bitmap. Count bits.
11. Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.
● Naive 2: Set approximation with Bloom filter
+ counter.
● Naive 3: Hash to bitmap. Count bits.
● Attempt 4: Hash, bitmap, count + collision
compensation. Linear Probabilistic Counter.
12. Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.
● Naive 2: Set approximation with Bloom filter
+ counter.
● Naive 3: Hash to bitmap. Count bits.
● Attempt 4: Hash, bitmap, count + collision
compensation. Linear Probabilistic Counter.
● Read papers… -> HyperLogLog counter
14. Top K counting
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 18
U2 65
Gaga 46
Avicii 23
Eminem 21
Peps 19
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 20
● Keep k items, assume absentees have
lowest value
● Accurate at top, overcounting in bottom
15. Approx counting - Count-Min Sketch
● Compute n hashes for key.
● Increment once on each row, col by mod
(hash)
● Retrieve by min() over rows
3 7 20 3 11 6 3+1 4 1 1
3 8 6 2+1 17 13 1 0 4 5
12 7 6 14 2 0 2 3 6+1 7
3 2 12 8+1 10 2 7 2 11 2
16. Top K with Count-Min Sketch
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 18
U2 65
Gaga 46
Avicii 23
Eminem 21
Peps 2
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 19
● Keep Heavy Hitters list.
● Lookup absentees in CMS.
● Risk of overcount is smaller and spread out.
17. Cubic CMS
● Decorate song with geo, age, etc. Pour into
CMS.
● Keep heavy hitters per geo, age group.
*:*:<U2>
SE:*:<U2>
*:31-40:<U2>
SE:31-40:<U2>
+1
+1
+1
+1
18. Machinery
O(104) messages / s per machine.
You probably only need one. If not, use Storm.
Read and write to pub/sub channel, e.g. Kafka
or ZeroMQ.
19. Brute force alternative
Dump every single message into
ElasticSearch.
Suitable for high dimensionality cubes.
22. Hungry for more?
Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and-
recommendation-stream-mining
Ted Dunning on deep learning for real-time anomaly detection: http://www.
berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases
Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20
Open source: stream-lib, Algebird