© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
© 2014 MapR Technologies 2 
• "Decoder ring" 
• "the next thing I want to do is this" 
• Flajolet
© 2014 MapR Technologies 3 
• What's the problem? 
– speed 
– feasibility 
– communication 
– incremental computation 
– t...
© 2014 MapR Technologies 4 
• Why is that hard (impossible)? 
– pathological inputs 
– median ... any element of the first...
© 2014 MapR Technologies 5 
• What can we do? 
– give up ... a slow, but exact answer may not be sooo bad 
– give up ... a...
© 2014 MapR Technologies 6 
The Classic Problems 
• Most common (top-40) 
• Count distinct 
• Quantiles, with focus on ext...
© 2014 MapR Technologies 7 
Classic Solutions 
• Leaky counters 
– Forget values, remember uncertainties 
• Count min sket...
© 2014 MapR Technologies 8 
Classic Solutions - Leaky counters 
• Intuition: 
– Common elements are rarely rare, rare elem...
© 2014 MapR Technologies 9 
Classic Solutions - Count min sketch 
• Intuition: 
– A gazillion hashed counters can't all be...
© 2014 MapR Technologies 10 
Increment Hashed Locations to Insert 
a 
h 
i 
(a)
© 2014 MapR Technologies 11 
Probe Using min of Counts 
mini"k[h 
i 
(a)]
Classic Solutions - Count distinct with HyperLogLog 
© 2014 MapR Technologies 12 
• Intuition: 
– The smallest of n unifor...
What does hashing look like? 
© 2014 MapR Technologies 13
© 2014 MapR Technologies 14 
0.0 0.2 0.4 0.6 0.8 1.0 
0.0 0.2 0.4 0.6 0.8 1.0 
ix
© 2014 MapR Technologies 15 
0.0 0.2 0.4 0.6 0.8 1.0 
0.0 0.2 0.4 0.6 0.8 1.0 
hash(ix)
Hashing fixes all ills 
© 2014 MapR Technologies 16
0 5 10 15 20 25 30 
© 2014 MapR Technologies 17 
0.0 1.0 2.0 
Original distribution 
x ~ G(0.2, 0.2) 
Mean = 1, median = 0...
Now the trick … what is the min? 
© 2014 MapR Technologies 18
© 2014 MapR Technologies 19 
Repeated Minimum 
10 samples 
Min is ~ 0.1
© 2014 MapR Technologies 20 
Min(x) 
PDF 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Observed minimum value 
(100 sampl...
© 2014 MapR Technologies 21 
Min(x) 
PDF 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Theoretical distribution 
Observed...
© 2014 MapR Technologies 22 
Min(x) 
PDF 
Mean = 0.0099 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Theoretical distrib...
Counting leading zeros is 
taking the log (almost) 
© 2014 MapR Technologies 23
© 2014 MapR Technologies 24 
Mean = −2.3 
10−2.3 
= 0.0056 
Observed minimum log10(value) 
Min(x) 
PDF 
0.0 0.2 0.4 0.6 0....
© 2014 MapR Technologies 25 
T-digest for Quantiles 
• Intuition: 
– 1-d k-means with size cap 
– Make size cap depend on ...
Prochain SlideShare
Chargement dans…5
×

Doing-the-impossible

2 817 vues

Publié le

Many statistics are impossible to compute precisely on streaming data. There are some very clever algorithms, however, which allow us to compute very good approximations of these values efficiently in terms of CPU and memory.

Publié dans : Technologie
  • Soyez le premier à commenter

Doing-the-impossible

  1. 1. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
  2. 2. © 2014 MapR Technologies 2 • "Decoder ring" • "the next thing I want to do is this" • Flajolet
  3. 3. © 2014 MapR Technologies 3 • What's the problem? – speed – feasibility – communication – incremental computation – tree-based pre-computation • What do we need? – on-line version – associative version
  4. 4. © 2014 MapR Technologies 4 • Why is that hard (impossible)? – pathological inputs – median ... any element of the first half of the data could be the median – k-th most common ... any element could occur enough in the second half to be biggest – unique elements ... hashing loses information, any compact representation must have false positives or negatives.
  5. 5. © 2014 MapR Technologies 5 • What can we do? – give up ... a slow, but exact answer may not be sooo bad – give up ... a fast, but inexact answer may not be sooo bad • The good news: – approximate can be very, very close to exact
  6. 6. © 2014 MapR Technologies 6 The Classic Problems • Most common (top-40) • Count distinct • Quantiles, with focus on extremes
  7. 7. © 2014 MapR Technologies 7 Classic Solutions • Leaky counters – Forget values, remember uncertainties • Count min sketch – Many small hash tables • Count distinct with HyperLogLog – Many hashes again • New Solution - Quantiles by t-digest – A new low in clustering
  8. 8. © 2014 MapR Technologies 8 Classic Solutions - Leaky counters • Intuition: – Common elements are rarely rare, rare elements are always rare • Leaky counter: – new element inserted with count=1, error = ceiling((N-1)/w) – every w samples {dropAll( if f+error < ceiling(N/w) )} • Adaptation to heavy hitters is trivial
  9. 9. © 2014 MapR Technologies 9 Classic Solutions - Count min sketch • Intuition: – A gazillion hashed counters can't all be wrong • Big array of counters, each row has different hash function • Increment counter in each row determined by hashing • Probe by finding minimum hashed counter for probe key • Oops... finding heavy hitters is tricky ... requires keeping log n sketches
  10. 10. © 2014 MapR Technologies 10 Increment Hashed Locations to Insert a h i (a)
  11. 11. © 2014 MapR Technologies 11 Probe Using min of Counts mini"k[h i (a)]
  12. 12. Classic Solutions - Count distinct with HyperLogLog © 2014 MapR Technologies 12 • Intuition: – The smallest of n uniform samples is expected to be 1/n – Hashing turns anything into uniform distribution – Hashing again turns anything into a new uniform distribution • Best done with pictures
  13. 13. What does hashing look like? © 2014 MapR Technologies 13
  14. 14. © 2014 MapR Technologies 14 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ix
  15. 15. © 2014 MapR Technologies 15 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 hash(ix)
  16. 16. Hashing fixes all ills © 2014 MapR Technologies 16
  17. 17. 0 5 10 15 20 25 30 © 2014 MapR Technologies 17 0.0 1.0 2.0 Original distribution x ~ G(0.2, 0.2) Mean = 1, median = 0.1, 5%−ile = 10-6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 After hashing
  18. 18. Now the trick … what is the min? © 2014 MapR Technologies 18
  19. 19. © 2014 MapR Technologies 19 Repeated Minimum 10 samples Min is ~ 0.1
  20. 20. © 2014 MapR Technologies 20 Min(x) PDF 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Observed minimum value (100 samples x 10,000 replications)
  21. 21. © 2014 MapR Technologies 21 Min(x) PDF 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Theoretical distribution Observed minimum value (100 samples x 10,000 replications)
  22. 22. © 2014 MapR Technologies 22 Min(x) PDF Mean = 0.0099 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Theoretical distribution Observed minimum value (100 samples x 10,000 replications)
  23. 23. Counting leading zeros is taking the log (almost) © 2014 MapR Technologies 23
  24. 24. © 2014 MapR Technologies 24 Mean = −2.3 10−2.3 = 0.0056 Observed minimum log10(value) Min(x) PDF 0.0 0.2 0.4 0.6 0.8 1.0 Error 1e−05 1e−04 0.001 0.01 0.1
  25. 25. © 2014 MapR Technologies 25 T-digest for Quantiles • Intuition: – 1-d k-means with size cap – Make size cap depend on distance to nearest end • Experimental verification – Distribution in cluster very uniform – Accuracy far better than alternatives, especially at extremes

×