Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Large-scale real-time analytics for everyone

880 vues

Publié le

My slides from Highload Strategy conference in Vilnius.

Publié dans : Données & analyses
  • Identifiez-vous pour voir les commentaires

Large-scale real-time analytics for everyone

  1. 1. Large-scale real-time analytics for everyone: fast, cheap and 98% correct
  2. 2. Pavel Kalaidin @facultyofwonder
  3. 3. we have a lot of data memory is limited one pass would be great constant update time
  4. 4. max, min, mean is trivial
  5. 5. median, anyone?
  6. 6. Sampling?
  7. 7. Probabilistic algorithms
  8. 8. Estimate is OK but nice to know how error is distributed
  9. 9. def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m
  10. 10. Memory used - 1 int! def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m It really works
  11. 11. Percentiles?
  12. 12. Demo: bit.ly/frugalsketch def frugal_1u(stream, m=0, q=0.5): for val in stream: r = np.random.random() if val > m and r > 1 - q: m += 1 elif val < m and r > q: m -= 1 return m
  13. 13. Streaming + probabilistic = sketch
  14. 14. What do we want? Get the number of unique users aka cardinality number
  15. 15. What do we want? Get the number of unique users grouped by host, date, segment
  16. 16. When do we want? Well, right now
  17. 17. Data: 1010 elements, 109 unique int32 40Gb
  18. 18. Straight-forward approach: hash-table
  19. 19. Hash-table: 4Gb
  20. 20. HyperLogLog: 1.5Kb, 2% error
  21. 21. It all starts with an algorithm called LogLog
  22. 22. Imagine I tell you I spent this morning flipping a coin
  23. 23. and now tell you what was the longest non-interrupting run of heads
  24. 24. 2 times or 100 times
  25. 25. When I flipped a coin for longer time?
  26. 26. We are interested in patterns in hashes (namely the longest runs of leading zeros = heads)
  27. 27. Hash, don’t sample!* * need a good hash function
  28. 28. Expecting: 0xxxxxx hashes - ~50% 1xxxxxx hashes - ~50% 00xxxxx hashes - ~25%
  29. 29. estimate - 2R , where R - is a longest run of leading zeros in hashes
  30. 30. I can perform several flipping experiments
  31. 31. and average the number of zeros
  32. 32. This is called stochastic averaging
  33. 33. So far the estimate is 2R , where R is a is a longest run of leading zeros in hashes
  34. 34. We will be using M buckets
  35. 35. where ɑ is a normalization constant
  36. 36. LogLog SuperLogLog
  37. 37. LogLog SuperLogLog HyperLogLog arithmetic mean -> harmonic mean plus a couple of tweaks
  38. 38. Standard error is 1.04/sqrt (M), where M is the number of buckets
  39. 39. LogLog SuperLogLog HyperLogLog HyperLogLog++ Google, 2013 32 bit -> 64 bit + fixes for low cardinality bit.ly/HLLGoogle
  40. 40. LogLog SuperLogLog HyperLogLog HyperLogLog++ Discrete Max-Count Facebook, 2014 bit.ly/DiscreteMaxCount
  41. 41. Large scale?
  42. 42. Suppose we have two HLL- sketches, let’s take a maximum value from corresponding buckets
  43. 43. Resulting sketch has no loss in accuracy!
  44. 44. What do we want? how many unique users belong to two segments?
  45. 45. HLL intersection
  46. 46. Inclusion-exclusion principle
  47. 47. credits: http://research.neustar. biz/2012/12/17/hll-intersections-2/
  48. 48. Python code: bit.ly/hloglog
  49. 49. What do we want? Get the churn rate
  50. 50. Straight forward: feed new data to a new sketch
  51. 51. Sliding-window HyperLogLog
  52. 52. We maintain a list of tuples (timestamp, R), where R is a possible maximum over future time
  53. 53. Values that are no longer make sense are automatically discarded from the list
  54. 54. One list per bucket
  55. 55. Take a maximum R over the given timeframe from the past, then estimate as we do in a regular HLL
  56. 56. Extra memory is required
  57. 57. All the details: bit.ly/SlidingHLL
  58. 58. hash, don’t sample estimate, not precise save memory streaming this slide is the sketch of the talk
  59. 59. Lots of sketches for various purposes: percentiles, heavy hitters, similarity, other stream statistics
  60. 60. Have we seen this user before?
  61. 61. Bloom filter
  62. 62. i h 1 h 2 h k 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0
  63. 63. How many time did we see a user?
  64. 64. Count-Min sketch is the answer: bit.ly/CountMinSketch
  65. 65. w i +1 +1 +1 h1 h4 hd d Estimate - take minimum from d values
  66. 66. Percentiles
  67. 67. Frugal sketching is not that precise enough
  68. 68. Sorting is pain
  69. 69. Distribute incoming values to buckets?
  70. 70. Some sort of clustering, maybe
  71. 71. T-Digest
  72. 72. Size is log(n), error is relative to q(1-q)
  73. 73. Code: bit.ly/T-Digest-Java bit.ly/T-Digest-Python
  74. 74. This is a growing field of computer science: stay tuned!
  75. 75. Thanks and happy sketching!
  76. 76. Reading list: Neustar Research blog: bit.ly/NRsketches Sketches overview: bit.ly/SketchesOverview Lecture notes on streaming algorithms: bit.ly/streaming-lectures
  77. 77. Bonus: HyperLogLog in SQL: bit.ly/HLLinSQL

×