Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Redis for duplicate detection on real time stream

Redis for duplicate detection on real time stream.
A brief intro to Redis in memory K/V store and a real use case.

  • Identifiez-vous pour voir les commentaires

Redis for duplicate detection on real time stream

  1. 1. for duplicate detection on real time stream
  2. 2. whoami(1) 15 years of experience, proud to be a programmer Writes software for information extraction, nlp, opinion mining (@scale ), and a lot of other buzzwords Implements scalable architectures Member of the JUG-Torino coordination team ro.franchini@gmail.com github.com/robfrank twitter.com/robfrankie linkedin.com/in/robfrank http://www.celi.it http://www.blogmeter.it
  3. 3. Agenda What is it? Main features Caching Counters Scripting How we use it
  4. 4. From the site Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs.
  5. 5. Who use it Twitter Github Youporn Pinterest Groupon ...
  6. 6. Ecosystem Clients in every known language Articles, books, presentations On High Scalability every other day
  7. 7. Architecture Single-threaded server Yes: single threaded server Remember that when you need to scale Single Linux server can handle 500k req/s
  8. 8. Main features In memory K/V store But with durable persistence Master-slave async replica Transactions Pub/Sub Server side LUA scripting
  9. 9. Main features Keys with TTL LRU eviction Keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs REDIS cluster on the go (3.0.0-rc1)
  10. 10. K/V store Key-value (KV) stores use the associative array (also known as a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. (wikipedia)
  11. 11. K/V store Key “plain text” name rob surname frank A C E D B F A B C D E F String/blobs/bitmaps HashTable: Objects Linked lists Sets
  12. 12. Persistence Configurable, two flavors RDB: perfect for backup AOF: append only log, replayed at startup Use AOF + RDB for rock solid persistence Automatic cache warm-up at startup!! Only RAM: switch off persistence
  13. 13. Common use cases Cache Queue Session replication In memory indexes Centralized ID generation
  14. 14. Basics SET user:1 frank GET user:1 → frank EXISTS user:2 → 1 EXPIRE user:1 3600 INCR count:1 GET count:1 → 1
  15. 15. Basics KEYS user:* → user:1, user:2 MSET user:1 frank user:2 coder MGET user:1 user:2 → frank, coder HMSET userdetail:3 name rob surname frank HGETALL userdetail:3 → name::rob, surname:: frank
  16. 16. Transactions MULTI INCR counter:1 INCR counter:2 EXEC > 1 > 1 WATCH counter:3 val = GET counter:3 val = val +1 MULTI SET counter:3 $val EXEC
  17. 17. Atomic counters Operators for key increment INCR counter:1 GET counter:1 → 1 INCRBY counter:1 9 GET counter:1 → 10
  18. 18. LUA scripting Server side LUA scripting A “sort of” stored procedure Scripts are sandboxed Atomic execution ← bear in mind
  19. 19. LUA scripting SCRIPT LOAD "return {KEYS[1],KEYS[2]}" "3905aac1828a8f75707b48e446988eaaeb173f13" EVALSHA 3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 user:2 1) "user:1" 2) "user:2"
  20. 20. Caching: server level Configure REDIS as a cache maxmemory 1024mb maxmemory-policy allkeys-lru all the keys will be evicted using an approximated LRU algorithm
  21. 21. Caching: TTL on key Set a timeout on a key SET doc:1 “mydoc.txt” EXIPRE doc:1 10 Or SETEX doc:1 10 “mydoc.txt”
  22. 22. Demo
  23. 23. Caching + Atomic Counters + Atomic LUA scripting
  24. 24. Duplicate detection Real time stream of documents from the Internet 20% to 50% of documents are duplicated DUPLICATES ARE EVIL And customers don’t pay for that :(
  25. 25. Basic Scenario 5M 3M 3M Duplicates Producer Producer detector NLP Storage Producer
  26. 26. Avoid duplicated documents Act on producers was TOO HARD Filter-out them before heavy document analysis (NLP)
  27. 27. Documents “Documents” are from: twitter facebook gplus instagram forums blogs
  28. 28. Documents Each kind of document has its own natural id twitter: status id facebook: post id forum: URL blog: URL We don’t want this IDs inside our system
  29. 29. Duplicate and id generation Producer 2M Producer Producer Duplicate detector - ID generatio n Analysis Storage 3M 3M Duplicate detector - ID generatio n 1M Analysis 1M 5M
  30. 30. Map external keys to internal UID Generate an ID for each document IDs are generated using daily named counters: INCR day:20141028 → 12576 INCR day:20141010 → 23412576 Cache generated ID tw_1234578688 → day:20141028;12576
  31. 31. Map external keys to internal UID Documents are internally stored on different storage systems with their generated id globalId→ 20141028:3456789
  32. 32. Operations Natural Keys are cached with TTL Documents out of time are parked in a staging area Duplicated documents are usually dropped
  33. 33. LRU cache, counters and LUA LUA scripts are executed atomically Wrote a simple script to: return previous mapped id or generate id and store key and id in cache EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 GET tw_1234566 → 20141028:123
  34. 34. Demo
  35. 35. Deployment Pre-production phase Single server 70M keys in 10GB of RAM In production with a simple M/S
  36. 36. Alternatives PostgreSQL sequence(s) table OR hstore Hazelcast (we are java based) in memory write your own persistence
  37. 37. Q/A
  38. 38. References http://redis.io/ http://redis.io/commands http://stackoverflow.com/questions/tagged/redis http://try.redis.io/