SlideShare a Scribd company logo
1 of 38
Download to read offline
for duplicate detection on real time stream
whoami(1) 
15 years of experience, proud to be a programmer 
Writes software for information extraction, nlp, opinion mining (@scale ), and a 
lot of other buzzwords 
Implements scalable architectures 
Member of the JUG-Torino coordination team 
ro.franchini@gmail.com github.com/robfrank 
twitter.com/robfrankie linkedin.com/in/robfrank 
http://www.celi.it http://www.blogmeter.it
Agenda 
What is it? 
Main features 
Caching 
Counters 
Scripting 
How we use it
From the site 
Redis is an open source, BSD licensed, advanced 
key-value cache and store. It is often referred to 
as a data structure server since keys can contain 
strings, hashes, lists, sets, sorted sets, bitmaps 
and hyperloglogs.
Who use it 
Twitter 
Github 
Youporn 
Pinterest 
Groupon 
...
Ecosystem 
Clients in every known language 
Articles, books, presentations 
On High Scalability every other day
Architecture 
Single-threaded server 
Yes: single threaded server 
Remember that when you need to scale 
Single Linux server can handle 500k req/s
Main features 
In memory K/V store 
But with durable persistence 
Master-slave async replica 
Transactions 
Pub/Sub 
Server side LUA scripting
Main features 
Keys with TTL 
LRU eviction 
Keys can contain strings, hashes, lists, sets, sorted sets, 
bitmaps and hyperloglogs 
REDIS cluster on the go (3.0.0-rc1)
K/V store 
Key-value (KV) stores use the associative array (also 
known as a map or dictionary) as their fundamental data 
model. In this model, data is represented as a collection of 
key-value pairs, such that each possible key appears at 
most once in the collection. (wikipedia)
K/V store 
Key 
“plain text” 
name rob 
surname frank 
A C E D B F 
A B C D E F 
String/blobs/bitmaps 
HashTable: Objects 
Linked lists 
Sets
Persistence 
Configurable, two flavors 
RDB: perfect for backup 
AOF: append only log, replayed at startup 
Use AOF + RDB for rock solid persistence 
Automatic cache warm-up at startup!! 
Only RAM: switch off persistence
Common use cases 
Cache 
Queue 
Session replication 
In memory indexes 
Centralized ID generation
Basics 
SET user:1 frank 
GET user:1 → frank 
EXISTS user:2 → 1 
EXPIRE user:1 3600 
INCR count:1 
GET count:1 → 1
Basics 
KEYS user:* → user:1, user:2 
MSET user:1 frank user:2 coder 
MGET user:1 user:2 → frank, coder 
HMSET userdetail:3 name rob surname frank 
HGETALL userdetail:3 → name::rob, surname:: frank
Transactions 
MULTI 
INCR counter:1 
INCR counter:2 
EXEC 
> 1 
> 1 
WATCH counter:3 
val = GET counter:3 
val = val +1 
MULTI 
SET counter:3 $val 
EXEC
Atomic counters 
Operators for key increment 
INCR counter:1 
GET counter:1 → 1 
INCRBY counter:1 9 
GET counter:1 → 10
LUA scripting 
Server side LUA scripting 
A “sort of” stored procedure 
Scripts are sandboxed 
Atomic execution ← bear in mind
LUA scripting 
SCRIPT LOAD "return {KEYS[1],KEYS[2]}" 
"3905aac1828a8f75707b48e446988eaaeb173f13" 
EVALSHA 
3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 
user:2 
1) "user:1" 
2) "user:2"
Caching: server level 
Configure REDIS as a cache 
maxmemory 1024mb 
maxmemory-policy allkeys-lru 
all the keys will be evicted using an 
approximated LRU algorithm
Caching: TTL on key 
Set a timeout on a key 
SET doc:1 “mydoc.txt” 
EXIPRE doc:1 10 
Or 
SETEX doc:1 10 “mydoc.txt”
Demo
Caching 
+ 
Atomic Counters 
+ 
Atomic LUA scripting
Duplicate detection 
Real time stream of documents from 
the Internet 
20% to 50% of documents are duplicated 
DUPLICATES ARE EVIL 
And customers don’t pay for that :(
Basic Scenario 
5M 3M 3M 
Duplicates 
Producer 
Producer detector NLP Storage 
Producer
Avoid duplicated documents 
Act on producers was 
TOO HARD 
Filter-out them before heavy document analysis (NLP)
Documents 
“Documents” are from: 
twitter 
facebook 
gplus 
instagram 
forums 
blogs
Documents 
Each kind of document has its own natural id 
twitter: status id 
facebook: post id 
forum: URL 
blog: URL 
We don’t want this IDs inside our system
Duplicate and id generation 
Producer 
2M 
Producer 
Producer 
Duplicate 
detector - 
ID 
generatio 
n 
Analysis 
Storage 
3M 
3M 
Duplicate 
detector - 
ID 
generatio 
n 
1M Analysis 1M 
5M
Map external keys to internal UID 
Generate an ID for each document 
IDs are generated using daily named counters: 
INCR day:20141028 → 12576 
INCR day:20141010 → 23412576 
Cache generated ID 
tw_1234578688 → day:20141028;12576
Map external keys to internal UID 
Documents are internally stored on different storage 
systems with their generated id 
globalId→ 20141028:3456789
Operations 
Natural Keys are cached with TTL 
Documents out of time are parked in a staging area 
Duplicated documents are usually dropped
LRU cache, counters and LUA 
LUA scripts are executed atomically 
Wrote a simple script to: 
return previous mapped id 
or generate id and store key and id in cache 
EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 
GET tw_1234566 → 20141028:123
Demo
Deployment 
Pre-production phase 
Single server 
70M keys in 10GB of RAM 
In production with a simple M/S
Alternatives 
PostgreSQL 
sequence(s) 
table OR hstore 
Hazelcast (we are java based) 
in memory 
write your own persistence
Q/A
References 
http://redis.io/ 
http://redis.io/commands 
http://stackoverflow.com/questions/tagged/redis 
http://try.redis.io/

More Related Content

What's hot

From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
GitOps - Operation By Pull Request
GitOps - Operation By Pull RequestGitOps - Operation By Pull Request
GitOps - Operation By Pull RequestKasper Nissen
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2Dvir Volk
 
Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsJulian Mazzitelli
 
Intro to Telegraf
Intro to TelegrafIntro to Telegraf
Intro to TelegrafInfluxData
 
Cilium - API-aware Networking and Security for Containers based on BPF
Cilium - API-aware Networking and Security for Containers based on BPFCilium - API-aware Networking and Security for Containers based on BPF
Cilium - API-aware Networking and Security for Containers based on BPFThomas Graf
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsWeaveworks
 
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyer
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyerCase Study: Migration to GitLab (from Bitbucket) at AppsFlyer
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyerNoa Harel
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...InfluxData
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Markus Klems
 
From DevOps to GitOps with GitLab
From DevOps to GitOps with GitLabFrom DevOps to GitOps with GitLab
From DevOps to GitOps with GitLabChen Cheng-Wei
 
gRPC Design and Implementation
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and ImplementationVarun Talwar
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 

What's hot (20)

From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
GitOps - Operation By Pull Request
GitOps - Operation By Pull RequestGitOps - Operation By Pull Request
GitOps - Operation By Pull Request
 
Redis 101
Redis 101Redis 101
Redis 101
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
 
Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd products
 
CNCF and Cloud Native Intro
CNCF and Cloud Native IntroCNCF and Cloud Native Intro
CNCF and Cloud Native Intro
 
Gitlab CI/CD
Gitlab CI/CDGitlab CI/CD
Gitlab CI/CD
 
Intro to Telegraf
Intro to TelegrafIntro to Telegraf
Intro to Telegraf
 
Cilium - API-aware Networking and Security for Containers based on BPF
Cilium - API-aware Networking and Security for Containers based on BPFCilium - API-aware Networking and Security for Containers based on BPF
Cilium - API-aware Networking and Security for Containers based on BPF
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyer
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyerCase Study: Migration to GitLab (from Bitbucket) at AppsFlyer
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyer
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
Observability with HAProxy
Observability with HAProxyObservability with HAProxy
Observability with HAProxy
 
Mono Repo
Mono RepoMono Repo
Mono Repo
 
Rate limits and all about
Rate limits and all aboutRate limits and all about
Rate limits and all about
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
From DevOps to GitOps with GitLab
From DevOps to GitOps with GitLabFrom DevOps to GitOps with GitLab
From DevOps to GitOps with GitLab
 
gRPC Design and Implementation
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and Implementation
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 

Viewers also liked

Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamCodemotion
 
Push! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsPush! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsDominik Obermaier
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
 
Codemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSCodemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSRoberto Franchini
 
What the hell is your software doing at runtime?
What the hell is your software doing at runtime?What the hell is your software doing at runtime?
What the hell is your software doing at runtime?Roberto Franchini
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Roberto Franchini
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataRoberto Franchini
 
Scaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamScaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamDominik Obermaier
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
 
Use Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysUse Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysItamar Haber
 
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Redis Labs
 
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Maarten Balliauw
 
Redis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaRedis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaItamar Haber
 
Getting Started with Redis
Getting Started with RedisGetting Started with Redis
Getting Started with RedisFaisal Akber
 
RespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellRespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellYoshifumi Kawai
 
UV logic using redis bitmap
UV logic using redis bitmapUV logic using redis bitmap
UV logic using redis bitmap주용 오
 
HIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProHIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProRedis Labs
 
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre... Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...Redis Labs
 
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulRedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulDynomiteDB
 

Viewers also liked (20)

Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time stream
 
Push! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsPush! - MQTT for the Internet of Things
Push! - MQTT for the Internet of Things
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Codemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSCodemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFS
 
What the hell is your software doing at runtime?
What the hell is your software doing at runtime?What the hell is your software doing at runtime?
What the hell is your software doing at runtime?
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big data
 
Scaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamScaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic Beam
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Use Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysUse Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual Ways
 
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
 
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
 
Redis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaRedis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of Lua
 
Getting Started with Redis
Getting Started with RedisGetting Started with Redis
Getting Started with Redis
 
RespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellRespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShell
 
UV logic using redis bitmap
UV logic using redis bitmapUV logic using redis bitmap
UV logic using redis bitmap
 
HIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProHIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoPro
 
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre... Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulRedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
 

Similar to Redis for duplicate detection on real time stream

Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Joe Arnold
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyTim Bunce
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http acceleratorno no
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisRicard Clau
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleJérôme Petazzoni
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchVic Hargrave
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton insertsChris Adkin
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xrkr10
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBob Ward
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQJérôme Petazzoni
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Jérôme Petazzoni
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesNETWAYS
 
dba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisdba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisLiviu Costea
 
Privilege Escalation with Metasploit
Privilege Escalation with MetasploitPrivilege Escalation with Metasploit
Privilege Escalation with Metasploitegypt
 
FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018Xavier Mertens
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNoSuchCon
 

Similar to Redis for duplicate detection on real time stream (20)

Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.key
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http accelerator
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with Redis
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Advances in Open Source Password Cracking
Advances in Open Source Password CrackingAdvances in Open Source Password Cracking
Advances in Open Source Password Cracking
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
 
dba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisdba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redis
 
Privilege Escalation with Metasploit
Privilege Escalation with MetasploitPrivilege Escalation with Metasploit
Privilege Escalation with Metasploit
 
FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge Solution
 

Recently uploaded

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Recently uploaded (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Redis for duplicate detection on real time stream

  • 1. for duplicate detection on real time stream
  • 2. whoami(1) 15 years of experience, proud to be a programmer Writes software for information extraction, nlp, opinion mining (@scale ), and a lot of other buzzwords Implements scalable architectures Member of the JUG-Torino coordination team ro.franchini@gmail.com github.com/robfrank twitter.com/robfrankie linkedin.com/in/robfrank http://www.celi.it http://www.blogmeter.it
  • 3. Agenda What is it? Main features Caching Counters Scripting How we use it
  • 4. From the site Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs.
  • 5. Who use it Twitter Github Youporn Pinterest Groupon ...
  • 6. Ecosystem Clients in every known language Articles, books, presentations On High Scalability every other day
  • 7. Architecture Single-threaded server Yes: single threaded server Remember that when you need to scale Single Linux server can handle 500k req/s
  • 8. Main features In memory K/V store But with durable persistence Master-slave async replica Transactions Pub/Sub Server side LUA scripting
  • 9. Main features Keys with TTL LRU eviction Keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs REDIS cluster on the go (3.0.0-rc1)
  • 10. K/V store Key-value (KV) stores use the associative array (also known as a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. (wikipedia)
  • 11. K/V store Key “plain text” name rob surname frank A C E D B F A B C D E F String/blobs/bitmaps HashTable: Objects Linked lists Sets
  • 12. Persistence Configurable, two flavors RDB: perfect for backup AOF: append only log, replayed at startup Use AOF + RDB for rock solid persistence Automatic cache warm-up at startup!! Only RAM: switch off persistence
  • 13. Common use cases Cache Queue Session replication In memory indexes Centralized ID generation
  • 14. Basics SET user:1 frank GET user:1 → frank EXISTS user:2 → 1 EXPIRE user:1 3600 INCR count:1 GET count:1 → 1
  • 15. Basics KEYS user:* → user:1, user:2 MSET user:1 frank user:2 coder MGET user:1 user:2 → frank, coder HMSET userdetail:3 name rob surname frank HGETALL userdetail:3 → name::rob, surname:: frank
  • 16. Transactions MULTI INCR counter:1 INCR counter:2 EXEC > 1 > 1 WATCH counter:3 val = GET counter:3 val = val +1 MULTI SET counter:3 $val EXEC
  • 17. Atomic counters Operators for key increment INCR counter:1 GET counter:1 → 1 INCRBY counter:1 9 GET counter:1 → 10
  • 18. LUA scripting Server side LUA scripting A “sort of” stored procedure Scripts are sandboxed Atomic execution ← bear in mind
  • 19. LUA scripting SCRIPT LOAD "return {KEYS[1],KEYS[2]}" "3905aac1828a8f75707b48e446988eaaeb173f13" EVALSHA 3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 user:2 1) "user:1" 2) "user:2"
  • 20. Caching: server level Configure REDIS as a cache maxmemory 1024mb maxmemory-policy allkeys-lru all the keys will be evicted using an approximated LRU algorithm
  • 21. Caching: TTL on key Set a timeout on a key SET doc:1 “mydoc.txt” EXIPRE doc:1 10 Or SETEX doc:1 10 “mydoc.txt”
  • 22. Demo
  • 23. Caching + Atomic Counters + Atomic LUA scripting
  • 24. Duplicate detection Real time stream of documents from the Internet 20% to 50% of documents are duplicated DUPLICATES ARE EVIL And customers don’t pay for that :(
  • 25. Basic Scenario 5M 3M 3M Duplicates Producer Producer detector NLP Storage Producer
  • 26. Avoid duplicated documents Act on producers was TOO HARD Filter-out them before heavy document analysis (NLP)
  • 27. Documents “Documents” are from: twitter facebook gplus instagram forums blogs
  • 28. Documents Each kind of document has its own natural id twitter: status id facebook: post id forum: URL blog: URL We don’t want this IDs inside our system
  • 29. Duplicate and id generation Producer 2M Producer Producer Duplicate detector - ID generatio n Analysis Storage 3M 3M Duplicate detector - ID generatio n 1M Analysis 1M 5M
  • 30. Map external keys to internal UID Generate an ID for each document IDs are generated using daily named counters: INCR day:20141028 → 12576 INCR day:20141010 → 23412576 Cache generated ID tw_1234578688 → day:20141028;12576
  • 31. Map external keys to internal UID Documents are internally stored on different storage systems with their generated id globalId→ 20141028:3456789
  • 32. Operations Natural Keys are cached with TTL Documents out of time are parked in a staging area Duplicated documents are usually dropped
  • 33. LRU cache, counters and LUA LUA scripts are executed atomically Wrote a simple script to: return previous mapped id or generate id and store key and id in cache EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 GET tw_1234566 → 20141028:123
  • 34. Demo
  • 35. Deployment Pre-production phase Single server 70M keys in 10GB of RAM In production with a simple M/S
  • 36. Alternatives PostgreSQL sequence(s) table OR hstore Hazelcast (we are java based) in memory write your own persistence
  • 37. Q/A
  • 38. References http://redis.io/ http://redis.io/commands http://stackoverflow.com/questions/tagged/redis http://try.redis.io/