Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015

Relevance - Deal Personalization and Real
Time Big Data Analytics
Prassnitha
Sampath

psampath@groupon.com

About Me
•  Lead
Engineer
working
on
Real
Time
Data

Infrastructure
@
Groupon

•  Graduate
of
Portland
State
and
Madras

University

Scaling: Keeping Up With a Changing Business
2014
2011
2012

Growing
Number
of
deals
Growing
Users

•  100
Million+
subscribers

•  We
need

to
store
data

like,
user
click
history,

email
records,
service

logs
etc.
This
is
billions
of

data
points
and
TB’s
of

data

Changing Business: Shift from Email to Mobile
•  Growth
in
Mobile

Business

•  Reducing
dependence
on

email
markeOng

100
Million+
App
Downloads

Deal Personalization Infrastructure Use Cases
Deliver Personalized
Emails
Website & Mobile
Experience
Oﬄine
System
Online
System

Email

Personalize
billions
of
emails
for
hundreds

of
millions
of
users

Personalize
one
of
the
most
popular

e-‐commerce
mobile
&
web
app

for
hundreds
of
millions
of
users
&
page
views

Deal Personalization Infrastructure Use Cases
Website, Mobile and Email
Experience
Deal
Performance
Understand
User
Behavior

Deliver Relevant Experience
with High Quality Deals

Earlier System
Oﬄine

PersonalizaOon

Map/Reduce

Data
Pipeline
(User
Logs,
Email
Records,
User
History
etc)

Online
Deal

PersonalizaOon

API

MySQL
Store

Email

Earlier System
Email

Oﬄine

PersonalizaOon

Map/Reduce

Data
Pipeline

Online
Deal

PersonalizaOon

API

MySQL
Store

• 
Scaling
MySQL
for
data

such
as
user
click
history,

email
records
was

painful
unless
we
shard

data

•  Data
Pipeline
is
not

“Real
Time”

Email

Oﬄine

PersonalizaOon

Map/Reduce

Real
Time
Data

Pipeline

Online
Deal

PersonalizaOon

API

Ideal
Data
Store

•  Common
data
store
that

serves
data
to
both
online

and
oﬄine
systems

•  Data
store
that
scales
to

hundreds
of
millions
of

records

•  Data
store
that
works
well

with
our
exisOng
Hadoop

based
systems

•  Real
Time
pipeline
that
scales

and
can
process
about

100,000
messages/
second

Ideal System

Email

Oﬄine

PersonalizaOon

Map/Reduce

Web
Site

Logs

Online
Deal

PersonalizaOon

API

HBase

Final Design
Mobile

Logs

Ka`a
Message
Broker

Storm

Two Challenges With HBase
HBase

How
to
scale

100,000

writes/
second?

HBase

•  How
to
run
Map
Reduce
Programs

over
HBase
without
aﬀecOng
read

latency?

•  How
to
batch
load
data
in
HBase

without
aﬀecOng
read
latencies?

Final Hbase Design
Real
Time

HBase

Batch

HBase

Bulk
Load

data
via

HFiles

ReplicaOon

Map
Reduce
Over

HBase

Leveraging System for Real Time Analytics

Various
requirements
from
relevance
algorithms
to
pre-‐
compute
real
6me
analy6cs
for
be9er
targe6ng

Category
Level

MulOdimensional

Performance

Metrics

Deal
Level

Performance

Metrics

How
do

women
in
Dublin

convert
for
Pizza
deals?

How
do
women
in
Dublin

convert
for
a
parOcular
pizza

deal?


More
Complex
Examples

Category
Level

MulOdimensional

Performance
Metrics

Deal
Level

Performance
Metrics

How
do
women
in
Dublin

from
the
Dundrum
area
aged

30-‐35
convert
for
New
York

Style
Pizza,
when
deal
is

located
within
2
miles,
and

when
deal
is
priced
between

€10-‐€20?

How
do
women
in
Dublin
from
Dundrum
area

aged
30-‐35
convert
for
a
parOcular
deal?

Even
More
Complex
Examples

How
do
women
in
Dublin

from
the
Dundrum
area

aged
30-‐35
who
also
like

acOviOes
like
Biking
and
are

acOve
customers
on
our

mobile
plahorm
convert

when
deal
is
located
within

2
miles,
and
when
deal
is

priced
between
€10-‐€20?

How
do
women
in
Dublin
from

the
Dundrum
area
aged
30-‐35

who
also
like
acOviOes
such
as

biking
and
are
acOve
customers

of
Groupon
deals
on
mobile

plahorm
convert
for
this

parOcular
deal?

Power of Simple Counting
Turns
out
all
earlier
quesOons
can
be
answered
if
we
could
count
appropriate
events
in

appropriate
bucket

No
Deal
Impressions
by
Women
in
Dublin
for
Pizza
Deals

No
of
Purchases
by
Women
in
Dublin
for

Pizza
Deals
Conversion
rate

for
pizza
deals

for
women
in

Dublin

=

Real Time Analytics Infrastructure
Ka`a
Topic
–
With
Real
Time
User

events

Storm
–
Running
AnalyOcs
Topology

Real
Time
infrastructure
processing

100,000
requests/
second

Redis
1
…

Storm
Topology
calculaOng
various

dimensions/
buckets
and
updates

appropriate
Redis
bucket.
Redis
is

sharded
from
client
side

Redis
cluster
handles
over
3
Million

events
per
second.
Stores
over
14

Billion
unique
keys

Redis
2
Redis
N

Real Time Analytics Infrastructure -
Explained
Ka`a
Topic
–

With
Real

Time
User

events

Read
user

event
Data

from
Ka`a

Find
out

which
all

buckets
this

event
falls

Increase
event

counter
for

appropriate

bucket
in
Redis

Redis

Shards

Storm

Scaling Challenges - Kafka - Storm

•  Storm
was
hard
to
scale.
We
had
to
try
various
number
of
combinaOons
to

ﬁnalize
how
many
bolts
of
each
type
are
required
for
steady
state

operaOons
and
overall
how
many
workers
are
needed.

•  Use
“topology.max.spout.pending”
senng
in
Storm
topologies.
We
found

it
to
be
very
useful
to
shield
your
topologies
from
sudden
surge
in
traﬃc.

•  Build
your
enOre
infrastructure
–
where
data
duplicates
are
allowed

Scaling Challenges - Redis
•  Reduce
memory
footprint
–

use
hashes.
Very
memory

efficient
compared
to
normal
Redis
keys

•  In
order
to
support
high
write
operaOons
turned
off
AOF,

turned
on
RDB
backups

Easiest
of
all
other
infrastructure
pieces
–
Kaà,
Storm,
HBase

When Small is Big – Bloom Filters
•  Since
both
Kaà
and
Storm
can
send
same
data
twice
specially
at

scale,
it
was
important
to
build
downstream
infrastructure
that
can

handle
duplicate
data.

•  However,
by
very
nature
AnalyOcs
Topology
(CounOng
Topology)

cannot
handle
duplicates

•  Storing
individual
messages
for
billions
of
messages
is
way
too

expensive
and
would
take
lot
more
memory

•  So
we
used
bloom
filters.
At
a
very
small
%
error
rate,
we
could

effecOvely
de-‐dupe
data
with
a
very
small
memory
footprint.

Avoiding Errors – Backups/ Recovery
Strategy
For
a
high
volume
system,
which
also
drives
so
much
revenue
for
the
company
good

backup/recovery
strategy
is
necessary

Redis

RDB
Backups
every

few
hours.
RDB

backups
are
stored

in
HDFS
for
later

use

HBase

HBase
Snapshot

funcOonality
is

used.
Snapshot
are

taken
every
few

hours.

Ka`a/
Storm

All
input
into
Ka`a

topic
is
stored
in

HDFS
for
30
days.

So
any
hour/
day

can
be
replayed

from
HDFS
if

necessary.

Monitoring
Overall end-to-end monitoring to test the complete ﬂow of data
Ka`a
-‐>
Storm
-‐>
HBase
Pipeline

Crawler
crawls
the
page
and
monitoring
looks
for
corresponding
data
in
HBase

psampath@groupon.com
www.groupon.com/techjobs
Ques6ons?

Thank
you!

Slides
prepared
in
collabora/on
with
Ameya
Kanitkar

Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015

Similar to Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015 (20)

More from NoSQLmatters

More from NoSQLmatters (20)

Recently uploaded

Recently uploaded (20)

Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015