Continuous Data Processing with Kinesis at Snowplow

Continuous data processing with
Kinesis at Snowplow
Budapest DW Forum 2014

Agenda today
1. Introduction to Snowplow
2. Our batch data flow & use cases
3. Why are we excited about Kinesis?
4. Adding Kinesis support to Snowplow
5. Questions

Snowplow is an open-source web and event analytics platform,
first version released in early 2012
• Co-founders Alex Dean and Yali Sassoon met at
OpenX, the open-source ad technology business
in 2008
• After leaving OpenX, Alex and Yali set up Keplar,
a niche digital product and analytics consultancy
• We released Snowplow as a skunkworks
prototype at start of 2012:
github.com/snowplow/snowplow
• We started working full time on Snowplow in
summer 2013

We wanted to take a fresh approach to web analytics
• Your own web event data -> in your own data warehouse
• Your own event data model
• Slice / dice and mine the data in highly bespoke ways to answer your
specific business questions
• Plug in the broadest possible set of analysis tools to drive value from your
data
Data warehouseData pipeline
Analyse your data in
any analysis tool

And we saw the potential of new “big data” technologies and
services to solve these problems in a scalable, low-cost manner
These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
Amazon EMRAmazon S3CloudFront Amazon Redshift

Early on, we made a crucial decision: Snowplow should be
composed of a set of loosely coupled subsystems
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
D = Standardised data protocols
Generate event
data from any
environment
Log raw events
from trackers
Validate and
enrich raw
events
Store enriched
events ready for
analysis
Analyze
enriched events
These turned out to be critical to allowing us
to evolve our technology stack

Our batch data flow & use
cases

By spring 2013 we had arrived at a relatively stable batch-based
processing architecture
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-
based event
collector
Scalding-
based
enrichment
on Hadoop
JavaScript
event tracker
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector

What did people start using Snowplow for?
Warehousing their
web event data
Agile aka ad
hoc analytics
To enable…
Marketing
attribution
modelling
Customer
lifetime value
calculations
Customer
churn
prediction
RTB fraud
detection
Email
product recs

These use cases tended to be characterized by a few important
traits
Trait Example
Agile aka ad
hoc analytics
Marketing
attribution
modelling
1. They use data collected over long
time periods
2. They demand ongoing & hands-on
involvement from a BA/ data scientist
3. They tend not to elicit
synchronous/deterministic responses
RTB fraud
detection

So why did we get excited
about Kinesis?

A quick history lesson: the three eras of business data processing
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
For more see http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/

The classic era, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point
connections
WIDE DATA
COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
Management
reporting
ERP
Silo
Local loop
Silo
Nightly batch ETL process
FULL DATA
HISTORY

The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
Local loop
LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
CRM
Local loop
SAAS VENDOR #2
Email
marketing
Local loop
ERP
Silo
Local loop
CMS
Silo
Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream
processing
Product
rec’s
Micro-batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporting
Batch
processing
Ad hoc
analytics
Hadoop
SAAS VENDOR #3
Web
analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports

The unified era, 2013+
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’
DATA HISTORY
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs

Search
Silo
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
The unified log is Kinesis (or Kafka)

Search
Silo
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
We asked: can we implement Snowplow on top of Kinesis?

What kinds of use cases can we support if we implement
Snowplow on top of Kinesis?
Populating a unified log with
your company’s event streams
In-session
product recs
To enable…
Holistic
systems
monitoring
In-game
difficulty
tuning
In-session
upselling
Ad
retargeting &
RTB
… anything requiring low
latency response /
holistic view of our data!

Adding Kinesis support to
Snowplow

Where we are heading with our Kinesis architecture
Scala Stream
Collector
Raw event
stream
Enrich
Kinesis app
Bad raw
events
stream
Enriched
event
stream
S3
Redshift
S3 sink
Kinesis app
Redshift
sink Kinesis
app
Snowplow
Trackers

This is where we are today
Scala Stream
Collector
Raw event
stream
Enrich
Kinesis app
Bad raw
events stream
Enriched
event
stream
S3
Redshift
S3 sink Kinesis
app
Redshift sink
Kinesis app
Snowplow
Trackers

What have we and the Snowplow community learnt about
Kinesis and continuous data processing so far?
1. One stream  many consuming apps is unexpected for
many people (legacy of old MQs?)
2. Think of Kinesis apps as distributed Unix commands
with streams mapping on to stdin, stderr, stdout
3. Build more complex systems by chaining simple Kinesis
apps – the Kinesis stream is a really powerful primitive
for continuous data flows
4. Scalability and elasticity are going to be much bigger
challenges than in our batch flow

Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To talk offline – @alexcrdean on Twitter or
alex@snowplowanalytics.com

Continuous Data Processing with Kinesis at Snowplow

Recommandé

Recommandé

Contenu connexe

Plus de Alexander Dean

Plus de Alexander Dean (6)

Dernier

Dernier (20)

Continuous Data Processing with Kinesis at Snowplow