Since its inception, the Snowplow open source event analytics platform (https://github.com/snowplow/snowplow) has always been tightly coupled to the batched-based Hadoop ecosystem, and Elastic MapReduce in particular.
With the release of Amazon Kinesis in late 2013, we set ourselves the challenge of porting Snowplow to Kinesis, to give our users access to their Snowplow event stream in near-real-time.
With this porting process nearing completion, Alex Dean, Snowplow Analytics co-founder and technical lead, will share Snowplow’s experiences in adopting stream processing as a complementary architecture to Hadoop and batch-based processing.
In particular, Alex will explore:
- “Hero” use cases for event streaming which drove our adoption of Kinesis
- Why we waited for Kinesis, and thoughts on how Kinesis fits into the wider streaming ecosystem
- How Snowplow achieved a lambda architecture with minimal code duplication, allowing Snowplow users to choose which (or both) platforms to use
- Key considerations when moving from a batch mindset to a streaming mindset – including aggregate windows, recomputation, backpressure
2. Agenda today
1. Introduction to Snowplow
2. Our batch data flow & use cases
3. Why are we excited about Kinesis?
4. Adding Kinesis support to Snowplow
5. Questions
4. Snowplow is an open-source web and event analytics platform,
first version released in early 2012
• Co-founders Alex Dean and Yali Sassoon met at
OpenX, the open-source ad technology business
in 2008
• After leaving OpenX, Alex and Yali set up Keplar,
a niche digital product and analytics consultancy
• We released Snowplow as a skunkworks
prototype at start of 2012:
github.com/snowplow/snowplow
• We started working full time on Snowplow in
summer 2013
5. We wanted to take a fresh approach to web analytics
• Your own web event data -> in your own data warehouse
• Your own event data model
• Slice / dice and mine the data in highly bespoke ways to answer your
specific business questions
• Plug in the broadest possible set of analysis tools to drive value from your
data
Data warehouseData pipeline
Analyse your data in
any analysis tool
6. And we saw the potential of new “big data” technologies and
services to solve these problems in a scalable, low-cost manner
These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
Amazon EMRAmazon S3CloudFront Amazon Redshift
7. Early on, we made a crucial decision: Snowplow should be
composed of a set of loosely coupled subsystems
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
D = Standardised data protocols
Generate event
data from any
environment
Log raw events
from trackers
Validate and
enrich raw
events
Store enriched
events ready for
analysis
Analyze
enriched events
These turned out to be critical to allowing us
to evolve our technology stack
9. By spring 2013 we had arrived at a relatively stable batch-based
processing architecture
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-
based event
collector
Scalding-
based
enrichment
on Hadoop
JavaScript
event tracker
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector
10. What did people start using Snowplow for?
Warehousing their
web event data
Agile aka ad
hoc analytics
To enable…
Marketing
attribution
modelling
Customer
lifetime value
calculations
Customer
churn
prediction
RTB fraud
detection
Email
product recs
11. These use cases tended to be characterized by a few important
traits
Trait Example
Agile aka ad
hoc analytics
Marketing
attribution
modelling
1. They use data collected over long
time periods
2. They demand ongoing & hands-on
involvement from a BA/ data scientist
3. They tend not to elicit
synchronous/deterministic responses
RTB fraud
detection
13. A quick history lesson: the three eras of business data processing
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
For more see http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/
14. The classic era, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point
connections
WIDE DATA
COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
Management
reporting
ERP
Silo
Local loop
Silo
Nightly batch ETL process
FULL DATA
HISTORY
15. The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
Local loop
LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
CRM
Local loop
SAAS VENDOR #2
Email
marketing
Local loop
ERP
Silo
Local loop
CMS
Silo
Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream
processing
Product
rec’s
Micro-batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporting
Batch
processing
Ad hoc
analytics
Hadoop
SAAS VENDOR #3
Web
analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
16. The unified era, 2013+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’
DATA HISTORY
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
17. CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
The unified log is Kinesis (or Kafka)
18. CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
We asked: can we implement Snowplow on top of Kinesis?
19. What kinds of use cases can we support if we implement
Snowplow on top of Kinesis?
Populating a unified log with
your company’s event streams
In-session
product recs
To enable…
Holistic
systems
monitoring
In-game
difficulty
tuning
In-session
upselling
Ad
retargeting &
RTB
… anything requiring low
latency response /
holistic view of our data!
21. Where we are heading with our Kinesis architecture
Scala Stream
Collector
Raw event
stream
Enrich
Kinesis app
Bad raw
events
stream
Enriched
event
stream
S3
Redshift
S3 sink
Kinesis app
Redshift
sink Kinesis
app
Snowplow
Trackers
22. This is where we are today
Scala Stream
Collector
Raw event
stream
Enrich
Kinesis app
Bad raw
events stream
Enriched
event
stream
S3
Redshift
S3 sink Kinesis
app
Redshift sink
Kinesis app
Snowplow
Trackers
23. What have we and the Snowplow community learnt about
Kinesis and continuous data processing so far?
1. One stream many consuming apps is unexpected for
many people (legacy of old MQs?)
2. Think of Kinesis apps as distributed Unix commands
with streams mapping on to stdin, stderr, stdout
3. Build more complex systems by chaining simple Kinesis
apps – the Kinesis stream is a really powerful primitive
for continuous data flows
4. Scalability and elasticity are going to be much bigger
challenges than in our batch flow