AWS User Group UK: Why your company needs a unified log

Why your company needs a
Unified Log
AWS User Group UK, 28th January 2015

Introducing myself
• Alex Dean
• Co-founder and technical lead at Snowplow,
the open-source event analytics platform
based here in London [1]
• Weekend writer of Unified Log Processing,
available on the Manning Early Access Program
[2]
[1] https://github.com/snowplow/snowplow
[2] http://manning.com/dean

A quick history lesson: the three eras of business data
processing [1]
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
[1] http://snowplowanalytics.com/blog/
2014/01/20/the-three-eras-of-business-data-processing/

The classic era of business data processing, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point
connections
WIDE DATA
COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
Management
reporting
ERP
Silo
Local loop
Silo
Nightly batch ETL process
FULL DATA
HISTORY

The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
Local loop
LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
CRM
Local loop
SAAS VENDOR #2
Email
marketing
Local loop
ERP
Silo
Local loop
CMS
Silo
Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream
processing
Product
rec’s
Micro-batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporting
Batch
processing
Ad hoc
analytics
Hadoop
SAAS VENDOR #3
Web
analytics
Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports

The hybrid era: a surfeit of software vendors
Search
Silo
Local loop
LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
CRM
Local loop
SAAS VENDOR #2
Email
marketing
Local loop
ERP
Silo
Local loop
CMS
Silo
Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream
processing
Product
rec’s
Micro-batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporting
Batch
processing
Ad hoc
analytics
Hadoop
SAAS VENDOR #3
Web
analytics
Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports

The hybrid era: company-wide reporting and
analytics ends up like Rashomon
The bandit’s story
vs.
The wife’s story
vs.
The samurai’s story
vs.
The woodcutter’s story

The hybrid era: the number of data integrations
is unsustainable

So how do we unravel the
hairball?

The unified era, 2013+
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’
DATA HISTORY
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs

Search
Silo
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
The unified log is Amazon Kinesis, or Apache Kafka
• Amazon Kinesis, a
hosted AWS service
• Extremely similar
semantics to Kafka
• Apache Kafka, an append-
only, distributed, ordered
commit log
• Developed at LinkedIn to
serve as their
organization’s unified log

“Kafka is designed to allow a
single cluster to serve as the
central data backbone for a
large organization” [1]
[1] http://kafka.apache.org/

So what does a unified log give us?
A single version of the truth
Our truth is now upstream from the data warehouse
The hairball of point-to-point connections has been
unravelled
Local loops have been unbundled
1
2
3
4

What does a unified log let us do that we couldn’t do before?
Populating a unified log with
your company’s event streams
Real-time
management
reporting
To enable…
Holistic
systems
monitoring
Re-running
models from
Day 0
A/B testing
end-to-end
pipelines
Shipping
offline
models to RT
… anything requiring low
latency response /
holistic view of our
company’s data!

How are we embracing the
unified log at Snowplow?

Some background: early on, we decided that Snowplow should
be composed of a set of loosely coupled subsystems
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
D = Standardised data protocols
Generate event
data from any
environment
Log raw events
from trackers
Validate and
enrich raw
events
Store enriched
events ready
for analysis
Analyze
enriched events
These turned out to be critical to allowing us
to evolve the above stack

Today most users are running a batch-based Snowplow
configuration
Hadoop-
based
enrichment
Snowplow
event
tracking SDK
Amazon
Redshift
Amazon S3
HTTP-based
event
collector
• Batch-based
• Normally run overnight;
sometimes every 4-6 hoursThe Snowplow batch-based
flow uses Amazon S3 as a
“poor man’s” unified log

Search
Silo
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
Can we implement Snowplow on top of Kinesis/Kafka?

We are working on Amazon Kinesis support first; Apache Kafka +
Samza will come later this year
scala-
stream-
collector
scala-
kinesis-
enrich
S3 Amazon
Redshift
S3 sink
Kinesis app
Redshift
sink
Kinesis app
Snowplow
Trackers
= not yet released
kinesis-
elasticsearch-
sink
DynamoDB
Elastic-
search
Event
aggregator
Kinesis app
Analytics on Read for agile
exploration of events, machine
learning, auditing, re-processing…
Raw
event
stream
Bad raw
event
stream
Enriched
event
stream
Google
BigQuery
kinesis-
bigquery-
sink
Analytics on Write (for dashboarding,
audience segmentation, RTB, etc)

Snowplow users can already write stream processing applications
which leverage the Snowplow enriched event stream
scala-
stream-
collector
scala-
kinesis-
enrich
AWS Lambda
Apache
Storm
Snowplow
Trackers
Apache
Samza
Raw
event
stream
Bad raw
event
stream
Enriched
event
stream
Apache
Spark
Streaming
Kinesis Client
Library

Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To meet up or chat, @alexcrdean on Twitter or
alex@snowplowanalytics.com
Discount code: ulogprugcf (43% off
Unified Log Processing eBook)

AWS User Group UK: Why your company needs a unified log

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à AWS User Group UK: Why your company needs a unified log

Similaire à AWS User Group UK: Why your company needs a unified log (20)

Dernier

Dernier (20)

AWS User Group UK: Why your company needs a unified log

Notes de l'éditeur