A talk through the journey we've been through at Snowplow thinking about event data, starting with our focus on web and then mobile analytics, and exploring our current and future technical and analytic approaches
3. 3 years ago, we were pretty frustrated…
Product / platform development decisions
But…
• Hard to identify patterns in user behavior
• Hard to identify good / bad engagements
• We were using tools (GA / Adobe) to answer questions that the tools were not
designed to support
4. Cloud services + open source big data technology -> we can
collect and warehouse the event-level data in a linearly scalable
way…
…and perform any analysis we want on it
5. Snowplow v1 was born…
• Every event on your website, represented as a line of data in your own data
warehouse (on EMR)
• Data queried mostly using Apache Hive, but opportunity to use any Hadoop-based
data processing framework
Lots of flexibility to perform more involved / advanced web analytics
6. Is web analytics data a subset of a broader category of digital
event data?
• Stream of events describe “what has happened” over time
• High volume of data: one line of data per event, potentially 1000s of events per
second
• In all cases, having the complete, event-level dataset available for analysis provides
the possibility of using the data to answer a very broad set of questions
Could we extend the pipeline we built for web data to encompass digital
event data more generally?
8. …BUT – what about the structure of the data? Doesn’t this vary
by event type? And aren’t we now looking at many more event
types?
Web events
All events
• Page view • Order • Add to basket• Page activity
• Game saved • Machine broke• Car started
• Spellcheck run • Screenshot taken• Fridge empty
• App crashed • Disk full• SMS sent
• Screen viewed • Tweet drafted• Player died
• Taxi arrived • Phonecall ended• Cluster started
• Till opened • Product returned
9. There are two historic approaches to dealing with the explosion
of possible event types
Web analytics vendors Mobile and app analytics vendors
Custom Variables Schema-less JSONs
10. Custom variables are very restrictive
1. Take a standard web event, like a page view:
2. and add custom variables until it becomes something totally different:
= a “taxi arrived” event, kind of!
Page View
Page View vehicle=taxi23 status=arrived+ +
11. Schema-less JSONs are better, but they have a different set of
problems
Issues with the event name:
• Separate from the event properties
• Not versioned
• Not unique – HBO video played
versus Brightcove video played
Lots of unanswered questions about the
properties:
• Is length required, and is it always a
number?
• Is id required, and is it always a string?
• What other optional properties are
allowed for a video play?
Other issues:
• What if the developer
accidentally starts sending
“len” instead of “length”? The
data will end up split across
two separate fields
• Why does the analyst need to
keep an implicit schema in
their head to analyze video
played events?
13. When a developer or analyst defines a new event in JSON, let’s
ask them to create a JSON Schema for that event
Additional optional field we might
not know about otherwise
No other fields
allowed
Yes length should always be a
number
14. But we need to let our event definitions evolve, so let’s
add versioning – we’re calling this SchemaVer
MODEL-REVISION-ADDITION
• Start versioning at 1-0-0 – so 1-0-0, 1-0-1, 1-0-2, 1-1-0 etc
• Try to stick to backwards-compatible ADDITION upgrades as much
as possible
15. We make the event JSONs self-describing, with a schema header
and data body
The schema field determines
where the schema can be found in
Iglu, our schema repository
Schemas are namespaced…
16. For each event being processed, we can ‘look up’ the schema
from the repo and use it to drive event validation and loading of
event data into structured data stores
Schema repo{}
17. Being able to load data into multiple different stores is very
valuable
• Different data stores support different types of analyses
• SQL DBS -> pivoting / OLAP
• Elasticsearch -> search, simple OLAP
• Graph databases -> pathing analysis
• Many of these data stores (all except elasticsearch) are ‘structured’
• Having the data pass through the pipeline in a schemaed format means we do not
need to manually structure it ourselves, which is expensive and error-prone
18. We are working on a second, real-time version of the Snowplow
data pipeline
Batch:
Real-time:
Requests logged to S3
Requests logged to Amazon
Kinesis and Kafka
Event data processed in
Scalding / EMR and SQL
Event data processing using
Kinesis Client Library / Samza
Kinesis and Kafka enable us to publish
and consume events from a
distributed stream in a real-time,
robust and scalable way
19. With the real-time pipeline, event data can be fed into data-
driven applications alongside the datawarehouse
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’
DATA HISTORY
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
Data
warehouse
20. This is exciting for data scientists
• Build predictive models based on the complete history of
events in the datawarehouse e.g. forecast revenue for
acquired users over lifetime…
• Put those models live on the same source of truth, as the
data comes in in real-time
• Use this approach for all types of applications: fraud detection,
real-time personalization, product recommendation
Data
warehouse
RT data-driven
application
21. However, this only makes figuring out how to model and
describe events more important
• Lots of applications (not just offline reporting / data warehouse) fed off the
event stream
• All those applications are decoupled: downstream applications have no control
over the structure of data inputted generated upstream of them
• So the better able we are to specify (and constrain) the data structures
upstream, the easier it’ll be to write downstream applications to consume the
data
22. Can we specify a standard framework / structure for events?
What about a semantic model?
23. We can extend our self-describing JSON model to encapsulate
this semantic model…
This is something we need to put
more thought / research into
24. The next few months are pretty exciting
• Building out the real-time pipeline
• Encouraging an ecosystem of developers / partners to build apps to
run on the real-time stream
• Develop the semantic model / event grammar
Any
questions?