Understanding event data

Understanding event data
Data Insights, Cambridge April 2015
In the beginning…
3 years ago, we were pretty frustrated…
Product / platform development decisions
But…
• Hard to identify patterns in user behavior
• Hard to identify good / bad engagements
• We were using tools (GA / Adobe) to answer questions that the tools were not
designed to support
Cloud services + open source big data technology -> we can
collect and warehouse the event-level data in a linearly scalable
way…
…and perform any analysis we want on it
Snowplow v1 was born…
• Every event on your website, represented as a line of data in your own data
warehouse (on EMR)
• Data queried mostly using Apache Hive, but opportunity to use any Hadoop-based
data processing framework
Lots of flexibility to perform more involved / advanced web analytics
Is web analytics data a subset of a broader category of digital
event data?
• Stream of events describe “what has happened” over time
• High volume of data: one line of data per event, potentially 1000s of events per
second
• In all cases, having the complete, event-level dataset available for analysis provides
the possibility of using the data to answer a very broad set of questions
Could we extend the pipeline we built for web data to encompass digital
event data more generally?
Yes! Just build more trackers…
…BUT – what about the structure of the data? Doesn’t this vary
by event type? And aren’t we now looking at many more event
types?
Web events
All events
• Page view • Order • Add to basket• Page activity
• Game saved • Machine broke• Car started
• Spellcheck run • Screenshot taken• Fridge empty
• App crashed • Disk full• SMS sent
• Screen viewed • Tweet drafted• Player died
• Taxi arrived • Phonecall ended• Cluster started
• Till opened • Product returned
There are two historic approaches to dealing with the explosion
of possible event types
Web analytics vendors Mobile and app analytics vendors
Custom Variables Schema-less JSONs
Custom variables are very restrictive
1. Take a standard web event, like a page view:
2. and add custom variables until it becomes something totally different:
= a “taxi arrived” event, kind of!
Page View
Page View vehicle=taxi23 status=arrived+ +
Schema-less JSONs are better, but they have a different set of
problems
Issues with the event name:
• Separate from the event properties
• Not versioned
• Not unique – HBO video played
versus Brightcove video played
Lots of unanswered questions about the
properties:
• Is length required, and is it always a
number?
• Is id required, and is it always a string?
• What other optional properties are
allowed for a video play?
Other issues:
• What if the developer
accidentally starts sending
“len” instead of “length”? The
data will end up split across
two separate fields
• Why does the analyst need to
keep an implicit schema in
their head to analyze video
played events?
Our approach: schema our
JSONs
When a developer or analyst defines a new event in JSON, let’s
ask them to create a JSON Schema for that event
Additional optional field we might
not know about otherwise
No other fields
allowed
Yes length should always be a
number
But we need to let our event definitions evolve, so let’s
add versioning – we’re calling this SchemaVer
MODEL-REVISION-ADDITION
• Start versioning at 1-0-0 – so 1-0-0, 1-0-1, 1-0-2, 1-1-0 etc
• Try to stick to backwards-compatible ADDITION upgrades as much
as possible
We make the event JSONs self-describing, with a schema header
and data body
The schema field determines
where the schema can be found in
Iglu, our schema repository
Schemas are namespaced…
For each event being processed, we can ‘look up’ the schema
from the repo and use it to drive event validation and loading of
event data into structured data stores
Schema repo{}
Being able to load data into multiple different stores is very
valuable
• Different data stores support different types of analyses
• SQL DBS -> pivoting / OLAP
• Elasticsearch -> search, simple OLAP
• Graph databases -> pathing analysis
• Many of these data stores (all except elasticsearch) are ‘structured’
• Having the data pass through the pipeline in a schemaed format means we do not
need to manually structure it ourselves, which is expensive and error-prone
We are working on a second, real-time version of the Snowplow
data pipeline
Batch:
Real-time:
Requests logged to S3
Requests logged to Amazon
Kinesis and Kafka
Event data processed in
Scalding / EMR and SQL
Event data processing using
Kinesis Client Library / Samza
Kinesis and Kafka enable us to publish
and consume events from a
distributed stream in a real-time,
robust and scalable way
With the real-time pipeline, event data can be fed into data-
driven applications alongside the datawarehouse
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’
DATA HISTORY
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
Data
warehouse
This is exciting for data scientists
• Build predictive models based on the complete history of
events in the datawarehouse e.g. forecast revenue for
acquired users over lifetime…
• Put those models live on the same source of truth, as the
data comes in in real-time
• Use this approach for all types of applications: fraud detection,
real-time personalization, product recommendation
Data
warehouse
RT data-driven
application
However, this only makes figuring out how to model and
describe events more important
• Lots of applications (not just offline reporting / data warehouse) fed off the
event stream
• All those applications are decoupled: downstream applications have no control
over the structure of data inputted generated upstream of them
• So the better able we are to specify (and constrain) the data structures
upstream, the easier it’ll be to write downstream applications to consume the
data
Can we specify a standard framework / structure for events?
What about a semantic model?
We can extend our self-describing JSON model to encapsulate
this semantic model…
This is something we need to put
more thought / research into
The next few months are pretty exciting
• Building out the real-time pipeline
• Encouraging an ecosystem of developers / partners to build apps to
run on the real-time stream
• Develop the semantic model / event grammar
Any
questions?
1 sur 24

Contenu connexe

Tendances(20)

Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and Cascalog
Robert Boland1.9K vues
Viewbix tracking journeyViewbix tracking journey
Viewbix tracking journey
idan_by4K vues

En vedette(11)

Basic unix commandsBasic unix commands
Basic unix commands
srinivas damarla1.1K vues
Modeling event dataModeling event data
Modeling event data
yalisassoon1.2K vues
A KPI framework for startupsA KPI framework for startups
A KPI framework for startups
yalisassoon79.2K vues
HAKQ ProfileHAKQ Profile
HAKQ Profile
smartbuddy11.2K vues
Solinea Lazuli Tower Project BriefSolinea Lazuli Tower Project Brief
Solinea Lazuli Tower Project Brief
sevenseaspropertycorp1.2K vues
Effectivnoe upravlenie personalomEffectivnoe upravlenie personalom
Effectivnoe upravlenie personalom
Настасья Варламова766 vues
Marriott management philosophyMarriott management philosophy
Marriott management philosophy
Fawad Akhtar1.2K vues

Similaire à Understanding event data(20)

[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
NAVER D21.7K vues
Security data delugeSecurity data deluge
Security data deluge
DataWorks Summit1.7K vues
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
BigDataCloud13.3K vues
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz1.7K vues
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz1.3K vues
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services348 vues

Understanding event data

  • 1. Understanding event data Data Insights, Cambridge April 2015
  • 3. 3 years ago, we were pretty frustrated… Product / platform development decisions But… • Hard to identify patterns in user behavior • Hard to identify good / bad engagements • We were using tools (GA / Adobe) to answer questions that the tools were not designed to support
  • 4. Cloud services + open source big data technology -> we can collect and warehouse the event-level data in a linearly scalable way… …and perform any analysis we want on it
  • 5. Snowplow v1 was born… • Every event on your website, represented as a line of data in your own data warehouse (on EMR) • Data queried mostly using Apache Hive, but opportunity to use any Hadoop-based data processing framework Lots of flexibility to perform more involved / advanced web analytics
  • 6. Is web analytics data a subset of a broader category of digital event data? • Stream of events describe “what has happened” over time • High volume of data: one line of data per event, potentially 1000s of events per second • In all cases, having the complete, event-level dataset available for analysis provides the possibility of using the data to answer a very broad set of questions Could we extend the pipeline we built for web data to encompass digital event data more generally?
  • 7. Yes! Just build more trackers…
  • 8. …BUT – what about the structure of the data? Doesn’t this vary by event type? And aren’t we now looking at many more event types? Web events All events • Page view • Order • Add to basket• Page activity • Game saved • Machine broke• Car started • Spellcheck run • Screenshot taken• Fridge empty • App crashed • Disk full• SMS sent • Screen viewed • Tweet drafted• Player died • Taxi arrived • Phonecall ended• Cluster started • Till opened • Product returned
  • 9. There are two historic approaches to dealing with the explosion of possible event types Web analytics vendors Mobile and app analytics vendors Custom Variables Schema-less JSONs
  • 10. Custom variables are very restrictive 1. Take a standard web event, like a page view: 2. and add custom variables until it becomes something totally different: = a “taxi arrived” event, kind of! Page View Page View vehicle=taxi23 status=arrived+ +
  • 11. Schema-less JSONs are better, but they have a different set of problems Issues with the event name: • Separate from the event properties • Not versioned • Not unique – HBO video played versus Brightcove video played Lots of unanswered questions about the properties: • Is length required, and is it always a number? • Is id required, and is it always a string? • What other optional properties are allowed for a video play? Other issues: • What if the developer accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields • Why does the analyst need to keep an implicit schema in their head to analyze video played events?
  • 12. Our approach: schema our JSONs
  • 13. When a developer or analyst defines a new event in JSON, let’s ask them to create a JSON Schema for that event Additional optional field we might not know about otherwise No other fields allowed Yes length should always be a number
  • 14. But we need to let our event definitions evolve, so let’s add versioning – we’re calling this SchemaVer MODEL-REVISION-ADDITION • Start versioning at 1-0-0 – so 1-0-0, 1-0-1, 1-0-2, 1-1-0 etc • Try to stick to backwards-compatible ADDITION upgrades as much as possible
  • 15. We make the event JSONs self-describing, with a schema header and data body The schema field determines where the schema can be found in Iglu, our schema repository Schemas are namespaced…
  • 16. For each event being processed, we can ‘look up’ the schema from the repo and use it to drive event validation and loading of event data into structured data stores Schema repo{}
  • 17. Being able to load data into multiple different stores is very valuable • Different data stores support different types of analyses • SQL DBS -> pivoting / OLAP • Elasticsearch -> search, simple OLAP • Graph databases -> pathing analysis • Many of these data stores (all except elasticsearch) are ‘structured’ • Having the data pass through the pipeline in a schemaed format means we do not need to manually structure it ourselves, which is expensive and error-prone
  • 18. We are working on a second, real-time version of the Snowplow data pipeline Batch: Real-time: Requests logged to S3 Requests logged to Amazon Kinesis and Kafka Event data processed in Scalding / EMR and SQL Event data processing using Kinesis Client Library / Samza Kinesis and Kafka enable us to publish and consume events from a distributed stream in a real-time, robust and scalable way
  • 19. With the real-time pipeline, event data can be fed into data- driven applications alongside the datawarehouse CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log LOW LATENCY WIDE DATA COVERAGE < WIDE DATA COVERAGE > < FULL DATA HISTORY > FEW DAYS’ DATA HISTORY Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs Data warehouse
  • 20. This is exciting for data scientists • Build predictive models based on the complete history of events in the datawarehouse e.g. forecast revenue for acquired users over lifetime… • Put those models live on the same source of truth, as the data comes in in real-time • Use this approach for all types of applications: fraud detection, real-time personalization, product recommendation Data warehouse RT data-driven application
  • 21. However, this only makes figuring out how to model and describe events more important • Lots of applications (not just offline reporting / data warehouse) fed off the event stream • All those applications are decoupled: downstream applications have no control over the structure of data inputted generated upstream of them • So the better able we are to specify (and constrain) the data structures upstream, the easier it’ll be to write downstream applications to consume the data
  • 22. Can we specify a standard framework / structure for events? What about a semantic model?
  • 23. We can extend our self-describing JSON model to encapsulate this semantic model… This is something we need to put more thought / research into
  • 24. The next few months are pretty exciting • Building out the real-time pipeline • Encouraging an ecosystem of developers / partners to build apps to run on the real-time stream • Develop the semantic model / event grammar Any questions?