There are a number of mature web analytics products that have been on the market for ~20 years. Big data tools have only really taken off in the last 5 years. So why use big data tools mine web analytics data?
In this presentation, I explore the limitations of traditional approaches to web analytics, and explain how big data tools can be used to address those limitations and drive more value from the underlying data. I explain how a combination of Snowplow and Qubole can be used to do this in practice
Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole
1. Using big data tools to analyse web
analytics data
Why use big data tools to analyse web analytics data?
How would you use big data tools to analyse web
analytics data (with Snowplow and Qubole)
2. Web event data is incredibly valuable
• It tells you how your customers actually behave (in lots of detail), and how that varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer journey)
• How behaviour drives value
• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer satisfaction and lifetime value
• It tells you how customers and prospective customers engage with your different
marketing campaigns and how that drives subsequent behaviour
Web analytics data should be essential to driving customer
development, product development and marketing decisions
3. Deriving value from web analytics data often involves very
bespoke analytics
• The web is a rich and varied space! E.g.
•
•
•
•
•
•
•
Bank
Newspaper
Social network
Analytics application
Government organisation (e.g. tax office)
Retailer
Marketplace
• For each type of business you’d expect different :
•
•
•
•
Types of events, with different types of associated data
Ecosystem of customers / partners with different types of relationships
Product development cycle (and approach to product development)
Types of business questions / priorities to inform how the data is analysed
4. Web analytics tools are good at delivering the standard reports
that are common across different business types…
• Where does your traffic come from e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page
• Understanding events common across business types (page views, transactions, ‘goals’)
e.g.
•
•
•
•
Page views per session
Page views per web page
Conversion rate by traffic source
Transaction value by traffic source
• Capturing contextual data common people browsing the web
•
•
•
•
•
•
Timestamps
Referer data
Web page data (e.g. page title, URL)
Browser data (e.g. type, plugins, language)
Operating system (e.g. type, timezone)
Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)
5. …but not at enabling the high-value bespoke analytics
• What is the impact of different ad campaigns and creative on the way users
behave, subsequently? What is the return on that ad spend?
• How do visitors use social channels (Facebook / Twitter) to interact around video
content? How can we predict which content will “go viral”?
• How do updates to our product change the “stickiness” of our service? ARPU?
Does that vary by customer segment?
6. That is because there are significant limitations in the way
traditional web analytics programmes handle:
Data collection
• Sample-based (e.g.
Google Analytics)
• Limited set of events e.g.
page views, goals,
transactions
• Limited set of ways of
describing events
(custom dim 1, custom
dim 2…)
Data processing
Data access
• Data is processed ‘once’
• Data is either aggregated
(e.g. Google Analytics),
or available as a
complete log file for a
fee (e.g. Adobe
SiteCatalyst)
• No validation
• No opportunity to
reprocess e.g. following
update to business rules
• Data is aggregated
prematurely
• Only particular
combinations of metrics
/ dimensions can be
pivoted together
(Google Analytics)
• Only particular type of
analysis are possible on
different types of
dimension (e.g. sProps,
eVars, conversion goals
in SiteCatalyst
• As a result, data is siloed:
hard to join with other
data sets
7. We built Snowplow to address those limitations and enable high
value, bespoke analytics on web event data
Data pipeline
Big data store
Snowplow is a data pipeline:
•
•
•
Captures data from website via Javascript tags
Validates, cleans, and enriches the incoming data (using Hadoop)
Loads the cleaned / enriched data store into a big data store for
analysis e.g. S3 where it can be analysed using big data tools e.g.
Qubole
8. Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:
9. Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:
Trackers generate event data
•
•
•
•
•
Javascript tracker for collecting data client-side
No-JS / pixel tracker (e.g. for email marketing)
Server side trackers (e.g. Lua tracker). Python / Ruby / Java / Scala on roadmap
Mobile trackers (iOS, Android on the roadmap…)
Internet of things (e.g. Arduino tracker)
10. Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:
Collectors receive data and write it to a queue for processing
• Cloudfront collector writes data to S3
• Clojure collector sets 3rd party cookie writes to S3
• Scala RT collector sets 3rd party cookie writes to S3 AND Kinesis
11. Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:
Enrichment validates and enriches the data
• Validates e.g. checks expected fields are set for each event type
• Enrichments e.g. categorising referrers (search / social), inferring location from IP
• Hadoop-based enrichment module (easy reprocessing of data)
• Kinesis-based enrichment module (real time processing) in development
12. Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:
Storage – make data available for analysis
• Store data in Amazon S3 for processing using big data tools e.g. Qubole
• Also support storage in Amazon Redshift / PostgreSQL for analysis using
traditional BI tools
13. So what does Snowplow data look like?
• A single table
• One line of data per event
• Fat table: 98 different fields (and counting)…
Type of field
Example field(s)
Description
User ID
domain_userid,
network_userid
Fields to identify user performing browsing. 1st and 3rd party
cookie IDs, browser fingerprints, IP address and separate field for
setting to custom value all available
Web page
page_urlpath
Fields that describe the web page the event occurred on,
including document size, URL, title
Traffic source
mkt_source, refr_source
Fields that relate to indicate the source of traffic. Snowplow
includes fields that can be set via utm parameters and others
based on the referrer
Event (rather
than context)
event, se_action, tr_total
Fields that relate to a specific event (e.g. transaction total)
User tech
setup
br_type, os_name,
dvce_type, br_viewheight
Fields that describe the user’s browser / OS / device setup
…
…
…
14. How do you analyse Snowplow data with Qubole?
• Common approach: use Hive on Qubole (could also use Pig or other Hadoop-based jobs)
• Create the events table (incl. recovering partitions)
• Write highly bespoke queries directly against the complete events table
16. Performing more sophisticated analysis
• Unfortunately there’s not time on this webinar to do a deeper demo…
• …however, there are resources available, in particular, the Snowplow Analytics
Cookbook - http://snowplowanalytics.com/analytics/index.html