Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole

Using big data tools to analyse web
analytics data

Why use big data tools to analyse web analytics data?
How would you use big data tools to analyse web
analytics data (with Snowplow and Qubole)

Web event data is incredibly valuable
• It tells you how your customers actually behave (in lots of detail), and how that varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer journey)
• How behaviour drives value

• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer satisfaction and lifetime value

• It tells you how customers and prospective customers engage with your different
marketing campaigns and how that drives subsequent behaviour

Web analytics data should be essential to driving customer
development, product development and marketing decisions

Deriving value from web analytics data often involves very
bespoke analytics
• The web is a rich and varied space! E.g.
•
•
•
•
•
•
•

Bank
Newspaper
Social network
Analytics application
Government organisation (e.g. tax office)
Retailer
Marketplace

• For each type of business you’d expect different :
•
•
•
•

Types of events, with different types of associated data
Ecosystem of customers / partners with different types of relationships
Product development cycle (and approach to product development)
Types of business questions / priorities to inform how the data is analysed

Web analytics tools are good at delivering the standard reports
that are common across different business types…
• Where does your traffic come from e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page

• Understanding events common across business types (page views, transactions, ‘goals’)
e.g.
•
•
•
•

Page views per session
Page views per web page
Conversion rate by traffic source
Transaction value by traffic source

• Capturing contextual data common people browsing the web
•
•
•
•
•
•

Timestamps
Referer data
Web page data (e.g. page title, URL)
Browser data (e.g. type, plugins, language)
Operating system (e.g. type, timezone)
Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)

…but not at enabling the high-value bespoke analytics
• What is the impact of different ad campaigns and creative on the way users
behave, subsequently? What is the return on that ad spend?

• How do visitors use social channels (Facebook / Twitter) to interact around video
content? How can we predict which content will “go viral”?

• How do updates to our product change the “stickiness” of our service? ARPU?
Does that vary by customer segment?

That is because there are significant limitations in the way
traditional web analytics programmes handle:
Data collection
• Sample-based (e.g.
Google Analytics)
• Limited set of events e.g.
page views, goals,
transactions

• Limited set of ways of
describing events
(custom dim 1, custom
dim 2…)

Data processing

Data access

• Data is processed ‘once’

• Data is either aggregated
(e.g. Google Analytics),
or available as a
complete log file for a
fee (e.g. Adobe
SiteCatalyst)

• No validation
• No opportunity to
reprocess e.g. following
update to business rules

• Data is aggregated
prematurely
• Only particular
combinations of metrics
/ dimensions can be
pivoted together
(Google Analytics)
• Only particular type of
analysis are possible on
different types of
dimension (e.g. sProps,
eVars, conversion goals
in SiteCatalyst

• As a result, data is siloed:
hard to join with other
data sets

We built Snowplow to address those limitations and enable high
value, bespoke analytics on web event data

Data pipeline

Big data store

Snowplow is a data pipeline:
•
•
•

Captures data from website via Javascript tags
Validates, cleans, and enriches the incoming data (using Hadoop)
Loads the cleaned / enriched data store into a big data store for
analysis e.g. S3 where it can be analysed using big data tools e.g.
Qubole

Understanding the technology that powers the Snowplow data
pipeline
The Snowplow data pipeline consists of five loosely coupled modules:

pipeline

Trackers generate event data
•
•
•
•
•

Javascript tracker for collecting data client-side
No-JS / pixel tracker (e.g. for email marketing)
Server side trackers (e.g. Lua tracker). Python / Ruby / Java / Scala on roadmap
Mobile trackers (iOS, Android on the roadmap…)
Internet of things (e.g. Arduino tracker)

pipeline

Collectors receive data and write it to a queue for processing
• Cloudfront collector writes data to S3
• Clojure collector sets 3rd party cookie writes to S3
• Scala RT collector sets 3rd party cookie writes to S3 AND Kinesis

pipeline

Enrichment validates and enriches the data
• Validates e.g. checks expected fields are set for each event type
• Enrichments e.g. categorising referrers (search / social), inferring location from IP
• Hadoop-based enrichment module (easy reprocessing of data)
• Kinesis-based enrichment module (real time processing) in development

pipeline

Storage – make data available for analysis
• Store data in Amazon S3 for processing using big data tools e.g. Qubole
• Also support storage in Amazon Redshift / PostgreSQL for analysis using
traditional BI tools

So what does Snowplow data look like?
• A single table
• One line of data per event
• Fat table: 98 different fields (and counting)…
Type of field

Example field(s)

Description

User ID

domain_userid,
network_userid

Fields to identify user performing browsing. 1st and 3rd party
cookie IDs, browser fingerprints, IP address and separate field for
setting to custom value all available

Web page

page_urlpath

Fields that describe the web page the event occurred on,
including document size, URL, title

Traffic source

mkt_source, refr_source

Fields that relate to indicate the source of traffic. Snowplow
includes fields that can be set via utm parameters and others
based on the referrer

Event (rather
than context)

event, se_action, tr_total

Fields that relate to a specific event (e.g. transaction total)

User tech
setup

br_type, os_name,
dvce_type, br_viewheight

Fields that describe the user’s browser / OS / device setup

…

…

…

How do you analyse Snowplow data with Qubole?
• Common approach: use Hive on Qubole (could also use Pig or other Hadoop-based jobs)
• Create the events table (incl. recovering partitions)
• Write highly bespoke queries directly against the complete events table

Performing more sophisticated analysis
• Unfortunately there’s not time on this webinar to do a deeper demo…
• …however, there are resources available, in particular, the Snowplow Analytics
Cookbook - http://snowplowanalytics.com/analytics/index.html

Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole

Similaire à Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole (20)

Plus de yalisassoon

Plus de yalisassoon (8)

Dernier

Dernier (20)

Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole