By Adam Kelleher (Sr Data Scientist, BuzzFeed)
BuzzFeed has developed the technology to attribute pageviews to a referring user. Using these data, we can construct diffusion graphs for our articles. These graphs introduce a whole collection of new performance metrics, and their complexity opens the door for a new assortment of complications to go with them. I'll mention some past work that has been done to create similar data from old pageview events. Then, I'll work through how we process these data into graph objects (avoiding some pitfalls), and mention some of the new ways of looking at web analytics implied by these objects. I'll talk about how we can take advantage of the structure of these objects to make certain algorithms more efficient. Finally, I'll cover some of the future applications we're particularly excited about!
13. OLD GRAPH RECONSTRUCTION:
MODEL-BASED INFERENCE
Probabilistic: You can infer connections that aren’t
there!
Error Prone: Graph statistics can be susceptible to
small changes in the graph
Gets larger when differences in
pageview times gets smaller
18. LIFE IS A LITTLE
MESSY…
This is
more like
what the
Pageview
graph
looks like
19. PROBLEM: DATA MUNGING
• Lots of potential for heuristics!
• How do we get promotion attribution
from propagations?
• Trees are important: how can we be
sure we get them?
20. PROBLEM: STREAMLINING
ANALYSIS
• How do we work from a common set of definitions?
• How do we avoid repeating analysis?
• How can we streamline data visualization? EDA?
• How do we share optimized analyses? And avoid
inefficient (but correct) algorithms?
21. DEFINE DATA
STRUCTURES!
• All data munging happens “under the hood”
• Data pre-processing is unit-tested
• No room for heuristics: standardization!
• Hard math definitions can be consistency-checked!
23. PROPAGATION SET
Pageviews to article b
in time T
Pageviews to the site
in time T
The simplest data structure. Just a
representation of the raw pageview logs.
Represented as a generator of UserEdge objects
30. PROPAGATION FOREST
The propagation graph is great, but we’d also like a
concept like unique visitors!
If there is attribution ordering in the graph, we can
trace content back to its source!
33. RESULT: ALL GRAPHS
ARE FORESTS
Promotions have 0 indegree,
Users have 1 indegree
total edges in connected components:
Trees!
34. CAREFUL FOR EDGE
CASES: MISSING DATA?
All connected components should be rooted at a
promotion source.
What happens if we lose the first edge (e.g. use the
wrong T)?
36. PROPAGATION FOREST:
CYCLE BREAKING
Consider …
As long as they’re not
equal, the can be
ordered, say
Then, there is a node in the
cycle with an out-edge
younger than its in-edge:
The original pageview for
that node must have been
lost. Cut the in-edge
(FPA!).
37. SUCCESS!
Cycle-breaking + FPA = Trees!
Each tree is the UV graph downstream from a
promotion source: promotion attribution!
Additional Benefits:
Most information diffusion analyses model trees growing on
graphs.
Many algorithms simplify when run on trees!
38. SUPERTREE
We may want to run an algorithm, or calculate a tree
statistic from a whole forest, instead of just one
tree. How can we do that?
Merge all the roots (promotion sources) together into
one “super-node”
The whole forest becomes a SuperTree!
46. H3 LAYOUT Each parent is in the center
of a hemisphere.
Children are laid out on the
surface of the hemisphere
They become centers of
smaller hemispheres (if
they’re parents)
Etc.