DataEngConf: The Science of Virality at BuzzFeed

THE DATA: OLD VERSION
Article being viewed
User viewing article
Time of pageview
Referring domain

THE DATA: NEW VERSION
Article being viewed
Time of pageview
Referring domain
User viewing article
Referring User

DIFFERENT PERSPECTIVE:
Pageviews are a process on a graph!

WHAT CAN DO YOU WITH
OLD PAGEVIEWS?
(Educated)
Guess!

OLD GRAPH RECONSTRUCTION:
MODEL-BASED INFERENCE
Probabilistic: You can infer connections that aren’t
there!
Error Prone: Graph statistics can be susceptible to
small changes in the graph
Gets larger when differences in
pageview times gets smaller

SIMPLIFIED VERSION:
Observe:
Guess:

SIMPLIFIED VERSION:
Guess:
Reality:

Check out a toy implementation here!
github.com/akellehe/pyconnie

NEW GRAPH RECONSTRUCTION:
TRIVIAL
These are
actually
Unique
Visitors …

LIFE IS A LITTLE
MESSY…
This is
more like
what the
Pageview
graph
looks like

PROBLEM: DATA MUNGING
• Lots of potential for heuristics!
• How do we get promotion attribution
from propagations?
• Trees are important: how can we be
sure we get them?

PROBLEM: STREAMLINING
ANALYSIS
• How do we work from a common set of definitions?
• How do we avoid repeating analysis?
• How can we streamline data visualization? EDA?
• How do we share optimized analyses? And avoid
inefficient (but correct) algorithms?

DEFINE DATA
STRUCTURES!
• All data munging happens “under the hood”
• Data pre-processing is unit-tested
• No room for heuristics: standardization!
• Hard math definitions can be consistency-checked!

PROPAGATION SET
For one article
For the site (or other set of articles, S)

PROPAGATION SET
Pageviews to article b
in time T
Pageviews to the site
in time T
The simplest data structure. Just a
representation of the raw pageview logs.
Represented as a generator of UserEdge objects

INFLUENCE GRAPH
Propagation graph together with a map,
That measures the influence of the origin user in p
on the pageviewing user

PROPAGATION FOREST
The propagation graph is great, but we’d also like a
concept like unique visitors!
If there is attribution ordering in the graph, we can
trace content back to its source!

PROPAGATION FOREST: FIRST
PARENT ATTRIBUTION
n pageviews One UV

PROPAGATION FOREST
gets the credit

RESULT: ALL GRAPHS
ARE FORESTS
Promotions have 0 indegree,
Users have 1 indegree
total edges in connected components:
Trees!

CAREFUL FOR EDGE
CASES: MISSING DATA?
All connected components should be rooted at a
promotion source.
What happens if we lose the first edge (e.g. use the
wrong T)?

PROPAGATION FOREST:
CYCLE BREAKING
Consider … Cycle is not broken by
first-parent attribution
Traversal algorithms go
on forever!

PROPAGATION FOREST:
CYCLE BREAKING
Consider …
As long as they’re not
equal, the can be
ordered, say
Then, there is a node in the
cycle with an out-edge
younger than its in-edge:
The original pageview for
that node must have been
lost. Cut the in-edge
(FPA!).

SUCCESS!
Cycle-breaking + FPA = Trees!
Each tree is the UV graph downstream from a
promotion source: promotion attribution!
Additional Benefits:
Most information diffusion analyses model trees growing on
graphs.
Many algorithms simplify when run on trees!

SUPERTREE
We may want to run an algorithm, or calculate a tree
statistic from a whole forest, instead of just one
tree. How can we do that?
Merge all the roots (promotion sources) together into
one “super-node”
The whole forest becomes a SuperTree!

APPLICATION:
LARGE SCALE
DATA VIS

WHY IS IT SLOW?
Layouts often consider repelling each
node from every other:
time complexity
Good for a few thousand nodes

OPENORD: SIMULATED
ANNEALING
Linear main layout
Quadratic settling Phase
Implemented in Gephi

OPENORD
Good for ~10k Users
Slow for ~100k Users
Messy! (if you skip
the quadratic step!)

TAKE ADVANTAGE OF
TREE STRUCTURE!
Traverse the tree to decide where to place nodes!

H3 LAYOUT Each parent is in the center
of a hemisphere.
Children are laid out on the
surface of the hemisphere
They become centers of
smaller hemispheres (if
they’re parents)
Etc.

A NEW IMPLEMENTATION
pip install pyh3

GRAPH AND TEMPORAL PROPERTIES
ARE IMPORTANT!

TEST THE INFLUENTIALS
HYPOTHESIS

FINDING THE CAUSES OF
VIRALITY
Consider Fitting a Model:
User Features, content features,
context features, User pair
features

UNDER CONSTRUCTION:
Online Regression!
Real-time feature weights tell which features
correlate with propagation probabilities!
Drives hypothesis-building!

DataEngConf: The Science of Virality at BuzzFeed

DataEngConf: The Science of Virality at BuzzFeed

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (6)

Similaire à DataEngConf: The Science of Virality at BuzzFeed

Similaire à DataEngConf: The Science of Virality at BuzzFeed (20)

Plus de Hakka Labs

Plus de Hakka Labs (20)

Dernier

Dernier (20)

DataEngConf: The Science of Virality at BuzzFeed

Notes de l'éditeur