4. Motivation
• CxO
– Pages Views, Unique Visitors, Dollars, Subscription
• Editor / Product Manager
– Time Spent, Comments
• Users
– Content
What does matter on a web site ?
5. Key Usage Metrics
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– In-App Purchase
6. The Quest for the Missing Proxy
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– User Satisfaction
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– User Satisfaction
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– User Satisfaction
– In-App Purchase
U
S
E
R
7. Question
How to measure and drive user satisfaction on a
large web sites with very diverse usage patterns
?
8. The Problem
New Comers From
Google News
People Coming
from twitter and
Facebook Posts
People coming to
the website almost
each and everyday
People that loves
to comment
Foreigners Robots
People fond of
sport section only
…. …..
BEHAVIOUR DIVERSITY
THE AVERAGED
METRICS WOULD
HIDE
IMPORTANT
VARIATION ON
SPECIFIC SEGMENTS
9. SubProblem 1: Hard Segments
• Segments Users per
Number of visits per
month
– > 20 days per month
-> Engaged Users
• Segment per
transformed or not
• Segment per country
10. Subproblem 2: Hard Metrics
• Newspaper
Time Spent on the website
log(Number of page
views) + Number of actions
• Search engine
Click Ratio
Click ratio
• E-Commerce
Transformation Ratio
12. Semi-Supervised Learning
All Labeled Data
All Unlabeled Data
Some Labeled Data
Lots of Unlabeled
Data
Training Data
Supervised
Learning
Unsupervised
Learning
Semi-
Supervised
Learning
Model
Model
Model
13. ½ SL – Natural Language Processing
I hope I’ll enjoy Amsterdam, and not only because of Hadoop
Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop
Statistical Knowledge
Text Structure
(Unsupervised)
Aligned Corpus
(Supervised)
14. ½ SL Applied to Web Sessions
Lots of customer sessions
Not so many concrete customer
feedbacks
Subscription
15. Semi-Supervised Learning
3 Approaches
• Generative Models, e.g. gaussian fits
– All Data fits a gaussian distribution with parameter X
– Find X that better fit distribution of both labeled data and
unlabeled data
• Fits with costs
– Supervised learning with a costs function that capture a
distance between point related to the unlabeled data
structure
• Ad-hoc : Combine unsupervised, then supervised
19. Our Approach
1. (Lots of ) Data preparation to build miningful
user session
2. Clustering sessions and validate/tag those
clusters by end users
3. Create Predictive User Satisfaction Metrics
4. Follow those metrics !
20. Data Prep: Overview
Step 1
Build Sessions
Pig
Step 2
Parse IP/Time/..
Custom Python
(or )
Step 3
Parse Sequences
Hive or Python
custom
Step 4
Build user-level
stats
Hive
RAW DATA
READY FOR ML
21. Step 1. Build Session
• Use Hive ( Or Pig)
• Group into “Session”
• Depending on the variable
– IP, Device Select only one per log
– URL, Event Create an ordered array that
represents the sequence of events in the session
22. Step 2 : Basic Feature
• IP Address Location, City
• User-Agent Device
• Timestamp User Time Day or night ?
Python + Hadoop Streaming
Option 1 Option 2
24. Step 3: Session Signals
• Simple Signals
– Number of Page Views
– Time Spent …..
– Etc…
• Limitation
It might not help that much to differentiate
behaviour
25. More Elaborate: N-Grams Model
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
26. N-Grams Model For Sessions
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
Web Sessions Page View [/home , /products, /trynow,
/blog]
/home, /products, /trynow,
/blog
/home /products, /products
/trynow, /trynow /blog
/home-/products-/trynow,
/products-/trynow-/blog
27. Session N-Grams Analytics
Campaign / URL / Event Detailed Token Simple Token
utm=google_search google-search-my-site google-search
/home home home
/search?q=baseball search-baseball search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player-comming sport
/search?q=Mick+JONES search-mick+jones search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player/comming sport
/politics/home politics-home politics
Important Tricks:
• Incorporate the first referrer / marketing campaign as FIRST TOKEN
• Build two level of tokens: detailed, and category only
N-Grams Fine Grain N-Grams Coarse Grain
28. How To In Practice
• Hive query using the n-grams UDF
• Compute the LLR (Least-Likehood Ratio) Metrics
• Keep the most frequent n-grams of each type (detailed
/ non detailed) as features for the session
• Hint : Set the frequency limit so that > 90% session
can be described by a non-detailed n-gram
29. Step 4. Cohort-like data
• Per cookie compute metrics
– Nb. Days since first visit
– Nb visits in the last 30 days
– Average session time
– …
• Reintegrate this information
• Easily achieved with a HiveQL query
30. Machine Learning for HDFS Data
Kind Algorithms
for clustering
Simplicity TRAIN set size
Apache Mahout MapReduce ~ 10 available Expert TERABYTES
Python
(Scikit+Pandas+…
)
Out for training /
In for apply
~ 20 available
(including bi-
clustering)
Medium (10GB)
1 SERVER RAM
H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB)
CLUSTER RAM
Open Source R +
Hadoop
Varies Varies Varies Varies
Open Source R +
Pattern
(Casacding)
Out for training
/ In for apply
> 3 Medium (1GB)
1 Server RAM in
R
Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB)
CLUSTER RAM
31. How Big is out data here ?
Step 1
Build Sessions
Step 2
Parse IP/Time/..
Step 3
Parse Sequences
Step 4
Build user-level
stats
RAW DATA
READY FOR ML
Uncompressed data size, for 1 year worth of log on a website with
10 Millions Unique Visitors per month
10 GB5TB
32. Clustering With Scikit on HDFS
1. Use Pydoop to get data on train server
2. Use pandas to read data transform to numerical
3. Kmeans().fit()
4. Ipython to draw some graphs
5. Enjoy
or
35. Clustering & Cluster Sampling
Take a balanced number of samples
in each cluster, close to the centroid
36. Labelling
0’ 00
0’ 12
1’ 04
1’ 45
3’ 02
Visualizing Sessions
Search for a
specific Topic
Labelling
I can guess what this guy was
doing !!!
37. Labelling
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
38. What if ?
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
39. Supervised Learning
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Independently from the clusters, used the
trained examples in order to classify each
session in the predefined segments
40. Supervised Learning : e.g. in python
• Load the data and the label in
python (Pandas)
• Fit the labeled sessions against
a model
• Save the model in HDFS
(python pickle)
• Run the model against all the
data (Hadoop Streaming)
We’ve got a tool to help you
do that in Data Science Studio
He’s called the Doctor and he’s
fun to use !
41. Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
938k sessions
0.3€ per session
0.23€ acquisition costs
738k sessions
0.83€ per session
0.73€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
42. User Satisfaction Metrics
• Future-Based Metrics
– Will the user most
likely subscribe/pay in
the future ?
• Expressed-Opinion
– Does he like satisfied
from its behaviour ?
43. Opinion-Based Training For User Satisfaction
User Feedbacks as “Labels” to build a model
on satisfaction
“Predict” a satisfaction score
for non-trained session
Session Data
Feedbacks
Scored
Session
HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS
THEY HAVE SIMILAR USER SATISFACTION LEVELS
(100 Million Sessions)
(10.000 feedbacks)
44. Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
938k sessions
0.3€ per session
0.23€ acquisition costs
738k sessions
0.83€ per session
0.73€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
SATISFACTION SCORE 0.87§
SATISFACTION SCORE 0.37
SATISFACTION SCORE 0.28
SATISFACTION SCORE 0.12
SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
45. Data in Time: Smoothing
In Red : The Base Metric
In Blue : The smoothed metricRAW DATA MAY VARY A LOT
FROM DAYS TO DAYS
IT WILL SCARE PEOPLE
46. Exponential Smoothing In Hive
SELECT segment
moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’))
FROM
stats
GROUP BY segment
These factors determine
whether your smooth a lot
or not, and over how many days
47. Final : Follow Smoothed Satisfaction
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Follow Statisfaction Metric Per Segment
Damn
our latest
release
has diverging
effects
on segments
48. Thank You !
Florian Douetteau
@fdouetteau
Questions now or later:
florian.douetteau@dataiku.com
dataiku.com