SlideShare une entreprise Scribd logo
1  sur  48
½ S L using to turn
into
Semi-Supervised Learning on
Hadoop to understand user
behaviors
Hadoop Summit Amsterdam
2-3 Avril 2014
Florian Douetteau
@fdouetteau
www.dataiku.com
Data Science
Studio
Motivation
• CxO
– Pages Views, Unique Visitors, Dollars, Subscription
• Editor / Product Manager
– Time Spent, Comments
• Users
– Content
What does matter on a web site ?
Key Usage Metrics
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– In-App Purchase
The Quest for the Missing Proxy
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– User Satisfaction
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– User Satisfaction
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– User Satisfaction
– In-App Purchase
U
S
E
R
Question
How to measure and drive user satisfaction on a
large web sites with very diverse usage patterns
?
The Problem
New Comers From
Google News
People Coming
from twitter and
Facebook Posts
People coming to
the website almost
each and everyday
People that loves
to comment
Foreigners Robots
People fond of
sport section only
…. …..
BEHAVIOUR DIVERSITY
THE AVERAGED
METRICS WOULD
HIDE
IMPORTANT
VARIATION ON
SPECIFIC SEGMENTS
SubProblem 1: Hard Segments
• Segments Users per
Number of visits per
month
– > 20 days per month
-> Engaged Users
• Segment per
transformed or not
• Segment per country
Subproblem 2: Hard Metrics
• Newspaper
Time Spent on the website
 log(Number of page
views) + Number of actions
• Search engine
Click Ratio
Click ratio
• E-Commerce
 Transformation Ratio
Limits
Hard Segments
 MISSING PART OF
THE REALITY
Hard Metrics
 ARGUING BETWEEN
TEAM
Semi-Supervised Learning
All Labeled Data
All Unlabeled Data
Some Labeled Data
Lots of Unlabeled
Data
Training Data
Supervised
Learning
Unsupervised
Learning
Semi-
Supervised
Learning
Model
Model
Model
½ SL – Natural Language Processing
I hope I’ll enjoy Amsterdam, and not only because of Hadoop
Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop
Statistical Knowledge
 Text Structure
(Unsupervised)
Aligned Corpus
(Supervised)
½ SL Applied to Web Sessions
Lots of customer sessions
Not so many concrete customer
feedbacks
Subscription
Semi-Supervised Learning
3 Approaches
• Generative Models, e.g. gaussian fits
– All Data fits a gaussian distribution with parameter X
– Find X that better fit distribution of both labeled data and
unlabeled data
• Fits with costs
– Supervised learning with a costs function that capture a
distance between point related to the unlabeled data
structure
• Ad-hoc : Combine unsupervised, then supervised
Clustering+Supervised in practice
Unlabeled training data points in grey
Labeled training data points in color
Supervised Learning Only
½ SL : Fit to the underlying structure
Our Approach
1. (Lots of ) Data preparation to build miningful
user session
2. Clustering sessions and validate/tag those
clusters by end users
3. Create Predictive User Satisfaction Metrics
4. Follow those metrics !
Data Prep: Overview
Step 1
Build Sessions
Pig
Step 2
Parse IP/Time/..
Custom Python
(or )
Step 3
Parse Sequences
Hive or Python
custom
Step 4
Build user-level
stats
Hive
RAW DATA
READY FOR ML
Step 1. Build Session
• Use Hive ( Or Pig)
• Group into “Session”
• Depending on the variable
– IP, Device  Select only one per log
– URL, Event  Create an ordered array that
represents the sequence of events in the session
Step 2 : Basic Feature
• IP Address  Location, City
• User-Agent  Device
• Timestamp  User Time  Day or night ?
Python + Hadoop Streaming
Option 1 Option 2
Extracted DataORIGINAL
ORIGINAL
ORIGINAL
NEW!!
NEW!!
NEW!!
Country From IP Device From User-AgentHour from
Country & Time
Step 3: Session Signals
• Simple Signals
– Number of Page Views
– Time Spent …..
– Etc…
• Limitation
 It might not help that much to differentiate
behaviour
More Elaborate: N-Grams Model
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
N-Grams Model For Sessions
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
Web Sessions Page View [/home , /products, /trynow,
/blog]
/home, /products, /trynow,
/blog
/home /products, /products
/trynow, /trynow /blog
/home-/products-/trynow,
/products-/trynow-/blog
Session N-Grams Analytics
Campaign / URL / Event Detailed Token Simple Token
utm=google_search google-search-my-site google-search
/home home home
/search?q=baseball search-baseball search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player-comming sport
/search?q=Mick+JONES search-mick+jones search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player/comming sport
/politics/home politics-home politics
Important Tricks:
• Incorporate the first referrer / marketing campaign as FIRST TOKEN
• Build two level of tokens: detailed, and category only
N-Grams Fine Grain N-Grams Coarse Grain
How To In Practice
• Hive query using the n-grams UDF
• Compute the LLR (Least-Likehood Ratio) Metrics
• Keep the most frequent n-grams of each type (detailed
/ non detailed) as features for the session
• Hint : Set the frequency limit so that > 90% session
can be described by a non-detailed n-gram
Step 4. Cohort-like data
• Per cookie compute metrics
– Nb. Days since first visit
– Nb visits in the last 30 days
– Average session time
– …
• Reintegrate this information
• Easily achieved with a HiveQL query
Machine Learning for HDFS Data
Kind Algorithms
for clustering
Simplicity TRAIN set size
Apache Mahout MapReduce ~ 10 available Expert TERABYTES
Python
(Scikit+Pandas+…
)
Out for training /
In for apply
~ 20 available
(including bi-
clustering)
Medium (10GB)
1 SERVER RAM
H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB)
CLUSTER RAM
Open Source R +
Hadoop
Varies Varies Varies Varies
Open Source R +
Pattern
(Casacding)
Out for training
/ In for apply
> 3 Medium (1GB)
1 Server RAM in
R
Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB)
CLUSTER RAM
How Big is out data here ?
Step 1
Build Sessions
Step 2
Parse IP/Time/..
Step 3
Parse Sequences
Step 4
Build user-level
stats
RAW DATA
READY FOR ML
Uncompressed data size, for 1 year worth of log on a website with
10 Millions Unique Visitors per month
10 GB5TB
Clustering With Scikit on HDFS
1. Use Pydoop to get data on train server
2. Use pandas to read data transform to numerical
3. Kmeans().fit()
4. Ipython to draw some graphs
5. Enjoy
or
Session Data
Clustering
Clustering & Cluster Sampling
Take a balanced number of samples
in each cluster, close to the centroid
Labelling
0’ 00
0’ 12
1’ 04
1’ 45
3’ 02
Visualizing Sessions
Search for a
specific Topic
Labelling
I can guess what this guy was
doing !!!
Labelling
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
What if ?
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Supervised Learning
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Independently from the clusters, used the
trained examples in order to classify each
session in the predefined segments
Supervised Learning : e.g. in python
• Load the data and the label in
python (Pandas)
• Fit the labeled sessions against
a model
• Save the model in HDFS
(python pickle)
• Run the model against all the
data (Hadoop Streaming)
We’ve got a tool to help you
do that in Data Science Studio
He’s called the Doctor and he’s
fun to use !
Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
938k sessions
0.3€ per session
0.23€ acquisition costs
738k sessions
0.83€ per session
0.73€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
User Satisfaction Metrics
• Future-Based Metrics
– Will the user most
likely subscribe/pay in
the future ?
• Expressed-Opinion
– Does he like satisfied
from its behaviour ?
Opinion-Based Training For User Satisfaction
User Feedbacks as “Labels” to build a model
on satisfaction
“Predict” a satisfaction score
for non-trained session
Session Data
Feedbacks
Scored
Session
HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS
THEY HAVE SIMILAR USER SATISFACTION LEVELS
(100 Million Sessions)
(10.000 feedbacks)
Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
938k sessions
0.3€ per session
0.23€ acquisition costs
738k sessions
0.83€ per session
0.73€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
SATISFACTION SCORE 0.87§
SATISFACTION SCORE 0.37
SATISFACTION SCORE 0.28
SATISFACTION SCORE 0.12
SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
Data in Time: Smoothing
In Red : The Base Metric
In Blue : The smoothed metricRAW DATA MAY VARY A LOT
FROM DAYS TO DAYS
IT WILL SCARE PEOPLE
Exponential Smoothing In Hive
SELECT segment
moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’))
FROM
stats
GROUP BY segment
These factors determine
whether your smooth a lot
or not, and over how many days
Final : Follow Smoothed Satisfaction
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Follow Statisfaction Metric Per Segment
Damn
our latest
release
has diverging
effects
on segments
Thank You !
Florian Douetteau
@fdouetteau
Questions now or later:
florian.douetteau@dataiku.com
dataiku.com

Contenu connexe

Tendances

Intro to Cypher
Intro to CypherIntro to Cypher
Intro to CypherNeo4j
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Vanishing & Exploding Gradients
Vanishing & Exploding GradientsVanishing & Exploding Gradients
Vanishing & Exploding GradientsSiddharth Vij
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation systemAkashPatil334
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systemsFalitokiniaina Rabearison
 
Introduction to Machine Learning
Introduction to Machine Learning   Introduction to Machine Learning
Introduction to Machine Learning snehal_152
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Build a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-timeBuild a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-timeAmazon Web Services
 

Tendances (20)

Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to Cypher
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Vanishing & Exploding Gradients
Vanishing & Exploding GradientsVanishing & Exploding Gradients
Vanishing & Exploding Gradients
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Activation function
Activation functionActivation function
Activation function
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation system
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
randomwalk.ppt
randomwalk.pptrandomwalk.ppt
randomwalk.ppt
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems
 
Introduction to Machine Learning
Introduction to Machine Learning   Introduction to Machine Learning
Introduction to Machine Learning
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Build a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-timeBuild a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-time
 

Similaire à Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892mercedes calderon
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 

Similaire à Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours (20)

Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Paris ML meetup
Paris ML meetupParis ML meetup
Paris ML meetup
 
Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Python ml
Python mlPython ml
Python ml
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
kdd2015
kdd2015kdd2015
kdd2015
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 

Plus de Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 

Plus de Dataiku (20)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 

Dernier

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Dernier (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

  • 1. ½ S L using to turn into
  • 2. Semi-Supervised Learning on Hadoop to understand user behaviors Hadoop Summit Amsterdam 2-3 Avril 2014
  • 4. Motivation • CxO – Pages Views, Unique Visitors, Dollars, Subscription • Editor / Product Manager – Time Spent, Comments • Users – Content What does matter on a web site ?
  • 5. Key Usage Metrics • Publisher – Time Spent on Page – Number of pages seen – Number of comments – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – In-App Purchase
  • 6. The Quest for the Missing Proxy • Publisher – Time Spent on Page – Number of pages seen – Number of comments – User Satisfaction – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – User Satisfaction – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – User Satisfaction – In-App Purchase U S E R
  • 7. Question How to measure and drive user satisfaction on a large web sites with very diverse usage patterns ?
  • 8. The Problem New Comers From Google News People Coming from twitter and Facebook Posts People coming to the website almost each and everyday People that loves to comment Foreigners Robots People fond of sport section only …. ….. BEHAVIOUR DIVERSITY THE AVERAGED METRICS WOULD HIDE IMPORTANT VARIATION ON SPECIFIC SEGMENTS
  • 9. SubProblem 1: Hard Segments • Segments Users per Number of visits per month – > 20 days per month -> Engaged Users • Segment per transformed or not • Segment per country
  • 10. Subproblem 2: Hard Metrics • Newspaper Time Spent on the website  log(Number of page views) + Number of actions • Search engine Click Ratio Click ratio • E-Commerce  Transformation Ratio
  • 11. Limits Hard Segments  MISSING PART OF THE REALITY Hard Metrics  ARGUING BETWEEN TEAM
  • 12. Semi-Supervised Learning All Labeled Data All Unlabeled Data Some Labeled Data Lots of Unlabeled Data Training Data Supervised Learning Unsupervised Learning Semi- Supervised Learning Model Model Model
  • 13. ½ SL – Natural Language Processing I hope I’ll enjoy Amsterdam, and not only because of Hadoop Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop Statistical Knowledge  Text Structure (Unsupervised) Aligned Corpus (Supervised)
  • 14. ½ SL Applied to Web Sessions Lots of customer sessions Not so many concrete customer feedbacks Subscription
  • 15. Semi-Supervised Learning 3 Approaches • Generative Models, e.g. gaussian fits – All Data fits a gaussian distribution with parameter X – Find X that better fit distribution of both labeled data and unlabeled data • Fits with costs – Supervised learning with a costs function that capture a distance between point related to the unlabeled data structure • Ad-hoc : Combine unsupervised, then supervised
  • 16. Clustering+Supervised in practice Unlabeled training data points in grey Labeled training data points in color
  • 18. ½ SL : Fit to the underlying structure
  • 19. Our Approach 1. (Lots of ) Data preparation to build miningful user session 2. Clustering sessions and validate/tag those clusters by end users 3. Create Predictive User Satisfaction Metrics 4. Follow those metrics !
  • 20. Data Prep: Overview Step 1 Build Sessions Pig Step 2 Parse IP/Time/.. Custom Python (or ) Step 3 Parse Sequences Hive or Python custom Step 4 Build user-level stats Hive RAW DATA READY FOR ML
  • 21. Step 1. Build Session • Use Hive ( Or Pig) • Group into “Session” • Depending on the variable – IP, Device  Select only one per log – URL, Event  Create an ordered array that represents the sequence of events in the session
  • 22. Step 2 : Basic Feature • IP Address  Location, City • User-Agent  Device • Timestamp  User Time  Day or night ? Python + Hadoop Streaming Option 1 Option 2
  • 23. Extracted DataORIGINAL ORIGINAL ORIGINAL NEW!! NEW!! NEW!! Country From IP Device From User-AgentHour from Country & Time
  • 24. Step 3: Session Signals • Simple Signals – Number of Page Views – Time Spent ….. – Etc… • Limitation  It might not help that much to differentiate behaviour
  • 25. More Elaborate: N-Grams Model Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote
  • 26. N-Grams Model For Sessions Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote Web Sessions Page View [/home , /products, /trynow, /blog] /home, /products, /trynow, /blog /home /products, /products /trynow, /trynow /blog /home-/products-/trynow, /products-/trynow-/blog
  • 27. Session N-Grams Analytics Campaign / URL / Event Detailed Token Simple Token utm=google_search google-search-my-site google-search /home home home /search?q=baseball search-baseball search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player-comming sport /search?q=Mick+JONES search-mick+jones search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player/comming sport /politics/home politics-home politics Important Tricks: • Incorporate the first referrer / marketing campaign as FIRST TOKEN • Build two level of tokens: detailed, and category only N-Grams Fine Grain N-Grams Coarse Grain
  • 28. How To In Practice • Hive query using the n-grams UDF • Compute the LLR (Least-Likehood Ratio) Metrics • Keep the most frequent n-grams of each type (detailed / non detailed) as features for the session • Hint : Set the frequency limit so that > 90% session can be described by a non-detailed n-gram
  • 29. Step 4. Cohort-like data • Per cookie compute metrics – Nb. Days since first visit – Nb visits in the last 30 days – Average session time – … • Reintegrate this information • Easily achieved with a HiveQL query
  • 30. Machine Learning for HDFS Data Kind Algorithms for clustering Simplicity TRAIN set size Apache Mahout MapReduce ~ 10 available Expert TERABYTES Python (Scikit+Pandas+… ) Out for training / In for apply ~ 20 available (including bi- clustering) Medium (10GB) 1 SERVER RAM H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB) CLUSTER RAM Open Source R + Hadoop Varies Varies Varies Varies Open Source R + Pattern (Casacding) Out for training / In for apply > 3 Medium (1GB) 1 Server RAM in R Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB) CLUSTER RAM
  • 31. How Big is out data here ? Step 1 Build Sessions Step 2 Parse IP/Time/.. Step 3 Parse Sequences Step 4 Build user-level stats RAW DATA READY FOR ML Uncompressed data size, for 1 year worth of log on a website with 10 Millions Unique Visitors per month 10 GB5TB
  • 32. Clustering With Scikit on HDFS 1. Use Pydoop to get data on train server 2. Use pandas to read data transform to numerical 3. Kmeans().fit() 4. Ipython to draw some graphs 5. Enjoy or
  • 35. Clustering & Cluster Sampling Take a balanced number of samples in each cluster, close to the centroid
  • 36. Labelling 0’ 00 0’ 12 1’ 04 1’ 45 3’ 02 Visualizing Sessions Search for a specific Topic Labelling I can guess what this guy was doing !!!
  • 37. Labelling Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  • 38. What if ? Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  • 39. Supervised Learning Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Independently from the clusters, used the trained examples in order to classify each session in the predefined segments
  • 40. Supervised Learning : e.g. in python • Load the data and the label in python (Pandas) • Fit the labeled sessions against a model • Save the model in HDFS (python pickle) • Run the model against all the data (Hadoop Streaming) We’ve got a tool to help you do that in Data Science Studio He’s called the Doctor and he’s fun to use !
  • 41. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs
  • 42. User Satisfaction Metrics • Future-Based Metrics – Will the user most likely subscribe/pay in the future ? • Expressed-Opinion – Does he like satisfied from its behaviour ?
  • 43. Opinion-Based Training For User Satisfaction User Feedbacks as “Labels” to build a model on satisfaction “Predict” a satisfaction score for non-trained session Session Data Feedbacks Scored Session HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS THEY HAVE SIMILAR USER SATISFACTION LEVELS (100 Million Sessions) (10.000 feedbacks)
  • 44. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs SATISFACTION SCORE 0.87§ SATISFACTION SCORE 0.37 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
  • 45. Data in Time: Smoothing In Red : The Base Metric In Blue : The smoothed metricRAW DATA MAY VARY A LOT FROM DAYS TO DAYS IT WILL SCARE PEOPLE
  • 46. Exponential Smoothing In Hive SELECT segment moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’)) FROM stats GROUP BY segment These factors determine whether your smooth a lot or not, and over how many days
  • 47. Final : Follow Smoothed Satisfaction Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Follow Statisfaction Metric Per Segment Damn our latest release has diverging effects on segments
  • 48. Thank You ! Florian Douetteau @fdouetteau Questions now or later: florian.douetteau@dataiku.com dataiku.com