SlideShare une entreprise Scribd logo
1  sur  80
Télécharger pour lire hors ligne
“Survival” Analysis of
     Web Users
             Dell Zhang
DCSIS, Birkbeck, University of London



                                        1
Outline
• What Is It
• Why Is It Useful
• Case Study
  – The Departure Dynamics of Wikipedia Editors




                                                  2
What Is It




             3
Time-To-Event Data
• Survival Analysis is a branch of statistics which
  deals with the modelling of time-to-event data
  – The outcome variable of interest is time until an
    event occurs.
     • death, disease, failure
     • recovery, marriage
  – It is called reliability theory/analysis in
    engineering, and duration analysis/modelling in
    economics or sociology.

                                                        4
Y   X

        How to build
        a probabilistic model of Y ?




                                       5
Y   X

        How to build
        a probabilistic model of Y ?

        How to build
        a probabilistic model of Y given X ?




                                               6
Y   X

        How to build
        a probabilistic model of Y ?

        How to build
        a probabilistic model of Y given X ?




                                               7
Censoring
• A key problem in survival analysis
  – It occurs when we have some information about
    individual survival time, but we don’t know the
    survival time exactly.




                                                      8
9
Y   X


        Options:

        1) Wait for those patients to die?

        2) Discard the censored data?

        3) Use the censored data as if they were
           not censored?

        4) ……




                                               10
Goals
• Survival Analysis attempts to answer
  questions such as
  – What is the fraction of a population which will
    survive past a certain time? Of those that survive,
    at what rate will they die?
  – Can multiple causes of death be taken into
    account?
  – How do particular circumstances or characteristics
    increase or decrease the odds of survival?

                                                      11
• Censoring of data
• Comparing groups
   – (1 treatment vs. 2 placebo)
• Confounding or Interaction
  factors
   – Log WBC




                                   12
Why Is It Useful

for Online Marketing etc.




                            13
The Data Are There
• Events meaningful to online marketing
  – Time to Clicking the Ad
  – Informational: Time to Finding the Wanted Info
  – Transactional: Time to Buying the Product
  – Social: Time to Joining/Leaving the Community
  – ……

                                     Time Matters!

                                                     14
Evidence-Based Marketing
• Let’s work as (real) doctors
  – Users = Patients
  – Advertisement (Marketing) = Treatment

                      Survival Analysis brings
                        the time dimension
                      back to the centre stage.



                                                  15
17
18
Predict whether a new question asked on Stack Overflow will be closed
        when
                                                                 19
Case Study

The Departure Dynamics of
    Wikipedia Editors



                            20
About 90,000 regularly active volunteer editors around the world21
22
Departure Dynamics
• Who are likely to “die”?
• How soon will they “die”?
• Why do they “die”?

     “live”= stay in the editors’ community
           = keep editing
     “die” = leave the editors’ community
           = stop editing (for 5 months)

                                              23
Who are likely to “die”?

      (WikiChallenge)




                           24
25
2001-01-01                2010-04-01   2010-09-01




             2001-06-01                2010-09-01   2011-02-01



                                                        26
27
Behavioural Dynamics Features




Exponential Steps

                    months

                     Web Search (SIGIR-2009),
                     Social Tagging (WWW-2009),
                     Language Modelling (ICTIR-2009)

                                                  28
29
30
31
Gradient Boosted Trees (GBT)




                                         32
                       © 2008-2012 ~maniraptora
Gradient Boosted Trees (GBT)
• The success of GBT in our task is probably
  attributable to
  – its ability to capture the complex nonlinear
    relationship between the target variable and the
    features,
  – its insensitivity to different feature value ranges as
    well as outliers, and
  – its resistance to overfitting via regularisation
    mechanisms such as shrinkage and subsampling
    (Friedman 1999a; 1999b).
• GBT vs RF
                                                             33
34
35
36
37
Final Result
• The 2nd best valid algorithm in the
  WikiChallenge
  – RMSLE = 0.862582: 41.7% improvement over
    WMF’s in-house solution
  – Much simpler model than the top performing
    system : 21 behavioural dynamics features vs. 206
    features
  – WMF is now implementing this algorithm
    permanently and looks forward to using it in the
    production environment.

                                                    38
How soon will they “die”?




                            39
110,000 random samples         birth & death




     January 2001


              The evolution of Wikipedia editors' community.
                                                               40
110,000 random samples         active editors




     January 2001


              The evolution of Wikipedia editors' community.
                                                               41
Survival Function

What is the fraction of a population which
will survive past a certain time?




                                             42
Occasional Editors                   Customary Editors




    The histogram of Wikipedia editors' lifetime.        43
Kaplan-Meier Estimator




                         44
45
The empirical survival function.   46
Normal Distribution




                      47
 Probability Plot
Extreme Value Distribution




                             48
    Probability Plot
Rayleigh Distribution




                        49
 Probability Plot
Exponential Distribution




                           50
   Probability Plot
Lognormal Distribution




                         51
  Probability Plot
Weibull Distribution




                       52
 Probability Plot
The survival function.   53
Weibull distribution




                       54
Expected Future Lifetime




              median lifetime: 53 days


                                         55
Hazard Function
Of those that survive, at what rate will they die?




   The instantaneous potential per unit time for the event to occur,
   given that the individual has survived t.

                                                                   56
Bathtub Curve




http://en.wikipedia.org/wiki/Bathtub_curve   57
The hazard function.   58
The hazard function.   59
Conclusions
• For customary Wikipedia editors,
  – the survival function can be well described by a
    Weibull distribution (with the median lifetime of
    about 53 days);
  – there are two critical phases (0-2 weeks and 8-20
    weeks) when the hazard rate of becoming inactive
    increases;
  – more active editors tend to keep active in editing
    for longer time.

                                                     60
Why do they “die”?




                     61
Covariates
Last
Edit




                    62
63
64
Cox Proportional Hazards Model




                                 65
Semi-Parametric
• The semi-parametric property of the Cox
  model => its popularity
  – The baseline hazard is unspecified
  – Robust: it will closely approximate the correct
    parametric model
  – Using a minimum of assumptions




                                                      66
Cox PH vs. Logistic




                      67
Maximum Likelihood Estimation




                                68
Cox Proportional Hazards Model


                     β         se        z          p
      X1:
                   -0.1095   0.0172   -6.3664    0.1935e-9
namespace==Main
       X2:
                   -0.0688   0.0036   -19.2474   0.0000e-9
 log(1+cur_size)




                                                        69
Hazard Ratio




               70
Adjusted Survival Curves




                           71
72
Next Step




            73
Cartoon: Ron Hipschman
Data: David Hand 74
Lightning Does Strike Twice!
• Roy Sullivan, a former park ranger from Virginia
  – He was struck by lightning 7 times
     •   1942 (lost big-toe nail)
     •   1969 (lost eyebrows)
     •   1970 (left shoulder seared)
     •   1972 (hair set on fire)
     •   1973 (hair set on fire & legs seared)
     •   1976 (ankle injured)
     •   1977 (chest & stomach burned)
  – He committed suicide in September 1983.

                                                     75
A Lot More To Do
• Multiple Occurrences of “Death”
  – Recurrent Event Survival Analysis (e.g., based on
    Counting Process)
• Multiple Types of “Death”
  – Competing Risks Survival Analysis




                                                        76
Software Tools
• R
  – The ‘survival’ package
• Matlab
  – The ‘statistics’ toolbox
• Python
  – The ‘statsmodels’ module?




                                 77
References
• David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning
  Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta
• John Wallace. How Big Data is Changing Retail Marketing Analytics.
  Webinar, Apr 2005. http://goo.gl/OlMmi
• Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors
  Keep Active? In Proceedings of the 8th International Symposium on Wikis
  and Open Collaboration (WikiSym), Linz, Austria, Aug 2012.
  http://goo.gl/On3qr
• Dell Zhang. Wikipedia Edit Number Prediction based on Temporal
  Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct
  2011. http://goo.gl/s2Dex




                                                                         78
?

    79
80

Contenu connexe

Tendances

An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!Mansour Saffar
 
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Sri Ambati
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models Chia-Wen Cheng
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...Laks Lakshmanan
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillationNAVER Engineering
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learningSri Ambati
 
Understanding Computers and Cognition
Understanding Computers and CognitionUnderstanding Computers and Cognition
Understanding Computers and CognitionR. Sosa
 
Big Data: Social Network Analysis
Big Data: Social Network AnalysisBig Data: Social Network Analysis
Big Data: Social Network AnalysisMichel Bruley
 
[PaperReview] LightGCN: Simplifying and Powering Graph Convolution Network fo...
[PaperReview] LightGCN: Simplifying and Powering Graph Convolution Network fo...[PaperReview] LightGCN: Simplifying and Powering Graph Convolution Network fo...
[PaperReview] LightGCN: Simplifying and Powering Graph Convolution Network fo...Zimin Park
 

Tendances (20)

An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!
 
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Fair Recommender Systems
Fair Recommender Systems Fair Recommender Systems
Fair Recommender Systems
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...
 
Credit scorecard
Credit scorecardCredit scorecard
Credit scorecard
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
Understanding Computers and Cognition
Understanding Computers and CognitionUnderstanding Computers and Cognition
Understanding Computers and Cognition
 
Big Data: Social Network Analysis
Big Data: Social Network AnalysisBig Data: Social Network Analysis
Big Data: Social Network Analysis
 
[PaperReview] LightGCN: Simplifying and Powering Graph Convolution Network fo...
[PaperReview] LightGCN: Simplifying and Powering Graph Convolution Network fo...[PaperReview] LightGCN: Simplifying and Powering Graph Convolution Network fo...
[PaperReview] LightGCN: Simplifying and Powering Graph Convolution Network fo...
 
Explainability for NLP
Explainability for NLPExplainability for NLP
Explainability for NLP
 

En vedette

Subscription Survival Analysis
Subscription Survival AnalysisSubscription Survival Analysis
Subscription Survival AnalysisTheDataNation
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivAmazon Web Services
 
OAuth 2.0 refresher Talk
OAuth 2.0 refresher TalkOAuth 2.0 refresher Talk
OAuth 2.0 refresher Talkmarcwan
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Architecting for Greater Security on AWS
Architecting for Greater Security on AWSArchitecting for Greater Security on AWS
Architecting for Greater Security on AWSAmazon Web Services
 
Py.test
Py.testPy.test
Py.testsoasme
 
Core deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysisCore deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysisLaura Roberts, Ph.D.
 
Masterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMasterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMalcolm Duncanson, CISSP
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS LambdaAmazon Web Services
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Survival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee TurnoverSurvival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee TurnoverTom Briggs
 

En vedette (20)

Survival analysis
Survival analysisSurvival analysis
Survival analysis
 
Subscription Survival Analysis
Subscription Survival AnalysisSubscription Survival Analysis
Subscription Survival Analysis
 
Survival Analysis Project
Survival Analysis Project Survival Analysis Project
Survival Analysis Project
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
 
OAuth 2.0 refresher Talk
OAuth 2.0 refresher TalkOAuth 2.0 refresher Talk
OAuth 2.0 refresher Talk
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Architecting for Greater Security on AWS
Architecting for Greater Security on AWSArchitecting for Greater Security on AWS
Architecting for Greater Security on AWS
 
Py.test
Py.testPy.test
Py.test
 
Path Analysis (Camino de Senderos)
Path Analysis (Camino de Senderos)Path Analysis (Camino de Senderos)
Path Analysis (Camino de Senderos)
 
Core deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysisCore deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysis
 
Masterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMasterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM Roles
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
 
SIP
SIP SIP
SIP
 
An introduction to weibull analysis
An introduction to weibull analysisAn introduction to weibull analysis
An introduction to weibull analysis
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Path Analysis
Path AnalysisPath Analysis
Path Analysis
 
Path analysis
Path analysisPath analysis
Path analysis
 
Survival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee TurnoverSurvival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee Turnover
 

Similaire à Survival Analysis of Web Users

From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0Xavier Llorà
 
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdfVincenzo Lomonaco
 
math bio for 1st year math students
math bio for 1st year math studentsmath bio for 1st year math students
math bio for 1st year math studentsBen Bolker
 
Segmentation for Targeting
Segmentation for TargetingSegmentation for Targeting
Segmentation for TargetingMarcelo Salup
 
Tale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked ProblemsTale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked ProblemsGomindSHIFT
 
Crowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical TurkCrowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical TurkEd Chi
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Dr. Aparna Varde
 
Philosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and SocietyPhilosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and SocietyMelanie Swan
 
6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next Decade6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next DecadeJeff Hurt
 
TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsSeldon
 
'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009Ed Chi
 
Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)Rob Blaauboer
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
IS Undergrads Class 19
IS Undergrads Class 19IS Undergrads Class 19
IS Undergrads Class 19Joao Cunha
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Benjamin Good
 
Myp unit introduction microorganisms
Myp unit introduction microorganismsMyp unit introduction microorganisms
Myp unit introduction microorganismsaimorales
 
Stories for survival and succes in nature and in business
Stories for survival and succes in nature and in businessStories for survival and succes in nature and in business
Stories for survival and succes in nature and in businessVictor Van Rij
 

Similaire à Survival Analysis of Web Users (20)

From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
 
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
 
math bio for 1st year math students
math bio for 1st year math studentsmath bio for 1st year math students
math bio for 1st year math students
 
Segmentation for Targeting
Segmentation for TargetingSegmentation for Targeting
Segmentation for Targeting
 
Tale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked ProblemsTale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked Problems
 
DeepLabCut AI Residency
DeepLabCut AI ResidencyDeepLabCut AI Residency
DeepLabCut AI Residency
 
Crowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical TurkCrowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical Turk
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
 
Philosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and SocietyPhilosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and Society
 
6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next Decade6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next Decade
 
TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative models
 
'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009
 
Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
IS Undergrads Class 19
IS Undergrads Class 19IS Undergrads Class 19
IS Undergrads Class 19
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden
 
Myp unit introduction microorganisms
Myp unit introduction microorganismsMyp unit introduction microorganisms
Myp unit introduction microorganisms
 
PhD defense
PhD defensePhD defense
PhD defense
 
Stories for survival and succes in nature and in business
Stories for survival and succes in nature and in businessStories for survival and succes in nature and in business
Stories for survival and succes in nature and in business
 
Change
ChangeChange
Change
 

Plus de Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 

Plus de Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 

Survival Analysis of Web Users

  • 1. “Survival” Analysis of Web Users Dell Zhang DCSIS, Birkbeck, University of London 1
  • 2. Outline • What Is It • Why Is It Useful • Case Study – The Departure Dynamics of Wikipedia Editors 2
  • 4. Time-To-Event Data • Survival Analysis is a branch of statistics which deals with the modelling of time-to-event data – The outcome variable of interest is time until an event occurs. • death, disease, failure • recovery, marriage – It is called reliability theory/analysis in engineering, and duration analysis/modelling in economics or sociology. 4
  • 5. Y X How to build a probabilistic model of Y ? 5
  • 6. Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 6
  • 7. Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 7
  • 8. Censoring • A key problem in survival analysis – It occurs when we have some information about individual survival time, but we don’t know the survival time exactly. 8
  • 9. 9
  • 10. Y X Options: 1) Wait for those patients to die? 2) Discard the censored data? 3) Use the censored data as if they were not censored? 4) …… 10
  • 11. Goals • Survival Analysis attempts to answer questions such as – What is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die? – Can multiple causes of death be taken into account? – How do particular circumstances or characteristics increase or decrease the odds of survival? 11
  • 12. • Censoring of data • Comparing groups – (1 treatment vs. 2 placebo) • Confounding or Interaction factors – Log WBC 12
  • 13. Why Is It Useful for Online Marketing etc. 13
  • 14. The Data Are There • Events meaningful to online marketing – Time to Clicking the Ad – Informational: Time to Finding the Wanted Info – Transactional: Time to Buying the Product – Social: Time to Joining/Leaving the Community – …… Time Matters! 14
  • 15. Evidence-Based Marketing • Let’s work as (real) doctors – Users = Patients – Advertisement (Marketing) = Treatment Survival Analysis brings the time dimension back to the centre stage. 15
  • 16.
  • 17. 17
  • 18. 18
  • 19. Predict whether a new question asked on Stack Overflow will be closed when 19
  • 20. Case Study The Departure Dynamics of Wikipedia Editors 20
  • 21. About 90,000 regularly active volunteer editors around the world21
  • 22. 22
  • 23. Departure Dynamics • Who are likely to “die”? • How soon will they “die”? • Why do they “die”? “live”= stay in the editors’ community = keep editing “die” = leave the editors’ community = stop editing (for 5 months) 23
  • 24. Who are likely to “die”? (WikiChallenge) 24
  • 25. 25
  • 26. 2001-01-01 2010-04-01 2010-09-01 2001-06-01 2010-09-01 2011-02-01 26
  • 27. 27
  • 28. Behavioural Dynamics Features Exponential Steps months Web Search (SIGIR-2009), Social Tagging (WWW-2009), Language Modelling (ICTIR-2009) 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. Gradient Boosted Trees (GBT) 32 © 2008-2012 ~maniraptora
  • 33. Gradient Boosted Trees (GBT) • The success of GBT in our task is probably attributable to – its ability to capture the complex nonlinear relationship between the target variable and the features, – its insensitivity to different feature value ranges as well as outliers, and – its resistance to overfitting via regularisation mechanisms such as shrinkage and subsampling (Friedman 1999a; 1999b). • GBT vs RF 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37
  • 38. Final Result • The 2nd best valid algorithm in the WikiChallenge – RMSLE = 0.862582: 41.7% improvement over WMF’s in-house solution – Much simpler model than the top performing system : 21 behavioural dynamics features vs. 206 features – WMF is now implementing this algorithm permanently and looks forward to using it in the production environment. 38
  • 39. How soon will they “die”? 39
  • 40. 110,000 random samples birth & death January 2001 The evolution of Wikipedia editors' community. 40
  • 41. 110,000 random samples active editors January 2001 The evolution of Wikipedia editors' community. 41
  • 42. Survival Function What is the fraction of a population which will survive past a certain time? 42
  • 43. Occasional Editors Customary Editors The histogram of Wikipedia editors' lifetime. 43
  • 45. 45
  • 46. The empirical survival function. 46
  • 47. Normal Distribution 47 Probability Plot
  • 48. Extreme Value Distribution 48 Probability Plot
  • 49. Rayleigh Distribution 49 Probability Plot
  • 50. Exponential Distribution 50 Probability Plot
  • 51. Lognormal Distribution 51 Probability Plot
  • 52. Weibull Distribution 52 Probability Plot
  • 55. Expected Future Lifetime median lifetime: 53 days 55
  • 56. Hazard Function Of those that survive, at what rate will they die? The instantaneous potential per unit time for the event to occur, given that the individual has survived t. 56
  • 60. Conclusions • For customary Wikipedia editors, – the survival function can be well described by a Weibull distribution (with the median lifetime of about 53 days); – there are two critical phases (0-2 weeks and 8-20 weeks) when the hazard rate of becoming inactive increases; – more active editors tend to keep active in editing for longer time. 60
  • 61. Why do they “die”? 61
  • 63. 63
  • 64. 64
  • 66. Semi-Parametric • The semi-parametric property of the Cox model => its popularity – The baseline hazard is unspecified – Robust: it will closely approximate the correct parametric model – Using a minimum of assumptions 66
  • 67. Cox PH vs. Logistic 67
  • 69. Cox Proportional Hazards Model β se z p X1: -0.1095 0.0172 -6.3664 0.1935e-9 namespace==Main X2: -0.0688 0.0036 -19.2474 0.0000e-9 log(1+cur_size) 69
  • 72. 72
  • 73. Next Step 73
  • 75. Lightning Does Strike Twice! • Roy Sullivan, a former park ranger from Virginia – He was struck by lightning 7 times • 1942 (lost big-toe nail) • 1969 (lost eyebrows) • 1970 (left shoulder seared) • 1972 (hair set on fire) • 1973 (hair set on fire & legs seared) • 1976 (ankle injured) • 1977 (chest & stomach burned) – He committed suicide in September 1983. 75
  • 76. A Lot More To Do • Multiple Occurrences of “Death” – Recurrent Event Survival Analysis (e.g., based on Counting Process) • Multiple Types of “Death” – Competing Risks Survival Analysis 76
  • 77. Software Tools • R – The ‘survival’ package • Matlab – The ‘statistics’ toolbox • Python – The ‘statsmodels’ module? 77
  • 78. References • David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta • John Wallace. How Big Data is Changing Retail Marketing Analytics. Webinar, Apr 2005. http://goo.gl/OlMmi • Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors Keep Active? In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym), Linz, Austria, Aug 2012. http://goo.gl/On3qr • Dell Zhang. Wikipedia Edit Number Prediction based on Temporal Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct 2011. http://goo.gl/s2Dex 78
  • 79. ? 79
  • 80. 80