SlideShare une entreprise Scribd logo
1  sur  80
Télécharger pour lire hors ligne
“Survival” Analysis of
     Web Users
             Dell Zhang
DCSIS, Birkbeck, University of London



                                        1
Outline
• What Is It
• Why Is It Useful
• Case Study
  – The Departure Dynamics of Wikipedia Editors




                                                  2
What Is It




             3
Time-To-Event Data
• Survival Analysis is a branch of statistics which
  deals with the modelling of time-to-event data
  – The outcome variable of interest is time until an
    event occurs.
     • death, disease, failure
     • recovery, marriage
  – It is called reliability theory/analysis in
    engineering, and duration analysis/modelling in
    economics or sociology.

                                                        4
Y   X

        How to build
        a probabilistic model of Y ?




                                       5
Y   X

        How to build
        a probabilistic model of Y ?

        How to build
        a probabilistic model of Y given X ?




                                               6
Y   X

        How to build
        a probabilistic model of Y ?

        How to build
        a probabilistic model of Y given X ?




                                               7
Censoring
• A key problem in survival analysis
  – It occurs when we have some information about
    individual survival time, but we don’t know the
    survival time exactly.




                                                      8
9
Y   X


        Options:

        1) Wait for those patients to die?

        2) Discard the censored data?

        3) Use the censored data as if they were
           not censored?

        4) ……




                                               10
Goals
• Survival Analysis attempts to answer
  questions such as
  – What is the fraction of a population which will
    survive past a certain time? Of those that survive,
    at what rate will they die?
  – Can multiple causes of death be taken into
    account?
  – How do particular circumstances or characteristics
    increase or decrease the odds of survival?

                                                      11
• Censoring of data
• Comparing groups
   – (1 treatment vs. 2 placebo)
• Confounding or Interaction
  factors
   – Log WBC




                                   12
Why Is It Useful

for Online Marketing etc.




                            13
The Data Are There
• Events meaningful to online marketing
  – Time to Clicking the Ad
  – Informational: Time to Finding the Wanted Info
  – Transactional: Time to Buying the Product
  – Social: Time to Joining/Leaving the Community
  – ……

                                     Time Matters!

                                                     14
Evidence-Based Marketing
• Let’s work as (real) doctors
  – Users = Patients
  – Advertisement (Marketing) = Treatment

                      Survival Analysis brings
                        the time dimension
                      back to the centre stage.



                                                  15
17
18
Predict whether a new question asked on Stack Overflow will be closed
        when
                                                                 19
Case Study

The Departure Dynamics of
    Wikipedia Editors



                            20
About 90,000 regularly active volunteer editors around the world21
22
Departure Dynamics
• Who are likely to “die”?
• How soon will they “die”?
• Why do they “die”?

     “live”= stay in the editors’ community
           = keep editing
     “die” = leave the editors’ community
           = stop editing (for 5 months)

                                              23
Who are likely to “die”?

      (WikiChallenge)




                           24
25
2001-01-01                2010-04-01   2010-09-01




             2001-06-01                2010-09-01   2011-02-01



                                                        26
27
Behavioural Dynamics Features




Exponential Steps

                    months

                     Web Search (SIGIR-2009),
                     Social Tagging (WWW-2009),
                     Language Modelling (ICTIR-2009)

                                                  28
29
30
31
Gradient Boosted Trees (GBT)




                                         32
                       © 2008-2012 ~maniraptora
Gradient Boosted Trees (GBT)
• The success of GBT in our task is probably
  attributable to
  – its ability to capture the complex nonlinear
    relationship between the target variable and the
    features,
  – its insensitivity to different feature value ranges as
    well as outliers, and
  – its resistance to overfitting via regularisation
    mechanisms such as shrinkage and subsampling
    (Friedman 1999a; 1999b).
• GBT vs RF
                                                             33
34
35
36
37
Final Result
• The 2nd best valid algorithm in the
  WikiChallenge
  – RMSLE = 0.862582: 41.7% improvement over
    WMF’s in-house solution
  – Much simpler model than the top performing
    system : 21 behavioural dynamics features vs. 206
    features
  – WMF is now implementing this algorithm
    permanently and looks forward to using it in the
    production environment.

                                                    38
How soon will they “die”?




                            39
110,000 random samples         birth & death




     January 2001


              The evolution of Wikipedia editors' community.
                                                               40
110,000 random samples         active editors




     January 2001


              The evolution of Wikipedia editors' community.
                                                               41
Survival Function

What is the fraction of a population which
will survive past a certain time?




                                             42
Occasional Editors                   Customary Editors




    The histogram of Wikipedia editors' lifetime.        43
Kaplan-Meier Estimator




                         44
45
The empirical survival function.   46
Normal Distribution




                      47
 Probability Plot
Extreme Value Distribution




                             48
    Probability Plot
Rayleigh Distribution




                        49
 Probability Plot
Exponential Distribution




                           50
   Probability Plot
Lognormal Distribution




                         51
  Probability Plot
Weibull Distribution




                       52
 Probability Plot
The survival function.   53
Weibull distribution




                       54
Expected Future Lifetime




              median lifetime: 53 days


                                         55
Hazard Function
Of those that survive, at what rate will they die?




   The instantaneous potential per unit time for the event to occur,
   given that the individual has survived t.

                                                                   56
Bathtub Curve




http://en.wikipedia.org/wiki/Bathtub_curve   57
The hazard function.   58
The hazard function.   59
Conclusions
• For customary Wikipedia editors,
  – the survival function can be well described by a
    Weibull distribution (with the median lifetime of
    about 53 days);
  – there are two critical phases (0-2 weeks and 8-20
    weeks) when the hazard rate of becoming inactive
    increases;
  – more active editors tend to keep active in editing
    for longer time.

                                                     60
Why do they “die”?




                     61
Covariates
Last
Edit




                    62
63
64
Cox Proportional Hazards Model




                                 65
Semi-Parametric
• The semi-parametric property of the Cox
  model => its popularity
  – The baseline hazard is unspecified
  – Robust: it will closely approximate the correct
    parametric model
  – Using a minimum of assumptions




                                                      66
Cox PH vs. Logistic




                      67
Maximum Likelihood Estimation




                                68
Cox Proportional Hazards Model


                     β         se        z          p
      X1:
                   -0.1095   0.0172   -6.3664    0.1935e-9
namespace==Main
       X2:
                   -0.0688   0.0036   -19.2474   0.0000e-9
 log(1+cur_size)




                                                        69
Hazard Ratio




               70
Adjusted Survival Curves




                           71
72
Next Step




            73
Cartoon: Ron Hipschman
Data: David Hand 74
Lightning Does Strike Twice!
• Roy Sullivan, a former park ranger from Virginia
  – He was struck by lightning 7 times
     •   1942 (lost big-toe nail)
     •   1969 (lost eyebrows)
     •   1970 (left shoulder seared)
     •   1972 (hair set on fire)
     •   1973 (hair set on fire & legs seared)
     •   1976 (ankle injured)
     •   1977 (chest & stomach burned)
  – He committed suicide in September 1983.

                                                     75
A Lot More To Do
• Multiple Occurrences of “Death”
  – Recurrent Event Survival Analysis (e.g., based on
    Counting Process)
• Multiple Types of “Death”
  – Competing Risks Survival Analysis




                                                        76
Software Tools
• R
  – The ‘survival’ package
• Matlab
  – The ‘statistics’ toolbox
• Python
  – The ‘statsmodels’ module?




                                 77
References
• David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning
  Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta
• John Wallace. How Big Data is Changing Retail Marketing Analytics.
  Webinar, Apr 2005. http://goo.gl/OlMmi
• Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors
  Keep Active? In Proceedings of the 8th International Symposium on Wikis
  and Open Collaboration (WikiSym), Linz, Austria, Aug 2012.
  http://goo.gl/On3qr
• Dell Zhang. Wikipedia Edit Number Prediction based on Temporal
  Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct
  2011. http://goo.gl/s2Dex




                                                                         78
?

    79
80

Contenu connexe

Tendances

Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Appsilon Data Science
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringDKALab
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemAnoop Deoras
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized HomepageJustin Basilico
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Practical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsPractical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsSC5.io
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsT212
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierArunabha Saha
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - IntroductionJungwon Kim
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersSeunghyun Hwang
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descentkandelin
 

Tendances (20)

Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative Filtering
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Practical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsPractical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit Algorithms
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Shap
ShapShap
Shap
 

En vedette

Subscription Survival Analysis
Subscription Survival AnalysisSubscription Survival Analysis
Subscription Survival AnalysisTheDataNation
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivAmazon Web Services
 
OAuth 2.0 refresher Talk
OAuth 2.0 refresher TalkOAuth 2.0 refresher Talk
OAuth 2.0 refresher Talkmarcwan
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Architecting for Greater Security on AWS
Architecting for Greater Security on AWSArchitecting for Greater Security on AWS
Architecting for Greater Security on AWSAmazon Web Services
 
Py.test
Py.testPy.test
Py.testsoasme
 
Core deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysisCore deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysisLaura Roberts, Ph.D.
 
Masterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMasterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMalcolm Duncanson, CISSP
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS LambdaAmazon Web Services
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Survival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee TurnoverSurvival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee TurnoverTom Briggs
 

En vedette (20)

Survival analysis
Survival analysisSurvival analysis
Survival analysis
 
Subscription Survival Analysis
Subscription Survival AnalysisSubscription Survival Analysis
Subscription Survival Analysis
 
Survival Analysis Project
Survival Analysis Project Survival Analysis Project
Survival Analysis Project
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
 
OAuth 2.0 refresher Talk
OAuth 2.0 refresher TalkOAuth 2.0 refresher Talk
OAuth 2.0 refresher Talk
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Architecting for Greater Security on AWS
Architecting for Greater Security on AWSArchitecting for Greater Security on AWS
Architecting for Greater Security on AWS
 
Py.test
Py.testPy.test
Py.test
 
Path Analysis (Camino de Senderos)
Path Analysis (Camino de Senderos)Path Analysis (Camino de Senderos)
Path Analysis (Camino de Senderos)
 
Core deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysisCore deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysis
 
Masterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMasterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM Roles
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
 
SIP
SIP SIP
SIP
 
An introduction to weibull analysis
An introduction to weibull analysisAn introduction to weibull analysis
An introduction to weibull analysis
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Path Analysis
Path AnalysisPath Analysis
Path Analysis
 
Path analysis
Path analysisPath analysis
Path analysis
 
Survival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee TurnoverSurvival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee Turnover
 

Similaire à Survival Analysis of Web Users

From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0Xavier Llorà
 
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdfVincenzo Lomonaco
 
math bio for 1st year math students
math bio for 1st year math studentsmath bio for 1st year math students
math bio for 1st year math studentsBen Bolker
 
Segmentation for Targeting
Segmentation for TargetingSegmentation for Targeting
Segmentation for TargetingMarcelo Salup
 
Tale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked ProblemsTale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked ProblemsGomindSHIFT
 
Crowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical TurkCrowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical TurkEd Chi
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Dr. Aparna Varde
 
Philosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and SocietyPhilosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and SocietyMelanie Swan
 
6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next Decade6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next DecadeJeff Hurt
 
TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsSeldon
 
'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009Ed Chi
 
Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)Rob Blaauboer
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
IS Undergrads Class 19
IS Undergrads Class 19IS Undergrads Class 19
IS Undergrads Class 19Joao Cunha
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Benjamin Good
 
Myp unit introduction microorganisms
Myp unit introduction microorganismsMyp unit introduction microorganisms
Myp unit introduction microorganismsaimorales
 
Stories for survival and succes in nature and in business
Stories for survival and succes in nature and in businessStories for survival and succes in nature and in business
Stories for survival and succes in nature and in businessVictor Van Rij
 

Similaire à Survival Analysis of Web Users (20)

From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
 
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
 
math bio for 1st year math students
math bio for 1st year math studentsmath bio for 1st year math students
math bio for 1st year math students
 
Segmentation for Targeting
Segmentation for TargetingSegmentation for Targeting
Segmentation for Targeting
 
Tale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked ProblemsTale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked Problems
 
DeepLabCut AI Residency
DeepLabCut AI ResidencyDeepLabCut AI Residency
DeepLabCut AI Residency
 
Crowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical TurkCrowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical Turk
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
 
Philosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and SocietyPhilosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and Society
 
6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next Decade6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next Decade
 
TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative models
 
'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009
 
Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
IS Undergrads Class 19
IS Undergrads Class 19IS Undergrads Class 19
IS Undergrads Class 19
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden
 
Myp unit introduction microorganisms
Myp unit introduction microorganismsMyp unit introduction microorganisms
Myp unit introduction microorganisms
 
PhD defense
PhD defensePhD defense
PhD defense
 
Stories for survival and succes in nature and in business
Stories for survival and succes in nature and in businessStories for survival and succes in nature and in business
Stories for survival and succes in nature and in business
 
Change
ChangeChange
Change
 

Plus de Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 

Plus de Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 

Survival Analysis of Web Users

  • 1. “Survival” Analysis of Web Users Dell Zhang DCSIS, Birkbeck, University of London 1
  • 2. Outline • What Is It • Why Is It Useful • Case Study – The Departure Dynamics of Wikipedia Editors 2
  • 4. Time-To-Event Data • Survival Analysis is a branch of statistics which deals with the modelling of time-to-event data – The outcome variable of interest is time until an event occurs. • death, disease, failure • recovery, marriage – It is called reliability theory/analysis in engineering, and duration analysis/modelling in economics or sociology. 4
  • 5. Y X How to build a probabilistic model of Y ? 5
  • 6. Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 6
  • 7. Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 7
  • 8. Censoring • A key problem in survival analysis – It occurs when we have some information about individual survival time, but we don’t know the survival time exactly. 8
  • 9. 9
  • 10. Y X Options: 1) Wait for those patients to die? 2) Discard the censored data? 3) Use the censored data as if they were not censored? 4) …… 10
  • 11. Goals • Survival Analysis attempts to answer questions such as – What is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die? – Can multiple causes of death be taken into account? – How do particular circumstances or characteristics increase or decrease the odds of survival? 11
  • 12. • Censoring of data • Comparing groups – (1 treatment vs. 2 placebo) • Confounding or Interaction factors – Log WBC 12
  • 13. Why Is It Useful for Online Marketing etc. 13
  • 14. The Data Are There • Events meaningful to online marketing – Time to Clicking the Ad – Informational: Time to Finding the Wanted Info – Transactional: Time to Buying the Product – Social: Time to Joining/Leaving the Community – …… Time Matters! 14
  • 15. Evidence-Based Marketing • Let’s work as (real) doctors – Users = Patients – Advertisement (Marketing) = Treatment Survival Analysis brings the time dimension back to the centre stage. 15
  • 16.
  • 17. 17
  • 18. 18
  • 19. Predict whether a new question asked on Stack Overflow will be closed when 19
  • 20. Case Study The Departure Dynamics of Wikipedia Editors 20
  • 21. About 90,000 regularly active volunteer editors around the world21
  • 22. 22
  • 23. Departure Dynamics • Who are likely to “die”? • How soon will they “die”? • Why do they “die”? “live”= stay in the editors’ community = keep editing “die” = leave the editors’ community = stop editing (for 5 months) 23
  • 24. Who are likely to “die”? (WikiChallenge) 24
  • 25. 25
  • 26. 2001-01-01 2010-04-01 2010-09-01 2001-06-01 2010-09-01 2011-02-01 26
  • 27. 27
  • 28. Behavioural Dynamics Features Exponential Steps months Web Search (SIGIR-2009), Social Tagging (WWW-2009), Language Modelling (ICTIR-2009) 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. Gradient Boosted Trees (GBT) 32 © 2008-2012 ~maniraptora
  • 33. Gradient Boosted Trees (GBT) • The success of GBT in our task is probably attributable to – its ability to capture the complex nonlinear relationship between the target variable and the features, – its insensitivity to different feature value ranges as well as outliers, and – its resistance to overfitting via regularisation mechanisms such as shrinkage and subsampling (Friedman 1999a; 1999b). • GBT vs RF 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37
  • 38. Final Result • The 2nd best valid algorithm in the WikiChallenge – RMSLE = 0.862582: 41.7% improvement over WMF’s in-house solution – Much simpler model than the top performing system : 21 behavioural dynamics features vs. 206 features – WMF is now implementing this algorithm permanently and looks forward to using it in the production environment. 38
  • 39. How soon will they “die”? 39
  • 40. 110,000 random samples birth & death January 2001 The evolution of Wikipedia editors' community. 40
  • 41. 110,000 random samples active editors January 2001 The evolution of Wikipedia editors' community. 41
  • 42. Survival Function What is the fraction of a population which will survive past a certain time? 42
  • 43. Occasional Editors Customary Editors The histogram of Wikipedia editors' lifetime. 43
  • 45. 45
  • 46. The empirical survival function. 46
  • 47. Normal Distribution 47 Probability Plot
  • 48. Extreme Value Distribution 48 Probability Plot
  • 49. Rayleigh Distribution 49 Probability Plot
  • 50. Exponential Distribution 50 Probability Plot
  • 51. Lognormal Distribution 51 Probability Plot
  • 52. Weibull Distribution 52 Probability Plot
  • 55. Expected Future Lifetime median lifetime: 53 days 55
  • 56. Hazard Function Of those that survive, at what rate will they die? The instantaneous potential per unit time for the event to occur, given that the individual has survived t. 56
  • 60. Conclusions • For customary Wikipedia editors, – the survival function can be well described by a Weibull distribution (with the median lifetime of about 53 days); – there are two critical phases (0-2 weeks and 8-20 weeks) when the hazard rate of becoming inactive increases; – more active editors tend to keep active in editing for longer time. 60
  • 61. Why do they “die”? 61
  • 63. 63
  • 64. 64
  • 66. Semi-Parametric • The semi-parametric property of the Cox model => its popularity – The baseline hazard is unspecified – Robust: it will closely approximate the correct parametric model – Using a minimum of assumptions 66
  • 67. Cox PH vs. Logistic 67
  • 69. Cox Proportional Hazards Model β se z p X1: -0.1095 0.0172 -6.3664 0.1935e-9 namespace==Main X2: -0.0688 0.0036 -19.2474 0.0000e-9 log(1+cur_size) 69
  • 72. 72
  • 73. Next Step 73
  • 75. Lightning Does Strike Twice! • Roy Sullivan, a former park ranger from Virginia – He was struck by lightning 7 times • 1942 (lost big-toe nail) • 1969 (lost eyebrows) • 1970 (left shoulder seared) • 1972 (hair set on fire) • 1973 (hair set on fire & legs seared) • 1976 (ankle injured) • 1977 (chest & stomach burned) – He committed suicide in September 1983. 75
  • 76. A Lot More To Do • Multiple Occurrences of “Death” – Recurrent Event Survival Analysis (e.g., based on Counting Process) • Multiple Types of “Death” – Competing Risks Survival Analysis 76
  • 77. Software Tools • R – The ‘survival’ package • Matlab – The ‘statistics’ toolbox • Python – The ‘statsmodels’ module? 77
  • 78. References • David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta • John Wallace. How Big Data is Changing Retail Marketing Analytics. Webinar, Apr 2005. http://goo.gl/OlMmi • Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors Keep Active? In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym), Linz, Austria, Aug 2012. http://goo.gl/On3qr • Dell Zhang. Wikipedia Edit Number Prediction based on Temporal Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct 2011. http://goo.gl/s2Dex 78
  • 79. ? 79
  • 80. 80