SlideShare une entreprise Scribd logo
1  sur  26
Predicting Lifetime
Value with Hadoop
Martin Colaco, Head of Data Science l April 10, 2013
Agenda

• What is predictive modeling
• What is Lifetime Value (LTV)
• What is feature extraction - challenges
• How can we build a cohort-based predictive LTV
  model
  o Python
  o Hive with Hadoop
  o Cascalog with Hadoop
Can we predict how many attendees tonight?

• How to estimate?   Door count (after the fact)




• Is there a way to build a model that we can
  use to predict attendees?
Predicting how many attendees tonight?

Attendees = Registrations x % Attendance + Non-registrants
Predicting how many attendees tonight?

Attendees = Registrations x % Attendance + Non-registrants

Attendees = 201 x 50% + 25 = 125
                                   Lots of Uncertainty
     Location          Date & Time               Company




                Speaker
                                     Title & Topic
Predictive Modeling

• Know the question you want to answer
• Look at historical behavior
• Apply understanding of those behaviors to new
   situations -> new groups of users

                                               Fame
         Feature      Model       Model
Data                                          Success
        Extraction   Selection   Validation
                                               Riches
Common use cases for predictive modeling
My chemical engineering roots….



                In – Out = Accumulation



      IN                     D            Out
Users: Maximizing Growth

                  In – Out = Accumulation



     IN                  D = Growth              Out

                    App or Network of Apps




 Paid marketing                                Frustration?
    Organic                                     Boredom?
  X-promotion                                Too expensive?
                                                Bad UX?
                                             No new content?
Money: Maximizing Profit

                  In – Out = Accumulation



      IN                   D = Profit                        Out

                    App or App Network or Business




 Lifetime Value                                      Business expenses:
      (LTV)                                              Marketing costs
                                                     Operations (servers, etc.)
                                                         Employee costs
How Do We Estimate LTV

    Business Model               LTV

        Download           Cost per Download

                           Avg. Price x Avg.
      Subscription
                           Customer Lifetime
Microtransactions                ???
(Ads / In-app-purchases)
LTV Modeling – Social / Mobile Games

                                        LTV = (1 + k) * Retention * ARPU


                  Output
                                                                                    Features
                 Variable

                            Daily Retention Curve                                               ARPDAU Curve
                      100.00%                                                       $0.10
% of users retained




                      80.00%                                                        $0.08



                                                                           ARPDAU
                      60.00%                                                        $0.06

                      40.00%                                                        $0.04

                      20.00%                                                        $0.02

                       0.00%                                                          $-
                                0   1   2     3     4    5     6   7   8                    0   1   2    3      4     5      6   7   8
                                            Days since install                                          Days since install
Predictive LTV Result

                       300



                       250



                       200
    Cumulative Spend




                       150



                       100



                        50



                         0
                             0   10   20   30   40       50        60   70   80   90   100
                                                  Days Since Install
Challenges with this simple LTV model

• All of these parameters are moving targets
• k-factor is wildly variable (we’ll ignore k-factor in this
  presentation)
• Acquisition costs can change (as can LTV and
  retention) - Cohort LTV by install date and install
  source
                   ARPDAU Curve                                                                   Retention Curve
           $0.10                                              % of users retained   100.00%
           $0.08                                                                    80.00%
  ARPDAU




           $0.06                                                                    60.00%
           $0.04                                                                    40.00%
           $0.02                                                                    20.00%
             $-                                                                      0.00%
                   0   1   2      3    4    5     6   7   8                                   0    1   2     3     4     5      6   7   8
                               Days since install                                                          Days since install
Challenges with this simple LTV model

• All of these parameters are moving targets
• k-factor is wildly variable (we’ll ignore k-factor in this
  presentation)
• Acquisition costs can change (as can LTV and
  retention) - Cohort LTV by install date and install
  source
• Retention is computationally difficult to calculate
• Large games can have millions of users who spend
  money over many months/years

   How can we build out the features we need
   to model LTV by cohort?
Kontagent Facts
• Founded in 2007
• 130+ employees and growing
• 100s of Customers
• 1000s of Apps Instrumented
• 250+ billion events per month
• 200MM+ MAUs
• 1 Trillion Events in 2013
How does Kontagent collect data?
•   Via a REST API
    o APA – Install message
    o EVT – Custom event message (user action)
    o MTU – Spending message
•   Yields a transaction log over time:
Feature Extraction for Predictive LTV

  Need to translate a transaction log into a table
  o   Install Date               o Users Active on Date
  o   Install Source             o Users Active on Date or After
  o   Activity Date
  o   Spend on Date
  o   Cumulative Spend to Date
How can we compute this table of features?

•   Python – single thread
     o Might work in some cases but need to cache
       potentially millions of rows of data

•   Hive with Hadoop
     o Data warehouse system that allows SQL-like
       querying capabilities of distributed data structures
     o Let’s work through this….
Hive query

•
                                                   Transaction log
    Store data in Hadoop

                                  APA                   EVT               MTU




•   Query using Hive       select distinct s
                           from demo_apa
    Query Language         where kt_date(utc_timestamp) = '2011-07-08' and s is
                           not null and month=201107
    (HiveQL)
This query gets cumbersome quickly…
select sub1.gameplay_date as play_date, sub1.returned,
sub2.spenders, sub2.total_daily_spend
from
(select gp.gameplay_date, count(distinct gp.s) as returned
from
(
select distinct s
from demo_apa
where kt_date(utc_timestamp) = '2011-07-08' and s is not null
and month=201107
) base
left outer join
(
select s, kt_date(utc_timestamp) as gameplay_date
from demo_evt
where s is not null and month>=201107
) gp on gp.s = base.s                                           play_date returned spenders total_daily_spend
group by gp.gameplay_date                                       7/10/2011     2        1            75
) sub1                                                          7/11/2011     4        2            19
join
(select sp.spend_date, count(distinct sp.s) as spenders,
                                                                7/12/2011     1        1            0.2
sum(sp.spend)/100 as total_daily_spend
from
(
select distinct s
from demo_apa
where kt_date(utc_timestamp) = '2011-07-08' and s is not null
and month=201107
) base
left outer join
(
select s, kt_date(utc_timestamp) as spend_date, v as spend
from demo_mtu
where s is not null and v>0 and month>=201107
) sp on sp.s = base.s
group by sp.spend_date
) sub2 on sub1.gameplay_date=sub2.spend_date
Feature Extraction with HiveQL
  o   Install Date                o Spend on Date
  o   Install Source              o Users Active on Date or After
  o   Activity Date               o Cumulative Spend to Date
  o   Users Active on Date




 Problem - HiveQL doesn’t support non equi-joins

 Options for improving Hive performance
 • Write tables or temp tables
 • Code up some UDFs
How can we compute this table of features?

•   Python – single thread

•   Hive with Hadoop

•   Cascalog (Cascading) with Hadoop
     o Cascading is a flow based computational model for
       Hadoop
     o Cascalog is a declarative based system for
       cascading
     o Let’s work through this…
Cascalog Code
                                                                  (defn life-table [api-key]
(defn user-install-dates [api-key]                                 (let [install-dates (user-install-dates api-key)
 (let [apas (tap/apa-tap api-key)]                                      evts (tap/evt-tap api-key)
   (<- [?s ?install-date]                                               mtus (tap/mtu-tap api-key)
      (apas ?s _ _ ?install-ts)                                         cumulative-spend (cumulative-spend-by-date install-dates mtus)
      (ops/ts-to-date ?install-ts :> ?install-date))))                  activity-spend (spend-by-activity-date install-dates mtus)
                                                                        cumulative-users (cumulative-active-users-by-date install-dates evts)
(defn active-users-by-activity-date [install-dates evts]                active-users (active-users-by-activity-date install-dates evts)]
 (<- [?install-date ?activity-date ?active-users]                    (<- [?install-date ?activity-date ?remaining-users ?active-users ?paying-
     (install-dates ?s ?install-date)                             users ?day-spending ?cumulative-spending]
     (evts ?s _ ?ts)                                                    (cumulative-spend ?install-date ?activity-date ?cumulative-spend)
     (ops/ts-to-date ?ts :> ?activity-date)                             (activity-spend ?install-date ?activity-date ?paying-users ?day-spending)
     (c/distinct-count ?s :> ?active-users)))                           (cumulative-users ?install-date ?activity-date ?remaining-users)
                                                                        (active-users ?install-date ?activity-date ?active-users))))
(defn spend-by-activity-date [install-dates mtus]
 (<- [?install-date ?activity-date ?paying-users ?day-spending]
    (mtus ?s ?v _ _ _ _ ?ts)
    (install-dates ?s ?install-date)
    (ops/ts-to-date ?ts :> ?activity-date)
    (c/distinct-count ?s :> ?paying-users)
    (c/sum ?v :> ?day-spending)))

(defn cumulative-active-users-by-date [install-dates evts]
 (<- [?install-date ?activity-date ?remaining-users]
    (install-dates ?s ?install-date)
    (evts ?s _ ?ts)
    (ops/project-backward ?ts :> ?activity-date)
    (c/distinct-count ?s :> ?remaining-users)))

(defn cumulative-spend-by-date [install-dates mtus]
 (<- [?install-date ?activity-date ?cumulative-spend]
    (install-dates ?s ?install-date)
    (mtus ?s ?v _ _ _ _ ?ts)
    (ops/project-forward ?ts :> ?activity-date)
    (c/sum ?v :> ?cumulative-spend)))
Feature Extraction with Cascalog
  o   Install Date                    o Spend on Date
  o   Install Source                  o Users Active on Date or After
  o   Activity Date                   o Cumulative Spend to Date
  o   Users Active on Date




 Options for improvement
 • Code not optimized – CPU limited
What have we learned

•   Martin sucks (or is awesome) at predicting number of
    attendees at Meetups!
•   Predictive modeling (particularly around LTV) can have a
    huge impact on a business
    o Requires intuition and iteration
    o In the big data world, feature extraction can be quite a huge
      challenge
•   Feature extraction can be done with Hadoop
    o HiveQL is nice because analysts can use it, but it can be
      inefficient and not generate all the features we need
    o Cascading can solve most of these problems and generate the
      clean features we need
Questions?


         Need a job? We’re hiring:
http://www.kontagent.com/company/careers/

      Martin Colaco
      Head of Data Science
      martin.colaco@kontagent.com

Contenu connexe

Similaire à Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent

Beginner ELEVATE Hands-on Developer Workshop
Beginner ELEVATE Hands-on Developer WorkshopBeginner ELEVATE Hands-on Developer Workshop
Beginner ELEVATE Hands-on Developer WorkshopKavindra Patel
 
How to make a Rock Star Product?
How to make a Rock Star Product?How to make a Rock Star Product?
How to make a Rock Star Product?tlesselin
 
Randy Bias - Presentation at Emerging Communications Conference & Awards (eCo...
Randy Bias - Presentation at Emerging Communications Conference & Awards (eCo...Randy Bias - Presentation at Emerging Communications Conference & Awards (eCo...
Randy Bias - Presentation at Emerging Communications Conference & Awards (eCo...eCommConf
 
Is it Time to Move Your Enterprise to the Cloud
Is it Time to Move Your Enterprise to the CloudIs it Time to Move Your Enterprise to the Cloud
Is it Time to Move Your Enterprise to the CloudProformative, Inc.
 
Sebacia Inc.- Startup Company Venture Capital Financing Deal Terms & Valuations
Sebacia Inc.- Startup Company Venture Capital Financing Deal Terms & ValuationsSebacia Inc.- Startup Company Venture Capital Financing Deal Terms & Valuations
Sebacia Inc.- Startup Company Venture Capital Financing Deal Terms & ValuationsVC Experts, Inc.
 
Labeling Foot Traffic in Dense Locations
Labeling Foot Traffic in Dense LocationsLabeling Foot Traffic in Dense Locations
Labeling Foot Traffic in Dense LocationsOm Patri
 
Big Data will drive Business Organization IT spending till 2016
Big Data will drive Business Organization IT spending till 2016Big Data will drive Business Organization IT spending till 2016
Big Data will drive Business Organization IT spending till 2016Rajesh Prabhakar
 
Gaming the Social: Community, Measurement & Monetization
Gaming the Social: Community, Measurement & MonetizationGaming the Social: Community, Measurement & Monetization
Gaming the Social: Community, Measurement & MonetizationSuperData
 
Gaming the Social: Community, Measurement & Monetization
Gaming the Social: Community, Measurement & MonetizationGaming the Social: Community, Measurement & Monetization
Gaming the Social: Community, Measurement & MonetizationSuperData
 
Mint: $325K VC investment turned into $170M. Mint's initial pitch deck
Mint: $325K VC investment turned into $170M. Mint's initial pitch deckMint: $325K VC investment turned into $170M. Mint's initial pitch deck
Mint: $325K VC investment turned into $170M. Mint's initial pitch deckAA BB
 
Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality	Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality ICSM 2010
 
Avnet Analyst Day 2010 Presentation 2 Path to Premier
Avnet Analyst Day 2010 Presentation 2 Path to PremierAvnet Analyst Day 2010 Presentation 2 Path to Premier
Avnet Analyst Day 2010 Presentation 2 Path to PremierAvnet Electronics Marketing
 
Valuation for the Africa Startup
Valuation for the Africa StartupValuation for the Africa Startup
Valuation for the Africa StartupMbwana Alliy
 

Similaire à Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent (20)

Beginner ELEVATE Hands-on Developer Workshop
Beginner ELEVATE Hands-on Developer WorkshopBeginner ELEVATE Hands-on Developer Workshop
Beginner ELEVATE Hands-on Developer Workshop
 
Livo deck v5.4
Livo deck v5.4Livo deck v5.4
Livo deck v5.4
 
Business plan
Business planBusiness plan
Business plan
 
How to make a Rock Star Product?
How to make a Rock Star Product?How to make a Rock Star Product?
How to make a Rock Star Product?
 
Randy Bias - Presentation at Emerging Communications Conference & Awards (eCo...
Randy Bias - Presentation at Emerging Communications Conference & Awards (eCo...Randy Bias - Presentation at Emerging Communications Conference & Awards (eCo...
Randy Bias - Presentation at Emerging Communications Conference & Awards (eCo...
 
Is it Time to Move Your Enterprise to the Cloud
Is it Time to Move Your Enterprise to the CloudIs it Time to Move Your Enterprise to the Cloud
Is it Time to Move Your Enterprise to the Cloud
 
Sebacia Inc.- Startup Company Venture Capital Financing Deal Terms & Valuations
Sebacia Inc.- Startup Company Venture Capital Financing Deal Terms & ValuationsSebacia Inc.- Startup Company Venture Capital Financing Deal Terms & Valuations
Sebacia Inc.- Startup Company Venture Capital Financing Deal Terms & Valuations
 
Lte asia 2011 s niri
Lte asia 2011 s niriLte asia 2011 s niri
Lte asia 2011 s niri
 
Labeling Foot Traffic in Dense Locations
Labeling Foot Traffic in Dense LocationsLabeling Foot Traffic in Dense Locations
Labeling Foot Traffic in Dense Locations
 
Big Data will drive Business Organization IT spending till 2016
Big Data will drive Business Organization IT spending till 2016Big Data will drive Business Organization IT spending till 2016
Big Data will drive Business Organization IT spending till 2016
 
Gaming the Social: Community, Measurement & Monetization
Gaming the Social: Community, Measurement & MonetizationGaming the Social: Community, Measurement & Monetization
Gaming the Social: Community, Measurement & Monetization
 
Gaming the Social: Community, Measurement & Monetization
Gaming the Social: Community, Measurement & MonetizationGaming the Social: Community, Measurement & Monetization
Gaming the Social: Community, Measurement & Monetization
 
Mint: $325K VC investment turned into $170M. Mint's initial pitch deck
Mint: $325K VC investment turned into $170M. Mint's initial pitch deckMint: $325K VC investment turned into $170M. Mint's initial pitch deck
Mint: $325K VC investment turned into $170M. Mint's initial pitch deck
 
Mint pitch deck
Mint pitch deckMint pitch deck
Mint pitch deck
 
Mint.pdf
Mint.pdfMint.pdf
Mint.pdf
 
Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality	Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality
 
RME Deck
RME DeckRME Deck
RME Deck
 
SSMX SocialAirlines
SSMX SocialAirlinesSSMX SocialAirlines
SSMX SocialAirlines
 
Avnet Analyst Day 2010 Presentation 2 Path to Premier
Avnet Analyst Day 2010 Presentation 2 Path to PremierAvnet Analyst Day 2010 Presentation 2 Path to Premier
Avnet Analyst Day 2010 Presentation 2 Path to Premier
 
Valuation for the Africa Startup
Valuation for the Africa StartupValuation for the Africa Startup
Valuation for the Africa Startup
 

Plus de Kontagent

How to build irresistible social casino games
How to build irresistible social casino gamesHow to build irresistible social casino games
How to build irresistible social casino gamesKontagent
 
KK2013 - A Vision of the Future - Jeff Tseng
KK2013 - A Vision of the Future - Jeff TsengKK2013 - A Vision of the Future - Jeff Tseng
KK2013 - A Vision of the Future - Jeff TsengKontagent
 
Kontagent - The Impact on Consumer Mobile Experiences
Kontagent - The Impact on Consumer Mobile ExperiencesKontagent - The Impact on Consumer Mobile Experiences
Kontagent - The Impact on Consumer Mobile ExperiencesKontagent
 
Webinar: Econsultancy 2013 Mobile Sophistication and Strategy Report
Webinar: Econsultancy 2013 Mobile Sophistication and Strategy ReportWebinar: Econsultancy 2013 Mobile Sophistication and Strategy Report
Webinar: Econsultancy 2013 Mobile Sophistication and Strategy ReportKontagent
 
Get Your Mobile App Discovered and Amp Up User Acquisition
Get Your Mobile App Discovered and Amp Up User AcquisitionGet Your Mobile App Discovered and Amp Up User Acquisition
Get Your Mobile App Discovered and Amp Up User AcquisitionKontagent
 
Utilizing Mobile Metrics to Increase Monetization
Utilizing Mobile Metrics to Increase MonetizationUtilizing Mobile Metrics to Increase Monetization
Utilizing Mobile Metrics to Increase MonetizationKontagent
 
Data Driven Design: You've Got the Data... So, Now What?
Data Driven Design: You've Got the Data... So, Now What?Data Driven Design: You've Got the Data... So, Now What?
Data Driven Design: You've Got the Data... So, Now What?Kontagent
 
Effective Data Driven Development for Social and Mobile Apps
Effective Data Driven Development for Social and Mobile AppsEffective Data Driven Development for Social and Mobile Apps
Effective Data Driven Development for Social and Mobile AppsKontagent
 
Kontagent Social to Mobile Webinar: A Playbook for Building Successful Games
Kontagent Social to Mobile Webinar: A Playbook for Building Successful GamesKontagent Social to Mobile Webinar: A Playbook for Building Successful Games
Kontagent Social to Mobile Webinar: A Playbook for Building Successful GamesKontagent
 
Top 7 Social Metrics - GDC Europe 2011
Top 7 Social Metrics - GDC Europe 2011Top 7 Social Metrics - GDC Europe 2011
Top 7 Social Metrics - GDC Europe 2011Kontagent
 

Plus de Kontagent (10)

How to build irresistible social casino games
How to build irresistible social casino gamesHow to build irresistible social casino games
How to build irresistible social casino games
 
KK2013 - A Vision of the Future - Jeff Tseng
KK2013 - A Vision of the Future - Jeff TsengKK2013 - A Vision of the Future - Jeff Tseng
KK2013 - A Vision of the Future - Jeff Tseng
 
Kontagent - The Impact on Consumer Mobile Experiences
Kontagent - The Impact on Consumer Mobile ExperiencesKontagent - The Impact on Consumer Mobile Experiences
Kontagent - The Impact on Consumer Mobile Experiences
 
Webinar: Econsultancy 2013 Mobile Sophistication and Strategy Report
Webinar: Econsultancy 2013 Mobile Sophistication and Strategy ReportWebinar: Econsultancy 2013 Mobile Sophistication and Strategy Report
Webinar: Econsultancy 2013 Mobile Sophistication and Strategy Report
 
Get Your Mobile App Discovered and Amp Up User Acquisition
Get Your Mobile App Discovered and Amp Up User AcquisitionGet Your Mobile App Discovered and Amp Up User Acquisition
Get Your Mobile App Discovered and Amp Up User Acquisition
 
Utilizing Mobile Metrics to Increase Monetization
Utilizing Mobile Metrics to Increase MonetizationUtilizing Mobile Metrics to Increase Monetization
Utilizing Mobile Metrics to Increase Monetization
 
Data Driven Design: You've Got the Data... So, Now What?
Data Driven Design: You've Got the Data... So, Now What?Data Driven Design: You've Got the Data... So, Now What?
Data Driven Design: You've Got the Data... So, Now What?
 
Effective Data Driven Development for Social and Mobile Apps
Effective Data Driven Development for Social and Mobile AppsEffective Data Driven Development for Social and Mobile Apps
Effective Data Driven Development for Social and Mobile Apps
 
Kontagent Social to Mobile Webinar: A Playbook for Building Successful Games
Kontagent Social to Mobile Webinar: A Playbook for Building Successful GamesKontagent Social to Mobile Webinar: A Playbook for Building Successful Games
Kontagent Social to Mobile Webinar: A Playbook for Building Successful Games
 
Top 7 Social Metrics - GDC Europe 2011
Top 7 Social Metrics - GDC Europe 2011Top 7 Social Metrics - GDC Europe 2011
Top 7 Social Metrics - GDC Europe 2011
 

Dernier

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent

  • 1. Predicting Lifetime Value with Hadoop Martin Colaco, Head of Data Science l April 10, 2013
  • 2. Agenda • What is predictive modeling • What is Lifetime Value (LTV) • What is feature extraction - challenges • How can we build a cohort-based predictive LTV model o Python o Hive with Hadoop o Cascalog with Hadoop
  • 3. Can we predict how many attendees tonight? • How to estimate? Door count (after the fact) • Is there a way to build a model that we can use to predict attendees?
  • 4. Predicting how many attendees tonight? Attendees = Registrations x % Attendance + Non-registrants
  • 5. Predicting how many attendees tonight? Attendees = Registrations x % Attendance + Non-registrants Attendees = 201 x 50% + 25 = 125 Lots of Uncertainty Location Date & Time Company Speaker Title & Topic
  • 6. Predictive Modeling • Know the question you want to answer • Look at historical behavior • Apply understanding of those behaviors to new situations -> new groups of users Fame Feature Model Model Data Success Extraction Selection Validation Riches
  • 7. Common use cases for predictive modeling My chemical engineering roots…. In – Out = Accumulation IN D Out
  • 8. Users: Maximizing Growth In – Out = Accumulation IN D = Growth Out App or Network of Apps Paid marketing Frustration? Organic Boredom? X-promotion Too expensive? Bad UX? No new content?
  • 9. Money: Maximizing Profit In – Out = Accumulation IN D = Profit Out App or App Network or Business Lifetime Value Business expenses: (LTV) Marketing costs Operations (servers, etc.) Employee costs
  • 10. How Do We Estimate LTV Business Model LTV Download Cost per Download Avg. Price x Avg. Subscription Customer Lifetime Microtransactions ??? (Ads / In-app-purchases)
  • 11. LTV Modeling – Social / Mobile Games LTV = (1 + k) * Retention * ARPU Output Features Variable Daily Retention Curve ARPDAU Curve 100.00% $0.10 % of users retained 80.00% $0.08 ARPDAU 60.00% $0.06 40.00% $0.04 20.00% $0.02 0.00% $- 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Days since install Days since install
  • 12. Predictive LTV Result 300 250 200 Cumulative Spend 150 100 50 0 0 10 20 30 40 50 60 70 80 90 100 Days Since Install
  • 13. Challenges with this simple LTV model • All of these parameters are moving targets • k-factor is wildly variable (we’ll ignore k-factor in this presentation) • Acquisition costs can change (as can LTV and retention) - Cohort LTV by install date and install source ARPDAU Curve Retention Curve $0.10 % of users retained 100.00% $0.08 80.00% ARPDAU $0.06 60.00% $0.04 40.00% $0.02 20.00% $- 0.00% 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Days since install Days since install
  • 14. Challenges with this simple LTV model • All of these parameters are moving targets • k-factor is wildly variable (we’ll ignore k-factor in this presentation) • Acquisition costs can change (as can LTV and retention) - Cohort LTV by install date and install source • Retention is computationally difficult to calculate • Large games can have millions of users who spend money over many months/years How can we build out the features we need to model LTV by cohort?
  • 15. Kontagent Facts • Founded in 2007 • 130+ employees and growing • 100s of Customers • 1000s of Apps Instrumented • 250+ billion events per month • 200MM+ MAUs • 1 Trillion Events in 2013
  • 16. How does Kontagent collect data? • Via a REST API o APA – Install message o EVT – Custom event message (user action) o MTU – Spending message • Yields a transaction log over time:
  • 17. Feature Extraction for Predictive LTV Need to translate a transaction log into a table o Install Date o Users Active on Date o Install Source o Users Active on Date or After o Activity Date o Spend on Date o Cumulative Spend to Date
  • 18. How can we compute this table of features? • Python – single thread o Might work in some cases but need to cache potentially millions of rows of data • Hive with Hadoop o Data warehouse system that allows SQL-like querying capabilities of distributed data structures o Let’s work through this….
  • 19. Hive query • Transaction log Store data in Hadoop APA EVT MTU • Query using Hive select distinct s from demo_apa Query Language where kt_date(utc_timestamp) = '2011-07-08' and s is not null and month=201107 (HiveQL)
  • 20. This query gets cumbersome quickly… select sub1.gameplay_date as play_date, sub1.returned, sub2.spenders, sub2.total_daily_spend from (select gp.gameplay_date, count(distinct gp.s) as returned from ( select distinct s from demo_apa where kt_date(utc_timestamp) = '2011-07-08' and s is not null and month=201107 ) base left outer join ( select s, kt_date(utc_timestamp) as gameplay_date from demo_evt where s is not null and month>=201107 ) gp on gp.s = base.s play_date returned spenders total_daily_spend group by gp.gameplay_date 7/10/2011 2 1 75 ) sub1 7/11/2011 4 2 19 join (select sp.spend_date, count(distinct sp.s) as spenders, 7/12/2011 1 1 0.2 sum(sp.spend)/100 as total_daily_spend from ( select distinct s from demo_apa where kt_date(utc_timestamp) = '2011-07-08' and s is not null and month=201107 ) base left outer join ( select s, kt_date(utc_timestamp) as spend_date, v as spend from demo_mtu where s is not null and v>0 and month>=201107 ) sp on sp.s = base.s group by sp.spend_date ) sub2 on sub1.gameplay_date=sub2.spend_date
  • 21. Feature Extraction with HiveQL o Install Date o Spend on Date o Install Source o Users Active on Date or After o Activity Date o Cumulative Spend to Date o Users Active on Date Problem - HiveQL doesn’t support non equi-joins Options for improving Hive performance • Write tables or temp tables • Code up some UDFs
  • 22. How can we compute this table of features? • Python – single thread • Hive with Hadoop • Cascalog (Cascading) with Hadoop o Cascading is a flow based computational model for Hadoop o Cascalog is a declarative based system for cascading o Let’s work through this…
  • 23. Cascalog Code (defn life-table [api-key] (defn user-install-dates [api-key] (let [install-dates (user-install-dates api-key) (let [apas (tap/apa-tap api-key)] evts (tap/evt-tap api-key) (<- [?s ?install-date] mtus (tap/mtu-tap api-key) (apas ?s _ _ ?install-ts) cumulative-spend (cumulative-spend-by-date install-dates mtus) (ops/ts-to-date ?install-ts :> ?install-date)))) activity-spend (spend-by-activity-date install-dates mtus) cumulative-users (cumulative-active-users-by-date install-dates evts) (defn active-users-by-activity-date [install-dates evts] active-users (active-users-by-activity-date install-dates evts)] (<- [?install-date ?activity-date ?active-users] (<- [?install-date ?activity-date ?remaining-users ?active-users ?paying- (install-dates ?s ?install-date) users ?day-spending ?cumulative-spending] (evts ?s _ ?ts) (cumulative-spend ?install-date ?activity-date ?cumulative-spend) (ops/ts-to-date ?ts :> ?activity-date) (activity-spend ?install-date ?activity-date ?paying-users ?day-spending) (c/distinct-count ?s :> ?active-users))) (cumulative-users ?install-date ?activity-date ?remaining-users) (active-users ?install-date ?activity-date ?active-users)))) (defn spend-by-activity-date [install-dates mtus] (<- [?install-date ?activity-date ?paying-users ?day-spending] (mtus ?s ?v _ _ _ _ ?ts) (install-dates ?s ?install-date) (ops/ts-to-date ?ts :> ?activity-date) (c/distinct-count ?s :> ?paying-users) (c/sum ?v :> ?day-spending))) (defn cumulative-active-users-by-date [install-dates evts] (<- [?install-date ?activity-date ?remaining-users] (install-dates ?s ?install-date) (evts ?s _ ?ts) (ops/project-backward ?ts :> ?activity-date) (c/distinct-count ?s :> ?remaining-users))) (defn cumulative-spend-by-date [install-dates mtus] (<- [?install-date ?activity-date ?cumulative-spend] (install-dates ?s ?install-date) (mtus ?s ?v _ _ _ _ ?ts) (ops/project-forward ?ts :> ?activity-date) (c/sum ?v :> ?cumulative-spend)))
  • 24. Feature Extraction with Cascalog o Install Date o Spend on Date o Install Source o Users Active on Date or After o Activity Date o Cumulative Spend to Date o Users Active on Date Options for improvement • Code not optimized – CPU limited
  • 25. What have we learned • Martin sucks (or is awesome) at predicting number of attendees at Meetups! • Predictive modeling (particularly around LTV) can have a huge impact on a business o Requires intuition and iteration o In the big data world, feature extraction can be quite a huge challenge • Feature extraction can be done with Hadoop o HiveQL is nice because analysts can use it, but it can be inefficient and not generate all the features we need o Cascading can solve most of these problems and generate the clean features we need
  • 26. Questions? Need a job? We’re hiring: http://www.kontagent.com/company/careers/ Martin Colaco Head of Data Science martin.colaco@kontagent.com