Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent

Predicting Lifetime
Value with Hadoop
Martin Colaco, Head of Data Science l April 10, 2013

Agenda

• What is predictive modeling
• What is Lifetime Value (LTV)
• What is feature extraction - challenges
• How can we build a cohort-based predictive LTV
model
o Python
o Hive with Hadoop
o Cascalog with Hadoop

Can we predict how many attendees tonight?

• How to estimate? Door count (after the fact)

• Is there a way to build a model that we can
use to predict attendees?

Predicting how many attendees tonight?

Attendees = Registrations x % Attendance + Non-registrants

Predicting how many attendees tonight?

Attendees = Registrations x % Attendance + Non-registrants

Attendees = 201 x 50% + 25 = 125
Lots of Uncertainty
Location Date & Time Company

Speaker
Title & Topic

Predictive Modeling

• Know the question you want to answer
• Look at historical behavior
• Apply understanding of those behaviors to new
situations -> new groups of users

Fame
Feature Model Model
Data Success
Extraction Selection Validation
Riches

Common use cases for predictive modeling
My chemical engineering roots….

In – Out = Accumulation

IN D Out

Users: Maximizing Growth


IN D = Growth Out

App or Network of Apps

Paid marketing Frustration?
Organic Boredom?
X-promotion Too expensive?
Bad UX?
No new content?

Money: Maximizing Profit


IN D = Profit Out

App or App Network or Business

Lifetime Value Business expenses:
(LTV) Marketing costs
Operations (servers, etc.)
Employee costs

How Do We Estimate LTV

Business Model LTV

Download Cost per Download

Avg. Price x Avg.
Subscription
Customer Lifetime
Microtransactions ???
(Ads / In-app-purchases)

LTV Modeling – Social / Mobile Games

LTV = (1 + k) * Retention * ARPU

Output
Features
Variable

Daily Retention Curve ARPDAU Curve
100.00% $0.10
% of users retained

80.00% $0.08

ARPDAU
60.00% $0.06

40.00% $0.04

20.00% $0.02

0.00% $-
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Days since install Days since install

Predictive LTV Result

300

250

200
Cumulative Spend

150

100

50

0
0 10 20 30 40 50 60 70 80 90 100
Days Since Install

Challenges with this simple LTV model

• All of these parameters are moving targets
• k-factor is wildly variable (we’ll ignore k-factor in this
presentation)
• Acquisition costs can change (as can LTV and
retention) - Cohort LTV by install date and install
source
ARPDAU Curve Retention Curve
$0.10 % of users retained 100.00%
$0.08 80.00%
ARPDAU

$0.06 60.00%
$0.04 40.00%
$0.02 20.00%
$- 0.00%
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Days since install Days since install

Challenges with this simple LTV model

• All of these parameters are moving targets
• k-factor is wildly variable (we’ll ignore k-factor in this
presentation)
• Acquisition costs can change (as can LTV and
retention) - Cohort LTV by install date and install
source
• Retention is computationally difficult to calculate
• Large games can have millions of users who spend
money over many months/years

How can we build out the features we need
to model LTV by cohort?

Kontagent Facts
• Founded in 2007
• 130+ employees and growing
• 100s of Customers
• 1000s of Apps Instrumented
• 250+ billion events per month
• 200MM+ MAUs
• 1 Trillion Events in 2013

How does Kontagent collect data?
• Via a REST API
o APA – Install message
o EVT – Custom event message (user action)
o MTU – Spending message
• Yields a transaction log over time:

Feature Extraction for Predictive LTV

Need to translate a transaction log into a table
o Install Date o Users Active on Date
o Install Source o Users Active on Date or After
o Activity Date
o Spend on Date
o Cumulative Spend to Date

How can we compute this table of features?

• Python – single thread
o Might work in some cases but need to cache
potentially millions of rows of data

• Hive with Hadoop
o Data warehouse system that allows SQL-like
querying capabilities of distributed data structures
o Let’s work through this….

Hive query

•
Transaction log
Store data in Hadoop

APA EVT MTU

• Query using Hive select distinct s
from demo_apa
Query Language where kt_date(utc_timestamp) = '2011-07-08' and s is
not null and month=201107
(HiveQL)

This query gets cumbersome quickly…
select sub1.gameplay_date as play_date, sub1.returned,
sub2.spenders, sub2.total_daily_spend
from
(select gp.gameplay_date, count(distinct gp.s) as returned
from
(
select distinct s
from demo_apa
where kt_date(utc_timestamp) = '2011-07-08' and s is not null
and month=201107
) base
left outer join
(
select s, kt_date(utc_timestamp) as gameplay_date
from demo_evt
where s is not null and month>=201107
) gp on gp.s = base.s play_date returned spenders total_daily_spend
group by gp.gameplay_date 7/10/2011 2 1 75
) sub1 7/11/2011 4 2 19
join
(select sp.spend_date, count(distinct sp.s) as spenders,
7/12/2011 1 1 0.2
sum(sp.spend)/100 as total_daily_spend
from
(
select distinct s
from demo_apa
where kt_date(utc_timestamp) = '2011-07-08' and s is not null
and month=201107
) base
left outer join
(
select s, kt_date(utc_timestamp) as spend_date, v as spend
from demo_mtu
where s is not null and v>0 and month>=201107
) sp on sp.s = base.s
group by sp.spend_date
) sub2 on sub1.gameplay_date=sub2.spend_date

Feature Extraction with HiveQL
o Install Date o Spend on Date
o Activity Date o Cumulative Spend to Date
o Users Active on Date

Problem - HiveQL doesn’t support non equi-joins

Options for improving Hive performance
• Write tables or temp tables
• Code up some UDFs

How can we compute this table of features?

• Python – single thread

• Hive with Hadoop

• Cascalog (Cascading) with Hadoop
o Cascading is a flow based computational model for
Hadoop
o Cascalog is a declarative based system for
cascading
o Let’s work through this…

Cascalog Code
(defn life-table [api-key]
(defn user-install-dates [api-key] (let [install-dates (user-install-dates api-key)
(let [apas (tap/apa-tap api-key)] evts (tap/evt-tap api-key)
(<- [?s ?install-date] mtus (tap/mtu-tap api-key)
(apas ?s _ _ ?install-ts) cumulative-spend (cumulative-spend-by-date install-dates mtus)
(ops/ts-to-date ?install-ts :> ?install-date)))) activity-spend (spend-by-activity-date install-dates mtus)
cumulative-users (cumulative-active-users-by-date install-dates evts)
(defn active-users-by-activity-date [install-dates evts] active-users (active-users-by-activity-date install-dates evts)]
(<- [?install-date ?activity-date ?active-users] (<- [?install-date ?activity-date ?remaining-users ?active-users ?paying-
(install-dates ?s ?install-date) users ?day-spending ?cumulative-spending]
(evts ?s _ ?ts) (cumulative-spend ?install-date ?activity-date ?cumulative-spend)
(ops/ts-to-date ?ts :> ?activity-date) (activity-spend ?install-date ?activity-date ?paying-users ?day-spending)
(c/distinct-count ?s :> ?active-users))) (cumulative-users ?install-date ?activity-date ?remaining-users)
(active-users ?install-date ?activity-date ?active-users))))
(defn spend-by-activity-date [install-dates mtus]
(<- [?install-date ?activity-date ?paying-users ?day-spending]
(mtus ?s ?v _ _ _ _ ?ts)
(install-dates ?s ?install-date)
(ops/ts-to-date ?ts :> ?activity-date)
(c/distinct-count ?s :> ?paying-users)
(c/sum ?v :> ?day-spending)))

(defn cumulative-active-users-by-date [install-dates evts]
(<- [?install-date ?activity-date ?remaining-users]
(evts ?s _ ?ts)
(ops/project-backward ?ts :> ?activity-date)
(c/distinct-count ?s :> ?remaining-users)))

(defn cumulative-spend-by-date [install-dates mtus]
(<- [?install-date ?activity-date ?cumulative-spend]
(mtus ?s ?v _ _ _ _ ?ts)
(ops/project-forward ?ts :> ?activity-date)
(c/sum ?v :> ?cumulative-spend)))

Feature Extraction with Cascalog
o Install Date o Spend on Date
o Activity Date o Cumulative Spend to Date
o Users Active on Date

Options for improvement
• Code not optimized – CPU limited

What have we learned

• Martin sucks (or is awesome) at predicting number of
attendees at Meetups!
• Predictive modeling (particularly around LTV) can have a
huge impact on a business
o Requires intuition and iteration
o In the big data world, feature extraction can be quite a huge
challenge
• Feature extraction can be done with Hadoop
o HiveQL is nice because analysts can use it, but it can be
inefficient and not generate all the features we need
o Cascading can solve most of these problems and generate the
clean features we need

Questions?

Need a job? We’re hiring:
http://www.kontagent.com/company/careers/

Martin Colaco
Head of Data Science
martin.colaco@kontagent.com

Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent

Recommandé

Recommandé

Contenu connexe

Similaire à Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent

Similaire à Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent (20)

Plus de Kontagent

Plus de Kontagent (10)

Dernier

Dernier (20)

Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascading - Kontagent