Description:
One of the biggest challenges for people building data products today is developing and refining features for modeling purposes (i.e. feature extraction) with the volume and variability of web scale data. In this talk, Martin will discuss some of the challenges and solutions faced by Kontagent as it built out a predictive lifetime value model for its customers. As you will learn, Hadoop is critical to this feature extraction process, and Cascading is quite handy when building out more complex features than can be readily developed in a query framework like Hive.
Speaker:
Martin Colaco, Director of Data Science for Kontagent
2. Agenda
• What is predictive modeling
• What is Lifetime Value (LTV)
• What is feature extraction - challenges
• How can we build a cohort-based predictive LTV
model
o Python
o Hive with Hadoop
o Cascalog with Hadoop
3. Can we predict how many attendees tonight?
• How to estimate? Door count (after the fact)
• Is there a way to build a model that we can
use to predict attendees?
4. Predicting how many attendees tonight?
Attendees = Registrations x % Attendance + Non-registrants
5. Predicting how many attendees tonight?
Attendees = Registrations x % Attendance + Non-registrants
Attendees = 201 x 50% + 25 = 125
Lots of Uncertainty
Location Date & Time Company
Speaker
Title & Topic
6. Predictive Modeling
• Know the question you want to answer
• Look at historical behavior
• Apply understanding of those behaviors to new
situations -> new groups of users
Fame
Feature Model Model
Data Success
Extraction Selection Validation
Riches
7. Common use cases for predictive modeling
My chemical engineering roots….
In – Out = Accumulation
IN D Out
8. Users: Maximizing Growth
In – Out = Accumulation
IN D = Growth Out
App or Network of Apps
Paid marketing Frustration?
Organic Boredom?
X-promotion Too expensive?
Bad UX?
No new content?
9. Money: Maximizing Profit
In – Out = Accumulation
IN D = Profit Out
App or App Network or Business
Lifetime Value Business expenses:
(LTV) Marketing costs
Operations (servers, etc.)
Employee costs
10. How Do We Estimate LTV
Business Model LTV
Download Cost per Download
Avg. Price x Avg.
Subscription
Customer Lifetime
Microtransactions ???
(Ads / In-app-purchases)
11. LTV Modeling – Social / Mobile Games
LTV = (1 + k) * Retention * ARPU
Output
Features
Variable
Daily Retention Curve ARPDAU Curve
100.00% $0.10
% of users retained
80.00% $0.08
ARPDAU
60.00% $0.06
40.00% $0.04
20.00% $0.02
0.00% $-
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Days since install Days since install
12. Predictive LTV Result
300
250
200
Cumulative Spend
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
Days Since Install
13. Challenges with this simple LTV model
• All of these parameters are moving targets
• k-factor is wildly variable (we’ll ignore k-factor in this
presentation)
• Acquisition costs can change (as can LTV and
retention) - Cohort LTV by install date and install
source
ARPDAU Curve Retention Curve
$0.10 % of users retained 100.00%
$0.08 80.00%
ARPDAU
$0.06 60.00%
$0.04 40.00%
$0.02 20.00%
$- 0.00%
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Days since install Days since install
14. Challenges with this simple LTV model
• All of these parameters are moving targets
• k-factor is wildly variable (we’ll ignore k-factor in this
presentation)
• Acquisition costs can change (as can LTV and
retention) - Cohort LTV by install date and install
source
• Retention is computationally difficult to calculate
• Large games can have millions of users who spend
money over many months/years
How can we build out the features we need
to model LTV by cohort?
15. Kontagent Facts
• Founded in 2007
• 130+ employees and growing
• 100s of Customers
• 1000s of Apps Instrumented
• 250+ billion events per month
• 200MM+ MAUs
• 1 Trillion Events in 2013
16. How does Kontagent collect data?
• Via a REST API
o APA – Install message
o EVT – Custom event message (user action)
o MTU – Spending message
• Yields a transaction log over time:
17. Feature Extraction for Predictive LTV
Need to translate a transaction log into a table
o Install Date o Users Active on Date
o Install Source o Users Active on Date or After
o Activity Date
o Spend on Date
o Cumulative Spend to Date
18. How can we compute this table of features?
• Python – single thread
o Might work in some cases but need to cache
potentially millions of rows of data
• Hive with Hadoop
o Data warehouse system that allows SQL-like
querying capabilities of distributed data structures
o Let’s work through this….
19. Hive query
•
Transaction log
Store data in Hadoop
APA EVT MTU
• Query using Hive select distinct s
from demo_apa
Query Language where kt_date(utc_timestamp) = '2011-07-08' and s is
not null and month=201107
(HiveQL)
20. This query gets cumbersome quickly…
select sub1.gameplay_date as play_date, sub1.returned,
sub2.spenders, sub2.total_daily_spend
from
(select gp.gameplay_date, count(distinct gp.s) as returned
from
(
select distinct s
from demo_apa
where kt_date(utc_timestamp) = '2011-07-08' and s is not null
and month=201107
) base
left outer join
(
select s, kt_date(utc_timestamp) as gameplay_date
from demo_evt
where s is not null and month>=201107
) gp on gp.s = base.s play_date returned spenders total_daily_spend
group by gp.gameplay_date 7/10/2011 2 1 75
) sub1 7/11/2011 4 2 19
join
(select sp.spend_date, count(distinct sp.s) as spenders,
7/12/2011 1 1 0.2
sum(sp.spend)/100 as total_daily_spend
from
(
select distinct s
from demo_apa
where kt_date(utc_timestamp) = '2011-07-08' and s is not null
and month=201107
) base
left outer join
(
select s, kt_date(utc_timestamp) as spend_date, v as spend
from demo_mtu
where s is not null and v>0 and month>=201107
) sp on sp.s = base.s
group by sp.spend_date
) sub2 on sub1.gameplay_date=sub2.spend_date
21. Feature Extraction with HiveQL
o Install Date o Spend on Date
o Install Source o Users Active on Date or After
o Activity Date o Cumulative Spend to Date
o Users Active on Date
Problem - HiveQL doesn’t support non equi-joins
Options for improving Hive performance
• Write tables or temp tables
• Code up some UDFs
22. How can we compute this table of features?
• Python – single thread
• Hive with Hadoop
• Cascalog (Cascading) with Hadoop
o Cascading is a flow based computational model for
Hadoop
o Cascalog is a declarative based system for
cascading
o Let’s work through this…
24. Feature Extraction with Cascalog
o Install Date o Spend on Date
o Install Source o Users Active on Date or After
o Activity Date o Cumulative Spend to Date
o Users Active on Date
Options for improvement
• Code not optimized – CPU limited
25. What have we learned
• Martin sucks (or is awesome) at predicting number of
attendees at Meetups!
• Predictive modeling (particularly around LTV) can have a
huge impact on a business
o Requires intuition and iteration
o In the big data world, feature extraction can be quite a huge
challenge
• Feature extraction can be done with Hadoop
o HiveQL is nice because analysts can use it, but it can be
inefficient and not generate all the features we need
o Cascading can solve most of these problems and generate the
clean features we need
26. Questions?
Need a job? We’re hiring:
http://www.kontagent.com/company/careers/
Martin Colaco
Head of Data Science
martin.colaco@kontagent.com