1. The Big Data Challenges of
Computational Market Research
Frank Smadja
frank.smadja@toluna.com (@FrankieMbaye)
EVP Engineering
Toluna
April 2014
2. Toluna
Table of Content
1. What is a Market Research study
2. The main challenge: Targeting.
3. Machine Learning Problem and Model
4. Some Experiments
5. Current and Future Work
4. Toluna
Market Research Goal:
Answering Questions for Brands
Customer/Employee Satisfaction:
• Are my customers happy?
• What can I do better for them?
• Am I getting better or worse?
Concept testing:
• Would dog owners buy my organic dog food?
• What should be my target market?
Ad testing:
• Is my advertising campaign effective?
Brand positioning:
• How is my brand doing compared to the competition?
• What are my perceived strong features?
• Where should I invest more?
And many more types of questions
9. Toluna
Market Research Main Challenge: Targeting
Select segment of respondents (sample) that is:
• Relevant to the question (dog owners who have one big dog
and one small dog, smokers who are trying to stop, etc.)
• Representative and balanced (not biased).
The tougher/restrictive the targeting, the more expensive the
study.
10. Toluna
The Targeting Pipeline and Incidence Rate
Demographics Behavioral Study
Select the right
population based on
simple demographic
attributes: Age,
Gender, Region,
Ethnicity, Income,
etc.
Further select based
on behavioral and
custom attributes:
fly more than 5 times
a year, uses aspirin
on a daily basis, etc.
Fixed set of
attributes known
beforehand
Free style attributes,
usually unknown.
Incidence Rate:
IR = Completes / Starts
Cost is a growing function of IR
Targeting process
Start
Complete
11. Toluna
Why is targeting hard?
Looking for 1,000 people in the UK who
“smoke,” “tried to stop in the past,”
“live around London,” “age 24-50.”
Data on UK population:
• 18% of the UK adults smoke
• 40% of smokers tried to stop
• 15% of the population is in the
London area
• 30% is between 24-50
Incidence rate:
0.18 * 0.4 *.15 * .3 = 0.3 %
Sample size: 333,333 UK
London
Adults
Smokers
Tried to
stop
12. Toluna
State of the Art: Use Known Demographic
Features
• Basic Demographics are known: 100% incidence.
o Age and London
• Smokers: 18%
• Tried to stop: 40%
Incidence rate:
1 * 0.18 * 0.4 = 9 %
Sample size: 11,000
Adults in the London
Area
Smokers
Tried to
stop
13. Toluna
New Direction: Use Known Features and
Predict Unknown Features
• Basic Demographics are known:
100% incidence.
o Age and London
• If we could predict smokers with 85%
accuracy.
• Tried to stop still unknown: 40%
Incidence rate:
1 * 0.85 * 0.4 = 34 %
Sample size: 2,900
Adults in the London
Area who are predicted
to be smokers
Tried to
stop
Smokers
14. Toluna
How to Predict Features?
The Space Model
Users
Features
Shirt color
Red Blue
Smokes?
Yes No
Sex, Age, Region, etc.
User 1
User 2
User 3
User 4
10^^9 users
10^^7 features
Sparse Matrix containing all the attributes (integer answers to
questions) we have ever asked.
Demographic
attributes
Behavioral attributes
15. Toluna
The Learning Task - The Model
Try to predict answer to the “Smokes?” attribute based on other
attributes.
Smokes? Dog owner? Jogger? Overweight?
16. Toluna
The Learning Task - Collaborative Filtering
User correlation or Feature correlation
User correlation: High level features [William Cohen]
• If Josie and Bob both have the X feature then if Josie has the
Y feature, Bob is more likely to have the Y feature as well.
• Dog owners
• Political inclination, Taste, Lifestyle
Feature correlation:
• If Josie has the X feature, Josie is more likely to also have the
Y feature.
• Joggers (y) and Smokers (n)
• Favorite sports and Race/Ethnicity
• Income level and Education level
17. Toluna
Smaller Task: Complete missing data on a single survey for a single
customer.
Example: On a specific survey, some respondents skip some questions on
income, some other skip the income level question. Use answers
provided by other respondents to impute the missing data.
Imputation: Complete missing data with substituted values with more or
less sophistication. Mean, Nearest neighbor, Multiple Imputation, etc.
[Andridge & Little 2011], [Rubin 1987], ...
Implementation: IBM, SPSS Missing Values module. Uses an iterative
Markov Chain Monte Carlo (MCMC) and multiple imputation.
Used by the US Census bureau.
First Experiments with Multiple Imputation
18. Toluna
First Experiments with Multiple Imputation
Some Results
Where it does not work:
• Too much missing data (over 10%)
• Too many possible answers (what is the name of your
children? what is your home city, etc.)
• Not enough data overall (less than 1,000)
Example of features that work well:
Dog owners, Smokers, Income level, Age (3 bands), etc.
Accuracy: 85% using blind tests.
19. Toluna
Current Work
Currently working on the storing component in
AWS using Hbase, Elastic search and Hadoop.
Some queries:
• Find people who Smoke, Have a red shirt and are
between 22 and 34.
• Compute and store the similarity or correlation
between any two pair of users.
• Compute and store the similarity between
features.
20. Toluna
Future Work
• Define model: binary features (smokes), Integer
(number of children, income), Strings (city,
diseases, etc.).
• Experiment on a large scale with Collaborative
Filtering algorithm and others.
• Experiment with user based and feature based
filtering (blend?, Slope-One?)
• Integrate this into Targeting methodology