Hypothesis Testing: Eliminate Ideas Quickly with Offline and Online Tests

Hypothesis Testing:
How to Eliminate Ideas as Soon as Possible 
Roman Zykov
Retail Rocket
Boston, RecSys 2016

Context
• Intro
• Offline vs Online testing
• Make offline testing shorter
• Artificial diversity metric
• Online tests

Retail Rocket
• Personalised real-time recommendations
• E-commerce only
• Multiple channels (site, email, …)
• Founded in 2012
• Offices: Amsterdam, Barcelona, Milan, Moscow
• 1000+ retail partners
• 100+ million daily events

Why testing is important?
• Highly competitive market
• It’s not hard to create own recommendation
• Constant changes in the product and algorithms
• Fast and reliable decisions

Offline vs Online testing
Offline testing forecasts online testing results
• Relatively fast, testing of minor changes requires hours
• Few resources: data, computational resources, code, 1 dev
• Hard to forecast online metrics in some cases
• Influence of an algorithm on users' behaviour is ignored
• Bad values of offline metrics prevent online implementation
Online test - final decision point
• Requires much time. At least two cycles of decision making
• Requires many resources: design, onsite production, etc

Testing facts
• Nine out of ten ideas do not improve anything
• Most ideas have minor impact:
o add new data: extracted from text, images, etc
o adjust parameters of algorithm

Offline predicts Online
Major changes or new algorithm
• Always check by online experiment
• Find appropriate offline metric after
• Try different definitions of users’ sessions
• Try different events sequences
Minor changes
• Use offline tests if you have proved offline metric

Make offline testing shorter Retail Rocket
What we did
• Functional programming on Scala/Spark. Four languages
(Python, Java, Pig, Hive) had been previously used.
• Research in Scala/Spark Notebooks with added R integration
for graphics
• Offline evaluation framework for all of our tasks with metrics
calculations. The most complicated project among others in
Retail Rocket
What we got
• It takes hours to prove or disapprove any simple idea
whereas previously it could have taken days
• Research is limited by the power of our cluster and the
number of data scientists

Offline framework
• Scala on Spark
• Deals with existing web logs
• Implicit feedback
• Major metrics:
o Recall, Diversity, Recall with NN, Empty Recs
• Minor metrics:
o Serendipity, Novelty, Coverage
• Different types of events sequences
• Different definitions of users’ sessions
• Personalised / Non-personalised recommendations
• Adjustable TOP of viewable recommendations
• Test panel of sites from different domains

Offline events sequences
view1 view2 view3 cart1 cart2 view4 view5 view6 purchase1
View2View View2Cart View2Purchase Cart2Purchase Cart2Cart
view1 -> view2
view2 -> view3
view3 -> view4
view4 -> view5
view5 -> view6
view1 -> cart1
view2 -> cart1
view3 -> cart1
view4 -> cart1
view5 -> cart2
view6 -> cart2
view1 -> purchase1
view2 -> purchase1
view3 -> purchase1
view4 -> purchase1
view5 -> purchase1
view6 -> purchase1
cart1 -> purchase1
cart2 -> purchase1
cart1 -> cart2
* Events: product view, add to cart, purchase, main page view, search, catalog page, …

Offline metric examples
view1 view2 view3 cart1 cart2 view4 view5 view6 purchase1
What Customers Buy After Viewing This Item
• View2Cart
• View2Purchase
• …
Customers Who Bought This Item Also Bought
• Cart2Cart
• Cart2Purchase
• View2Cart
• …

Case: Artificial diversification

Artificial diversification
Original
After
Problem: It’s not impossible to use Recall for evaluating

Recall with Nearest Neighbours (NN)
Top 4 recs
0.8 0.7 0.5 0.5
0.8 0.7 0.5 0.5
0.6 0.5 0.4
0.9 0.8 0.3 0.5
Content based similarity 
(Nearest neighbours)
Real item
0.5
Indirect hit
1.0
Direct hit
No hit
0.0
Metric = Average over all sessions

AA/BB tests
A group
A group
B group
B group
Control group
Test group

AA/BB tests
A
A
B
B
A
A
B
B
IdealDirty

Bayesian approach
• Conversion rates
o Beta distribution with normal priors
• Average Order Values
o Normal distribution (after log) with normal priors
• Priors from historical data before experiment
Anything may be done with posteriors.
E.g.: There is a 95% chance that A has an 1% lift over B

Conclusion
• Offline testing can predict online results
• One programming language for R&D reduces the test time
• The Scala language is a good alternative for ML tasks
• Different event sequences for offline metrics
• Recall with Nearest Neighbours (NN) metric

Thank you!
Roman Zykov
Retail Rocket
rzykov@retailrocket.net
https://github.com/RetailRocket/SparkMultiTool

Hypothesis Testing: Eliminate Ideas Quickly with Offline and Online Tests

Recommandé

Recommandé

Contenu connexe

Similaire à Hypothesis Testing: Eliminate Ideas Quickly with Offline and Online Tests

Similaire à Hypothesis Testing: Eliminate Ideas Quickly with Offline and Online Tests (20)

Plus de Roman Zykov

Plus de Roman Zykov (20)

Dernier

Dernier (20)

Hypothesis Testing: Eliminate Ideas Quickly with Offline and Online Tests