WHAT DOES IT TAKE TO WIN THE
KAGGLE/YANDEX COMPETITION

Christophe Bourguignat
Kenji Lefèvre-Hasegawa
Paul Masurel @Dataik...
OUTLINE OF THE TALK

• Review of the Kaggle/Yandex Challenge
• How we worked (team work & tools)
• The winning model
GOAL Re-rank URLs returned by Yandex according to
the personal preferences of the users
url1

url3

url2

url2
GOAL

url3
...
GIVEN
• 30 days logs test: 3 days, train: 27 days
• Users historic queries, clicks & dwell-times
Q

Q

Q

Q

• Test sessio...
QUALITY METRIC
• One query test / user on the last 3 days
• NDCG metric penalize error of pertinence on top ranked
urls

•...
TEAM DATAIKU SCIENCE STUDIO / KAGGLE

•
•
•
•

Christophe Bourguignat Engineer, Data enthusiastic
Kenji Lefèvre-Hasegawa P...
WE’VE USED
•
•
•
•
•

Related papers (mainly Microsoft’s)
12 core, 64 Gb
Python scikit-learn
Dataiku Science Studio
Java R...
DATAIKU SCIENCE STUDIO
Features & labels

Features

Labels

Split train & validation

Original train

LEARNING
Team member...
HOW MUCH WORK ?
• 960+ emails
• 360+ features
• 50+ ideas grid tuned (300+ models fitted)
• Server heavily loaded the last...
PROBLEM ANALYSIS
Query

Result Set
• Rank
• URL Snippet Quality
• URL is skipped, clicked or missed

CLICK
Reading URL
• U...
FEATURES
Features :
• Rank
• User habits, query specificity (entropy, frequency,…)
• Snippet pertinence
• Missed, Skipped,...
MODELS
• Random Forest (predict proba)
+ maximize E(NDCG)
Kaggle/Yandex Top 1
then 3rd

• Lambda MART (Gradient Boosting T...
QUESTIONS

?
Thank you !
Prochain SlideShare
Chargement dans…5
×

What does it take to win the Kaggle/Yandex competition

3 579 vues

Publié le

A feedback on how we won Kaggle/Yandex competition

Publié dans : Technologie, Business
0 commentaire
4 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
3 579
Sur SlideShare
0
Issues des intégrations
0
Intégrations
172
Actions
Partages
0
Téléchargements
25
Commentaires
0
J’aime
4
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

What does it take to win the Kaggle/Yandex competition

  1. 1. WHAT DOES IT TAKE TO WIN THE KAGGLE/YANDEX COMPETITION Christophe Bourguignat Kenji Lefèvre-Hasegawa Paul Masurel @Dataiku Matthieu Scordia @Dataiku
  2. 2. OUTLINE OF THE TALK • Review of the Kaggle/Yandex Challenge • How we worked (team work & tools) • The winning model
  3. 3. GOAL Re-rank URLs returned by Yandex according to the personal preferences of the users url1 url3 url2 url2 GOAL url3 url1 url4 url4 ML CHALLENGE Predict user’s pertinence for urls and rerank result set accordingly The Kaggle/Yandex challenge
  4. 4. GIVEN • 30 days logs test: 3 days, train: 27 days • Users historic queries, clicks & dwell-times Q Q Q Q • Test session prior activity queries, clicks & dwell-times Test session : SIZE • 15Gb size The Kaggle/Yandex challenge Q Q T ?
  5. 5. QUALITY METRIC • One query test / user on the last 3 days • NDCG metric penalize error of pertinence on top ranked urls • No A/B test url1 url2 OK BAD url4 url3 Kaggle The Kaggle/Yandex challenge Prediction Another ranking
  6. 6. TEAM DATAIKU SCIENCE STUDIO / KAGGLE • • • • Christophe Bourguignat Engineer, Data enthusiastic Kenji Lefèvre-Hasegawa Ph.D. math, new to ML Paul Masurel Software Engineer @dataiku Matthieu Scordia Data Scientist @dataiku First meeting : October16th 2013 How we worked (Team work & tools)
  7. 7. WE’VE USED • • • • • Related papers (mainly Microsoft’s) 12 core, 64 Gb Python scikit-learn Dataiku Science Studio Java Ranklib How we worked (Team work & tools)
  8. 8. DATAIKU SCIENCE STUDIO Features & labels Features Labels Split train & validation Original train LEARNING Team members work independantly FEATURES CONSTRUCTION Team members work independantly DATA DRIVEN COMPUTATION How we worked (Team work & tools)
  9. 9. HOW MUCH WORK ? • 960+ emails • 360+ features • 50+ ideas grid tuned (300+ models fitted) • Server heavily loaded the last 3 weeks • 56 kaggle submissions • 196 teams, 264 players, 3570 submissions How we worked (Team work & tools) 2014-01-01 1st Future top 2 & 3 enter race 1 week 3rd 1 week 1st 5th Top 10 Top 25 1/2 month 1 week
  10. 10. PROBLEM ANALYSIS Query Result Set • Rank • URL Snippet Quality • URL is skipped, clicked or missed CLICK Reading URL • URL & Domain pertinence with dwell-time The winning model
  11. 11. FEATURES Features : • Rank • User habits, query specificity (entropy, frequency,…) • Snippet pertinence • Missed, Skipped, Clicked • URL & Domain Pertinence Declinaison of & Clicked • Probability, Stimuli freq., Mean Reciprocal Rank (MRR) • For each user : historic & previous activity in test session & aggregate • For all user • Declined for all queries & same query The winning model
  12. 12. MODELS • Random Forest (predict proba) + maximize E(NDCG) Kaggle/Yandex Top 1 then 3rd • Lambda MART (Gradient Boosting Tree optimized for NDCG) WINS ! The winning model
  13. 13. QUESTIONS ? Thank you !

×