Kaggle Days Brussels - Alberto Danese

Kaggle: state of the art ML for real world problems
My experience on Google’s platform for data science: how it works, why it
matters and why you should join!
Alberto Danese

22 / 11 / 2018Kaggle Days - Brussels2
Who am I
●
Alberto Danese
●
Senior Data Scientist @ Cerved since 2016
●
Innovation and Data Sources Team
●
Background:
●
Computer Engineer @ Politecnico di Milano (2007)
●
Senior Consultant @ Between
●
Manager @ EY
●
Passions: Data! ML, blockchain, anything between IT, applied math and biz
●
And Kaggle: currently 1st in Italy, ranked in top 100 worldwide in 2018

Poll
How much do you know about Kaggle?
1. I’m just curious about it!
2. I registered to have a look at data / kernels / tutorials / learning modules
3. I joined a real-world competition
4. It has already turned to an addiction

Basics
●
Leading platform for machine learning competitions since 2010
●
Companies post real data and problems that can be solved with predictive
modeling / machine learning / AI / some kind of magic!
●
Data scientists from all over the world compete to produce the best
algorithms
●
Acquired by Google in 2017
●
Grown to a complete ML platform with learning modules, code sharing
features (kernels), job board and more

Serious sponsors
●
Over 300 competitions so far, by companies such as:

Relevant prizes (1/2)
●
Aside from tutorial / playground competition, most competitions have
money prizes ranging from a few thousand dollars to over 1 million
●
In 2017, over 4,75 million U$D awarded to data scientists[1]
●
Personal hint: Kaggle does pay ;)
●
Some competitions have jobs / interviews as prize:
●
5 recruiting campaigns by Facebook
●
3 by Walmart
[1]
http://blog.kaggle.com/2017/12/26/your-year-on-kaggle-most-memorable-community-stats-from-2017/

Relevant prizes (2/2)
●
My trivial logic...
●
Companies are willing to pay real money prizes, no matter how hard it is
to:
●
Define a proper competition
●
Anonimize the data (seriously)
●
Persuade the legal and business depts to share private data ;)
●
→ Kaggle competitions solve business problems that really matter!
●
Large data companies (already very attractive) keep on posting job
competitions
●
→ They are probably finding talents with Kaggle

Real data and problems
●
Machine learning is a bit more than classifying badly written digits
●
If I were passionate about iris and flowers, I would be a botanist

Notable participants... (1/2)
Francois Chollet
https://www.kaggle.com/fchollet

Notable participants... (2/2)
Tianqi Chen
https://www.kaggle.com/tqchen

… and a large international community (1/2)
●
A place for researchers that demonstrate in the real world the effectiveness
of their brand new algos / libraries
●
Winning algos on Kaggle have become the de-facto industry standard
●
But also a place for data-lovers who can:
●
Prove their skills on real problems, building a portfolio of experiences on
data and techniques
●
Have a public, data-driven showcase: 93.000+ kagglers are ranked based
on their results and belong to 5 tiers, from Novice to Grand Master (124
data scientists worldwide, just 11 new GM in all 2017)
●
Learn from the best: many writeups and actual code from all or part of
top solutions

… and a large international community (2/2)
28
12
2
6
5
4
●
USA, Russia and China
have the highest number
of top kagglers (based on
declared residence
country)
●
Various European
countries have 2 or more
GMs
#GMs

Kaggle’s framework (1/5)
●
The sponsor company creates a labeled dataset
Feature 1 Feature 2 Feature 3 ... Feature N Target
34 a . . 0
123 b 324.52 . 1
523 n 34.343 . 0
3 a 431.41 . 1
43 . . . 1
3 . . . 1
425 g . . 1
53 f . . 0
653 e 42.4 . 1
63 r 121 . 0
63 y 4526.543 . 0
. d 23.24 . 1
634 d 213.53 . 0
64 a . . 0
It’s classical supervised learning scenario
Simple example based on tabular data and
binary target
Same logic applies to all competitions,
even if they involve computer vision,
natural language processing or other ML
tasks

●
The sponsor company splits the dataset in train and test: it provides the
target only for train dataset and it defines a single metric for performance
evaluation on test target prediction
Feat 1 Feat 2 Feat 3 ... Feat N Target
34 a . . 0
123 b 324.52 . 1
523 n 34.343 . 0
3 a 431.41 . 1
43 . . . 1
3 . . . 1
425 g . . 1
53 f . . 0
Feat 1 Feat 2 Feat 3 ... Feat N
653 e 42.4 .
63 r 121 .
63 y 4526.543 .
. d 23.24 .
634 d 213.53 .
64 a . .
TRAIN TEST
Target
1
0
0
1
0
0
Not published,
kept secret
Performance metric: e.g. RMSE, AUC, etc.

●
Kagglers build their predictive models with FOSS (usually R / Python) and
provide predictions on the whole test dataset (submit predictions only, no
code)
●
The score is immediately provided on the so called “public leaderboard”,
usually around 20-30% of test records
653 e 42.4 .
63 r 121 .
63 y 4526.543 .
. d 23.24 .
634 d 213.53 .
64 a . .
TEST
Target
1
1
0
1
1
0
SUBMISSION PUBLIC LB SCORE
(INSTANTLY)
e.g. AUC 0.8594ML
You could easily overfit
the public leaderboard
by submitting enough
predictions! But...

●
When the competitions ending approaches (usually in 60-90 days), a kaggler
has to choose his/her 2 final solutions
●
The final ranking (“private leaderboard”) - the only that matters, on which
prizes are awarded – is calculated on the other 70-80% of the test dataset
●
If you overfitted the public LB, you will drop significantly
653 e 42.4 .
63 r 121 .
63 y 4526.543 .
. d 23.24 .
634 d 213.53 .
64 a . .
TEST
Target
1
1
0
1
1
0
SUBMISSION PRIVATE LB SCORE
(AT THE END OF
COMPETITION)
e.g. AUC 0.8358
ML

●
Winners (usually top 3) have to present their work to the sponsor, submit
the full code (that has to be reproducible) and finally claim their prize :)
Sample final leaderboard
In 1st place, a team that was 14th in public
leaderboard (see the green “13”, i.e. delta
between public and private LB).
The team that was 1st in public LB,
actually finished 4th.
You can participate in teams, splitting the
prize in case – great experience!

Some examples and how I approach a competition

Type of competitions
●
Usually competitions can be somehow divided in 3 categories
●
Classic machine learning – i.e. tabular data (one or more tables, small or
“big” data, etc.)
●
Computer vision – on still images or videos
●
Natural language processing
●
Competitions can be:
●
Public (probably > 95%): anyone can join
●
Private: the sponsor prefers to keep the data private, only Masters and
Grandmasters can see the competition and apply

Example of classic ML
●
Sponsor: Talking Data,
largest mobile ad provider
in China
●
Problem: fraud detection,
i.e. predicting which clicks
are generated by “click
farms” in order to generate
fake commissions and to
cheat on app download
rankings
●
Prize: 25.000$
●
Data: 180.000.000 clicks (i.e.
labeled records in training
set), 60.000.000 in test set

Example of Computer Vision
●
Sponsor: Laura and John
Arnold Foundation et al.
●
Problem: fighting cancer,
i.e. analyze TACs (typically
with deep learning) to
identify the presence of lung
cancer
●
Prize: 1.000.000$
●
Data: over 1.000 patients
with over 200 images each
(see image by side)

Example of NLP
●
Sponsor: Jigsaw and Google
(both part of Alphabet)
●
Problem: classify toxic
comments, i.e. tag
offensive, discrimatory,
obscene content
●
Prize: 35.000$
●
Data: free text to be
classified on different kind of
“toxicity”

Example of… another amazing challenge
●
Sponsor: NSF and others
●
Problem: classify
astronomical sources from
LSST, a new telescope that
“is about to revolutionize
the field, discovering 10 to
100 times more astronomical
sources that vary in the night
sky than we've ever known”
●
Prize: 25.000$ + possibility to
present in California / Sydney
/ Paris conferences
●
Data: light curves and more
(tabular data)

My approach
1. Find a motivating competition!
●
For personal interest
●
For potential use outside kaggle (e.g. at work)
2. Understand the business problem – even with anonymized and heavily
obscured features
3. Work on your own for a while – from EDA, to a good validation framework, to
building a baseline model
4. [optional] Consider joining forces and create a team:
●
Pros: meet brilliant people, understand different approaches and tools, best
final ranking
●
Cons: coordination takes time (avoid large groups), when alone you’re forced
to deal even with aspects you do not like / care much
5. Try not to overfit the leaderboard! ...

Just a hint
●
Usually the test has significant differences from the training set
●
Classic approach: choose the appropriate statistical test to understand the
different distributions between train and test – for each feature!
●
Drop the “real” target from the training set
●
Create a combined dataset with train plus test and a new target:
●
1 if the record belongs to train
●
0 if the record belongs to test
●
Train a simple tree-based classifier and measure AUC and features
importance
●
You can now answer easily: is the test distinguishable from the train? What
are the variables that determine this difference?

Kaggle vs. Job as Data Scientist

Pure predictive performance vs. tradeoffs
●
Most Kaggle competition can be run with no time / RAM / complexity
constraints
●
You have to do whatever it takes to improve by 0,01% the accuracy or any
given metric
●
Clean code is not a priority (but you have to polish it and be sure to be
able to reproduce your solution, in case you win)
●
Be smart! Focus on the test set you have to predict

100% ML vs. bigger challenges (1/2)
Understand
data
Machine
Learning
Define biz
problem /
data product
Find right
data
Understand
data
Machine
learning
Solution
engineering
Maintainance
and evolution
Time spent on
different data
science activities
Job in Data Science
Understand business
problems, find and enrich
data, define how to move
from prototype to working
and maintainable
solution… and deal with
people! (it’s not a bad
thing)
A straightforward,
objective, pure
algorithmic competition

100% ML vs. bigger challenges (2/2)
●
Data comes first!
●
Average model + Great data >> Great model + Average data
●
But keep in mind that current ML algos smash traditional statistical
methods both in predictive performance and in time required to achieve a
“reasonable” model
●
Explainability has to be taken into account in several applications (but I
prefer a safest self-driving car even if it is a black box)
●
Strict and transparent validation setup – defining how to evaluate models
before doing ML – is actually key also in the work place

My Kaggle toolkit – Useful in the workplace too
●
R: good choice for ML, not so good if you are focused on deep learning
●
Rstudio: a jewel, still looking for a comparable IDE for Python
●
Git (Github / GitLab): no brainer here
●
Data.table: impressive speed to handle and manipulate large data files
●
Dplyr: nicer syntax than data.table
●
Xgboost: great implementation of GBM
●
Lightgbm: probably an even greater implementation of GBM (open, by
Microsoft ML team)
●
Ranger: multithreaded random forest, old fashioned but worth a try
●
Glmnet: older fashioned - Regularized Generalized Linear Models
●
Ggplot2: the plotting library by Rstudio team

Kaggle – is it worth your time?

My experience
2014
7 years after graduation.
Stumbled across Kaggle
First thought: cool!
Second: where do I start?
2015
Finally decided to
dedicate 100% to data
science. Attended a
part-time master
(Bicocca) and thought a
Kaggle competition
would be a good final
project
Feb 2016
First competition!
Finished 18th out of 1764
Apr 2016
First team competition!
Joined 3 Chinese guys
(from Kansas City,
Shanghai and San
Francisco), finished in top
10 out of 3000
competitors
October 2016
Joined Cerved!
The Italian data company,
where data and algos have
been key long before the
data science hype
2017
Other competitions
Various gold medals,
learned a lot from every
new challenge
Feb 2018
Money!
Won a large prize in a
private, masters-only
competition
May 2018
Grandmaster
Entered the top
tier in Kaggle
Still
learning!

Conclusion
●
Cons
●
Competitions take time: motivation and passion are strong among kagglers,
incentives are really huge (especially for low-income countries), so being
highly competitive is complicated (especially when “solo”)
●
Pros
●
No better place to learn ML on real world data
●
Understand approaches/tools that really work (hype in DS is still growing)
●
Understand validation framework, overfitting vs. robustness
●
Skills are directly usable at work
●
Nice community: code is a universal language, always impressed by brilliant
solutions and elegant code by many kagglers
●
Personal branding: if data is really that important… adding some objective
results to a CV is not bad for a data scientist

Kaggle Days Brussels - Alberto Danese

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kaggle Days Brussels - Alberto Danese

Similar to Kaggle Days Brussels - Alberto Danese (20)

Recently uploaded

Recently uploaded (20)

Kaggle Days Brussels - Alberto Danese