SlideShare a Scribd company logo
1 of 36
Download to read offline
Kaggle: state of the art ML for real world problems
My experience on Google’s platform for data science: how it works, why it
matters and why you should join!
Alberto Danese
22 / 11 / 2018Kaggle Days - Brussels2
Who am I
●
Alberto Danese
●
Senior Data Scientist @ Cerved since 2016
●
Innovation and Data Sources Team
●
Background:
●
Computer Engineer @ Politecnico di Milano (2007)
●
Senior Consultant @ Between
●
Manager @ EY
●
Passions: Data! ML, blockchain, anything between IT, applied math and biz
●
And Kaggle: currently 1st in Italy, ranked in top 100 worldwide in 2018
22 / 11 / 2018Kaggle Days - Brussels3
Poll
How much do you know about Kaggle?
1. I’m just curious about it!
2. I registered to have a look at data / kernels / tutorials / learning modules
3. I joined a real-world competition
4. It has already turned to an addiction
22 / 11 / 2018Kaggle Days - Brussels4
Basics
●
Leading platform for machine learning competitions since 2010
●
Companies post real data and problems that can be solved with predictive
modeling / machine learning / AI / some kind of magic!
●
Data scientists from all over the world compete to produce the best
algorithms
●
Acquired by Google in 2017
●
Grown to a complete ML platform with learning modules, code sharing
features (kernels), job board and more
Why should you care?
22 / 11 / 2018Kaggle Days - Brussels6
Serious sponsors
●
Over 300 competitions so far, by companies such as:
22 / 11 / 2018Kaggle Days - Brussels7
Relevant prizes (1/2)
●
Aside from tutorial / playground competition, most competitions have
money prizes ranging from a few thousand dollars to over 1 million
●
In 2017, over 4,75 million U$D awarded to data scientists[1]
●
Personal hint: Kaggle does pay ;)
●
Some competitions have jobs / interviews as prize:
●
5 recruiting campaigns by Facebook
●
3 by Walmart
[1]
http://blog.kaggle.com/2017/12/26/your-year-on-kaggle-most-memorable-community-stats-from-2017/
22 / 11 / 2018Kaggle Days - Brussels8
Relevant prizes (2/2)
●
My trivial logic...
●
Companies are willing to pay real money prizes, no matter how hard it is
to:
●
Define a proper competition
●
Anonimize the data (seriously)
●
Persuade the legal and business depts to share private data ;)
●
→  Kaggle competitions solve business problems that really matter!
●
Large data companies (already very attractive) keep on posting job
competitions
●
→  They are probably finding talents with Kaggle
22 / 11 / 2018Kaggle Days - Brussels9
Real data and problems
●
Machine learning is a bit more than classifying badly written digits
●
If I were passionate about iris and flowers, I would be a botanist
22 / 11 / 2018Kaggle Days - Brussels10
Notable participants... (1/2)
Francois Chollet
https://www.kaggle.com/fchollet
22 / 11 / 2018Kaggle Days - Brussels11
Notable participants... (2/2)
Tianqi Chen
https://www.kaggle.com/tqchen
22 / 11 / 2018Kaggle Days - Brussels12
… and a large international community (1/2)
●
A place for researchers that demonstrate in the real world the effectiveness
of their brand new algos / libraries
●
Winning algos on Kaggle have become the de-facto industry standard
●
But also a place for data-lovers who can:
●
Prove their skills on real problems, building a portfolio of experiences on
data and techniques
●
Have a public, data-driven showcase: 93.000+ kagglers are ranked based
on their results and belong to 5 tiers, from Novice to Grand Master (124
data scientists worldwide, just 11 new GM in all 2017)
●
Learn from the best: many writeups and actual code from all or part of
top solutions
22 / 11 / 2018Kaggle Days - Brussels13
… and a large international community (2/2)
28
12
2
6
5
4
●
USA, Russia and China
have the highest number
of top kagglers (based on
declared residence
country)
●
Various European
countries have 2 or more
GMs
#GMs
How does it work?
22 / 11 / 2018Kaggle Days - Brussels15
Kaggle’s framework (1/5)
●
The sponsor company creates a labeled dataset
Feature 1 Feature 2 Feature 3 ... Feature N Target
34 a . . 0
123 b 324.52 . 1
523 n 34.343 . 0
3 a 431.41 . 1
43 . . . 1
3 . . . 1
425 g . . 1
53 f . . 0
653 e 42.4 . 1
63 r 121 . 0
63 y 4526.543 . 0
. d 23.24 . 1
634 d 213.53 . 0
64 a . . 0
It’s classical supervised learning scenario
Simple example based on tabular data and
binary target
Same logic applies to all competitions,
even if they involve computer vision,
natural language processing or other ML
tasks
22 / 11 / 2018Kaggle Days - Brussels16
Kaggle’s framework (2/5)
●
The sponsor company splits the dataset in train and test: it provides the
target only for train dataset and it defines a single metric for performance
evaluation on test target prediction
Feat 1 Feat 2 Feat 3 ... Feat N Target
34 a . . 0
123 b 324.52 . 1
523 n 34.343 . 0
3 a 431.41 . 1
43 . . . 1
3 . . . 1
425 g . . 1
53 f . . 0
Feat 1 Feat 2 Feat 3 ... Feat N
653 e 42.4 .
63 r 121 .
63 y 4526.543 .
. d 23.24 .
634 d 213.53 .
64 a . .
TRAIN TEST
Target
1
0
0
1
0
0
Not published,
kept secret
Performance metric: e.g. RMSE, AUC, etc.
22 / 11 / 2018Kaggle Days - Brussels17
Kaggle’s framework (3/5)
●
Kagglers build their predictive models with FOSS (usually R / Python) and
provide predictions on the whole test dataset (submit predictions only, no
code)
●
The score is immediately provided on the so called “public leaderboard”,
usually around 20-30% of test records
Feat 1 Feat 2 Feat 3 ... Feat N
653 e 42.4 .
63 r 121 .
63 y 4526.543 .
. d 23.24 .
634 d 213.53 .
64 a . .
TEST
Target
1
1
0
1
1
0
SUBMISSION PUBLIC LB SCORE
(INSTANTLY)
e.g. AUC 0.8594ML
You could easily overfit
the public leaderboard
by submitting enough
predictions! But...
22 / 11 / 2018Kaggle Days - Brussels18
Kaggle’s framework (4/5)
●
When the competitions ending approaches (usually in 60-90 days), a kaggler
has to choose his/her 2 final solutions
●
The final ranking (“private leaderboard”) - the only that matters, on which
prizes are awarded – is calculated on the other 70-80% of the test dataset
●
If you overfitted the public LB, you will drop significantly
Feat 1 Feat 2 Feat 3 ... Feat N
653 e 42.4 .
63 r 121 .
63 y 4526.543 .
. d 23.24 .
634 d 213.53 .
64 a . .
TEST
Target
1
1
0
1
1
0
SUBMISSION PRIVATE LB SCORE
(AT THE END OF
COMPETITION)
e.g. AUC 0.8358
ML
22 / 11 / 2018Kaggle Days - Brussels19
Kaggle’s framework (5/5)
●
Winners (usually top 3) have to present their work to the sponsor, submit
the full code (that has to be reproducible) and finally claim their prize :)
Sample final leaderboard
In 1st place, a team that was 14th in public
leaderboard (see the green “13”, i.e. delta
between public and private LB).
The team that was 1st in public LB,
actually finished 4th.
You can participate in teams, splitting the
prize in case – great experience!
Some examples and how I approach a competition
22 / 11 / 2018Kaggle Days - Brussels21
Type of competitions
●
Usually competitions can be somehow divided in 3 categories
●
Classic machine learning – i.e. tabular data (one or more tables, small or
“big” data, etc.)
●
Computer vision – on still images or videos
●
Natural language processing
●
Competitions can be:
●
Public (probably > 95%): anyone can join
●
Private: the sponsor prefers to keep the data private, only Masters and
Grandmasters can see the competition and apply
22 / 11 / 2018Kaggle Days - Brussels22
Example of classic ML
●
Sponsor: Talking Data,
largest mobile ad provider
in China
●
Problem: fraud detection,
i.e. predicting which clicks
are generated by “click
farms” in order to generate
fake commissions and to
cheat on app download
rankings
●
Prize: 25.000$
●
Data: 180.000.000 clicks (i.e.
labeled records in training
set), 60.000.000 in test set
22 / 11 / 2018Kaggle Days - Brussels23
Example of Computer Vision
●
Sponsor: Laura and John
Arnold Foundation et al.
●
Problem: fighting cancer,
i.e. analyze TACs (typically
with deep learning) to
identify the presence of lung
cancer
●
Prize: 1.000.000$
●
Data: over 1.000 patients
with over 200 images each
(see image by side)
22 / 11 / 2018Kaggle Days - Brussels24
Example of NLP
●
Sponsor: Jigsaw and Google
(both part of Alphabet)
●
Problem: classify toxic
comments, i.e. tag
offensive, discrimatory,
obscene content
●
Prize: 35.000$
●
Data: free text to be
classified on different kind of
“toxicity”
22 / 11 / 2018Kaggle Days - Brussels25
Example of… another amazing challenge
●
Sponsor: NSF and others
●
Problem: classify
astronomical sources from
LSST, a new telescope that
“is about to revolutionize
the field, discovering 10 to
100 times more astronomical
sources that vary in the night
sky than we've ever known”
●
Prize: 25.000$ + possibility to
present in California / Sydney
/ Paris conferences
●
Data: light curves and more
(tabular data)
22 / 11 / 2018Kaggle Days - Brussels26
My approach
1. Find a motivating competition!
●
For personal interest
●
For potential use outside kaggle (e.g. at work)
2. Understand the business problem – even with anonymized and heavily
obscured features
3. Work on your own for a while – from EDA, to a good validation framework, to
building a baseline model
4. [optional] Consider joining forces and create a team:
●
Pros: meet brilliant people, understand different approaches and tools, best
final ranking
●
Cons: coordination takes time (avoid large groups), when alone you’re forced
to deal even with aspects you do not like / care much
5. Try not to overfit the leaderboard! ...
22 / 11 / 2018Kaggle Days - Brussels27
Just a hint
●
Usually the test has significant differences from the training set
●
Classic approach: choose the appropriate statistical test to understand the
different distributions between train and test – for each feature!
●
Drop the “real” target from the training set
●
Create a combined dataset with train plus test and a new target:
●
1 if the record belongs to train
●
0 if the record belongs to test
●
Train a simple tree-based classifier and measure AUC and features
importance
●
You can now answer easily: is the test distinguishable from the train? What
are the variables that determine this difference?
Kaggle vs. Job as Data Scientist
22 / 11 / 2018Kaggle Days - Brussels29
Pure predictive performance vs. tradeoffs
●
Most Kaggle competition can be run with no time / RAM / complexity
constraints
●
You have to do whatever it takes to improve by 0,01% the accuracy or any
given metric
●
Clean code is not a priority (but you have to polish it and be sure to be
able to reproduce your solution, in case you win)
●
Be smart! Focus on the test set you have to predict
22 / 11 / 2018Kaggle Days - Brussels30
100% ML vs. bigger challenges (1/2)
Understand
data
Machine
Learning
Define biz
problem /
data product
Find right
data
Understand
data
Machine
learning
Solution
engineering
Maintainance
and evolution
Time spent on
different data
science activities
Job in Data Science
Understand business
problems, find and enrich
data, define how to move
from prototype to working
and maintainable
solution… and deal with
people! (it’s not a bad
thing)
A straightforward,
objective, pure
algorithmic competition
22 / 11 / 2018Kaggle Days - Brussels31
100% ML vs. bigger challenges (2/2)
●
Data comes first!
●
Average model + Great data >> Great model + Average data
●
But keep in mind that current ML algos smash traditional statistical
methods both in predictive performance and in time required to achieve a
“reasonable” model
●
Explainability has to be taken into account in several applications (but I
prefer a safest self-driving car even if it is a black box)
●
Strict and transparent validation setup – defining how to evaluate models
before doing ML – is actually key also in the work place
22 / 11 / 2018Kaggle Days - Brussels32
My Kaggle toolkit – Useful in the workplace too
●
R: good choice for ML, not so good if you are focused on deep learning
●
Rstudio: a jewel, still looking for a comparable IDE for Python
●
Git (Github / GitLab): no brainer here
●
Data.table: impressive speed to handle and manipulate large data files
●
Dplyr: nicer syntax than data.table
●
Xgboost: great implementation of GBM
●
Lightgbm: probably an even greater implementation of GBM (open, by
Microsoft ML team)
●
Ranger: multithreaded random forest, old fashioned but worth a try
●
Glmnet: older fashioned - Regularized Generalized Linear Models
●
Ggplot2: the plotting library by Rstudio team
Kaggle – is it worth your time?
22 / 11 / 2018Kaggle Days - Brussels34
My experience
2014
7 years after graduation.
Stumbled across Kaggle
First thought: cool!
Second: where do I start?
2015
Finally decided to
dedicate 100% to data
science. Attended a
part-time master
(Bicocca) and thought a
Kaggle competition
would be a good final
project
Feb 2016
First competition!
Finished 18th out of 1764
Apr 2016
First team competition!
Joined 3 Chinese guys
(from Kansas City,
Shanghai and San
Francisco), finished in top
10 out of 3000
competitors
October 2016
Joined Cerved!
The Italian data company,
where data and algos have
been key long before the
data science hype
2017
Other competitions
Various gold medals,
learned a lot from every
new challenge
Feb 2018
Money!
Won a large prize in a
private, masters-only
competition
May 2018
Grandmaster
Entered the top
tier in Kaggle
Still
learning!
22 / 11 / 2018Kaggle Days - Brussels35
Conclusion
●
Cons
●
Competitions take time: motivation and passion are strong among kagglers,
incentives are really huge (especially for low-income countries), so being
highly competitive is complicated (especially when “solo”)
●
Pros
●
No better place to learn ML on real world data
●
Understand approaches/tools that really work (hype in DS is still growing)
●
Understand validation framework, overfitting vs. robustness
●
Skills are directly usable at work
●
Nice community: code is a universal language, always impressed by brilliant
solutions and elegant code by many kagglers
●
Personal branding: if data is really that important… adding some objective
results to a CV is not bad for a data scientist
Thank you and happy kaggling!

More Related Content

What's hot

Ideas on Machine Learning Interpretability
Ideas on Machine Learning InterpretabilityIdeas on Machine Learning Interpretability
Ideas on Machine Learning Interpretability
Sri Ambati
 

What's hot (20)

Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
Learning for Big Data-林軒田
Learning for Big Data-林軒田Learning for Big Data-林軒田
Learning for Big Data-林軒田
 
PyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darknessPyData SF 2016 --- Moving forward through the darkness
PyData SF 2016 --- Moving forward through the darkness
 
Resume_xuezhi
Resume_xuezhiResume_xuezhi
Resume_xuezhi
 
WMJ&GMBwosc08-Effective Learning & Production Via Modelling
WMJ&GMBwosc08-Effective Learning & Production Via ModellingWMJ&GMBwosc08-Effective Learning & Production Via Modelling
WMJ&GMBwosc08-Effective Learning & Production Via Modelling
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and Applications
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
 
The Green Lab - [11-A] Data Visualization
The Green Lab - [11-A] Data VisualizationThe Green Lab - [11-A] Data Visualization
The Green Lab - [11-A] Data Visualization
 
DutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveDutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical Perspective
 
The Green Lab - [05 A] Experiment design (basics)
The Green Lab - [05 A] Experiment design (basics)The Green Lab - [05 A] Experiment design (basics)
The Green Lab - [05 A] Experiment design (basics)
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototyping
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in python
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Ideas on Machine Learning Interpretability
Ideas on Machine Learning InterpretabilityIdeas on Machine Learning Interpretability
Ideas on Machine Learning Interpretability
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
The Green Lab - [05 B] Experiment design (advanced)
The Green Lab - [05 B] Experiment design (advanced)The Green Lab - [05 B] Experiment design (advanced)
The Green Lab - [05 B] Experiment design (advanced)
 

Similar to Kaggle Days Brussels - Alberto Danese

Ice 2013-A Structured Team Building Method for Collaborative Crowdsourcing
Ice 2013-A Structured Team Building Method for Collaborative CrowdsourcingIce 2013-A Structured Team Building Method for Collaborative Crowdsourcing
Ice 2013-A Structured Team Building Method for Collaborative Crowdsourcing
Erre Quadro
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Dataconomy Media
 

Similar to Kaggle Days Brussels - Alberto Danese (20)

How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
How to get into Kaggle? by Philipp Singer and Dmitry GordeevHow to get into Kaggle? by Philipp Singer and Dmitry Gordeev
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
 
How to win a machine learning competition pavel pleskov
How to win a machine learning competition   pavel pleskovHow to win a machine learning competition   pavel pleskov
How to win a machine learning competition pavel pleskov
 
Innovation report: Artificial Intelligence
Innovation report: Artificial IntelligenceInnovation report: Artificial Intelligence
Innovation report: Artificial Intelligence
 
Mastering Machine Learning with Competitions
Mastering Machine Learning with CompetitionsMastering Machine Learning with Competitions
Mastering Machine Learning with Competitions
 
Click prediction: kaggle competitions vs real life
Click prediction: kaggle competitions vs real lifeClick prediction: kaggle competitions vs real life
Click prediction: kaggle competitions vs real life
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
Info-Session-Slides.pptx
Info-Session-Slides.pptxInfo-Session-Slides.pptx
Info-Session-Slides.pptx
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Ice 2013-A Structured Team Building Method for Collaborative Crowdsourcing
Ice 2013-A Structured Team Building Method for Collaborative CrowdsourcingIce 2013-A Structured Team Building Method for Collaborative Crowdsourcing
Ice 2013-A Structured Team Building Method for Collaborative Crowdsourcing
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Data science guide
Data science guideData science guide
Data science guide
 
How we learned to rank search results big data meetup
How we learned to rank search results   big data meetupHow we learned to rank search results   big data meetup
How we learned to rank search results big data meetup
 
Digital analytics lecture1
Digital analytics lecture1Digital analytics lecture1
Digital analytics lecture1
 

Recently uploaded

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Kaggle Days Brussels - Alberto Danese

  • 1. Kaggle: state of the art ML for real world problems My experience on Google’s platform for data science: how it works, why it matters and why you should join! Alberto Danese
  • 2. 22 / 11 / 2018Kaggle Days - Brussels2 Who am I ● Alberto Danese ● Senior Data Scientist @ Cerved since 2016 ● Innovation and Data Sources Team ● Background: ● Computer Engineer @ Politecnico di Milano (2007) ● Senior Consultant @ Between ● Manager @ EY ● Passions: Data! ML, blockchain, anything between IT, applied math and biz ● And Kaggle: currently 1st in Italy, ranked in top 100 worldwide in 2018
  • 3. 22 / 11 / 2018Kaggle Days - Brussels3 Poll How much do you know about Kaggle? 1. I’m just curious about it! 2. I registered to have a look at data / kernels / tutorials / learning modules 3. I joined a real-world competition 4. It has already turned to an addiction
  • 4. 22 / 11 / 2018Kaggle Days - Brussels4 Basics ● Leading platform for machine learning competitions since 2010 ● Companies post real data and problems that can be solved with predictive modeling / machine learning / AI / some kind of magic! ● Data scientists from all over the world compete to produce the best algorithms ● Acquired by Google in 2017 ● Grown to a complete ML platform with learning modules, code sharing features (kernels), job board and more
  • 6. 22 / 11 / 2018Kaggle Days - Brussels6 Serious sponsors ● Over 300 competitions so far, by companies such as:
  • 7. 22 / 11 / 2018Kaggle Days - Brussels7 Relevant prizes (1/2) ● Aside from tutorial / playground competition, most competitions have money prizes ranging from a few thousand dollars to over 1 million ● In 2017, over 4,75 million U$D awarded to data scientists[1] ● Personal hint: Kaggle does pay ;) ● Some competitions have jobs / interviews as prize: ● 5 recruiting campaigns by Facebook ● 3 by Walmart [1] http://blog.kaggle.com/2017/12/26/your-year-on-kaggle-most-memorable-community-stats-from-2017/
  • 8. 22 / 11 / 2018Kaggle Days - Brussels8 Relevant prizes (2/2) ● My trivial logic... ● Companies are willing to pay real money prizes, no matter how hard it is to: ● Define a proper competition ● Anonimize the data (seriously) ● Persuade the legal and business depts to share private data ;) ● → Kaggle competitions solve business problems that really matter! ● Large data companies (already very attractive) keep on posting job competitions ● → They are probably finding talents with Kaggle
  • 9. 22 / 11 / 2018Kaggle Days - Brussels9 Real data and problems ● Machine learning is a bit more than classifying badly written digits ● If I were passionate about iris and flowers, I would be a botanist
  • 10. 22 / 11 / 2018Kaggle Days - Brussels10 Notable participants... (1/2) Francois Chollet https://www.kaggle.com/fchollet
  • 11. 22 / 11 / 2018Kaggle Days - Brussels11 Notable participants... (2/2) Tianqi Chen https://www.kaggle.com/tqchen
  • 12. 22 / 11 / 2018Kaggle Days - Brussels12 … and a large international community (1/2) ● A place for researchers that demonstrate in the real world the effectiveness of their brand new algos / libraries ● Winning algos on Kaggle have become the de-facto industry standard ● But also a place for data-lovers who can: ● Prove their skills on real problems, building a portfolio of experiences on data and techniques ● Have a public, data-driven showcase: 93.000+ kagglers are ranked based on their results and belong to 5 tiers, from Novice to Grand Master (124 data scientists worldwide, just 11 new GM in all 2017) ● Learn from the best: many writeups and actual code from all or part of top solutions
  • 13. 22 / 11 / 2018Kaggle Days - Brussels13 … and a large international community (2/2) 28 12 2 6 5 4 ● USA, Russia and China have the highest number of top kagglers (based on declared residence country) ● Various European countries have 2 or more GMs #GMs
  • 14. How does it work?
  • 15. 22 / 11 / 2018Kaggle Days - Brussels15 Kaggle’s framework (1/5) ● The sponsor company creates a labeled dataset Feature 1 Feature 2 Feature 3 ... Feature N Target 34 a . . 0 123 b 324.52 . 1 523 n 34.343 . 0 3 a 431.41 . 1 43 . . . 1 3 . . . 1 425 g . . 1 53 f . . 0 653 e 42.4 . 1 63 r 121 . 0 63 y 4526.543 . 0 . d 23.24 . 1 634 d 213.53 . 0 64 a . . 0 It’s classical supervised learning scenario Simple example based on tabular data and binary target Same logic applies to all competitions, even if they involve computer vision, natural language processing or other ML tasks
  • 16. 22 / 11 / 2018Kaggle Days - Brussels16 Kaggle’s framework (2/5) ● The sponsor company splits the dataset in train and test: it provides the target only for train dataset and it defines a single metric for performance evaluation on test target prediction Feat 1 Feat 2 Feat 3 ... Feat N Target 34 a . . 0 123 b 324.52 . 1 523 n 34.343 . 0 3 a 431.41 . 1 43 . . . 1 3 . . . 1 425 g . . 1 53 f . . 0 Feat 1 Feat 2 Feat 3 ... Feat N 653 e 42.4 . 63 r 121 . 63 y 4526.543 . . d 23.24 . 634 d 213.53 . 64 a . . TRAIN TEST Target 1 0 0 1 0 0 Not published, kept secret Performance metric: e.g. RMSE, AUC, etc.
  • 17. 22 / 11 / 2018Kaggle Days - Brussels17 Kaggle’s framework (3/5) ● Kagglers build their predictive models with FOSS (usually R / Python) and provide predictions on the whole test dataset (submit predictions only, no code) ● The score is immediately provided on the so called “public leaderboard”, usually around 20-30% of test records Feat 1 Feat 2 Feat 3 ... Feat N 653 e 42.4 . 63 r 121 . 63 y 4526.543 . . d 23.24 . 634 d 213.53 . 64 a . . TEST Target 1 1 0 1 1 0 SUBMISSION PUBLIC LB SCORE (INSTANTLY) e.g. AUC 0.8594ML You could easily overfit the public leaderboard by submitting enough predictions! But...
  • 18. 22 / 11 / 2018Kaggle Days - Brussels18 Kaggle’s framework (4/5) ● When the competitions ending approaches (usually in 60-90 days), a kaggler has to choose his/her 2 final solutions ● The final ranking (“private leaderboard”) - the only that matters, on which prizes are awarded – is calculated on the other 70-80% of the test dataset ● If you overfitted the public LB, you will drop significantly Feat 1 Feat 2 Feat 3 ... Feat N 653 e 42.4 . 63 r 121 . 63 y 4526.543 . . d 23.24 . 634 d 213.53 . 64 a . . TEST Target 1 1 0 1 1 0 SUBMISSION PRIVATE LB SCORE (AT THE END OF COMPETITION) e.g. AUC 0.8358 ML
  • 19. 22 / 11 / 2018Kaggle Days - Brussels19 Kaggle’s framework (5/5) ● Winners (usually top 3) have to present their work to the sponsor, submit the full code (that has to be reproducible) and finally claim their prize :) Sample final leaderboard In 1st place, a team that was 14th in public leaderboard (see the green “13”, i.e. delta between public and private LB). The team that was 1st in public LB, actually finished 4th. You can participate in teams, splitting the prize in case – great experience!
  • 20. Some examples and how I approach a competition
  • 21. 22 / 11 / 2018Kaggle Days - Brussels21 Type of competitions ● Usually competitions can be somehow divided in 3 categories ● Classic machine learning – i.e. tabular data (one or more tables, small or “big” data, etc.) ● Computer vision – on still images or videos ● Natural language processing ● Competitions can be: ● Public (probably > 95%): anyone can join ● Private: the sponsor prefers to keep the data private, only Masters and Grandmasters can see the competition and apply
  • 22. 22 / 11 / 2018Kaggle Days - Brussels22 Example of classic ML ● Sponsor: Talking Data, largest mobile ad provider in China ● Problem: fraud detection, i.e. predicting which clicks are generated by “click farms” in order to generate fake commissions and to cheat on app download rankings ● Prize: 25.000$ ● Data: 180.000.000 clicks (i.e. labeled records in training set), 60.000.000 in test set
  • 23. 22 / 11 / 2018Kaggle Days - Brussels23 Example of Computer Vision ● Sponsor: Laura and John Arnold Foundation et al. ● Problem: fighting cancer, i.e. analyze TACs (typically with deep learning) to identify the presence of lung cancer ● Prize: 1.000.000$ ● Data: over 1.000 patients with over 200 images each (see image by side)
  • 24. 22 / 11 / 2018Kaggle Days - Brussels24 Example of NLP ● Sponsor: Jigsaw and Google (both part of Alphabet) ● Problem: classify toxic comments, i.e. tag offensive, discrimatory, obscene content ● Prize: 35.000$ ● Data: free text to be classified on different kind of “toxicity”
  • 25. 22 / 11 / 2018Kaggle Days - Brussels25 Example of… another amazing challenge ● Sponsor: NSF and others ● Problem: classify astronomical sources from LSST, a new telescope that “is about to revolutionize the field, discovering 10 to 100 times more astronomical sources that vary in the night sky than we've ever known” ● Prize: 25.000$ + possibility to present in California / Sydney / Paris conferences ● Data: light curves and more (tabular data)
  • 26. 22 / 11 / 2018Kaggle Days - Brussels26 My approach 1. Find a motivating competition! ● For personal interest ● For potential use outside kaggle (e.g. at work) 2. Understand the business problem – even with anonymized and heavily obscured features 3. Work on your own for a while – from EDA, to a good validation framework, to building a baseline model 4. [optional] Consider joining forces and create a team: ● Pros: meet brilliant people, understand different approaches and tools, best final ranking ● Cons: coordination takes time (avoid large groups), when alone you’re forced to deal even with aspects you do not like / care much 5. Try not to overfit the leaderboard! ...
  • 27. 22 / 11 / 2018Kaggle Days - Brussels27 Just a hint ● Usually the test has significant differences from the training set ● Classic approach: choose the appropriate statistical test to understand the different distributions between train and test – for each feature! ● Drop the “real” target from the training set ● Create a combined dataset with train plus test and a new target: ● 1 if the record belongs to train ● 0 if the record belongs to test ● Train a simple tree-based classifier and measure AUC and features importance ● You can now answer easily: is the test distinguishable from the train? What are the variables that determine this difference?
  • 28. Kaggle vs. Job as Data Scientist
  • 29. 22 / 11 / 2018Kaggle Days - Brussels29 Pure predictive performance vs. tradeoffs ● Most Kaggle competition can be run with no time / RAM / complexity constraints ● You have to do whatever it takes to improve by 0,01% the accuracy or any given metric ● Clean code is not a priority (but you have to polish it and be sure to be able to reproduce your solution, in case you win) ● Be smart! Focus on the test set you have to predict
  • 30. 22 / 11 / 2018Kaggle Days - Brussels30 100% ML vs. bigger challenges (1/2) Understand data Machine Learning Define biz problem / data product Find right data Understand data Machine learning Solution engineering Maintainance and evolution Time spent on different data science activities Job in Data Science Understand business problems, find and enrich data, define how to move from prototype to working and maintainable solution… and deal with people! (it’s not a bad thing) A straightforward, objective, pure algorithmic competition
  • 31. 22 / 11 / 2018Kaggle Days - Brussels31 100% ML vs. bigger challenges (2/2) ● Data comes first! ● Average model + Great data >> Great model + Average data ● But keep in mind that current ML algos smash traditional statistical methods both in predictive performance and in time required to achieve a “reasonable” model ● Explainability has to be taken into account in several applications (but I prefer a safest self-driving car even if it is a black box) ● Strict and transparent validation setup – defining how to evaluate models before doing ML – is actually key also in the work place
  • 32. 22 / 11 / 2018Kaggle Days - Brussels32 My Kaggle toolkit – Useful in the workplace too ● R: good choice for ML, not so good if you are focused on deep learning ● Rstudio: a jewel, still looking for a comparable IDE for Python ● Git (Github / GitLab): no brainer here ● Data.table: impressive speed to handle and manipulate large data files ● Dplyr: nicer syntax than data.table ● Xgboost: great implementation of GBM ● Lightgbm: probably an even greater implementation of GBM (open, by Microsoft ML team) ● Ranger: multithreaded random forest, old fashioned but worth a try ● Glmnet: older fashioned - Regularized Generalized Linear Models ● Ggplot2: the plotting library by Rstudio team
  • 33. Kaggle – is it worth your time?
  • 34. 22 / 11 / 2018Kaggle Days - Brussels34 My experience 2014 7 years after graduation. Stumbled across Kaggle First thought: cool! Second: where do I start? 2015 Finally decided to dedicate 100% to data science. Attended a part-time master (Bicocca) and thought a Kaggle competition would be a good final project Feb 2016 First competition! Finished 18th out of 1764 Apr 2016 First team competition! Joined 3 Chinese guys (from Kansas City, Shanghai and San Francisco), finished in top 10 out of 3000 competitors October 2016 Joined Cerved! The Italian data company, where data and algos have been key long before the data science hype 2017 Other competitions Various gold medals, learned a lot from every new challenge Feb 2018 Money! Won a large prize in a private, masters-only competition May 2018 Grandmaster Entered the top tier in Kaggle Still learning!
  • 35. 22 / 11 / 2018Kaggle Days - Brussels35 Conclusion ● Cons ● Competitions take time: motivation and passion are strong among kagglers, incentives are really huge (especially for low-income countries), so being highly competitive is complicated (especially when “solo”) ● Pros ● No better place to learn ML on real world data ● Understand approaches/tools that really work (hype in DS is still growing) ● Understand validation framework, overfitting vs. robustness ● Skills are directly usable at work ● Nice community: code is a universal language, always impressed by brilliant solutions and elegant code by many kagglers ● Personal branding: if data is really that important… adding some objective results to a CV is not bad for a data scientist
  • 36. Thank you and happy kaggling!