A presentation sharing Kaggle best practices by Mark Landry, ranked 44 amongst all Kaggle competitors in the world! Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue! ML@B: ml.berkeley.edu DSSB: http://dssberkeley.org BIDS: http://bids.berkeley.edu/ H2O.ai: http://h2o.ai
4. H2O.ai
Machine Intelligence
Iterative Workflow
• Agile workflows generally outperform waterfall
methodologies
• One of the most commonly cited insights from Kaggle
employees regarding success
4
5. H2O.ai
Machine Intelligence
Iterative Workflow: Basics
• Work quickly to develop a reasonable model early
o Model should be complete enough to gauge score, per competition
setup
o Simple models: understand how the mean and mode score
o Confirms understanding of the problem
o Confirms validity of your internal loss calculation
• Enhance model iteratively
o Explore and add features: additional data sets and/or transformations
o Experiment with additional model classes
o Experiment with hyperparameters within algorithm class
o Ensemble
o Validate enhancements via improvement from prior leading model
5
6. H2O.ai
Machine Intelligence
Iterative Workflow: Benefits
• Allows the data guide what modeling approach fits best
o Availability and quality of data may not support complex modeling ideas
• Catch mistakes or incorrect assumptions early and clearly
o If you observe no improvement after adding what you considered to be a
vital feature, you know to immediately check the accuracy of the
calculations and/or question how the model already captured that
information
6
7. H2O.ai
Machine Intelligence
Framing the Problem
• Have to make the data machine learning ready
o 1 training file
o 1 row per target
o Features do not require additional methodology (e.g. text, images)
• Many Kaggle competitions arrive “ML-ready”
7
8. H2O.ai
Machine Intelligence
Framing the Problem, 2
• My favorite competitions are those that are non ML-ready
o Focuses more heavily on solving the data problem
o More like solving a puzzle instead of tuning hyperparameters
8
10. H2O.ai
Machine Intelligence
Learning from Kaggle
• Sharing during competition
o Kaggle Scripts
o Discussions on the forums
• Shared after the competition
o Most often several of the top ranking competitors will share their
methodology
o Often a summary post, occasionally Github code
o I find this the most valuable component of learning data science
10