2. Chief Data Scientist, Conversion Logic
70+ Competitions
6 Times Prize Winner (KDD Cup 2012 & 2015)
8 Top 10 Finishes (Deloitte, AARP, Liberty Mutual)
Top 10, Kaggle 2015
Father of 4 boys
Jeong-Yoon Lee, Ph.D.
24. No EDA?
Most of competitions provide actual labels - typical EDA
Anonymized data - more creative EDA
o People decode age, states, time intervals, income, etc.
24
31. Feature Engineering
31
Types Note
Numerical Log, Log2(1 + x), Box-Cox, Normalization, Binning
Categorical One-hot-encoding, Label-encoding, Count, Weight-of-Evidence
Text Bag-of-Words, TF-IDF, N-gram, Character-n-gram, K-skip-n-gram
Timeseries/ Sensor data Descriptive Statistics, Derivatives, FFT, MFCC, ERP
Network Graph Degree, Closeness, Betweenness, PageRank
Numerical/ Timeseries Convert to categorical features using RF/GBM
Dimensionality Reduction PCA, SVD, Autoencoder, Hashing Trick
Interaction Addition/substraction/mutiplicaiton/division. Hashing Trick
* More comprehensive overview on feature engineering by HJ van Veen: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750
32. Diverse Algorithms
Algorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, Torch, CNTK Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM, fastFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)
32
33. Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
33
34.
35. Ensemble - Stacking
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/
35
41. Best Practices
For fun
For experiences
For learning
For networking
41
Feature Engineering
Diverse Algorithms
Cross Validation
Ensemble
Collaboration
Why Competition
42. Things That Help
42
Keep competition journals and repos – both during and after competitions
Build and improve the automated pipeline and library for competitions
• https://github.com/jeongyoonlee/Kaggler
• https://gitlab.com/jeongyoonlee/allstate-claims-severity/tree/master
• http://kaggler.com/kagglers-toolbox-setup/
Be humble, and ready to try and learn something new
Make a commitment and work on competitions no matter what on a regular basis
43. Resources
43
No Free Hunch by Kaggle
Winning Tips on Machine Learning Competitions by Marios Michailidis (KazAnova)
Feature Engineering, mlwave.com by HJ van Veen (Triskelion)
fastml.com by Zygmunt Zając (Foxtrot)
kaggler.com, facebook.com/Kaggler by Jeong-Yoon Lee @ CL and Hang Li @ Hulu
Tianqi Chen @ UW – Won KDDCup 2012, DSB 2015. Author of XGBoost, MXNet
Gilberto Titericz Junior in San Francisco - #1 at Kaggle