Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Feature Engineering Hands-On by Dmitry Larko

208 vues

Publié le

Dmitry Larko presented these on July 17, 2019 at the Los Angeles Artificial Intelligence & Deep Learning meetup.

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Feature Engineering Hands-On by Dmitry Larko

  1. 1. Feature Engineering Hands-on by Dmitry Larko Sr. Data Scientist @ H2O.ai
  2. 2. 6x "Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering." ~ Andrew Ng Feature Engineering “… some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.” ~ Pedro Domingos “Good data preparation and feature engineering is integral to better prediction” ~ Marios Michailidis (KazAnova), Kaggle GrandMaster, Kaggle #2, former #1 ”you have to turn your inputs into things the algorithm can understand” ~ Shayne Miel, answer to “What is the intuitive explanation of feature engineering in machine learning?”
  3. 3. 6x Your data 6x Target Categorical CatNum
  4. 4. Follow this link 6x https://www.kaggle.com/c/amazon-employee-access-challenge/kernels http://bit.ly/SoCal-h2o We are going to use these 4 kernels:
  5. 5. 6x Categorical features – missing values 6x • Encode as an unique category • “Unknown”, “Missing”, …. • Use the most frequent category level (mode) Feature 1 Encoded Feature 1 Encoded Feature 1 A A A A A A NaN MISSING A A A A B B B B B B NaN MISSING A C C C C C C
  6. 6. 6x Categorical Encoding 6x • Turn categorical features into numeric features to provide more fine-grained information • Help explicitly capture non-linear relationships and interactions between the values of features • Most of machine learning tools only accept numbers as their input • xgboost, gbm, glmnet, libsvm, liblinear, etc.
  7. 7. 6x Categorical Encoding • Label Encoding • Randomly assign 1 or N integer to each level • Ok for tree-based methods Feature 1 Encoded Feature 1 Encoded Feature 2 A 0 0 A 0 0 A 0 0 A 0 0 B 1 2 B 1 2 B 1 2 C 2 1 C 2 1 A 0 0 B 1 2 C 2 1
  8. 8. 6x Categorical Encoding 6x • One Hot Encoding • Transform categories into individual binary (0 or 1) features • Python scikit-learn: DictVectorizer, OneHotEncoder • Ok for K-means, Linear, NNs, etc Feature Feature = A Feature = B Feature = C A 1 0 0 A 1 0 0 A 1 0 0 A 1 0 0 B 0 1 0 B 0 1 0 B 0 1 0 C 0 0 1 C 0 0 1 A 1 0 0 B 0 1 0 C 0 0 1
  9. 9. Categorical Encoding 6x • SVD encoding Feature1 Feature2 A C A D A C A C B C B D B D C D C C C D A 3 1 B 1 2 C 1 1 C D A 0.94 0.31 B 0.44 0.89 C 0.70 0.70 A 0.91 B 0.93 C 0.99 Count IDF SVD
  10. 10. Categorical Encoding 6x • Frequency Encoding • Encoding of categorical levels of feature to values between 0 and 1 based on their relative frequency Feature Encoded Feature A 0.44 A 0.44 A 0.44 A 0.44 B 0.33 B 0.33 B 0.33 C 0.22 C 0.22 A 0.44 (4 out of 9) B 0.33 (3 out of 9) C 0.22 (2 out of 9)
  11. 11. Categorical Encoding - Target mean encoding 6x • Instead of dummy encoding of categorical variables and increasing the number of features we can encode each level as the mean of the response. Feature Outcome MeanEncode A 1 0.75 A 0 0.75 A 1 0.75 A 1 0.75 B 1 0.66 B 1 0.66 B 0 0.66 C 1 1.00 C 1 1.00 A 0.75 (3 out of 4) B 0.66 (2 out of 3) C 1.00 (2 out of 2)
  12. 12. Categorical Encoding - Target mean encoding, smoothing 6x • Also it is better to calculate weighted average of the overall mean of the training set and the mean of the level: 𝑓 𝑛 ∗ 𝑚𝑒𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 + 1 − 𝑓 𝑛 ∗ 𝑚𝑒𝑎𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 Where: 𝑓 𝑛 - weight function (monotonic) 𝑚𝑒𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 - calculated mean for category level 𝑚𝑒𝑎𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 - calculated mean for a dataset • The weights are based on the frequency of the levels i.e. if a category only appears a few times in the dataset then its encoded value will be close to the overall mean instead of the mean of that level.
  13. 13. Categorical Encoding – Target mean encoding 𝑓 𝑛 example 6x x = frequency k = inflection point f = steepness https://www.researchgate.net/publication/220520258_A_Preprocessing_Scheme_for_High-Cardinality_Categorical_Attributes_in_Classification_and_Prediction_Problems
  14. 14. Categorical Encoding - Target mean encoding - Smoothing 6x Feature Outcome A 1 A 0 A 1 A 1 B 1 B 1 B 0 C 1 C 1 x level dataset 𝜆 A 4 0.75 0.77 0.99 0.99*0.75 + 0.01*0.77 = 0.7502 B 3 0.66 0.77 0.98 0.98*0.66 + 0.02*0.77 = 0.6622 C 2 1.00 0.77 0.5 0.5*1.0 + 0.5*0.77 = 0.885 𝜆 = 1 1 + exp (− 𝑥 − 2 0.25 ) x level dataset 𝜆 A 4 0.75 0.77 0.98 0.98*0.75 + 0.01*0.77 = 0.7427 B 3 0.66 0.77 0.5 0.5*0.66 + 0.5*0.77 = 0.715 C 2 1.00 0.77 0.017 0.017*1.0 + 0.983*0.77 = 0.773 𝜆 = 1 1 + exp (− 𝑥 − 3 0.25 )
  15. 15. Categorical Encoding – Target mean encoding, adding noise 6x • Using 2-fold CV Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 A 1 B 0 C 1 C 1 A 1 B 1 A 1 B 1 A 0 A 0.5 B 1.0 A 1.0 B 0.0 C 1.0 A 0.5 B 1.0 C 0.77 C 0.77 A 0.5 B 0.0 A 1.0 B 0.0 A 1.0 Feature Encode B 0.0 A 1.0 B 0.0 A 1.0 A 0.5 B 1.0 C 0.77 C 0.77 A 0.5
  16. 16. Categorical Encoding – Target mean encoding, adding noise 6x Feature Outcome MeanEncode A 1 0.75 A 0 0.75 A 1 0.75 A 1 0.75 B 1 0.66 B 1 0.66 B 0 0.66 C 1 1.00 C 1 1.00 A 0.75 (3 out of 4) B 0.66 (2 out of 3) C 1.00 (2 out of 2) • Using Expanding mean approach
  17. 17. Categorical Encoding - Target mean encoding, adding noise 6x • Expanding mean Feature Outcome A 1 A 0 A 1 A 1 B 1 B 1 B 0 C 1 C 1 Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 Shuffle data
  18. 18. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77
  19. 19. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77 0.77
  20. 20. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77 0.77 1.00
  21. 21. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77 0.77 1.00 1.00
  22. 22. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77 0.77 1.00 1.00 0.50
  23. 23. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77 0.77 1.00 1.00 0.50 1.00
  24. 24. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77 0.77 1.00 1.00 0.50 1.00 0.77
  25. 25. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77 0.77 1.00 1.00 0.50 1.00 0.77 1.00
  26. 26. Categorical Encoding - Target mean encoding, adding noise • Expanding mean Feature Outcome B 1 A 1 B 1 A 0 A 1 B 0 C 1 C 1 A 1 ExpMeanEncode 0.77 0.77 1.00 1.00 0.50 1.00 0.77 1.00 0.66
  27. 27. Must read: • Feature Engineering and Selection: A Practical Approach for Predictive Models www.feat.engineering • O’Reilly Feature Engineering for Machine Learning http://shop.oreilly.com/product/0636920049081.do • GitHub for different categorical encoding techniques (14!!! techniques): https://github.com/scikit-learn-contrib/categorical-encoding • Mean (likelihood) encoding for categorical variables with high cardinality and feature interactions: a comprehensive study with Python: https://www.kaggle.com/vprokopev/mean- likelihood-encodings-a-comprehensive-study • Nice ensembling techniques overview https://mlwave.com/kaggle-ensembling-guide/
  28. 28. Thank you! Q&A

×