Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Practical Machine Learning in Python

https://us.pycon.org/2012/schedule/presentation/119/

  • Identifiez-vous pour voir les commentaires

Practical Machine Learning in Python

  1. Practical MachineLearning in PythonMatt Spitz via@mattspitz
  2. Practical Machine Learning in Python 2This is the Age of Aquarius Data• Data is plentiful • application logs • external APIs • Facebook, Twitter • public datasets• Analysis adds value • understanding your users • dynamic application decisions• Storage / CPU time is cheap
  3. Practical Machine Learning in Python 3Machine Learning in Python• Python is well-suited for data analysis• Versatile • quick and dirty scripts • full-featured, realtime applications• Mature ML packages • tons of choices (see: mloss.org) • plug-and-play or DIY
  4. Practical Machine Learning in Python 4Classification Problem: Terminology• Data points • feature set: “interesting” facts about an event/thing • label: a description of that event/thing• Classification • training set: a bunch of labeled feature sets • given a training set, build a classifier to predict labels for unlabeled feature sets
  5. Practical Machine Learning in Python 5SluggerML• Two questions • What features are strong predictors for home runs and strikeouts? • Given a particular situation, with what probability will the batter hit a home run or strike out?• Feature sets represent game state for a plate appearance • game: day vs. night, wind direction... • at-bat: inning, #strikes, left-right matchup... • batter/pitcher: age, weight, fielding position...• Labels represent outcome • HR (home run), K (strikeout), OTHER• Poor Man’s Sabermetrics
  6. Practical Machine Learning in Python 6SluggerML: Example• Training set • {game_daynight: day, batter_age: 24, pitcher_weight: 211} • label: HR • {game_daynight: day, batter_age: 36, pitcher_weight: 242} • label: K • {game_daynight: night, batter_age: 27, pitcher_weight: 195} • label: OTHER• Classifier predictions • {game_daynight: night, batter_age: 36, pitcher_weight: 225} • 2.6% HR 15.6% K • {game_daynight: day, batter_age: 20, pitcher_weight: 216} • 2.2% HR 19.1% K
  7. Practical Machine Learning in Python 7SluggerML: Gathering Data• Sources • Retrosheet • play-by-play logs for every game since 1956 • Sean Lahman’s Baseball Archive • detailed stats about individual players• Coalescing • 1st pass, Lahman: create player database • shelve module • 2nd pass, Retrosheet: track game state, join on player db• Scrubbing • ensure consistency
  8. Practical Machine Learning in Python 8SluggerML: Gathering Data• Training set • regular-season games from 1980-2011 • 5,669,301 plate appearances • 135,602 home runs • 871,226 strikeouts
  9. Practical Machine Learning in Python 9Selecting a Toolkit: Tradeoffs• Speed • offline vs. realtime• Transparency • internal visibility • customizability• Support • maturity • community
  10. Practical Machine Learning in Python 10Selecting a Toolkit: High-Level Options• External bindings • python interfaces to popular packages • Matlab, R, Octave, SHOGUN Toolbox • transition legacy workflows• Python implementations • collections of algorithms • (mostly) python • external subcomponents• DIY • building blocks
  11. Practical Machine Learning in Python 11Selecting a Toolkit: Python Implementations• nltk • focus on NLP • book: Natural Language Processing with Python (O’Reilly ‘09)• mlpy • regression, classification, clustering• PyML • focus on SVM• PyBrain • focus on neural networks
  12. Practical Machine Learning in Python 12Selecting a Toolkit: Python Implementations• mdp-toolkit • data processing management • nodes represent tasks in a data workflow • scheduling, parallelization• scikit-learn • supervised, unsupervised, feature selection, visualization • heavy development, large team • excellent documentation • active community
  13. Practical Machine Learning in Python 13Selecting a Toolkit: Do It Yourself• Basic building blocks • NumPy • SciPy• C/C++ implementations • LIBLINEAR • LIBSVM • OpenCV • ...your own?
  14. Practical Machine Learning in Python 14SluggerML: Two Questions• What features are strong predictors for home runs and strikeouts?• Given a particular situation, with what probability will the batter hit a home run or strike out?
  15. Practical Machine Learning in Python 15SluggerML: Feature Selection• Identifies predictive features • strongly correlated with labels • predictive: max_benchpress • not predictive: favorite_cookie• scikit-learn: chi-square feature selection• Visualizing significance • for each well-supported value, find correlation with HR/K • “well-supported”: >= 0.05% of samples with feature=value • correlation: ( P(HR | feature=value) / P(HR) ) - 1
  16. Practical Machine Learning in Python 16 SluggerML: Feature Selection Batter: Home vs. Visiting 50.0% 40.0% 30.0% 20.0% 10.0%Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% home team visiting team
  17. Practical Machine Learning in Python 17 SluggerML: Feature Selection Batter: Fielding Position 50.0% 40.0% 30.0% 20.0% 10.0%Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% P C 1B 2B 3B SS LF CF RF DH PH
  18. Practical Machine Learning in Python 18 SluggerML: Feature Selection Game: Temperature (˚F) 50.0% 40.0% 30.0% 20.0% 10.0%Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100-104
  19. Practical Machine Learning in Python 19 SluggerML: Feature Selection Game: Year 50.0% 40.0% 30.0% 20.0% 10.0%Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 1980-1984 1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 2010-2011
  20. Practical Machine Learning in Python 20SluggerML: Realtime Classification• Given features, predict label probabilities• nltk: NaiveBayesClassifier• Web frontend • gunicorn, nginx
  21. Practical Machine Learning in Python 21Tips and Tricks• Persistent classifier internals • once trained, save and reuse • depends on implementation • string representation may exist • create your own• Using generators where possible • avoid keeping data in memory • single-pass algorithms • conversion pass before training• Multicore text processing • scrubbing: low memory footprint • multiprocessing module
  22. Practical Machine Learning in Python 22The Fine Print™• Plug-and-play is easy!• Don’t blindly apply ML • understand your data • understand your algorithms • ml-class.org is an excellent resource
  23. Practical Machine Learning in Python 23Thanks!github.com/mattspitz/sluggermlslideshare.net/mattspitz/practical-machine-learning-in-python@mattspitz

×