SlideShare une entreprise Scribd logo
1  sur  23
Practical Machine
Learning in Python
Matt Spitz
       via
@mattspitz
Practical Machine Learning in Python   2




This is the Age of Aquarius Data
• Data is plentiful
 • application logs
 • external APIs
   • Facebook, Twitter

 • public datasets
• Analysis adds value
 • understanding your users
 • dynamic application decisions
• Storage / CPU time is cheap
Practical Machine Learning in Python   3




Machine Learning in Python
• Python is well-suited for data analysis
• Versatile
  • quick and dirty scripts
  • full-featured, realtime applications
• Mature ML packages
  • tons of choices (see: mloss.org)
  • plug-and-play or DIY
Practical Machine Learning in Python   4




Classification Problem: Terminology
• Data points
  • feature set: “interesting” facts about an event/thing
  • label: a description of that event/thing
• Classification
  • training set: a bunch of labeled feature sets
  • given a training set, build a classifier to predict labels for
    unlabeled feature sets
Practical Machine Learning in Python   5




SluggerML
• Two questions
   • What features are strong predictors for home runs and strikeouts?
   • Given a particular situation, with what probability will the batter
     hit a home run or strike out?
• Feature sets represent game state for a plate appearance
   • game: day vs. night, wind direction...
   • at-bat: inning, #strikes, left-right matchup...
   • batter/pitcher: age, weight, fielding position...
• Labels represent outcome
   • HR (home run), K (strikeout), OTHER
• Poor Man’s Sabermetrics
Practical Machine Learning in Python   6




SluggerML: Example
• Training set
   • {game_daynight: day, batter_age: 24, pitcher_weight: 211}
    • label: HR
  • {game_daynight: day, batter_age: 36, pitcher_weight: 242}
     • label: K
  • {game_daynight: night, batter_age: 27, pitcher_weight: 195}
     • label: OTHER
• Classifier predictions
  • {game_daynight: night, batter_age: 36, pitcher_weight: 225}
    • 2.6% HR     15.6% K
  • {game_daynight: day, batter_age: 20, pitcher_weight: 216}
     • 2.2% HR 19.1% K
Practical Machine Learning in Python   7




SluggerML: Gathering Data
• Sources
  • Retrosheet
     • play-by-play logs for every game since 1956
  • Sean Lahman’s Baseball Archive
     • detailed stats about individual players

• Coalescing
  • 1st pass, Lahman: create player database
    • shelve module
  • 2nd pass, Retrosheet: track game state, join on player db
• Scrubbing
  • ensure consistency
Practical Machine Learning in Python   8




SluggerML: Gathering Data
• Training set
  • regular-season games from 1980-2011
  • 5,669,301 plate appearances
     • 135,602 home runs
     • 871,226 strikeouts
Practical Machine Learning in Python   9




Selecting a Toolkit: Tradeoffs
• Speed
  • offline vs. realtime
• Transparency
   • internal visibility
   • customizability
• Support
  • maturity
  • community
Practical Machine Learning in Python   10




Selecting a Toolkit: High-Level Options
• External bindings
  • python interfaces to popular packages
  • Matlab, R, Octave, SHOGUN Toolbox
  • transition legacy workflows
• Python implementations
  • collections of algorithms
  • (mostly) python
  • external subcomponents
• DIY
  • building blocks
Practical Machine Learning in Python   11




Selecting a Toolkit: Python Implementations
• nltk
  • focus on NLP
  • book: Natural Language Processing with Python (O’Reilly ‘09)
• mlpy
  • regression, classification, clustering
• PyML
  • focus on SVM
• PyBrain
  • focus on neural networks
Practical Machine Learning in Python   12




Selecting a Toolkit: Python Implementations
• mdp-toolkit
  • data processing management
  • nodes represent tasks in a data workflow
  • scheduling, parallelization
• scikit-learn
  • supervised, unsupervised, feature selection, visualization
  • heavy development, large team
  • excellent documentation
  • active community
Practical Machine Learning in Python   13




Selecting a Toolkit: Do It Yourself
• Basic building blocks
  • NumPy
  • SciPy
• C/C++ implementations
  • LIBLINEAR
  • LIBSVM
  • OpenCV
  • ...your own?
Practical Machine Learning in Python   14




SluggerML: Two Questions
• What features are strong predictors for home runs
  and strikeouts?
• Given a particular situation, with what probability will
  the batter hit a home run or strike out?
Practical Machine Learning in Python   15




SluggerML: Feature Selection
• Identifies predictive features
  • strongly correlated with labels
  • predictive: max_benchpress
  • not predictive: favorite_cookie
• scikit-learn: chi-square feature selection
• Visualizing significance
  • for each well-supported value, find correlation with HR/K
     • “well-supported”: >= 0.05% of samples with feature=value
     • correlation: ( P(HR | feature=value) / P(HR) ) - 1
Practical Machine Learning in Python   16




      SluggerML: Feature Selection
                                   Batter: Home vs. Visiting
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                              Home Run
                                                                                                 Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       home team                               visiting team
Practical Machine Learning in Python    17




      SluggerML: Feature Selection
                                         Batter: Fielding Position
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                      Home Run
                                                                                                         Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       P   C   1B   2B       3B    SS     LF       CF       RF        DH       PH
Practical Machine Learning in Python      18




      SluggerML: Feature Selection
                                                           Game: Temperature (˚F)
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                                                    Home Run
                                                                                                                                       Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       35-39   40-44   45-49   50-54   55-59   60-64   65-69   70-74   75-79   80-84   85-89   90-94   95-99 100-104
Practical Machine Learning in Python     19




      SluggerML: Feature Selection
                                                           Game: Year
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                                   Home Run
                                                                                                                      Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       1980-1984   1985-1989   1990-1994    1995-1999   2000-2004     2005-2009      2010-2011
Practical Machine Learning in Python   20




SluggerML: Realtime Classification
• Given features, predict label probabilities
• nltk: NaiveBayesClassifier
• Web frontend
  • gunicorn, nginx
Practical Machine Learning in Python   21




Tips and Tricks
• Persistent classifier internals
   • once trained, save and reuse
   • depends on implementation
    • string representation may exist
    • create your own
• Using generators where possible
  • avoid keeping data in memory
    • single-pass algorithms
    • conversion pass before training
• Multicore text processing
  • scrubbing: low memory footprint
  • multiprocessing module
Practical Machine Learning in Python   22




The Fine Print™
• Plug-and-play is easy!
• Don’t blindly apply ML
  • understand your data
  • understand your algorithms
     • ml-class.org is an excellent resource
Practical Machine Learning in Python   23




Thanks!
github.com/mattspitz/sluggerml
slideshare.net/mattspitz/practical-machine-learning-in-python


@mattspitz

Contenu connexe

En vedette

Sample email submission
Sample email submissionSample email submission
Sample email submissionDavid Sommer
 
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3David Sommer
 
Internationalization in Rails 2.2
Internationalization in Rails 2.2Internationalization in Rails 2.2
Internationalization in Rails 2.2Nicolas Jacobeus
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Javajbellis
 
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)John Collins
 
mobile development platforms
mobile development platformsmobile development platforms
mobile development platformsguestfa9375
 
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web appsiapain
 
My Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideMy Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideSizzlynRose
 
Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)John Collins
 
2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentaryalghanim
 
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)John Collins
 
The ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThe ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThoughtWorks
 
Strategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationStrategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationJohn Collins
 
Designing for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsDesigning for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsRobert Douglas
 
Stc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsStc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsDavid Sommer
 
Linguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsLinguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsHeatherRivers
 

En vedette (19)

Glossary
GlossaryGlossary
Glossary
 
Sample email submission
Sample email submissionSample email submission
Sample email submission
 
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3
 
Shrunken Head
 Shrunken Head  Shrunken Head
Shrunken Head
 
Internationalization in Rails 2.2
Internationalization in Rails 2.2Internationalization in Rails 2.2
Internationalization in Rails 2.2
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
 
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)
 
mobile development platforms
mobile development platformsmobile development platforms
mobile development platforms
 
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web apps
 
My Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideMy Valentine Gift - YOU Decide
My Valentine Gift - YOU Decide
 
Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)
 
2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary
 
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
 
The ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThe ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj Kumar
 
Strategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationStrategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful Localization
 
Designing for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsDesigning for Multiple Mobile Platforms
Designing for Multiple Mobile Platforms
 
Stc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsStc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kits
 
Silmeyiniz
SilmeyinizSilmeyiniz
Silmeyiniz
 
Linguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsLinguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with Rails
 

Similaire à Practical ML in Python: HR/K Prediction

Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH
 
FSB: TreeWalker - SECCON 2015 Online CTF
FSB: TreeWalker - SECCON 2015 Online CTFFSB: TreeWalker - SECCON 2015 Online CTF
FSB: TreeWalker - SECCON 2015 Online CTFYOKARO-MON
 
Down the rabbit hole, profiling in Django
Down the rabbit hole, profiling in DjangoDown the rabbit hole, profiling in Django
Down the rabbit hole, profiling in DjangoRemco Wendt
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentationBryan Reinero
 
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1Toshiaki Maki
 

Similaire à Practical ML in Python: HR/K Prediction (9)

sourav-projects
sourav-projectssourav-projects
sourav-projects
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
FSB: TreeWalker - SECCON 2015 Online CTF
FSB: TreeWalker - SECCON 2015 Online CTFFSB: TreeWalker - SECCON 2015 Online CTF
FSB: TreeWalker - SECCON 2015 Online CTF
 
專題報告
專題報告專題報告
專題報告
 
Down the rabbit hole, profiling in Django
Down the rabbit hole, profiling in DjangoDown the rabbit hole, profiling in Django
Down the rabbit hole, profiling in Django
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentation
 
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
 
About_Moviemetr
About_MoviemetrAbout_Moviemetr
About_Moviemetr
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Practical ML in Python: HR/K Prediction

  • 1. Practical Machine Learning in Python Matt Spitz via @mattspitz
  • 2. Practical Machine Learning in Python 2 This is the Age of Aquarius Data • Data is plentiful • application logs • external APIs • Facebook, Twitter • public datasets • Analysis adds value • understanding your users • dynamic application decisions • Storage / CPU time is cheap
  • 3. Practical Machine Learning in Python 3 Machine Learning in Python • Python is well-suited for data analysis • Versatile • quick and dirty scripts • full-featured, realtime applications • Mature ML packages • tons of choices (see: mloss.org) • plug-and-play or DIY
  • 4. Practical Machine Learning in Python 4 Classification Problem: Terminology • Data points • feature set: “interesting” facts about an event/thing • label: a description of that event/thing • Classification • training set: a bunch of labeled feature sets • given a training set, build a classifier to predict labels for unlabeled feature sets
  • 5. Practical Machine Learning in Python 5 SluggerML • Two questions • What features are strong predictors for home runs and strikeouts? • Given a particular situation, with what probability will the batter hit a home run or strike out? • Feature sets represent game state for a plate appearance • game: day vs. night, wind direction... • at-bat: inning, #strikes, left-right matchup... • batter/pitcher: age, weight, fielding position... • Labels represent outcome • HR (home run), K (strikeout), OTHER • Poor Man’s Sabermetrics
  • 6. Practical Machine Learning in Python 6 SluggerML: Example • Training set • {game_daynight: day, batter_age: 24, pitcher_weight: 211} • label: HR • {game_daynight: day, batter_age: 36, pitcher_weight: 242} • label: K • {game_daynight: night, batter_age: 27, pitcher_weight: 195} • label: OTHER • Classifier predictions • {game_daynight: night, batter_age: 36, pitcher_weight: 225} • 2.6% HR 15.6% K • {game_daynight: day, batter_age: 20, pitcher_weight: 216} • 2.2% HR 19.1% K
  • 7. Practical Machine Learning in Python 7 SluggerML: Gathering Data • Sources • Retrosheet • play-by-play logs for every game since 1956 • Sean Lahman’s Baseball Archive • detailed stats about individual players • Coalescing • 1st pass, Lahman: create player database • shelve module • 2nd pass, Retrosheet: track game state, join on player db • Scrubbing • ensure consistency
  • 8. Practical Machine Learning in Python 8 SluggerML: Gathering Data • Training set • regular-season games from 1980-2011 • 5,669,301 plate appearances • 135,602 home runs • 871,226 strikeouts
  • 9. Practical Machine Learning in Python 9 Selecting a Toolkit: Tradeoffs • Speed • offline vs. realtime • Transparency • internal visibility • customizability • Support • maturity • community
  • 10. Practical Machine Learning in Python 10 Selecting a Toolkit: High-Level Options • External bindings • python interfaces to popular packages • Matlab, R, Octave, SHOGUN Toolbox • transition legacy workflows • Python implementations • collections of algorithms • (mostly) python • external subcomponents • DIY • building blocks
  • 11. Practical Machine Learning in Python 11 Selecting a Toolkit: Python Implementations • nltk • focus on NLP • book: Natural Language Processing with Python (O’Reilly ‘09) • mlpy • regression, classification, clustering • PyML • focus on SVM • PyBrain • focus on neural networks
  • 12. Practical Machine Learning in Python 12 Selecting a Toolkit: Python Implementations • mdp-toolkit • data processing management • nodes represent tasks in a data workflow • scheduling, parallelization • scikit-learn • supervised, unsupervised, feature selection, visualization • heavy development, large team • excellent documentation • active community
  • 13. Practical Machine Learning in Python 13 Selecting a Toolkit: Do It Yourself • Basic building blocks • NumPy • SciPy • C/C++ implementations • LIBLINEAR • LIBSVM • OpenCV • ...your own?
  • 14. Practical Machine Learning in Python 14 SluggerML: Two Questions • What features are strong predictors for home runs and strikeouts? • Given a particular situation, with what probability will the batter hit a home run or strike out?
  • 15. Practical Machine Learning in Python 15 SluggerML: Feature Selection • Identifies predictive features • strongly correlated with labels • predictive: max_benchpress • not predictive: favorite_cookie • scikit-learn: chi-square feature selection • Visualizing significance • for each well-supported value, find correlation with HR/K • “well-supported”: >= 0.05% of samples with feature=value • correlation: ( P(HR | feature=value) / P(HR) ) - 1
  • 16. Practical Machine Learning in Python 16 SluggerML: Feature Selection Batter: Home vs. Visiting 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% home team visiting team
  • 17. Practical Machine Learning in Python 17 SluggerML: Feature Selection Batter: Fielding Position 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% P C 1B 2B 3B SS LF CF RF DH PH
  • 18. Practical Machine Learning in Python 18 SluggerML: Feature Selection Game: Temperature (˚F) 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100-104
  • 19. Practical Machine Learning in Python 19 SluggerML: Feature Selection Game: Year 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 1980-1984 1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 2010-2011
  • 20. Practical Machine Learning in Python 20 SluggerML: Realtime Classification • Given features, predict label probabilities • nltk: NaiveBayesClassifier • Web frontend • gunicorn, nginx
  • 21. Practical Machine Learning in Python 21 Tips and Tricks • Persistent classifier internals • once trained, save and reuse • depends on implementation • string representation may exist • create your own • Using generators where possible • avoid keeping data in memory • single-pass algorithms • conversion pass before training • Multicore text processing • scrubbing: low memory footprint • multiprocessing module
  • 22. Practical Machine Learning in Python 22 The Fine Print™ • Plug-and-play is easy! • Don’t blindly apply ML • understand your data • understand your algorithms • ml-class.org is an excellent resource
  • 23. Practical Machine Learning in Python 23 Thanks! github.com/mattspitz/sluggerml slideshare.net/mattspitz/practical-machine-learning-in-python @mattspitz

Notes de l'éditeur

  1. Data is everywhere clickstream data users are bad at managing fb permissions; you can get a lot out of the graph APIThere’s value in learning about data - how people use your site- feature or advertisement personalizationOne thing that enables this is that resources are cheap these days
  2. Python is a fantastic programming environment for data processing and analyticson one end of the spectrum, quick and dirty scripts... or full-featured applications ready for a deployment at scaleWide variety of toolkits for off-the-shelf analysis or building out your own data processing applications
  3. For this talk... discussing one flavor of analytics and machine learning, the classification problemintuition: training set: what you know about the world train a classifier to predict things that you don’t
  4. As a concrete example, I started playing around with some baseball stats to illustrate how one might go about building ML applications in pythoneven if you’re not into baseball, you know that the iconic visions of success and failure are the home run and the strikeout in all the movies, hitting a home run is equivalent to getting the girl and striking out is seen as a major setback
  5. As with any machine learning problem, you want to get your data into a classifier-consumable format. That is, labeled feature sets. For each play in the game, keep track of the game state and output a labeled feature bundle representing the situation and its outcome: HR, K, (other)
  6. speed: offline: deadline ~ hours, daysrealtime: user waiting on the other side (user actions: => milliseconds)transparency:seeing what’s going on with an algorithm in case the docs aren’t clearmodifying or patching an algorithm to meet your needssupport:maturity, active development how strong is the community around the project? are there tutorials available?
  7. interface with external packages if you’ve done some analysis already and want to transition to python without throwing away codepython toolkits provide sets of algorithms, mostly python implementationsoften use external packages with C bindings, some even use other toolkitsDIY: use the external packages yourself
  8. to give a sampling of what’s available, i chose some toolkits that were last updated within a yearAs a disclaimer... -Not exhaustive, just a sampling -some of these tools I’ve used, some I haven’t! -I’m sure I’ve missed your favorite, and for that I apologizedifferent packages focus on different things, so one isn’t necessarily going to suit all of your needs
  9. buzz around scikit-learn last year - checked it out recently and it’s been built out a lot
  10. NumPy: fast and efficient arraysSciPy: scientific tools and algorithms built on NumPyCan also use popular C/C++ implementations using python bindingspython is a modular language, so you can always sub out your implementation without disrupting your workflow too muchnow, as an example of applying these toolkits...
  11. speed isn’t criticalspeed is critical (imagine that you’re a coach) baseball is slow, but it’s not THAT slow
  12. identifies predictive features certain values are strongly correlated with certain labelssklearn- wasn’t clear on the documented usage, looked at the code
  13. for a coach
  14. don’t we need to train our classifier to run our web application?save them on disk!pickle or pull out a textual representation(another argument for using a package that allows you to do this)why compute things twice?use generatorslots and lots of dataavoid keeping it all in memorysingle pass algorithm (bayes)first-pass conversion to compact data (numpy vectors, not python objects)not always possible, but keep it in mindtake advantage of multiple cores - if your processing step has a minimal memory footprint (just one line at a time), do it on multiple cores - multiple processes on different input files or multiprocessing module is great at this
  15. you don't need to know everything about the algorithms you use …but you can't just blindly apply these things and hope that they magically workml-class.org: free class, provides an excellent foundation and starting point for understanding MLin no time, you, too, can be a number muncher
  16. source code for SluggerML on github; kind of a mess, and I’m sorry about thatand I’m @mattspitz on the twitters