SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Debugging Machine Learning.
Mostly for profit but with a bit of fun too!
Michał Łopuszyński
PyData Warsaw, 19.10.2017
About me
In my previous professional life, I was a theoretical physicist.
I got a PhD in solid state physics
•
For the last 5 years I work as a Data Scientist in ICM, University of Warsaw•
Agenda
4 failure modes of ML systems•
9 simple hints what to do•
Hey, my
ML system
does not
work at all...
Hint #1
Check your code
AKA it is engineering, stupid!
Write tests
Do not strive for 100% coverage, partial coverage is infinitely better
then none!
•
In Python, doctests are your friends•
"Hidden" benefits of tests•
Better code structure•
Executable documentation•
Test the fragile parts first•
Test your code with Monte Carlo / synthetic data
Try to generate a trivial case for your
ML system
•
This tests the whole pipeline (transforming/training) and allows for exploration of
your system performance under unusual conditions
•
If you have generative model,
prepare Monte Carlo data from
assumed distributions
•
Perturb the original data, by
generating the output with known
and learnable prescription
•
Code style
Single responsibility principle•
Do not repeat yourself (DRY!)•
Have and apply coding standard (PEP8)•
Consider using a linter•
Short functions and methods•
https://xkcd.com/844/
pycodestyle checker (formerly pep8)•
yapf formater•
pylint, pyflakes, pychecker
Naming
Minimal requirement - be consistent!•
Interesting software development books, offering chapters on naming•
Freely available chapter on names
Hint #2
Check your data
Data quality audits are difficult
Happy families are all alike; every unhappy family
is unhappy in its own way.
Leo Tolstoy
Like families, tidy datasets are all alike but every
messy dataset is messy in its own way.
Hadley Wickham
H. Wickham, Tidy Data, JSS 59 (2014), doi: 10.18637/jss.v059.i10
Images credit - Wikipedia
Data quality
Beware, your data providers usually
overestimate the data quality
•
Think of outliers, missing values (and how the are represented)•
Understand your data•
Do exploratory data analysis•
Visualize, visualize, visualize•
Talk to the domain expert•
Is your data correct, complete, coherent, stationary (seasonality!), deduplicated,
up-to-date, representative (unbiased)
•
OK, my ML system
works, but I think it
should perform
better...
Hint #3
Examine your features
Features
Features make a difference!•
Be creative with your features•
Try meaningful transformations,
combinations (products/ratios), decorrelation...
•
Think of good representations for non-tabular data
(text, time-series)
•
Make conscious decision about missing values•
Understand what features are important for your model•
Use ML models offering feature ranking•
Use feature selection methods•
ID F1 F2 F3
Hint #4
Examine your data points
Data points
Find difficult data points! (DDP)•
DDP = notoriously misclassified (or high error) cases
in your cross-validation loop for large variety of models
•
ID
P1
P2
P3
P4
P5
Examine DDPs, understand them!•
In the easiest case, remove DDPs from the dataset
(think outliers, mislabeled examples)
•
Influence functions
Influence functions
Best Paper Award
ICML 2017
Data points
Get more data!• ID
P1
P2
P3
P4
P5
Trick 1. Extend your set with artificial data
E.g., data augmentation in image processing,
SMOTE algorithm for imbalanced datasets
•
Good performance booster, rarely applicable•
Trick 2. Generate automatically noisy labeled
data set by heuristics, e.g. distant supervision in NLP
(requires unlabeled data!)
•
Trick 3. Semisupervised learning methods
self-training and co-training (requires unlabeled data!)
•
Hint #5
Examine your model
Why my model predicts what it predicts? (philosophical slide)
How do you answer why questions?•
Inspiring homework: watch Richard Feynman, Fun to imagine on magnets (youtube)•
Model introspection
You can answer thy why question, only for very simple models
(e.g., linear model, basic decision trees)
•
Sometimes, it is instructive to run such a simple model on your dataset, even
though it does not provide top-level performance
•
You can boost your simple model by feeding it with more advanced
(non-linearly transformed) features
•
Complex model introspection
LIME algorithm = Local Interpretable Model-agnostic Explanations
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016)
Complex model introspection – practical issues
Lime authors provided open source Python implementation as lime package•
http://eli5.readthedocs.io/en/latest/tutorials•
Another option eli5 package, aimed more generally at model introspection
and explanation (includes lime implementation)
•
Visualizing models
Display model in
a data space
Look at collection of
models at once
Explore the process of
model fitting
The ASA Data Science Journal 8, 203 (2015), doi: 10.1002/sam.11271
So my ML system
works on test data,
but you tell me it
fails in production?
Hint #6
Watch out for overfitting
Overfitting
If you torture the data long enough,
it will confess.
Roald Coase
Hint #7
Watch out for data leakage
Data leakage
Some time ago, I used to thing data leaks are trivial to avoid•
They are not! (Look at number of Kaggle competitions flawed by Data Leakage)•
You may introduce them yourself
E.g. meaningful identifiers, past & future separation in time series
•
You may receive them in the data from your provider•
Good paper•
Hint #8
Watch out for covariate shift
What is covariate shift?
Training data
y
X
What is covariate shift?
Model
Training data
X
y
What is covariate shift?
Model
Noiseless reality
Training data
y
X
What is covariate shift?
Model
Noiseless reality
Training data
Production data
(Test data)
y
X
Covariate shift
Unlike overfitting and data leakage, it is easier to detect•
Method: Try to build classifier differentiating train from production (test).
If you succeed, you very likely have a problem
•
Basic remedy – reweighting data points. Give production-like data higher impact
on your model
•
The quality of my
super ML system
deteriorates with
time, really?
Really really?
ML system in production
2009: Hooray we can predict flu
epidemics from Google query data!
2014: Hmm... Can we?
ML system in production
NIPS 2015 paper
Hint #9
Remember monitoring & maintenance
AKA it is engineering again, stupid!
5
Thank you!
Questions?
@lopusz

Contenu connexe

Tendances

Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentationehtshamelahi
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandrySri Ambati
 
A Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionA Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionDatabricks
 
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Sri Ambati
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsJustin Basilico
 
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...Sri Ambati
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDays Riga
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineeringIvano Malavolta
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine LearningWell, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine LearningDevFest DC
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Aseda Owusua Addai-Deseh
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixXavier Amatriain
 

Tendances (20)

Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
A Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionA Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly Detection
 
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine LearningWell, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
 

Similaire à Debugging machine-learning

Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017fredverheul
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation OptionsMihai Criveti
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-finalsupportlogic
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needGibDevs
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programmingJuggernaut Liu
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxsnigdhaagrawal11
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxAbderrahmanABID2
 
MachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert SystemsMachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert Systemsshreenathji26
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxharikaramisetty3
 
How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...WiMLDSMontreal
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez
 
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...panagenda
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 

Similaire à Debugging machine-learning (20)

Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert SystemsMachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert Systems
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling Blunders
 
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 

Dernier

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 

Dernier (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 

Debugging machine-learning

  • 1. Debugging Machine Learning. Mostly for profit but with a bit of fun too! Michał Łopuszyński PyData Warsaw, 19.10.2017
  • 2. About me In my previous professional life, I was a theoretical physicist. I got a PhD in solid state physics • For the last 5 years I work as a Data Scientist in ICM, University of Warsaw•
  • 3. Agenda 4 failure modes of ML systems• 9 simple hints what to do•
  • 4. Hey, my ML system does not work at all...
  • 5. Hint #1 Check your code AKA it is engineering, stupid!
  • 6. Write tests Do not strive for 100% coverage, partial coverage is infinitely better then none! • In Python, doctests are your friends• "Hidden" benefits of tests• Better code structure• Executable documentation• Test the fragile parts first•
  • 7. Test your code with Monte Carlo / synthetic data Try to generate a trivial case for your ML system • This tests the whole pipeline (transforming/training) and allows for exploration of your system performance under unusual conditions • If you have generative model, prepare Monte Carlo data from assumed distributions • Perturb the original data, by generating the output with known and learnable prescription •
  • 8. Code style Single responsibility principle• Do not repeat yourself (DRY!)• Have and apply coding standard (PEP8)• Consider using a linter• Short functions and methods• https://xkcd.com/844/ pycodestyle checker (formerly pep8)• yapf formater• pylint, pyflakes, pychecker
  • 9. Naming Minimal requirement - be consistent!• Interesting software development books, offering chapters on naming• Freely available chapter on names
  • 11. Data quality audits are difficult Happy families are all alike; every unhappy family is unhappy in its own way. Leo Tolstoy Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Hadley Wickham H. Wickham, Tidy Data, JSS 59 (2014), doi: 10.18637/jss.v059.i10 Images credit - Wikipedia
  • 12. Data quality Beware, your data providers usually overestimate the data quality • Think of outliers, missing values (and how the are represented)• Understand your data• Do exploratory data analysis• Visualize, visualize, visualize• Talk to the domain expert• Is your data correct, complete, coherent, stationary (seasonality!), deduplicated, up-to-date, representative (unbiased) •
  • 13. OK, my ML system works, but I think it should perform better...
  • 15. Features Features make a difference!• Be creative with your features• Try meaningful transformations, combinations (products/ratios), decorrelation... • Think of good representations for non-tabular data (text, time-series) • Make conscious decision about missing values• Understand what features are important for your model• Use ML models offering feature ranking• Use feature selection methods• ID F1 F2 F3
  • 16. Hint #4 Examine your data points
  • 17. Data points Find difficult data points! (DDP)• DDP = notoriously misclassified (or high error) cases in your cross-validation loop for large variety of models • ID P1 P2 P3 P4 P5 Examine DDPs, understand them!• In the easiest case, remove DDPs from the dataset (think outliers, mislabeled examples) •
  • 20. Data points Get more data!• ID P1 P2 P3 P4 P5 Trick 1. Extend your set with artificial data E.g., data augmentation in image processing, SMOTE algorithm for imbalanced datasets • Good performance booster, rarely applicable• Trick 2. Generate automatically noisy labeled data set by heuristics, e.g. distant supervision in NLP (requires unlabeled data!) • Trick 3. Semisupervised learning methods self-training and co-training (requires unlabeled data!) •
  • 22. Why my model predicts what it predicts? (philosophical slide) How do you answer why questions?• Inspiring homework: watch Richard Feynman, Fun to imagine on magnets (youtube)•
  • 23. Model introspection You can answer thy why question, only for very simple models (e.g., linear model, basic decision trees) • Sometimes, it is instructive to run such a simple model on your dataset, even though it does not provide top-level performance • You can boost your simple model by feeding it with more advanced (non-linearly transformed) features •
  • 24. Complex model introspection LIME algorithm = Local Interpretable Model-agnostic Explanations ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016)
  • 25. Complex model introspection – practical issues Lime authors provided open source Python implementation as lime package• http://eli5.readthedocs.io/en/latest/tutorials• Another option eli5 package, aimed more generally at model introspection and explanation (includes lime implementation) •
  • 26. Visualizing models Display model in a data space Look at collection of models at once Explore the process of model fitting The ASA Data Science Journal 8, 203 (2015), doi: 10.1002/sam.11271
  • 27. So my ML system works on test data, but you tell me it fails in production?
  • 28. Hint #6 Watch out for overfitting
  • 29. Overfitting If you torture the data long enough, it will confess. Roald Coase
  • 30. Hint #7 Watch out for data leakage
  • 31. Data leakage Some time ago, I used to thing data leaks are trivial to avoid• They are not! (Look at number of Kaggle competitions flawed by Data Leakage)• You may introduce them yourself E.g. meaningful identifiers, past & future separation in time series • You may receive them in the data from your provider• Good paper•
  • 32. Hint #8 Watch out for covariate shift
  • 33. What is covariate shift? Training data y X
  • 34. What is covariate shift? Model Training data X y
  • 35. What is covariate shift? Model Noiseless reality Training data y X
  • 36. What is covariate shift? Model Noiseless reality Training data Production data (Test data) y X
  • 37. Covariate shift Unlike overfitting and data leakage, it is easier to detect• Method: Try to build classifier differentiating train from production (test). If you succeed, you very likely have a problem • Basic remedy – reweighting data points. Give production-like data higher impact on your model •
  • 38. The quality of my super ML system deteriorates with time, really? Really really?
  • 39. ML system in production 2009: Hooray we can predict flu epidemics from Google query data! 2014: Hmm... Can we?
  • 40. ML system in production NIPS 2015 paper
  • 41. Hint #9 Remember monitoring & maintenance AKA it is engineering again, stupid! 5