SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Work Hard Once
Strategy and Automation applied to
building machine learning models
Franklin Sarkett
April 2, 2018
About me: franklin.sarkett@gmail.com
Audantic Real Estate Analytics, co-founder
● http://audantic.com/
● Audantic provides customized data, analytics, and predictive analytics for machine
residential real estate.
Facebook
● Data scientist at Facebook and developed an algorithm for the Ads Payments team that
increased revenue over $200 million and earned a patent.
Education
● CS degree from University of Illinois at Urbana Champaign
● MS in Applied Statistics from DePaul University.
Summary
Building machine learning models from data ingestion to productionalization is challenging,
with many steps.
Of all the steps, feature engineering is the biggest differentiator between models that work
and models that do not.
Using automation and strategy we can remove some of the most challenging parts, and
focus on the area of machine learning that generates the most value: feature engineering.
John Boyd and the OODA Loop
The OODA loop is the decision cycle of observe, orient, decide, and act, developed by military strategist
and United States Air Force Colonel John Boyd.
Boyd applied the concept to the combat operations process.
It is now also often applied to understand commercial operations and learning processes.
The approach favors agility over raw power in dealing with human opponents in any endeavor.
- Wikipedia
Orient (most important)
"Orient" is the key to the OODA loop.
Since one is conditioned by one's heritage, surrounding culture, existing knowledge and
learnings, the mind combines fragments of ideas, information, conjectures, impressions,
etc. to generate our orientation.
How well your orientation matches the real world is largely a function of how well you
observe.
Stages of Machine Learning
Feature
engineering
Data
cleaning
Model
training
Observe
Get raw data
(sql, csv, API)
Orient Decide
Model
evaluation
Deployment
Act
Two guiding thoughts
A mentor of mine at FB was coaching me on our model building.
Building models requires domain knowledge, and put as much data into the model as you can.
To improve the models, you need to add:
● Data quality
● Data volume
○ Breadth
○ Depth
Addressing these concerns takes Feature Engineering to the next level.
Automating the Observe stage
Many of the tasks in the observe stage could be classified as DevOps and Data Engineering.
My favorite tools to use for data science:
● Docker
● Jenkins
● Luigi
Orient - Feature Engineering
“Coming up with features is difficult, time-consuming, requires expert
knowledge. 'Applied machine learning' is basically feature engineering.”
— Prof. Andrew Ng.
Orient - Feature Engineering
“The algorithms we used are very standard for Kagglers. …We spent
most of our efforts in feature engineering. … We were also very careful
to discard features likely to expose us to the risk of over-fitting our
model.”
— Xavier Conort
Orient - Feature Engineering
“Feature engineering is the process of transforming raw data into features
that better represent the underlying problem to the predictive models,
resulting in improved model accuracy on unseen data.”
— Dr. Jason Brownlee
Orient - Feature Engineering
At the end of the day, some machine learning projects succeed
and some fail. What makes the difference? Easily the most
important factor is the features used...It is often also one of the
most interesting parts, where intuition, creativity and “black art”
are as important as the technical stuff.
-Pedro Domingos, Prof of CS as University of Washington
Code snippet
http://bit.ly/PyDataChi-FeatureEngineering
How do we iterate
feature engineering faster?
● Create a pipeline of transforms with a final estimator.
● Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of
steps in processing the data, for example feature selection, normalization and classification.
● Benefits:
○ Convenience and encapsulation.
You only have to call fit and predict once on your data to fit a whole sequence of estimators
○ Safety.
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by
ensuring that the same samples are used to train the transformers and predictors.
Feature extraction
Feature extraction
Feature extraction
The
pipeline
Summary
Building machine learning models from data ingestion to productionalization is hard.
Using automation and strategy we can remove some of the most challenging parts,
and focus on the area of machine learning that generates the most value: feature
engineering.
When we use automation and strategy to remove the most challenging parts of
machine learning, we can run through more OODA loops faster, generate better
models, learn more about our subject, and deliver more value.
franklin.sarkett@gmail.com

Contenu connexe

Similaire à Work Hard Once Strategy

Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Analytics India Magazine
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemVMware Tanzu
 
Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Albert Y. C. Chen
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Build Intelligence System with AI. Antimo Musone, Ernst & Young
Build Intelligence System with AI. Antimo Musone, Ernst & YoungBuild Intelligence System with AI. Antimo Musone, Ernst & Young
Build Intelligence System with AI. Antimo Musone, Ernst & YoungData Driven Innovation
 
Best Practices in Software Cost Estimation - Metrikon 2015 - Frank Vogelezang
Best Practices in Software Cost Estimation - Metrikon 2015 - Frank VogelezangBest Practices in Software Cost Estimation - Metrikon 2015 - Frank Vogelezang
Best Practices in Software Cost Estimation - Metrikon 2015 - Frank VogelezangFrank Vogelezang
 
Guide to end end machine learning projects
Guide to end end machine learning projectsGuide to end end machine learning projects
Guide to end end machine learning projectsSkyl.ai
 
Nilesh Patil PLM Teamcenter manufacturing
Nilesh Patil PLM Teamcenter manufacturingNilesh Patil PLM Teamcenter manufacturing
Nilesh Patil PLM Teamcenter manufacturingNilesh Patil
 
WELCOME TO AI PROJECT shidhant mittaal.pptx
WELCOME TO AI PROJECT shidhant mittaal.pptxWELCOME TO AI PROJECT shidhant mittaal.pptx
WELCOME TO AI PROJECT shidhant mittaal.pptx9D38SHIDHANTMITTAL
 
Machine Learning Risk Management
Machine Learning Risk ManagementMachine Learning Risk Management
Machine Learning Risk ManagementAndrew Clark
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Matt Stubbs
 
Phrases for resume and interview start Mar31
Phrases for resume and interview  start Mar31Phrases for resume and interview  start Mar31
Phrases for resume and interview start Mar31Sander Stepanov
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 

Similaire à Work Hard Once Strategy (20)

Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation System
 
Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
CSEIT- ALL.pptx
CSEIT- ALL.pptxCSEIT- ALL.pptx
CSEIT- ALL.pptx
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Build Intelligence System with AI. Antimo Musone, Ernst & Young
Build Intelligence System with AI. Antimo Musone, Ernst & YoungBuild Intelligence System with AI. Antimo Musone, Ernst & Young
Build Intelligence System with AI. Antimo Musone, Ernst & Young
 
Best Practices in Software Cost Estimation - Metrikon 2015 - Frank Vogelezang
Best Practices in Software Cost Estimation - Metrikon 2015 - Frank VogelezangBest Practices in Software Cost Estimation - Metrikon 2015 - Frank Vogelezang
Best Practices in Software Cost Estimation - Metrikon 2015 - Frank Vogelezang
 
Guide to end end machine learning projects
Guide to end end machine learning projectsGuide to end end machine learning projects
Guide to end end machine learning projects
 
Nilesh Patil PLM Teamcenter manufacturing
Nilesh Patil PLM Teamcenter manufacturingNilesh Patil PLM Teamcenter manufacturing
Nilesh Patil PLM Teamcenter manufacturing
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
WELCOME TO AI PROJECT shidhant mittaal.pptx
WELCOME TO AI PROJECT shidhant mittaal.pptxWELCOME TO AI PROJECT shidhant mittaal.pptx
WELCOME TO AI PROJECT shidhant mittaal.pptx
 
Machine Learning Risk Management
Machine Learning Risk ManagementMachine Learning Risk Management
Machine Learning Risk Management
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 
Phrases for resume and interview start Mar31
Phrases for resume and interview  start Mar31Phrases for resume and interview  start Mar31
Phrases for resume and interview start Mar31
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 

Dernier

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Dernier (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Work Hard Once Strategy

  • 1. Work Hard Once Strategy and Automation applied to building machine learning models Franklin Sarkett April 2, 2018
  • 2. About me: franklin.sarkett@gmail.com Audantic Real Estate Analytics, co-founder ● http://audantic.com/ ● Audantic provides customized data, analytics, and predictive analytics for machine residential real estate. Facebook ● Data scientist at Facebook and developed an algorithm for the Ads Payments team that increased revenue over $200 million and earned a patent. Education ● CS degree from University of Illinois at Urbana Champaign ● MS in Applied Statistics from DePaul University.
  • 3. Summary Building machine learning models from data ingestion to productionalization is challenging, with many steps. Of all the steps, feature engineering is the biggest differentiator between models that work and models that do not. Using automation and strategy we can remove some of the most challenging parts, and focus on the area of machine learning that generates the most value: feature engineering.
  • 4. John Boyd and the OODA Loop The OODA loop is the decision cycle of observe, orient, decide, and act, developed by military strategist and United States Air Force Colonel John Boyd. Boyd applied the concept to the combat operations process. It is now also often applied to understand commercial operations and learning processes. The approach favors agility over raw power in dealing with human opponents in any endeavor. - Wikipedia
  • 5.
  • 6. Orient (most important) "Orient" is the key to the OODA loop. Since one is conditioned by one's heritage, surrounding culture, existing knowledge and learnings, the mind combines fragments of ideas, information, conjectures, impressions, etc. to generate our orientation. How well your orientation matches the real world is largely a function of how well you observe.
  • 7. Stages of Machine Learning Feature engineering Data cleaning Model training Observe Get raw data (sql, csv, API) Orient Decide Model evaluation Deployment Act
  • 8. Two guiding thoughts A mentor of mine at FB was coaching me on our model building. Building models requires domain knowledge, and put as much data into the model as you can. To improve the models, you need to add: ● Data quality ● Data volume ○ Breadth ○ Depth Addressing these concerns takes Feature Engineering to the next level.
  • 9. Automating the Observe stage Many of the tasks in the observe stage could be classified as DevOps and Data Engineering. My favorite tools to use for data science: ● Docker ● Jenkins ● Luigi
  • 10. Orient - Feature Engineering “Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering.” — Prof. Andrew Ng.
  • 11. Orient - Feature Engineering “The algorithms we used are very standard for Kagglers. …We spent most of our efforts in feature engineering. … We were also very careful to discard features likely to expose us to the risk of over-fitting our model.” — Xavier Conort
  • 12. Orient - Feature Engineering “Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.” — Dr. Jason Brownlee
  • 13. Orient - Feature Engineering At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used...It is often also one of the most interesting parts, where intuition, creativity and “black art” are as important as the technical stuff. -Pedro Domingos, Prof of CS as University of Washington
  • 15. How do we iterate feature engineering faster? ● Create a pipeline of transforms with a final estimator. ● Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. ● Benefits: ○ Convenience and encapsulation. You only have to call fit and predict once on your data to fit a whole sequence of estimators ○ Safety. Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
  • 20. Summary Building machine learning models from data ingestion to productionalization is hard. Using automation and strategy we can remove some of the most challenging parts, and focus on the area of machine learning that generates the most value: feature engineering. When we use automation and strategy to remove the most challenging parts of machine learning, we can run through more OODA loops faster, generate better models, learn more about our subject, and deliver more value.