SlideShare une entreprise Scribd logo
1  sur  14
Building Data Scientists
Machine Learning Mastery in Python
Mitch Sanders
Jan 10th 2018
Internal Use - Confidential
2 of Y
Internal Use - Confidential
Trend #2 Non-Data Scientists will perform
more fairly sophisticated analytics
alongside data scientists
Data Scientist
Algorithm Coder
Data
Science
Citizens
Advanced
Analytics
Programmers
Statisticians
Business
Analyst
Coders
Data Science continues to develop
specialties - this means the mythical
‘full stack’ data scientist will disappear
Trend #1
Data
Scientist
Data
Engineer
Algorithm
Coder
Data
Storyteller
Industry Trends for 2018 – How
what we’re doing fits into the future
the Context
3 of Y
Internal Use - Confidential
the Course
Machine Learning Mastery
- Understand Your Data
- Create Accurate Models
- Work Projects End-To-End
• 16 weeks – May-Oct., 2017
• 20+ class hours – 20% homework, 80% live coding
• 17 notebooks – Python code templates
• 4 Prerequisites – Coding, statistics, algorithms, thirst to learn
• 1 Textbook – Machine Learning Mastery w/ Python -Dr. Jason Brownlee
• 1 Teacher – Mitch Sanders w/ Assistant – Uday Waghmare
• 14 Students – global: software engineers, adv. analysts, statisticians
• Platform – Jupyter, Python 2.7, Anaconda
• Code Repository – GitHub
• NPS Survey – Survey Monkey, LTR = 90
• Awarded – “On the Spot”
4 of Y
Internal Use - Confidential
the Content
Prepare & Explore Model Improve Accuracy & Finalize
Python ML
Ecosystem
SciPy
Scikit-learn
Crash Courses
NumPy
Matplotlib
Pandas
Load Libraries & Data
Descriptive Statistics
Attribute Data Types
Class Distribution
Correlation Analysis
Skew of Univariates
Pre Processing
Rescale
Standardize
Normalize
BinarizeFeature Selection
Tree & Univariate
Recursive -RFE
Principle Comp.
Analysis - PCA
Feature Importance
Resampling
Split into Train/Test
K-fold Cross Validation
Leave One Out
Repeated Random
Evaluation Metrics
For Classification
For Regression
Spot Check
Classification Algorithms
Linear –
• Logistic Regression
• Linear Discriminate
Analysis (LDA)
Non-linear –
• K-Nearest Neighbor (KNN)
• Naïve Bayes
• Class & Regression Trees
(CART)
• Support Vector Machines
(SVM)
Compare Algorithms
Spot Check
Regression Algorithms
Linear – LR, LASSO,
ElasticNet (EN)
Non-Linear – CART, SVR,
KNN
Automate w/ Pipelines
Preparation Pipelines
Feature Extraction Pipelines
Modeling Pipelines
Ensembles - Performance
Improvements
Boosting –
• AdaBoost,
• Gradient Boosting (GBM)
Bagging –
• Random Forest, Extra Trees
• Voting
Algorithm
Parameter Tuning
Parameters
Grid Search
Random Search
Finalize Model
Predict on Validation Data
Create Standalone on Entire Data
Save Model for Production
Visualization
Univariate Plots
Multivariate Plots
Case Studies #1 & #2
Key concepts – and flow – the
17 notebooks
#1
#17
Reference Material
6 of Y
Internal Use - Confidential
the Course Syllabus
Python Ecosystem for Machine
Learning
• Python
• SciPy
• Scikit-learn
• Python Ecosystem Installation
• Summary
Crash Course in Python and SciPy
• Python Crash Course
• NumPy Crash Course
• Matplotlib Crash Course
• Pandas Crash Course
• Summary
How To Load Machine Learning Data
• Considerations When Loading CSV
Data
• Pima Indians Dataset
• Load CSV Files with the Python
Standard Library
• Load CSV Files with NumPy
• Load CSV Files with Pandas
• Summary
Understand Your Data With
Visualization
• Univariate Plots
• Multivariate Plots
• Summary
Prepare Your Data For Machine Learning
• Need For Data Pre-processing
• Data Transforms
• Rescale Data
• Standardize Data
• Normalize Data
• Binarize Data (Make Binary)
• Summary
Feature Selection For Machine Learning
• Feature Selection
• Univariate Selection
• Recursive Feature Elimination
• Principal Component Analysis
• Feature Importance
• Summary
Evaluate the Performance of Machine
Learning Algorithms with Resampling
• Evaluate Machine Learning Algorithms
• Split into Train and Test Sets
• K-fold Cross-Validation
• Leave One Out Cross-Validation
• Repeated Random Test-Train Splits
• What Techniques to Use When
• Summary
Machine Learning Algorithm
Performance Metrics
• Algorithm Evaluation Metrics
• Classification Metrics
• Regression Metrics
• Summary
Spot-Check Classification Algorithms
• Algorithm Spot-Checking
• Algorithms Overview
• Linear Machine Learning Algorithms
• Nonlinear Machine Learning
Algorithms
• Summary
Spot-Check Regression Algorithms
• Algorithms Overview
• Linear Machine Learning Algorithms
• Nonlinear Machine Learning
Algorithms
• Summary
Compare Machine Learning Algorithms
• Choose The Best Machine Learning
Model
• Compare Machine Learning
Algorithms Consistently
• Summary
Automate Machine Learning Workflows
with Pipelines
• Automating Machine Learning
Workflows
• Data Preparation and Modeling
Pipeline
• Feature Extraction and Modeling
Pipeline
• Summary
Improve Performance with Ensembles
• Combine Models Into Ensemble
Predictions
• Bagging Algorithms
• Boosting Algorithms
• Voting Ensemble
• Summary
7 of Y
Internal Use - Confidential
data science student questions - 1
“So you do Data Science work. What really does that involve? And how is that different than programming, statistical work or data
engineering?”
“I want to learn Data Science. Between R, Python and SAS, where should I start and what are the Pros and Cons of each?”
“What is OOP (Object orientated programming) and Structured Programming and what’s the difference between them?"
“What is main differences between Python 2.7 and Python 3.x versions? And why do so many developers stay with Python 2.7?”
"What is the difference between Supervised Learning an Unsupervised Learning?"
"What's different graphing might a univariate have compared to a bivariate analysis? Can you graph multivariate?"
"How do you explain machine learning to an 8-year old child?"
"What is Gradient Descent?
"What is multicollinearity and how you can overcome it?"
8 of Y
Internal Use - Confidential
data science student questions - 2
"What is the curse of dimensionality?"
"What do you understand by Hypothesis in the content of Machine Learning?"
"What's the difference between a Test Set and a Validation Set?"
"What is cross-validation and what is it used for?"
"What's difference between a Classification Regression Tree algoithm and a Random Forest? And when is one better than the other?"
"What are the basic assumptions to be made for linear regression?"
"Can you explain in simple language what is an Eigenvalue and Eigenvector?"
"Do gradient descent methods always converge to same point?"
"What's difference between continuous, ordinal and categorical variables?"
"What is K-means? How can you select K for K-means?"
9 of Y
Internal Use - Confidential
data science student questions - 3
"Why is naive Bayes so ‘naive’ ?"
"OLS is to linear regression as Maximum likelihood is to logistic regression. Explain the statement."
"What do you understand by Bias Variance trade off?"
"Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?"
"When does regularization becomes necessary in Machine Learning?"
"Explain a model and its dimensions to an 8 year old."
"How do you determine and deal with correlated features in your data set, how to reduce the dimensionality of data?"
"During analysis, how do you treat missing values?"
"What is Regularization and what kind of problems does regularization solve?"
Extras
11 of Y
Internal Use - Confidential
the Data Scientist Roles
Roles Defined by 3 different Data Science Authors
Data Scientist Core Skills
How To Build A Successful Data Science
Team
The seven people you need on your
Big Data team Descriptions:
Capture Data Engineer Handyman
Expert in Dell EDW, D3, BO, Hana/BMS,
other RDBMS, and ETL work
Open Source Guru (plus Data
Modeler)
Hadoop stack, Cloudera, Linux, data
structures and network
Analyze Machine Learning Expert
Data Modeler (plus all aspets of Data
Engineer and Business Analyst)
SQL, RDBMS, Teradata, Dell
infrastructure
Deep Diver
Machine Learning, R, Python, SQL, ETL
work, algorithm modeling, statistics
Present Business Analyst Story Teller
PowerPoint, Design, Tableau,
understands customers business
language and technical, artistic eye
Snoop (plus Handyman skills)
Enthusiastic, deeply creative, super savy
in Dell envirionments, finds contacts and
not hesitant to do work-arounds
Privacy Wonk
Dell policy meticulous, socially aware,
foresees roadblocks
12 of Y
Internal Use - Confidential
13 of Y
Internal Use - Confidential
14 of Y
Internal Use - Confidential

Contenu connexe

Tendances

Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Simplilearn
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Simplilearn
 

Tendances (20)

Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
Data Science using Python
Data Science using PythonData Science using Python
Data Science using Python
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Data+Science : A First Course
Data+Science : A First CourseData+Science : A First Course
Data+Science : A First Course
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
NLP & ML Webinar
NLP & ML WebinarNLP & ML Webinar
NLP & ML Webinar
 
Introduction to Machine Learning & AI
Introduction to Machine Learning & AIIntroduction to Machine Learning & AI
Introduction to Machine Learning & AI
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Data science
Data scienceData science
Data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 

Similaire à Building Data Scientists

Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
RAKESHG79
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Rohit Dubey
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
ArmyTrilidiaDevegaSK
 

Similaire à Building Data Scientists (20)

JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data Science
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data Science
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
How to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesHow to crack Big Data and Data Science roles
How to crack Big Data and Data Science roles
 
L2 DS Tools and Application.pptx
L2 DS Tools and Application.pptxL2 DS Tools and Application.pptx
L2 DS Tools and Application.pptx
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdf
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 

Dernier

Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 

Dernier (20)

A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 

Building Data Scientists

  • 1. Building Data Scientists Machine Learning Mastery in Python Mitch Sanders Jan 10th 2018 Internal Use - Confidential
  • 2. 2 of Y Internal Use - Confidential Trend #2 Non-Data Scientists will perform more fairly sophisticated analytics alongside data scientists Data Scientist Algorithm Coder Data Science Citizens Advanced Analytics Programmers Statisticians Business Analyst Coders Data Science continues to develop specialties - this means the mythical ‘full stack’ data scientist will disappear Trend #1 Data Scientist Data Engineer Algorithm Coder Data Storyteller Industry Trends for 2018 – How what we’re doing fits into the future the Context
  • 3. 3 of Y Internal Use - Confidential the Course Machine Learning Mastery - Understand Your Data - Create Accurate Models - Work Projects End-To-End • 16 weeks – May-Oct., 2017 • 20+ class hours – 20% homework, 80% live coding • 17 notebooks – Python code templates • 4 Prerequisites – Coding, statistics, algorithms, thirst to learn • 1 Textbook – Machine Learning Mastery w/ Python -Dr. Jason Brownlee • 1 Teacher – Mitch Sanders w/ Assistant – Uday Waghmare • 14 Students – global: software engineers, adv. analysts, statisticians • Platform – Jupyter, Python 2.7, Anaconda • Code Repository – GitHub • NPS Survey – Survey Monkey, LTR = 90 • Awarded – “On the Spot”
  • 4. 4 of Y Internal Use - Confidential the Content Prepare & Explore Model Improve Accuracy & Finalize Python ML Ecosystem SciPy Scikit-learn Crash Courses NumPy Matplotlib Pandas Load Libraries & Data Descriptive Statistics Attribute Data Types Class Distribution Correlation Analysis Skew of Univariates Pre Processing Rescale Standardize Normalize BinarizeFeature Selection Tree & Univariate Recursive -RFE Principle Comp. Analysis - PCA Feature Importance Resampling Split into Train/Test K-fold Cross Validation Leave One Out Repeated Random Evaluation Metrics For Classification For Regression Spot Check Classification Algorithms Linear – • Logistic Regression • Linear Discriminate Analysis (LDA) Non-linear – • K-Nearest Neighbor (KNN) • Naïve Bayes • Class & Regression Trees (CART) • Support Vector Machines (SVM) Compare Algorithms Spot Check Regression Algorithms Linear – LR, LASSO, ElasticNet (EN) Non-Linear – CART, SVR, KNN Automate w/ Pipelines Preparation Pipelines Feature Extraction Pipelines Modeling Pipelines Ensembles - Performance Improvements Boosting – • AdaBoost, • Gradient Boosting (GBM) Bagging – • Random Forest, Extra Trees • Voting Algorithm Parameter Tuning Parameters Grid Search Random Search Finalize Model Predict on Validation Data Create Standalone on Entire Data Save Model for Production Visualization Univariate Plots Multivariate Plots Case Studies #1 & #2 Key concepts – and flow – the 17 notebooks #1 #17
  • 6. 6 of Y Internal Use - Confidential the Course Syllabus Python Ecosystem for Machine Learning • Python • SciPy • Scikit-learn • Python Ecosystem Installation • Summary Crash Course in Python and SciPy • Python Crash Course • NumPy Crash Course • Matplotlib Crash Course • Pandas Crash Course • Summary How To Load Machine Learning Data • Considerations When Loading CSV Data • Pima Indians Dataset • Load CSV Files with the Python Standard Library • Load CSV Files with NumPy • Load CSV Files with Pandas • Summary Understand Your Data With Visualization • Univariate Plots • Multivariate Plots • Summary Prepare Your Data For Machine Learning • Need For Data Pre-processing • Data Transforms • Rescale Data • Standardize Data • Normalize Data • Binarize Data (Make Binary) • Summary Feature Selection For Machine Learning • Feature Selection • Univariate Selection • Recursive Feature Elimination • Principal Component Analysis • Feature Importance • Summary Evaluate the Performance of Machine Learning Algorithms with Resampling • Evaluate Machine Learning Algorithms • Split into Train and Test Sets • K-fold Cross-Validation • Leave One Out Cross-Validation • Repeated Random Test-Train Splits • What Techniques to Use When • Summary Machine Learning Algorithm Performance Metrics • Algorithm Evaluation Metrics • Classification Metrics • Regression Metrics • Summary Spot-Check Classification Algorithms • Algorithm Spot-Checking • Algorithms Overview • Linear Machine Learning Algorithms • Nonlinear Machine Learning Algorithms • Summary Spot-Check Regression Algorithms • Algorithms Overview • Linear Machine Learning Algorithms • Nonlinear Machine Learning Algorithms • Summary Compare Machine Learning Algorithms • Choose The Best Machine Learning Model • Compare Machine Learning Algorithms Consistently • Summary Automate Machine Learning Workflows with Pipelines • Automating Machine Learning Workflows • Data Preparation and Modeling Pipeline • Feature Extraction and Modeling Pipeline • Summary Improve Performance with Ensembles • Combine Models Into Ensemble Predictions • Bagging Algorithms • Boosting Algorithms • Voting Ensemble • Summary
  • 7. 7 of Y Internal Use - Confidential data science student questions - 1 “So you do Data Science work. What really does that involve? And how is that different than programming, statistical work or data engineering?” “I want to learn Data Science. Between R, Python and SAS, where should I start and what are the Pros and Cons of each?” “What is OOP (Object orientated programming) and Structured Programming and what’s the difference between them?" “What is main differences between Python 2.7 and Python 3.x versions? And why do so many developers stay with Python 2.7?” "What is the difference between Supervised Learning an Unsupervised Learning?" "What's different graphing might a univariate have compared to a bivariate analysis? Can you graph multivariate?" "How do you explain machine learning to an 8-year old child?" "What is Gradient Descent? "What is multicollinearity and how you can overcome it?"
  • 8. 8 of Y Internal Use - Confidential data science student questions - 2 "What is the curse of dimensionality?" "What do you understand by Hypothesis in the content of Machine Learning?" "What's the difference between a Test Set and a Validation Set?" "What is cross-validation and what is it used for?" "What's difference between a Classification Regression Tree algoithm and a Random Forest? And when is one better than the other?" "What are the basic assumptions to be made for linear regression?" "Can you explain in simple language what is an Eigenvalue and Eigenvector?" "Do gradient descent methods always converge to same point?" "What's difference between continuous, ordinal and categorical variables?" "What is K-means? How can you select K for K-means?"
  • 9. 9 of Y Internal Use - Confidential data science student questions - 3 "Why is naive Bayes so ‘naive’ ?" "OLS is to linear regression as Maximum likelihood is to logistic regression. Explain the statement." "What do you understand by Bias Variance trade off?" "Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?" "When does regularization becomes necessary in Machine Learning?" "Explain a model and its dimensions to an 8 year old." "How do you determine and deal with correlated features in your data set, how to reduce the dimensionality of data?" "During analysis, how do you treat missing values?" "What is Regularization and what kind of problems does regularization solve?"
  • 11. 11 of Y Internal Use - Confidential the Data Scientist Roles Roles Defined by 3 different Data Science Authors Data Scientist Core Skills How To Build A Successful Data Science Team The seven people you need on your Big Data team Descriptions: Capture Data Engineer Handyman Expert in Dell EDW, D3, BO, Hana/BMS, other RDBMS, and ETL work Open Source Guru (plus Data Modeler) Hadoop stack, Cloudera, Linux, data structures and network Analyze Machine Learning Expert Data Modeler (plus all aspets of Data Engineer and Business Analyst) SQL, RDBMS, Teradata, Dell infrastructure Deep Diver Machine Learning, R, Python, SQL, ETL work, algorithm modeling, statistics Present Business Analyst Story Teller PowerPoint, Design, Tableau, understands customers business language and technical, artistic eye Snoop (plus Handyman skills) Enthusiastic, deeply creative, super savy in Dell envirionments, finds contacts and not hesitant to do work-arounds Privacy Wonk Dell policy meticulous, socially aware, foresees roadblocks
  • 12. 12 of Y Internal Use - Confidential
  • 13. 13 of Y Internal Use - Confidential
  • 14. 14 of Y Internal Use - Confidential

Notes de l'éditeur

  1. https://www.datasciencecentral.com/profiles/blogs/6-predictions-about-data-science-machine-learning-and-ai-for-2018