SlideShare une entreprise Scribd logo
1  sur  37
Machine Learning
Data science for beginners, session 6
Machine Learning: your 5-7 things
Defining machine learning
The Scikit-Learn library
Machine learning algorithms
Choosing an algorithm
Measuring algorithm performance
Defining Machine Learning
Machine Learning = learning models from data
Which advert is the user most likely to click on?
Who’s most likely to win this election?
Which wells are most likely to fail in the next 6 months?
Machine Learning as Predictive Analytics...
Machine Learning Process
● Get data
● Select a model
● Select hyperparameters for that model
● Fit model to data
● Validate model (and change model, if necessary)
● Use the model to predict values for new data
Today’s library: Scikit-Learn (sklearn)
Scikit-Learn’s example datasets
● Iris
● Digits
● Diabetes
● Boston
Select a Model
Algorithm Types
Supervised learning
Regression: learning numbers
Classification: learning classes
Unsupervised learning
Clustering: finding groups
Dimensionality reduction: finding efficient representations
Linear Regression: fit a line to (numerical) data
Linear Regression: First, get your data
import numpy as np
import pandas as pd
gen = np.random.RandomState(42)
num_samples = 40
x = 10 * gen.rand(num_samples)
y = 3 * x + 7+ gen.randn(num_samples)
X = pd.DataFrame(x)
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(x,y)
Linear Regression: Fit model to data
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print('Slope: {}, Intercept: {}'.format(model.coef_, model.intercept_))
Linear Regression: Check your model
Xtest = pd.DataFrame(np.linspace(-1, 11))
predicted = model.predict(Xtest)
plt.scatter(x, y)
plt.plot(Xtest, predicted)
Reality can be a little more like this…
Classification: Predict classes
● Well pump: [working, broken]
● CV: [accept, reject]
● Gender: [male, female, others]
● Iris variety: [iris setosa, iris virginica, iris versicolor]
Classification: The Iris Dataset Petal
Sepal
Classification: first get your data
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Classification: Split your data
ntest=10
np.random.seed(0)
indices = np.random.permutation(len(X))
iris_X_train = X[indices[:-ntest]]
iris_Y_train = Y[indices[:-ntest]]
iris_X_test = X[indices[-ntest:]]
iris_Y_test = Y[indices[-ntest:]]
Classifier: Fit Model to Data
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski')
knn.fit(iris_X_train, iris_Y_train)
Classifier: Check your model
predicted_classes = knn.predict(iris_X_test)
print('kNN predicted classes: {}'.format(predicted_classes))
print('Real classes: {}'.format(iris_Y_test))
Clustering: Find groups in your data
Clustering: get your data
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
print("Xs: {}".format(X))
Clustering: Fit model to data
from sklearn import cluster
k_means = cluster.KMeans(3)
k_means.fit(iris.data)
Clustering: Check your model
print("Generated labels: n{}".format(k_means.labels_))
print("Real labels: n{}".format(Y))
Dimensionality Reduction
Dimensionality reduction: Get your data
Dimensionality reduction: Fit model to data
Recap: Choosing an Algorithm
Have: data and expected outputs
Want numbers? Try regression algorithms
Want classes? Try classification algorithms
Have: just data
Want to find structure? Try clustering algorithms
Want to look at it? Try dimensionality reduction
Model Validation
How well does the model fit new data?
“Holdout sets”:
split your data into training and test sets
learn your model with the training set
get a validation score for your test set
Models are rarely perfect… you might have to change parameters or model
● underfitting: model not complex enough to fit the training data
● overfitting: model too complex: fits the training data well, does badly on test
Overfitting and underfitting
The Confusion Matrix
True positive
False positive
False negative
True negative
Test Metrics
Precision:
of all the “true” results, how many were actually “true”?
Precision = tp / (tp + fp)
Recall:
how many of the things that were really “true” were marked as “true” by the
classifier?
Recall = tp / (tp + fn)
F1 score:
harmonic mean of precision and recall
F1_score = 2 * precision * recall / (precision + recall)
Iris classification: metrics
from sklearn import metrics
print(metrics.classification_report(iris_Y_test, predicted_classes))
Exercises
Explore some algorithms
Notebooks 6.x contain examples of machine learning algorithms. Run them,
play with the numbers in them, break them, think about why they might have
broken.

Contenu connexe

Tendances

Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at ScaleSri Ambati
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R PackageDataRobot
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive BayesJosh Patterson
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionAndrew Ferlitsch
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnAmol Agrawal
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostTakami Sato
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
 

Tendances (20)

Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly Detection
 

Similaire à Session 06 machine learning.pptx

DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...Dataconomy Media
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonDr. Volkan OBAN
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnKarlijn Willems
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonZahid Hasan
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsIstituto nazionale di statistica
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in RSujaAldrin
 
Quick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 minsQuick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 minsNaveen Davis
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleYvonne K. Matos
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksKevin Lee
 
Dian Vitiana Ningrum ()6211540000020)
Dian Vitiana Ningrum  ()6211540000020)Dian Vitiana Ningrum  ()6211540000020)
Dian Vitiana Ningrum ()6211540000020)dian vit
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxPerumalPitchandi
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning projectAlex Austin
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsankit_ppt
 
Classification
ClassificationClassification
ClassificationCloudxLab
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classificationLokarchanaD
 

Similaire à Session 06 machine learning.pptx (20)

DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_python
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
Quick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 minsQuick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 mins
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
Lecture-6-7.pptx
Lecture-6-7.pptxLecture-6-7.pptx
Lecture-6-7.pptx
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Dian Vitiana Ningrum ()6211540000020)
Dian Vitiana Ningrum  ()6211540000020)Dian Vitiana Ningrum  ()6211540000020)
Dian Vitiana Ningrum ()6211540000020)
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
 
Classification
ClassificationClassification
Classification
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
 

Plus de bodaceacat

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformationbodaceacat
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_masterbodaceacat
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019bodaceacat
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019bodaceacat
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019bodaceacat
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018bodaceacat
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger databodaceacat
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptxbodaceacat
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial databodaceacat
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptxbodaceacat
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploringbodaceacat
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating resultsbodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basicsbodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011bodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011bodaceacat
 
Ardrone represent
Ardrone representArdrone represent
Ardrone representbodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection managerbodaceacat
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovationbodaceacat
 

Plus de bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 

Dernier

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 

Dernier (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 

Session 06 machine learning.pptx

  • 1. Machine Learning Data science for beginners, session 6
  • 2. Machine Learning: your 5-7 things Defining machine learning The Scikit-Learn library Machine learning algorithms Choosing an algorithm Measuring algorithm performance
  • 4. Machine Learning = learning models from data Which advert is the user most likely to click on? Who’s most likely to win this election? Which wells are most likely to fail in the next 6 months?
  • 5. Machine Learning as Predictive Analytics...
  • 6. Machine Learning Process ● Get data ● Select a model ● Select hyperparameters for that model ● Fit model to data ● Validate model (and change model, if necessary) ● Use the model to predict values for new data
  • 8. Scikit-Learn’s example datasets ● Iris ● Digits ● Diabetes ● Boston
  • 10. Algorithm Types Supervised learning Regression: learning numbers Classification: learning classes Unsupervised learning Clustering: finding groups Dimensionality reduction: finding efficient representations
  • 11. Linear Regression: fit a line to (numerical) data
  • 12. Linear Regression: First, get your data import numpy as np import pandas as pd gen = np.random.RandomState(42) num_samples = 40 x = 10 * gen.rand(num_samples) y = 3 * x + 7+ gen.randn(num_samples) X = pd.DataFrame(x) %matplotlib inline import matplotlib.pyplot as plt plt.scatter(x,y)
  • 13. Linear Regression: Fit model to data from sklearn.linear_model import LinearRegression model = LinearRegression(fit_intercept=True) model.fit(X, y) print('Slope: {}, Intercept: {}'.format(model.coef_, model.intercept_))
  • 14. Linear Regression: Check your model Xtest = pd.DataFrame(np.linspace(-1, 11)) predicted = model.predict(Xtest) plt.scatter(x, y) plt.plot(Xtest, predicted)
  • 15. Reality can be a little more like this…
  • 16. Classification: Predict classes ● Well pump: [working, broken] ● CV: [accept, reject] ● Gender: [male, female, others] ● Iris variety: [iris setosa, iris virginica, iris versicolor]
  • 17. Classification: The Iris Dataset Petal Sepal
  • 18. Classification: first get your data import numpy as np from sklearn import datasets iris = datasets.load_iris() X = iris.data Y = iris.target
  • 19. Classification: Split your data ntest=10 np.random.seed(0) indices = np.random.permutation(len(X)) iris_X_train = X[indices[:-ntest]] iris_Y_train = Y[indices[:-ntest]] iris_X_test = X[indices[-ntest:]] iris_Y_test = Y[indices[-ntest:]]
  • 20. Classifier: Fit Model to Data from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski') knn.fit(iris_X_train, iris_Y_train)
  • 21. Classifier: Check your model predicted_classes = knn.predict(iris_X_test) print('kNN predicted classes: {}'.format(predicted_classes)) print('Real classes: {}'.format(iris_Y_test))
  • 22. Clustering: Find groups in your data
  • 23. Clustering: get your data from sklearn import datasets iris = datasets.load_iris() X = iris.data Y = iris.target print("Xs: {}".format(X))
  • 24. Clustering: Fit model to data from sklearn import cluster k_means = cluster.KMeans(3) k_means.fit(iris.data)
  • 25. Clustering: Check your model print("Generated labels: n{}".format(k_means.labels_)) print("Real labels: n{}".format(Y))
  • 29. Recap: Choosing an Algorithm Have: data and expected outputs Want numbers? Try regression algorithms Want classes? Try classification algorithms Have: just data Want to find structure? Try clustering algorithms Want to look at it? Try dimensionality reduction
  • 31. How well does the model fit new data? “Holdout sets”: split your data into training and test sets learn your model with the training set get a validation score for your test set Models are rarely perfect… you might have to change parameters or model ● underfitting: model not complex enough to fit the training data ● overfitting: model too complex: fits the training data well, does badly on test
  • 33. The Confusion Matrix True positive False positive False negative True negative
  • 34. Test Metrics Precision: of all the “true” results, how many were actually “true”? Precision = tp / (tp + fp) Recall: how many of the things that were really “true” were marked as “true” by the classifier? Recall = tp / (tp + fn) F1 score: harmonic mean of precision and recall F1_score = 2 * precision * recall / (precision + recall)
  • 35. Iris classification: metrics from sklearn import metrics print(metrics.classification_report(iris_Y_test, predicted_classes))
  • 37. Explore some algorithms Notebooks 6.x contain examples of machine learning algorithms. Run them, play with the numbers in them, break them, think about why they might have broken.

Notes de l'éditeur

  1. What you’re learning isn’t the data, but a model that will help you understand (and possibly also explain) it.
  2. We bother making models because we want to start asking questions, and (hopefully) making changes in our world. Image from http://www.rosebt.com/blog/descriptive-diagnostic-predictive-prescriptive-analytics
  3. AKA import-instantiate-fit-predict Hyperparameter: things like “how many clusters of data do I think there are in this dataset?”
  4. Lots of great tutorials on http://scikit-learn.org/stable/ You import from this library, which is called “sklearn” in python code.
  5. Iris image from Nociveglia https://www.flickr.com/photos/40385177@N07/.
  6. Supervised versus unsupervised learning: supervised = give the algorithm both input data and the answers for that data (kinda like teaching), and it learns the connection between data and answers; unsupervised = give the algorithm just the data, and it finds the structure in that data Semi-supervised learning (where you only have a few answers) does exist, but isn’t talked about much. There’s also reinforcement learning, where you know if a result is better or worse, but not how much it’s better or worse.
  7. Fit a line to a set of datapoints. Use that line to predict new values
  8. This will give you 40 random samples around the line y = 3x + 7. Random.rand selects from a uniform distribution; random.randn selects from a standard normal distribution.
  9. Note the hyperparameter (fit_intercept). This says that your model doesn’t start at (0,0).
  10. predicted_slope = model.coef_ predicted_intercept = model.intercept_
  11. 1-feature linear regression on the Diabetes dataset. This is where you need to change your model. In this case, you’d start by trying more features, then adapting the model hyperparameters (e.g. it might not be a straight line that you need to fit) or the model that you use (e.g. linear regression might not be the best model type to use on this dataset).
  12. When there are just two classifications, it’s called binary classification.
  13. Classification: finding the link between data and classes. This is the Iris dataset. It’s one of Scikit-learn’s example datasets.
  14. print("Targets: {}".format(iris['target_names'])) print("Target data: {}".format(iris_Y)) print("Features: {}".format(iris['feature_names'])) print("Feature data: {}".format(iris_X))
  15. Why do we split into training and test sets? This is called a “holdout” set… we save some of our data, so we can use it to check how well our classifier does on data it hasn’t seen before. print(‘{} training points, {} test points’.format(len(iris_X_train), len(iris_X_test)))
  16. This is the k nearest neighbours algorithm. For every new datapoint, it looks at the N nearest datapoints it has classifications for, and assigns the new datapoint the class that’s most common amongst them. Here, we’re using 5 neighbours. We’re also using the Minkowski distance (https://machinelearning1.wordpress.com/2013/03/25/three-famous-metrics-manhattan-euclidean-minkowski/) : this tells the algorithm how to compute the distance between two points, so we can define which points are ‘closest’. Common distance metrics you’ll see in machine learning include: Manhattan, or “city block” distance: add the distance along the x axis to the distance along the y axis (“city block” because that’s how you navigate in Manhattan”) Euclidian distance: calculate the straight-line distance between the two points (e.g. sqrt(x^2 + y^2)) Minkowski distance: a variant of Euclidian distance, for large numbers of features
  17. This is the digits example dataset.
  18. This is all in notebook 6.5
  19. There’s no “best” algorithm for every problem. This is also known as the “no free lunch” theory. If you have data and estimate of better/worse: reinforcement learning There are lots of variants on these algorithms: the Scikit-learn cheat sheet will help you choose between them: http://scikit-learn.org/stable/tutorial/machine_learning_map/
  20. Overfitting: matches the training data well, performs badly on new data… has high variance Underfitting: doesn’t match the training data well, might perform well on new data… has high bias Bias/ Variance tradeoff: adjust your hyperparameters until the model performs well on the test data. See e.g. http://scott.fortmann-roe.com/docs/BiasVariance.html
  21. This is all about your parameters e.g. the difference between fitting a straight line, a quadratic curve or a n-dimensional curve. Figures from Jake Van Der Plas’ Python for Data Science book. We’ll talk about the bias-variance tradeoff later.
  22. False positive is also known as a “type 1 error”; false negative is also known as a “type 2 error”.
  23. These numbers are always between 0 and 1. If you want to play with F1, try it in Python, e.g.: import numpy as np p = np.array([.25, .25, .125, .5, .75]) r = np.array([.001, .10, .7, .9, .3]) 2*p*r / (p + r)
  24. Support: how many things that are actually this class did we use to calculate these metrics? Precision: of all the “true” results, how many were actually “true”? Recall: how many of the things that were really “true” were marked as “true” by the classifier? F1: combination of precision and recall