SlideShare une entreprise Scribd logo
1  sur  28
Intro to Scikit-Learn and StatsModels for the
Absolute Beginner
Jennifer D. Davis, Ph.D.
July 2, 2015
“Risks, I like to say, always pay off. You
learn what to do, or what not to do.”
- Dr. Jonas E Salk
Outline
• Machine learning and statistics, tools of the data
scientist
• Why python?
• Popular Scikit Learn ML algorithms
• Popular StatsModels algorithms
• History of Scikit Learn and StatsModels
• A use case: Polio Rates and Vaccination in the
United States
Note: many of the slides contain lots of text or notes so you
don’t need to take written notes. At the same time, this is
a talk for absolute beginners and so we present in a fairly
non-technical manner.
Experienced audience members may find some information
lacking detail or caveats. Additional information/tutorial
will be available in a Jupyter notebook on github.
Why Python?
• Well developed scripting language that can also be
utilized for software development & is *scalable*
• Well developed machine learning libraries backed
by developers at Google among other places
• Runs on C/C++ in the background, so complex
computations can run faster or be ported to C
• Runs on big data platforms like Spark (pySpark)
• Plays nicely with other programming languages (see
Jython & Cython for porting to Java and C
respectively, other methods work too)
Why Machine Learning?
• Machine learning is a subset of Artificial
Intelligence, which is as the title suggests, uses
mathematics to mimic learning intelligence
• Machine learning takes complex data,
mathematically models it (using training data)
under tunable parameters, and allows for
predictions or assessments of *individuals* within
or compared to groups
• Machine learning includes network analysis, deep
learning, probability density graphs, supervised
learning, unsupervised learning, dimensionality
reduction (SVM, PCA) and other techniques
Why Statistics?
• Statistics is the mathematical application to data to build
a ‘model’ of how observations fit into a ‘big picture’
• Statistical analyses often include correlations,
assessments of how ‘good’ a model is based on error rates
or population fits
• Statistics is an essential part of data science repertoire,
but data scientists do not *rely* on statistics alone
• Examples include ANOVA, Pearson’s Correlation, ROC Curves
(assesses various models), Time Series, Regressions
• Statistical techniques can be applied to Machine Learning
algorithms to determine how effective, accurate or
predictive the algorithm is, but they are not the only
method
• Examples include: PPV, NPV, ROC Curves
How do Statistics & Machine Learning
Relate to One Another?
• Statistical methods are used to assess the
performance of a machine learning algorithm often but
do not require data to ‘tune’ the statistical test
• Some statistical tests can be utilized as machine
learning algorithms (e.g. log-odds regressions etc.)
• While Statistics is not generally considered part of
artificial intelligence, it can be used to determine
the accuracy, learning rate and other parameters tied
to AI & Machine Learning.
• Machine learning algorithms use test data to tune
their parameters. Remember the musician who’s
instrument is out of tune? We don’t want that
(under-fitting). And we don’t want the musician
tuned only to themselves—but differently than the
rest of the band--that’s over-fitting.
The Top 5 Machine Learning Algorithms for Data Science
Available in Scikit-Learn
• PageRank (Principal Eigenvector)
• AdaBoost (Ensemble Learning)
• kNN (K-nearest neighbor Classification)
• Principal Component Analysis (dimensionality reduction)
• Neural Network Models (example, Restricted Boltzmann
machines)
The Top 5 Statistical Models for Data Science Available in
StatsModels
• Generalized linear models (e.g. ordinary least squares
regressions)
• Nonparametric estimators
• Analysis of Variance
• Times Series Analysis
• Survival Analysis
Scikit Learn: History & Development
• Project started in 2007 as a Google Summer of Code project
by David Counapeau.
• Matthieu Brucher then took it up as part of his thesis
work.
• 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort &
Vincent Miachel of INRIA took project leadership
• The first public release was February 1, 2010
• Since releases have appeared about every ~3 months
• A great community exists, so if you’d like to contribute
your own code for machine-learning algorithms contact the
scikit-learn team.
StatsModels History & Development
• Statsmodels is a Python library that provides classes &
functions for estimation of many statistical functions
• It is useful for conducting tests such as ANOVA, ARMA,
time-series, various flavors of regressions
• Results are tested against existing statistical packages to
ensure accuracy
• For those of you who are used to R, you can fit models
using R-style functional programming
• The modules were originally from scipy.stats written by
Jonathan Taylor. It was later expanded and moved.
• As part of the Google Summer of Code 2009, statsmodels was
tested, improved and released as a package. Since then a
team of developers from Google and AWR have supported the
development. To oversee coding practices (i.e. use of PEP-
8) python.org typically reviews modules/libraries.
Use Cases: Scikit-Learn
• Classification – identify which category an object or
person belongs too, eg. Spam detection or image
recognition, or which of you will pay more than $40, $75 or
$100 for a pair of shoes?
• Regression, predicting continuous-value attributes
associated with an object, e.g. patient drug response based
on other factors
• Clustering – grouping similar objects into sets, e.g.
customer segmentation, grouping experimental outcomes
• Dimensionality reduction (reducing the number of variables
included in ML analyses), see my github for example
• Model selection – comparing, cross-validating, choosing
tuning parameters & metrics
• Preprocessing (yes, this is important!!!) – feature
extraction & normalization, transforming input data such as
text, into a vector or representation that can be used by a
ML algorithm
Use Cases: StatsModels
• Linear regression models (I will show an example,
but not the best example)
• Plotting data to assess its fit – are you over
fitting or under fitting or just right?
• Discrete Choice Models – how good is your
regression and other uses
• Nonparametric Statistics – e.g. t-tests for data
not normally distributed
• General Linear Models – other flavors of
regressions
• Robust Regression – more regressions!
• Time Series Analyses – used in Fraud Detection
• Others such as ANOVA, Kernel Density & Survival
Analyses
Polio Virus
• Polio Virus (PV) is a RNA-based virus
• First epidemic was 1894. During late 1940s & 1950s,
polio crippled more than 35,000 people per month in the
US
• PV is still present in population of 3rd world countries
• President Franklin D. Roosevelt, a Polio survivor,
helped to found the March of Dimes. His intent was to
raise funds to develop a Polio Vaccine.
• Vaccine was invented by Dr. Jonas Salk
• US has been polio-free since 1979
Health Data: Polio Rates and Vaccination in
the United States
• Polio is a viral RNA strand that causes
myelytis, respiratory problems and sometimes
paralysis
• Vaccination started in late 1950s & early
1960s
• Some info about the dataset
– Data begins in 1916
– Gathered by Centers for Disease Control
– Downloaded from healthdata.gov
Analysis Work Flow Polio Data I
• Hypothesis 1: Polio Rates Decreased due to
Vaccination
• Take a peak at the data & check for:
– “Missing-ness”
– Number of observations and types of
observations
– Perform an initial visualization
• Perform a regression analysis to determine
whether the use of vaccines was correlated
to an exponential drop in Polio rates
What the ALL the Data Looks Like…
Assumptions using ALL the data (aggregated data) can lead to
results that are less than interpretable or misleading…this
graph makes it seem that vaccine was irrelevant as the Polio
rates decreased exponentially before the vaccinations
started…But is that true?
Some of the Code
Its good practice to import all the libraries and modules
you will use at the top of your code file when doing ad
hoc analyses. Jupyter notebook will be provided in github.
Some more of the Code (our regression)
Summary Outcome
• Alternate hypothesis: Rates of decreased
incidence of Polio differed by state.
• Our linear regression was not a good fit
using Ordinary Least Squares and the
aggregated data might have been misleading
• There was significant skew and kurtosis
• Either a log-odds regression with a different
distribution family chosen OR a non-
parametric test would be more appropriate for
this data considering skew; alternatively
transforming to normal distribution can be
appropriate
Analysis Work Flow Polio Data 2:
• Hypothesis 2: Polio rates decreased at
different rates depending upon area of the
country
• Take a peak at the data & check for:
– Perform an initial visualization based upon state
(we are keeping things simplistic by choosing a
state in the north, south, east, west)
• Perform a time-series analysis to determine
if Polio rates were decreasing significantly
between 1945-1965 (slightly before and
slightly after vaccinations began)or it was a
constant decrease. This analysis will be
available in the Jupyter notebook.
What the data Looks like..
The visualization of data that is not aggregated,
but rather separated by state, shows a binomial
distribution, not an exponential decline.
Insights & Future Action Points for Polio
Study
• Vaccination had an effect, which created an initial
dip in Polio levels not long after vaccination began.
• Although the rate of polio decreased in response to
vaccinations with a moderate decline, the incidences
rose again.
• Ultimately vaccination and public health measures
were able to wipe out new incidences of Polio from
the US--but not until 1979, decades after the vaccine
was first administered
• Population rates of disease do not necessarily
correlate with vaccination
• Vigilance and population-level prevention should be
supplemented (not replaced) with vaccination
Example 2: K-means Clustering of Iris
Dataset
• Quick example of visual analysis & K-means
clustering using the canonical ‘Iris’ Dataset
• This dataset includes different examples of
Iris Flowers along with their physical
features
• We are taking a simple example directly from
the Sci-kit learn library but I will also add
an example of cluster analysis for the Polio
data at a later point in the Jupyter notebook
within my github repository
Some of the Code
Output for K-means Clustering
Insights: As we might have
guessed there are
3 clusters for most feature
combinations, and these are
generally separate for each
type of flower—but not always!
Can you see where this isn’t
true?
The End
• Thank you ObjectRocket & Rackspace for sponsoring
PyLadies ATX and this talk!
• Where to find the data: www.healthdata.gov
• Where to find all of the Code:
https://github.com/jddavis-100/Statistics-and-
Machine-Learning/wiki/Welcome-&-Table-of-Contents
• Where to find the Jupyter Notebook: I will be
providing it to Sara Safavi so contact her soon. You
can also find a static copy of it on my wiki (soon).
• Where to have fun: start on 6th & make your way to
Rainey…or out to Salt Lick Grill or ACL festival in
Zilker Park…or…any number of awesome places in ATX!
A very simplistic Confusion Matrix
Understanding the true positives, true negatives, false
positives and false negatives, allows us to calculate
accuracy & precision. We can also use this analyses on
both the test and the training data. Other tests such as
marginal error are sometimes used.

Contenu connexe

Tendances

Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...PyData
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learningShishir Choudhary
 
Bayesian networks and the search for causality
Bayesian networks and the search for causalityBayesian networks and the search for causality
Bayesian networks and the search for causalityBayes Nets meetup London
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...tboubez
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflowsdgarijo
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsJen Stirrup
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modellingQuinton Anderson
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningTamir Taha
 
Strategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroStrategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroRobert Munro
 
Predicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksPredicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksAnaelia Ovalle
 
Machine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to DeploymentMachine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to DeploymentAnant Agarwal
 
Dealing with incomplete data for mapping and spatial analysis
Dealing with incomplete data for mapping and spatial analysisDealing with incomplete data for mapping and spatial analysis
Dealing with incomplete data for mapping and spatial analysisAileen Buckley
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKoundinya Desiraju
 

Tendances (20)

Outlier Detection
Outlier DetectionOutlier Detection
Outlier Detection
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Bayesian networks and the search for causality
Bayesian networks and the search for causalityBayesian networks and the search for causality
Bayesian networks and the search for causality
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflows
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Meta analysis in neuroimaging 101
Meta analysis in neuroimaging 101Meta analysis in neuroimaging 101
Meta analysis in neuroimaging 101
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Strategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroStrategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert Munro
 
Chapter 01
Chapter 01Chapter 01
Chapter 01
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Predicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksPredicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural Networks
 
Machine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to DeploymentMachine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to Deployment
 
Dealing with incomplete data for mapping and spatial analysis
Dealing with incomplete data for mapping and spatial analysisDealing with incomplete data for mapping and spatial analysis
Dealing with incomplete data for mapping and spatial analysis
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 

En vedette

Examining Malware with Python
Examining Malware with PythonExamining Malware with Python
Examining Malware with Pythonmrphilroth
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processingnathanmarz
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
Counterfactual evaluation of machine learning models
Counterfactual evaluation of machine learning modelsCounterfactual evaluation of machine learning models
Counterfactual evaluation of machine learning modelsMichael Manapat
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2VecKouhei Nakaji
 
Microservices. Microservices everywhere! (At OSCON 2015)
Microservices. Microservices everywhere! (At OSCON 2015)Microservices. Microservices everywhere! (At OSCON 2015)
Microservices. Microservices everywhere! (At OSCON 2015)Jérôme Petazzoni
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsRoelof Pieters
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 

En vedette (11)

Examining Malware with Python
Examining Malware with PythonExamining Malware with Python
Examining Malware with Python
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Counterfactual evaluation of machine learning models
Counterfactual evaluation of machine learning modelsCounterfactual evaluation of machine learning models
Counterfactual evaluation of machine learning models
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
 
Microservices. Microservices everywhere! (At OSCON 2015)
Microservices. Microservices everywhere! (At OSCON 2015)Microservices. Microservices everywhere! (At OSCON 2015)
Microservices. Microservices everywhere! (At OSCON 2015)
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 

Similaire à Intro scikitlearnstatsmodels

Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1sasi
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSpartan60
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
Analysis of "A Predictive Analytics Primer" by Tom Davenport
 Analysis of "A Predictive Analytics Primer" by Tom Davenport Analysis of "A Predictive Analytics Primer" by Tom Davenport
Analysis of "A Predictive Analytics Primer" by Tom DavenportEt Hish
 
Chapter-1 - Notes.pptx
Chapter-1 - Notes.pptxChapter-1 - Notes.pptx
Chapter-1 - Notes.pptxDATASCIENCE41
 
Introduction to Data Analytics - PPM.pptx
Introduction to Data Analytics - PPM.pptxIntroduction to Data Analytics - PPM.pptx
Introduction to Data Analytics - PPM.pptxssuser5cdaa93
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
Statistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptxStatistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptxQasimGull
 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineeringalessio_ferrari
 

Similaire à Intro scikitlearnstatsmodels (20)

Data science 101
Data science 101Data science 101
Data science 101
 
data analysis.ppt
data analysis.pptdata analysis.ppt
data analysis.ppt
 
data analysis.pptx
data analysis.pptxdata analysis.pptx
data analysis.pptx
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data analysis
Data analysisData analysis
Data analysis
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Analysis of "A Predictive Analytics Primer" by Tom Davenport
 Analysis of "A Predictive Analytics Primer" by Tom Davenport Analysis of "A Predictive Analytics Primer" by Tom Davenport
Analysis of "A Predictive Analytics Primer" by Tom Davenport
 
Chapter-1 - Notes.pptx
Chapter-1 - Notes.pptxChapter-1 - Notes.pptx
Chapter-1 - Notes.pptx
 
Introduction to Data Analytics - PPM.pptx
Introduction to Data Analytics - PPM.pptxIntroduction to Data Analytics - PPM.pptx
Introduction to Data Analytics - PPM.pptx
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Statistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptxStatistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptx
 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineering
 

Dernier

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Dernier (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Intro scikitlearnstatsmodels

  • 1. Intro to Scikit-Learn and StatsModels for the Absolute Beginner Jennifer D. Davis, Ph.D. July 2, 2015
  • 2. “Risks, I like to say, always pay off. You learn what to do, or what not to do.” - Dr. Jonas E Salk
  • 3. Outline • Machine learning and statistics, tools of the data scientist • Why python? • Popular Scikit Learn ML algorithms • Popular StatsModels algorithms • History of Scikit Learn and StatsModels • A use case: Polio Rates and Vaccination in the United States Note: many of the slides contain lots of text or notes so you don’t need to take written notes. At the same time, this is a talk for absolute beginners and so we present in a fairly non-technical manner. Experienced audience members may find some information lacking detail or caveats. Additional information/tutorial will be available in a Jupyter notebook on github.
  • 4. Why Python? • Well developed scripting language that can also be utilized for software development & is *scalable* • Well developed machine learning libraries backed by developers at Google among other places • Runs on C/C++ in the background, so complex computations can run faster or be ported to C • Runs on big data platforms like Spark (pySpark) • Plays nicely with other programming languages (see Jython & Cython for porting to Java and C respectively, other methods work too)
  • 5. Why Machine Learning? • Machine learning is a subset of Artificial Intelligence, which is as the title suggests, uses mathematics to mimic learning intelligence • Machine learning takes complex data, mathematically models it (using training data) under tunable parameters, and allows for predictions or assessments of *individuals* within or compared to groups • Machine learning includes network analysis, deep learning, probability density graphs, supervised learning, unsupervised learning, dimensionality reduction (SVM, PCA) and other techniques
  • 6. Why Statistics? • Statistics is the mathematical application to data to build a ‘model’ of how observations fit into a ‘big picture’ • Statistical analyses often include correlations, assessments of how ‘good’ a model is based on error rates or population fits • Statistics is an essential part of data science repertoire, but data scientists do not *rely* on statistics alone • Examples include ANOVA, Pearson’s Correlation, ROC Curves (assesses various models), Time Series, Regressions • Statistical techniques can be applied to Machine Learning algorithms to determine how effective, accurate or predictive the algorithm is, but they are not the only method • Examples include: PPV, NPV, ROC Curves
  • 7. How do Statistics & Machine Learning Relate to One Another? • Statistical methods are used to assess the performance of a machine learning algorithm often but do not require data to ‘tune’ the statistical test • Some statistical tests can be utilized as machine learning algorithms (e.g. log-odds regressions etc.) • While Statistics is not generally considered part of artificial intelligence, it can be used to determine the accuracy, learning rate and other parameters tied to AI & Machine Learning. • Machine learning algorithms use test data to tune their parameters. Remember the musician who’s instrument is out of tune? We don’t want that (under-fitting). And we don’t want the musician tuned only to themselves—but differently than the rest of the band--that’s over-fitting.
  • 8. The Top 5 Machine Learning Algorithms for Data Science Available in Scikit-Learn • PageRank (Principal Eigenvector) • AdaBoost (Ensemble Learning) • kNN (K-nearest neighbor Classification) • Principal Component Analysis (dimensionality reduction) • Neural Network Models (example, Restricted Boltzmann machines)
  • 9. The Top 5 Statistical Models for Data Science Available in StatsModels • Generalized linear models (e.g. ordinary least squares regressions) • Nonparametric estimators • Analysis of Variance • Times Series Analysis • Survival Analysis
  • 10. Scikit Learn: History & Development • Project started in 2007 as a Google Summer of Code project by David Counapeau. • Matthieu Brucher then took it up as part of his thesis work. • 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort & Vincent Miachel of INRIA took project leadership • The first public release was February 1, 2010 • Since releases have appeared about every ~3 months • A great community exists, so if you’d like to contribute your own code for machine-learning algorithms contact the scikit-learn team.
  • 11. StatsModels History & Development • Statsmodels is a Python library that provides classes & functions for estimation of many statistical functions • It is useful for conducting tests such as ANOVA, ARMA, time-series, various flavors of regressions • Results are tested against existing statistical packages to ensure accuracy • For those of you who are used to R, you can fit models using R-style functional programming • The modules were originally from scipy.stats written by Jonathan Taylor. It was later expanded and moved. • As part of the Google Summer of Code 2009, statsmodels was tested, improved and released as a package. Since then a team of developers from Google and AWR have supported the development. To oversee coding practices (i.e. use of PEP- 8) python.org typically reviews modules/libraries.
  • 12. Use Cases: Scikit-Learn • Classification – identify which category an object or person belongs too, eg. Spam detection or image recognition, or which of you will pay more than $40, $75 or $100 for a pair of shoes? • Regression, predicting continuous-value attributes associated with an object, e.g. patient drug response based on other factors • Clustering – grouping similar objects into sets, e.g. customer segmentation, grouping experimental outcomes • Dimensionality reduction (reducing the number of variables included in ML analyses), see my github for example • Model selection – comparing, cross-validating, choosing tuning parameters & metrics • Preprocessing (yes, this is important!!!) – feature extraction & normalization, transforming input data such as text, into a vector or representation that can be used by a ML algorithm
  • 13. Use Cases: StatsModels • Linear regression models (I will show an example, but not the best example) • Plotting data to assess its fit – are you over fitting or under fitting or just right? • Discrete Choice Models – how good is your regression and other uses • Nonparametric Statistics – e.g. t-tests for data not normally distributed • General Linear Models – other flavors of regressions • Robust Regression – more regressions! • Time Series Analyses – used in Fraud Detection • Others such as ANOVA, Kernel Density & Survival Analyses
  • 14. Polio Virus • Polio Virus (PV) is a RNA-based virus • First epidemic was 1894. During late 1940s & 1950s, polio crippled more than 35,000 people per month in the US • PV is still present in population of 3rd world countries • President Franklin D. Roosevelt, a Polio survivor, helped to found the March of Dimes. His intent was to raise funds to develop a Polio Vaccine. • Vaccine was invented by Dr. Jonas Salk • US has been polio-free since 1979
  • 15. Health Data: Polio Rates and Vaccination in the United States • Polio is a viral RNA strand that causes myelytis, respiratory problems and sometimes paralysis • Vaccination started in late 1950s & early 1960s • Some info about the dataset – Data begins in 1916 – Gathered by Centers for Disease Control – Downloaded from healthdata.gov
  • 16. Analysis Work Flow Polio Data I • Hypothesis 1: Polio Rates Decreased due to Vaccination • Take a peak at the data & check for: – “Missing-ness” – Number of observations and types of observations – Perform an initial visualization • Perform a regression analysis to determine whether the use of vaccines was correlated to an exponential drop in Polio rates
  • 17. What the ALL the Data Looks Like… Assumptions using ALL the data (aggregated data) can lead to results that are less than interpretable or misleading…this graph makes it seem that vaccine was irrelevant as the Polio rates decreased exponentially before the vaccinations started…But is that true?
  • 18. Some of the Code Its good practice to import all the libraries and modules you will use at the top of your code file when doing ad hoc analyses. Jupyter notebook will be provided in github.
  • 19. Some more of the Code (our regression)
  • 20. Summary Outcome • Alternate hypothesis: Rates of decreased incidence of Polio differed by state. • Our linear regression was not a good fit using Ordinary Least Squares and the aggregated data might have been misleading • There was significant skew and kurtosis • Either a log-odds regression with a different distribution family chosen OR a non- parametric test would be more appropriate for this data considering skew; alternatively transforming to normal distribution can be appropriate
  • 21. Analysis Work Flow Polio Data 2: • Hypothesis 2: Polio rates decreased at different rates depending upon area of the country • Take a peak at the data & check for: – Perform an initial visualization based upon state (we are keeping things simplistic by choosing a state in the north, south, east, west) • Perform a time-series analysis to determine if Polio rates were decreasing significantly between 1945-1965 (slightly before and slightly after vaccinations began)or it was a constant decrease. This analysis will be available in the Jupyter notebook.
  • 22. What the data Looks like.. The visualization of data that is not aggregated, but rather separated by state, shows a binomial distribution, not an exponential decline.
  • 23. Insights & Future Action Points for Polio Study • Vaccination had an effect, which created an initial dip in Polio levels not long after vaccination began. • Although the rate of polio decreased in response to vaccinations with a moderate decline, the incidences rose again. • Ultimately vaccination and public health measures were able to wipe out new incidences of Polio from the US--but not until 1979, decades after the vaccine was first administered • Population rates of disease do not necessarily correlate with vaccination • Vigilance and population-level prevention should be supplemented (not replaced) with vaccination
  • 24. Example 2: K-means Clustering of Iris Dataset • Quick example of visual analysis & K-means clustering using the canonical ‘Iris’ Dataset • This dataset includes different examples of Iris Flowers along with their physical features • We are taking a simple example directly from the Sci-kit learn library but I will also add an example of cluster analysis for the Polio data at a later point in the Jupyter notebook within my github repository
  • 25. Some of the Code
  • 26. Output for K-means Clustering Insights: As we might have guessed there are 3 clusters for most feature combinations, and these are generally separate for each type of flower—but not always! Can you see where this isn’t true?
  • 27. The End • Thank you ObjectRocket & Rackspace for sponsoring PyLadies ATX and this talk! • Where to find the data: www.healthdata.gov • Where to find all of the Code: https://github.com/jddavis-100/Statistics-and- Machine-Learning/wiki/Welcome-&-Table-of-Contents • Where to find the Jupyter Notebook: I will be providing it to Sara Safavi so contact her soon. You can also find a static copy of it on my wiki (soon). • Where to have fun: start on 6th & make your way to Rainey…or out to Salt Lick Grill or ACL festival in Zilker Park…or…any number of awesome places in ATX!
  • 28. A very simplistic Confusion Matrix Understanding the true positives, true negatives, false positives and false negatives, allows us to calculate accuracy & precision. We can also use this analyses on both the test and the training data. Other tests such as marginal error are sometimes used.

Notes de l'éditeur

  1. PageRank (Principal Eigenvector) was invented by Sergey Brin & Larry Page, 1998. Search ranking algorithm using hyperlinks on the web. The basis for the original Google search engine. AdaBoost (ensemble learning) – used in Ensemble Learning, a method to employ multiple ‘learners’ to solve a problem. AdaBoost is one of the most utilized ensemble algorithms invented by Yoav Fruend & Robert Schapire. kNN (K-nearest neighbor Classification) – this algorithm finds a group of ‘k’ objects in the training set that are closest (e.g. eucledian distance) to the test object. Elements required include (1) set of labeled objects, (2) a similarity metric and (2) the value of k (number of nearest neighbors). Principal Component Analysis (dimensionality reduction) Neural Network Models (e.g. Restricted Boltzmann machines)
  2. The plots display firstly what a K-means algorithm would yield using three clusters. It is then shown what the effect of a bad initialization is on the classification process: By setting n_init to only 1 (default is 10), the amount of times that the algorithm will be run with different centroid seeds is reduced. The next plot displays what using eight clusters would deliver and finally the ground truth.