SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
David Callender
• Finished in top 2% (18th out of >1300) on 3 year
$3 million Machine Learning competition.
• Studied disease propagation in an urban setting
using probabilistic graphical models at Dartmouth
College
• Studied computational protein design at the
University of Washington
• Studied Mathematical foundations of Quantum
Mechanics at Macalester College
Machine Learning in R
circa 2013
David Callender
a.k.a. Using R on Kaggle
who will end up in the hospital
}drug effectiveness
Computer Security:
Determining employee
access needs
What will the salary be for
a given job advertisement
Not Just Kaggle
•Movie
recomendations
•Popular
productions
•Product
recomendations
•Good business
oportunities
•The Entire
Internet
•Probably a lot
more too
Talk Outline
• Motivation
• Concepts
• Algorithms
• Decision Trees and Forests
• Neural networks
• Kaggle
• Interactive session with R packages
• randomForest
• gbm
• neuralnet
Supervised Learning
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 33 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
Survived Pclass Sex Age SibSp Parch Fare Embarked
? 3 male 34.5 0 0 7.8292 Q
? 3 female 47 1 0 7 S
? 2 male 62 0 0 9.6875 Q
? 3 male 27 0 0 8.6625 S
? 3 female 22 1 1 12.2875 S
? 3 male 14 0 0 9.225 S
? 3 female 30 0 0 7.6292 Q
? 2 male 26 1 1 29 S
? 3 female 18 0 0 7.2292 C
? 3 male 21 2 0 24.15 S
Train model with
examples where
you know value of
“survived”
Use model to
predict value of
“survived”
Predicting survival for passengers of Titanic
binary
numeric
catagorical
Overfitting
http://en.wikipedia.org/wiki/File:Overfitting_on_Training_Set_Data.pdf Tomaso Poggio
Decision Trees
http://en.wikipedia.org/wiki/File:CART_tree_titanic_survivors.png | Stephen Milborrow | Made using R
Survived Pclass Sex Age SibSp Parch Fare Embarked
? 3 male 34.5 0 0 7.8292 Q
? 3 female 47 1 0 7 S
? 2 male 62 0 0 9.6875 Q
? 3 male 27 0 0 8.7 S
? 3 female 22 1 1 12.2875 S
? 3 male 14 0 0 9.225 S
? 3 female 30 0 0 7.6292 Q
? 2 male 26 1 1 29 S
? 3 female 18 0 0 7.2292 C
? 3 male 21 2 0 24.15 S
Random Forest (RF)
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 33 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 33 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
Random Sub-SpacesBagging
{
{ Voting/Avg
Prediction
Training
Adaboost &
Gradient Boosting
• Initialize a set of weights, One for each training example, with equal value
• Train a tree with weighted training examples
• Add tree to set of trees
• Make predictions with set of trees
• Adjust weights so that the training examples you got wrong have more
weight
• repeat
Logistic Regression
a.k.a The Perceptron
Activation
Function
Weighted sum
Multilayer Feed-forward
Neural Network
R’s Popularity
Tools mentioned in Kaggle user profiles
From blog entry by Ben Hammer
http://blog.kaggle.com/2011/11/27/kagglers-favorite-tools/
Summary of Recent
Competition Winners
Position Algorithm Other Algs. Tools
Adzuna
Salary
1st
Adzuna
Salary
2nd
Adzuna
Salary
3rd
Merck
1st
Merck 2ndMerck
3rd
NN* - Python GPU
NN - C++
NN NB, SVM, LR Python
NN* - Python GPU
GBM & SVM
RF, PCA,
KNN, SVM R & Python
RF & SVM GBM, NN R
Learning More
• Pedro Domingos at University of Washington
• www.coursera.org/course/machlearning
• www.coursera.org/uw
• A Few Useful Things to Know about Machine Learning. Communications
of the ACM
• homes.cs.washington.edu/~pedrod
• blog.kaggle.com
• ufldl.stanford.edu/wiki/

Contenu connexe

En vedette

En vedette (8)

Iyer Matrimony.txt
Iyer Matrimony.txtIyer Matrimony.txt
Iyer Matrimony.txt
 
ixtract - Tears of the sun
ixtract - Tears of the sunixtract - Tears of the sun
ixtract - Tears of the sun
 
QVC Deen
QVC DeenQVC Deen
QVC Deen
 
Qw home automation (qwha)
Qw home automation (qwha)Qw home automation (qwha)
Qw home automation (qwha)
 
Mono rail
Mono railMono rail
Mono rail
 
Wings training folder - set 2013
Wings training   folder - set 2013Wings training   folder - set 2013
Wings training folder - set 2013
 
RISERVATTO OSASCO SP VL S.FRANCISCO APTO 3_4 DORM 11-7853-9660 GABAN
RISERVATTO OSASCO SP VL S.FRANCISCO APTO 3_4 DORM 11-7853-9660 GABANRISERVATTO OSASCO SP VL S.FRANCISCO APTO 3_4 DORM 11-7853-9660 GABAN
RISERVATTO OSASCO SP VL S.FRANCISCO APTO 3_4 DORM 11-7853-9660 GABAN
 
XWiki : Evolutions 2012
XWiki : Evolutions 2012XWiki : Evolutions 2012
XWiki : Evolutions 2012
 

Similaire à Uvrgrp ml

Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
joycemi_la
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
joycemi_la
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
Harry Potter
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
James Wong
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
Young Alista
 
Data miningmaximumlikelihood
Data miningmaximumlikelihoodData miningmaximumlikelihood
Data miningmaximumlikelihood
Fraboni Ec
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
Tony Nguyen
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
Luis Goldster
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
nlt2390
 

Similaire à Uvrgrp ml (20)

Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fc
 
VSSML18. Ensembles and Logistic Regressions
VSSML18. Ensembles and Logistic RegressionsVSSML18. Ensembles and Logistic Regressions
VSSML18. Ensembles and Logistic Regressions
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly Detection
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data miningmaximumlikelihood
Data miningmaximumlikelihoodData miningmaximumlikelihood
Data miningmaximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
MyStataLab Assignment Help
MyStataLab Assignment HelpMyStataLab Assignment Help
MyStataLab Assignment Help
 
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life SciencesIncreasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - Ensembles
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Uvrgrp ml

  • 1. David Callender • Finished in top 2% (18th out of >1300) on 3 year $3 million Machine Learning competition. • Studied disease propagation in an urban setting using probabilistic graphical models at Dartmouth College • Studied computational protein design at the University of Washington • Studied Mathematical foundations of Quantum Mechanics at Macalester College
  • 2. Machine Learning in R circa 2013 David Callender
  • 3. a.k.a. Using R on Kaggle who will end up in the hospital }drug effectiveness Computer Security: Determining employee access needs What will the salary be for a given job advertisement
  • 4. Not Just Kaggle •Movie recomendations •Popular productions •Product recomendations •Good business oportunities •The Entire Internet •Probably a lot more too
  • 5. Talk Outline • Motivation • Concepts • Algorithms • Decision Trees and Forests • Neural networks • Kaggle • Interactive session with R packages • randomForest • gbm • neuralnet
  • 6. Supervised Learning Survived Pclass Sex Age SibSp Parch Fare Embarked 0 3 male 22 1 0 7.25 S 1 1 female 38 1 0 71.2833 C 1 3 female 26 0 0 7.925 S 1 1 female 35 1 0 53.1 S 0 3 male 35 0 0 8.05 S 0 3 male 33 0 0 8.4583 Q 0 1 male 54 0 0 51.8625 S 0 3 male 2 3 1 21.075 S 1 3 female 27 0 2 11.1333 S 1 2 female 14 1 0 30.0708 C Survived Pclass Sex Age SibSp Parch Fare Embarked ? 3 male 34.5 0 0 7.8292 Q ? 3 female 47 1 0 7 S ? 2 male 62 0 0 9.6875 Q ? 3 male 27 0 0 8.6625 S ? 3 female 22 1 1 12.2875 S ? 3 male 14 0 0 9.225 S ? 3 female 30 0 0 7.6292 Q ? 2 male 26 1 1 29 S ? 3 female 18 0 0 7.2292 C ? 3 male 21 2 0 24.15 S Train model with examples where you know value of “survived” Use model to predict value of “survived” Predicting survival for passengers of Titanic binary numeric catagorical
  • 8. Decision Trees http://en.wikipedia.org/wiki/File:CART_tree_titanic_survivors.png | Stephen Milborrow | Made using R Survived Pclass Sex Age SibSp Parch Fare Embarked ? 3 male 34.5 0 0 7.8292 Q ? 3 female 47 1 0 7 S ? 2 male 62 0 0 9.6875 Q ? 3 male 27 0 0 8.7 S ? 3 female 22 1 1 12.2875 S ? 3 male 14 0 0 9.225 S ? 3 female 30 0 0 7.6292 Q ? 2 male 26 1 1 29 S ? 3 female 18 0 0 7.2292 C ? 3 male 21 2 0 24.15 S
  • 9. Random Forest (RF) Survived Pclass Sex Age SibSp Parch Fare Embarked 0 3 male 22 1 0 7.25 S 1 1 female 38 1 0 71.2833 C 1 3 female 26 0 0 7.925 S 1 1 female 35 1 0 53.1 S 0 3 male 35 0 0 8.05 S 0 3 male 33 0 0 8.4583 Q 0 1 male 54 0 0 51.8625 S 0 3 male 2 3 1 21.075 S 1 3 female 27 0 2 11.1333 S 1 2 female 14 1 0 30.0708 C Survived Pclass Sex Age SibSp Parch Fare Embarked 0 3 male 22 1 0 7.25 S 1 1 female 38 1 0 71.2833 C 1 3 female 26 0 0 7.925 S 1 1 female 35 1 0 53.1 S 0 3 male 35 0 0 8.05 S 0 3 male 33 0 0 8.4583 Q 0 1 male 54 0 0 51.8625 S 0 3 male 2 3 1 21.075 S 1 3 female 27 0 2 11.1333 S 1 2 female 14 1 0 30.0708 C Random Sub-SpacesBagging { { Voting/Avg Prediction Training
  • 10. Adaboost & Gradient Boosting • Initialize a set of weights, One for each training example, with equal value • Train a tree with weighted training examples • Add tree to set of trees • Make predictions with set of trees • Adjust weights so that the training examples you got wrong have more weight • repeat
  • 11. Logistic Regression a.k.a The Perceptron Activation Function Weighted sum
  • 13. R’s Popularity Tools mentioned in Kaggle user profiles From blog entry by Ben Hammer http://blog.kaggle.com/2011/11/27/kagglers-favorite-tools/
  • 14. Summary of Recent Competition Winners Position Algorithm Other Algs. Tools Adzuna Salary 1st Adzuna Salary 2nd Adzuna Salary 3rd Merck 1st Merck 2ndMerck 3rd NN* - Python GPU NN - C++ NN NB, SVM, LR Python NN* - Python GPU GBM & SVM RF, PCA, KNN, SVM R & Python RF & SVM GBM, NN R
  • 15. Learning More • Pedro Domingos at University of Washington • www.coursera.org/course/machlearning • www.coursera.org/uw • A Few Useful Things to Know about Machine Learning. Communications of the ACM • homes.cs.washington.edu/~pedrod • blog.kaggle.com • ufldl.stanford.edu/wiki/