SlideShare une entreprise Scribd logo
1  sur  61
Data Mining the City
Big Data, Urbanism, and Web 2.0
Danil Nagy (dn2216@columbia.edu)
Wednesdays, 7:00pm-9:00pm
200 Buell
1
Week 5
A (practical) introduction
to Machine Learning
2
Data / Model
3
Meadows, et al. The Limits to Growth (1972)
4
5
Data Mining vs. Machine Learning
6
DATA KNOWLEDGE
LEARNING
7
What is Learning?
1.	To get knowledge of something by study, experience, or being taught.
2.	To become aware by information or from observation
3.	To commit to memory
4.	To be informed of or to ascertain
5.	To receive instruction
Witten, Frank, Hall. Data Mining, Practical Machine Learning Tools and Techniques, 3d edition. 8
What is Learning?
1.	To get knowledge of something by study, experience, or being taught.
2.	To become aware by information or from observation
3.	To commit to memory
4.	To be informed of or to ascertain
5.	To receive instruction
Things learn when they change their behavior in a way that makes them
perform better in the future.
Witten, Frank, Hall. Data Mining, Practical Machine Learning Tools and Techniques, 3d edition. 9
“Telling the future, when it comes right down
to it, is not solely a human yearning. It is the
fundamental nature of any organism, and
perhaps any complex system. Telling the future
is what organisms are for.”
- Kevin Kelly, “Out of Control”
10
DATA KNOWLEDGE
LEARNING
11
TRAINING DATA
Features (X1
, X2
, ...)
NEW DATA
Features (X1
, X2
, ...)
‘LEARNING’
Value (y)
Predicted Value (yp
)
TRAINED
PREDICTOR
MODEL
12
Name Gender Height Income HS Degree
Bob Male 5’5” $44,000 No
John Male 6’0” $60,000 Yes
Susan Female 5’10” $40,000 No
Betty Female 5’6” $55,000 Yes
13
Name Gender Height Income HS Degree
Bob Male 5’5” $44,000 No
John Male 6’0” $60,000 Yes
Susan Female 5’10” $40,000 No
Betty Female 5’6” $55,000 Yes
Description
Data
Categorical
Data
Continuous
Data
14
Name Gender Height Income HS Degree
Bob Male 5’5” $44,000 No
John Male 6’0” $60,000 Yes
Susan Female 5’10” $40,000 No
Betty Female 5’6” $55,000 Yes
Problem 1: Predict Income
Features (X) value (y)
[regression]
15
Name Gender Height Income HS Degree
Bob Male 5’5” $44,000 No
John Male 6’0” $60,000 Yes
Susan Female 5’10” $40,000 No
Betty Female 5’6” $55,000 Yes
Problem 2: Predict HS Degree
Features (X) value (y)
[classification]
16
TRAINING DATA
Features (X1
, X2
, ...)
NEW DATA
Features (X1
, X2
, ...)
‘LEARNING’
Value (y)
Predicted Value (yp
)
TRAINED
PREDICTOR
MODEL
SUPERVISED LEARNING MODEL
17
TRAINING DATA
Features (X1
, X2
, ...)
NEW DATA
Features (X1
, X2
, ...)
‘LEARNING’ Value (y)
Predicted Value (yp
)
TRAINED
PREDICTOR
MODEL
UNSUPERVISED LEARNING MODEL
18
“Not everything that can be counted counts,
and not everything that counts can be counted.”
- William Bruce Cameron, 1967
19
Machine Learning Applications
1.	Web mining (search engine)
2.	Screening (loan customers)
3.	Image analysis (geographic detection)
4.	Load forecasting (energy companies)
5.	Diagnosis (medical and mechanical failure)
6.	Marketing and sales (retaining customers, targeting advertising, recommender systems)
7.	Science (gene detection, galaxy detection, prefixing structure of organic compounds)
8.	City design and planning?
20
• Image data for 5,328 colonies over 6 days (~32,000 images) at 550x550 resolution
• Table of information and for 145 colonies processed by hand
• Time-lapse video of growth for one colony
DATA RECEIVED
Day 1
Day 4
Day 2
Day 5
Day 3
Day 6 21
FEATURE EXTRACTION: Method 1 - Matrix Representation
Original image (550x550 pixels) Normalized single-channel subset (30x30 pixels)
52
46
37
43
37
32
37
32
30
22
23
24
25
26
27
28
29
30
e
MSE = Σ e2
num values
31
32
33
34
35
degree = 3
get the code here: http://goo.gl/ogJM3u 36
TRAINING DATA
Features (X1
, X2
, ...)
NEW DATA
Features (X1
, X2
, ...)
‘LEARNING’
Value (y)
Predicted Value (yp
)
TRAINED
PREDICTOR
MODEL
SUPERVISED LEARNING MODEL
37
TRAINING DATA ~70%
Features (X1
, X2
, ...)
NEW DATA
Features (X1
, X2
, ...)
‘LEARNING’
Value (y)
VALIDATION DATA ~30%
Features (X1
, X2
, ...)
Value (y)
Predicted Value (yp
)
TRAINED
PREDICTOR
MODELS
VALIDATED
MODEL
SUPERVISED LEARNING MODEL - WITH VALIDATION
38
Machine Learning Algorithms
http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms
Supervised Learning
1.	Instance-based learning
2.	Artificial neural network
3.	Support vector machines
4.	Learning automata
Unsupervised Learning
5.	K-nearest neighbor
6.	Decision trees
7.	Random forests
39
Machine Learning Algorithms
http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms
Supervised Learning
1.	Instance-based learning
2.	Artificial neural network
3.	Support vector machines <-- use in class
4.	Learning automata
Unsupervised Learning
5.	K-nearest neighbor
6.	Decision trees
7.	Random forests
40
Advantages of SVM
1.	Modern
2.	Understandable
3.	Controllable
4.	Flexible
5.	Powerful
41
http://en.wikipedia.org/wiki/Support_vector_machine
SUPPORT VECTOR MACHINES
H1 does not separate the classes.
H2 does, but only with a small margin.
H3 separates them with the maximum margin.
Maximum-margin hyperplane and margins
for an SVM trained with samples from two
classes. Samples on the margin are called the
support vectors.
42
SUPPORT VECTOR MACHINES
http://en.wikipedia.org/wiki/Support_vector_machine
Non-linear Classification
Non-linear models are useful for data that cannot be separated in its original feature
space. They are created through the ‘kernel trick’ where data is first projected into a
higher-dimensional space in which it can be separated, and then the whole model is
projected back into the feature space.
43
SUPPORT VECTOR MACHINES
http://en.wikipedia.org/wiki/Support_vector_machine
Soft-margin Classification
Soft-margin systems build classifiers that are allowed to ignore some misclassifications
that fall within a certain distance (ε) of the separator. They are useful for categorizing
messy or noisy data.
44
SUPPORT VECTOR MACHINES
Non-linear soft-margin SVM classification used to classify non-separable data
http://en.wikipedia.org/wiki/Support_vector_machine 45
SUPPORT VECTOR MACHINES
http://en.wikipedia.org/wiki/Support_vector_machine
Error Function
46
SUPPORT VECTOR MACHINES
http://en.wikipedia.org/wiki/Support_vector_machine
Optim
ization
Function
M
argin
W
idth
Penalty
Factor
Error Function
47
SUPPORT VECTOR MACHINES
http://www.svms.org/parameters/
Penalty
Factor
The penalty factor in a SVM penalizes the model (creates higher values in the optimization)
for wrong guesses. It is driven by two parameters (which become inputs into the model):
C - a multiplier that controls the strength of the penalty factor. Higher values of C will pro-
duce larger relative penalties for misclassified points and lead to over-fitting (high variance).
ε (epsilon) - controlls the margin of error or ‘gray area’ of the model (how wrong an example
has to be before it is considered an error). Higher values will produce simpler models but
may result in under-fitting (high bias).
48
BIAS-VARIANCE TRADEOFF
MODEL COMPLEXITY
HIGH BIAS
(underfitting)
HIGH VARIANCE
(overfitting)
‘RIGHT’
MODEL
49
SCIKIT-LEARN MACHINE LEARNING LIBRARY FOR PYTHON
50
SCIKIT-LEARN MACHINE LEARNING LIBRARY FOR PYTHON
51
2D EXAMPLE - SVR [REGRESSION]
52
e = .0001 e = 1 e = 3
2D EXAMPLE - SVR [REGRESSION]
C = 500
C = 50
C = 0.001
53
C = 500
C = 50
C = 0.001
e = .0001 e = 1 e = 3
2D EXAMPLE - SVR [REGRESSION]
mse = 24.994
54
2D EXAMPLE - SVR [REGRESSION]
get the code here: http://goo.gl/RecPfq 55
2D EXAMPLE - SVC [CLASSIFICATION]
56
C = 100
C = 10
C = 1
e = .0001 e = 1 e = 3
2D EXAMPLE - SVC [CLASSIFICATION]
57
C = 500
C = 50
C = 1
g = 0.001 g = 0.1 g = 1
e = 34
2D EXAMPLE - SVC [CLASSIFICATION]
58
2D EXAMPLE - SVC [CLASSIFICATION]
get the code here: http://goo.gl/b3TGOQ 59
WEBSTACK APPLICATION - HEATMAP (DENSITY DISTRIBUTION)
60
WEBSTACK APPLICATION - INTERPOLATION (VALUE PREDICTION)
61

Contenu connexe

En vedette (6)

Practical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
Practical Data Mining with RapidMiner Studio 7 : A Basic and IntermediatePractical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
Practical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 

Similaire à Data Mining the City - A (practical) introduction to Machine Learning

Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
darwinrlo
 
Topic_6
Topic_6Topic_6
Topic_6
butest
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
butest
 

Similaire à Data Mining the City - A (practical) introduction to Machine Learning (20)

Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7
 
AI and ML Skills for the Testing World Tutorial
AI and ML Skills for the Testing World TutorialAI and ML Skills for the Testing World Tutorial
AI and ML Skills for the Testing World Tutorial
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
Topic_6
Topic_6Topic_6
Topic_6
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
ML Basic Concepts.pdf
ML Basic Concepts.pdfML Basic Concepts.pdf
ML Basic Concepts.pdf
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
 
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...Big Data Analytics: The Math, the Implementation and How it can be Effectivel...
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
 
09learning.ppt
09learning.ppt09learning.ppt
09learning.ppt
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
Machine Learning & AI - 2022 intro for pre-college students.pdf
Machine Learning & AI - 2022 intro for pre-college students.pdfMachine Learning & AI - 2022 intro for pre-college students.pdf
Machine Learning & AI - 2022 intro for pre-college students.pdf
 

Plus de Danil Nagy

Plus de Danil Nagy (14)

Generative Design - Week 6 - Designing with inputs, objectives, and constraints
Generative Design - Week 6 - Designing with inputs, objectives, and constraintsGenerative Design - Week 6 - Designing with inputs, objectives, and constraints
Generative Design - Week 6 - Designing with inputs, objectives, and constraints
 
Generative Design - Week 5 - Introduction to optimization
Generative Design - Week 5 - Introduction to optimizationGenerative Design - Week 5 - Introduction to optimization
Generative Design - Week 5 - Introduction to optimization
 
Generative Design - Week 4 - Scripting in Python
Generative Design - Week 4 - Scripting in PythonGenerative Design - Week 4 - Scripting in Python
Generative Design - Week 4 - Scripting in Python
 
Generative Design - Week 3 - Working with data in Grasshopper
Generative Design - Week 3 - Working with data in GrasshopperGenerative Design - Week 3 - Working with data in Grasshopper
Generative Design - Week 3 - Working with data in Grasshopper
 
Generative Design - Week 1 - Introduction to Generative Design
Generative Design - Week 1 - Introduction to Generative DesignGenerative Design - Week 1 - Introduction to Generative Design
Generative Design - Week 1 - Introduction to Generative Design
 
Generative Design - Week 2 - Parametric modeling in rhino and grasshopper
Generative Design - Week 2 - Parametric modeling in rhino and grasshopperGenerative Design - Week 2 - Parametric modeling in rhino and grasshopper
Generative Design - Week 2 - Parametric modeling in rhino and grasshopper
 
SP18 Generative Design - Week 8 - Optimization
SP18 Generative Design - Week 8 - OptimizationSP18 Generative Design - Week 8 - Optimization
SP18 Generative Design - Week 8 - Optimization
 
SP18 Generative Design - Week 7 - GD case studies
SP18 Generative Design - Week 7 - GD case studiesSP18 Generative Design - Week 7 - GD case studies
SP18 Generative Design - Week 7 - GD case studies
 
SP18 Generative Design - Week 6 - Design space design
SP18 Generative Design - Week 6 - Design space designSP18 Generative Design - Week 6 - Design space design
SP18 Generative Design - Week 6 - Design space design
 
SP18 Generative Design - Week 5 - Introduction to simulation
SP18 Generative Design - Week 5 - Introduction to simulationSP18 Generative Design - Week 5 - Introduction to simulation
SP18 Generative Design - Week 5 - Introduction to simulation
 
SP18 Generative Design - Week 4 - Computational control strategies
SP18 Generative Design - Week 4 - Computational control strategiesSP18 Generative Design - Week 4 - Computational control strategies
SP18 Generative Design - Week 4 - Computational control strategies
 
SP18 Generative Design - Week 2 - Introduction to computational design
SP18 Generative Design - Week 2 - Introduction to computational designSP18 Generative Design - Week 2 - Introduction to computational design
SP18 Generative Design - Week 2 - Introduction to computational design
 
SP18 Generative Design - Week 1 - Introduction
SP18 Generative Design - Week 1 - IntroductionSP18 Generative Design - Week 1 - Introduction
SP18 Generative Design - Week 1 - Introduction
 
Studio 4 - workshop introduction
Studio 4 - workshop introductionStudio 4 - workshop introduction
Studio 4 - workshop introduction
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Data Mining the City - A (practical) introduction to Machine Learning