Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
2. Overview
Neural Networks
(How Machines Can Learn)
Collecting & Modeling Data
(How to do machine learning)
Case Studies
(What it really looks like)
So What? How do I really use this?
In order to full appreciate the many components of machine learning,
we will explore the topic in four key areas.
3. • Regression (estimating or predicting a real value)
• Classification (classifying as true/false or 1-of-n classes)
• Optimization (scheduling / process analysis)
• Performance assessment
Types of Applications
Image by Julian Nitzsche, August 2007, CC BY-SA-3.0 (modified)
https://commons.wikimedia.org/wiki/File:Visegrad_Drina_Bridge_1.jpg
Each problem type is substantially unique and requires a
specific approach. In many cases, unique machine learning
algorithms and techniques have been developed to address
these problem types.
4. Neural Networks “The Neural Network” by rajsegar
CC License BY-NC-ND
http://rajasegar.deviantart.com/art/The-Neural-Network-177904377
5. Neural Networks
Simulates Brain physiology
• Neurons and synapses
• Pattern recognition
Classification / Regression
• Disease classification
• Stock price prediction
Noisy / Complex data
• Missing, incorrect, or irrelevant information
• Linear / non-linear
6. Regression Classification
Problem Type / Complexity Matrix
It is important to understand the
relationship between problem type
and problem complexity in the early
stages of a machine learning problem
Key questions:
• Is the data directly related?
• Are we trying to estimate a real
value or just a yes/no or
categorical classification?
7. Regression Classification
Problem Type / Complexity Matrix
Non-Linear Regression is
required where the relationship
between the variables cannot be
approximated with a straight
line. Data sets often fall into this
category.
Linear Regression (“line of best fit”)
Arguably, the simplest problem to
solve. The variables are linearly
(directly) related, allowing them to
be solved using common linear
algebra techniques.
8. Regression Classification
Problem Type / Complexity Matrix
Classification has one key distinction
from regression. Where regression
seeks a line which best fits the data,
classification seeks a boundary which
best separates the data
Non-linear classification problems may
have unique characteristics, like one class
which is entirely contained within
another. Here, a single straight line
cannot separate them - a curved
boundary is required.
9. Artificial Neural Network (Linear)
x1
x2
y
Wx1
Wx2
This simple visual structure
represents a basic, linear neural
network.
It contains the key components
of it’s biological forebear,
neurons and their synaptic
interconnections.
One can also look at this as the visual representation of a mathematical function, y = f(x)
with the weighted connections (Wx1 and Wx2) describing the relationship between x and y
10. Artificial Neural Network (Linear)
x1
x2
y
Wx1
Wx2
Input Layer Output Layer
Fully-connected
feed forward
network
Fully-connected
The neurons in each layer are
connected to every neuron in the
following layer
Feed-forward
Computation begins at left with
the input and terminates at right
with the result
11. Artificial Neural Network (Linear)
x1
x2
y
Wx1
Wx2
The equation expressed in summation
notation may be literally described as:
“The sum of the inputs multiplied by
their weighted connections, then
divided by the total number of inputs”
In other words, it’s the description of
a weighted average.
12. 2 X 1 = 2
4 X 1 = 4
3
=1
=12
4
Artificial Neural Network (Linear)
x1
x2
y
Wx1
Wx2
Simple Average
(2 + 4) / 2
13. 2 X 1 = 2
4 X 1 = 4
3
=1
=12
4
Artificial Neural Network (Linear)
x1
x2
y
Wx1
Wx2
Simple Average
(2 + 4) / 2
Calculating a simple average is straight forward. With weights
given a value of 1, we are effectively saying that every value in the
average is equally important. The end result will always be the
simple average of any two inputs provided at left.
But suppose we want to do something more interesting?
Suppose we want to calculate the sum, rather than the average?
To do that, we only need to change the value of the weights…
15. 2
4
2 X 2 = 4
4 X 2 = 8
6
=2
=2
Artificial Neural Network (Linear)
x1
x2
y
Wx1
Wx2
Summation
(4 + 8) / 2
It’s easy to see how a neural network’s behavior can be
drastically affected by the value of the weighted connections
between the neurons (just as the human brain’s synapses
modulate electrical pulses between neurons to affect different
results).
But most problems are too complex to guess the weights
beforehand or determine them by trial and error.
If only we could get the machine to figure out the weights
for itself…
16. Supervised Learning
Feed Forward
Back Propagate
x1
x2
y
Wx1
Wx2
Prediction
Expected Result
Error
-
Supervised Learning is a technique
used to find the optimum weights
required to fit a dataset using
the error between the prediction
and the desired answer.
It consists the feed forward and back propagate phases.
Back Propagation is the process by which a network takes
the error between it’s prediction and the expected result
and adjusts it’s weights to achieve a better prediction.
17. Supervised Learning
Update Rule
Increase / decrease weights
by prediction error
Convergence
Network error minimizes,
weights stabilize
Feed Forward
Back Propagate
Through many iterations, supervised learning adjusts the network
weights using a pre-determined update rule with the hope of
achieving network convergence, typically indicated by
monotonically decreasing error that eventually plateaus.
18. Supervised Learning
As a simple (trivial) example, examine the table below. In this case, Supervised Learning is
used to “train” a network designed to calculate simple averages (weights = 1) to learn to
calculate sums instead (weights = 2).
The update rule simply adjusts the weights up or down by the percentage of error
from the previous iteration. We can see how the network quickly converges on the optimal
weights, producing an accurate prediction with correspondingly low error.
19. Hidden Layer
Artificial Neural Network (Non-Linear)
h1
h2
y
x1
x2
The real power of a neural network,
however, is in it’s ability to handle
non-linear problems.
Here, we see a non-linear network’s
key feature: the hidden layer.
Note that each layer is fully
connected to the next.
20. Artificial Neural Network (Non-Linear)
h1
h2
y
x1
x2
To better understand how the non-linear network operates, it helps to conceive of it as
a composition of several linear networks, each following the same rules for
calculation and updating.
21. Artificial Neural Network (Non-Linear)
h1
h2
y
x1
x2
A non-linear network may have any number of hidden layers and any number of
neurons in each layer. Practice shows, however, that most problems do not require
more than one hidden layer, with no more neurons than there are in the input or
output layers.
It can be mathematically demonstrated that a neural network of sufficient
complexity Is capable of modelling any complex mathematical function.
This strength makes them ideally suited for modelling the indirect, subtle relationships
In a data set to provide accurate predictions where human experience fails.
22. In 2012, researchers at Google Brain created
a network of 16,000 computer processors
with over 1 billion connections.
23. They then trained this network by showing it
screen captures from 10 million randomly
selected YouTube videos over three days.
24. ?At the end of the experiment, researchers
discovered the network was able to recognize
two things in particular.
Can you guess what they were?
?
25. At the end of the experiment, researchers
discovered the network was able to recognize
two things in particular.
Can you guess what they were?
Image source: www.twitter.com/realgrumpycatImage source: http://www.chroniclelive.co.uk/
People Cats
26. What made this experiment unique was that
the researchers used unsupervised learning,
allowing the network to determine for itself the
difference between images, rather than telling it in
advance what it was looking at (supervised learning).
Image source: www.twitter.com/realgrumpycatImage source: http://www.chroniclelive.co.uk/
28. Data Sets
• {3.14159, 1.333, 42.0}
Numeric
• {Atlanta, Dallas, Chicago}
Unordered Categorical
• {Low, Medium, High}
Ordered Categorical
• What we’re trying to predict
Target
• Describes the characteristics of the dataset
Features / Predictors
29. Data Sets
• What we’re trying to predict
Target
• Describes the characteristics of the dataset
Features / Predictors
When determining data and it’s roles, it’s important to remember that
relationships between pieces of data may be complex. For example, to predict
housing prices a housing market, it would be necessary to visit a sample of
houses and record their listing price (the target value).
Since the listing price is often driven by a series of features of the house (square
footage, number of bathrooms, etc.), we need to record that data as well.
Further, which features to choose can be tricky as their relationship to the target
value may be complex. Further, some features may have relationships to other
features, an undesirable dynamic to be avoided.
30. Data Sets
• {3.14159, 1.333, 42.0}
Numeric
• {Atlanta, Dallas, Chicago}
Unordered Categorical
• {Low, Medium, High}
Ordered Categorical
• What we’re trying to predict
Target
• Describes the characteristics of the dataset
Features / Predictors
Categorical data can be somewhat complex to manage. It is key to note whether
or not your categorical data is ordered, as it can substantially affect model
accuracy.
Preparing text categorical data (as opposed to numbered categories), also
requires there be no alternate spellings or misspellings.
31. Data Sources
Data collection can be
very time consuming!
Data set sizes:
• 10 – 100 million
• 500 – 10,000 typical
The R and Python languages
are well-suited for
retrieving and managing
data.
“He who has the most data,
wins.”
Data
Set
Web
Spreadsheet Databases
Paper /
Other
32. Data Preparation
Clean Data
• No missing / incorrect values
• No misspelled categorical
values
• No mixed data types
Tabular Layout
• Features and targets in
columns
• Each row is an “observation”
• Avoid duplicate records
33. Data Preparation
Normalization
• Values may vary by several orders of
magnitude
• Larger values have greater influence
• Normalization constrains feature value
ranges to the same values.
• [0,1] and [-1,1] are common ranges.
• Generally, ~[-3, 3] is acceptable.
34. “Prediction is very difficult,
especially about the
future.”
- Niels Bohr
Prediction
Image source: http://info.iqms.com/IQMS-Manufacturing-erp-expertise/bid/102603/IQMS-
Quality-Assurance-Predictions-With-Carnac-the-Magnificent
35. Steps:
1. Split original data set into train and test sets (80/20).
2. Train the model with the larger portion
3. Predict with both the training and testing data.
4. Measure the error in the predictions in both data sets
5. Compare the error of the two data sets
Cross-Validation
Cross-Validation
Establishes how well a model “generalizes”
Generalization
The ability to accurately predict
using previously-unseen data
36. Good fit
• The network generalizes well on data it has not seen
• Performance on both data sets is similar
• Overall error is low
Underfit (high bias)
• Does not predict well on either data set
• Need more data, features, better algorithm
Overfit (high variance)
• Predicts well on the training data, but not the testing data
• Need fewer features, less-powerful algorithm
Cross-Validation
37. Measure of Success
A measure of success is:
A meaningful, context-specific statement of
how successfully the model predicts.
“On average, the model predicts within
__% of the actual value,
__% of the time.”
38. Measure of Success
Measures of success allow us to articulate in
a straightforward, simple fashion, the
accuracy of the machine learning model
without having to resort to technical jargon
that may confuse laypersons.
Articulating a measure of success also helps
us understand just how accurate the model
needs to be for the intended purposes.
40. Overview
Compressive Strength of Concrete Samples
Image source:
http://info.admet.com/blog
/topic/compression-test
Given a concrete sample’s mix design and age,
can we accurately estimate is compressive strength?
41. Data Profile
Source: University of California, Irvine (UCI) website
1,030 samples (metric units)
Non-Linear
Features:
1. Cement
2. Slag
3. FlyAsh
4. Water
5. Superplasticizer
6. Coarse Aggregate
7. Fine Aggregate
8. Age (days)
42. Predictions
Actual compressive strengths of all samples,
sorted from lowest to highest
Upper accuracy tolerance (110% of actual)
Lower accuracy tolerance (90% of actual)
43. Predictions
Shape of curve suggests data distribution is approximately
Normal / Gaussian (standard bell curve), evidenced by steeper slope
at lower and higher values (fewer data points) and flatter slope in the
midrange (more data points)
45. Predictions
Generalized Linear Regression (train)
A plot of Linear Regression’s prediction success vs. error reveals that
even where the error was relatively low in the middle-strength range,
there was little consistency in it’s success rate.
47. Predictions
Random Forest (training) Support Vector Machine (training)
Applying two non-linear algorithms (a Random Forest Network and a Support Vector
Machine) yielded much more favorable results, performing very well against the 90%
success metric. While these algorithms are not neural networks (and are structurally
unrelated), many of the same rules that apply to neural network training also apply here.
Note the characteristic spike in the lower end of the predictions of both algorithms. This
is likely due to one or a small number of erratic data points.
A more detailed investigation into the data set would, hopefully, identify the cause and
perhaps suggest changes that could improve the model accuracy.
48. Predictions
Random Forest (training) Support Vector Machine (training)
Also note the error / success plots for the two algorithms. Not surprisingly low success
rates occur where the least data is (low and high ends). However, it’s interesting to note
that Random Forest had less error with low-strength predictions, whereas Support
Vector Machine did better with high-strength predictions.
This is a common characteristic of machine learning algorithms – each algorithm looks at
the data a bit differently. Further, we can use this trait to our advantage by building
ensemble networks – combinations of two good networks to create better predictions.
49. Model Results
[1]Test Success: Percentage of time model is at least 90% accurate on
previously-unseen data.
[2]Ensemble: Combination of SVM and RF only.
[3]Chained Ensemble: Predictions of one ensemble are used as inputs to another.
50. Model Results
Not surprisingly, linear regression performed poorly on the test data (data not previously
seen during training). SVM / RF performances were markedly better, though far too low
to be useful as production models.
The ensemble of the RF and SVM simply computed the average of their predictions.
It is proven that a simple average of two or more well-performing models can outperform
those models. While we did not achieve that here, it is interesting to note that the
ensemble was not substantially worse than the better model.
51. Model Results
The final model, the “chained ensemble” consisted of two ensembles (two sets of
RF and SVM) linked in series.
The first ensemble simply computed the average of each network’s
predictions. The second ensemble took the original data set and added the first
ensemble’s predictions as an extra “feature”. This amounted to given the second
ensemble a “cheat sheet” of the patterns discovered by the first, enabling it to
substantially outperform any of the other algorithms and achieve an 87% success rate.
52. Feature Importance
Feature Importance provides a way of preprocessing a data
set to help control variability.
The depicted binary decision tree takes the data set and
splits it in two at the point where the most variation occurs.
In this case, we see that the age of the samples (specifically
at 21 days) is the first node in the tree.
This single observation alone could substantially improve
network performance, if we split our data set into two
(separating samples at the 21-day age). The trade-off,
however, is fewer data points in the resulting data sets,
which makes learning patterns more difficult.
53.
54. So What?
What can this technology
really do for us?
The answer lies in
asking the right question.
56. So What?
The question has three key ingredients:
• The Givens (features / predictors)
• The Goal (target / prediction)
• The Accuracy (success rate)
Using this format, we can take one data set
(like the strengths of concrete sample data)
and use it to answer a variety of unique questions
57. So What?
Given the strength and mix design,
can I determine the time it will take to cure
with 95% accuracy?
Answering this question is important for contractors and designers
who are trying to determine a construction schedule or anticipate
how soon a newly-constructed roadway can be opened to live traffic.
58. So What?
Given the compressive strength and cure time
can I determine the most valid mix design
with 90% accuracy?
Answering this question helps material-testing personnel identify
potential causes of substandard materials and make investigations
into chronic material quality issues more efficient.
59. So What?
So where can I learn how to use machine
learning?
Resources exist all over the internet, including:
• Online classes
• Data repositories
• Machine Learning tools and cloud-computing services
60. Additional Resources
BigML.com (http://www.bigml.com)
On-line machine learning and data
visualization tools
UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
Wide range of data sets for machine learning applications
The R Project (http://www.r-project.org/)
Free scripting language for statistical
computing and graphics
Coursera (http://www.coursera.org)
Free on-line college-level courses in
technology and other topics
Microsoft Azure / Amazon EC2
Cloud-computing that provides virtualization
and machine learning services