CSA 3702 machine learning module 1

CSA 3702 – MACHINE LEARNING
PREPARED BY,
NANDHINI S (SRMIST, RAMAPURAM CAMPUS),
BHARATHI RAJA N, MENNAKSHI COLLEGE OF ENGINEERING.

Learning
• Learning is essential for unknown environments,
- i.e., When designer lacks omniscience.
• Learning is useful as a system construction method,
- i.e., Expose the agent to reality rather than trying to write it
down.
• Learning modifies the agent's decision mechanisms to improve
performance.
Learning element
Design of a learning element is affected by
• Which components of the performance element are to be
learned
• What feedback is available to learn these components
• What representation is used for the components

What is Machine Learning..?
• Learning is any process by which is a system improves performance from
experience.
• Machine learning is a set of tools that, allow us to “teach” computers how to
perform tasks by providing examples of how they should be done.
• Machine Learning is the study of algorithms that
- improve their performance P
- at some task T
- with experience E.
A well-defined learning task is given by <P,T, E>.
When Do We Use Machine Learning?
- Human expertise does not exist (navigating on Mars)
- Humans can’t explain their expertise (speech recognition)
- Models must be customized (personalized medicine)
- Models are based on huge amounts of data (genomics)

Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Debugging software

Types of Machine Learning
1. Supervised Learning,
in which the training data is labeled with the correct answers,
e.g., “spam” or “ham.”
The two most common types of supervised learning are classification
(where the outputs are discrete labels, as in spam filtering) and regression
(where the outputs are real-valued).
2. Unsupervised learning,
in which we are given a collection of unlabeled data, which we wish to
analyze and discover patterns within.
The two most important examples are dimension reduction and
clustering.
3. Reinforcement learning,
in which an agent (e.g., a robot or controller) seeks to learn the
optimal actions to take based the outcomes of past actions.

There are many other types of machine learning as well, for
example:
1. Semi-supervised learning, in which only a subset of the training data is
labeled.
2. Time-series forecasting, such as in financial markets.
3. Anomaly detection such as used for fault-detection in factories and in
surveillance.
4. Active learning, in which obtaining data is expensive, and so an algorithm
must determine which training data to acquire.
5.Transduction, similar to supervised learning, but does not explicitly
construct a function: instead, tries to predict new outputs based on training
inputs, training outputs, and new inputs.
6. Learning to learn, where the algorithm learns its own inductive bias based
on previous experience.

Supervised Learning
• Supervised learning is commonly used in real world applications,
such as face and speech recognition, products or movie
recommendations, and sales forecasting.
• Supervised learning can be further classified into two types -
Regression and Classification.
• Regression trains on and predicts a continuous-valued response,
for example predicting real estate prices.
• Classification attempts to find the appropriate class label, such as
analyzing positive/negative sentiment, male and female persons,
secure and unsecure loans etc.
• In supervised learning, learning data comes with description, labels,
targets or desired outputs and the objective is to find a general rule
that maps inputs to outputs.

• Supervised learning is the machine learning task of inferring a
function from labeled training data. The training data consist of a set
of training examples.
• In supervised learning, each example is a pair consisting of an input
object (typically a vector) and a desired output value (also called the
supervisory signal)
• A supervised learning algorithm analyzes the training data and
produces an inferred function, which can be used for mapping new
examples.
• An optimal scenario will allow for the algorithm to correctly
determine the class labels for unseen instances.
• This requires the learning algorithm to generalize from the training
data to unseen situations in a "reasonable" way.

• Supervised learning involves building a machine learning model
that is based on labeled samples.
Real time example,
- if we build a system to estimate the price of a plot of land or a house based
on various features, such as size, location, and so on, we first need to create a
database and label it. We need to teach the algorithm what features correspond to
what prices. Based on this data, the algorithm will learn how to calculate the price of
real estate using the values of the input features.
• Supervised learning deals with learning a function from available
training data.
- a learning algorithm analyzes the training data and produces a derived
function that can be used for mapping new examples. There are many supervised
learning algorithms such as Logistic Regression, Neural networks, Support Vector
Machines (SVMs), and Naive Bayes classifiers.
• Common examples of supervised learning includes classifying e-
mails into spam and not-spam categories, labeling webpages based
on their content, and voice recognition.

NO OK
Evaluation with test
set
Training
Data Pre -Processing
YES
Parameter
Tuning
Problem
Identification of
Data
Definition of training set
Algorithm
selection
Classifier
Fig. Machine Learning Supervise Process

The Brain and the Neural
Networks

How do our brains work?
• The Brain is a massively parallel information processing system.
• Our brains are a huge network of processing elements.
• A typical brain contains a network of 10 billion neurons.
Processing Element
Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output

• Brain Neurons
- Many neurons possess arboreal structures called dendrites which
receive signals from other neurons via junctions called synapses. -
Some neurons communicate by means of a few synapses, others possess
thousands.
• A neuron is connected to other neurons through about 10,000 synapses
• A neuron receives input from other neurons. Inputs are combined.
• Once input exceeds a critical level, the neuron discharges a spike ‐ an
electrical pulse that travels from the body, down the axon, to the next
neuron(s)
• The axon endings almost touch the dendrites or cell body of the next
neuron.
• Transmission of an electrical signal from one neuron to the next is effected
by neurotransmitters.
• Neurotransmitters are chemicals which are released from the first neuron
and which bind to the Second.
• This link is called a synapse. The strength of the signal that reaches the
next neuron depends on factors such as the amount of neurotransmitter
available.

Neural Networks: history
• The first model of neural networks was proposed in 1943 by
McCulloch and Pitts in terms of a computational model of neural
activity.
• Artificial Neural Networks (ANN) are a simulation abstract of our
nervous system, which contains a collection of neurons which
communicate each other through connections called axons.
• The ANN model has a certain resemblance to the axons and
dendrites in a nervous system .

Why Artificial Neural Networks?
• There are two basic reasons why we are interested in building
artificial neural networks (ANNs):
Technical viewpoint: Some problems such as
character recognition or the prediction of future
states of a system require massively parallel and
adaptive processing.
Biological viewpoint: ANNs can be used to
replicate and simulate components of the human
(or animal) brain, thereby giving us insight into
natural information processing.

Reasons to study Neural Computation
• To understand how the brain actually works.
– Its very big and very complicated and made of stuff that dies
when you poke it around. So we need to use computer simulations.
• To understand a style of parallel computation inspired by neurons
and their adaptive connections.
• To solve practical problems by using novel learning algorithms
inspired by the brain
– Learning algorithms can be very useful even if they are not
how the brain actually works.

Artificial Neural Networks
• The “building blocks” of neural networks are the neurons.
• In technical systems, we also refer to them as units or nodes.
Basically, each neuron
- receives input from many other neurons.
- changes its internal state (activation) based on the current
input.
- sends one output signal to many other neurons, possibly
including its input neurons (recurrent network).
- Information is transmitted as a series of electric impulses, so-
called spikes.
• The frequency and phase of these spikes encodes the information.
• In biological systems, one neuron can be connected to as many as
10,000 other neurons.
• Usually, a neuron receives its information from other neurons in a
confined area, its so-called receptive field.

Contd…
• There is the fascinating hypothesis that the way the brain does all of
these different things is not worth like a thousand different programs,
but instead, the way the brain does it is worth just a single learning
algorithm.
Structure of a Neural Network
• A neural network consists of:
A set of nodes (neurons) or units connected by links
A set of weights associated with links
A set of thresholds or levels of activation
• The design of a neural network requires:
The choice of the number and type of units
The determination of the morphological structure (layers)
Coding of training examples, in terms of inputs and outputs from
the network
The initialization and training of the weights on the
interconnections through the training set

Functioning of a Neuron
• It is estimated that the human brain contains over 100 billion
neurons and that a neuron can have over 1000 synapses in the input
and output.
• Switching time of a few milliseconds (much slower than a
logic gate), but connectivity hundreds of times higher;
• A neuron transmits information to other neurons through its
axon;
• The axon transmits electrical impulses, which depend on its
potential;
• The transmitted data can be excitatory or inhibitory;
• A neuron receives input signals of various nature, which are
summed;
• If the excitatory influence is predominant, the neuron is
activated and generates informational messages to the output
synapses;

How Do ANN works..?
 An artificial neural network (ANN) is either a hardware
implementation or a computer program which strives to simulate the
information processing capabilities of its biological exemplar.
 ANNs are typically composed of a great number of interconnected
artificial neurons. The artificial neurons are simplified models of their
biological counterparts.
 ANN is a technique for solving problems by constructing software
that works like our brains.

An artificial neuron is an imitation of a human neuron

Working Function of Single Neuron

• The output is a function of the input, that is affected by the weights, and
the transfer functions.

Problems solvable with Neural Networks
• Network characteristics:
-Instances are represented by many features in many of the
values, also real
-The target objective function can be real-valued
-Examples can be noisy
-The training time can be long
-The evaluation of the network must be able to be made
quickly learned
• Applications:
-Robotics
-Image understanding
-Biological systems
-Financial predictions

Linear Discriminant Analysis
• Logistic regression is a simple and powerful linear classification
algorithm.
• Logistic regression is a classification algorithm traditionally limited to
only two-class classification problems.
• If you have more than two classes then Linear Discriminant Analysis
is the preferred linear classification technique.
• It also has limitations that suggest at the need for alternate linear
classification algorithms.
– Two-Class Problems. Logistic regression is intended for two-class or binary
classification problems. It can be extended for multi-class classification, but is
rarely used for this purpose.
– Unstable With Well Separated Classes. Logistic regression can become
unstable when the classes are well separated.
– Unstable With Few Examples. Logistic regression can become unstable when
there are few examples from which to estimate the parameters.
• Even with binary-classification problems, it is a good idea to try both
logistic regression and linear discriminant analysis.

Learning LDA Models
• LDA makes some simplifying assumptions about your data:
– That your data is Gaussian, that each variable is shaped like a bell curve when
plotted.
– That each attribute has the same variance, that values of each variable vary
around the mean by the same amount on average.
• With these assumptions, the LDA model estimates the mean and
variance from your data for each class.
• The mean (mu) value of each input (x) for each class (k) can be
estimated in the normal way by dividing the sum of values by the
total number of values.
muk = 1/nk * sum(x)
– Where muk is the mean value of x for the class k, nk is the number of instances
with class k. The variance is calculated across all classes as the average
squared difference of each value from the mean.
sigma^2 = 1 / (n-K) * sum((x – mu)^2)
– Where sigma^2 is the variance across all inputs (x), n is the number of instances,
K is the number of classes and mu is the mean for input x.

Making Predictions with LDA
• LDA makes predictions by estimating the probability that a new set
of inputs belongs to each class. The class that gets the highest
probability is the output class and a prediction is made.
• The model uses Bayes Theorem to estimate the probabilities. It can
be used to estimate the probability of the output class (k) given the
input (x) using the probability of each class and the probability of the
data belonging to each class:
P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))
– Where PIk refers to the base probability of each class (k) observed in your
training data (e.g. 0.5 for a 50-50 split in a two class problem). In Bayes’
Theorem this is called the prior probability.
PIk = nk/n
– The f(x) above is the estimated probability of x belonging to the class.
– A Gaussian distribution function is used for f(x). Plugging the Gaussian into the
above equation and simplifying we end up with the equation below. This is called
a discriminate function and the class is calculated as having the largest value will
be the output classification (y):
Dk(x) = x * (muk/siga^2) – (muk^2/(2*sigma^2)) + ln(PIk)
– Dk(x) is the discriminate function for class k given input x, the muk, sigma^2 and

How to Prepare Data for LDA
• Here, some suggestions you may consider when preparing your
data for use with LDA.
• Classification Problems.
– This might go without saying, but LDA is intended for classification problems
where the output variable is categorical. LDA supports both binary and multi-
class classification.
• Gaussian Distribution.
– The standard implementation of the model assumes a Gaussian distribution of
the input variables. Consider reviewing the univariate distributions of each
attribute and using transforms to make them more Gaussian-looking (e.g. log
and root for exponential distributions and Box-Cox for skewed distributions).
• Remove Outliers.
– Consider removing outliers from your data. These can skew the basic statistics
used to separate classes in LDA such the mean and the standard deviation.
• Same Variance.
– LDA assumes that each input variable has the same variance. It is almost
always a good idea to standardize your data before using LDA so that it has a
mean of 0 and a standard deviation of 1.

Extensions to LDA
• Quadratic Discriminant Analysis (QDA):
– Each class uses its own estimate of variance (or covariance when there
are multiple input variables).
• Flexible Discriminant Analysis (FDA):
– Where non-linear combinations of inputs is used such as splines.
• Regularized Discriminant Analysis (RDA):
– Introduces regularization into the estimate of the variance (actually
covariance), moderating the influence of different variables on LDA.

• In machine learning, the perceptron is an algorithm for supervised
learning of binary classifiers.
• A binary classifier is a function which can decide whether or not an
input, represented by a vector of numbers, belongs to some specific
class.
• It is a type of linear classifier, i.e. a classification algorithm that
makes its predictions based on a linear predictor function combining
a set of weights with the feature vector.
• The perceptron was intended to be a machine, rather than a
program.
• Designed for image recognition: it had an array of 400 photocells,
randomly connected to the "neurons". Weights were encoded
in potentiometers, and weight updates during learning were
performed by electric motors.
• The perceptron is a simplified model of a biological neuron. While
the complexity of biological neuron models is often required to fully
understand neural behavior, research suggests a perceptron-like
linear model can produce some behavior seen in real neurons.

Perceptron Function
• The perceptron is milestone of neural networks
• Tries to simulate the operation of the single neuron
The output values are Boolean: 0 or 1
-The inputs xi and weights wi are positive or negative real
values
-Three elements: inputs, sum, threshold
-The learning is to select weights and threshold
- The perceptron is an algorithm for learning a binary classifier
called a threshold function:
- a function that maps its input (a real-valued vector) to an output
value (a single binary value):
if wi xi >0
otherwise

• (From Fig.) The Perceptron receives multiple input signals, and if the
sum of the input signals exceeds a certain threshold, it either
outputs a signal or does not return an output.
• In the context of supervised learning and classification, this can then
be used to predict the class of a sample.
• The value of (0 or 1) is used to classify as either a positive or a
negative instance, in the case of a binary classification problem.
• If value is negative, then the weighted combination of inputs must
produce a positive value greater than in order to push the classifier
neuron over the 0 threshold.
• The perceptron learning algorithm does not terminate if the learning
set is not linearly separable.
• If the vectors are not linearly separable learning will never reach a
point where all vectors are classified properly. The most famous
example of the perceptron's inability to solve problems with linearly
non separable vectors is the Boolean exclusive-or problem.

• There are two types of Perceptrons: Single layer and Multilayer.
• Single layer Perceptrons can learn only linearly separable patterns.
• Multilayer Perceptrons or feedforward neural networks with two or
more layers have the greater processing power.
• The Perceptron algorithm learns the weights for the input signals in
order to draw a linear decision boundary.
• This enables you to distinguish between the two linearly separable
classes +1 and -1.
• Perceptron Learning Rule states that the algorithm would
automatically learn the optimal weight coefficients.
The input features are then multiplied with these weights to
determine if a neuron fires or not.

Perceptron Learning Rule
• Perceptron Learning Rule states that the algorithm would
automatically learn the optimal weight coefficients. The input
features are then multiplied with these weights to determine if a
neuron fires or not.
• The Perceptron receives multiple input signals, and if the sum of the
input signals exceeds a certain threshold, it either outputs a signal or
does not return an output. In the context of supervised learning and
classification, this can then be used to predict the class of a sample.

Perceptron Function
• Perceptron is a function that maps its input “x,” which is multiplied
with the learned weight coefficient; an output value ”f(x)”is generated.
• In the equation given above:
– “w” = vector of real-valued weights
– “b” = bias (an element that adjusts the boundary away from origin without any
dependence on the input value)
– “x” = vector of input x values
– “m” = number of inputs to the Perceptron
• The output can be represented as “1” or “0.” It can also be
represented as “1” or “-1” depending on which activation function is
used.

Inputs of a Perceptron
• A Perceptron accepts inputs, moderates them with certain weight
values, then applies the transformation function to output the final
result.
• A Boolean output is based on inputs such as salaried, married, age,
past credit profile, etc. It has only two values: Yes and No or True
and False.
• The summation function “∑” multiplies all inputs of “x” by weights “w”
and then adds them up as follows:

Activation Functions of Perceptron
• The activation function applies a step rule (convert the numerical
output into +1 or -1) to check if the output of the weighting function is
greater than zero or not.
• For example,
– If ∑ wixi> 0 => then final output “o” = 1 (issue bank loan)
– Else, final output “o” = -1 (deny bank loan)
– Step function gets triggered above a certain value of the neuron output; else it outputs
zero.
– Sign Function outputs +1 or -1 depending on whether neuron output is greater than
zero or not.
– Sigmoid is the S-curve and outputs a value between 0 and 1.

Output of Perceptron
• Perceptron with a Boolean output:
• Inputs: x1…xn
• Output: o(x1….xn)
• Weights: wi=> contribution of input xi to the Perceptron output;
• w0=> bias or threshold
• If ∑w.x > 0, output is +1, else -1. The neuron gets triggered only when
weighted input reaches a certain threshold value.
• An output of +1 specifies that the neuron is triggered. An output of -1
specifies that the neuron did not get triggered.
• “sgn” stands for sign function with output +1 or -1.

Error in Perceptron
• In the Perceptron Learning Rule, the predicted output is compared
with the known output.
• If it does not match, the error is propagated backward to allow
weight adjustment to happen.
Functions represented by the perceptron
• The perceptron can represent all Boolean primitive functions AND,
OR, NAND e NOR
• Some Boolean functions can not be represented
–E.g. the XOR function (that is 1 if and only if x1 ≠ x2)
requires more perceptrons.

Multi Layer Perceptron
• Multi-Layer Perceptron defines the most complicated architecture of
artificial neural networks. It is substantially formed from multiple layers
of perceptron.
• The diagrammatic representation of multi-layer perceptron learning is
as shown below
• An MLP consists of at least three layers of nodes:
– an input layer,
– a hidden layer
– an output layer.

• Except for the input nodes, each node is a neuron that uses a
nonlinear activation function.
• Multi Layer perceptron (MLP) is a feedforward neural network with one
or more layers between input and output layer.
• Feedforward means that data flows in one direction from input to output
layer (forward). This type of network is trained with the backpropagation
learning algorithm.
• MLPs are widely used for pattern classification, recognition, prediction
and approximation. Multi Layer Perceptron can solve problems which
are not linearly separable.
• MLP utilizes a supervised learning technique called backpropagation for
training.
• Its multiple layers and non-linear activation distinguish MLP from a
linear perceptron. It can distinguish data that is not linearly separable.
• MLP networks are usually used for supervised learning format. A typical
learning algorithm for MLP networks is also called back propagation’s
algorithm.

Activation function
• If a multilayer perceptron has a linear activation function in all
neurons, that is, a linear function that maps the weighted inputs to
the output of each neuron, then linear algebra shows that any
number of layers can be reduced to a two-layer input-output model.
• In MLPs some neurons use a nonlinear activation function that was
developed to model the frequency of action potentials, or firing, of
biological neurons.
Learning
• Learning occurs in the perceptron by changing connection weights
after each piece of data is processed, based on the amount of error
in the output compared to the expected result.
• This is an example of supervised learning, and is carried out through
backpropagation, a generalization of the least mean squares
algorithm in the linear perceptron.

Terminology
• The term "multilayer perceptron" does not refer to a single
perceptron that has multiple layers. Rather, it contains many
perceptrons that are organized into layers.
• An alternative is "multilayer perceptron network". Moreover, MLP
"perceptrons" are not perceptrons in the strictest possible sense.
• True perceptrons are formally a special case of artificial neurons that
use a threshold activation function such as the Heaviside step
function.
• MLP perceptrons can employ arbitrary activation functions. A true
perceptron performs binary classification (either this or that), an MLP
neuron is free to either perform classification or regression,
depending upon its activation function.
• The term "multilayer perceptron" later was applied without respect to
nature of the nodes/layers, which can be composed of arbitrarily
defined artificial neurons, and not perceptrons specifically.
• This interpretation avoids the loosening of the definition of
"perceptron" to mean an artificial neuron in general.

Applications
• MLPs are useful in research for their ability to solve problems
stochastically, which often allows approximate solutions for
extremely complex problems.
• Can be used to create mathematical models by regression analysis.
• Used in Speech recognition, Image recognition, Machine translation
software.

• If a set of patterns can be correctly classified by some perceptron,
then such a set of patterns is said to be linearly separable.
• The term "linear" is used because the perceptron is a linear device.
• The net input is a linear function of the individual inputs and the
output is a linear function of the net input.
• Linear means that there is no square(x2 ) or cube(x3), etc. terms in
the formulas.
• The idea is to..?
- check if you can separate points in an n-dimensional space
using only n-1 dimensions.

One Dimension
• Example: Lets say you're on a number line. You take any two
numbers. Now, there are two possibilities:
1. You can choose two different numbers
2. You can choose the same number
• If you choose two different numbers, you can always find another
number between them. This number "separates" the two numbers
you chose.
• So, you say that these two numbers are "linearly separable".
• But, if both numbers are the same, you simply cannot separate them.
They're the same. So, they're "linearly inseparable".

Two Dimensions
• On extending this idea to two dimensions, some more possibilities
come into existence. Consider the following:
Fig. Two classes of points

• Here, separate the point (1,1) from the other points. You can see
that there exists a line that does this.
• In fact, there exist infinite such lines. So, these two "classes" of
points are linearly separable.
• The first class consists of the point (1,1) and the other class has
(0,1), (1,0) and (0,0).
• In the above case, you just cannot use one single line to separate
the two classes (one containing the black points and one containing
the red points). So, they are linearly inseparable.

Three Dimensions
• Extending the above example to three dimensions. You need a
plane for separating the two classes.
Fig. Linear separability in 3D space
• The dashed plane separates the red point from the other blue
points. So its linearly separable.
• If bottom right point on the opposite side was red too, it would
become linearly inseparable .
Extending to n dimensions
• To separate classes in n-dimensions, we need an n-1 dimensional
"hyperplane".

• Linear regression is a linear model, e.g. a model that assumes a
linear relationship between the input variables (x) and the single
output variable (y).
• More specifically, that y can be calculated from a linear combination
of the input variables (x).
- When there is a single input variable (x), the method is
referred to as simple linear regression.
- When there are multiple input variables, literature from
statistics often refers to the method as multiple linear regression.

Linear Regression Model Representation
• The representation is a linear equation that combines a specific set
of input values (x) the solution to which is the predicted output for
that set of input values (y). As such, both the input values (x) and
the output value are numeric.
• The linear equation assigns one scale factor to each input value or
column, called a coefficient and represented by the capital Greek
letter Beta (B).
• One additional coefficient is also added, giving the line an additional
degree of freedom (e.g. moving up and down on a two-dimensional
plot) and is often called the intercept or the bias coefficient.
• For example, in a simple regression problem (a single x and a
single y), the form of the model would be:
y = B0 + B1*x
• In higher dimensions when we have more than one input (x), the line
is called a plane or a hyper-plane.

• The representation therefore is the form of the equation and the
specific values used for the coefficients (e.g. B0 and B1 in the above
example).
• When a coefficient becomes zero, it effectively removes the
influence of the input variable on the model and therefore from the
prediction made from the model (0 * x = 0).

Linear Regression Learning the Model
1. Simple Linear Regression
• With simple linear regression when we have a single input, we can
use statistics to estimate the coefficients.
• This requires that you calculate statistical properties from the data
such as means, standard deviations, correlations and covariance.
All of the data must be available to traverse and calculate statistics.
• This is fun as an exercise in excel, but not really useful in practice.
2. Ordinary Least Squares
• When we have more than one input we can use Ordinary Least
Squares to estimate the values of the coefficients.
• The Ordinary Least Squares procedure seeks to minimize the sum
of the squared residuals.
• This means that given a regression line through the data we
calculate the distance from each data point to the regression line,
square it, and sum all of the squared errors together. This is the
quantity that ordinary least squares seeks to minimize.

3. Gradient Descent
• When there are one or more inputs you can use a process of
optimizing the values of the coefficients by iteratively minimizing the
error of the model on your training data.
• This operation is called Gradient Descent and works by starting with
random values for each coefficient.
• The sum of the squared errors are calculated for each pair of input
and output values.
• A learning rate is used as a scale factor and the coefficients are
updated in the direction towards minimizing the error.
• The process is repeated until a minimum sum squared error is
achieved or no further improvement is possible.
• When using this method, you must select a learning rate (alpha)
parameter that determines the size of the improvement step to take
on each iteration of the procedure.

4. Regularization
• There are extensions of the training of the linear model called
regularization methods. These seek to both minimize the sum of the
squared error of the model on the training data (using ordinary least
squares) but also to reduce the complexity of the model (like the
number or absolute size of the sum of all coefficients in the model).
• Two popular examples of regularization procedures for linear
regression are:
– Lasso Regression: where Ordinary Least Squares is modified to also minimize
the absolute sum of the coefficients (called L1 regularization).
– Ridge Regression: where Ordinary Least Squares is modified to also minimize
the squared absolute sum of the coefficients (called L2 regularization).
• These methods are effective to use when there is collinearity in your
input values and ordinary least squares would overfit the training
data.

Making Predictions with Linear Regression
• Given the representation is a linear equation, making predictions is
as simple as solving the equation for a specific set of inputs.
• Let’s make this concrete with an example. Imagine we are predicting
weight (y) from height (x). Our linear regression model
representation for this problem would be:
y = B0 + B1 * x1
or
weight =B0 +B1 * height
– Where B0 is the bias coefficient and B1 is the coefficient for the height column.
We use a learning technique to find a good set of coefficient values.
– Once found, we can plug in different height values to predict the weight.

• For example, lets use B0 = 0.1 and B1 = 0.5. Let’s plug them in
and calculate the weight (in kilograms) for a person with the height
of 182 centimeters.
weight = 0.1 + 0.5 * 182
weight = 91.1
• You can see that the above equation could be plotted as a line in
two-dimensions. The B0 is our starting point regardless of what
height we have. We can run through a bunch of heights from 100 to
250 centimeters and plug them to the equation and get weight
values, creating our line.

Preparing Data For Linear Regression
• Linear Assumption. Linear regression assumes that the
relationship between your input and output is linear. It does not
support anything else. This may be obvious, but it is good to
remember when you have a lot of attributes.
• You may need to transform data to make the relationship linear (e.g.
log transform for an exponential relationship).
• Remove Noise. Linear regression assumes that your input and
output variables are not noisy. Consider using data cleaning
operations that let you better expose and clarify the signal in your
data.
• This is most important for the output variable and you want to
remove outliers in the output variable (y) if possible.

• Remove Collinearity. Linear regression will over-fit your data when
you have highly correlated input variables. Consider calculating
pairwise correlations for your input data and removing the most
correlated.
• Gaussian Distributions. Linear regression will make more reliable
predictions if your input and output variables have a Gaussian
distribution. You may get some benefit using transforms (e.g. log or
BoxCox) on you variables to make their distribution more Gaussian
looking.
• Rescale Inputs: Linear regression will often make more reliable
predictions if you rescale input variables using standardization or
normalization.

Examples of Machine Learning
• Machine learning has been used in multiple fields and industries.
For example medical diagnosis, image processing, prediction,
classification, learning association, regression etc.
1. Image Recognition
• Image recognition is one of the most common uses of machine
learning. There are many situations where you can classify the
object as a digital image.
• Machine learning can be used for face detection in an image as
well. There is a separate category for each person in a database of
several people.
• Machine learning is also used for character recognition to discern
handwritten as well as printed letter. We can segment a piece of
writing into smaller images, each containing a single character.

2. Medical diagnosis
• Machine learning can be used in the techniques and tools that can
help in the diagnosis of diseases.
• It is used for the analysis of the clinical parameters and their
combination for the prognosis example prediction of disease
progression for the extraction of medical knowledge for the outcome
research, for therapy planning and patient monitoring.
• These are the successful implementations of the machine learning
methods.
• It can help in the integration of computer-based systems in the
healthcare sector.

3. Financial Services
• Machine learning has a lot of potential in the financial and banking
sector. It is the driving force behind the popularity of the financial
services.
• Machine learning can help the banks, financial institutions to make
smarter decisions.
• Machine learning can help the financial services to spot an account
closure before it occurs. It can also track the spending pattern of the
customers.
• It can also perform the market analysis. Smart machines can be
trained to track the spending patterns. The algorithms can identify
the tends easily and can react in real time.

Real Time Examples of Machine Learning
Netflix
• At Netflix, ML has been constantly used to improvise the
recommendations and personalization problems.
• ML has also expanded into various other streams like content
promotions, price modeling, content delivery and marketing too.
• The entire platform seems to run 80% through the recommendation
engine.
• The neural network keeps a track on user behavior and program
content.
• This is further merged up to create multiple taste groups on which
the recommendation engine works.
various factors like food preparation time to estimate the delivery time.

Uber
• ML is a fundamental part of this tech giant. From estimating the time
to determining how far your cab is from your given location,
everything is driven by ML.
• It uses algorithms to determine all these effectively. It does these by
analyzing the data from the previous trips and putting it in the
present situation. Even the other branch of this giant i.e. UberEATS,
does the same. It takes into account various factors like food
preparation time to estimate the delivery time.
Google
• We are already familiar with how greatly Google is showcasing its
ML products in action with Google Assistant and Google Camera to
the world. But now it has extended it to Gmail and Google Photos
too.
• Gmail has now got smart reply feature which will suggest small brief
responses to whatever email you’ve received based on the content
that is present in the email.

Back propagation of error
• Backpropagation is a supervised learning algorithm, for training
Multi-layer Perceptrons.
Why We Need Backpropagation?
• Calculate the error – How far is your model output from the actual
output.
• Minimum Error – Check whether the error is minimized or not.
• Update the parameters – If the error is huge then, update the
parameters (weights and biases). After that again check the error.
Repeat the process until the error becomes minimum.
• Model is ready to make a prediction – Once the error becomes
minimum, you can feed some inputs to your model and it will
produce the output.

What is Backpropagation..?
• The Backpropagation algorithm looks for the minimum value of the
error function in weight space using a technique called the delta rule
or gradient descent.
• The weights that minimize the error function is then considered to be
a solution to the learning problem.
Types of Back Propagation
1. Back-propagation
2. Forward-propagation

CSA 3702 machine learning module 1

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à CSA 3702 machine learning module 1

Similaire à CSA 3702 machine learning module 1 (20)

Dernier

Dernier (20)

CSA 3702 machine learning module 1