This document summarizes a talk on machine learning given by Subrat Panda at the SACON conference in Pune, India on May 18-19, 2018. The talk covered topics including classical definitions of machine learning, types of machine learning algorithms like supervised learning, unsupervised learning and reinforcement learning. It also discussed specific algorithms like linear regression, logistic regression, and support vector machines.
6. SACON
Gartner Says By 2020,
Artificial Intelligence Will
Create More Jobs Than It
Eliminates
7. SACON
What this talk can motivate people to do
§ STUDENTS:
§ Motivates to participate in data science competitions
§ Further learning and add the expertise to the resume
§ Final year and fun projects.
§ PROFESSIONALS:
§ Find interesting data in your current project and apply machine learning
§ Motivates further learning and profession change. Data scientists/Machine
learning engineers are highly paid professionals J
§ TEACHERS:
§ Motivates teachers to spread knowledge in the their university
§ Conduct hackathons
SACON 2018 - Pune
18. SACON
Linear Regression
• In supervised learning, our goal is, given a training set, to learn a function h : X
→ Y so that h(x) is a “good” predictor for the corresponding value of Y.
Living Area (Sq.
feet) Year Built
Price
(1000$s)
2104 2012 400
1600 2013 300
2400 2014 369
1416 2013 232
3000 2015 540
. . .
. . .
. . .
• Lets consider the housing data above. X’s represents a two dimensional vector
ad Y represents the price of the house.
20. SACON
Cost Function I
• Lets approximate the Y as a linear function of X. Hence the hypothesis function
will be given by.
• θ’s are the parameters (also called weights) parameterizing the space of linear
functions mapping from X to Y.
• How do we pick, or learn, the parameters θ? One reasonable method seems to
be to make h(x) close to y, at least for the training examples we have. The cost
function is given by: (Considering θ1
• This is the least-squares cost function that gives rise to the ordinary least
squares regression model
22. SACON
Gradient Descent
§ We should use a search algorithm that starts with some “initial guess” for θ, and that
repeatedly changes θ to make J(θ) smaller, until we converge to a value of θ that
minimizes J(θ).
§ The algorithm we choose is Gradient Descent Algorithm, which starts with some
initial θ and repeatedly perform the following update:
§ If we calculate the partial derivate , we get the following output:
α = Learning Rate
If α is too small: slow convergence.
If α is too large: may not decrease on every iteration and thus may
not converge.
27. SACON
Normal Equation
§ Given a training set with m examples and n features, define the
design matrix X to be the m-by-n matrix give like below:
§ Thus, the value of θ that minimizes J(θ) is given
in closed form by the equation
§ let y be the m-dimensional vector containing all the target values
from the training set:
52. SACON
n SVM locates a separating hyperplane in the feature space and classify points in
that space
n It does not need to represent the space explicitly, simply by defining a kernel
function
n The kernel function plays the role of the dot product in the feature space.
Nonlinear SVM - Overview
60. SACON
Some Issues
§ Choice of kernel
§ Gaussian or polynomial kernel is default
§ If ineffective, more elaborate kernels are needed
§ Domain experts can give assistance in formulating appropriate
similarity measures
§ Choice of kernel parameters
§ e.g. σ in Gaussian kernel
§ σ is the distance between closest points with different classifications
§ In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
§ Optimization criterion – Hard margin v.s. Soft margin
§ a lengthy series of experiments in which various parameters are tested
62. SACON
k-Nearest Neighbor Classification
(kNN)
§ Unlike all the previous learning methods, kNN does not build model
from the training data.
§ To classify a test instance d, define k-neighborhood P as k nearest
neighbors of d
§ Count number n of training instances in P that belong to class cj
§ Estimate Pr(cj|d) as n/k
§ No training is needed. Classification time is linear in training set size
for each test case.
63. SACON
kNN Algorithm
n k is usually chosen empirically via a validation set
or cross-validation by trying a range of k values.
n Distance function is crucial, but depends on
applications.
65. SACON
Discussions
§ kNN can deal with complex and arbitrary decision boundaries.
§ Despite its simplicity, researchers have shown that the classification
accuracy of kNN can be quite strong and in many cases as accurate
as those elaborated methods.
§ kNN is slow at the classification time
§ kNN does not produce an understandable model
68. SACON
TYPES OF CLUSTERING
§ Hierarchical algorithms: these find successive clusters using previously
established clusters.
§ Agglomerative ("bottom-up"): Agglomerative algorithms begin with each element as
a separate cluster and merge them into successively larger clusters.
§ Divisive ("top-down"): Divisive algorithms begin with the whole set and proceed to
divide it into successively smaller clusters.
SACON 2018 - Pune
CLUSTER
DENDOGRAM
71. SACON
§ The maximum norm is given by:
§ The Mahalanobis distance corrects data for different scales and
correlations in the variables.
§ Inner product space: The angle between two vectors can be used as a
distance measure when clustering high dimensional data
§ Hamming distance (sometimes edit distance) measures the minimum
number of substitutions required to change one member into another.
73. SACON
§ An algorithm for partitioning (or clustering) N data points into K disjoint
subsets Sj containing data points so as to minimize the sum-of-squares
criterion
where xn is a vector representing the the nth data point and uj is the
geometric centroid of the data points in Sj.
§ Simply speaking k-means clustering is an algorithm to categorize or to group
the objects based on attributes/features into K number of group.
§ K is positive integer number.
§ The grouping is done by minimizing the sum of squares of distances
between data and the corresponding cluster centroid.
74. SACON
HOW K-MEANS CLUSTERING WORKS?
§ Step 1: Begin with a decision on the value of k = Number of
clusters
§ Step 2: Put any initial partition that classifies the data into k
clusters. You may assign the training samples randomly, or
systematically as the following:
§ Take the first k training sample as single- element
clusters
§ Assign each of the remaining (N-k) training sample to the
cluster with the nearest centroid. After each assignment,
recompute the centroid of the gaining cluster.
§ Step 3: Take each sample in sequence and compute its distance
from the centroid of each of the clusters. If a sample is not
currently in the cluster with the closest centroid, switch this
sample to that cluster and update the centroid of the cluster
gaining the new sample and the cluster losing the sample.
§ Step 4 . Repeat step 3 until convergence is achieved, that is until a
pass through the training sample causes no new assignments.
81. SACON
Definitions
• Overfitting: too much reliance on the training data
• Underfitting: a failure to learn the relationships in the training data
• High Variance: model changes significantly based on training data
• High Bias: assumptions about model lead to ignoring training data
• Overfitting and underfitting cause poor generalization on the test set
• A validation set for model tuning can prevent under and overfitting
SACON 2018 - Pune
82. SACON
Ways to Deal with
Overfitting and Underfitting
§ Underfitting:
§ Easier to resolve
§ Try different machine learning models
§ Try stronger models with higher capacity (hyperparameter
tuning)
§ Try more features
§ Overfitting
§ Use a resampling technique like K-fold cross validation
§ Improve the feature quality or remove some features
§ Training with more data
§ Early stopping
§ Regularization
§ Ensembling
Early Stopping
84. SACON
L1 and L2 Regularization
§ In L2, we have:
§ Here, lambda is the regularization parameter. It is the hyperparameter whose
value is optimized for better results. L2 regularization is also known as weight
decay as it forces the weights to decay towards zero (but not exactly zero).
§ In L1, we have:
§ In this, we penalize the absolute value of the weights. Unlike L2, the weights may
be reduced to zero here.
86. SACON
Artificial Neural Networks
§ A Single Neuron: The basic unit of computation in a neural network is
the neuron, often called a node or unit.
§ The function f is non-linear and is called the Activation Function.
§ The idea of ANNs is based on the belief that working of human brain by making
the right connections, can be imitated using silicon and wires as
living neurons and dendrites.
90. SACON
Back-Propagation
§ Back-propagation (BP) algorithms works by
determining the loss (or error) at the output and then
propagating it back into the network.
§ The weights are updated to minimize the error
resulting from each neuron.
96. SACON
Methods of splitting: Information gain
which node can be described easily?
§ Information theory is a measure to define this degree of disorganization in a system known as
Entropy.
Here p and q is probability of success and failure respectively in that
node.
97. SACON
Other Tree based methods
§ Trade-off management of bias-variance errors.
§ Bagging is a simple ensembling technique in which we
build many independent predictors/models/learners and
combine them using some model averaging techniques.
§ Ensemble methods involve group of predictive models to
achieve a better accuracy and model stability.
§ Random Forest: Multiple Trees instead of
single tree. It’s a bagging method
§ To classify a new object based on
attributes, each tree gives a classification
and we say the tree “votes” for that class.
98. SACON
Other Tree based methods
§ Gradient Boosting is a tree ensemble technique that creates a strong classifier
from a number of weak classifiers.
§ It works in the technique of weak learners and the additive model.
§ Boosting is an ensemble technique in which the predictors are not made
independently, but sequentially.
99. SACON
Iris Dataset
§ Three species of Iris (Iris setosa, Iris virginica and Iris versicolor).
§ Four features were measured from each sample: the length and the width of
the sepals and petals, in centimeters.
100. SACON
References
• Andrew Ng’s Coursera Course
• Scikit Learn Training example on Google
• Nvidia
• Sebastian Ruder’s blog
• HBR
• MIT Tech Review
• Lots of Others
• AI community in general
• IDLI Community