Statistical foundations of ml

Department of Statistics
The Maharaja Sayajirao University of Baroda

Can we make computers learn?
Machine learning is a field of artificial intelligence that
uses statistical techniques to give computer systems the
ability to "learn" from data, without being explicitly
programmed.
- Arthur Samuel
Here, by learning we mean progressively improving
performance, through experience, on a specific task.
Machine Learning

Need of Algorithm
To solve a problem on a computer, we need an algorithm.
 Searching
 Sorting
 Numerical computations
However, there are problems for which algorithms do
not exist
 Recognizing characters
 Distinguish spam emails from legitimate ones.
This lack of knowledge is made up for by data.

Programming and Machine Learning
The difference between “Programing” and “Machine
Learning” is similar to that between “Mathematics” and
“Statistics”.
It is believed that there is an unknown underlying
process that explains the data we observe. By analyzing
the available data, we intend to understand this process
as much as possible. Though identifying the complete
process may not be possible, we can still detect certain
patterns or regularities. This detection is the function of
machine learning.
Machine Learning is essentially learning from data.

Learning Program
Definition: A computer program is said to learn from
experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T,
as measured by P, improves with experience E.
To formulate a well-defined learning problem, we must
identify
 The class of tasks T
 The measure of performance P
 The source of experience E
-Tom M. Mitchell

ML Applications
 Alexa
Alexa is a virtual assistant developed by Amazon. It is
used in Echo Smart Speakers and many other devices.
 Watson
IBM Watson was created as question-answering
computer system. It won the quiz show Jeopardy in
2011 against the legendry champions.
 Driverless cars
Machine learning algorithms are extensively used in
driverless cars.

Machine Learning Approaches
Supervised learning
In supervised learning, the goal is to learn to predict the
value of an outcome measure based on a number of
input measures for which correct output values are
provided by supervisor.
Supervised learning is like using data to build predictive
models.

ML Approaches
Unsupervised learning
In unsupervised learning, there is no such supervisor,
and no outcome measure to be predicted. We only have
input data, and the goal is to find and describe
regularities in the input. We want to see ‘what generally
happens, and what does not’.
Unsupervised learning is like identifying probability
distributions from data.

 Reinforcement learning
Reinforcement learning is learning what to do so as to
maximize a numerical reward. The learner is not told
which actions to take, instead learner must discover
which actions are most rewarding by trying them.
The actions may affect not only the immediate reward
but also the next situation and, through that, all
subsequent rewards.
Reinforcement learning is learning from
mistakes/ success
ML Approaches

ML and Data Mining
The manual extraction of patterns from data has been
practiced for centuries. The techniques that have been
used for this purpose come in the purview of Statistics.
The term Data Mining, in this context, refers to the
automatic (or semi-automatic) analysis of large
quantities of data to extract previously unknown,
interesting patterns.

Data Mining as a tool for ML
 Machine learning uses variety of techniques that are
part of Data mining. In that sense Data mining
provides tools for machine learning.
 While data mining involves discovery of knowledge
from data, machine learning takes the journey forward
by utilizing the discovered knowledge for developing
intelligent systems.
 Learning algorithms is what distinguishes ‘Machine
learning’ from ‘Data mining’. Learning algorithms
utilizes the knowledge discovered by applying Data
Mining.

ML as a tool for Data Mining
Recall that machine learning makes the machines learn
to perform specified tasks.
If the task to be performed is ‘Mining’ of patterns,
machine learning can be used as a tool to make
machines learn to mine patterns.
It is this vice-versa relationship between the two fields,
which has led to a lot of confusion between ‘Data
Mining’ and ‘Machine Learning’ in the community.

Artificial Intelligence and ML
Artificial Intelligence
Artificial intelligence (AI), is the intelligence
demonstrated by machines.
It refers to the ability of a computer (or a computer-
controlled machine) to perform tasks that are commonly
associated with intelligent beings.
This term is frequently used to describe the projects of
developing systems with abilities that are characteristic
of humans, such as the ability to reason, discover
meaning, generalize, or learn from past experience.

The discipline of AI was born during a summer
workshop organized by John McCarthy at Dartmouth
college in 1956 named “The Dartmouth Summer
Research Project on Artificial Intelligence”.
The basis of the workshop was the conjecture that
Every aspect of learning or any other feature of
intelligence can, in principle, be so precisely described
that a machine can be made to simulate it.
Machine learning is an area of AI that deals with
designing systems that possess the trait of learning.

Supervised Learning
The goal of supervised learning is to predict the values of
output using the values of features.
In statistical terminology, features are called predictors.
Similarly, the outputs are called responses or dependent
variables.
A prediction task is called regression when the output is a
quantitative measure, and for qualitative output, the
prediction task is called classification.

It is useful to think of a prediction problem as a decision
problem. We know that the optimal solutions to
decision problems depend on the loss function under
consideration.
The most common choices of loss functions are
For numerical output: Squared error loss function
For qualitative output: Probability of wrong prediction
However, there can be several other loss functions that
are more suitable in a given context.
Prediction Problem as a decision Problem

Best prediction
Suppose we are faced with the problem of predicting the
value of a random variable Y. Further, suppose we know
the probability distribution P of Y.
If Y is numeric, and the loss function is ‘squared error’
loss function, We know that the expected loss is
minimized when 𝑌 = 𝐸 𝑃(𝑌).
If Y is qualitative, and the loss function is ‘probability of
wrong prediction’, We know that the expected loss is
minimized when 𝑌 = 𝑀𝑜𝑑𝑒 𝑃 (𝑌)

Best prediction in supervised learning
Recall that ‘supervised learning’ is a problem of
predicting the value of output Y for the given values of
features X1, X2, …, Xp. Consequently,
In a regression/ classification problem, we need to
estimate the conditional probability distribution of Y
given X1, X2, …, Xp and use the estimate of the expected
value/ mode of this conditional distribution as the
prediction.
Different strategies of estimating this conditional
distribution leads to different machine learning
solutions.

Supervised Learning Process
Let Y be a quantitative response, and X1,X2, . . .,Xp be the p
different predictors. We assume that there is some
relationship between Y and X = (X1,X2, . . .,Xp), which can be
written as
Y = f(X) + 
Here f(x) = E(Y|X=x) is some fixed but unknown function of
x, and  is an error having some probability distribution with
mean 0 (zero).
Supervised learning refers to a set of approaches for learning
(estimating) f using a training dataset. Once f is learnt, Y can
be predicted for a given input X = x as
𝑌 = 𝑓(𝑥)

Three components of Risk
The expected loss (risk) in using 𝑌 as a prediction for Y
for given value x of X is
𝐸 𝑌 − 𝑓(𝑋)
2
|𝑋 = 𝑥
Which can be decomposed into two terms as
𝐸 𝑌 − 𝑓(𝑋)
2
|𝑋 = 𝑥 = 𝐸 𝑌 − 𝑓(𝑋) 2|𝑋 = 𝑥 + 𝑓 𝑥 − 𝑓(𝑥)
2
The first, of the two terms, is called the irreducible error
and the second one is called the reducible error.
The reducible error can be further decomposed into
variance and bias.

Estimating f
 There are several possible functions to choose from.
 We need to select a function 𝑓 that can explain the
examples in the training data set, and at the same time
capable of making good predictions for the examples
not in the training data set.
 Above requirement implies that, every new training
example reduces the search space.
 In principle, enough training examples should,
therefore, lead to only one function 𝑓 at the end. In
that case 𝑓 is the solution to the problem.

The Inductive bias
In practice, however, the data by itself is not sufficient to
find unique solution function 𝑓. Such a learning
problem is said to be an ill-posed problem
Because learning problems are generally ill-posed, we
need to make some extra assumptions to have a unique
solution with the data we have.
The set of assumptions we make to have learning
possible is said to introduce the inductive bias in the
learning algorithm.
In general, learning is not possible without inductive bias.

The triple trade-off
 While choosing a model with minimum inductive
bias, it is important to understand that the goal of
learning is to achieve best generalization rather than
best explanation of training data.
 Avoid underfitting and overfitting.
 Typically, we need to achieve the triple trade-off
 Complexity of the class F
 Amount of training data
 Generalization error

Two approaches of learning
Parametric approach
In this approach, we assume (inductive bias) that the
conditional probability distribution F of outcome Y given the
responses X1, X2, …, Xp belongs to a parametric family F. The
problem of learning F then reduces to that of learning the
model parameters using approaches such as ML estimation.
Eg. Linear Regression
Non-parametric approach
In this approach F is directly constructed from the training
data
Eg. KNN regression

Classifiers
KNN Classifiers
F(y|X=x) is estimated by the frequency distribution of Y
from the training examples in the neighborhood of x.
Bayesian classifier
The posterior distribution F(y|X=x) is estimated using
Bayes’ theorem, where both prior distribution F(y) and
the conditional distribution F(x|Y) are estimated as the
frequency distributions from training data

Classifiers
Logistic regression
F(Y|X=x) is assumed to be Bernoulli/ Multinomial with
parameters as parametric functions of X. These
parameters are learnt using approaches such as ML
estimation.
Classification Trees
The training space is iteratively partitioned using one
feature at a time until the partition is reasonably pure
with respect to Y (i.e. Most examples in the partition
belongs to one class)

Ensemble Learning
When multiple estimators are available for a quantity/
function, an appropriate combination of these estimator
is better than any individual estimator.
This principle is exploited by Ensemble Learning.
Bagging: This techniques uses bootstrapping
Boosting: Adaptive subsampling
Random forest: Ensemble learning for decision trees

Feedback loops
The learning process should ensure into improvement of
performance as more experience is accumulated.
This requires incorporation of feedback loop shown
above. The feedback loop updates (improves) the
learned model which results into better performance.

Updating regression estimators
When a new example x becomes available, we can update the
learned values of regression coefficients as
𝛽 𝑛𝑒𝑤 = (𝑋′ 𝑋) 𝑛𝑒𝑤
−1 𝑋′ 𝑛𝑒𝑤 𝑌𝑛𝑒𝑤
Where
𝑋′ 𝑋 𝑛𝑒𝑤
−1 = 𝑋′ 𝑋 + 𝑥 𝑥′ −1
= (𝑋′ 𝑋)−1−
(𝑋′ 𝑋)−1 𝑥 (𝑋′ 𝑋)−1 𝑥
′
1+𝑥′(𝑋′ 𝑋)−1 𝑥
Rank one update
formula
Instead of updating the model with every new example, one
can also adopt the strategy of updating model after every batch
of examples.

Statistical foundations of ml

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Statistical foundations of ml

Similaire à Statistical foundations of ml (20)

Dernier

Dernier (20)

Statistical foundations of ml