This document discusses bias and variance in machine learning models. It begins by introducing bias as a stronger force that is always present and harder to eliminate than variance. Several examples of bias are provided. Through simulations of sampling from a normal distribution, it is shown that sample statistics like the mean and standard deviation are always biased compared to the population parameters. Sample size also impacts bias, with larger samples having lower bias. Variance refers to a model's ability to generalize, with higher variance indicating overfitting. The tradeoff between bias and variance is that reducing one increases the other. Several techniques for optimizing this tradeoff are discussed, including cross-validation, bagging, boosting, dimensionality reduction, and changing the model complexity.
1. What next?
This document is for those who have met the pre-requisite for a course on Machine
Learning. You must be familiar with basic principles and ideas drawn from
Statistics, Linear Algebra, Calculus and Probability theory. This is not a
beginner’s document even though material presented is very basic.
Now that we have played around with individual supervised
learners we now have an obligation to pause and ponder.
If everything we have learned in the last 2 months is so promising,
why then we never achieved 100% accuracy even in the small toy datasets
we worked with.
What is the reason?
We may attribute it to one of several plausible underlying causes.
Perhaps we can think of them as process related (the manner in which we
conducted these exercises) or other problems
intrinsic unrelated to the process.
So let us begin our exploration with intrinsic factors that prevent us
from learning to perfection. They are Bias and Variance. Understanding Bias
and Variance at a deeper level is our goal for this lecture.
Bias is a stronger force, and it is always present and it is much harder to
eliminate. There are several forms of bias and most of these forms of bias may not
be attributable to the process but the data we are trying to learn from, except for
the inductive bias. Bias is present in the data due to any number of reasons --
Selection bias (Landon vs FDR Election polling in 1936), Survival bias (Outliers by
Gladwell, Wald experiment with Navy). There are many other forms of Bias, I
encourage you to consult our collective conscience, the www.
In Statistics,bias is indicated when the expected value of a sample statistic
differs from the population parameter. Let us conduct a small experiment to
understand bias.
You have seen the notation (mean,variance) for the Normal Distribution. What
exactly does this notation refer to? Population or sample? My answer would be it is
the population as it encompasses any and every observation that is part of that
normal distribution with that mean and variance.
You also have seen the R function rnorm(100,mean,sigma) which returns a sample of
100 observations from a normal distribution with those parameters.
Let us compute the mean of the sample observations and compare it with the
theoretical mean.
# am omitting the seed so that we all get different samples and i
# expect all observations to be consistent with our opinions below, numerically
different
# but we will come to the same conclusion
s1<-rnorm(100,10,8)
paste("Sample is ",
2. ifelse(mean(s1)!=10,"biased.", "unbiased."),
"Sample statistic is mean.", sep="")
#Let us consider another sample statistic SQRT(variance), the standard deviation.
paste("Sample is ",
ifelse(sd(s1)!=8,"biased.", "unbiased."),
"Sample statistic is sd.", sep="")
This is single sample what would happen if we do this a 1000 times.
We can think of such an average as its expected value?
# Let us find the mean of a sample of 1000, averaged over 1000 samples
N<-1000
L<-unlist(lapply(1:N,FUN=function(x){ s<-rnorm(1000,10,8)
list(mean=mean(s),sd=sd(s))}))
mean(L[seq(1,N,2)])
mean(L[seq(2,N,2)])
hist(L[seq(1,N,2)])
#10.0252 for mean and 7.997218 for sd
Every time, we run, we will get different results. That is the point. The sample
means are never the theoretical mean 10 (for the mean). This difference is the
Bias. Every dataset, we will ever work with, is a sample drawn from some unknown
distribution. Accordingly, we anticipate bias in any statistic we estimate from a
given sample. We cannot avoid that.
Bias and sample size
How does Bias vary with Sample Size?
# what can we expect if the sample sz is 100, but averaged over 1000 samples as
before
N<-1000
sz<-100
L<-unlist(lapply(1:N,FUN=function(x,m=sz){ s<-rnorm(m,10,8)
list(mean=mean(s),sd=sd(s))}))
paste("Sample is ",ifelse(mean(L[seq(1,N,2)])!=10,"biased.", "Unbiased."), "Sample
statistic is mean", sep="")
paste("Sample is ",ifelse(mean(L[seq(2,N,2)])!=8,"biased.", "Unbiased."), "Sample
statistic is sd", sep="")
We can plot a histogram of the means
hist(L[seq(1,N,2)])
Also, this is an opportune time to digress a bit toward CLT, LLN
CLT -- Regardless of the underlying distribution, the mean of the sample.statistic
will follow Normal Curve.
We can perform shapiro.test to validate CLT.
shapiro.test(L[seq(1,N,2)])
> shapiro.test(L[seq(1,N,2)])
Shapiro-Wilk normality test
data: L[seq(1, N, 2)]
W = 0.99663, p-value = 0.3825
The null hypothesis of Shapiro’s test is that the population is distributed
normally. Since the p-value is > 0.05 we cannot reject the null. That is, this
sample data is not significantly different from a normal distribution. CLT.
3. LLN -> sample.statistic will asymptotically reach the population parameter.
When training error is large and accuracy is close to 50%, and we consider the
learner is unable to learn. The hypothesis is too simple. This is referred to as
underfitting. We attribute underfitting to Bias, because the model
is unable to estimate the parameters. Model is too simple and unable to learn the
structure of the data presented.
https://becominghuman.ai/machine-learning-bias-vs-variance-641f924e6c57
https://www.mygreatlearning.com/blog/bias-variance-trade-off-in-machine-learning/
https://www.mygreatlearning.com/blog/overfitting-and-underfitting-in-machine-
learning/
Now the Variance
As shown above, each sample there is slight variation. No two samples are alike.
That comes with iid.
When a learner is unable to generalize, that is the model is unable to perform as
well as it did on the training set when presented with never seen before data, we
attribute that to variance and generally referred to as over-fitting.
Another way to think of this given different data, our classifier performs
differently. For classification problem, the same observation is classified
differently given different training set. The fluctuation around the true class
label, depending on the training set.
Model is more complex than necessary and captures the noise.
Total Error = variance + bias**2 + irreducible error
Most introductory books on M/L establish this constraint algebraically using first
principles. Given this constraint, variance and bias cannot be simultaneously
reduced.
Most optimal point is where the variance and the bias cross, when plotted as
function of model complexity.
http://scott.fortmann-roe.com/docs/BiasVariance.html
https://www.analyticsvidhya.com/blog/2020/08/bias-and-variance-tradeoff-machine-
learning/
Bias/Variance Tradeoff
As mentioned before variance and bias cannot be simultaneously reduced. If we seek
low variance, we will end up with high bias and vice versa. That is the
Bias/Variance tradeoff. By definition, a model with high bias, therefore, will have
low variance and a model with high variance will have low bias. This gives us the
ability to improve classifiers. We can start with high variance reduce its
variance. We will end up with an optimal model low bias and lower variance. Or
start with a low variance or high bias and seek to reduce its bias and end up with
a model that low variance and lower bias.
What is our option?
What can we vary or tune? What are the variables? Our dataset has N observations
and p features. We can vary them by considering different training sets. We now
know numerous strategies to classify. Therefore, we have the following variables
4. and we can vary them to optimize performance:
1. vary the training set (keep N constant, but change the observations)
2. Consider different features (p) in our training set
3. Train different models
Let us consider some well known techniques.
Methods to optimize bias/variance:
1. Cross-Validation : Cross Validation in its simplest form is a one round
validation, where we leave one sample as in-time validation and rest for
training the model. This form of CV is called LOO-CV (leave one out CV).
There is another family of cross validation where we split the data into
equal number of folds, 10,5, and 3 are common. Here in a k-fold CV, we
divide the dataset into k disjoint folds. One fold is kept as testing set
and the k-1 folds are used as training set. This process is repeated over
all the k-folds and the average is taken as the CV-metrics. The difference
between the LOO and k-fold CV is that in LOO-CV, there are as many folds as
there are observations. Each fold is 1 observation. Every observation
participates k-1 training-sessions and once as a test observation. From a
Big Data perspective cross-validation lends itself to parallelization and
MapReduce is an appropriate strategy.
2. Boosting combines many "weak" individual models in an ensemble that has
lower bias than the individual models. Boosting is what we do naturally.
When something goes wrong, we try to fix those parts that are erroneous and
iteratively eliminate all the errors making the solution error free.
Boosting therefore is iterative and does not lend itself to
parallelization. Not a good candidate for Big data experimentation.
3. Bagging combines "weak" learners in a way that reduces their variance. In
Bagging, we generate a number of bootstrap samples. Train the model and
average them in regression and take the majority vote in classification.
Bagging therefore is highly parallelizable and a good candidate for Big
Data experimentation.
4. In k-nearest neighbor models, a high value of k leads to high bias and low
variance. Think of a dataset with two classes. Assume the class proportions
are 70/30. If we make k to be the N of the dataset, kNN will classify all
new observations as belonging to the majority class. A biased classifier.
kNNs are also amenable to parallelization. Given an unknown observation,
the distance between the observations can be computed in parallel.
5. Early Stopping : Early stopping rules provide guidance as to how many
iterations can be run before the learner begins to over-fit. This technique
is often used in neural nets and Tree algorithms. Tree algorithms are
parallelizable.
6. Pruning : Pruning is used extensively while building CART (Tree) models. In
decision trees, the depth of the tree determines the variance. Decision
trees are commonly pruned to control variance.
7. Regularization: Linear and Generalized linear models can be regularized to
decrease their variance at the cost of increasing their bias. Iterative and
therefore not parallelizable.
8. Dimensionality reduction and feature selection can decrease variance by
simplifying models and getting rid of correlated features. Dimenstionality
methods rely on linear algebra techniques involving matrix inversion and
multiplication. Such operations are parallelizable and hence good candidate
for Big Data experimentation.
9. Adding features (predictors) tends to decrease bias, at the expense of
introducing additional variance.
10. A larger training set tends to decrease variance. Thus, a higher fold
cross validation results in lower variance.
5. More experimentation, varying sample size:
sd10<-unlist(lapply(1:1000,FUN=function(x)sd(rnorm(10,mean=10,sd=4))))
sd100<-unlist(lapply(1:1000,FUN=function(x)sd(rnorm(100,mean=10,sd=4))))
sd1000<-unlist(lapply(1:1000,FUN=function(x)sd(rnorm(1000,mean=10,sd=4))))
sd10000<-unlist(lapply(1:1000,FUN=function(x)sd(rnorm(10000,mean=10,sd=4))))
adf<-data.frame(ex10=sd10,ex100=sd100,ex1000=sd1000,ex10000=sd10000)
apply(adf,2,mean)
ex10 ex100 ex1000 ex10000
3.893439 3.980448 4.001862 3.997796
apply(adf,2,sd)
ex10 ex100 ex1000 ex10000
0.94593649 0.28605537 0.08785607 0.02829969
Note that the variance,as measured by sd, for larger samples is much lower than the
variance for smaller samples.