3. Dimensionality Reduction
1 Introduction
x
In the recent times, data has become very large and therefore requires
classification of data from an extremely large data set. Many applications
in data mining require deriving a classifier or a function estimate from a
very large data set. The present state of data sets provide a large number of
examples in which the data is classified and given as an example for classifying
the data which may come in the future. The classified dataset consists of a
large number of features, some of which may be irrelavent and sometimes even
misleading. This may be an issue for an algorithm attempting to generalize
the data. Hence, the datasets which have an extremely complex feature
sets will slow down any algorithm which attempts to classify it and make it
difficult to find an optimal result. In order to decrease the burden on the
classifiers and function estimators, we will have to reduce the dimensionality
of the data so that the number of features in the data will reduce by a
large extent. Thus, dimensionality reduction simplifies data so that it can
be efficiently processed.
Apart from visualization, dimensionality reduction helps reveal what is
the main feature governing a data set. For example, suppose we want to clas-
sify a mail as a spam or a non-spam. A general approach we would follow
is to represent it a vector of words appearing in the email. The dimension-
ality of this could easily be in hundreds. But, by dimensionality reduction
approach may reveal that there are only a few telling features like the words
”free”,”donate”,etc. This can help classify the mail as spam.
There are two types by which we can reduce the dimensionality of the
given data set. They are:
1. Linear Dimensionality Reduction
2. Non-linear Dimensionality Reduction
2 Linear Dimensionality Reduction
The most popular algorithm for dimensionality reduction is Principal Com-
ponent Analysis(PCA). Given a data set, PCA finds the vectors along which
the data has maximum variance in addition to the relative importance of
these directions. An example will explain the PCA in a more intutive way.
Take for example, the data we have is the surface of a teapot and we need
to capture the most information about the 3D teapot.
2
4. Dimensionality Reduction
In order to achieve this, we rotate the teapot and get a position where
we can get the most visual information. The method to achieve that will be:
First find the axis so that the object has largest extend in average along the
axis (red axis). Next, rotate the object around the first axis to find the axis
that is perpendicualr to the first axis, and the object has largest extend in
average along this axis (green axis).
Figure 1: Finding the Principal Components
The two axises found are the first and the second principal component.
And the extends in average along the axises are called the eigenvalues.
Mathematically, the steps involved in PCA are:
Suppose we have n documents and overall of m terms,
1. Construct an m x n term document matrix A. Each document is rep-
resented as a column vector of m dimensions.
2. Compute the empirical mean of each term.
3. Compute the normalized matrix by subtracting emperical mean from
each data dimensions. The mean subtracted is the average across each
dimension.
4. Calculate the m x m term covariance matrix from the normalized pro-
jections.
5. Calculate the eigenvectors and the eigenvalues of the covariance matrix.
Since the covariance matrix is square, we can calculate the eigenvectors
3
5. Dimensionality Reduction
and the eigenvalues for this matrix. It is important to notice that these
eigenvectorss are both unit eigenvectors. This is very important for
PCA.
6. Once the eigenvectors are found from the covariance matrix, the next
step is to order them by eigenvalue, highest to lowest. This gives the
components in order of significance. Now, if we wish, we can ignore the
components of lower significance. we may lose some information, but if
the eigenvalues are small, we may not lose much. Then, the final data
set will have lesser dimensions. To be precise, if you originally have
n dimensions in your data, and so you calculate n eigenvectors and n
eigenvalues, and then you choose only the first p eigenvectors, then the
final data set has only p dimensions. The value of p can be decided
by computing the cumulative energy of the eigenvalues. Choose p such
that the cumulative energy is above a certain threshhold say 90% of
the overall cumulative energy.
Figure 2: Principal Components Analysis on 3D data
Despite PCA’s popularity, it has a number of drawbacks. One of the
major drawback is the requirement that the data lie on a linear subspace.
For example, in the figure 4 below (known as a swiss roll), the data is actually
2 dimensional manifold, but PCA will not correctly extract this data set.
There are other approaches to reduce the number of dimensions. It has
been observed that high-dimensional data is often much simpler than the
dimensionality it actually shows. In other words, a high-dimesional data set
4
6. Dimensionality Reduction
may contain many features that are all measurements of the same underlying
cause and thus are very closely related. For ex. taking a video footage of
a singe object from multiple angles simultaneously. The features of such a
dataset contain a lot of overlapping information. This notion is formalized
using the notion of a manifold.
3 Non-Linear Dimensionality Reduction
3.1 Manifolds
A manifold is basically a low-dimensional Eucledian subspace to which a
higher-dimensional subspace is mapped to. A more general topological man-
ifold can be described as a topological space that on a small enough scale
resembles the Euclidean space of a specific dimension, called the dimension of
the manifold. And thereby, a line and a circle are one dimensional manifolds.
A plane and a sphere are two dimensional manifolds and so on.
(a) The sphere (surface of a ball) (b) A 1D manifold embedded in 3D
is a two-dimensional manifold since
it can be represented by a col-
lection of two-dimensional maps.
Source:Wikipedia
Figure 3: Manifolds
In the figure a above, we notice that the triangle drawn on the 3D globe
can actually be represented linearly on a 2D space.
In the figure b above, notice that the curve is in 3D, but it has zero volume
and zero area and hence the 3D representation is somewhat misleading since
it can be represented as a line (1D).
5
7. Dimensionality Reduction
3.2 Manifold Learning
Manifold learning is one of the most popular approach to non-linear dimen-
sionality reduction. The algorithms used for this job are based on the idea
that the data is acually present in the low-dimension but is embedded in a
high-dimension space, where the low-dimensional space reflects the under-
lying parameters. Manifold learning algorithms try to get these parameters
in order to find a low-dimensional representation of the data. Some of the
widely used algorithms for this purpose are: Isomap, Locally Linear Embed-
ding, Laplacian Eigenmaps, Semidefinite Embedding. The best example used
to explain the manifold learning is the swiss roll, a 2D manifold embedded
in 3D shown in the figure below.
Figure 4: Swiss Roll Manifold
3.2.1 Isomap
Isomap - short for isometric feature mapping - was one of the first algorithms
introduced for manifold learning. Its one of the most applied procedure for
the problem.
Isomap consists of two main steps:
1. Estimate the geodesic distances (distances along a manifold) between
the points using shortest path distances on the data set’s k-nearest
neighbour graph.
2. Then use Multi Dimensional Scaling (MDS), to map the distances we
got in the first step, onto the LD Eucledian space keeping in mind the
interpoint distances between the points computed in the first step.
6
8. Dimensionality Reduction
Estimating Geodesic distances
A geodesic is defined as a curve that locally minimizes the distance be-
tween two points on any mathematically defined space, such as a curved
manifold. Equivalently, it is a path of minimal curvature. In noncurved
three-dimensional space, the geodesic is a straight line.
We make an assumption that the data is present in D dimension and the
manifold is assumed to be present in d dimention. Isomap furthur assumes
that there is a chart that preserves the distances between points i.e. if xi,xj
are points on the manifold and G(xi , xj ) is the geodesic distance between
them, then there is a chart f : M → Rd such that
||f (xi ) − f (xj )|| = G(xi , xj )|| and the manifold is smooth enough that the
geodesic distance is between the nearby points are approximately linear.
Multidimensionaly scaling (MDS)
After finding the geodesic distances, Isomap finds the points whose Eu-
clidean distances are equal to these geodesic distances. And since the mani-
fold is isometrically embedded, such points exist. Multidimensional Scaling is
a classical techique that may be used to find such points. The MDS basically
constructs a set of points from a matrix of dissimlarities whose interpoint
Euclidean distances match those in data’s actual dimension D, closely. Clas-
sical MDS (cMDS) algorithm is used by Isomap to minimize the cost. The
cMDS algorithm takes an input matrix giving dissimilarities between pairs
of items and outputs a coordinate matrix whose configuration minimizes a
loss function called strain.
Hence, first compute the pairwise distances from a given set of m vectors,
(x1 , x2 , ..xm ) in n dimensional space.
· · · δ1.m
0 δ1,2
δ2,1
0 · · · δ2,m
∆= .
. ...
.
δm,1 δm,2 · · · 0
And then, map the vectors onto the manifold of a lower dimension k<<n,subject
to the following optimization criteria:
min(x1 , x2 , ..xm ) (|xi − xj | − δi,j )2
i<j
7
9. Dimensionality Reduction
Isomap working
The Isomap algorithm sestimates the geodesic distances using the shortest
path algorithms and then finding an embedding of these distances in Eu-
clidean space using cMDS algorithm.
Algorithm 1: Isomap
input : x1 , x2 ...., xn RD , k
1. Form the k-nearest neighbor graph with edge weights Wij := |xi − xj |
for neighboring points xi , xj .
2. Compute the shortest path distances between all pairss of points
using Dijkstra’s or Floyd’s algorithm. Store the squares of these
distances in D where D is the Euclidean distance matrix.
3. Return Y:=cMDS(D).
Figure 5: Isomap Manifold Learning
One particularly helpful feature of Isomap - not found in some of the
other algorithms - is that it automatically provides an estimate of the dimen-
sionality of the underlying manifold. In particular, the number of non-zero
eigenvalues found by Classical MDS (cMDS) gives the underlying dimension-
ality.
3.2.2 Locally Linear Embedding (LLE)
LLE assumes the manifold to be a collection of overlapping coordinate patches.
And if the neighbourhood sizes are small and the manifold is smooth, then,
the patches can be considered as almost linear. LLE also begins by finding a
set of nearest neighbors of each point. It then computes a set of weights for
8
10. Dimensionality Reduction
each point from its neighbors that best reconstructs each data point. Then
in the end, it uses the eigenvector based optimization to find the LD embed-
ding of points such that the intrinsic weights are preserved to maintain the
nonlinear manifold in the LD space.
Figure 6: LLE algorithm
To be more precise, the LLE algorithm is given as inputs an n x p data
matrix X, with rows xi , a desired number of dimensions q < p and an integer
k for finding local neighborhoods, where k ≥ q+1. The output is supposed
to be an n x q matrix Y, with rows yi
The steps involved in the LLE algorithm is given below:
9
11. Dimensionality Reduction
Algorithm 2: Locally Linear Embedding (LLE)
1. For each xi , find the k nearest neighbors
2. Find the weight matrix W which minimizes the residual sum of
squares for reconstructing each xi from its neighbors.
n
RSS(w) ≡ ||xi − wij xj ||2
i=1 j=1
where wij = 0 unless xj is one of xi ’s k-nearest neighbors, and for
each i, wij = 1
j
3. Find the cordinates Y which minimize the reconstruction error using
the weights,
n
φ(Y) ≡ ||yi − wij yj ||2
i=1 j=1
subject to the constraints that Yij = 0 for each j and that Y T Y = I
j
3.2.3 Isomap vs Locally Linear Embedding (LLE)
Embedding Type
Isomap looks for isometric embedding i.e it assumes that there is a coordi-
nate chart from the parameter space to HD space that preserves interpoint
distances and attempt to uncover this chart. LLE looks for conformal map-
pings i.e mapping which preserves local distances between points but not the
distances between all the points.
Local vs Global
Isomap is a global method because it considers the geodesic distances be-
tween all pairs of the points on the manifold. LLE is a local method because
it constructs an embedding considering only the placement of a point with
respect ot its neighbors.
10
12. Dimensionality Reduction
3.3 Applications of Manifold Learning
Manifold learning methods are adaptable data-representation techniques that
enable dimensionality reduction and processing tasks in meaningful spaces.
Their success in medical image analysis as well as in other scientific fields lies
in both, their flexibility and the simplicity of their application. In medical
imaging, manifold learning has been successfully used to visualize, cluster,
classify and fuse high dimensional data, as well as for Content Based Image
Retrieval (CBIR), segmentation, registration, statistical population analysis
and shape modeling and classification.
1. Manifold learning is used for patient position detection in MRI. Low-
resolution images, acquired during the initial placement of the patient
in the scanner, are exploited for detecting the patient position.
2. Isomap Method is used in prediction of Protein Quaternary Structure.
3. medical image analysis: applications to video endoscopy and 4D imag-
ing.
4. Identifying spectral clustering.
5. Identify increase or decrease in disease cells or a tumour in application
to Neuroimaging.
6. Character Recognition.
7. Research is going on for using Manifold learning for image and video
indexing. As we know, there are millions and millions of videos on the
internet. These are stored in the repositories along with the people
creating and sharing them. Now, if a person queries for an image or a
video, we need to effectively identify duplication and copyright of the
image or the videos. For this purpose, manifold learning is being used
for image/video analyzing, indexing and searching.
11
13. Dimensionality Reduction
References
[1] ”Algorithms for manifold learning” by Lawrence Cayton - June 15,2005.
[2] ”A tutorial on Principal Component Analysis” by Lindsay I Smith -
February 26,2002
[3] ”Linear Dimensionality Reduction” by Percy Liang - October 16,2006
[4] ”A layman’s introduction to principal component analysis” by VisuMap
Technologies
[5] en.wikipedia.org/wiki/Principal_component_analysis
[6] en.wikipedia.org/wiki/Nonlinear_dimensionality_
reductionr
12