Advanced Data Analysis Techniques

1
Module 6 RM
Advanced data analysis techniques
1) Correlation and regression analysis (PL SEE PDF)
Regression and correlation analysis:
Regression analysis involves identifying the relationship between a dependent variable
and one or more independent variables. A model of the relationship is hypothesized, and
estimates of the parameter values are used to develop an estimated regression equation.
Various tests are then employed to determine if the model is satisfactory. If the model is
deemed satisfactory, the estimated regression equation can be used to predict the value of
the dependent variable given values for the independent variables.
Regression model.
In simple linear regression, the model used to describe the relationship between a single
dependent variable y and a single independent variable x is y = a0 + a1x + k. a0and a1
are referred to as the model parameters, and is a probabilistic error term that accounts for
the variability in y that cannot be explained by the linear relationship with x. If the error
term were not present, the model would be deterministic; in that case, knowledge of the
value of x would be sufficient to determine the value of y.
Least squares method.
Either a simple or multiple regression model is initially posed as a hypothesis concerning
the relationship among the dependent and independent variables. The least squares
method is the most widely used procedure for developing estimates of the model
parameters.
As an illustration of regression analysis and the least squares method, suppose a
university medical centre is investigating the relationship between stress and blood
pressure. Assume that both a stress test score and a blood pressure reading have been
recorded for a sample of 20 patients. The data are shown graphically in the figure below,
called a scatter diagram. Values of the independent variable, stress test score, are given
on the horizontal axis, and values of the dependent variable, blood pressure, are shown on
the vertical axis. The line passing through the data points is the graph of the estimated
regression equation: y = 42.3 + 0.49x. The parameter estimates, b0 = 42.3 and b1 = 0.49,
were obtained using the least squares method.

2
Correlation.
Correlation and regression analysis are related in the sense that both deal with
relationships among variables. The correlation coefficient is a measure of linear
association between two variables. Values of the correlation coefficient are always
between -1 and +1. A correlation coefficient of +1 indicates that two variables are
perfectly related in a positive linear sense; a correlation coefficient of -1 indicates that
two variables are perfectly related in a negative linear sense, and a correlation coefficient
of 0 indicates that there is no linear relationship between the two variables. For simple
linear regression, the sample correlation coefficient is the square root of the coefficient of
determination, with the sign of the correlation coefficient being the same as the sign of
b1, the coefficient of x1 in the estimated regression equation.
Neither regression nor correlation analyses can be interpreted as establishing cause-and-
effect relationships. They can indicate only how or to what extent variables are associated
with each other. The correlation coefficient measures only the degree of linear
association between two variables. Any conclusions about a cause-and-effect relationship
must be based on the judgment of the analyst.
What is the difference between correlation and linear regression?
Correlation and linear regression are not the same.
What is the goal?
Correlation quantifies the degree to which two variables are related. Correlation does not
fit a line through the data points. You simply are computing a correlation coefficient (r)
that tells you how much one variable tends to change when the other one does. When r is
0.0, there is no relationship. When r is positive, there is a trend that one variable goes up
as the other one goes up. When r is negative, there is a trend that one variable goes up as
the other one goes down.
Linear regression finds the best line that predicts Y from X.
What kind of data?
Correlation is almost always used when you measure both variables. It rarely is
appropriate when one variable is something you experimentally manipulate.
Linear regression is usually used when X is a variable you manipulate (time,
concentration, etc.)

3
Does it matter which variable is X and which is Y?
With correlation, you don't have to think about cause and effect. It doesn't matter which
of the two variables you call "X" and which you call "Y". You'll get the same correlation
coefficient if you swap the two.
The decision of which variable you call "X" and which you call "Y" matters in
regression, as you'll get a different best-fit line if you swap the two. The line that best
predicts Y from X is not the same as the line that predicts X from Y (however both those
lines have the same value for R2)
Assumptions
The correlation coefficient itself is simply a way to describe how two variables vary
together, so it can be computed and interpreted for any two variables. Further inferences,
however, require an additional assumption -- that both X and Y are measured, and both
are sampled from Gaussian distributions. This is called a bivariate Gaussian distribution.
If those assumptions are true, then you can interpret the confidence interval of r and the P
value testing the null hypothesis that there really is no correlation between the two
variables (and any correlation you observed is a consequence of random sampling).
With linear regression, the X values can be measured or can be a variable controlled by
the experimenter. The X values are not assumed to be sampled from a Gaussian
distribution. The vertical distances of the points from the best-fit line (the residuals) are
assumed to follow a Gaussian distribution, with the SD of the scatter not related to the X
or Y values.
Relationship between results
Correlation computes the value of the Pearson correlation coefficient, r. Its value ranges
from -1 to +1.
Linear regression quantifies goodness of fit with r2, sometimes shown in uppercase as
R2. If you put the same data into correlation (which is rarely appropriate; see above), the
square of r from correlation will equal r2 from regression
2) INTRODUCTION TO FACTOR ANALYSIS
A Brief Introduction to Factor Analysis
1. Introduction
Factor analysis attempts to represent a set of observed variables X1, X2 …. Xn in terms
of a number of 'common' factors plus a factor which is unique to each variable. The
common factors (sometimes called latent variables) are hypothetical variables which
explain why a number of variables are correlated with each other -- it is because they
have one or more factors in common.
A concrete physical example may help. Say we measured the size of various parts of the
body of a random sample of humans: for example, such things as height, leg, arm, finger,
foot and toe lengths and head, chest, waist, arm and leg circumferences, the distance

4
between eyes, etc. We'd expect that many of the measurements would be correlated, and
we'd say that the explanation for these correlations is that there is a common underlying
factor of body size. It is this kind of common factor that we are looking for with factor
analysis, although in psychology the factors may be less tangible than body size.
To carry the body measurement example further, we probably wouldn't expect body size
to explain all of the variability of the measurements: for example, there might be a
lankiness factor, which would explain some of the variability of the circumference
measures and limb lengths, and perhaps another factor for head size which would have
some independence from body size (what factors emerge is very dependent on what
variables are measured). Even with a number of common factors such as body size,
lankiness and head size, we still wouldn't expect to account for all of the variability in the
measures (or explain all of the correlations), so the factor analysis model includes a
unique factor for each variable which accounts for the variability of that variable which is
not due to any of the common factors.
Why carry out factor analyses? If we can summarise a multitude of measurements with a
smaller number of factors without losing too much information, we have achieved some
economy of description, which is one of the goals of scientific investigation. It is also
possible that factor analysis will allow us to test theories involving variables which are
hard to measure directly. Finally, at a more prosaic level, factor analysis can help us
establish that sets of questionnaire items (observed variables) are in fact all measuring the
same underlying factor (perhaps with varying reliability) and so can be combined to form
a more reliable measure of that factor.
There are a number of different varieties of factor analysis: the discussion here is limited
to principal axis factor analysis and factor solutions in which the common factors are
uncorrelated with each other. It is also assumed that the observed variables are
standardized (mean zero, standard deviation of one) and that the factor analysis is based
on the correlation matrix of the observed variables.
2. The Factor Analysis Model
If the observed variables are X1, X2 …. Xn, the common factors are F1, F2 … Fm and
the unique factors are U1, U2 …Un , the variables may be expressed as linear functions
of the factors:
X1 = a11F1 + a12F2 + a13F3 + … + a1mFm + a1U1
X2 = a21F1 + a22F2 + a23F3 + … + a2mFm + a2U2
….
Xn = an1F1 + an2F2 + an3F3 + … + anmFm + anUn (1)
Each of these equations is a regression equation; factor analysis seeks to find the
coefficients a11, a12 … anm which best reproduce the observed variables from the
factors. The coefficients a11, a12 … anm are weights in the same way as regression
coefficients (because the variables are standardised, the constant is zero, and so is not
shown). For example, the coefficient a11 shows the effect on variable X1 of a one-unit
increase in F1. In factor analysis, the

5
coefficients are called loadings (a variable is said to 'load' on a factor) and, when the
factors are uncorrelated, they also show the correlation between each variable and a given
factor. In the model above, a11 is the loading for variable X1 on F1, a23 is the loading
for variable X2 on F3, etc.
When the coefficients are correlations, i.e., when the factors are uncorrelated, the sum of
the squares of the loadings for variable X1, namely a11 2 + a12 2 + … + a13 2, shows
the proportion of the variance of variable X1 which is accounted for by the common
factors. This is called the communality. The larger the communality for each variable, the
more successful a factor analysis solution is.
By the same token, the sum of the squares of the coefficients for a factor -- for F1 it
would be [a112 + a212 + … + an12] -- shows the proportion of the variance of all the
variables which is accounted for by that factor.
Why use factor analysis?
Factor analysis is a useful tool for investigating variable relationships for complex
concepts such as socioeconomic status, dietary patterns, or psychological scales.
It allows researchers to investigate concepts that are not easily measured directly by
collapsing a large number of variables into a few interpretable underlying factors.
What is a factor?
The key concept of factor analysis is that multiple observed variables have similar
patterns of responses because of their association with an underlying latent variable, the
factor, which cannot easily be measured.
For example, people may respond similarly to questions about income, education, and
occupation, which are all associated with the latent variable socioeconomic status.
In every factor analysis, there are the same number of factors as there are variables. Each
factor captures a certain amount of the overall variance in the observed variables, and the
factors are always listed in order of how much variation they explain.
The eigenvalue is a measure of how much of the variance of the observed variables a
factor explains. Any factor with an eigenvalue ≥1 explains more variance than a single
observed variable.
So if the factor for socioeconomic status had an eigenvalue of 2.3 it would explain as
much variance as 2.3 of the three variables. This factor, which captures most of the
variance in those three variables, could then be used in other analyses.
The factors that explain the least amount of variance are generally discarded. Deciding
how many factors are useful to retain will be the subject of another post.

6
What are factor loadings?
The relationship of each variable to the underlying factor is expressed by the so-called
factor loading. Here is an example of the output of a simple factor analysis looking at
indicators of wealth, with just six variables and two resulting factors.
Variables Factor 1 Factor 2
Income 0.65 0.11
Education 0.59 0.25
Occupation 0.48 0.19
House value 0.38 0.60
Number of public parks in
neighborhood
0.13 0.57
Number of violent crimes per
year in neighborhood
0.23 0.55
The variable with the strongest association to the underlying latent variable. Factor 1, is
income, with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, one
could also say that the variable income has a correlation of 0.65 with Factor 1. This
would be considered a strong association for a factor analysis in most research fields.
Two other variables, education and occupation, are also associated with Factor 1. Based
on the variables loading highly onto Factor 1, we could call it “Individual socioeconomic
status.”
House value, number of public parks, and number of violent crimes per year, however,
have high factor loadings on the other factor, Factor 2. They seem to indicate the overall
wealth within the neighborhood, so we may want to call Factor 2 “Neighborhood
socioeconomic status.”
Notice that the variable house value also is marginally important in Factor 1 (loading =
0.38). This makes sense, since the value of a person’s house should be associated with his
or her income.
3) DISCRIMINANT ANALYSIS
The purposes of discriminant analysis (DA)
Discriminant Function Analysis (DA) undertakes the same task as multiple linear
regression by predicting an outcome. However, multiple linear regression is limited to
cases where the dependent variable on the Y axis is an interval variable so that the
combination of predictors will, through the regression equation, produce estimated mean
population numerical . Y values for given values of weighted combinations of X values.
But many interesting variables are categorical, such as political party voting intention,
migrant/non-migrant status, making a profit or not, holding a particular credit card,
owning, renting or paying a mortgage for a house, employed/unemployed, satisfied
versus dissatisfied employees, which customers are likely to buy a product or not buy,

7
what distinguishes Stellar Bean clients from Gloria Beans clients, whether a person is a
credit risk or not, etc.
DA is used when:
 the dependent is categorical with the predictor IV’s at interval level such as age,
income, attitudes, perceptions, and years of education, although dummy variables
can be used as predictors as in multiple regression. Logistic regression IV’s can be
of any level of measurement.
 there are more than two DV categories, unlike logistic regression, which is limited
to a dichotomous dependent variable.
Discriminant analysis linear equation
DA involves the determination of a linear equation like regression that will predict which
group the case belongs to. The form of the equation or function is:
D v X v X v X ........v X a 1 1
variable
variables
This function is similar to a regression equation or function. The v’s are unstandardized
discriminant coefficients analogous to the b’s in the regression equation. These v’s
maximize the distance between the means of the criterion (dependent) variable.
Standardized discriminant coefficients can also be used like beta weight in regression.
Good predictors tend to have large weights. What you want this function to do is
maximize the distance between the categories, i.e. come up with an equation that has
strong discriminatory power between groups. After using an existing set of data to
calculate the discriminant function and classify cases, any new cases can then be
classified. The number of discriminant functions is one less the number of groups. There
is only one function for the basic two group discriminant analysis.
Assumptions of discriminant analysis
The major underlying assumptions of DA are:
 the observations are a random sample;
 each predictor variable is normally distributed
 each of the allocations for the dependent categories in the initial classification are
correctly classified;
 there must be at least two groups or categories, with each case belonging to only
one group so that the groups are mutually exclusive and collectively exhaustive
(all cases can be placed in a group);
 each group or category must be well defined, clearly differentiated from any other
group(s) and natural. Putting a median split on an attitude scale is not a natural
way to form groups. Partitioning quantitative variables is only justifiable if there
are easily identifiable gaps at the points of division;

8
 for instance, three groups taking three available levels of amounts of housing loan;
the groups or categories should be defined before collecting the data;
 the attribute(s) used to separate the groups should discriminate quite clearly
between
 the groups so that group or category overlap is clearly non-existent or minimal;
 group sizes of the dependent should not be grossly different and should be at least
five times the number of independent variables.
There are several purposes of DA:
 To investigate differences between groups on the basis of the attributes of the
cases, indicating which attributes contribute most to group separation. The
descriptive technique successively identifies the linear combination of attributes
known as canonical discriminant functions (equations) which contribute
maximally to group separation.
 Predictive DA addresses the question of how to assign new cases to groups. The
DA function uses a person’s scores on the predictor variables to predict the
category to which the individual belongs.
 To determine the most parsimonious way to distinguish between groups.
 To classify cases into groups. Statistical significance tests using chi square enable
you to see how well the function separates the groups.
 To test theory whether cases are classified as predicted.
Discriminant Analysis
Discriminant Analysis may be used for two objectives: either we want to assess the
adequacy of classification, given the group memberships of the objects under study; or
we wish to assign objects to one of a number of (known) groups of objects. Discriminant
Analysis may thus have a descriptive or a predictive objective.
In both cases, some group assignments must be known before carrying out the
Discriminant Analysis. Such group assignments, or labelling, may be arrived at in any
way. Hence Discriminant Analysis can be employed as a useful complement to Cluster
Analysis (in order to judge the results of the latter) or Principal Components Analysis.
Alternatively, in star-galaxy separation, for instance, using digitised images, the analyst
may define group (stars, galaxies) membership visually for a conveniently small training
set or design set.
Methods implemented in this area are Multiple Discriminant Analysis, Fisher's Linear
Discriminant Analysis, and K-Nearest Neighbours Discriminant Analysis.
Multiple Discriminant Analysis
(MDA) is also termed Discriminant Factor Analysis and Canonical Discriminant
Analysis. It adopts a similar perspective to PCA: the rows of the data matrix to be
examined constitute points in a multidimensional space, as also do the group mean
vectors. Discriminating axes are determined in this space, in such a way that
optimal separation of the predefined groups is attained. As with PCA, the problem

9
becomes mathematically the eigen reduction of a real, symmetric matrix. The
eigen values represent the discriminating power of the associated eigenvectors.
Then Ygroups lie in a space of dimension at most nY - 1. This will be the number
of discriminant axes or factors obtainable in the most common practical case when
n > m > nY (where n is the number of rows, and m the number of columns of the
input data matrix).
Linear Discriminant Analysis
is the 2-group case of MDA. It optimally separates two groups, using the
Mahalanobis metric or generalized distance. It also gives the same linear
separating decision surface as Bayesian maximum likelihood discrimination in the
case of equal class covariance matrices.
K-NNs Discriminant Analysis
: Non-parametric (distribution-free) methods dispense with the need for
assumptions regarding the probability density function. They have become very
popular especially in the image processing area. The K-NNs method assigns an
object of unknown affiliation to the group to which the majority of its K nearest
neighbours belongs.
There is no best discrimination method. A few remarks concerning the advantages and
disadvantages of the methods studied are as follows.
 Analytical simplicity or computational reasons may lead to initial consideration of
linear discriminant analysis or the NN-rule.
 Linear discrimination is the most widely used in practice. Often the 2-group
method is used repeatedly for the analysis of pairs of multigroup data (yielding
decision surfaces for k groups).
 To estimate the parameters required in quadratic discrimination more computation
and data is required than in the case of linear discrimination. If there is not a great
difference in the group covariance matrices, then the latter will perform as well as
quadratic discrimination.
 The k-NN rule is simply defined and implemented, especially if there is
insufficient data to adequately define sample means and covariance matrices.
 MDA is most appropriately used for feature selection. As in the case of PCA, we
may want to focus on the variables used in order to investigate the differences
between groups; to create synthetic variables which improve the grouping ability
of the data; to arrive at a similar objective by discarding irrelevant variables; or to
determine the most parsimonious variables for graphical representational
purposes.

10
4) CLUSTER ANALYSIS
Cluster analysis is a convenient method for identifying homogenous groups of objects
called clusters. Objects (or cases, observations) in a specific cluster share many
characteristics, but are very dissimilar to objects not belonging to that cluster.
The objective of cluster analysis is to identify groups of objects (in this case, customers)
that are very similar with regard to their price consciousness and brand loyalty and assign
them into clusters. After having decided on the clustering variables (brand loyalty and
price consciousness), we need to decide on the clustering procedure to form our groups of
objects. This step is crucial for the analysis, as different procedures require different
decisions prior to analysis. There is an abundance of different approaches and little
guidance on which one to use in practice.
An important problem in the application of cluster analysis is the decision regarding how
many clusters should be derived from the data
Steps in a cluster analysis
Decide on the clustering variables
Decide on the clustering procedure
Hierarchical methods Partitioning methods
Two-step clustering
Select a measure of similarity or dissimilarity
Select a measure of similarity or dissimilarity
Decide on the number of clusters
Validate and interpret the cluster solution
Choose a clusteringalgorithm

11
5) MULTIDIMENSIONAL SCALING
Multidimensional Scaling
 General Purpose
 Logic of MDS
 Computational Approach
 How many dimensions to specify?
 Interpreting the Dimensions
 Applications
 MDS and Factor Analysis
General Purpose
Multidimensional scaling (MDS) can be considered to be an alternative to factor analysis
(see Factor Analysis). In general, the goal of the analysis is to detect meaningful
underlying dimensions that allow the researcher to explain observed similarities or
dissimilarities (distances) between the investigated objects. In factor analysis, the
similarities between objects (e.g., variables) are expressed in the correlation matrix. With
MDS, you can analyze any kind of similarity or dissimilarity matrix, in addition to
correlation matrices.
Logic of MDS
The following simple example may demonstrate the logic of an MDS analysis. Suppose
we take a matrix of distances between major US cities from a map. We then analyze this
matrix, specifying that we want to reproduce the distances based on two dimensions. As a
result of the MDS analysis, we would most likely obtain a two-dimensional
representation of the locations of the cities, that is, we would basically obtain a two-
dimensional map.
In general then, MDS attempts to arrange "objects" (major cities in this example) in a
space with a particular number of dimensions (two-dimensional in this example) so as to
reproduce the observed distances. As a result, we can "explain" the distances in terms of
underlying dimensions; in our example, we could explain the distances in terms of the
two geographical dimensions: north/south and east/west.
Orientation of axes. As in factor analysis, the actual orientation of axes in the final
solution is arbitrary. To return to our example, we could rotate the map in any way we
want, the distances between cities remain the same. Thus, the final orientation of axes in
the plane or space is mostly the result of a subjective decision by the researcher, who will
choose an orientation that can be most easily explained. To return to our example, we
could have chosen an orientation of axes other than north/south and east/west; however,
that orientation is most convenient because it "makes the most sense" (i.e., it is easily
interpretable).

12
Computational Approach
MDS is not so much an exact procedure as rather a way to "rearrange" objects in an
efficient manner, so as to arrive at a configuration that best approximates the observed
distances. It actually moves objects around in the space defined by the requested number
of dimensions, and checks how well the distances between objects can be reproduced by
the new configuration. In more technical terms, it uses a function minimization algorithm
that evaluates different configurations with the goal of maximizing the goodness-of-fit
(or minimizing "lack of fit").
Measures of goodness-of-fit: Stress. The most common measure that is used to evaluate
how well (or poorly) a particular configuration reproduces the observed distance matrix is
the stress measure. The raw stress value Phi of a configuration is defined by:
Phi = [dij - f ( ij)]2
In this formula, dij stands for the reproduced distances, given the respective number of
dimensions, and ij (deltaij) stands for the input data (i.e., observed distances). The
expression f ( ij) indicates a nonmetric, monotone transformation of the observed input
data (distances). Thus, it will attempt to reproduce the general rank-ordering of distances
between the objects in the analysis.
There are several similar related measures that are commonly used; however, most of
them amount to the computation of the sum of squared deviations of observed distances
(or some monotone transformation of those distances) from the reproduced distances.
Thus, the smaller the stress value, the better is the fit of the reproduced distance matrix to
the observed distance matrix.
Shepard diagram. You can plot the reproduced distances for a particular number of
dimensions against the observed input data (distances). This scatterplot is referred to as a
Shepard diagram. This plot shows the reproduced distances plotted on the vertical (Y)
axis versus the original similarities plotted on the horizontal (X) axis (hence, the
generally negative slope). This plot also shows a step-function. This line represents the
so- called D-hat values, that is, the result of the monotone transformation f( ) of the
input data. If all reproduced distances fall onto the step-line, then the rank-ordering of
distances (or similarities) would be perfectly reproduced by the respective solution
(dimensional model). Deviations from the step-line indicate lack of fit.
How Many Dimensions to Specify?
If you are familiar with factor analysis, you will be quite aware of this issue. If you are
not familiar with factor analysis, you may want to read the Factor Analysis section in the
manual; however, this is not necessary in order to understand the following discussion. In
general, the more dimensions we use in order to reproduce the distance matrix, the better
is the fit of the reproduced matrix to the observed matrix (i.e., the smaller is the stress). In
fact, if we use as many dimensions as there are variables, then we can perfectly reproduce
the observed distance matrix. Of course, our goal is to reduce the observed complexity of

13
nature, that is, to explain the distance matrix in terms of fewer underlying dimensions. To
return to the example of distances between cities, once we have a two-dimensional map it
is much easier to visualize the location of and navigate between cities, as compared to
relying on the distance matrix only.
Sources of misfit. Let's consider for a moment why fewer factors may produce a worse
representation of a distance matrix than would more factors. Imagine the three cities A,
B, and C, and the three cities D, E, and F; shown below are their distances from each
other.
A B C D E F
A
B
C
0
90
90
0
90 0
D
E
F
0
90
180
0
90 0
In the first matrix, all cities are exactly 90 miles apart from each other; in the second
matrix, cities D and F are 180 miles apart. Now, can we arrange the three cities (objects)
on one dimension (line)? Indeed, we can arrange cities D, E, and F on one dimension:
D---90 miles---E---90 miles---F
D is 90 miles away from E, and E is 90 miles away from F; thus, D is 90+90=180 miles
away from F. If you try to do the same thing with cities A, B, and C you will see that
there is no way to arrange the three cities on one line so that the distances can be
reproduced. However, we can arrange those cities in two dimensions, in the shape of a
triangle:
A
90 miles 90 miles
B 90 miles C
Arranging the three cities in this manner, we can perfectly reproduce the distances
between them. Without going into much detail, this small example illustrates how a
particular distance matrix implies a particular number of dimensions. Of course, "real"
data are never this "clean," and contain a lot of noise, that is, random variability that
contributes to the differences between the reproduced and observed matrix.
Scree test. A common way to decide how many dimensions to use is to plot the stress
value against different numbers of dimensions. This test was first proposed by Cattell
(1966) in the context of the number-of-factors problem in factor analysis (see Factor
Analysis); Kruskal and Wish (1978; pp. 53-60) discuss the application of this plot to
MDS.
Cattell suggests to find the place where the smooth decrease of stress values (eigenvalues
in factor analysis) appears to level off to the right of the plot. To the right of this point,
you find, presumably, only "factorial scree" - "scree" is the geological term referring to
the debris which collects on the lower part of a rocky slope.
Interpretability of configuration. A second criterion for deciding how many
dimensions to interpret is the clarity of the final configuration. Sometimes, as in our

14
example of distances between cities, the resultant dimensions are easily interpreted. At
other times, the points in the plot form a sort of "random cloud," and there is no
straightforward and easy way to interpret the dimensions. In the latter case, you should
try to include more or fewer dimensions and examine the resultant final configurations.
Often, more interpretable solutions emerge. However, if the data points in the plot do not
follow any pattern, and if the stress plot does not show any clear "elbow," then the data
are most likely random "noise."
Interpreting the Dimensions
The interpretation of dimensions usually represents the final step of the analysis. As
mentioned earlier, the actual orientations of the axes from the MDS analysis are arbitrary,
and can be rotated in any direction. A first step is to produce scatterplots of the objects in
the different two-dimensional planes.
Three-dimensional solutions can also be illustrated graphically, however, their
interpretation is somewhat more complex.
In addition to "meaningful dimensions," you should also look for clusters of points or
particular patterns and configurations (such as circles, manifolds, etc.). For a detailed
discussion of how to interpret final configurations, see Borg and Lingoes (1987), Borg
and Shye (in press), or Guttman (1968).
Use of multiple regression techniques. An analytical way of interpreting dimensions
(described in Kruskal & Wish, 1978) is to use multiple regression techniques to regress
some meaningful variables on the coordinates for the different dimensions. Note that this
can easily be done via Multiple Regression.

15
Applications
The "beauty" of MDS is that we can analyze any kind of distance or similarity matrix.
These similarities can represent people's ratings of similarities between objects, the
percent agreement between judges, the number of times a subjects fails to discriminate
between stimuli, etc. For example, MDS methods used to be very popular in
psychological research on person perception where similarities between trait descriptors
were analyzed to uncover the underlying dimensionality of people's perceptions of traits
(see, for example Rosenberg, 1977). They are also very popular in marketing research, in
order to detect the number and nature of dimensions underlying the perceptions of
different brands or products & Carmone, 1970).
In general, MDS methods allow the researcher to ask relatively unobtrusive questions
("how similar is brand A to brand B") and to derive from those questions underlying
dimensions without the respondents ever knowing what is the researcher's real interest.
MDS and Factor Analysis
Even though there are similarities in the type of research questions to which these two
procedures can be applied, MDS and factor analysis are fundamentally different methods.
Factor analysis requires that the underlying data are distributed as multivariate normal,
and that the relationships are linear. MDS imposes no such restrictions. As long as the
rank-ordering of distances (or similarities) in the matrix is meaningful, MDS can be used.
In terms of resultant differences, factor analysis tends to extract more factors
(dimensions) than MDS; as a result, MDS often yields more readily, interpretable
solutions. Most importantly, however, MDS can be applied to any kind of distances or
similarities, while factor analysis requires us to first compute a correlation matrix. MDS
can be based on subjects' direct assessment of similarities between stimuli, while factor
analysis requires subjects to rate those stimuli on some list of attributes (for which the
factor analysis is performed).
In summary, MDS methods are applicable to a wide variety of research designs because
distance measures can be obtained in any number of ways (for different examples, refer
to the references provided at the beginning of this section).
6) DESCRIPTIVE STATISTICS
Descriptive statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a
collection of data, or the quantitative description itself. Descriptive statistics are
distinguished from inferential statistics (or inductive statistics), in that descriptive
statistics aim to summarize a sample, rather than use the data to learn about the
population that the sample of data is thought to represent. This generally means that
descriptive statistics, unlike inferential statistics, are not developed on the basis of
probability theory. Even when a data analysis draws its main conclusions using
inferential statistics, descriptive statistics are generally also presented. For example in a
paper reporting on a study involving human subjects, there typically appears a table
giving the overall sample size, sample sizes in important subgroups (e.g., for each

16
treatment or exposure group), and demographic or clinical characteristics such as the
average age, the proportion of subjects of each sex, and the proportion of subjects with
related comorbidities.
Some measures that are commonly used to describe a data set are measures of central
tendency and measures of variability or dispersion. Measures of central tendency include
the mean, median and mode, while measures of variability include the standard deviation
(or variance), the minimum and maximum values of the variables, kurtosis and skewness.
Contents
 1 Use in statistical analysis
o 1.1 Univariate analysis
o 1.2 Bivariate analysis
Use in statistical analysis
Descriptive statistics provides simple summaries about the sample and about the
observations that have been made. Such summaries may be either quantitative, i.e.
summary statistics, or visual, i.e. simple-to-understand graphs. These summaries may
either form the basis of the initial description of the data as part of a more extensive
statistical analysis, or they may be sufficient in and of themselves for a particular
investigation.
For example, the shooting percentage in basketball is a descriptive statistic that
summarizes the performance of a player or a team. This number is the number of shots
made divided by the number of shots taken. For example, a player who shoots 33% is
making approximately one shot in every three. The percentage summarizes or describes
multiple discrete events. Consider also the grade point average. This single number
describes the general performance of a student across the range of their course
experiences.
The use of descriptive and summary statistics has an extensive history and, indeed, the
simple tabulation of populations and of economic data was the first way the topic of
statistics appeared. More recently, a collection of summarisation techniques has been
formulated under the heading of exploratory data analysis: an example of such a
technique is the box plot.
In the business world, descriptive statistics provide a useful summary of security returns
when researchers perform empirical and analytical analysis, as they give a historical
account of return behavior.
Univariate analysis
Univariate analysis involves describing the distribution of a single variable, including its
central tendency (including the mean, median, and mode) and dispersion (including the
range and quantiles of the data-set, and measures of spread such as the variance and
standard deviation). The shape of the distribution may also be described via indices such
as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted
in graphical or tabular format, including histograms and stem-and-leaf display.

17
Bivariate analysis
When a sample consists of more than one variable, descriptive statistics may be used to
describe the relationship between pairs of variables. In this case, descriptive statistics
include:
 Cross-tabulations and contingency tables
 Graphical representation via scatterplots
 Quantitative measures of dependence
 Descriptions of conditional distributions
The main reason for differentiating univariate and bivariate analysis is that bivariate
analysis is not only simple descriptive analysis, but also it describes the relationship
between two different variables.[5] Quantitative measures of dependence include
correlation (such as Pearson's r when both variables are continuous, or Spearman's rho if
one or both are not) and covariance (which reflects the scale variables are measured on).
The slope, in regression analysis, also reflects the relationship between variables. The
unstandardised slope indicates the unit change in the criterion variable for a one unit
change in the predictor. The standardised slope indicates this change in standardised (z-
score) units. Highly skewed data are often transformed by taking logarithms. Use of
logarithms makes graphs more symmetrical and look more similar to the normal
distribution, making them easier to interpret intuitively.
7) INFERENTIAL STATISTICS
Inferential Statistics
Unlike descriptive statistics, which are used to describe the characteristics (i.e.
distribution, central tendency, and dispersion) of a single variable, inferential statistics are
used to make inferences about the larger population based on the sample. Since a sample
is a small subset of the larger population (or sampling frame), the inferences are
necessarily error prone. That is, we cannot say with 100% confidence that the
characteristics of the sample accurately reflect the characteristics of the larger population
(or sampling frame) too. Hence, only qualified inferences can be made, within a degree
of certainty, which is often expressed in terms of probability (e.g., 90% or 95%
probability that the sample reflects the population).
Typically, inferential statistics deals with analyzing two (called BIVARIATE analysis) or
more (called MULTIVARIATE analysis) variables. In this discussion, we will limit
ourselves to 2 variables, i.e. BIVARIATE ANALYSIS.
There are different types of inferential statistics that are used. The type of inferential
statistics used depends on the type of variable (i.e. NOMINAL, ORDINAL, INTERVAL/
RATIO). While the type of statistical analysis is different for these variables, the main
idea is the same: we try to determine how one variable compares to another. Values of
one variable could be systematically higher/ lower/ or the same as the other (e.g., men's
and women's wages). Alternatively, there could be a relationship between the two (e.g.
age and wages), in which case, we find the correlation between them. The different types
of analysis could be summarized as below:

18
Type of Variables Inferential statistics
Nominal (e.g. GENDER, male and female)
Compare the DISTRIBUTION, CENTRAL
TENDENCY
[Carry out separate test to check the validity
(i.e. margin of error) of above comparison, in
which DISPERSION measures are used]
Ordinal (e.g. class grades)
Beyond scope [should be taught in Statistics
class]
Ratio/ Interval (e.g. AGE and WAGE) Regression Analysis
Comparing Nominal variables: Differences in Central Tendency
Often, we need to compare two nominal variables. For example, we might want to find
out if MEN earn more than WOMEN. In this example, MEN and WOMEN are nominal
values of GENDER. We are comparing their EARNINGS, which has RATIO values.
Hence, in this case, we might compare the CENTRAL TENDENCY for MEN's
EARNINGS to WOMEN's EARNINGS. What measure of CENTRAL TENDENCY (i.e.
mean, median, or mode) do you think will be most appropriate to compare in this case?
[Hint: Obviously, this will depend on FREQUENCY distribution of EARNINGS.
Typically, it is a SKEWED distribution.] As mentioned earlier, we cannot be 100 percent
confident of this comparison. To verify if the comparison is variable, we need to calculate
the t-statistic [should be covered in Statistics class].
Comparing Ratio variables: Regression Analysis
Regression analysis is used to measure the degree of relationship between two or more
RATIO variables. Consider any two RATIO variables, for example AGE and WAGES.
One might reasonably expect that WAGES might increase as AGE increases, based on
the hypothesis that one's experience increases with age. Thus, consider the following
hypothesis:
Hypothesis: WAGES are positively related to AGE. [That is, higher the AGE, higher the
WAGES; lower the AGE, lower the WAGES.]
Of course, AGE is not the only factor that determines WAGES. There might be other
factors. GENDER is often such a factor (Census Bureau figures reveal that women earn
less than women); EDUCATION might be another; and so on. Despite such other factors,
we may reasonably be inclined to test the above hypothesis to see if it is indeed true. This
is a BIVARIATE analysis since we are using only 2 variables. We could use regression
analysis to find out the relationship between AGE and WAGES, i.e. test whether there is
indeed a relationship between the two variables.
In the above example, clearly, AGE is the INDEPENDENT variable, and WAGES is the
DEPENDENT variable. In regression analysis, the DEPENDENT variable is generically
denoted by Y, and the INDEPENDENT variable is denoted by X. [Below, whenever I
refer to X, it is the independent variable; Y is the dependent variable.]

19
FIRST STEP
The first step in the regression analysis is to chart the X and Y values graphically to
visually see if there is indeed a relationship between the X and the Y. X is typically on
the horizontal (x) axis; Y is typically on the verical (y) axis. This chart of plotted values
is called a scatterplot. The scatterplot should give you a good visual clue as to whether X
and Y are related or not. See the charts below. A POSITIVE association between AGE
and WAGES would have an upward trend (positive slope), where higher WAGES
correspond to higher AGE and lower WAGES correspond to lower AGE. A NEGATIVE
association would be indicated by the opposite effect (negative slope), where the older
individuals (i.e. higher AGE) have lower WAGES than the younger individuals (i.e.
lower AGE) (this could arguably apply in computer programming, which is a relatively
young field). A RANDOM association (i.e. zero association) is one where the scatterplot
does not indicate any trend (i.e. either positive or negative). In this case, young as well as
old individuals may expect to earn high or low earnings (i.e. the trend would be flat).
There are, however, many cases where the relationship between X and Y may not be as
linear; the relationship may be curvilinear, e.g., U or reverse U. For example, WAGES
might rise with AGE upto a certain number of years (say, retirement), and decrease after
that (a reverse U). All of this information can be visually gleaned from the scatterplot.
Examine the following scatterplots.
1. Positive Correlation 2. Negative Correlation
3. Random (i.e. NO)
correlation
4. Non-linear (reverse U)
correlation
Obviously, if the hypothesis stated above is true, we should expect to see Figure 1 if we
drew a scatterplot of AGE and WAGES. If we somehow get any of the other scatterplots,
understandably, the hypothesis may not be true.
SECOND STEP
The above scatterplots give a good idea of the overall type of relationship between X
(Independent) and Y (Dependent) variables. Yet, they do not give us a precise idea (i.e.
mathematically accurate) idea of the relationship between the two variables. Hence, the

20
second step is to test the relationship mathematically. We will deal only with LINEAR
relationships here. In a linear relationship, if you recall high school mathematics, the
relationship between X and Y can be described by a single line. A line is given by the
equation:
Y = A + B * X, where
Y = Dependent
variable;
X = Independent
variable;
A = Intercept on Y
axis;
B = Slope (or gradient)
[In different books you
might see slope is
represented by m, and
the intercept is
represented by c]
I will not get into the statistical procedures for how to calculate the values for A and B;
these are covered in the class on statistics [You can simply calculate this using Excel, as
shown in class]. Here, my interest is more in explaining and interpreting what these
values mean. From the scatterplot and the regression line, you should be able to more
precisely understand the relationship between X and Y. There are several likely
scenarios:
(a) the line is at 45 degrees (i.e. B = 1), which means that X and Y have a perfect
relationship (i.e. for 1 unit increase in X, there is a corresponding 1 unit increase in Y).
That means our hypothesis is fully true. However, this is rarely the case in most social
science studies;
(b) the line is off from 45 degrees but is inclined close to it (i.e. B~1), which means X
and Y are indeed related (i.e. for 1 unit increase in X, there is a fractional increase in Y).
If there is a positive slope (i.e. the line is inclined upward), the hypothesis holds true; if
there is a negative slope, the hypothesis does not hold true. This is more likely to be the
case in many occasions.
(c) the line is vertical or horizontal (i.e. B=0 or infinite), which means X and Y are not
related. This means our hypothesis is not true.
Thus the value B tells much about the relationship between the Independent and
Dependent variables.
The regression equation is really useful in predicting the value of Y for a given value of
X. That is, in the above example of relationship between AGE and WAGE, you will be
able to predict what WAGE one will earn at a particular AGE, when the values of A and
B are given. Thus, suppose the regression equation between AGE and WAGE is given as
(A= -6; B= 0.9):
WAGE = -6 + 0.9 * AGE [WAGE is hourly; AGE is in years]

21
Then, at the AGE 45, the person could expect to receive: -6 + 0.9 * 45 = -6 + 40.5 =
$32.5 per hour.
[The value A is the value of Y when X = 0. This value is of no statistical use unless X can
actually take values near 0.]
THIRD STEP
Obviously, from the scatterplot and regression equation, you should now be able to
predict if there is indeed any relationship between the Independent and Dependent
variables. The third step tells you how much of an effect the Independent variable has on
the Dependent variable. Here, we calculate the Correlation coefficient. This coefficient,
also called Pearson's R, gives the strength of relationship between the two variables.
[Again, I am not describing how to calculate; this should be covered in Statistics class;
you can simply do this using Excel as showed in class]. The value of Pearson's R could
range anywhere between 0 and 1. Generally, in social science, a value of R above 0.6
indicates a strong relationship between the two variables. A value between 0.3 and 0.6
indicates a moderate relationship. Anything below 0.3 indicates a weak relationship.
More generally, the value of R-squared (i.e. the squared value of Pearson's R) is
calculated to give the percentage strength of relationship between the independent and
dependent variables. Similar to R, R-squared value could be anywhere between 0 and 1.
Let's say in the above example, the Pearson's R is 0.7. This value indicates that there is a
strong relationship between AGE and WAGES. The R-squared value is 0.7 * 0.7 = 0.49.
This means that AGE represents 49% of the increase in one's WAGES. [The other 51
percent could be other factors, such as education, etc.]
There are additional steps required to test if the values of R and R-squared above are
indeed reliable; these should be covered in your Statistics class.
of your sample
tionship between your independent (causal)
variables, and you dependent (effect) variables
Why use inferential statistics?
-tiered journals will not publish articles that do NOT use inferential statistics.
s to the larger population.
yours.
and you dependent (effect) variables.
ss the relative impact of various program inputs on your program
outcomes/objectives
When are inferential statistics utilized?
 Inferential statistics can only be used under the following conditions
 You have a complete list of the members of the population.

22
 You draw a random sample from this population
 Using a pre-established formula, you determine that your sample size is
large enough
The following types of inferential statistics are relatively common and relatively easy to
interpret
 One sample test of difference/One sample hypothesis test
 Confidence Interval
 Contingency Tables and Chi Square Statistic
 T-test or Anova
 Pearson Correlation
 Bi-variate Regression
 Multi-variate Regression
8) MULTIDIMENTIONAL MEASUREMENT AND FACTOR ANALYSIS
Factor analysis
INTRODUCTION
Factor analysis is a method for investigating whether a number of variables of interest
Y1, Y2, : : :, Yl, are linearly related to a smaller number of unobservable factors F1, F2, :
: :, Fk .
The fact that the factors are not observable disquali¯es regression and other methods
previously examined. We shall see, however, that under certain conditions the
hypothesized factor model has certain implications, and these implications in turn can be
tested against the observations. Exactly what these conditions and implications are, and
how the model can be tested, must be explained with some care.
Factor Analysis
Factor analysis is a statistical method used to study the dimensionality of a set of
variables. In factor analysis, latent variables represent unobserved constructs and are
referred to as factors or dimensions.
• Exploratory Factor Analysis (EFA)
Used to explore the dimensionality of a measurement instrument by finding the smallest
number of interpretable
factors needed to explain the correlations among a set of variables – exploratory in the
sense that it places no structure on the linear relationships between the observed variables
and on the linear relationships between the observed variables and the factors but only
specifies the number of latent variables
• Confirmatory Factor Analysis (CFA)
Used to study how well a hypothesized factor model fits a new sample from the same
population or a sample from a different population – characterized by allowing
restrictions on the parameters of the model

23
Applications of Factor Analysis
• Personality and cognition in psychology
• Child Behavior Checklist (CBCL)
• MMPI
• Attitudes in sociology, political science, etc.
• Achievement in education
• Diagnostic criteria in mental health
Issues
• History of EFA versus CFA
• Can hypothesized dimensions be found?
• Validity of measurements
A Possible Research Strategy For Instrument Development
1. Pilot study 1
• Small n, EFA
• Revise, delete, and add items
2. Pilot study 2
• Small n, EFA
• Formulate tentative CFA model
3. Pilot study 3
• Larger n, CFA
• Test model from Pilot study 2 using random half of the sample
• Revise into new CFA model
• Cross-validate new CFA model using other half of data
4. Large scale study, CFA
5. Investigate other populations
Multidimensional measurement NO MATERIAL FOUND

Advanced Data Analysis Techniques

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Advanced Data Analysis Techniques

Similaire à Advanced Data Analysis Techniques (20)

Plus de Kishor Ade

Plus de Kishor Ade (20)

Dernier

Dernier (20)

Advanced Data Analysis Techniques