Data Analytics Notes

WHAT IS MULTIVARIATE ANALYSIS?
Today businesses must be more profitable, react quicker, and offer higher-quality products and services,
and do it all with fewer people and at lower cost.An essential requirement in this process is
effective knowledge creation and management. There is no lack of information, but there is a dearth
of knowledge. As Tom Peters said in his book Thriving on Chaos,“We are drowning in information
and starved for knowledge” [7].
The information available for decision making exploded in recent years, and will continue to
do so in the future, probably even faster. Until recently, much of that information just disappeared.
It was either not collected or discarded.Today this information is being collected and stored in data
warehouses,and it is available to be “mined” for improved decision making. Some of that information
can be analyzed and understood with simple statistics,but much of it requires more complex,
multivariate statistical techniques to convert these data into knowledge.
MULTIVARIATE ANALYSIS IN STATISTICAL TERMS
Multivariate analysis techniques are popular because they enable organizations to create knowledge
and thereby improve their decision making. Multivariate analysis refers to all statistical techniques
that simultaneously analyze multiple measurements on individuals or objects under investigation.
Thus,any simultaneous analysis of more than two variables can be loosely considered multivariate
analysis.
Many multivariate techniques are extensions of univariate analysis (analysis of single-variable
distributions)and bivariate analysis (cross-classification, correlation, analysis of variance, and simple
regression used to analyze two variables). For example, simple regression (with one predictor
variable) is extended in the multivariate case to include several predictor variables. Likewise, the
single dependent variable found in analysis of variance is extended to include multiple dependent
variables in multivariate analysis of variance. Some multivariate techniques (e.g., multiple regression
and multivariate analysis of variance) provide a means of performing in a single analysis what
once took multiple univariate analyses to accomplish. Other multivariate techniques,however, are
uniquely designed to deal with multivariate issues,such as factor analysis, which identifies the structure
underlying a set of variables, or discriminant analysis, which differentiates among groups based
on a set of variables.
Principal Components and Common Factor Analysis
Factor analysis, including both principal component analysis and common factor analysis,is a statistical
approach that can be used to analyze interrelationships among a large number of variables
and to explain these variables in terms of their common underlying dimensions (factors). The objective
is to find a way of condensing the information contained in a number of original variables into
a smaller set of variates (factors) with a minimal loss of information. By providing an empirical estimate
of the structure of the variables considered,factor analysis becomes an objective basis for creating
summated scales.
A researcher can use factor analysis, for example, to betterunderstand the relationships
between customers’ ratings of a fast-food restaurant. Assume you ask customers to rate the restaurant
on the following six variables: food taste, food temperature, freshness,waiting time, cleanliness,
and friendliness of employees. The analyst would like to combine these six variables into a
smaller number. By analyzing the customer responses,the analyst might find that the variables food
taste,temperature, and freshness combine togetherto form a single factor of food quality, whereas
the variables waiting time, cleanliness, and friendliness of employees combine to form another
single factor, service quality.
Multiple Regression
Multiple regression is the appropriate method of analysis when the research problem involves a single
metric dependent variable presumed to be related to two or more metric independent variables.
The objective of multiple regression analysis is to predict the changes in the dependent variable in
response to changes in the independent variables. This objective is most often achieved through the
statistical rule of least squares.
Wheneverthe researcher is interested in predicting the amount or size of the dependent variable,
multiple regression is useful. For example, monthly expenditures on dining out (dependent
variable) might be predicted from information regarding a family’s income, its size, and the age of
the head of household (independent variables). Similarly, the researcher might attempt to predict a
company’s sales from information on its expenditures for advertising, the number of salespeople,

and the number of stores carrying its products.
Multiple Discriminant Analysis and Logistic Regression
Multiple discriminant analysis (MDA) is the appropriate multivariate technique if the single dependent
variable is dichotomous (e.g., male–female) or multichotomous (e.g., high–medium–low) and
therefore nonmetric. As with multiple regression, the independent variables are assumed to be metric.
Discriminant analysis is applicable in situations in which the total sample can be divided into
groups based on a nonmetric dependent variable characterizing several known classes.The primary
objectives of multiple discriminant analysis are to understand group differences and to predict the
likelihood that an entity (individual or object) will belong to a particular class or group based on
several metric independent variables.
Discriminant analysis might be used to distinguish innovators from noninnovators according to
their demographic and psychographic profiles. Other applications include distinguishing heavy product
users from light users,males from females, national-brand buyers from private-label buyers,and
good credit risks from poorcredit risks. Even the Internal Revenue Service uses discriminant analysis
to compare selected federal tax returns with a composite, hypothetical, normal taxpayer’s return
(at different income levels) to identify the most promising returns and areas for audit.
Logistic regression models, often referred to as logit analysis,are a combination of multiple
regression and multiple discriminant analysis.This technique is similar to multiple regression analysis
in that one or more independent variables are used to predict a single dependent variable. What distinguishes
a logistic regression model from multiple regression is that the dependent variable is nonmetric,
as in discriminant analysis. The nonmetric scale of the dependent variable requires differences in the
estimation method and assumptions about the type of underlying distribution, yet in most other facets it
is quite similar to multiple regression. Thus,once the dependent variable is correctly specified and the
appropriate estimation technique is employed, the basic factors considered in multiple regression are
used here as well. Logistic regression models are distinguished from discriminant analysis primarily in
that they accommodate all types of independent variables (metric and nonmetric) and do not require the
assumption of multivariate normality. However, in many instances,particularly with more than two
levels of the dependent variable, discriminant analysis is the more appropriate technique.
Assume financial advisors were trying to develop a means of selecting emerging firms for
start-up investment. To assist in this task, they reviewed past records and placed firms into one of
two classes:successfulover a five-year period, and unsuccessfulafter five years. For each firm, they
also had a wealth of financial and managerial data. They could then use a logistic regression model
to identify those financial and managerial data that best differentiated between the successfuland
unsuccessfulfirms in order to select the best candidates for investment in the future.
Canonical Correlation
Canonical correlation analysis can be viewed as a logical extension of multiple regression analysis.
Recall that multiple regression analysis involves a single metric dependent variable and several metric
independent variables. With canonical analysis the objective is to correlate simultaneously several metric
dependent variables and several metric independent variables. Whereas multiple regression involves a
single dependent variable, canonical correlation involves multiple dependent variables. The underlying
principle is to develop a linear combination of each set of variables (both independent and dependent)in
a manner that maximizes the correlation between the two sets.Stated in a different manner, the procedure
involves obtaining a set of weights for the dependent and independent variables that provides the maximum
simple correlation between the set of dependent variables and the set of independent variables. An
additional chapteris available at www.pearsonhigher.com/hair and www.mvstats.com that provides an
overview and application of the technique.
Assume a company conducts a study that collects information on its service quality based on
answers to 50 metrically measured questions.The study uses questions frompublished service quality
research and includes benchmarking information on perceptions of the service quality of “world-class
companies” as well as the company for which the research is being conducted.Canonical correlation
could be used to compare the perceptions of the world-class companies on the 50 questions with the
perceptions of the company. The research could then conclude whether the perceptions of the company
are correlated with those of world-class companies. The technique would provide information on
the overall correlation of perceptions as well as the correlation between each of the 50 questions.
Multivariate Analysis of Variance and Covariance
Multivariate analysis of variance (MANOVA) is a statistical technique that can be used to simultaneously
explore the relationship between several categorical independent variables (usually referred

to as treatments) and two or more metric dependent variables. As such,it represents an
extension of univariate analysis of variance (ANOVA). Multivariate analysis of covariance
(MANCOVA) can be used in conjunction with MANOVA to remove (after the experiment) the
effect of any uncontrolled metric independent variables (known as covariates) on the dependent
variables. The procedure is similar to that involved in bivariate partial correlation, in which
the effect of a third variable is removed from the correlation. MANOVA is useful when the
researcher designs an experimental situation (manipulation of several nonmetric treatment variables)
to test hypothesesconcerning the variance in group responses on two or more metric dependent
variables.
Assume a company wants to know if a humorous ad will be more effective with its customers
than a nonhumorous ad. It could ask its ad agency to develop two ads —one humorous and
one nonhumorous—and then showa group of customers the two ads. After seeing the ads, the
customers would be asked to rate the company and its products on several dimensions, such as
modern versus traditional or high quality versus low quality. MANOVA would be the technique to
use to determine the extent of any statistical differences between the perceptions of customers who
saw the humorous ad versus those who saw the nonhumorous one.
Conjoint Analysis
Conjoint analysis is an emerging dependence technique that brings new sophistication to the evaluation
of objects, such as new products,services,or ideas. The most direct application is in new product
or service development, allowing for the evaluation of complex products while maintaining a
realistic decision context for the respondent.The market researcher is able to assess the importance
of attributes as well as the levels of each attribute while consumers evaluate only a few product
profiles, which are combinations of product levels.
Assume a product concept has three attributes (price, quality, and color), each at three possible
levels (e.g., red, yellow, and blue). Instead of having to evaluate all 27 (3 * 3 * 3) possible
combinations, a subset (9 or more) can be evaluated for their attractiveness to consumers, and the
researcher knows not only how important each attribute is but also the importance of each level
(e.g., the attractiveness ofred versus yellow versus blue). Moreover, when the consumer evaluations
are completed, the results of conjoint analysis can also be used in product design simulators, which
showcustomer acceptance for any number of product formulations and aid in the design of the
optimal product.
Cluster Analysis
Cluster analysis is an analytical technique for developing meaningful subgroups ofindividuals or
objects.Specifically, the objective is to classify a sample of entities (individuals or objects)into a
small number of mutually exclusive groups based on the similarities among the entities. In cluster
analysis, unlike discriminant analysis, the groups are not predefined. Instead,the technique is used
to identify the groups.
Cluster analysis usually involves at least three steps.The first is the measurement of some
form of similarity or association among the entities to determine how many groups really exist in
the sample. The second step is the actual clustering process,whereby entities are partitioned into
groups (clusters). The final step is to profile the persons or variables to determine their composition.
Many times this profiling may be accomplished by applying discriminant analysis to the groups
identified by the cluster technique.
As an example of cluster analysis, let’s assume a restaurant owner wants to know whether
customers are patronizing the restaurant for different reasons.Data could be collected on perceptions
of pricing, food quality, and so forth. Cluster analysis could be used to determine whether
some subgroups (clusters)are highly motivated by low prices versus those who are much less
motivated to come to the restaurant based on price considerations.
Perceptual Mapping
In perceptual mapping (also known as multidimensional scaling),the objective is to transform consumer
judgments of similarity or preference (e.g., preference for stores or brands)into distances
represented in multidimensional space. If objects A and B are judged by respondents as being the
most similar compared with all other possible pairs of objects,perceptual mapping techniques will
position objects A and B in such a way that the distance between them in multidimensional space is
smaller than the distance between any other pairs of objects. The resulting perceptual maps show
the relative positioning of all objects, but additional analyses are needed to describe or assess which
attributes predict the position of each object.

As an example of perceptual mapping, let’s assume an owner of a Burger King franchise wants
to know whether the strongest competitoris McDonald’s or Wendy’s.A sample of customers is given
a survey and asked to rate the pairs of restaurants from most similar to least similar. The results show
that the Burger King is most similar to Wendy’s,so the owners know that the strongest competitor is
the Wendy’s restaurant because it is thought to be the most similar. Follow-up analysis can identify
what attributes influence perceptions of similarity or dissimilarity.
Correspondence Analysis
Correspondence analysis is a recently developed interdependence technique that facilitates the
perceptual mapping of objects (e.g., products,persons)on a set of nonmetric attributes. Researchers
are constantly faced with the need to “quantify the qualitative data” found in nominal variables.
Correspondence analysis differs from the interdependence techniques discussed earlier in its ability
to accommodate both nonmetric data and nonlinear relationships.
In its most basic form, correspondence analysis employs a contingency table, which is the
cross-tabulation of two categorical variables. It then transforms the nonmetric data to a metric level
and performs dimensional reduction (similar to factor analysis) and perceptual mapping.
Correspondence analysis provides a multivariate representation of interdependence for nonmetric
data that is not possible with othermethods.
As an example, respondents’brand preferences can be cross-tabulated on demographic variables
(e.g., gender, income categories, occupation)by indicating how many people preferring each
brand fall into each category of the demographic variables. Through correspondence analysis,the
association,or “correspondence,” of brands and the distinguishing characteristics of those preferring
each brand are then shown in a two- or three-dimensional map of both brands and respondent
characteristics. Brands perceived as similar are located close to one another.Likewise, the most
distinguishing characteristics of respondents preferring each brand are also determined by the
proximity of the demographic variable categories to the brand’s position.
Structural Equation Modeling and Confirmatory Factor Analysis
Structural equation modeling (SEM) is a technique that allows separate relationships for each of a set
of dependent variables. In its simplest sense,structuralequation modeling provides the appropriate
and most efficient estimation technique for a series of separate multiple regression equations estimated
simultaneously. It is characterized by two basic components: (1) the structural model and (2) the
measurement model. The structuralmodel is the path model, which relates independent to dependent
variables. In such situations,theory, prior experience, or otherguidelines enable the researcher to
distinguish which independent variables predict each dependent variable. Models discussed previously
that accommodate multiple dependent variables—multivariate analysis of variance and canonical
correlation—are not applicable in this situation because they allow only a single relationship
between dependent and independent variables.
The measurement model enables the researcher to use several variables (indicators) for a single
independent or dependent variable. For example, the dependent variable might be a concept represented
by a summated scale, such as self-esteem. In a confirmatory factor analysis the researcher can assess the
contribution of each scale item as well as incorporate how well the scale measures the concept (reliability).
The scales are then integrated into the estimation of the relationships between dependent and
independent variables in the structural model. This procedure is similar to performing a factor analysis
(discussed in a later section)of the scale items and using the factor scores in the regression.
A study by management consultants identified several factors that affect worker satisfaction:
supervisorsupport,work environment, and job performance. In addition to this relationship, they
noted a separate relationship wherein supervisorsupport and work environment were unique predictors
of job performance. Hence, they had two separate,but interrelated relationships. Supervisor
support and the work environment not only affected worker satisfaction directly, but had possible
indirect effects through the relationship with job performance, which was also a predictor of worker
satisfaction.In attempting to assess these relationships,the consultants also developed multi-item
scales for each construct (supervisorsupport,work environment, job performance, and worker
satisfaction). SEM provides a means of not only assessing each of the relationships simultaneously
rather than in separate analyses,but also incorporating the multi-item scales in the analysis to
account for measurement error associated with each of the scales.

Principal Components and Common Factor Analysis
Factor analysis, including both principal component analysis and common factor analysis,is a statistical
approach that can be used to analyze interrelationships among a large number of variables
and to explain these variables in terms of their common underlying dimensions (factors). The objective
is to find a way of condensing the information contained in a number of original variables into
a smaller set of variates (factors) with a minimal loss of information. By providing an empirical estimate
of the structure of the variables considered,factor analysis becomes an objective basis for creating
summated scales.
A researcher can use factor analysis, for example, to betterunderstand the relationships
between customers’ ratings of a fast-food restaurant. Assume you ask customers to rate the restaurant
on the following six variables: food taste, food temperature, freshness,waiting time, cleanliness,
and friendliness of employees. The analyst would like to combine these six variables into a
smaller number. By analyzing the customer responses,the analyst might find that the variables food
taste,temperature, and freshness combine togetherto form a single factor of food quality, whereas
the variables waiting time, cleanliness, and friendliness of employees combine to form another
single factor, service quality.
Multiple Regression
Multiple regression is the appropriate method of analysis when the research problem involves a single
metric dependent variable presumed to be related to two or more metric independent variables.
The objective of multiple regression analysis is to predict the changes in the dependent variable in
response to changes in the independent variables. This objective is most often achieved through the
statistical rule of least squares.
Wheneverthe researcher is interested in predicting the amount or size of the dependent variable,
multiple regression is useful. For example, monthly expenditures on dining out (dependent
variable) might be predicted from information regarding a family’s income, its size, and the age of
the head of household (independent variables). Similarly, the researcher might attempt to predict a
company’s sales from information on its expenditures for advertising, the number of salespeople,
and the number of stores carrying its products.
Multiple Discriminant Analysis and Logistic Regression
Multiple discriminant analysis (MDA) is the appropriate multivariate technique if the single dependent
variable is dichotomous (e.g., male–female) or multichotomous (e.g., high–medium–low) and
therefore nonmetric. As with multiple regression, the independent variables are assumed to be metric.
Discriminant analysis is applicable in situations in which the total sample can be divided into
groups based on a nonmetric dependent variable characterizing several known classes.The primary
objectives of multiple discriminant analysis are to understand group differences and to predict the
likelihood that an entity (individual or object) will belong to a particular class or group based on
several metric independent variables.
Discriminant analysis might be used to distinguish innovators from noninnovators according to
their demographic and psychographic profiles. Other applications include distinguishing heavy product
users from light users,males from females, national-brand buyers from private-label buyers,and
good credit risks from poorcredit risks. Even the Internal Revenue Service uses discriminant analysis
to compare selected federal tax returns with a composite, hypothetical, normal taxpayer’s return
(at different income levels) to identify the most promising returns and areas for audit.
Logistic regression models, often referred to as logit analysis,are a combination of multiple
regression and multiple discriminant analysis.This technique is similar to multiple regression analysis
in that one or more independent variables are used to predict a single dependent variable. What distinguishes
a logistic regression model from multiple regression is that the dependent variable is nonmetric,
as in discriminant analysis. The nonmetric scale of the dependent variable requires differences in the
estimation method and assumptions about the type of underlying distribution, yet in most other facets it
is quite similar to multiple regression. Thus,once the dependent variable is correctly specified and the
appropriate estimation technique is employed, the basic factors considered in multiple regression are
used here as well. Logistic regression models are distinguished from discriminant analysis primarily in
that they accommodate all types of independent variables (metric and nonmetric) and do not require the
assumption of multivariate normality. However, in many instances,particularly with more than two
levels of the dependent variable, discriminant analysis is the more appropriate technique.

Assume financial advisors were trying to develop a means of selecting emerging firms for
start-up investment. To assist in this task, they reviewed past records and placed firms into one of
two classes:successfulover a five-year period, and unsuccessfulafter five years. For each firm, they
also had a wealth of financial and managerial data. They could then use a logistic regression model
to identify those financial and managerial data that best differentiated between the successfuland
unsuccessfulfirms in order to select the best candidates for investment in the future.
Canonical Correlation
Canonical correlation analysis can be viewed as a logical extension of multiple regression analys is.
Recall that multiple regression analysis involves a single metric dependent variable and several metric
independent variables. With canonical analysis the objective is to correlate simultaneously several metric
dependent variables and several metric independent variables. Whereas multiple regression involves a
single dependent variable, canonical correlation involves multiple dependent variables. The underlying
principle is to develop a linear combination of each set of variables (both independent and dependent)in
a manner that maximizes the correlation between the two sets.Stated in a different manner, the procedure
involves obtaining a set of weights for the dependent and independent variables that provides the maximum
simple correlation between the set of dependent variables and the set of independent variables. An
additional chapteris available at www.pearsonhigher.com/hair and www.mvstats.com that provides an
overview and application of the technique.
Assume a company conducts a study that collects information on its service quality based on
answers to 50 metrically measured questions.The study uses questions frompublished service quality
research and includes benchmarking information on perceptions of the service quality of “world-class
companies” as well as the company for which the research is being conducted.Canonical correlation
could be used to compare the perceptions of the world-class companies on the 50 questions with the
perceptions of the company. The research could then conclude whether the perceptions of the company
are correlated with those of world-class companies. The technique would provide information on
the overall correlation of perceptions as well as the correlation between each of the 50 questions.
Multivariate Analysis of Variance and Covariance
Multivariate analysis of variance (MANOVA) is a statistical technique that can be used to simultaneously
explore the relationship between several categorical independent variables (usually referred
to as treatments) and two or more metric dependent variables. As such,it represents an
extension of univariate analysis of variance (ANOVA). Multivariate analysis of covariance
(MANCOVA) can be used in conjunction with MANOVA to remove (after the experiment) the
effect of any uncontrolled metric independent variables (known as covariates) on the dependent
variables. The procedure is similar to that involved in bivariate partial correlation, in which
the effect of a third variable is removed from the correlation. MANOVA is useful when the
researcher designs an experimental situation (manipulation of several nonmetric treatment variables)
to test hypothesesconcerning the variance in group responses on two or more metric dependent
variables.
Assume a company wants to know if a humorous ad will be more effective with its customers
than a nonhumorous ad. It could ask its ad agency to develop two ads —one humorous and
one nonhumorous—and then showa group of customers the two ads. After seeing the ads, the
customers would be asked to rate the company and its products on several dimensions, such as
modern versus traditional or high quality versus low quality. MANOVA would be the technique to
use to determine the extent of any statistical differences between the perceptions of customers who
saw the humorous ad versus those who saw the nonhumorous one.
Conjoint Analysis
Conjoint analysis is an emerging dependence technique that brings new sophistication to the evaluation
of objects, such as new products,services,or ideas. The most direct application is in new product
or service development, allowing for the evaluation of complex products while maintaining a
realistic decision context for the respondent.The market researcher is able to assess the importance
of attributes as well as the levels of each attribute while consumers evaluate only a few product
profiles, which are combinations of product levels.
Assume a product concept has three attributes (price, quality, and color), each at three possible
levels (e.g., red, yellow, and blue). Instead of having to evaluate all 27 (3 * 3 * 3) possible
combinations, a subset (9 or more) can be evaluated for their attractiveness to consumers, and the
researcher knows not only how important each attribute is but also the importance of each level
(e.g., the attractiveness ofred versus yellow versus blue). Moreover, when the consumer evaluations

are completed, the results of conjoint analysis can also be used in product design simulators, which
showcustomer acceptance for any number of product formulations and aid in the design of the
optimal product.
Cluster Analysis
Cluster analysis is an analytical technique for developing meaningful subgroups ofindividuals or
objects.Specifically, the objective is to classify a sample of entities (individuals or objects)into a
small number of mutually exclusive groups based on the similarities among the entities. In cluster
analysis, unlike discriminant analysis, the groups are not predefined. Instead,the technique is used
to identify the groups.
Cluster analysis usually involves at least three steps.The first is the measurement of some
form of similarity or association among the entities to determine how many groups really exist in
the sample. The second step is the actual clustering process,whereby entities are partitioned into
groups (clusters). The final step is to profile the persons or variables to determine their composition.
Many times this profiling may be accomplished by applying discriminant analysis to the groups
identified by the cluster technique.
As an example of cluster analysis, let’s assume a restaurant owner wants to know whether
customers are patronizing the restaurant for different reasons.Data could be collected on perceptions
of pricing, food quality, and so forth. Cluster analysis could be used to determine whether
some subgroups (clusters)are highly motivated by low prices versus those who are much less
motivated to come to the restaurant based on price considerations.
Perceptual Mapping
In perceptual mapping (also known as multidimensional scaling),the objective is to transform consumer
judgments of similarity or preference (e.g., preference for stores or brands)into distances
represented in multidimensional space. If objects A and B are judged by respondents as being the
most similar compared with all other possible pairs of objects,perceptual mapping techniques will
position objects A and B in such a way that the distance between them in multidimensional space is
smaller than the distance between any other pairs of objects. The resulting perceptual maps show
the relative positioning of all objects, but additional analyses are needed to describe or assess which
attributes predict the position of each object.
As an example of perceptual mapping, let’s assume an owner of a Burger King franchise wants
to know whether the strongest competitoris McDonald’s or Wendy’s.A sample of customers is given
a survey and asked to rate the pairs of restaurants from most similar to least similar. The results show
that the Burger King is most similar to Wendy’s,so the owners know that the strongest competitor is
the Wendy’s restaurant because it is thought to be the most similar. Follow-up analysis can identify
what attributes influence perceptions of similarity or dissimilarity.
Correspondence Analysis
Correspondence analysis is a recently developed interdependence technique that facilitates the
perceptual mapping of objects (e.g., products,persons)on a set of nonmetric attributes.Researchers
are constantly faced with the need to “quantify the qualitative data” found in nominal variables.
Correspondence analysis differs from the interdependence techniques discussed earlier in its ability
to accommodate both nonmetric data and nonlinear relationships.
In its most basic form, correspondence analysis employs a contingency table, which is the
cross-tabulation of two categorical variables. It then transforms the nonmetric data to a metric level
and performs dimensional reduction (similar to factor analysis) and perceptual mapping.
Correspondence analysis provides a multivariate representation of interdependence for nonmetric
data that is not possible with othermethods.
As an example, respondents’brand preferences can be cross-tabulated on demographic variables
(e.g., gender, income categories, occupation)by indicating how many people preferring each
brand fall into each category of the demographic variables. Through correspondence analysis,the
association,or “correspondence,” of brands and the distinguishing characteristics of those preferring
each brand are then shown in a two- or three-dimensional map of both brands and respondent
characteristics. Brands perceived as similar are located close to one another.Likewise, the most
distinguishing characteristics of respondents preferring each brand are also determined by the
proximity of the demographic variable categories to the brand’s position.

Structural Equation Modeling and Confirmatory Factor Analysis
Structural equation modeling (SEM) is a technique that allows separate relationships for each of a set
of dependent variables. In its simplest sense,structuralequation modeling provides the appropriate
and most efficient estimation technique for a series of separate multiple regres sion equations estimated
simultaneously. It is characterized by two basic components: (1) the structural model and (2) the
measurement model. The structuralmodel is the path model, which relates independent to dependent
variables. In such situations,theory, prior experience, or otherguidelines enable the researcher to
distinguish which independent variables predict each dependent variable. Models discussed previously
that accommodate multiple dependent variables—multivariate analysis of variance and canonical
correlation—are not applicable in this situation because they allow only a single relationship
between dependent and independent variables.
The measurement model enables the researcher to use several variables (indicators) for a single
independent or dependent variable. For example, the dependent variable might be a concept represented
by a summated scale, such as self-esteem. In a confirmatory factor analysis the researcher can assess the
contribution of each scale item as well as incorporate how well the scale measures the concept (reliability).
The scales are then integrated into the estimation of the relationships between dependent and
independent variables in the structural model. This procedure is similar to performing a factor analysis
(discussed in a later section)of the scale items and using the factor scores in the regression.
A study by management consultants identified several factors that affect worker satisfaction:
supervisorsupport,work environment, and job performance. In addition to this relationship, they
noted a separate relationship wherein supervisorsupport and work environment were unique predictors
of job performance. Hence, they had two separate,but interrelated relationships. Supervisor
support and the work environment not only affected worker satisfaction directly, but had possible
indirect effects through the relationship with job performance, which was also a predictor of worker
satisfaction.In attempting to assess these relationships,the consultants also developed multi-item
scales for each construct (supervisorsupport,work environment, job performance, and worker
satisfaction). SEM provides a means of not only assessing each of the relationships simultaneously
rather than in separate analyses,but also incorporating the multi-item scales in the analysis to
account for measurement error associated with each of the scales.
Histogram Graphical display of the distribution of a single variable. By forming frequency
counts in categories, the shape of the variable’s distribution can be shown.Used to make a visual
comparison to the normal distribution.
Homoscedasticity When the variance of the error terms (e) appears constant overa range of
predictor variables, the data are said to be homoscedastic.The assumption of equal variance of
the population error E (where E is estimated from e) is critical to the proper application of many
multivariate techniques.When the error terms have increasing or modulating variance, the data
are said to be heteroscedastic. Analysis of residuals best illustrates this point.
Normal distribution Purely theoretical continuous probability distribution in which the horizontal
axis represents all possible values of a variable and the vertical axis represents the probability
of those values occurring. The scores on the variable are clustered around the mean in a symmetrical,
unimodal pattern known as the bell-shaped, or normal, curve
Residual Portion of a dependent variable not explained by a multivariate technique. Associated
with dependence methods that attempt to predict the dependent variable, the residual represents
the unexplained portion of the dependent variable. Residuals can be used in diagnostic procedures
to identify problems in the estimation technique or to identify unspecified relationships.
Scatterplot Representation of the relationship between two metric variables portraying the joint
values of each observation in a two-dimensional graph.
HOMOSCEDASTICITY The next assumption is related primarily to dependence relationships
between variables. Homoscedasticity refers to the assumption that dependent variable(s) exhibit
equal levels of variance across the range of predictor variable(s). Homoscedasticity is desirable
because the variance of the dependent variable being explained in the dependence relationship
should not be concentrated in only a limited range of the independent values. In most situations,we
have many different values of the dependent variable at each value of the independent variable.
For this relationship to be fully captured, the dispersion (variance) of the dependent variable values

must be relatively equal at each value of the predictor variable. If this dispersion is unequalacross
values of the independent variable, the relationship is said to be heteroscedastic. Although the
dependent variables must be metric, this concept of an equal spread of variance across independent
variables can be applied when the independent variables are either metric or nonmetric:
• Metric independent variables. The concept of homoscedasticity is based on the spread of
dependent variable variance across the range of independent variable values, which is encountered
in techniques such as multiple regression. The dispersion of values for the dependent
variable should be as large for small values of the independent values as it is for moderate and
large values. In a scatterplot, it is seen as an elliptical distribution of points.
• Nonmetric independent variables.In these analyses (e.g., ANOVA and MANOVA) the focus
now becomes the equality of the variance (single dependent variable) or the variance/covariance
matrices (multiple dependent variables) across the groups formed by the nonmetric independent
variables. The equality of variance/covariance matrices is also seen in discriminant
analysis, but in this technique the emphasis is on the spread of the independent variables
across the groups formed by the nonmetric dependent measure.
In each of these instances,the purpose is the same: to ensure that the variance used in explanation
and prediction is distributed across the range of values, thus allowing for a “fair test” of the
relationship across all values of the nonmetric variables. The two most common sources of
heteroscedasticity are the following:
• Variable type. Many types of variables have a natural tendency toward differences in dispersion.
For example, as a variable increases in value (e.g., units ranging from near zero to
millions) a naturally wider range of answers is possible for the larger values. Also, when
percentages are used the natural tendency is for many values to be in the middle range, with
few in the lower or higher values.
The result of heteroscedasticity is to cause the predictions to be better at some levels of the
independent variable than at others.This variability affects the standard errors and makes hypothesis
tests either too stringent or too insensitive. The effect of heteroscedasticity is also often related to
sample size, especially when examining the variance dispersion across groups.For example, in
ANOVA or MANOVA the impact of heteroscedasticity on the statistical test depends on the sample
sizes associated with the groups of smaller and larger variances. In multiple regression analysis,
similar effects would occur in highly skewed distributions where there were disproportionate numbers
of respondents in certain ranges of the independent variable.
LINEARITY An implicit assumption of all multivariate techniques based on correlational measures
of association,including multiple regression,logistic regression, factor analysis, and structural
equation modeling, is linearity. Because correlations represent only the linear association between
variables, nonlinear effects will not be represented in the correlation value. This omission results in
an underestimation of the actual strength of the relationship. It is always prudent to examine all relationships
to identify any departures from linearity that may affect the correlation.
Identifying Nonlinear Relationships. The most common way to assess linearity is to examine
scatterplots of the variables and to identify any nonlinear patterns in the data. Many scatterplot programs
can showthe straight line depicting the linear relationship, enabling the researcher to better
identify any nonlinear characteristics. An alternative approach is to run a simple regression analysis
and to examine the residuals. The residuals reflect the unexplained portion of the dependent variable;
thus,any nonlinear portion of the relationship will showup in the residuals. A third approach is to
explicitly model a nonlinear relationship by the testing of alternative model specifications (also know
as curve fitting) that reflect the nonlinear elements.
Remedies for Nonlinearity. If a nonlinear relationship is detected,the most direct approach
is to transform one or both variables to achieve linearity. A number of available transformations are
discussed laterin this chapter. An alternative to data transformation is the creation of new variables
to represent the nonlinear portion of the relationship

WHAT IS FACTOR ANALYSIS?
Factor analysis is an interdependence technique whose primary purpose is to define the underlying
structure among the variablesin the analysis. Obviously, variables play a key role in any multivariate
analysis. Whetherwe are making a sales forecast with regression,predicting success orfailure
of a new firm with discriminant analysis,or using any othermultivariate techniques,we must have
a set of variables upon which to form relationships (e.g., What variables best predict sales or success/
failure?). As such,variables are the building blocks of relationships.
As we employ multivariate techniques,by their very nature, the number of variables
increases. Univariate techniques are limited to a single variable, but multivariate techniques can
have tens,hundreds,or even thousands ofvariables. But how do we describe and represent all of
these variables? Certainly, if we have only a few variables, they may all be distinct and different. As
we add more and more variables, more and more overlap (i.e., correlation) is likely among the variables.
In some instances,such as when we are using multiple measures to overcome measurement
error by multivariable measurement, the researcher even strives for correlation among the variables.
As the variables become correlated, the researcher now needs ways in which to manage these
variables—grouping highly correlated variables together,labeling or naming the groups,and perhaps
even creating a new composite measure that can represent each group of variables.
We introduce factor analysis as our first multivariate technique because it can play a
unique role in the application of other multivariate techniques.Broadly speaking, factor analysis
provides the tools for analyzing the structure of the interrelationships (correlations) among a
large number of variables (e.g., test scores,test items, questionnaire responses)by defining sets
of variables that are highly interrelated, known as factors. These groups of variables (factors),
which are by definition highly intercorrelated, are assumed to represent dimensions within the
data. If we are only concerned with reducing the number of variables, then the dimensions can
guide in creating new composite measures. However, if we have a conceptualbasis for understanding
the relationships between variables, then the dimensions may actually have meaning
for what they collectively represent. In the latter case, these dimensions may correspond to concepts
that cannot be adequately described by a single measure (e.g., store atmosphere is defined
by many sensory components that must be measured separately but are all interrelated). We will
see that factor analysis presents several ways of representing these groups of variables for use in
other multivariate techniques.
OBJECTIVES OF FACTOR ANALYSIS
The starting point in factor analysis, as with other statistical techniques,is the research problem. The
general purpose of factor analytic techniques is to find a way to condense (summarize) the information
contained in a number of original variables into a smaller set of new, composite dimensions or variates
(factors) with a minimum loss of information—that is, to search for and define the fundamental constructs
or dimensions assumed to underlie the original variables [18, 33]. In meeting its objectives, factor
analysis is keyed to four issues:specifying the unit of analysis, achieving data summarization and/or
data reduction, variable selection, and using factor analysis results with other multivariate techniques.
DESIGNING AFACTOR ANALYSIS
The design of a factor analysis involves three basic decisions: (1) calculation of the input data
(a correlation matrix) to meet the specified objectives of grouping variables or respondents; (2)
design of the study in terms of number of variables, measurement properties of variables, and the
types of allowable variables; and (3) the sample size necessary,both in absolute terms and as a function
of the number of variables in the analysis.
Rotation of Factors
Perhaps the most important tool in interpreting factors is factor rotation. The term rotation means
exactly what it implies. Specifically, the reference axes of the factors are turned about the origin
until some other position has been reached. As indicated earlier, unrotated factor solutions extract
factors in the order of their variance extracted. The first factor tends to be a general factor with
almost every variable loading significantly, and it accounts for the largest amount of variance. The
second and subsequent factors are then based on the residual amount of variance. Each accounts for

successively smaller portions of variance. The ultimate effect of rotating the factor matrix is to
redistribute the variance from earlier factors to later ones to achieve a simpler, theoretically more
meaningful factor pattern.
The simplest case of rotation is an orthogonal factor rotation, in which the axes are maintained
at 90 degrees.It is also possible to rotate the axes and not retain the 90-degree angle between
the reference axes. When not constrained to being orthogonal,the rotational procedure is called an
oblique factor rotation. Orthogonal and oblique factor rotations are demonstrated in Figures 7 and 8,
respectively.
Figure 7, in which five variables are depicted in a two-dimensional factor diagram, illustrates
factor rotation. The vertical axis represents unrotated factorII, and the horizontal axis represents
unrotated factor I. The axes are labeled with 0 at the origin and extend outward to +1.0 or -1.0. The
numbers on the axes represent the factor loadings. The five variables are labeled V1, V2, V3, V4, and
V5. The factor loading for variable 2 (V2) on unrotated factor II is determined by drawing a dashed
line horizontally from the data point to the vertical axis for factor II. Similarly, a vertical line is drawn
from variable 2 to the horizontal axis of unrotated factor I to determine the loading of variable 2 on
factor I. A similar procedure followed for the remaining variables determines the factor loadings for
the unrotated and rotated solutions,as displayed in Table 1 for comparison purposes.On the unrotated
first factor, all the variables load fairly high. On the unrotated second factor,variables 1 and 2
are very high in the positive direction. Variable 5 is moderately high in the negative direction, and
variables 3 and 4
have
considerably lower loadings in the negative direction.

WHAT IS MULTIPLE REGRESSIONANALYSIS?
Multiple regression analysis is a statistical technique that can be used to analyze the relationship
between a single dependent (criterion) variable and several independent (predictor) variables.
The objective of multiple regression analysis is to use the independent variables whose values are
known to predict the single dependent value selected by the researcher. Each independent variable
is weighted by the regression analysis procedure to ensure maximal prediction from the set
of independent variables. The weights denote the relative contribution of the independent variables
to the overall prediction and facilitate interpretation as to the influence of each variable in
making the prediction, although correlation among the independent variables complicates the
interpretative process.The set of weighted independent variables forms the regressionvariate, a
linear combination of the independent variables that best predicts the dependent variable.
157
MultipleRegression Analysis
The regression variate, also referred to as the regression equation orregression model, is the most
widely known example of a variate among the multivariate techniques.
Multiple regression analysis is a dependence technique.Thus,to use it you must be able to
divide the variables into dependent and independent variables. Regression analysis is also a statistical
tool that should be used only when both the dependent and independent variables are metric.
However, undercertain circumstances it is possible to include nonmetric data either as independent
variables (by transforming either ordinal or nominal data with dummy variable coding) or the
dependent variable (by the use of a binary measure in the specialized technique of logistic regression).
In summary, to apply multiple regression analysis: (1) the data must be metric or appropriately
transformed, and (2) before deriving the regression equation,the researcher must decide
which variable is to be dependent and which remaining variables will be independent.

WHAT IS DISCRIMINANT ANALYSIS?
In attempting to choose an appropriate analytical technique, we sometimes encountera problem
that involves a categorical dependent variable and several metric independent variables. For example,
we may wish to distinguish good from bad credit risks. If we had a metric measure of credit
risk, then we could use multiple regression.In many instances we do not have the metric measure
necessary for multiple regression.Instead, we are only able to ascertain whether someone is in a
particular group (e.g., good or bad credit risk).
Discriminant analysis is the appropriate statistical techniques when the dependent variable is
a categorical (nominal or nonmetric) variable and the independent variables are metric variables.
In many cases,the dependent variable consists oftwo groups or classifications, for example, male
versus female or high versus low. In other instances,more than two groups are involved, such as
low, medium, and high classifications. Discriminant analysis is capable of handling either two
groups or multiple (three or more) groups.When two classifications are involved, the technique is
referred to as two-group discriminant analysis.When three or more classifications are identified,
the technique is referred to as multiple discriminant analysis (MDA). Logistic regression is limited
in its basic form to two groups,although otherformulations can handle more groups.
Discriminant Analysis
Discriminant analysis involves deriving a variate. The discriminant variate is the linear combination
of the two (or more) independent variables that will discriminate best between the objects (persons,
firms, etc.) in the groups defined a priori. Discrimination is achieved by calculating the
variate’s weights for each independent variable to maximize the differences between the groups
(i.e., the between-group variance relative to the within-group variance). The variate for a discriminant
analysis, also known as the discriminant function, is derived from an equation much like that
seen in multiple regression.It takes the following form:
Zjk = a + W1 X1k + W2 X2k + Á + Wn Xnk
Where
Zjk = discriminant Z score of discriminant function j for object k
a = intercept
Wi = discriminant weight for independent variable i
Xik = independent variable i for object k
As with the variate in regression or any other multivariate technique we see the discriminant
score for each object in the analysis (person,firm, etc.) being a summation of the values
obtained by multiplying each independent variable by its discriminant weight. What is unique
about discriminant analysis is that more than one discriminant function may be present,
resulting in each object possibly having more than one discriminant score.We will discuss what
determines the number of discriminant functions later, but here we see that discriminant
analysis has both similarities and unique elements when compared to other multivariate
techniques.
Discriminant analysis is the appropriate statistical technique for testing the hypothesis that the
group means of a set of independent variables for two or more groups are equal. By averaging the
discriminant scores for all the individuals within a particular group,we arrive at the group mean.
This group mean is referred to as a centroid.
OBJECTIVES OF DISCRIMINANT ANALYSIS
A review of the objectives for applying discriminant analysis should further clarify its nature.
Discriminant analysis can address any of the following research objectives:
1. Determining whether statistically significant differences exist between the average score
profiles on a set of variables for two (or more) a priori defined groups
2. Determining which of the independent variables most account for the differences in the
average score profiles of the two or more groups
3. Establishing the number and composition of the dimensions of discrimination between

groups formed from the set of independent variables
4. Establishing procedures for classifying objects (individuals, firms, products,etc.) into groups
on the basis of their scores on a set of independent variables
WHAT IS LOGISTIC REGRESSION?
Logistic regression, along with discriminant analysis,is the appropriate statistical technique when
the dependent variable is a categorical (nominal or nonmetric) variable and the independent variables
are metric or nonmetric variables. When compared to discriminant analysis,logistic regression
is limited in its basic form to two groups for the dependent variable, although other formulations can
handle more groups.It does have the advantage,however, of easily incorporating nonmetric variables
as independent variables, much like in multiple regression.
In a practical sense,logistic regression may be preferred for two reasons.First, discriminant
analysis relies on strictly meeting the assumptions ofmultivariate normality and equal
variance–covariance matrices across groups—assumptions that are not met in many situations.
Logistic regression does not face these strict assumptions and is much more robust when these
assumptions are not met, making its application appropriate in many situations.Second,even if the
assumptions are met, many researchers prefer logistic regression because it is similar to multiple
regression.It has straightforward statistical tests,similar approaches to incorporating metric and
nonmetric variables and nonlinear effects, and a wide range of diagnostics.Thus,for these and more
technical reasons,logistic regression is equivalent to two-group discriminant analysis and may be
more suitable in many situations.
OBJECTIVES OF LOGISTIC REGRESSION
Logistic regression is identical to discriminant analysis in terms of the basic objectives it can
address.Logistic regression is best suited to address two research objectives:
• Identifying the independent variables that impact group membership in the dependent
Variable
• Establishing a classification systembased on the logistic model for determining group
membership.
The first objective is quite similar to the primary objectives of discriminant analysis and even
multiple regression in that emphasis is placed on the explanation of group membership in terms of
the independent variables in the model. In the classification process,logistic regression, like discriminant
analysis, provides a basis for classifying not only the sample used to estimate the discriminant
function but also any otherobservations that can have values for all the independent variables.
In this way, the logistic regression analysis can classify otherobservations into the defined groups
USE OF THE LOGISTIC CURVE Because the binary dependent variable has only the values of 0 and 1,
the predicted value (probability) must be bounded to fall within the same range. To define a relationship
bounded by 0 and 1, logistic regression uses the logistic curve to represent the relationship between the
independent and dependent variables (see Figure 1). At very low levels of the independent variable, the
probability approaches 0, but never reaches it. Likewise, as the independent variable increases,the predicted
values increase up the curve, but then the slope starts decreasing so that at any level of the independent
variable the probability will approach 1.0 but never exceed it. The linear models of regres sion
cannot accommodate such a relationship, because it is inherently nonlinear. The linear relationship of
regression,even with additional terms of transformations for nonlinear effects, cannot guarantee that the
predicted values will remain within the range of 0 and 1.
UNIQUE NATURE OF THE DEPENDENT VARIABLE The binary nature of the dependent variable
(0 or 1) has properties that violate the assumptions ofmultiple regression. First, the error term of a
discrete variable follows the binomial distribution instead of the normal distribution, thus invalidating
all statisticaltesting based on the assumptions of normality. Second, the variance of a dichotomous
variable is not constant,creating instances of heteroscedasticity as well. Moreover, neither
violation can be remedied through transformations of the dependent or independent variables.
Logistic regression was developed to specifically deal with these issues.Its unique relationship
between dependent and independent variables, however, requires a somewhat different
approach in estimating the variate, assessing goodness-of-fit, and interpreting the coefficients when
compared to multiple regression.

WHAT IS CONJOINT ANALYSIS?
Conjoint analysis is a multivariate technique developed specifically to understand how respondents
develop preferences for any type of object (products,services,or ideas). It is based on the simple
premise that consumers evaluate the value of an object (real or hypothetical) by combining the
separate amounts of value provided by each attribute. Moreover, consumers can best provide their
estimates of preference by judging objects formed by combinations of attributes.
Utility, a subjective judgment of preference unique to each individual, is the most fundamental
concept in conjoint analysis and the conceptualbasis for measuring value. The researcher using
conjoint analysis to study what things determine utility should consider several key issues:
• Utility encompasses all features of the object, both tangible and intangible, and as such is a
measure of an individual’s overall preference.
• Utility is assumed to be based on the value placed on each of the levels of the attributes.
In doing so, respondents react to varying combinations of attribute levels (e.g., different
prices, features,or brands)with varying levels of preference.
• Utility is expressed by a relationship reflecting the manner in which the utility is formulated for
any combination of attributes.For example, we might sum the utility values associated with each
feature of a product or service to arrive at an overall utility. Then we would assume that products
or services with higher utility values are more preferred and have a betterchance of choice.
To be successfulin defining utility, the researcher must be able to describe the object in terms of
both its attributes and all relevant values for each attribute. To do so,the researcher develops a conjoint
task, which not only identifies the relevant attributes,but defines those attributes so hypotheticalchoice
situations can be constructed.In doing so,the researcher faces four specific questions:
1. What are the important attributesthat could affect preference? In order to accurately measure
preference, the researcher must be able to identify all of the attributes,known as factors, that
provide utility and form the basis for preference and choice. Factors represent the specific
attributes or other characteristics of the product or service.
2. How will respondentsknow the meaning of each factor? In addition to specifying the factors,
the researcher must also define each factor in terms of levels, which are the possible values for
that factor. These values enable the researcher to then describe an object in terms of its levels
on the set of factors characterizing it. For example, brand name and price might be two factors
in a conjoint analysis. Brand name might have two levels (brand X and brand Y), whereas
price might have four levels (39 cents,49 cents,59 cents,and 69 cents).
3. What do the respondentsactually evaluate? Afterthe researcher selects the factors and the
levels to describe an object, they are combined (one level from each factor) into a profile,
which is similar to a stimulus in a traditional experiment. Therefore, a profile for our simple
example might be brand X at 49 cents.
4. How many profiles are evaluated? Conjoint analysis is unique among the multivariate
methods, as will be discussed later, in that respondents provide multiple evaluations. In terms
of the conjoint task, a respondent will evaluate a number of profiles in order to provide a basis
for understanding their preferences. The process of deciding on the actual number of profiles
and their composition is contained in the design.
These four questions are focused on ensuring that the respondent is able to perform a realistic
task—choosing among a set of objects (profiles). Respondents need not tell the researcher anything
else, such as how important an individual attribute is to them or how well the object performs on any
specific attribute. Because the researcher constructed the hypotheticalobjects in a specific manner,
the influence of each attribute and each value of each attribute on the utility judgment of a respondent
can be determined from the respondents’overall ratings.
THE OBJECTIVES OF CONJOINT ANALYSIS
As with any statistical analysis,the starting point is the research question.In understanding
consumer decisions, conjoint analysis meets two basic objectives:
1. To determine the contributionsofpredictor variables and their levels in the determination of
consumer preferences. For example, how much does price contribute to the willingness to buy

a product? Which price level is the best? How much change in the willingness to buy soap can
be accounted for by differences between the levels of price?
2. To establish a valid model of consumer judgments. Valid models enable us to predict the consumer
acceptance of any combination of attributes, even those not originally evaluated by
consumers. In doing so,the issues addressed include the following: Do the respondents’
choices indicate a simple linear relationship between the predictor variables and choices? Is a
simple model of adding up the value of each attribute sufficient, or do we need to include
more complex evaluations of preference to mirror the judgment process adequately?
The respondent reacts only to what the researcher provides in terms of profiles (attribute combinations).
Are these the actual attributes used in making a decision? Are other attributes,particularly
attributes of a more qualitative nature, such as emotional reactions, important as well? These and
other considerations require the research question to be framed around two major issues:
• Is it possible to describe all the attributes that give utility or value to the product or service
being studied?
• What are the key attributes involved in the choice process for this type of product or service?
These questions need to be resolved before proceeding into the design phase of a conjoint analysis
because they provide critical guidance for key decisions in each stage.
WHAT IS CLUSTER ANALYSIS?
Cluster analysis is a group of multivariate techniques whose primary purpose is to group objects based
on the characteristics they possess.It has been referred to as Q analysis,typology construction,classification
analysis, and numerical taxonomy. This variety of names is due to the usage of clustering
methods in such diverse disciplines as psychology, biology,sociology, economics, engineering, and
business.Although the names differ across disciplines, the methods all have a common dimension:
classification according to relationships among the objects being clustered [1, 2, 4, 10, 22, 27]. This
common dimension represents the essence ofall clustering approaches—the classification of data as
suggested by naturalgroupings of the data themselves. Cluster analysis is comparable to factor analysis
in its objective of assessing structure.Cluster analysis differs from factor analysis,however, in that
cluster analysis groups objects,whereas factor analysis is primarily concerned with grouping variables.
Additionally, factor analysis makes the groupings based on patterns of variation (correlation) in
the data whereas cluster analysis makes groupings on the basis of distance (proximity).
HOW DOES CLUSTER ANALYSIS WORK?
Cluster analysis performs a task innate to all individuals—pattern recognition and grouping. The
human ability to process even slight differences in innumerable characteristics is a cognitive process
inherent in human beings that is not easily matched with all of our technological advances.Take for
example the task of analyzing and grouping human faces. Even from birth, individuals can quickly
identify slight differences in facial expressions and group different faces in homogeneous groups
while considering hundreds offacial characteristics. Yet we still struggle with facial recognition
programs to accomplish the same task. The process of identifying natural groupings is one that can
become quite complex rather quickly.
To demonstrate how cluster analysis operates,we examine a simple example that illustrates
some of the key issues:measuring similarity, forming clusters, and deciding on the number of clusters
that best represent structure. We also briefly discuss the balance of objective and subjective
considerations that must be addressed by any researcher
Objectives of Cluster Analysis
The primary goal of cluster analysis is to partition a set of objects into two or more groups based on
the similarity of the objects for a set of specified characteristics (cluster variate). In fulfilling this
basic objective, the researcher must address two key issues:the research questions being addressed
in this analysis and the variables used to characterize objects in the clustering process.We will
discuss each issue in the following section.
WHAT IS MULTIDIMENSIONAL SCALING?
Multidimensional scaling (MDS), also known as perceptual mapping, is a procedure that enables a
researcher to determine the perceived relative image of a set of objects (firms, products,ideas, or other
items associated with commonly held perceptions). The purpose of MDS is to transform consumer
judgments of overall similarity or preference (e.g., preference for stores orbrands) into distances represented
in multidimensional space.

Comparing Objects
Multidimensional scaling is based on the comparison of objects (e.g., product, service, person,
aroma). MDS differs from other multivariate methods in that it uses only a single, overall measure
of similarity or preference. To perform a multidimensional scaling analysis,the researcher performs
three basic steps:
1. Gather measures of similarity or preference across the entire set of objects to be analyzed.
2. Use MDS techniques to estimate the relative position of each object in multidimensional
space.
3. Identify and interpret the axes of the dimensional space in terms of perceptual and/or objective
attributes.
Assume that objects A and B are judged by respondents to be the most similar compared with
all otherpossible pairs of objects (AC, BC, AD, and so on). MDS techniques will position objects
A and B so that the distance between them in multidimensional space is smaller than the distance
between any other two pairs of objects. The resulting perceptual map, also known as a spatial
map, shows the relative positioning of all objects,as shown in Figure 1

A SIMPLIFIED LOOK AT HOW MDS WORKS
To facilitate a better understanding ofthe basic procedures in multidimensional scaling, we first
present a simple example to illustrate the basic concepts underlying MDS and the procedure by
which it transforms similarity judgments into the corresponding spatial positions.We will follow
the three basic steps described previously.
Gathering Similarity Judgments
The first step is to obtain similarity judgments from one or more respondents.Here we will ask
respondents fora single measure of similarity for each pair of objects.
Market researchers are interested in understanding consumers’perceptions of six candy bars
that are currently on the market. Instead of trying to gatherinformation about consumer’s evaluations
of the candy bars on a number of attributes, the researchers instead gatheronly perceptions of
overall similarities or dissimilarities. The data are typically gathered by having respondents give
simple global responses to statements such as these:
• Rate the similarity of products A and B on a 10-point scale.
• Product A is more similar to B than to C.
• I like product A better than product B.
Creating a Perceptual Map
From these simple responses,a perceptual map can be drawn that best portrays the overall pattern of
similarities among the six candy bars. We illustrate the process of creating a perceptual map with
the data from a single respondent,although this process could also be applied to multiple respondents
or to the aggregate responses ofa group of consumers.
The data are gathered by first creating a set of 15 unique pairs of the six candy bars (6 5 ÷
2 = 15 pairs). After tasting the candy bars, the respondent is asked to rank the 15 candy bar pairs,
where a rank of 1 is assigned to the pair of candy bars that is most similar and a rank of 15
indicates the pair that is least alike. The results (rank orders) for all pairs of candy bars for one
respondent are shown in Table 1. This respondent thought that candy bars D and E were the most
similar, candy bars A and B were the next most similar, and so forth until candy bars E and F were
the least similar.
If we want to illustrate the similarity among candy bars graphically, a first attempt would be
to draw a single similarity scale and fit all the candy bars to it. In this one-dimensional portrayal of
similarity, distance represents similarity. Thus,objects closer togetheron the scale are more similar
and those farther away less similar. The objective is to position the candy bars on the scale so
that the rank orders are best represented (rank order of 1 is closest,rank order of 2 is next closest,
and so forth).

Let us try to see how we would place some of the objects. Positioning two or three candy bars
is fairly simple. The first real test comes with four objects.We choose candy bars A, B, C, and D.
Table 1 shows that the rank order of the pairs is as follows: AB < AD < BD < CD < BC < AC (each
pair of letters refers to the distance [similarity] between the pair). From these values, we must place
the four candy bars on a single scale so that the most similar (AB) are the closest and the least similar
(AC) are the farthest apart. Figure 2a contains a one-dimensional perceptual map that matches the
orders of pairs. If the person judging the similarity between the candy bars had been thinking of a
simple rule of similarity that involved only one attribute (dimension), such as amount of chocolate,
then all the pairs could be placed on a single scale that reproduces the similarity values.
Although a one-dimensional map can accommodate four objects, the task becomes increasingly
difficult as the number of objects increases. The interested reader is encouraged to attempt this
task with six objects.When a single dimension is employed with the six objects,the actual ordering
varies substantially from the respondent’s original rank orders.
Because one-dimensional scaling does not fit the data well, a two-dimensional solution should
be attempted. It allows for anotherscale (dimension) to be used in configuring the six candy bars.
The procedure is quite tedious to attempt by hand.The two-dimensional solution produced by
an MDS program is shown in Figure 2b. This configuration matches the rank orders of Table 1 exactly,
supporting the notion that the respondent most probably used two dimensions in evaluating the candy
bars. The conjecture that at least two attributes (dimensions) were considered is based on the inability
to represent the respondent’s perceptions in one dimension. However, we are still not aware of what
attributes the respondent used in this evaluation

COMPARING MDS TO OTHER INTERDEPENDENCE TECHNIQUES
Multidimensional scaling can be compared to the other interdependence techniques such as factor
and clusteranalysis based on its approach to defining structure:
• Factor analysis:Defines structure by grouping variables into variates that represent underlying
dimensions in the original set of variables. Variables that highly correlate are grouped together.
• Cluster analysis:Defines structure by grouping objects according to their profile on a set of variables
(the cluster variate) in which objects in close proximity to each otherare grouped together.
MDS differs from factor and cluster analysis in two key aspects:(1) a solution can be obtained for
each individual and (2) it does not use a variate.
OBJECTIVES OF MDS
Perceptual mapping, and MDS in particular, is most appropriate for achieving two objectives:
1. An exploratory technique to identify unrecognized dimensions affecting behavior
2. A means of obtaining comparative evaluations of objects when the specific bases of comparison
are unknown or undefined
In MDS, it is not necessary to specify the attributes of comparison for the respondent.All that is
required is to specify the objects and make sure that the objects share a common basis of comparison.
This flexibility makes MDS particularly suited to image and positioning studies in which the dimensions
of evaluation may be too global or too emotional and affective to be measured by conventional
scales. MDS methods combine the positioning of objects and subjects in a single overall map, making
the relative positions of objects and consumers for segmentation analysis much more direct.
WHAT IS PERCEPTUAL MAPPING?
Perceptual mapping is a set of techniques that attempt to identify the perceived relative image of a
set of objects (firms, products,ideas, or other items associated with commonly held perceptions).
The objective of any perceptual mapping approach is to use consumer judgments of similarity or
preference to represent the objects (e.g., stores or brands) in a multidimensional space.
The three basic elements of any perceptualmapping process are defining objects, defining a measure
of similarity, and establishing the dimensions of comparison. Objects can be any entity that can be
evaluated by respondents.Although we normally think of tangible objects (e.g., products,people,etc.),
an object can also be something very intangible (e.g., political philosophies,cultural beliefs, etc.).
The second aspect of perceptualmapping is the concept of similarity—a relative judgment of
one object versus another.A key characteristic of similarity is that it can be defined differently by
each respondent.All that is required is that the respondent can make a comparison between the
objects and form some perception of similarity.
With the similarity judgments in hand,the perceptual mapping technique then forms
dimensions of comparison. Dimensions are unobserved characteristics that allow for the objects to
be arrayed in a multidimensional space (perceptual map) that replicates the respondent’s similarity
judgments. Similar to factors formed in factor analysis because of their combination of a number of
specific characteristics, here dimensions differ in that the underlying characteristics that “define”
the dimensions are unknown.

WHAT IS PERCEPTUAL MAPPING?
Perceptual mapping is a set of techniques that attempt to identify the perceived relative image of a
set of objects (firms, products,ideas, or other items associated with commonly held perceptions).
The objective of any perceptual mapping approach is to use consumer judgments of similarity or
preference to represent the objects (e.g., stores or brands)in a multidimensional space.
The three basic elements of any perceptualmapping process are defining objects, defining a measure
of similarity, and establishing the dimensions of comparison. Objects can be any entity that can be
evaluated by respondents.Although we normally think of tangible objects (e.g., products,people,etc.),
an object can also be something very intangible (e.g., political philosophies,cultural beliefs, etc.).
The second aspect of perceptualmapping is the concept of similarity—a relative judgment of
one object versus another.A key characteristic of similarity is that it can be defined differently by
each respondent.All that is required is that the respondent can make a comparison between the
objects and form some perception of similarity.
With the similarity judgments in hand,the perceptual mapping technique then forms
dimensions of comparison. Dimensions are unobserved characteristics that allow for the objects to
be arrayed in a multidimensional space (perceptual map) that replicates the respondent’s similarity
judgments. Similar to factors formed in factor analysis because of their combination of a number of
specific characteristics, here dimensions differ in that the underlying characteristics that “define”
the dimensions are unknown.
Readers are encouraged to also review multidimensional scaling to gain a broader perspective
on the overall perceptual mapping process and specific MDS techniques.Although there are quite
specific differences between correspondence analysis and multidimensional scaling, they share a
number of common objectives and approaches to perceptualmapping.
PERCEPTUAL MAPPING WITH CORRESPONDENCE ANALYSIS
Correspondence analysis (CA) is an increasingly popular interdependence technique for dimensional
reduction and perceptual mapping [1, 2, 3, 5, 8]. It is also known as optimal scaling or scoring,
reciprocal averaging, or homogeneity analysis. When compared to MDS techniques,
correspondence analysis has three distinguishing characteristics.
First, it is a compositional method rather than a decompositional approach,because the
perceptual map is based on the association between objects and a set of descriptive characteristics or
attributes specified by the researcher. A decompositional method, like MDS techniques,requires
that the respondent provide a direct measure of similarity. Second, most applications of CA involve
the correspondence ofcategories of variables, particularly those measured on nominal measurement
scales. This correspondence is then the basis for developing perceptual maps. Finally, the unique
benefits of CA lie in its abilities for simultaneously representing rows and columns, for example,
brands and attributes, in joint space.
Differences from Other Multivariate Techniques
Among the compositional techniques,factor analysis is the most similar because it defines composite
dimensions (factors) of the variables (e.g., attributes)and then plots objects (e.g., products)on
their scores on each dimension. In discriminant analysis,products can be distinguished by their profiles
across a set of variables and plotted in a dimensional space as well. Correspondence analysis
extends beyond either of these two compositional techniques:
• CA can be used with nominal data (e.g., frequency counts of preference for objects across a set of
attributes)rather than metric ratings of each object on each object. This capability enables CA to
be used in many situations in which the more traditional multivariate techniques are inappropriate.
• CA creates perceptual maps in a single step,where variables and objects are simultaneously
plotted in the perceptual map based directly on the association of variables and objects. The
relationships between objects and variables are the explicit objective of CA.
We first examine a simple example of CA to gain some perspective on its basic principles. Then
we discuss CA in terms of a six-stage decision-making process.The emphasis is on those unique elements
of CA as compared to the decompositional methods of MDS.
Creating the Perceptual Map
The similarity (signed chi-square) values provide a standardized measure of association,much like
the similarity judgments used in MDS methods. With this association/similarity measure, CA creates
a perceptual map by using the standardized measure to estimate orthogonaldimensions upon which
the categories can be placed to best account for the strength of association represented by the
chi-square distances.

As done in MDS techniques,we first considera lower-dimensional solution (e.g., one or two
dimensions) and then expand the number of dimensions and continue until we reach the maximum
number of dimensions. In CA, the maximum number of dimensions is one less than the smaller of
the number of rows or columns.
Looking back at the signed chi-square values in Table 2, which act as a similarity measure, the
perceptual map should place certain combinations with positive values closer togetheron the perceptual
map (e.g., young adults close to product B, middle age close to product A, and mature individuals
close to product C). Moreover, certain combinations should be farther apart given their
negative values (young adults farther from product C, middle age farther from product B, and
mature individuals farther from product A).
In our example, we can only have two dimensions (the smaller of the number of rows or
columns minus one, or 3 - 1 = 2), so the two-dimensional perceptual map is as shown in Figure 1.
Corresponding to the similarity values just described, the positions of the age groups and products
represent the positive and negative associations between age group and products very well. The
researcher can examine the perceptual map to understand the product preferences among age groups
based on their sales patterns.
In contrast to MDS, however, we can relate age groups with product positions.We can
see that we now have an additional tool that can relate different characteristics (in this case products
and the age characteristics of their buyers)in a single perceptual map. Just as easily, we could have
related products to attributes,benefits, or any other characteristic. We should note, however, that in
CA the positions of objects are now dependent on the characteristics they are associated with.

STAGE 1: OBJECTIVES OF CA
Researchers are constantly faced with the need to quantify the qualitative data found in nominal
variables. CA differs from other MDS techniques in its ability to accommodate both nonmetric data
and nonlinear relationships. It performs dimensional reduction similar to multidimensional scaling
and a type of perceptualmapping, in which categories are represented in the multidimensional
space.Proximity indicates the level of association among row or column categories. CA can address
either of two basic objectives:
1. Association among only row or column categories. CA can be used to examine the association
among the categories of just a row or column. A typical use is the examination of the categories
of a scale, such as the Likert scale (five categories from “strongly agree” to “strongly disagree”)
or other qualitative scales (e.g., excellent, good, poor, bad). The categories can be compared to
see whether two can be combined (i.e., they are in close proximity on the map) or whether they
provide discrimination (i.e., they are located separately in the perceptual space).
2. Association between both row and column categories. In this application, interest lies is
portraying the association between categories of the rows and columns, such as our example
of product sales by age group. This use is most similar to the previous example of MDS and
has propelled CA into more widespread use across many research areas.
The researcher must determine the specific objectives of the analysis because certain decisions are
based on which type of objective is chosen.CA provides a multivariate representation of interdependence
for nonmetric data not possible with other methods. With a compositional method, the researcher
must be aware that the results are based on the descriptive characteristics (e.g., object attributes or
respondent characteristics)of the objects included in the analysis.This contrasts with decompositional
MDS procedures in which only the overall measure of similarity between objects is needed
Defining a Model
A model is a representation of a theory. Theory can be thought ofas a systematic set of relationships
providing a consistent and comprehensive explanation of phenomena. From this definition, we see
that theory is not the exclusive domain of academia, but can be rooted in experience and practice
obtained by observation of real-world behavior. A conventionalmodel in SEM terminology consists
of really two models, the measurement model (representing how measured variables come together
to represent constructs)and the structural model (showing how constructs are associated with each
other).
IMPORTANCE OF THEORY A model should not be developed without some underlying theory.
Theory is often a primary objective of academic research, but practitioners may develop or propose
a set of relationships that are as complex and interrelated as any academically based theory. Thus,
researchers from both academia and industry can benefit from the unique analytical tools provided
by SEM. We will discuss in a later section specific issues in establishing a theoretical base for your
SEM model, particularly as it relates to establishing causality. In all instances SEM analyses should
be dictated first and foremost by a strong theoretical base.
COVARIATION Because causality means that a change in a cause brings about a corresponding
change in an effect, systematic covariance (correlation) between the caus e and effect is necessary,
but not sufficient, to establish causality. Just as is done in multiple regression by estimating the
statistical significance of coefficients of independent variables that affect the dependent variable,
SEM can determine systematic and statistically significant covariation between constructs.Thus,
statistically significant estimated paths in the structural model (i.e., relationships between
constructs)provide evidence that covariation is present.Structural relationships between constructs
are typically the paths for which causalinferences are most often hypothesized.

PATH ANALYSIS
IDENTIFYING PATHS
The first step is to identify all relationships that connect
any two constructs.Path analysis enables us to decompose
the simple (bivariate) correlation between any two
variables into the sum of the compound paths connecting
these points.The number and types of compound paths
between any two variables are strictly a function of the
model proposed by the researcher.
A compound path is a path along the arrows of a
path diagram that follow three rules:
1. After going forward on an arrow, the path cannot
go backward again; but the path can go backward
as many times as necessary before going forward.
2. The path cannot go through the same variable more
than once.
3. The path can include only one curved arrow (correlated
variable pair).
When applying these rules, each path or arrow represents a
path. If only one arrow links two constructs (path analysis
can also be conducted with variables), then the relationship
between those two is equal to the parameter estimate
between those two constructs.Fornow, this relationship
can be called a direct relationship. If there are multiple
arrows linking one construct to another as in XSYSZ,
then the effect of X on Z is equal to the product of the
parameter estimates for each arrow and is termed an indirect
relationship. This concept may seem quite complicated
but an example makes it easy to follow:
Figure A-1 portrays a simple model with two
exogenous constructs (X1 and X2) causally related to the
endogenous construct (Y1). The correlational path A is X1
correlated with X2, path B is the effect of X1 predicting Y1,
and path C shows the effect of X2 predicting Y1. The value
for Y1 can be stated simply with a regression-like equation:
We can now identify the direct and indirect paths in our
model. For ease in referring to the paths,the causalpaths
are labeled A, B, and C.
Y1 = b1X1 + b2X2
Direct Paths IndirectPaths
A = X1 to X2
B = X1 to Y1 AC = X1 to Y1
C = X2 to Y1 AB = X2 to Y1
ESTIMATING THE RELATIONSHIP
With the direct and indirect paths now defined, we can
represent the correlation between each construct as the
sum of the direct and indirect paths.
The three unique correlations among the constructs
can be shown to be composed of direct and indirect paths
as follows:
First, the correlation of X1 and X2 is simply equal
to A. The correlation of X1 and Y1 (CorrX1,Y1) can be
represented as two paths:B and AC. The symbol B represents
the direct path from X1 to Y1, and the otherpath
(a compound path) follows the curved arrow from X1 to
X2 and then to Y1. Likewise, the correlation of X2 and Y1
can be shown to be composed of two causalpaths:
C and AB.
Once all the correlations are defined in terms of

paths,the values of the observed correlations can be substituted
and the equations solved for each separate path.
The paths then represent either the causal relationships
between constructs (similar to a regression coefficient) or
correlational estimates.
Using the correlations as shown in Figure A-1,
we can solve the equations for each correlation (see
CorrX2Y1 = C + AB
CorrX1Y1 = B + AC
CorrX1X2 = A

WHAT IS STRUCTURAL EQUATION MODELING?
Structural equation modeling (SEM) is a family of statistical models that seek to explain the relationships
among multiple variables. In doing so, it examines the structure of interrelationships
expressed in a series of equations,similar to a series of multiple regression equations.These equations
depict all of the relationships among constructs (the dependent and independent variables)
involved in the analysis. Constructs are unobservable or latent factors represented by multiple variables
(much like variables representing a factor in factor analysis). Thus far each multivariate technique
has been classified either as an interdependence or dependence technique.SEM can be
thought of as a unique combination of both types of techniques because SEM’s foundation lies in
two familiar multivariate techniques: factor analysis and multiple regression analysis.
SEM is known by many names: covariance structure analysis,latent variable analysis, and
sometimes it is even referred to by the name of the specialized software package used (e.g., a LISREL
or AMOS model). Although SEM models can be tested in different ways, all structural equation
models are distinguished by three characteristics:
1. Estimation of multiple and interrelated dependence relationships
2. An ability to represent unobserved concepts in these relationships and account for measurement
error in the estimation process
3. Defining a model to explain the entire set of relationships
WHAT IS A STRUCTURAL MODEL?
The goal of measurement theory is to produce ways of measuring concepts in a reliable
and valid manner. Measurement theories are tested by how well the indicator variables
of theoretical constructs relate to one another.The relationships between the indicators
are captured in a covariance matrix. CFA tests a measurement theory by providing
evidence on the validity of individual measures based on the model’s overall fit and other evidence
of construct validity. CFA alone is limited in its ability to examine the nature of relationships
between constructs beyond simple correlations. Thus, a measurement theory is often a means to the
end of examining relationships between constructs,not an end in itself.
A structural theory is a conceptualrepresentation of the structural relationships between
constructs.It can be expressed in terms of a structural model that represents the theory with a set
of structural equations and is usually depicted with a visual diagram. The structural relationship
between any two constructs is represented empirically by the structural parameter estimate, also
known as a path estimate. Structural models are referred to by several terms, including a theoretical
model or occasionally a causal model. A causalmodel infers that the relationships meet the
conditions necessary for causation.The researcher should be careful not to depict the model as having
causal inferences unless all of the conditions are met.
Differences BetweenMANOVA and Discriminant Analysis. We noted earlier that in
statistical testing MANOVA employs a discriminant function, which is the variate of dependent
variables that maximizes the difference between groups.The question may arise: What is the
difference between MANOVA and discriminant analysis? In some aspects,MANOVA and
discriminant analysis are mirror images. The dependent variables in MANOVA (a set of metric
variables) are the independent variables in discriminant analysis, and the single nonmetric
dependent variable of discriminant analysis becomes the independent variable in MANOVA.
Moreover, both use the same methods in forming the variates and assessing the statisticalsignificance
between groups.
OBJECTIVES OF MANOVA
The selection of MANOVA is based on the desire to analyze a dependence relationship represented
as the differences in a set of dependent measures across a series of groups formed by one or more
categorical independent measures. As such,MANOVA represents a powerful analytical tool suitable
to a wide array of research questions.Whetherused in actual or quasi-experimental situations
(i.e., field settings or survey research for which the independent measures are categorical),
MANOVA can provide insights into not only the nature and predictive power of the independent
measures but also the interrelationships and differences seen in the set of dependent measures.

Data Analytics Notes

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (17)

Similaire à Data Analytics Notes

Similaire à Data Analytics Notes (20)

Dernier

Dernier (20)

Data Analytics Notes