SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Probabilistic Latent Factor Induction and
Statistical Factor Analysis

A Comparison of Methods



Stefan Conrady, stefan.conrady@conradyscience.com

Dr. Lionel Jouffe, jouffe@bayesia.com

April 7, 2011




Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
Probabilistic Factor Induction and Statistical Factor Analysis




Table of Contents

Introduction
       About the Authors                                         4
         Stefan Conrady                                          4
         Lionel Jouffe                                           4

  Key Concepts from Information Theory                            1
       Entropy                                                    1
       Chain Rule Theorem                                         2
       Conditional Entropy                                        2
       Mutual Information                                         3
       Relative Entropy (Kullback-Leibler Divergence)             3
         Example 1                                                3
         Example 2                                                4

Comparison of Methods
       Approach                                                   5
       Notation                                                   5
       Key Terminology                                            5
       Data Set                                                   6

  Probabilistic Latent Factor Induction with BayesiaLab           7
       Data Import                                                7
       Variable Clustering                                       16
       Latent Factor Induction                                   21

  Statistical Factor Analysis                                    30

  Factor Analysis with STATISTICA                                32

  Conclusion                                                     39

  References                                                     40

  Contact Information                                            41
         Conrady Applied Science, LLC                            41
         Bayesia SAS                                             41

  Copyright                                                      41


www.conradyscience.com | www.bayesia.com
                         ii
Probabilistic Factor Induction and Statistical Factor Analysis



Introduction

Bayesian networks have been gaining prominence among scientists over the recent decade and the new insights gener-
ated by this powerful research approach can now be found in studies that circulate well beyond the academic communi-
ties. As a result, many practitioners and managerial decision-makers see more and more references to Bayesian networks
in all kinds of scienti c and business research, ranging from biostatistics to marketing analytics.

It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. In
the eld of market research, for instance, long-established methods, such as factor analysis remain in daily use today.
Given that there exists a direct counterpart to factor analysis in the Bayesian network framework, we want to highlight
similarities as well as fundamental differences. The objective of this paper is to present both methods side-by-side and
thus help researchers to correctly compare and interpret the respective results. More speci cally, we want to establish
the semantic equivalents between the traditional statistical factor analysis approach and BayesiaLab’s method based on
Bayesian networks, which we refer to as Probabilistic Latent Factor Induction.

Factor Analysis is a statistical method used to describe variability among observed variables in terms of a potentially
lower number of unobserved variables called factors. It is possible, for example, that variations in three or four ob-
served variables mainly re ect the variations in a single unobserved variable, or in a reduced number of unobserved
variables. The observed variables can be seen as manifestations of abstract underlying (and unobserved) dimensions or
(latent) factors.

Factor analysis originated in psychometrics, and is used in behavioral sciences, social sciences, marketing, product man-
agement, operations research, and other applied sciences that deal with a large number of variables in their data.

Probabilistic Latent Factor Induction is a work ow within the BayesiaLab software package, which has the same objec-
tive as the traditional factor analysis, i.e. variable reduction, but works entirely with the framework of Bayesian net-
works and is based on principles derived from information theory.

It is important to point out that this comparison is not meant to favor one approach over the other (and to declare a
winner and loser), although it is clearly in the authors’ interest to promote Bayesian networks in general and BayesiaLab
in particular. Rather, this paper should serve as reference for research practitioners and those who use research results
in their decision-making processes, so they can correctly interpret insights generated with either approach.




www.conradyscience.com | www.bayesia.com
                                                                              iii
Probabilistic Factor Induction and Statistical Factor Analysis



About the Authors

Stefan Conrady
Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting
 rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied
Science was appointed the authorized sales and consulting partner of Bayesia SAS for North America.

Stefan Conrady studied Electrical Engineering and has extensive management experience in the elds of product plan-
ning, marketing and analytics, working at Daimler and BMW Group in Europe, North America and Asia. Prior to es-
tablishing his own rm, he was heading the Analytics & Forecasting group at Nissan North America.

Lionel Jouffe
Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Science
and has been working in the eld of Arti cial Intelligence since the early 1990s. He and his team have been developing
BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and
knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as
in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high-
lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007.




www.conradyscience.com | www.bayesia.com
                                                                           iv
Information Theory Background




Key Concepts from Information Theory
Before we proceed to the direct comparison of methods, it is important to establish several key concepts relating to the
knowledge representation in Bayesian networks.


Entropy
The concept of entropy provides the underpinning for all structural learning and analysis algorithms in BayesiaLab.
Entropy measures the uncertainty inherent in the distribution of a random variable.

The entropy H(X) of a random variable X is de ned as:


H (X) = − ∑ p(x)log 2 p(x) ,
             x∈X



where x stands for the states, which variable X can take. Note that the log is to the base of 2 and the value of entropy is
expressed in bits (0/1).

An example can perhaps illustrate this: If variable X represents the outcome of a coin toss, X can have one of two
states, Heads and Tails, i.e. the set of potential outcomes is X={Heads, Tails}. Given the coin toss is fair, the probability
of Head and Tails will be 0.5, i.e. p(Heads)=0.5 and p(Tails)=0.5.

We can now compute the entropy H(Xfair), based on these values:


H (X fair ) = − p(Heads)log 2 p(Heads) − p(Tails)log 2 p(Tails)
= −0.5 log 2 0.5 − 0.5 log 2 0.5 = 0.5 + 0.5 = 1 bit

This means our uncertainty prior to a fair coin toss is equivalent to an entropy value of 1 bit, which is the maximum
entropy due to the uniform distribution of the variable with two states.

If we had a biased coin instead with p(Heads)=0.7 and p(Tails)=0.3, it is intuitive to think that the uncertainty would be
lower as one state of the coin toss will be more probable and, indeed, computing the entropy H(Xbiased) yields a lower
value.


H (Xbiased ) = −0.7 log 2 0.7 − 0.3log 2 0.3 = 0.881

To complete this idea, we can also plot H(X) as a function of the bias, p(Heads)=1-p(Tails), with p(Heads)∈{0,..,1}, i.e.
ranging from impossible, p(Heads)=0, to certain, p(Heads)=1.




www.conradyscience.com | www.bayesia.com                                                                                   1
Information Theory Background


 HX
1.0


0.8


0.6


0.4


0.2


                                                        pHeads
              0.2        0.4   0.6      0.8       1.0


Clearly, anything other than a perfectly fair coin reduces the entropy and thus our uncertainty regarding the outcome of
the coin toss.


Chain Rule Theorem
The chain rule for joint entropy states that the total uncertainty about the value of X and Y is equal to the uncertainty
about X plus the (average) uncertainty about Y once you know X.


H (X,Y ) = H (X) + H (Y X)

The proof of this theorem follows:


H (X,Y ) = − ∑ ∑ p(x, y)log 2 p(x, y)
                    y∈Y x∈X

= − ∑ ∑ p(x, y)log 2 p(y x)p(x)
        y∈Y x∈X

= − ∑ ∑ p(x, y)log 2 p(y x) − ∑ ∑ p(x, y)log 2 p(x)
        y∈Y x∈X                       y∈Y x∈X

= − ∑ ∑ p(x, y)log 2 p(y x) − ∑ p(x)log 2 p(x)
        y∈Y x∈X                       x∈X

= H (Y X) + H (X)



Conditional Entropy
Perhaps the single most important concept for computations in BayesiaLab is conditional entropy. Conditional entropy
refers to the entropy of a random variable when we have information on another variable.

The conditional entropy H(Y|X), is de ned as




www.conradyscience.com | www.bayesia.com                                                                               2
Information Theory Background



H (Y X         ∑ p(x)H (Y       X = x)
               x∈X

= − ∑ p(x)∑ p(y x)log 2 p(y x)
     x∈X       y∈Y

= − ∑ ∑ p(x, y)log 2 p(y x)
     x∈X y∈Y




The conditional entropy of Y conditional on X refers to the expected entropy of Y conditional on the value of X.


Mutual Information
The mutual information I(X,Y) measures how much (on average) the observation of random variable Y tells us about
the uncertainty of X, i.e. by how much the entropy of X is reduced if we we have information on Y.


I(X,Y ) = H (X) − H (X Y ) = H (Y ) − H (Y X)

Note that the mutual information is a symmetric metric, which re ects the uncertainty reduction of X by knowing Y as
well as of Y by knowing X.


Relative Entropy (Kullback-Leibler Divergence)
A closely related concept is the relative entropy, also referred to as the Kullback-Leibler Divergence (DKL) or sometimes
cross entropy. The Kullback-Leibler Divergence is a measure of the difference between two probability distributions p
and q.

For probability distributions p and q of a discrete random variable X, their K–L divergence is de ned to be


                                             p(x)
DKL = ( p(X) || q(X)) = ∑ p(x)log 2
                             x∈X             q(x)

In words, it is the expected value of the logarithmic difference between the joint probability distributions p(X) and q(X).
In contrast to the mutual information, the relative entropy is non-symmetric.

Example 1
We once again use tossing coins as an example. By default, we would expect that any given coin is fair and assume a
model q(Heads)=q(Tails)=0.5. As it turns out, in repeated coin tosses, we observe that a probability of p(Heads)=0.75
and of p(Tails)=0.25. We can now use the Kullback-Leibler Divergence to establish the “distance” or “distortion” be-
tween the originally assumed distribution q(x) and the observed distribution of p(x).


                                             p(x)
DKL = ( p(X) || q(X)) = ∑ p(x)log 2
                             x∈X             q(x)
                     p(Heads)                 p(Tails)              0.75              0.25
= p(Heads)log 2               + p(Tails)log 2          = 0.75 log 2      + 0.25 log 2
                     q(Heads)                 q(Tails)               0.5              0.5
= 0.188722 bits



www.conradyscience.com | www.bayesia.com                                                                                 3
Information Theory Background



Example 2
For another illustration we use an example from the eld of meteorology. More speci cally, we look at the rainfall in
two cities in state of Victoria, Australia. We used daily rainfall data measured at Geelong Airport and at Melbourne
Tullamarine Airport, which are approximately 80 kilometers apart, over the entire calendar year of 2010. Given the
proximity of those locations, one would generally expect similar weather. Perhaps the Geelong weather isn’t reported in
the Melbourne newspapers and so a traveler wants to use the Melbourne weather as a proxy. However, the actual
weather station observations tell us that there is rain in Melbourne on 40.3% of the days, whereas Geelong sees rainfall
on 47.4% of the days in the year.

We can now compute the Kullback-Leibler Divergence for these two distributions, and pGeelong(x) stands for Geelong
and pMelbourne(x) for the Melbourne rain probability distributions.



        (                              )
                                                                      pGeelong (x)
DKL = pGeelong (X) || pMelbourne (X) = ∑ pGeelong (x)log 2
                                            x∈X                   pMelbourne (x)
                                    pGeloong (x = No Rain)                                   pGeloong (x = Rain)
= pGeelong (x = No Rain)log 2                                  + pGeelong (x = Rain)log 2
                             pMelbourne (x = No Rain)                                       pMelbourne (x = Rain)
              0.526               0.474
= 0.526 log 2       + 0.474 log 2          = 0.0148958 bits
              0.597               0.403



        (                               )
DKL = pMelbourne (X) || pGeelong (X) = ∑ pMelbourne (x)log 2
                                            x∈X
                                                                       pMelbourne (x)
                                                                        pGeelong (x)
                                 pMelbourne (x = Rain)                                p          (x = No Rain)
= pMelbourne (x = Rain)log 2                           + pMelbourne (x = No Rain)log 2 Melbourne
                                  pGeelong (x = Rain)                                  pGeelong (x = No Rain)
                0.403               0.597
= 0.403log 2          + 0.597 log 2       = 0.0147077 bits
                0.474               0.526


BayesiaLab’s primary metric, the Arc Force, is directly proportional to the relative entropy and describes the strength of
the directional link between two variables. More speci cally, it describes the difference between the joint probability
distributions with and without the particular arc.




www.conradyscience.com | www.bayesia.com                                                                                4
Probabilistic Latent Factor Induction vs. Statistical Factor Analysis




Comparison of Methods

Approach
We believe that we can best facilitate a comparison of the statistical factor analysis and latent factor induction by work-
ing through an example. We draw upon the familiar dataset from the previously presented case study from the perfume
industry, hereafter referred to as the “Perfume Study.” 1

We begin our tutorial with the Data Import process for BayesiaLab, although it is not meant to be at the core of the
comparison. It is important though to spell out the data pre-processing steps in BayesiaLab, as they highlight some of
the fundamental differences between probabilistic and statistical approaches.

Once the data preparation is complete, we           rst present the probabilistic latent factor induction work ow with
BayesiaLab and then provide an example of a statistical factor analysis. For the statistical factor analysis, we will use
STATISTICA 10 as the software platform, although most steps are fairly generic and could be reproduced with a num-
ber of other statistical software packages as well.


Notation
To clearly distinguish between natural language, software-speci c functions and study-speci c variable names, the fol-
lowing notation is used:

• BayesiaLab-speci c functions, keywords, commands, etc., are capitalized and shown in bold type.

• Names of attributes, variable, node and factors are italicized.

• At appropriate points in the text, grey boxes highlight parallels between the two presented methods:



      Probabilistic Latent Factor Induction                 Statistical Factor Analysis




Key Terminology
• “Observed” and “manifest” are used interchangeably and describe those random variables, which have been meas-
  ured by the researcher. Each variable measure

• The terms “latent” or “unobserved” are used interchangeably in the context of hidden concepts or factors, which
  cannot be measured, but can potentially be extracted or induced. In our context, the term factor stands exclusively for
    latent variables. Consequently, the terms “factor”, “factor variable”, “latent variable” and “unobserved variable” are
    equivalent.




1   Conrady and Jouffe (2010)


www.conradyscience.com | www.bayesia.com                                                                                 5
Probabilistic Latent Factor Induction vs. Statistical Factor Analysis



Data Set
The Perfume Study is based on a monadic consumer survey about a range of fragrances, which was conducted in
France. In this example we use survey responses from 1,321 women, who have evaluated a total of 11 fragrances on a
wide range of attributes:

• 27 ratings on fragrance-related attributes, such as, “sweet”, “ owery”, “feminine”, etc., measured on a 1-to-10 scale.

• 12 ratings on projected imagery related to someone, who would be wearing the respective fragrance, e.g. “is sexy”,
  “is modern”, measured on a 1-to-10 scale.

• 1 variable for Intensity, a measure re ecting the level of intensity, measured on a 1-to-5 scale.

• 1 variable for Purchase Intent, measured on a 1-to-6 scale.

• 1 nominal variable, Product, for product identi cation purposes.




www.conradyscience.com | www.bayesia.com                                                                               6
Probabilistic Latent Factor Induction with BayesiaLab




Probabilistic Latent Factor Induction with BayesiaLab

Data Import
To start the process with BayesiaLab, we rst import the data set, which is formatted as a CSV le.2 With DataOpen
Data SourceText File, we start the Data Import wizard, which immediately provides a preview of the data le.




The table displayed in the Data Import wizard shows the individual variables as columns and the survey responses as
rows. There are a number of options available, e.g. for sampling. However, this is not necessary in our example given
the relatively small size of the database.

Clicking the Next button, prompts a data type analysis, which provides BayesiaLab’s best guess regarding the data type
of each variable.

Furthermore, the Information box provides a brief summary regarding the number of records, the number of missing
values, ltered states, etc.3




2   CSV stands for “comma-separated values”, a common format for text-based data les.
3   There are no missing values in our database and ltered states are not applicable in this survey.


www.conradyscience.com | www.bayesia.com                                                                                7
Probabilistic Latent Factor Induction with BayesiaLab




For this example, we will need to override the default data type for the Product variable, as each value is a nominal
product identi er rather than a numerical scale value. We can change the data type by highlighting the Product variable
and clicking the Discrete check box, which changes the color of the Product column to red.




We will also de ne Purchase Intent and Intensity as discrete variables, as the default number of states of these variables
is already adequate for our purposes.4

The next screen provides options as to how to treat any missing values. In our case, there are no missing values so the
corresponding panel is grayed-out.

Clicking the small upside-down triangle next to the variable names brings up a window with key statistics of the
selected variable, in this case Fresh.




4   The desired number of variable states is largely a function of the analyst’s judgment.


www.conradyscience.com | www.bayesia.com                                                                                  8
Probabilistic Latent Factor Induction with BayesiaLab




The next step is the Discretization and Aggregation dialogue, which allows the analyst to determine the type of
discretization that must be performed on all continuous variables.5 For this survey, and given the number of
observations, it is appropriate to reduce the number of states from the original 10 states (1 through 10) to smaller
number. One could, for instance, bin the 1-10 rating into low, mid and high, or apply any other arbitrary method
deemed appropriate by the analyst.




The screenshot shows the dialogue for the Manual selection of discretization steps, which permits to select binning
thresholds by point-and-click.




5   BayesiaLab requires discrete distributions for all variables.


www.conradyscience.com | www.bayesia.com                                                                               9
Probabilistic Latent Factor Induction with BayesiaLab



 Note

 For choosing discretization algorithms beyond this example, the following rule of thumb may be helpful:

 • For supervised learning, choose Decision Tree.

 • For unsupervised learning, choose, in the order of priority, K-Means, Equal Distances or Equal Frequencies.




For this particular example, we select Equal Distances with 5 intervals for all continuous variables. This was the
analyst’s choice in order to be consistent with prior research.




Clicking Select All Continuous followed by Finish completes the import process and the 49 variables (columns) from
our database are now shown as blue nodes in the Graph Panel, which is the main window for network editing. By
default, all variables are represented as nodes. This initial view represents a fully unconnected Bayesian network.




www.conradyscience.com | www.bayesia.com                                                                              10
Probabilistic Latent Factor Induction with BayesiaLab




In the above graph, two variables play a fundamentally different role. The values of Product represent categories and
Purchase Intent is the overall target variable, i.e. the dependent variable of the Perfume Study. Thus both will be ex-
cluded from the factor generation process.

While correlation and covariance the central measures for statistical factor analysis, learning Bayesian networks with
BayesiaLab (and thus probabilistic factor induction) is based on measures from information theory, such as the
Kullback-Leibler Divergence, which was introduced in the rst chapter.

The Kullback-Leibler Divergence can be obtained after learning an initial Bayesian network with one of BayesiaLab’s
unsupervised learning algorithms. “Unsupervised” implies that the learning algorithm searches for an overall representa-
tion of the joint distribution of the underlying data rather than the characterization of an individual target variable.

In our example, we use BayesiaLab’s EQ algorithm to obtain a Bayesian network.




www.conradyscience.com | www.bayesia.com                                                                                   11
Probabilistic Latent Factor Induction with BayesiaLab




As this view of the network is not easily readable, BayesiaLab has numerous built-in layout algorithms, of which the
Force Directed Layout is perhaps the most commonly used. It can be invoked by ViewAutomatic LayoutForce
Directed Layout or alternatively through the keyboard shortcut “p”.

The resulting network will look similar to the following screenshot.




www.conradyscience.com | www.bayesia.com                                                                               12
Probabilistic Latent Factor Induction with BayesiaLab



                                     Completed Bayesian Network upon EQ Learning




With the network established, we can now further examine the probabilistic relationships between the nodes, which are
represented as arcs.6 By selecting, AnalysisGraphicArc Force, we can show the probabilistic strength of the arcs,
which is visualized by the thickness of the arcs.




6   “Arcs” are directed links or edges between nodes, which appear as arrows in the graph.


www.conradyscience.com | www.bayesia.com                                                                          13
Probabilistic Latent Factor Induction with BayesiaLab




                                                 Network with Arc Force




The numeric values of the Arc Force can be shown by selecting ViewDisplay Arc Comments. In the network shown
below, the Arc Force values are presented in yellow boxes attached to each arc.




www.conradyscience.com | www.bayesia.com                                                                   14
Probabilistic Latent Factor Induction with BayesiaLab



                                                 Network with Arc Force




 Arc Force        Covariance
 In BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure
 for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance
 matrix play the equivalent role.




www.conradyscience.com | www.bayesia.com                                                                              15
Probabilistic Latent Factor Induction with BayesiaLab



Variable Clustering
With Arc Force established as a the key measure across the entire network, BayesiaLab can determine clusters of vari-
ables, which are “close” in a probabilistic sense. This can be initiated from the menu via AnalysisGraphicVariable
Clustering.




The clustering algorithm is iterative and starts with those two variables, whose connecting arc has the strongest Arc
Force. The following sequence of screenshots illustrates this algorithm conceptually in “slow motion,” as the analyst
would not see these individual steps in the actual work ow.

As a starting point, every manifest variable is treated as a distinct cluster and so we have 47 clusters. Using the
Kullback-Leibler Divergence as a measure, the “closest” variables are then merged into one concept. As a result, we rst
obtain 46 clusters, then 45, etc., as shown in the array of dendrograms below. BayesiaLab proposes to conclude this
algorithm upon nding 15 clusters. However, the analyst has the ability to override this automatic selection. As the
choice of clusters appears to be generally compatible with our interpretation of the variable names, we accept this rec-
ommendation.




www.conradyscience.com | www.bayesia.com                                                                             16
Probabilistic Latent Factor Induction with BayesiaLab



                                                 Sequence of Dendrograms
     47           46           45           44          ...   16            15




Because of the importance of this process, we will also show it from another angle, i.e. by looking at sequential views of
the graph.




www.conradyscience.com | www.bayesia.com                                                                               17
Probabilistic Latent Factor Induction with BayesiaLab



                                                   Step 0 - 47 Clusters




                                Step 1 - 46 Clusters: Pleasure merged with Corresponds




The strongest Arc Force exists between Pleasure and Corresponds and BayesiaLab will form an interim concept from
them. The next-highest Arc Force then determines whether another variable is merged with the rst concept or whether
a new concept is created. In our case, Radiant and In Love are combined as a new concept.



www.conradyscience.com | www.bayesia.com                                                                        18
Probabilistic Latent Factor Induction with BayesiaLab



                                   Step 2 - 45 Clusters: Radiant merged with In Love




In the third step, we see Sensual and Romantic merged into a new latent concept, and so on.

                                  Step 3 - 44 Clusters: Sensual merged with Romantic




Upon completion of this process, BayesiaLab forms variable/node clusters from these common concepts and color-codes
them accordingly.




www.conradyscience.com | www.bayesia.com                                                                        19
Probabilistic Latent Factor Induction with BayesiaLab



                                      Network with Color-Coded Variable Clusters




By clicking the Validate Clustering button          , we can now formally xate the new latent factor variables. The
new latent factors are shown in the following table with their associated observed variables. By default, they are given
the name “Factor” plus a numeric suf x




www.conradyscience.com | www.bayesia.com                                                                             20
Probabilistic Latent Factor Induction with BayesiaLab



                                    Latent Factor Induction
                                    Upon de nition of the new latent factor variables, we now want to make them
                                    available for modeling purposes. Although these latent factors exist as new concepts
                                    and are conceptually linked to the manifest variables, the factors do not yet have
                                    any values or states.

                                    This will now happen in the Multiple Clustering process, which creates discrete
                                    states for each latent factor variable by performing data clustering over the linked
                                    manifest variables.




                                    More speci cally, the states of each latent factor will be created in such a way that
                                    they best summarize the joint probability distribution de ned by the manifest vari-
                                    ables. Factor 0 and its linked manifest variables are shown below.

                                                                  Subnetwork for Factor 0




www.conradyscience.com | www.bayesia.com                                                                              21
Probabilistic Latent Factor Induction with BayesiaLab



The following Monitors display the marginal probability distributions of the variables associated with Factor 1, plus,
highlighted in red, Factor 1 itself and its states are shown. We can see that 5 states were created for Factor 1, labelled
C1 through C5, and they each have an expected value, which is shown in parentheses. For instance, state C2 has an
expected value of 9.21. That means, given that C2 is observed, the mean value of the manifest variables, weighted by
their relation with C2, is equal to 9.21. In other words, C2 corresponds to high ratings with regard to those 5 dimen-
sions.




By selecting speci c states of Factor 0 in the Monitor Panel, we can see the conditional distributions of the manifest
variables. The states C2 and C3 are displayed for reference below. They can be easily interpreted by looking at the asso-
ciated values, e.g. state C2 appears to re ect high ratings of the manifest variables, whereas state C3 captures very low
ratings.




A more general analysis of the relationships between manifest variables and latent factors can be obtained through
AnalysisReportsRelationship Analysis:




This chart summarizes the values of key clustering measures, such as the Kullback-Leibler Divergence, for every mani-
fest variable associated with Factor 0. For reference only, it also includes Pearson’s Correlation Coef cient R.



www.conradyscience.com | www.bayesia.com                                                                               22
Probabilistic Latent Factor Induction with BayesiaLab




 Relationship Analysis             Factor Loadings
 This summary of clustering measures in the Relationship Analysis allows an interpretation, which is very similar to
 what is provided with factor loadings.



It is also possible to visualize the mean values of the manifest variables (x-axis) along with the Mutual Information (y-
axis, left panel) and the Standardized Total Effect (y-axis, right panel) for the latent factor variable.




Although we have now de ned new factor variables, we have not yet seen the original matrix survey responses in terms
of the new factor variables. For instance, every respondent record has a value for Active, Ful lled, Trust, etc., as these
variables were observed and recorded in the survey, but how do we nd the values (or states) of the new latent factors
for each respondent record?

Actually, at the conclusion of the Multiple Clustering process, BayesiaLab has introduced the new factors into the origi-
nal network. By using BayesiaLab’s imputation process, which is based on maximum likelihood, they were added as
new nodes to the graph and also saved as new columns (or elds) to the database,




www.conradyscience.com | www.bayesia.com                                                                               23
Probabilistic Latent Factor Induction with BayesiaLab



                                         Latent Factors Introduced into Network




 Factor Induction           Saving Factor Scores
 Introducing the new latent factors into the network is equivalent to adding the factor scores to the original observa-
 tion matrix.



We can easily verify that each new factor has a value for each respondent record. We start InferenceInteractive Infer-
ence, which allows to scroll through the survey records and view the values of any variable, including the values of the
new latent factors.




www.conradyscience.com | www.bayesia.com                                                                             24
Probabilistic Latent Factor Induction with BayesiaLab




For instance, survey record #0 is expressed as state C4 in terms of Factor 0. The states of the manifest variables are
shown for reference.




Record #8, for example is assigned to state C3:




Now we have the entire set of respondent records re-expressed in terms of 15 latent factors, which allows us to use
them for all kinds of modeling purposes.




www.conradyscience.com | www.bayesia.com                                                                           25
Probabilistic Latent Factor Induction with BayesiaLab



Given the importance of latent factors for interpretation, we will assign descriptive labels to each of them. BayesiaLab
can visually aid in this process by showing the latent factors and their relationships to the original manifest variables.
This means, we will simply learn a new network, which includes both factor variables and manifest variables.




www.conradyscience.com | www.bayesia.com                                                                               26
Probabilistic Latent Factor Induction with BayesiaLab



                                 Network including Latent Factors and Manifest Variables




The emerging network structure clearly lends itself to de ning descriptive labels, which are applied to the factors in the
following graph.7




7   See Conrady and Jouffe (2010) for a more detailed explanation of the interpretation process.


www.conradyscience.com | www.bayesia.com                                                                               27
Probabilistic Latent Factor Induction with BayesiaLab




                       Network including Latent Factors and Manifest Variables plus Factor Labels




It is important to reiterate that the latent factors generated here are not orthogonal, which means that probabilistic rela-
tionships exist between the factors. For illustration purposes, we can highlight the latent factors and exclude the mani-
fest variables from being displayed. In addition, the following graph also displays the Arc Force between each latent
factor providing further con rmation that the latent factors are not independent.




www.conradyscience.com | www.bayesia.com                                                                                 28
Probabilistic Latent Factor Induction with BayesiaLab




                                      Network with Latent Factors and Arc Forces




www.conradyscience.com | www.bayesia.com                                           29
Statistical Factor Analysis




Statistical Factor Analysis
Perhaps the most common approach for extracting factors from a set of observed variables is Principal Components
Analysis (PCA) and it is frequently considered a synonym for factor analysis.8 For our purpose, we look at PCA as a
prototypical tool for factor extraction, which lends itself to be compared to the latent factor induction with BayesiaLab
presented earlier.

Principal Component Analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a
set of observations, represented by matrix X, of possibly correlated variables into a set of values of uncorrelated vari-
ables called principal components, to be represented by a new matrix Y. The goal of this transformation is to minimize
redundancy (measured by covariance) and to maximize the signal (measured by variance).

This transformation is de ned in such a way that the rst principal component has the highest possible variance, i.e.
accounting for as much of the variability in the data as possible. In turn, each succeeding component has the next-
highest variance while being orthogonal to (uncorrelated with) the preceding components.



                                  Conceptual Illustration of Principal Component Vectors




More formally, PCA creates a re-expression of the original data set on the basis of a new set of orthonormal vectors,
replacing the original set of “naive” basis vectors, which resulted from the choice of measurements.9

In matrix notation, this can be expressed as follows:

PX = Y

8   There are differences between PCA and the more general concept of factor analysis, but explaining those goes beyond
the scope of this paper.
9   Any observed variable automatically establishes a basis vector. Measuring 47 variables would thus result in a 47-
dimensional coordinate system.


www.conradyscience.com | www.bayesia.com                                                                                30
Statistical Factor Analysis



with X being the matrix of original observations and P being a yet-to-be-determined orthonormal matrix that trans-
forms X into Y. Interpreting this geometrically, P is a rotation and stretch to generate Y. The rows of P, {p1,…,pm}, are
the new set of basis vectors for expressing the columns of X. Writing out the explicit dot products may better illustrate
this.


     ⎛ p1        ⎞
     ⎜
PX = ⎜ 
     ⎜ pm
                 ⎟
                 ⎟
                 ⎟
                     (x   1    xn   )
     ⎝           ⎠
  ⎛ p 1 ⋅ x1 … p 1 ⋅ x n             ⎞
  ⎜                                  ⎟
Y=⎜                               ⎟
  ⎜ p m ⋅ x1  p m ⋅ x n
  ⎝                                  ⎟
                                     ⎠


This provides us with the general framework, but we have yet to determine what matrix P should be.

This is the point where we need to introduce the concept of the covariance matrix (Cx). It is de ned as


          1
CX =         XX T
        n −1

• CX is a square and symmetric m × m matrix.

• The elements on the diagonal of CX represent the variance of the observed variables.

• The off-diagonal elements of CX represent the covariance between observed variables.

As a result CX captures the correlations between all possible pairs of observed variables.

This obviously relates to our objective of minimizing redundancy (measured by covariance) and maximizing the signal
(measured by variance) of the target matrix Y. The optimum achievement of these goals would imply a diagonal covari-
ance matrix of Y, i.e. with all off-diagonal elements being zero, and our objective thus translates into stipulating that CY
must be diagonal. Fortunately, linear algebra provides several tools for diagonalizing a matrix.

More formally, the objective becomes nding some orthonormal matrix P where Y=PX such that CY is diagonalized.
The rows of P are then the principal components.

Without providing further detail, the solution is:

• The principal components of X are the eigenvectors of XXT or the rows of P.

• The ith diagonal value of CY is the variance of X along pi.




www.conradyscience.com | www.bayesia.com                                                                                 31
Statistical Factor Analysis



Factor Analysis with STATISTICA
Upon loading the survey data into STATISTICA, the respondent records will be presented as a data table, with the vari-
able names shown as column headers and case numbers shown as row headers.10 This represents our observation matrix
X.

                                                    Observation Matrix X




As a starting point of the PCA process, we can display CX, the covariance matrix of X:




10   We will skip a detailed description of the data import steps, as they are fairly generic and we assume that readers
would use a wide array of statistical programs.


www.conradyscience.com | www.bayesia.com                                                                                   32
Statistical Factor Analysis



                                                     Covariance Matrix




 Arc Force          Covariance
 In BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure
 for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance
 matrix play the equivalent role.


As expected, there is a high amount of covariance, i.e. redundancy, between many of the observed variables. To get a
better sense of the magnitude of these pairwise relationships, it helps to display the correlation matrix for reference:




www.conradyscience.com | www.bayesia.com                                                                                   33
Statistical Factor Analysis



                                                   Correlation Matrix




STATISTICA, like many other statistical software packages, has built-in routines, which can perform the computation
of the matrix P of principal components automatically. There are several methods available for solving the PCA, includ-
ing the approach using the eigenvectors of the covariance matrix, which was shown earlier.

Regardless of the computational method used, the solution of the PCA provides as many eigenvalues as there are ob-
served variables. The sum of all eigenvalues equals the number of observed variables, in our case 47. This allows to de-
termine the share of variance attributable to each factor. For instance, the rst factor has an eigenvalue of 29.6, which
means that it accounts for 29.6/47=62.98% of the variance. Proceeding down the list, the eigenvalues decline in value and
correspondingly their contribution to the total variance.




www.conradyscience.com | www.bayesia.com                                                                              34
Statistical Factor Analysis




                                                     List of Eigenvalues




Now that we have a measure of how much variance each successive factor extracts, we can return to the question of
how many factors to retain, as the overall objective of this exercise is variable reduction. The precise number of factors
to be retained is ultimately an arbitrary decision of the analyst, but factors with eigenvalues greater than 1 are typically
considered candidates. A scree plot11 is typically used to illustrate the eigenvalues of the extracted factors. Sometimes
this provides a visual indication of a natural cutoff point between higher and lower eigenvalues. Here such a distinction
cannot be made easily, so we defer to the rule-of-thumb and retain eigenvalues greater than 1.




11   The name “scree plot” is a metaphorical expression, as “scree” is the term for the accumulation of broken rock at the
base of mountain cliffs. In the scree plot we want to distinguish the substantial eigenvalues from the “rubble” at the
bottom.


www.conradyscience.com | www.bayesia.com                                                                                 35
Statistical Factor Analysis




                                                       Scree Plot




In the next step we turn to the interpretation of the extracted factors. The table below shows the factor loadings, which
are the correlations of each observed variable with the extracted factors.



                                                    Factor Loadings




www.conradyscience.com | www.bayesia.com                                                                              36
Statistical Factor Analysis



Given the high eigenvalue of factor 1, it is not surprising that many variables are highly correlated with it. In our par-
ticular case, however, this correlation is mostly negative, which may be counterintuitive for interpretation purposes.

It is common practice to rotate factors in order to aid in the interpretation process. Intuitively speaking, the rotation in
typically chosen in such a way that the principal factor, i.e. factor 1, aligns with what is commonly understood as the
“positive x-axis.”

Such a factor rotation, for which several methods exist, was also performed with STATISTICA and the results appear in
the table below. In addition, factor loadings higher than 0.7 are highlighted.



                                               Loadings on Rotated Factors




 Relationship Analysis            Factor Loadings
 The summary of clustering measures in BayesiaLab’s Relationship Analysis allows an interpretation, which is very simi-
 lar to what is provided with factor loadings.



The analyst can now use these factor loadings to assign meaningful names to each factor. Some are quite obvious in
their characterization, such as factor 3, which could be called “pleasant” or factor 4, which is quite obviously “classi-
cal.” It is also interesting to see that only one variable, i.e. Intensity, has a high loading on factor 2. This implies that



www.conradyscience.com | www.bayesia.com                                                                                  37
Statistical Factor Analysis



perhaps Intensity is a standalone concept, which has little redundancy. On the other extreme, many variables have high
loadings on factor 1, which makes identifying a distinct concept more elusive.

Without completing this interpretation process, we turn to the “reduction” part by introducing the extracted factors as
variables into the original data set, i.e. replacing 47 variables with 6 variables. This is often referred to as “saving factor
scores,” with the factor scores being the values related to the original observations in this new coordinate system created
by the extracted factors. Our observations now have new coordinates in a 6-dimensional coordinate system rather than
in one with 47 dimensions.

                                                        Factor Scores




 Latent Factor Induction              Saving Factor Scores
 Introducing the latent factors into the network is equivalent to adding the factor scores to the original observation
 matrix.



We now have the ability to create a wide range of models, for instance, modeling Purchase Intent as a function of the 6
new factors. This will undoubtedly be easier to interpret than a model, which includes all of the 47 original observed
variables.




www.conradyscience.com | www.bayesia.com                                                                                    38
Probabilistic Factor Induction and Statistical Factor Analysis




Conclusion
Although fundamentally different in their framework, statistical factor analysis and probabilistic latent factor induction
have many parallels, which lend themselves to direct comparative interpretation. Given these parallels, analysts familiar
with either domain should nd it easy to translate their research work ow from one framework into the other. Equally,
end users of research results, who may be less familiar with the underlying computations, should be in a position to
interpret the ndings from both methods in a very similar manner.




www.conradyscience.com | www.bayesia.com                                                                               39
Probabilistic Factor Induction and Statistical Factor Analysis



References
Conrady, Stefan, and Lionel Jouffe. “Driver Analysis  Product Optimization, A Case Study from the Perfume Indus-
    try”, December 1, 2010. http://www.conradyscience.com/index.php/driver-analysis.
Cover, T. M, and J. A Thomas. “Entropy, relative entropy and mutual information.” Elements of Information Theory
    (1991): 12–49.
Kachigan, Sam Kash. Multivariate Statistical Analysis: A Conceptual Introduction. 2nd ed. Radius Press, 1991.
MacKay, David J. C. Information Theory, Inference and Learning Algorithms. 1st ed. Cambridge University Press,
   2003.
Shlens, J. “A tutorial on principal component analysis.” Systems Neurobiology Laboratory, University of California at
     San Diego (2005).
StatSoft, Inc. “Electronic Statistics Textbook.” Electronic Statistics Textbook, 2011. http://www.statsoft.com/textbook/.




www.conradyscience.com | www.bayesia.com                                                                              40
Probabilistic Factor Induction and Statistical Factor Analysis



Contact Information

Conrady Applied Science, LLC
312 Hamlet’s End Way
Franklin, TN 37067
USA
+1 888-386-8383
info@conradyscience.com
www.conradyscience.com

Bayesia SAS
6, rue Léonard de Vinci
BP 119
53001 Laval Cedex
France
+33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com



Copyright
© 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved.

Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:

• You may print or download this document for your personal and noncommercial use only.

• You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady
  Applied Science, LLC and Bayesia SAS as the source of the material.

• You may not, except with our express written permission, distribute or commercially exploit the content. Nor may
  you transmit it or store it in any other website or other form of electronic retrieval system.




www.conradyscience.com | www.bayesia.com                                                                                41

Contenu connexe

Similaire à Probabilistic Latent Factor Induction and
 Statistical Factor Analysis

The Bayesia Portfolio of Research Software
The Bayesia Portfolio of Research SoftwareThe Bayesia Portfolio of Research Software
The Bayesia Portfolio of Research SoftwareBayesia USA
 
Pdi conditioning-sum2018-milan-20181004
Pdi conditioning-sum2018-milan-20181004Pdi conditioning-sum2018-milan-20181004
Pdi conditioning-sum2018-milan-20181004University of Twente
 
Bayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisBayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisAdnan Masood
 
Knowledge Discovery in Stock Market
Knowledge Discovery in Stock MarketKnowledge Discovery in Stock Market
Knowledge Discovery in Stock Marketjouffe
 
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
Criminal and Civil Identification with DNA Databases Using Bayesian NetworksCriminal and Civil Identification with DNA Databases Using Bayesian Networks
Criminal and Civil Identification with DNA Databases Using Bayesian NetworksCSCJournals
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisChristian Have
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksDevansh16
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Miningbutest
 
Microarray Analysis with BayesiaLab
Microarray Analysis with BayesiaLabMicroarray Analysis with BayesiaLab
Microarray Analysis with BayesiaLabBayesia USA
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisChristian Have
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksBayesia USA
 
712201907
712201907712201907
712201907IJRAT
 
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningFrancisco E. Figueroa-Nigaglioni
 
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...ijccsa
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology tuxette
 
The bayesian revolution in genetics
The bayesian revolution in geneticsThe bayesian revolution in genetics
The bayesian revolution in geneticsBeat Winehouse
 
Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...
Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...
Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...Quantopian
 

Similaire à Probabilistic Latent Factor Induction and
 Statistical Factor Analysis (20)

The Bayesia Portfolio of Research Software
The Bayesia Portfolio of Research SoftwareThe Bayesia Portfolio of Research Software
The Bayesia Portfolio of Research Software
 
Pdi conditioning-sum2018-milan-20181004
Pdi conditioning-sum2018-milan-20181004Pdi conditioning-sum2018-milan-20181004
Pdi conditioning-sum2018-milan-20181004
 
Bayesian Networks and Association Analysis
Bayesian Networks and Association AnalysisBayesian Networks and Association Analysis
Bayesian Networks and Association Analysis
 
Knowledge Discovery in Stock Market
Knowledge Discovery in Stock MarketKnowledge Discovery in Stock Market
Knowledge Discovery in Stock Market
 
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
Criminal and Civil Identification with DNA Databases Using Bayesian NetworksCriminal and Civil Identification with DNA Databases Using Bayesian Networks
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
 
Petrini - MSc Thesis
Petrini - MSc ThesisPetrini - MSc Thesis
Petrini - MSc Thesis
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Mining
 
Microarray Analysis with BayesiaLab
Microarray Analysis with BayesiaLabMicroarray Analysis with BayesiaLab
Microarray Analysis with BayesiaLab
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian Networks
 
712201907
712201907712201907
712201907
 
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learning
 
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
Towards An Enhanced Semantic Approach Based On Formal Concept Analysis And Li...
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
 
Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
 
The bayesian revolution in genetics
The bayesian revolution in geneticsThe bayesian revolution in genetics
The bayesian revolution in genetics
 
Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...
Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...
Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...
 

Plus de Bayesia USA

vehicle_safety_v20b
vehicle_safety_v20bvehicle_safety_v20b
vehicle_safety_v20bBayesia USA
 
Impact Analysis V12
Impact Analysis V12Impact Analysis V12
Impact Analysis V12Bayesia USA
 
Causality for Policy Assessment and 
Impact Analysis
Causality for Policy Assessment and 
Impact AnalysisCausality for Policy Assessment and 
Impact Analysis
Causality for Policy Assessment and 
Impact AnalysisBayesia USA
 
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...Bayesia USA
 
Bayesian Networks & BayesiaLab
Bayesian Networks & BayesiaLabBayesian Networks & BayesiaLab
Bayesian Networks & BayesiaLabBayesia USA
 
Breast Cancer Diagnostics with Bayesian Networks
Breast Cancer Diagnostics with Bayesian NetworksBreast Cancer Diagnostics with Bayesian Networks
Breast Cancer Diagnostics with Bayesian NetworksBayesia USA
 
Modeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian NetworksModeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian NetworksBayesia USA
 
BayesiaLab 5.0 Introduction
BayesiaLab 5.0 IntroductionBayesiaLab 5.0 Introduction
BayesiaLab 5.0 IntroductionBayesia USA
 
Car And Driver Hk Interview
Car And Driver Hk InterviewCar And Driver Hk Interview
Car And Driver Hk InterviewBayesia USA
 

Plus de Bayesia USA (9)

vehicle_safety_v20b
vehicle_safety_v20bvehicle_safety_v20b
vehicle_safety_v20b
 
Impact Analysis V12
Impact Analysis V12Impact Analysis V12
Impact Analysis V12
 
Causality for Policy Assessment and 
Impact Analysis
Causality for Policy Assessment and 
Impact AnalysisCausality for Policy Assessment and 
Impact Analysis
Causality for Policy Assessment and 
Impact Analysis
 
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...
 
Bayesian Networks & BayesiaLab
Bayesian Networks & BayesiaLabBayesian Networks & BayesiaLab
Bayesian Networks & BayesiaLab
 
Breast Cancer Diagnostics with Bayesian Networks
Breast Cancer Diagnostics with Bayesian NetworksBreast Cancer Diagnostics with Bayesian Networks
Breast Cancer Diagnostics with Bayesian Networks
 
Modeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian NetworksModeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian Networks
 
BayesiaLab 5.0 Introduction
BayesiaLab 5.0 IntroductionBayesiaLab 5.0 Introduction
BayesiaLab 5.0 Introduction
 
Car And Driver Hk Interview
Car And Driver Hk InterviewCar And Driver Hk Interview
Car And Driver Hk Interview
 

Probabilistic Latent Factor Induction and
 Statistical Factor Analysis

  • 1. Probabilistic Latent Factor Induction and Statistical Factor Analysis A Comparison of Methods Stefan Conrady, stefan.conrady@conradyscience.com Dr. Lionel Jouffe, jouffe@bayesia.com April 7, 2011 Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
  • 2. Probabilistic Factor Induction and Statistical Factor Analysis Table of Contents Introduction About the Authors 4 Stefan Conrady 4 Lionel Jouffe 4 Key Concepts from Information Theory 1 Entropy 1 Chain Rule Theorem 2 Conditional Entropy 2 Mutual Information 3 Relative Entropy (Kullback-Leibler Divergence) 3 Example 1 3 Example 2 4 Comparison of Methods Approach 5 Notation 5 Key Terminology 5 Data Set 6 Probabilistic Latent Factor Induction with BayesiaLab 7 Data Import 7 Variable Clustering 16 Latent Factor Induction 21 Statistical Factor Analysis 30 Factor Analysis with STATISTICA 32 Conclusion 39 References 40 Contact Information 41 Conrady Applied Science, LLC 41 Bayesia SAS 41 Copyright 41 www.conradyscience.com | www.bayesia.com ii
  • 3. Probabilistic Factor Induction and Statistical Factor Analysis Introduction Bayesian networks have been gaining prominence among scientists over the recent decade and the new insights gener- ated by this powerful research approach can now be found in studies that circulate well beyond the academic communi- ties. As a result, many practitioners and managerial decision-makers see more and more references to Bayesian networks in all kinds of scienti c and business research, ranging from biostatistics to marketing analytics. It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. In the eld of market research, for instance, long-established methods, such as factor analysis remain in daily use today. Given that there exists a direct counterpart to factor analysis in the Bayesian network framework, we want to highlight similarities as well as fundamental differences. The objective of this paper is to present both methods side-by-side and thus help researchers to correctly compare and interpret the respective results. More speci cally, we want to establish the semantic equivalents between the traditional statistical factor analysis approach and BayesiaLab’s method based on Bayesian networks, which we refer to as Probabilistic Latent Factor Induction. Factor Analysis is a statistical method used to describe variability among observed variables in terms of a potentially lower number of unobserved variables called factors. It is possible, for example, that variations in three or four ob- served variables mainly re ect the variations in a single unobserved variable, or in a reduced number of unobserved variables. The observed variables can be seen as manifestations of abstract underlying (and unobserved) dimensions or (latent) factors. Factor analysis originated in psychometrics, and is used in behavioral sciences, social sciences, marketing, product man- agement, operations research, and other applied sciences that deal with a large number of variables in their data. Probabilistic Latent Factor Induction is a work ow within the BayesiaLab software package, which has the same objec- tive as the traditional factor analysis, i.e. variable reduction, but works entirely with the framework of Bayesian net- works and is based on principles derived from information theory. It is important to point out that this comparison is not meant to favor one approach over the other (and to declare a winner and loser), although it is clearly in the authors’ interest to promote Bayesian networks in general and BayesiaLab in particular. Rather, this paper should serve as reference for research practitioners and those who use research results in their decision-making processes, so they can correctly interpret insights generated with either approach. www.conradyscience.com | www.bayesia.com iii
  • 4. Probabilistic Factor Induction and Statistical Factor Analysis About the Authors Stefan Conrady Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied Science was appointed the authorized sales and consulting partner of Bayesia SAS for North America. Stefan Conrady studied Electrical Engineering and has extensive management experience in the elds of product plan- ning, marketing and analytics, working at Daimler and BMW Group in Europe, North America and Asia. Prior to es- tablishing his own rm, he was heading the Analytics & Forecasting group at Nissan North America. Lionel Jouffe Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Science and has been working in the eld of Arti cial Intelligence since the early 1990s. He and his team have been developing BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high- lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007. www.conradyscience.com | www.bayesia.com iv
  • 5. Information Theory Background Key Concepts from Information Theory Before we proceed to the direct comparison of methods, it is important to establish several key concepts relating to the knowledge representation in Bayesian networks. Entropy The concept of entropy provides the underpinning for all structural learning and analysis algorithms in BayesiaLab. Entropy measures the uncertainty inherent in the distribution of a random variable. The entropy H(X) of a random variable X is de ned as: H (X) = − ∑ p(x)log 2 p(x) , x∈X where x stands for the states, which variable X can take. Note that the log is to the base of 2 and the value of entropy is expressed in bits (0/1). An example can perhaps illustrate this: If variable X represents the outcome of a coin toss, X can have one of two states, Heads and Tails, i.e. the set of potential outcomes is X={Heads, Tails}. Given the coin toss is fair, the probability of Head and Tails will be 0.5, i.e. p(Heads)=0.5 and p(Tails)=0.5. We can now compute the entropy H(Xfair), based on these values: H (X fair ) = − p(Heads)log 2 p(Heads) − p(Tails)log 2 p(Tails) = −0.5 log 2 0.5 − 0.5 log 2 0.5 = 0.5 + 0.5 = 1 bit This means our uncertainty prior to a fair coin toss is equivalent to an entropy value of 1 bit, which is the maximum entropy due to the uniform distribution of the variable with two states. If we had a biased coin instead with p(Heads)=0.7 and p(Tails)=0.3, it is intuitive to think that the uncertainty would be lower as one state of the coin toss will be more probable and, indeed, computing the entropy H(Xbiased) yields a lower value. H (Xbiased ) = −0.7 log 2 0.7 − 0.3log 2 0.3 = 0.881 To complete this idea, we can also plot H(X) as a function of the bias, p(Heads)=1-p(Tails), with p(Heads)∈{0,..,1}, i.e. ranging from impossible, p(Heads)=0, to certain, p(Heads)=1. www.conradyscience.com | www.bayesia.com 1
  • 6. Information Theory Background HX 1.0 0.8 0.6 0.4 0.2 pHeads 0.2 0.4 0.6 0.8 1.0 Clearly, anything other than a perfectly fair coin reduces the entropy and thus our uncertainty regarding the outcome of the coin toss. Chain Rule Theorem The chain rule for joint entropy states that the total uncertainty about the value of X and Y is equal to the uncertainty about X plus the (average) uncertainty about Y once you know X. H (X,Y ) = H (X) + H (Y X) The proof of this theorem follows: H (X,Y ) = − ∑ ∑ p(x, y)log 2 p(x, y) y∈Y x∈X = − ∑ ∑ p(x, y)log 2 p(y x)p(x) y∈Y x∈X = − ∑ ∑ p(x, y)log 2 p(y x) − ∑ ∑ p(x, y)log 2 p(x) y∈Y x∈X y∈Y x∈X = − ∑ ∑ p(x, y)log 2 p(y x) − ∑ p(x)log 2 p(x) y∈Y x∈X x∈X = H (Y X) + H (X) Conditional Entropy Perhaps the single most important concept for computations in BayesiaLab is conditional entropy. Conditional entropy refers to the entropy of a random variable when we have information on another variable. The conditional entropy H(Y|X), is de ned as www.conradyscience.com | www.bayesia.com 2
  • 7. Information Theory Background H (Y X ∑ p(x)H (Y X = x) x∈X = − ∑ p(x)∑ p(y x)log 2 p(y x) x∈X y∈Y = − ∑ ∑ p(x, y)log 2 p(y x) x∈X y∈Y The conditional entropy of Y conditional on X refers to the expected entropy of Y conditional on the value of X. Mutual Information The mutual information I(X,Y) measures how much (on average) the observation of random variable Y tells us about the uncertainty of X, i.e. by how much the entropy of X is reduced if we we have information on Y. I(X,Y ) = H (X) − H (X Y ) = H (Y ) − H (Y X) Note that the mutual information is a symmetric metric, which re ects the uncertainty reduction of X by knowing Y as well as of Y by knowing X. Relative Entropy (Kullback-Leibler Divergence) A closely related concept is the relative entropy, also referred to as the Kullback-Leibler Divergence (DKL) or sometimes cross entropy. The Kullback-Leibler Divergence is a measure of the difference between two probability distributions p and q. For probability distributions p and q of a discrete random variable X, their K–L divergence is de ned to be p(x) DKL = ( p(X) || q(X)) = ∑ p(x)log 2 x∈X q(x) In words, it is the expected value of the logarithmic difference between the joint probability distributions p(X) and q(X). In contrast to the mutual information, the relative entropy is non-symmetric. Example 1 We once again use tossing coins as an example. By default, we would expect that any given coin is fair and assume a model q(Heads)=q(Tails)=0.5. As it turns out, in repeated coin tosses, we observe that a probability of p(Heads)=0.75 and of p(Tails)=0.25. We can now use the Kullback-Leibler Divergence to establish the “distance” or “distortion” be- tween the originally assumed distribution q(x) and the observed distribution of p(x). p(x) DKL = ( p(X) || q(X)) = ∑ p(x)log 2 x∈X q(x) p(Heads) p(Tails) 0.75 0.25 = p(Heads)log 2 + p(Tails)log 2 = 0.75 log 2 + 0.25 log 2 q(Heads) q(Tails) 0.5 0.5 = 0.188722 bits www.conradyscience.com | www.bayesia.com 3
  • 8. Information Theory Background Example 2 For another illustration we use an example from the eld of meteorology. More speci cally, we look at the rainfall in two cities in state of Victoria, Australia. We used daily rainfall data measured at Geelong Airport and at Melbourne Tullamarine Airport, which are approximately 80 kilometers apart, over the entire calendar year of 2010. Given the proximity of those locations, one would generally expect similar weather. Perhaps the Geelong weather isn’t reported in the Melbourne newspapers and so a traveler wants to use the Melbourne weather as a proxy. However, the actual weather station observations tell us that there is rain in Melbourne on 40.3% of the days, whereas Geelong sees rainfall on 47.4% of the days in the year. We can now compute the Kullback-Leibler Divergence for these two distributions, and pGeelong(x) stands for Geelong and pMelbourne(x) for the Melbourne rain probability distributions. ( ) pGeelong (x) DKL = pGeelong (X) || pMelbourne (X) = ∑ pGeelong (x)log 2 x∈X pMelbourne (x) pGeloong (x = No Rain) pGeloong (x = Rain) = pGeelong (x = No Rain)log 2 + pGeelong (x = Rain)log 2 pMelbourne (x = No Rain) pMelbourne (x = Rain) 0.526 0.474 = 0.526 log 2 + 0.474 log 2 = 0.0148958 bits 0.597 0.403 ( ) DKL = pMelbourne (X) || pGeelong (X) = ∑ pMelbourne (x)log 2 x∈X pMelbourne (x) pGeelong (x) pMelbourne (x = Rain) p (x = No Rain) = pMelbourne (x = Rain)log 2 + pMelbourne (x = No Rain)log 2 Melbourne pGeelong (x = Rain) pGeelong (x = No Rain) 0.403 0.597 = 0.403log 2 + 0.597 log 2 = 0.0147077 bits 0.474 0.526 BayesiaLab’s primary metric, the Arc Force, is directly proportional to the relative entropy and describes the strength of the directional link between two variables. More speci cally, it describes the difference between the joint probability distributions with and without the particular arc. www.conradyscience.com | www.bayesia.com 4
  • 9. Probabilistic Latent Factor Induction vs. Statistical Factor Analysis Comparison of Methods Approach We believe that we can best facilitate a comparison of the statistical factor analysis and latent factor induction by work- ing through an example. We draw upon the familiar dataset from the previously presented case study from the perfume industry, hereafter referred to as the “Perfume Study.” 1 We begin our tutorial with the Data Import process for BayesiaLab, although it is not meant to be at the core of the comparison. It is important though to spell out the data pre-processing steps in BayesiaLab, as they highlight some of the fundamental differences between probabilistic and statistical approaches. Once the data preparation is complete, we rst present the probabilistic latent factor induction work ow with BayesiaLab and then provide an example of a statistical factor analysis. For the statistical factor analysis, we will use STATISTICA 10 as the software platform, although most steps are fairly generic and could be reproduced with a num- ber of other statistical software packages as well. Notation To clearly distinguish between natural language, software-speci c functions and study-speci c variable names, the fol- lowing notation is used: • BayesiaLab-speci c functions, keywords, commands, etc., are capitalized and shown in bold type. • Names of attributes, variable, node and factors are italicized. • At appropriate points in the text, grey boxes highlight parallels between the two presented methods: Probabilistic Latent Factor Induction Statistical Factor Analysis Key Terminology • “Observed” and “manifest” are used interchangeably and describe those random variables, which have been meas- ured by the researcher. Each variable measure • The terms “latent” or “unobserved” are used interchangeably in the context of hidden concepts or factors, which cannot be measured, but can potentially be extracted or induced. In our context, the term factor stands exclusively for latent variables. Consequently, the terms “factor”, “factor variable”, “latent variable” and “unobserved variable” are equivalent. 1 Conrady and Jouffe (2010) www.conradyscience.com | www.bayesia.com 5
  • 10. Probabilistic Latent Factor Induction vs. Statistical Factor Analysis Data Set The Perfume Study is based on a monadic consumer survey about a range of fragrances, which was conducted in France. In this example we use survey responses from 1,321 women, who have evaluated a total of 11 fragrances on a wide range of attributes: • 27 ratings on fragrance-related attributes, such as, “sweet”, “ owery”, “feminine”, etc., measured on a 1-to-10 scale. • 12 ratings on projected imagery related to someone, who would be wearing the respective fragrance, e.g. “is sexy”, “is modern”, measured on a 1-to-10 scale. • 1 variable for Intensity, a measure re ecting the level of intensity, measured on a 1-to-5 scale. • 1 variable for Purchase Intent, measured on a 1-to-6 scale. • 1 nominal variable, Product, for product identi cation purposes. www.conradyscience.com | www.bayesia.com 6
  • 11. Probabilistic Latent Factor Induction with BayesiaLab Probabilistic Latent Factor Induction with BayesiaLab Data Import To start the process with BayesiaLab, we rst import the data set, which is formatted as a CSV le.2 With DataOpen Data SourceText File, we start the Data Import wizard, which immediately provides a preview of the data le. The table displayed in the Data Import wizard shows the individual variables as columns and the survey responses as rows. There are a number of options available, e.g. for sampling. However, this is not necessary in our example given the relatively small size of the database. Clicking the Next button, prompts a data type analysis, which provides BayesiaLab’s best guess regarding the data type of each variable. Furthermore, the Information box provides a brief summary regarding the number of records, the number of missing values, ltered states, etc.3 2 CSV stands for “comma-separated values”, a common format for text-based data les. 3 There are no missing values in our database and ltered states are not applicable in this survey. www.conradyscience.com | www.bayesia.com 7
  • 12. Probabilistic Latent Factor Induction with BayesiaLab For this example, we will need to override the default data type for the Product variable, as each value is a nominal product identi er rather than a numerical scale value. We can change the data type by highlighting the Product variable and clicking the Discrete check box, which changes the color of the Product column to red. We will also de ne Purchase Intent and Intensity as discrete variables, as the default number of states of these variables is already adequate for our purposes.4 The next screen provides options as to how to treat any missing values. In our case, there are no missing values so the corresponding panel is grayed-out. Clicking the small upside-down triangle next to the variable names brings up a window with key statistics of the selected variable, in this case Fresh. 4 The desired number of variable states is largely a function of the analyst’s judgment. www.conradyscience.com | www.bayesia.com 8
  • 13. Probabilistic Latent Factor Induction with BayesiaLab The next step is the Discretization and Aggregation dialogue, which allows the analyst to determine the type of discretization that must be performed on all continuous variables.5 For this survey, and given the number of observations, it is appropriate to reduce the number of states from the original 10 states (1 through 10) to smaller number. One could, for instance, bin the 1-10 rating into low, mid and high, or apply any other arbitrary method deemed appropriate by the analyst. The screenshot shows the dialogue for the Manual selection of discretization steps, which permits to select binning thresholds by point-and-click. 5 BayesiaLab requires discrete distributions for all variables. www.conradyscience.com | www.bayesia.com 9
  • 14. Probabilistic Latent Factor Induction with BayesiaLab Note For choosing discretization algorithms beyond this example, the following rule of thumb may be helpful: • For supervised learning, choose Decision Tree. • For unsupervised learning, choose, in the order of priority, K-Means, Equal Distances or Equal Frequencies. For this particular example, we select Equal Distances with 5 intervals for all continuous variables. This was the analyst’s choice in order to be consistent with prior research. Clicking Select All Continuous followed by Finish completes the import process and the 49 variables (columns) from our database are now shown as blue nodes in the Graph Panel, which is the main window for network editing. By default, all variables are represented as nodes. This initial view represents a fully unconnected Bayesian network. www.conradyscience.com | www.bayesia.com 10
  • 15. Probabilistic Latent Factor Induction with BayesiaLab In the above graph, two variables play a fundamentally different role. The values of Product represent categories and Purchase Intent is the overall target variable, i.e. the dependent variable of the Perfume Study. Thus both will be ex- cluded from the factor generation process. While correlation and covariance the central measures for statistical factor analysis, learning Bayesian networks with BayesiaLab (and thus probabilistic factor induction) is based on measures from information theory, such as the Kullback-Leibler Divergence, which was introduced in the rst chapter. The Kullback-Leibler Divergence can be obtained after learning an initial Bayesian network with one of BayesiaLab’s unsupervised learning algorithms. “Unsupervised” implies that the learning algorithm searches for an overall representa- tion of the joint distribution of the underlying data rather than the characterization of an individual target variable. In our example, we use BayesiaLab’s EQ algorithm to obtain a Bayesian network. www.conradyscience.com | www.bayesia.com 11
  • 16. Probabilistic Latent Factor Induction with BayesiaLab As this view of the network is not easily readable, BayesiaLab has numerous built-in layout algorithms, of which the Force Directed Layout is perhaps the most commonly used. It can be invoked by ViewAutomatic LayoutForce Directed Layout or alternatively through the keyboard shortcut “p”. The resulting network will look similar to the following screenshot. www.conradyscience.com | www.bayesia.com 12
  • 17. Probabilistic Latent Factor Induction with BayesiaLab Completed Bayesian Network upon EQ Learning With the network established, we can now further examine the probabilistic relationships between the nodes, which are represented as arcs.6 By selecting, AnalysisGraphicArc Force, we can show the probabilistic strength of the arcs, which is visualized by the thickness of the arcs. 6 “Arcs” are directed links or edges between nodes, which appear as arrows in the graph. www.conradyscience.com | www.bayesia.com 13
  • 18. Probabilistic Latent Factor Induction with BayesiaLab Network with Arc Force The numeric values of the Arc Force can be shown by selecting ViewDisplay Arc Comments. In the network shown below, the Arc Force values are presented in yellow boxes attached to each arc. www.conradyscience.com | www.bayesia.com 14
  • 19. Probabilistic Latent Factor Induction with BayesiaLab Network with Arc Force Arc Force Covariance In BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance matrix play the equivalent role. www.conradyscience.com | www.bayesia.com 15
  • 20. Probabilistic Latent Factor Induction with BayesiaLab Variable Clustering With Arc Force established as a the key measure across the entire network, BayesiaLab can determine clusters of vari- ables, which are “close” in a probabilistic sense. This can be initiated from the menu via AnalysisGraphicVariable Clustering. The clustering algorithm is iterative and starts with those two variables, whose connecting arc has the strongest Arc Force. The following sequence of screenshots illustrates this algorithm conceptually in “slow motion,” as the analyst would not see these individual steps in the actual work ow. As a starting point, every manifest variable is treated as a distinct cluster and so we have 47 clusters. Using the Kullback-Leibler Divergence as a measure, the “closest” variables are then merged into one concept. As a result, we rst obtain 46 clusters, then 45, etc., as shown in the array of dendrograms below. BayesiaLab proposes to conclude this algorithm upon nding 15 clusters. However, the analyst has the ability to override this automatic selection. As the choice of clusters appears to be generally compatible with our interpretation of the variable names, we accept this rec- ommendation. www.conradyscience.com | www.bayesia.com 16
  • 21. Probabilistic Latent Factor Induction with BayesiaLab Sequence of Dendrograms 47 46 45 44 ... 16 15 Because of the importance of this process, we will also show it from another angle, i.e. by looking at sequential views of the graph. www.conradyscience.com | www.bayesia.com 17
  • 22. Probabilistic Latent Factor Induction with BayesiaLab Step 0 - 47 Clusters Step 1 - 46 Clusters: Pleasure merged with Corresponds The strongest Arc Force exists between Pleasure and Corresponds and BayesiaLab will form an interim concept from them. The next-highest Arc Force then determines whether another variable is merged with the rst concept or whether a new concept is created. In our case, Radiant and In Love are combined as a new concept. www.conradyscience.com | www.bayesia.com 18
  • 23. Probabilistic Latent Factor Induction with BayesiaLab Step 2 - 45 Clusters: Radiant merged with In Love In the third step, we see Sensual and Romantic merged into a new latent concept, and so on. Step 3 - 44 Clusters: Sensual merged with Romantic Upon completion of this process, BayesiaLab forms variable/node clusters from these common concepts and color-codes them accordingly. www.conradyscience.com | www.bayesia.com 19
  • 24. Probabilistic Latent Factor Induction with BayesiaLab Network with Color-Coded Variable Clusters By clicking the Validate Clustering button , we can now formally xate the new latent factor variables. The new latent factors are shown in the following table with their associated observed variables. By default, they are given the name “Factor” plus a numeric suf x www.conradyscience.com | www.bayesia.com 20
  • 25. Probabilistic Latent Factor Induction with BayesiaLab Latent Factor Induction Upon de nition of the new latent factor variables, we now want to make them available for modeling purposes. Although these latent factors exist as new concepts and are conceptually linked to the manifest variables, the factors do not yet have any values or states. This will now happen in the Multiple Clustering process, which creates discrete states for each latent factor variable by performing data clustering over the linked manifest variables. More speci cally, the states of each latent factor will be created in such a way that they best summarize the joint probability distribution de ned by the manifest vari- ables. Factor 0 and its linked manifest variables are shown below. Subnetwork for Factor 0 www.conradyscience.com | www.bayesia.com 21
  • 26. Probabilistic Latent Factor Induction with BayesiaLab The following Monitors display the marginal probability distributions of the variables associated with Factor 1, plus, highlighted in red, Factor 1 itself and its states are shown. We can see that 5 states were created for Factor 1, labelled C1 through C5, and they each have an expected value, which is shown in parentheses. For instance, state C2 has an expected value of 9.21. That means, given that C2 is observed, the mean value of the manifest variables, weighted by their relation with C2, is equal to 9.21. In other words, C2 corresponds to high ratings with regard to those 5 dimen- sions. By selecting speci c states of Factor 0 in the Monitor Panel, we can see the conditional distributions of the manifest variables. The states C2 and C3 are displayed for reference below. They can be easily interpreted by looking at the asso- ciated values, e.g. state C2 appears to re ect high ratings of the manifest variables, whereas state C3 captures very low ratings. A more general analysis of the relationships between manifest variables and latent factors can be obtained through AnalysisReportsRelationship Analysis: This chart summarizes the values of key clustering measures, such as the Kullback-Leibler Divergence, for every mani- fest variable associated with Factor 0. For reference only, it also includes Pearson’s Correlation Coef cient R. www.conradyscience.com | www.bayesia.com 22
  • 27. Probabilistic Latent Factor Induction with BayesiaLab Relationship Analysis Factor Loadings This summary of clustering measures in the Relationship Analysis allows an interpretation, which is very similar to what is provided with factor loadings. It is also possible to visualize the mean values of the manifest variables (x-axis) along with the Mutual Information (y- axis, left panel) and the Standardized Total Effect (y-axis, right panel) for the latent factor variable. Although we have now de ned new factor variables, we have not yet seen the original matrix survey responses in terms of the new factor variables. For instance, every respondent record has a value for Active, Ful lled, Trust, etc., as these variables were observed and recorded in the survey, but how do we nd the values (or states) of the new latent factors for each respondent record? Actually, at the conclusion of the Multiple Clustering process, BayesiaLab has introduced the new factors into the origi- nal network. By using BayesiaLab’s imputation process, which is based on maximum likelihood, they were added as new nodes to the graph and also saved as new columns (or elds) to the database, www.conradyscience.com | www.bayesia.com 23
  • 28. Probabilistic Latent Factor Induction with BayesiaLab Latent Factors Introduced into Network Factor Induction Saving Factor Scores Introducing the new latent factors into the network is equivalent to adding the factor scores to the original observa- tion matrix. We can easily verify that each new factor has a value for each respondent record. We start InferenceInteractive Infer- ence, which allows to scroll through the survey records and view the values of any variable, including the values of the new latent factors. www.conradyscience.com | www.bayesia.com 24
  • 29. Probabilistic Latent Factor Induction with BayesiaLab For instance, survey record #0 is expressed as state C4 in terms of Factor 0. The states of the manifest variables are shown for reference. Record #8, for example is assigned to state C3: Now we have the entire set of respondent records re-expressed in terms of 15 latent factors, which allows us to use them for all kinds of modeling purposes. www.conradyscience.com | www.bayesia.com 25
  • 30. Probabilistic Latent Factor Induction with BayesiaLab Given the importance of latent factors for interpretation, we will assign descriptive labels to each of them. BayesiaLab can visually aid in this process by showing the latent factors and their relationships to the original manifest variables. This means, we will simply learn a new network, which includes both factor variables and manifest variables. www.conradyscience.com | www.bayesia.com 26
  • 31. Probabilistic Latent Factor Induction with BayesiaLab Network including Latent Factors and Manifest Variables The emerging network structure clearly lends itself to de ning descriptive labels, which are applied to the factors in the following graph.7 7 See Conrady and Jouffe (2010) for a more detailed explanation of the interpretation process. www.conradyscience.com | www.bayesia.com 27
  • 32. Probabilistic Latent Factor Induction with BayesiaLab Network including Latent Factors and Manifest Variables plus Factor Labels It is important to reiterate that the latent factors generated here are not orthogonal, which means that probabilistic rela- tionships exist between the factors. For illustration purposes, we can highlight the latent factors and exclude the mani- fest variables from being displayed. In addition, the following graph also displays the Arc Force between each latent factor providing further con rmation that the latent factors are not independent. www.conradyscience.com | www.bayesia.com 28
  • 33. Probabilistic Latent Factor Induction with BayesiaLab Network with Latent Factors and Arc Forces www.conradyscience.com | www.bayesia.com 29
  • 34. Statistical Factor Analysis Statistical Factor Analysis Perhaps the most common approach for extracting factors from a set of observed variables is Principal Components Analysis (PCA) and it is frequently considered a synonym for factor analysis.8 For our purpose, we look at PCA as a prototypical tool for factor extraction, which lends itself to be compared to the latent factor induction with BayesiaLab presented earlier. Principal Component Analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations, represented by matrix X, of possibly correlated variables into a set of values of uncorrelated vari- ables called principal components, to be represented by a new matrix Y. The goal of this transformation is to minimize redundancy (measured by covariance) and to maximize the signal (measured by variance). This transformation is de ned in such a way that the rst principal component has the highest possible variance, i.e. accounting for as much of the variability in the data as possible. In turn, each succeeding component has the next- highest variance while being orthogonal to (uncorrelated with) the preceding components. Conceptual Illustration of Principal Component Vectors More formally, PCA creates a re-expression of the original data set on the basis of a new set of orthonormal vectors, replacing the original set of “naive” basis vectors, which resulted from the choice of measurements.9 In matrix notation, this can be expressed as follows: PX = Y 8 There are differences between PCA and the more general concept of factor analysis, but explaining those goes beyond the scope of this paper. 9 Any observed variable automatically establishes a basis vector. Measuring 47 variables would thus result in a 47- dimensional coordinate system. www.conradyscience.com | www.bayesia.com 30
  • 35. Statistical Factor Analysis with X being the matrix of original observations and P being a yet-to-be-determined orthonormal matrix that trans- forms X into Y. Interpreting this geometrically, P is a rotation and stretch to generate Y. The rows of P, {p1,…,pm}, are the new set of basis vectors for expressing the columns of X. Writing out the explicit dot products may better illustrate this. ⎛ p1 ⎞ ⎜ PX = ⎜  ⎜ pm ⎟ ⎟ ⎟ (x 1  xn ) ⎝ ⎠ ⎛ p 1 ⋅ x1 … p 1 ⋅ x n ⎞ ⎜ ⎟ Y=⎜    ⎟ ⎜ p m ⋅ x1  p m ⋅ x n ⎝ ⎟ ⎠ This provides us with the general framework, but we have yet to determine what matrix P should be. This is the point where we need to introduce the concept of the covariance matrix (Cx). It is de ned as 1 CX = XX T n −1 • CX is a square and symmetric m × m matrix. • The elements on the diagonal of CX represent the variance of the observed variables. • The off-diagonal elements of CX represent the covariance between observed variables. As a result CX captures the correlations between all possible pairs of observed variables. This obviously relates to our objective of minimizing redundancy (measured by covariance) and maximizing the signal (measured by variance) of the target matrix Y. The optimum achievement of these goals would imply a diagonal covari- ance matrix of Y, i.e. with all off-diagonal elements being zero, and our objective thus translates into stipulating that CY must be diagonal. Fortunately, linear algebra provides several tools for diagonalizing a matrix. More formally, the objective becomes nding some orthonormal matrix P where Y=PX such that CY is diagonalized. The rows of P are then the principal components. Without providing further detail, the solution is: • The principal components of X are the eigenvectors of XXT or the rows of P. • The ith diagonal value of CY is the variance of X along pi. www.conradyscience.com | www.bayesia.com 31
  • 36. Statistical Factor Analysis Factor Analysis with STATISTICA Upon loading the survey data into STATISTICA, the respondent records will be presented as a data table, with the vari- able names shown as column headers and case numbers shown as row headers.10 This represents our observation matrix X. Observation Matrix X As a starting point of the PCA process, we can display CX, the covariance matrix of X: 10 We will skip a detailed description of the data import steps, as they are fairly generic and we assume that readers would use a wide array of statistical programs. www.conradyscience.com | www.bayesia.com 32
  • 37. Statistical Factor Analysis Covariance Matrix Arc Force Covariance In BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance matrix play the equivalent role. As expected, there is a high amount of covariance, i.e. redundancy, between many of the observed variables. To get a better sense of the magnitude of these pairwise relationships, it helps to display the correlation matrix for reference: www.conradyscience.com | www.bayesia.com 33
  • 38. Statistical Factor Analysis Correlation Matrix STATISTICA, like many other statistical software packages, has built-in routines, which can perform the computation of the matrix P of principal components automatically. There are several methods available for solving the PCA, includ- ing the approach using the eigenvectors of the covariance matrix, which was shown earlier. Regardless of the computational method used, the solution of the PCA provides as many eigenvalues as there are ob- served variables. The sum of all eigenvalues equals the number of observed variables, in our case 47. This allows to de- termine the share of variance attributable to each factor. For instance, the rst factor has an eigenvalue of 29.6, which means that it accounts for 29.6/47=62.98% of the variance. Proceeding down the list, the eigenvalues decline in value and correspondingly their contribution to the total variance. www.conradyscience.com | www.bayesia.com 34
  • 39. Statistical Factor Analysis List of Eigenvalues Now that we have a measure of how much variance each successive factor extracts, we can return to the question of how many factors to retain, as the overall objective of this exercise is variable reduction. The precise number of factors to be retained is ultimately an arbitrary decision of the analyst, but factors with eigenvalues greater than 1 are typically considered candidates. A scree plot11 is typically used to illustrate the eigenvalues of the extracted factors. Sometimes this provides a visual indication of a natural cutoff point between higher and lower eigenvalues. Here such a distinction cannot be made easily, so we defer to the rule-of-thumb and retain eigenvalues greater than 1. 11 The name “scree plot” is a metaphorical expression, as “scree” is the term for the accumulation of broken rock at the base of mountain cliffs. In the scree plot we want to distinguish the substantial eigenvalues from the “rubble” at the bottom. www.conradyscience.com | www.bayesia.com 35
  • 40. Statistical Factor Analysis Scree Plot In the next step we turn to the interpretation of the extracted factors. The table below shows the factor loadings, which are the correlations of each observed variable with the extracted factors. Factor Loadings www.conradyscience.com | www.bayesia.com 36
  • 41. Statistical Factor Analysis Given the high eigenvalue of factor 1, it is not surprising that many variables are highly correlated with it. In our par- ticular case, however, this correlation is mostly negative, which may be counterintuitive for interpretation purposes. It is common practice to rotate factors in order to aid in the interpretation process. Intuitively speaking, the rotation in typically chosen in such a way that the principal factor, i.e. factor 1, aligns with what is commonly understood as the “positive x-axis.” Such a factor rotation, for which several methods exist, was also performed with STATISTICA and the results appear in the table below. In addition, factor loadings higher than 0.7 are highlighted. Loadings on Rotated Factors Relationship Analysis Factor Loadings The summary of clustering measures in BayesiaLab’s Relationship Analysis allows an interpretation, which is very simi- lar to what is provided with factor loadings. The analyst can now use these factor loadings to assign meaningful names to each factor. Some are quite obvious in their characterization, such as factor 3, which could be called “pleasant” or factor 4, which is quite obviously “classi- cal.” It is also interesting to see that only one variable, i.e. Intensity, has a high loading on factor 2. This implies that www.conradyscience.com | www.bayesia.com 37
  • 42. Statistical Factor Analysis perhaps Intensity is a standalone concept, which has little redundancy. On the other extreme, many variables have high loadings on factor 1, which makes identifying a distinct concept more elusive. Without completing this interpretation process, we turn to the “reduction” part by introducing the extracted factors as variables into the original data set, i.e. replacing 47 variables with 6 variables. This is often referred to as “saving factor scores,” with the factor scores being the values related to the original observations in this new coordinate system created by the extracted factors. Our observations now have new coordinates in a 6-dimensional coordinate system rather than in one with 47 dimensions. Factor Scores Latent Factor Induction Saving Factor Scores Introducing the latent factors into the network is equivalent to adding the factor scores to the original observation matrix. We now have the ability to create a wide range of models, for instance, modeling Purchase Intent as a function of the 6 new factors. This will undoubtedly be easier to interpret than a model, which includes all of the 47 original observed variables. www.conradyscience.com | www.bayesia.com 38
  • 43. Probabilistic Factor Induction and Statistical Factor Analysis Conclusion Although fundamentally different in their framework, statistical factor analysis and probabilistic latent factor induction have many parallels, which lend themselves to direct comparative interpretation. Given these parallels, analysts familiar with either domain should nd it easy to translate their research work ow from one framework into the other. Equally, end users of research results, who may be less familiar with the underlying computations, should be in a position to interpret the ndings from both methods in a very similar manner. www.conradyscience.com | www.bayesia.com 39
  • 44. Probabilistic Factor Induction and Statistical Factor Analysis References Conrady, Stefan, and Lionel Jouffe. “Driver Analysis Product Optimization, A Case Study from the Perfume Indus- try”, December 1, 2010. http://www.conradyscience.com/index.php/driver-analysis. Cover, T. M, and J. A Thomas. “Entropy, relative entropy and mutual information.” Elements of Information Theory (1991): 12–49. Kachigan, Sam Kash. Multivariate Statistical Analysis: A Conceptual Introduction. 2nd ed. Radius Press, 1991. MacKay, David J. C. Information Theory, Inference and Learning Algorithms. 1st ed. Cambridge University Press, 2003. Shlens, J. “A tutorial on principal component analysis.” Systems Neurobiology Laboratory, University of California at San Diego (2005). StatSoft, Inc. “Electronic Statistics Textbook.” Electronic Statistics Textbook, 2011. http://www.statsoft.com/textbook/. www.conradyscience.com | www.bayesia.com 40
  • 45. Probabilistic Factor Induction and Statistical Factor Analysis Contact Information Conrady Applied Science, LLC 312 Hamlet’s End Way Franklin, TN 37067 USA +1 888-386-8383 info@conradyscience.com www.conradyscience.com Bayesia SAS 6, rue Léonard de Vinci BP 119 53001 Laval Cedex France +33(0)2 43 49 75 69 info@bayesia.com www.bayesia.com Copyright © 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following: • You may print or download this document for your personal and noncommercial use only. • You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady Applied Science, LLC and Bayesia SAS as the source of the material. • You may not, except with our express written permission, distribute or commercially exploit the content. Nor may you transmit it or store it in any other website or other form of electronic retrieval system. www.conradyscience.com | www.bayesia.com 41