SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Bayesian Co-clustering
  Authors: Hanhuai Shan, Arindam Banerjee
  Dept of Computer Science & Engineering
    University of Minnesota, Twin Cities
             2008 IEEE ICDM
       Presenter: Rui-Zhe Liu on 9/7
Outline
     Introduction
     Related work
       Generative mixture models
     Method
       Bayesian Co-clustering
       Inference and learning
     Experiments and results
     Conclusion




2
Introduction(1/2)
     Traditional clustering algorithms do not perform well on
      dyadic data because they are unable to utilize the relationship
      between the two entities of a matrix.
     In comparison, co-clustering can achieve a much better
      performance in terms of discovering the structure of data
      and predicting the missing values by taking advantage of
      relationships between two entities.




3
Introduction(2/2)
     While these techniques work reasonably on real data, one
      important restriction is that almost all of these algorithms are
      partitional.
       i.e., a row/column belongs to only one row/column cluster.
     In this paper, we propose Bayesian co-clustering (BCC) by
      viewing co-clustering as a generative mixture modeling
      problem.
     We assume each row and column to have a mixed
      membership respectively, from which we generate row and
      column clusters.

4
Generative mixture models(1/4)
     Finite mixture models (FMM)
       The density function of a data point x in FMMs is given by:




5
Generative mixture models(2/4)
     Latent Dirichlet Allocation
       LDA relaxes this assumption by assuming there is a separate
        mixing weight π for each data point, and π is sampled from a
        Dirichlet distribution Dir(α).
       For a sequence of tokens x = x1 · ·xd, LDA with k components
                                         ·
        has a density of the form




6
Generative mixture models(3/4)
     Bayesian Naive Bayes
       LDA can only handle discrete data as tokens.
       Bayesian naive Bayes (BNB) generalizes LDA to allow the model to
        work with arbitrarily exponential family distributions.




       BNB is able to deal with different types of data, and is designed to
        handle sparsity.
7
Generative mixture models(4/4)
     Co-clustering based on GMMs
       The existing literature has a few examples of generative models
        for co-clustering.
       Existing models have one or more of the following limitations:
         (a) The model only handles binary relationships;
         (b) The model deals with relation within one type of entity, such as a
          social network among people;
         (c) There is no computationally efficient algorithm to do inference, and
          one has to rely on stochastic approximation based on sampling.
       The proposed BCC model has none of these limitations, and
        actually goes much further by leveraging the good ideas in such
        models.

8
Bayesian Co-Clustering(1/4)




9
Bayesian Co-Clustering(2/4)
      The overall joint distribution over all observable and latent
       variables is given by




10
Bayesian Co-Clustering(3/4)
     




11
Bayesian Co-Clustering(4/4)
     




      It is easy to see (Figure 1) that one-way Bayesian clustering
         models such as BNB and LDA are special cases of BCC.
12
Inference and Learning
     




     




     




13
Variational Approximation(1/2)
     




14
Variational Approximation(2/2)
     




     




15
Inference(1/3)
      In the inference step, given a choice of model parameters
       (α1,α2,Θ), the goal is to get the tightest lower bound to log
       p(X|α1,α2,Θ).
      While there is no closed form, by taking derivative of L and
       setting it to zero, the solution can be obtained by iterating
       over the following set of equations:




16
Inference(2/3)




17
Inference(3/3)
     




     




18
Parameter Estimation(1/3)
      To estimate the Dirichlet parameters (α1,α2), one can use an
       efficient Newton update as shown in [7, 5] for LDA and BNB.
      One potential issue with such an update is that an
       intermediate iterate α(t) can go outside the feasible region
       α> 0.
      In our implementation, we avoid such a situation using an
       adaptive line search. In particular, the updating function for
       α1 is:




19
Parameter Estimation(2/3)
      For estimating Θ, in principle, a closed form solution is
       possible for all exponential family distributions.
      We first consider a special case when the component
       distributions are univariate Gaussians.
      The update equations for Θ=




20
Parameter Estimation(3/3)
     




21
EM Algorithm(1/2)
     




22
EM Algorithm(2/2)
     




23
Experiments and results
Simulated data set(1/2)
      Three 80 × 100 data matrices are generated with 4 row
       clusters and 5 column clusters, i.e., 20 co-clusters in total.
      Each co-cluster generates a 20 × 20 submatrix.
      We use Gaussian, Bernoulli, and Poisson as the generative
       model for each data matrix respectively
        Each submatrix is generated from the generative model with a
         predefined parameter, which is set to be different for different
         submatrices.
      For each data matrix, we do semi-supervised initialization by
       using 5% data in each co-cluster.

25
Simulated data set(2/2)
      Cluster accuracy (CA) for
       rows/columns is defined as:


        where n is the number of
         rows/columns,
        k is the number of row/
         column clusters
        and nci is for the ith
         row/column result cluster,
         the largest number of rows/
         columns that fall into a same
         true cluster.

26
Real data set(1/2)
      Three real datasets are used in our experiments—
        Movielens, Foodmart, and Jester:
        (a) Movielens: Movie-lens is a movie recommendation dataset
         created by theGrouplens Research Project.
          It contains 100,000 ratings (1-5, 5 the best) in a sparse data matrix for
           1682 movies rated by 943 users.
          We also construct a binarized dataset such that entries whose ratings are
           higher than 3 become 1 and others become 0.




27
Real data set(2/2)
       (b) Jester: Jester is a joke rating dataset.
         The original dataset contains 4.1 million continuous ratings (-10-+10,
          +10 the best) of 100 jokes from73,421 users. We pick 1000 users who
          rate all 100 jokes and use this dense data matrix in our experiment.
         We also binarize the dataset such that the non-negative entries become 1
          and the negative entries become 0.
       (c) Foodmart: Foodmart data comes with Microsoft SQL server.
         It contains transaction data for a fictitious retailer. There are 164,558 sales
           records in a sparse data matrix for 7803 customers and 1559 products.
           Each record is the number of products bought by the customer.
         Again, we binarize the dataset such that entries whose number of
           products are below median are 0 and others are 1.



28
Methodology for real data(1/3)
      In particular, there are two steps in our evaluation:
         (a) Combine training and test data together and do inference (E-step)
          to obtain variational parameters;
         (b) Use model parameters and variational parameters to obtain the
          perplexity on the test set.




         Perplexity monotonically decreases with log-likelihood, implying that
           lower perplexity is better since higher log-likelihood on training set means
            that the model fits the data better,
           and a higher log-likelihood on the test set implies that the model can
            explain the data better.


29
Methodology for real data(2/3)
      Let Xtrain and Xtest be the original training and test sets
         respectively.
     




      The comparison of BCC with LDA is done only on binarized
         data sets since LDA is not designed to handle real values.


30
Methodology for real data(3/3)
      We compare BCC with BNB and LDA in terms of perplexity
       and prediction performance.
      To ensure a fair comparison, we do not use simulated
       annealing for BCC in these experiments because there is no
       simulated annealing in BNB and LDA either.




31
Results
      In this section, we present three main experimental results:
        (a) Perplexity comparison among BCC, BNB and LDA;
        (b)The prediction performance comparison between BCC and
         LDA;
        (c) The visualization obtained from BCC.




32
Perplexity comparison(1/4)
      We compare the perplexity among BCC, BNB and LDA with
       varying number of row clusters from 5 to 25 in steps of 5.
      And a fixed number of column clusters for BCC to be 20,
       10 and 5 for Movielens, Foodmart and Jester respectively.
      The results are reported as an average perplexity of 10-cross
       validation in Figures 4, 5 and Table 3.




33
Perplexity comparison(2/4)




34
Perplexity comparison(3/4)
      (a) For BCC and LDA,
        the perplexities of BCC on both training and test sets are 2-3 orders
         of magnitude lower than that of LDA,
        and the paired t-test shows that the distinction is statistically
         significant with an extremely small p-value.
        The lower perplexity of BCC seems to indicate that BCC fits the data
         and explains the data substantially better than LDA.
        However, one must be careful in drawing such conclusions since BCC
         and LDA work on different variants of the data.
      (b) For BCC and BNB,
        although BNB sometimes has a lower perplexity than BCC on training
         sets, on test sets, the perplexities of BCC are lower than BNB in all
         cases.
        Again, the difference is significant based on the paired t-test.
        BNB’s high perplexities on test sets indicate over-fitting, especially on
         the original Movielens data.
35
Perplexity comparison(4/4)
      In comparison, BCC behaves much better than BNB on test
       sets, possibly because of two reasons:
        (i) BCC uses much less number of variational parameters than
         BNB, so as to avoid overfitting;
        (ii) BCC is able to capture the co-cluster structure which is
         missing in BNB.




36
Prediction comparison(1/5)
      We only compare the prediction on the binarized data, which
       is a reasonable simplification.
      Because in real recommendation systems, we usually only
       need to know whether the user likes (1) the movie/product/
       joke or not (0) to decide whether we should recommend it.
      To add noise to binarized data, we flip the entries of 1 to 0
       and 0 to 1.




37
Prediction comparison(2/5)
      As shown in Figure 6, all three lines go up steadily with an
       increasing percentage of test data modified.
      This is a surprisingly good result, implying that our model is
       able to detect increasing noise and convey the message
       through increasing perplexities.




38
Prediction comparison(3/5)




39
Prediction comparison(4/5)
      In both figures, the first row is for adding noise at steps of 0.01%
       and the second row is for adding noise at steps of 0.1%.
      The trends of the perplexity curves show the prediction
       performance.
      When the perplexity curves go up and down from time to time, it
       means that sometimes the model fits the data with more noise
       better than that with less noise, indicating a lower prediction
       accuracy.
      The decreasing perplexity with addition of noise also indicates
       LDA does not have a good prediction performance on Movielens.



40
Prediction comparison(5/5)
      Given a binary dataset, BCC works on all non-missing entries,
       but LDA only works on the entries with value 1. Therefore,
       BCC and LDA actually work on different data, and hence
       their perplexities cannot be compared directly.
      However, the result of prediction shows that BCC indeed
       does much better than LDA, no matter which part of dataset
       they are using.




41
Visualization(1/5)
      Figure 9 is an example of user-
       movie co-clusters on Movielens.
      There are 10 × 20 sub-blocks,
       corresponding to 10 user
       clusters and 20 movie clusters.
      A darker sub-block indicates a
       larger parameter of the
       bernoulli distribution.
         Since the parameter of a
          bernoulli distribution implies
          the probability of generating an
          outcome 1 (rate 4 or 5),
         the darker the sub-block is the
          more the corresponding movie
          cluster is preferred by the user
          cluster.
42
Visualization(2/5)
      The variational parameters φ1, with dimension k1 for rows,
       and φ2, with dimension k2 for columns, give a low-
       dimensional representation for all the row and column
       objects.
      We call the low-dimensional vectors φ1 and φ2 a “co-
       embedding” since they are two inter-dependent low-
       dimensional representations of the row and column objects
       derived from the original data matrix.
      To visualize the co-embedding, we apply ISOMAP [21] on φ1
       and φ2 to further reduce the space to 2 dimensions.


43
Visualization(3/5)
      In Figure 10(a) and 10(c), we mark four users and four
       movies, and extract their “signatures”.
      In our experiment, we do the following: For each user, we
       get the number of movies she rates “1” in movie cluster 1-20
       respectively and normalize them.
      The results of co-embedding for users and movies on
       binarized Movielens are shown in Figure 10(a) and 10(c).




44
Visualization(4/5)




45
Visualization(5/5)
      Each point in the figure denotes one user/movie. We mark
       three clusters with red, blue and green for users and movies
       respectively; other points are colored pink.
      The numbers on the right are user/movie IDs corresponding
       to those marked points in co-embedding plots, showing
       where they are located.
      We can see that each signature is quite different from others
       in terms of the value on each component.




46
Conclusion(1/2)
      In this paper, we have proposed Bayesian co-clustering (BCC)
       which views co-clustering as a generative mixture modeling
       problem.
      BCC inherits the strengths and robustness of Bayesian
       modeling, is designed to work with sparse matrices, and can
       use any exponential family distribution as the generative
       model, thereby making it suitable for a wide range of
       matrices.




47
Conclusion(2/2)
      Unlike existing partitional co-clustering algorithms, BCC
       generates mixed memberships for rows and columns, which
       seem more appropriate for a variety of applications.
      A key advantage of the proposed variational approximation
       approach for BCC is that it is expected to be significantly
       faster than a stochastic approximation based on sampling,
       making it suitable for large matrices in real life applications.
      Finally, the co-embedding obtained from BCC can be
       effectively used for visualization, subsequent predictive
       modeling, and decision making.


48
Thank you for listening.

Contenu connexe

Tendances

A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...ijfls
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...Waqas Tariq
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstructioncsandit
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...ijaia
 
A Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisA Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisEditor IJCATR
 
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO SystemsA New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO SystemsRadita Apriana
 
Dimension reduction(jiten01)
Dimension reduction(jiten01)Dimension reduction(jiten01)
Dimension reduction(jiten01)Jiten Dhimmar
 
A comparative study of three validities computation methods for multimodel ap...
A comparative study of three validities computation methods for multimodel ap...A comparative study of three validities computation methods for multimodel ap...
A comparative study of three validities computation methods for multimodel ap...IJECEIAES
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS TECSI FEA USP
 
Polikar10missing
Polikar10missingPolikar10missing
Polikar10missingkagupta
 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...TELKOMNIKA JOURNAL
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
A Modified Dna Computing Approach To Tackle The Exponential Solution Space Of...
A Modified Dna Computing Approach To Tackle The Exponential Solution Space Of...A Modified Dna Computing Approach To Tackle The Exponential Solution Space Of...
A Modified Dna Computing Approach To Tackle The Exponential Solution Space Of...ijfcstjournal
 

Tendances (17)

A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
A FUZZY INTERACTIVE BI-OBJECTIVE MODEL FOR SVM TO IDENTIFY THE BEST COMPROMIS...
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstruction
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
A Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisA Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph Analysis
 
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO SystemsA New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
 
Dimension reduction(jiten01)
Dimension reduction(jiten01)Dimension reduction(jiten01)
Dimension reduction(jiten01)
 
A comparative study of three validities computation methods for multimodel ap...
A comparative study of three validities computation methods for multimodel ap...A comparative study of three validities computation methods for multimodel ap...
A comparative study of three validities computation methods for multimodel ap...
 
Em molnar2015
Em molnar2015Em molnar2015
Em molnar2015
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
 
Polikar10missing
Polikar10missingPolikar10missing
Polikar10missing
 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
 
Pca part
Pca partPca part
Pca part
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
A Modified Dna Computing Approach To Tackle The Exponential Solution Space Of...
A Modified Dna Computing Approach To Tackle The Exponential Solution Space Of...A Modified Dna Computing Approach To Tackle The Exponential Solution Space Of...
A Modified Dna Computing Approach To Tackle The Exponential Solution Space Of...
 

En vedette

Movie Prose - A Business Intelligence system
Movie Prose - A Business Intelligence systemMovie Prose - A Business Intelligence system
Movie Prose - A Business Intelligence systemankurkath
 
Datamining korea movie industry
Datamining korea movie industryDatamining korea movie industry
Datamining korea movie industry경택 홍
 
Data mining project presentation
Data mining project presentationData mining project presentation
Data mining project presentationKaiwen Qi
 
Predicting Box Office Revenues
Predicting Box Office RevenuesPredicting Box Office Revenues
Predicting Box Office Revenuesatamaki
 
Leen Vandezande - gestandardiseerde interfaces
Leen Vandezande - gestandardiseerde interfacesLeen Vandezande - gestandardiseerde interfaces
Leen Vandezande - gestandardiseerde interfacesimec.archive
 
Mark Sterns: entrepreneurship and faithfulness
Mark Sterns: entrepreneurship and faithfulnessMark Sterns: entrepreneurship and faithfulness
Mark Sterns: entrepreneurship and faithfulnessmicahdavis
 
I Lab3 I Lab Testcenteroverview
I Lab3 I Lab TestcenteroverviewI Lab3 I Lab Testcenteroverview
I Lab3 I Lab Testcenteroverviewimec.archive
 
Break out: Collaboration tools - Peter Mechant
Break out: Collaboration tools - Peter MechantBreak out: Collaboration tools - Peter Mechant
Break out: Collaboration tools - Peter Mechantimec.archive
 
Workshopvin1 2018 Technology In Labour For Rebirth Of Hal 9000 Series
Workshopvin1 2018 Technology In Labour For Rebirth Of Hal 9000 SeriesWorkshopvin1 2018 Technology In Labour For Rebirth Of Hal 9000 Series
Workshopvin1 2018 Technology In Labour For Rebirth Of Hal 9000 Seriesimec.archive
 
Weebt2008 presentation
Weebt2008 presentationWeebt2008 presentation
Weebt2008 presentationimec.archive
 
Maduf09 The Maduf Prophecies Kris Van Bruwaene
Maduf09 The Maduf Prophecies   Kris Van BruwaeneMaduf09 The Maduf Prophecies   Kris Van Bruwaene
Maduf09 The Maduf Prophecies Kris Van Bruwaeneimec.archive
 
03 Bart De Nil Situering Erfgoed 2 0
03  Bart De Nil   Situering Erfgoed 2 003  Bart De Nil   Situering Erfgoed 2 0
03 Bart De Nil Situering Erfgoed 2 0imec.archive
 
Crsm 4 2009 Peter Anker Rspg Towards An Eu Policy For Cognitive Radio
Crsm 4 2009   Peter Anker Rspg   Towards An Eu Policy For Cognitive RadioCrsm 4 2009   Peter Anker Rspg   Towards An Eu Policy For Cognitive Radio
Crsm 4 2009 Peter Anker Rspg Towards An Eu Policy For Cognitive Radioimec.archive
 
A Real-World Experimentation Platform
A Real-World Experimentation PlatformA Real-World Experimentation Platform
A Real-World Experimentation Platformimec.archive
 
Music Magazine Planning
Music Magazine PlanningMusic Magazine Planning
Music Magazine Planninglyricalbeatt
 
Brokerage 2007 presentation market
Brokerage 2007 presentation marketBrokerage 2007 presentation market
Brokerage 2007 presentation marketimec.archive
 
Kickoff Vrijstaat Thialf
Kickoff Vrijstaat ThialfKickoff Vrijstaat Thialf
Kickoff Vrijstaat Thialfhetlab
 
Caroline pauwels digital society
Caroline pauwels   digital societyCaroline pauwels   digital society
Caroline pauwels digital societyimec.archive
 

En vedette (20)

Movie Prose - A Business Intelligence system
Movie Prose - A Business Intelligence systemMovie Prose - A Business Intelligence system
Movie Prose - A Business Intelligence system
 
Data Mining
Data MiningData Mining
Data Mining
 
Datamining korea movie industry
Datamining korea movie industryDatamining korea movie industry
Datamining korea movie industry
 
Data mining project presentation
Data mining project presentationData mining project presentation
Data mining project presentation
 
Predicting Box Office Revenues
Predicting Box Office RevenuesPredicting Box Office Revenues
Predicting Box Office Revenues
 
Leen Vandezande - gestandardiseerde interfaces
Leen Vandezande - gestandardiseerde interfacesLeen Vandezande - gestandardiseerde interfaces
Leen Vandezande - gestandardiseerde interfaces
 
Mark Sterns: entrepreneurship and faithfulness
Mark Sterns: entrepreneurship and faithfulnessMark Sterns: entrepreneurship and faithfulness
Mark Sterns: entrepreneurship and faithfulness
 
I Lab3 I Lab Testcenteroverview
I Lab3 I Lab TestcenteroverviewI Lab3 I Lab Testcenteroverview
I Lab3 I Lab Testcenteroverview
 
G-flux
G-fluxG-flux
G-flux
 
Break out: Collaboration tools - Peter Mechant
Break out: Collaboration tools - Peter MechantBreak out: Collaboration tools - Peter Mechant
Break out: Collaboration tools - Peter Mechant
 
Workshopvin1 2018 Technology In Labour For Rebirth Of Hal 9000 Series
Workshopvin1 2018 Technology In Labour For Rebirth Of Hal 9000 SeriesWorkshopvin1 2018 Technology In Labour For Rebirth Of Hal 9000 Series
Workshopvin1 2018 Technology In Labour For Rebirth Of Hal 9000 Series
 
Weebt2008 presentation
Weebt2008 presentationWeebt2008 presentation
Weebt2008 presentation
 
Maduf09 The Maduf Prophecies Kris Van Bruwaene
Maduf09 The Maduf Prophecies   Kris Van BruwaeneMaduf09 The Maduf Prophecies   Kris Van Bruwaene
Maduf09 The Maduf Prophecies Kris Van Bruwaene
 
03 Bart De Nil Situering Erfgoed 2 0
03  Bart De Nil   Situering Erfgoed 2 003  Bart De Nil   Situering Erfgoed 2 0
03 Bart De Nil Situering Erfgoed 2 0
 
Crsm 4 2009 Peter Anker Rspg Towards An Eu Policy For Cognitive Radio
Crsm 4 2009   Peter Anker Rspg   Towards An Eu Policy For Cognitive RadioCrsm 4 2009   Peter Anker Rspg   Towards An Eu Policy For Cognitive Radio
Crsm 4 2009 Peter Anker Rspg Towards An Eu Policy For Cognitive Radio
 
A Real-World Experimentation Platform
A Real-World Experimentation PlatformA Real-World Experimentation Platform
A Real-World Experimentation Platform
 
Music Magazine Planning
Music Magazine PlanningMusic Magazine Planning
Music Magazine Planning
 
Brokerage 2007 presentation market
Brokerage 2007 presentation marketBrokerage 2007 presentation market
Brokerage 2007 presentation market
 
Kickoff Vrijstaat Thialf
Kickoff Vrijstaat ThialfKickoff Vrijstaat Thialf
Kickoff Vrijstaat Thialf
 
Caroline pauwels digital society
Caroline pauwels   digital societyCaroline pauwels   digital society
Caroline pauwels digital society
 

Similaire à Bayesian Co clustering

COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXINGCOMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXINGcsandit
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 
Image recogonization
Image recogonizationImage recogonization
Image recogonizationSANTOSH RATH
 
Data Partitioning for Ensemble Model Building
Data Partitioning for Ensemble Model BuildingData Partitioning for Ensemble Model Building
Data Partitioning for Ensemble Model Buildingneirew J
 
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDINGDATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDINGijccsa
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataijcsit
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfSunghoon Joo
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...csandit
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionYoussefKitane
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
 
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...ijma
 
CFM Challenge - Course Project
CFM Challenge - Course ProjectCFM Challenge - Course Project
CFM Challenge - Course ProjectKhalilBergaoui
 
Coclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain DocumentsCoclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain Documentslau
 
Reproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfishReproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfishtuxette
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...CSCJournals
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 
Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forestsChristian Robert
 
'ACCOST' for differential HiC analysis
'ACCOST' for differential HiC analysis'ACCOST' for differential HiC analysis
'ACCOST' for differential HiC analysistuxette
 

Similaire à Bayesian Co clustering (20)

COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXINGCOMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
 
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
 
Data Modelling
Data ModellingData Modelling
Data Modelling
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
Image recogonization
Image recogonizationImage recogonization
Image recogonization
 
Data Partitioning for Ensemble Model Building
Data Partitioning for Ensemble Model BuildingData Partitioning for Ensemble Model Building
Data Partitioning for Ensemble Model Building
 
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDINGDATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray data
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
 
CFM Challenge - Course Project
CFM Challenge - Course ProjectCFM Challenge - Course Project
CFM Challenge - Course Project
 
Coclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain DocumentsCoclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain Documents
 
Reproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfishReproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfish
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 
Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forests
 
'ACCOST' for differential HiC analysis
'ACCOST' for differential HiC analysis'ACCOST' for differential HiC analysis
'ACCOST' for differential HiC analysis
 

Bayesian Co clustering

  • 1. Bayesian Co-clustering Authors: Hanhuai Shan, Arindam Banerjee Dept of Computer Science & Engineering University of Minnesota, Twin Cities 2008 IEEE ICDM Presenter: Rui-Zhe Liu on 9/7
  • 2. Outline  Introduction  Related work  Generative mixture models  Method  Bayesian Co-clustering  Inference and learning  Experiments and results  Conclusion 2
  • 3. Introduction(1/2)  Traditional clustering algorithms do not perform well on dyadic data because they are unable to utilize the relationship between the two entities of a matrix.  In comparison, co-clustering can achieve a much better performance in terms of discovering the structure of data and predicting the missing values by taking advantage of relationships between two entities. 3
  • 4. Introduction(2/2)  While these techniques work reasonably on real data, one important restriction is that almost all of these algorithms are partitional.  i.e., a row/column belongs to only one row/column cluster.  In this paper, we propose Bayesian co-clustering (BCC) by viewing co-clustering as a generative mixture modeling problem.  We assume each row and column to have a mixed membership respectively, from which we generate row and column clusters. 4
  • 5. Generative mixture models(1/4)  Finite mixture models (FMM)  The density function of a data point x in FMMs is given by: 5
  • 6. Generative mixture models(2/4)  Latent Dirichlet Allocation  LDA relaxes this assumption by assuming there is a separate mixing weight π for each data point, and π is sampled from a Dirichlet distribution Dir(α).  For a sequence of tokens x = x1 · ·xd, LDA with k components · has a density of the form 6
  • 7. Generative mixture models(3/4)  Bayesian Naive Bayes  LDA can only handle discrete data as tokens.  Bayesian naive Bayes (BNB) generalizes LDA to allow the model to work with arbitrarily exponential family distributions.  BNB is able to deal with different types of data, and is designed to handle sparsity. 7
  • 8. Generative mixture models(4/4)  Co-clustering based on GMMs  The existing literature has a few examples of generative models for co-clustering.  Existing models have one or more of the following limitations:  (a) The model only handles binary relationships;  (b) The model deals with relation within one type of entity, such as a social network among people;  (c) There is no computationally efficient algorithm to do inference, and one has to rely on stochastic approximation based on sampling.  The proposed BCC model has none of these limitations, and actually goes much further by leveraging the good ideas in such models. 8
  • 10. Bayesian Co-Clustering(2/4)  The overall joint distribution over all observable and latent variables is given by 10
  • 12. Bayesian Co-Clustering(4/4)   It is easy to see (Figure 1) that one-way Bayesian clustering models such as BNB and LDA are special cases of BCC. 12
  • 13. Inference and Learning    13
  • 16. Inference(1/3)  In the inference step, given a choice of model parameters (α1,α2,Θ), the goal is to get the tightest lower bound to log p(X|α1,α2,Θ).  While there is no closed form, by taking derivative of L and setting it to zero, the solution can be obtained by iterating over the following set of equations: 16
  • 18. Inference(3/3)   18
  • 19. Parameter Estimation(1/3)  To estimate the Dirichlet parameters (α1,α2), one can use an efficient Newton update as shown in [7, 5] for LDA and BNB.  One potential issue with such an update is that an intermediate iterate α(t) can go outside the feasible region α> 0.  In our implementation, we avoid such a situation using an adaptive line search. In particular, the updating function for α1 is: 19
  • 20. Parameter Estimation(2/3)  For estimating Θ, in principle, a closed form solution is possible for all exponential family distributions.  We first consider a special case when the component distributions are univariate Gaussians.  The update equations for Θ= 20
  • 25. Simulated data set(1/2)  Three 80 × 100 data matrices are generated with 4 row clusters and 5 column clusters, i.e., 20 co-clusters in total.  Each co-cluster generates a 20 × 20 submatrix.  We use Gaussian, Bernoulli, and Poisson as the generative model for each data matrix respectively  Each submatrix is generated from the generative model with a predefined parameter, which is set to be different for different submatrices.  For each data matrix, we do semi-supervised initialization by using 5% data in each co-cluster. 25
  • 26. Simulated data set(2/2)  Cluster accuracy (CA) for rows/columns is defined as:  where n is the number of rows/columns,  k is the number of row/ column clusters  and nci is for the ith row/column result cluster, the largest number of rows/ columns that fall into a same true cluster. 26
  • 27. Real data set(1/2)  Three real datasets are used in our experiments—  Movielens, Foodmart, and Jester:  (a) Movielens: Movie-lens is a movie recommendation dataset created by theGrouplens Research Project.  It contains 100,000 ratings (1-5, 5 the best) in a sparse data matrix for 1682 movies rated by 943 users.  We also construct a binarized dataset such that entries whose ratings are higher than 3 become 1 and others become 0. 27
  • 28. Real data set(2/2)  (b) Jester: Jester is a joke rating dataset.  The original dataset contains 4.1 million continuous ratings (-10-+10, +10 the best) of 100 jokes from73,421 users. We pick 1000 users who rate all 100 jokes and use this dense data matrix in our experiment.  We also binarize the dataset such that the non-negative entries become 1 and the negative entries become 0.  (c) Foodmart: Foodmart data comes with Microsoft SQL server.  It contains transaction data for a fictitious retailer. There are 164,558 sales records in a sparse data matrix for 7803 customers and 1559 products. Each record is the number of products bought by the customer.  Again, we binarize the dataset such that entries whose number of products are below median are 0 and others are 1. 28
  • 29. Methodology for real data(1/3)  In particular, there are two steps in our evaluation:  (a) Combine training and test data together and do inference (E-step) to obtain variational parameters;  (b) Use model parameters and variational parameters to obtain the perplexity on the test set.  Perplexity monotonically decreases with log-likelihood, implying that  lower perplexity is better since higher log-likelihood on training set means that the model fits the data better,  and a higher log-likelihood on the test set implies that the model can explain the data better. 29
  • 30. Methodology for real data(2/3)  Let Xtrain and Xtest be the original training and test sets respectively.   The comparison of BCC with LDA is done only on binarized data sets since LDA is not designed to handle real values. 30
  • 31. Methodology for real data(3/3)  We compare BCC with BNB and LDA in terms of perplexity and prediction performance.  To ensure a fair comparison, we do not use simulated annealing for BCC in these experiments because there is no simulated annealing in BNB and LDA either. 31
  • 32. Results  In this section, we present three main experimental results:  (a) Perplexity comparison among BCC, BNB and LDA;  (b)The prediction performance comparison between BCC and LDA;  (c) The visualization obtained from BCC. 32
  • 33. Perplexity comparison(1/4)  We compare the perplexity among BCC, BNB and LDA with varying number of row clusters from 5 to 25 in steps of 5.  And a fixed number of column clusters for BCC to be 20, 10 and 5 for Movielens, Foodmart and Jester respectively.  The results are reported as an average perplexity of 10-cross validation in Figures 4, 5 and Table 3. 33
  • 35. Perplexity comparison(3/4)  (a) For BCC and LDA,  the perplexities of BCC on both training and test sets are 2-3 orders of magnitude lower than that of LDA,  and the paired t-test shows that the distinction is statistically significant with an extremely small p-value.  The lower perplexity of BCC seems to indicate that BCC fits the data and explains the data substantially better than LDA.  However, one must be careful in drawing such conclusions since BCC and LDA work on different variants of the data.  (b) For BCC and BNB,  although BNB sometimes has a lower perplexity than BCC on training sets, on test sets, the perplexities of BCC are lower than BNB in all cases.  Again, the difference is significant based on the paired t-test.  BNB’s high perplexities on test sets indicate over-fitting, especially on the original Movielens data. 35
  • 36. Perplexity comparison(4/4)  In comparison, BCC behaves much better than BNB on test sets, possibly because of two reasons:  (i) BCC uses much less number of variational parameters than BNB, so as to avoid overfitting;  (ii) BCC is able to capture the co-cluster structure which is missing in BNB. 36
  • 37. Prediction comparison(1/5)  We only compare the prediction on the binarized data, which is a reasonable simplification.  Because in real recommendation systems, we usually only need to know whether the user likes (1) the movie/product/ joke or not (0) to decide whether we should recommend it.  To add noise to binarized data, we flip the entries of 1 to 0 and 0 to 1. 37
  • 38. Prediction comparison(2/5)  As shown in Figure 6, all three lines go up steadily with an increasing percentage of test data modified.  This is a surprisingly good result, implying that our model is able to detect increasing noise and convey the message through increasing perplexities. 38
  • 40. Prediction comparison(4/5)  In both figures, the first row is for adding noise at steps of 0.01% and the second row is for adding noise at steps of 0.1%.  The trends of the perplexity curves show the prediction performance.  When the perplexity curves go up and down from time to time, it means that sometimes the model fits the data with more noise better than that with less noise, indicating a lower prediction accuracy.  The decreasing perplexity with addition of noise also indicates LDA does not have a good prediction performance on Movielens. 40
  • 41. Prediction comparison(5/5)  Given a binary dataset, BCC works on all non-missing entries, but LDA only works on the entries with value 1. Therefore, BCC and LDA actually work on different data, and hence their perplexities cannot be compared directly.  However, the result of prediction shows that BCC indeed does much better than LDA, no matter which part of dataset they are using. 41
  • 42. Visualization(1/5)  Figure 9 is an example of user- movie co-clusters on Movielens.  There are 10 × 20 sub-blocks, corresponding to 10 user clusters and 20 movie clusters.  A darker sub-block indicates a larger parameter of the bernoulli distribution.  Since the parameter of a bernoulli distribution implies the probability of generating an outcome 1 (rate 4 or 5),  the darker the sub-block is the more the corresponding movie cluster is preferred by the user cluster. 42
  • 43. Visualization(2/5)  The variational parameters φ1, with dimension k1 for rows, and φ2, with dimension k2 for columns, give a low- dimensional representation for all the row and column objects.  We call the low-dimensional vectors φ1 and φ2 a “co- embedding” since they are two inter-dependent low- dimensional representations of the row and column objects derived from the original data matrix.  To visualize the co-embedding, we apply ISOMAP [21] on φ1 and φ2 to further reduce the space to 2 dimensions. 43
  • 44. Visualization(3/5)  In Figure 10(a) and 10(c), we mark four users and four movies, and extract their “signatures”.  In our experiment, we do the following: For each user, we get the number of movies she rates “1” in movie cluster 1-20 respectively and normalize them.  The results of co-embedding for users and movies on binarized Movielens are shown in Figure 10(a) and 10(c). 44
  • 46. Visualization(5/5)  Each point in the figure denotes one user/movie. We mark three clusters with red, blue and green for users and movies respectively; other points are colored pink.  The numbers on the right are user/movie IDs corresponding to those marked points in co-embedding plots, showing where they are located.  We can see that each signature is quite different from others in terms of the value on each component. 46
  • 47. Conclusion(1/2)  In this paper, we have proposed Bayesian co-clustering (BCC) which views co-clustering as a generative mixture modeling problem.  BCC inherits the strengths and robustness of Bayesian modeling, is designed to work with sparse matrices, and can use any exponential family distribution as the generative model, thereby making it suitable for a wide range of matrices. 47
  • 48. Conclusion(2/2)  Unlike existing partitional co-clustering algorithms, BCC generates mixed memberships for rows and columns, which seem more appropriate for a variety of applications.  A key advantage of the proposed variational approximation approach for BCC is that it is expected to be significantly faster than a stochastic approximation based on sampling, making it suitable for large matrices in real life applications.  Finally, the co-embedding obtained from BCC can be effectively used for visualization, subsequent predictive modeling, and decision making. 48
  • 49. Thank you for listening.