SlideShare une entreprise Scribd logo
1  sur  128
EXPERT SYSTEMS AND SOLUTIONS
     Email: expertsyssol@gmail.com
        expertsyssol@yahoo.com
          Cell: 9952749533
     www.researchprojects.info
    PAIYANOOR, OMR, CHENNAI
 Call For Research Projects          Final
 year students of B.E in EEE, ECE,
    EI, M.E (Power Systems), M.E
  (Applied Electronics), M.E (Power
              Electronics)
  Ph.D Electrical and Electronics.
Students can assemble their hardware in our
 Research labs. Experts will be guiding the
                 projects.
Classification of Microarray
   Gene Expression Data

                Geoff McLachlan
Department of Mathematics & Institute for Molecular Bioscience
                 University of Queensland
Institute for Molecular Bioscience,
University of Queensland
“A    wide range of supervised and
unsupervised learning methods have been
considered to better organize data, be it to
infer coordinated patterns of gene
expression,     to     discover   molecular
signatures of disease subtypes, or to derive
various predictions. ”


Statistical Methods for Gene     Expression:
Microarrays and Proteomics
Outline of Talk
• Introduction

• Supervised classification of tissue
  samples – selection bias

• Unsupervised classification
  (clustering) of tissues – mixture
  model-based approach
Vital Statistics
by C. Tilstone
Nature 424, 610-612, 2003.

  “DNA microarrays have
  given geneticists and
  molecular biologists access
  to more data than ever
  before. But do these          Branching out: cluster
                                analysis can group
  researchers have the          samples that show
                                similar patterns of gene
  statistical know-how to       expression.

  cope?”
MICROARRAY DATA
REPRESENTED by a p × n matrix
             ( x1 ,  , x n )
     xj contains the gene expressions for the p genes
        of the jth tissue sample (j = 1, …, n).
     p = No. of genes (103 - 104)
     n = No. of tissue samples (10 - 102)

 STANDARD STATISTICAL METHODOLOGY
 APPROPRIATE FOR   n >> p
                HERE       p >> n
Two Groups in Two Dimensions. All cluster information would
 be lost by collapsing to the first principal component. The
principal ellipses of the two groups are shown as solid curves.
bioArray News (2, no. 35, 2002)
Arrays Hold Promise for Cancer Diagnostics
Oncologists would like to use arrays to predict
whether or not a cancer is going to spread in the
body, how likely it will respond to a certain type
of treatment, and how long the patient will
probably survive.
It would be useful if the gene expression
signatures could distinguish between subtypes of
tumours that standard methods, such as
histological pathology from a biopsy, fail to
discriminate, and that require different treatments.
van’t Veer & De Jong (2002, Nature Medicine 8)
The microarray way to tailored cancer treatment
   In principle, gene activities that determine the
   biological behaviour of a tumour are more
   likely to reflect its aggressiveness than general
   parameters such as tumour size and age of the
   patient.
(indistinguishable disease states in diffuse large B-cell
lymphoma unravelled by microarray expression profiles
– Shipp et al., 2002, Nature Med. 8)
Microarray to be used as routine
clinical screen
by C. M. Schubert
Nature Medicine
9, 9, 2003.



 The Netherlands Cancer Institute in Amsterdam is to become the first institution
 in the world to use microarray techniques for the routine prognostic screening of
 cancer patients. Aiming for a June 2003 start date, the center will use a panoply
 of 70 genes to assess the tumor profile of breast cancer patients and to
 determine which women will receive adjuvant treatment after surgery.
Microarrays also to be used in the
prediction of breast cancer by Mike West
(Duke University) and the Koo
Foundation Sun Yat-Sen Cancer Centre,
Taipei

   Huang et al. (2003, The Lancet, Gene
   expression predictors of breast cancer).
CLASSIFICATION OF TISSUES
          SUPERVISED CLASSIFICATION
           (DISCRIMINANT ANALYSIS)
 We OBSERVE the CLASS LABELS y1, …, yn where
 yj = i if jth tissue sample comes from the ith class
 (i=1,…,g).
AIM: TO CONSTRUCT A CLASSIFIER C(x) FOR
PREDICTING THE UNKNOWN CLASS LABEL y
OF A TISSUE SAMPLE x.
  e.g.    g = 2 classes   G1 - DISEASE-FREE
                          G2 - METASTASES
LINEAR CLASSIFIER
FORM

 C ( x) = β 0 + β x     T


        = β0 + β1 x1 +  + β p x p
for the production of the group label y of
a future entity with feature vector x.
FISHER’S LINEAR DISCRIMINANT FUNCTION

                  y = sign C ( x )
                  −1
 where    β = S ( x1 − x 2 )
                 1
          β 0 = − ( x1 − x 2 ) S ( x1 − x 2 )
                              T −1

                 2
 and   x1 , x 2 , and S are the sample means and pooled sample
       covariance matrix found from the training data
SUPPORT VECTOR CLASSIFIER
                           Vapnik (1995)

         C ( x ) = β0 + β1 x1 +  + β p x p
where β0 and β are obtained as follows:
                                          n
                                1
                                    β + γ ∑ξ j
                                     2
                       min
                       β , β0   2         j =1
      subject to ξ j   ≥ 0,
                 y j C(x j ) ≥ 1 − ξ j ( j = 1, , n)

   ξ1 ,, ξ n relate to the slack variables
   γ = ∞ separable case
n
                  β = ∑α j y j x j
                  ˆ     ˆ
                        j =1

with non-zero α j only for those observations j for which the
                ˆ
constraints are exactly met (the support vectors).
                        n
              C ( x ) = ∑α j y j x T x + β 0
                          ˆ        j
                                         ˆ
                       j =1
                        n
                    = ∑α j y j x j , x + β 0
                        ˆ                ˆ
                       j =1
Support Vector Machine (SVM)

REPLACE    x   by h( x )
                     n
          C ( x ) = ∑ α j h( x j ), h( x ) + β 0
                      ˆ                      ˆ
                    j =1
                    n
                = ∑α j K ( x j , x) + β 0
                    ˆ                 ˆ
                   j =1

where the kernel function K ( x j , x ) = h( x j ), h( x )
is the inner product in the transformed feature space.
HASTIE et al. (2001, Chapter 12)

The Lagrange (primal function) is
                                          [                     ]
                        n           n                               n
       1
           β + γ ∑ ξ j − ∑ α j y j C ( x j ) − (1 − ξ j ) − ∑ λ jξ j
             2
LP =                                                                       (1)
       2               j =1     j =1                                j =1

which we maximize w.r.t. β, β0, and ξj.

Setting the respective derivatives to zero, we get
                      n
                 β = ∑α j y j x j                         (2)
                      j =1
                      n
              ∆ = ∑α j y j                                (3)
                      j =1

              α j = γ − λj              ( j = 1, , n).   (4)
             with α j ≥ 0, λ j ≥ 0, and ξ j ≥ 0 ( j = 1,  , n).
By substituting (2) to (4) into (1), we obtain the Lagrangian dual
function          n
                             1 n n
          LD = ∑ α j −            ∑∑α α           j    k   y j yk x x k  T
                                                                         j        (5)
                  j =1          2   j =1 k =1
                                                            n
We maximize (5) subject to 0 ≤ α j ≤ γ           and       ∑α
                                                            j =1
                                                                   j   y j = 0.


In addition to (2) to (4), the constraints include
                α j [ y j C (x j ) − (1 − ξ j )] = 0                     (6)
                                        λ jξ j = 0                       (7)
                      y j C (x j ) − (1 − ξ j ) ≥ 0                      (8)
                for j = 1, , n.
Together these equations (2) to (8) uniquely characterize the solution
to the primal and dual problem.
Leo Breiman (2001)
           Statistical modeling:
     the two cultures (with discussion).
      Statistical Science 16, 199-231.

Discussants include Brad Efron and David Cox
Selection bias in gene extraction on the
  basis of microarray gene-expression data
                 Ambroise and McLachlan


Proceedings of the National Academy of Sciences
      Vol. 99, Issue 10, 6562-6566, May 14, 2002

 http://www.pnas.org/cgi/content/full/99/10/6562
GUYON, WESTON, BARNHILL & VAPNIK
      (2002, Machine Learning)

• COLON Data (Alon et al., 1999)


• LEUKAEMIA Data (Golub et al., 1999)
Since p>>n, consideration given to
selection of suitable genes


SVM: FORWARD or BACKWARD (in terms of
     magnitude of weight βi)
 RECURSIVE FEATURE ELIMINATION (RFE)


FISHER: FORWARD ONLY (in terms of CVE)
GUYON et al. (2002)

LEUKAEMIA DATA:
    Only 2 genes are needed to obtain a zero
         CVE (cross-validated error rate)


COLON DATA:
    Using only 4 genes, CVE is 2%
GUYON et al. (2002)

“The success of the RFE indicates that RFE has a
  built in regularization mechanism that we do not
  understand yet that prevents overfitting the
  training data in its selection of gene subsets.”
Figure 1: Error rates of the SVM rule with RFE procedure
 averaged over 50 random splits of colon tissue samples
Figure 2: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of leukemia tissue samples
Figure 3: Error rates of Fisher’s rule with stepwise forward
        selection procedure using all the colon data
Figure 4: Error rates of Fisher’s rule with stepwise forward
      selection procedure using all the leukemia data
Figure 5: Error rates of the SVM rule averaged over 20 noninformative
 samples generated by random permutations of the class labels of the
                           colon tumor tissues
Error Rate Estimation
Suppose there are two groups G1 and G2

C(x) is a classifier formed from the
data set
     (x1, x2, x3,……………, xn)
The apparent error is the proportion of
the data set misallocated by C(x).
Cross-Validation
From the original data set, remove x1 to
give the reduced set
       (x2, x3,……………, xn)
Then form the classifier C(1)(x ) from this
reduced set.
Use C(1)(x1) to allocate x1 to either G1 or
G2.
Repeat this process for the second data
point, x2.

So that this point is assigned to either G1 or
G2 on the basis of the classifier C(2)(x2).

And so on up to xn.
Figure 1: Error rates of the SVM rule with RFE procedure
 averaged over 50 random splits of colon tissue samples
ADDITIONAL REFERENCES

Selection bias ignored:
    XIONG et al. (2001, Molecular Genetics and Metabolism)
    XIONG et al. (2001, Genome Research)
    ZHANG et al. (2001, PNAS)

Aware of selection bias:
      SPANG et al. (2001, Silico Biology)
      WEST et al. (2001, PNAS)
      NGUYEN and ROCKE (2002)
BOOTSTRAP APPROACH
                Efron’s (1983, JASA) .632 estimator
                B.632 = .368 × AE + .632 × B1
                                                    *
where B1 is the bootstrap when rule              R k is applied to a point not in
the training sample.
A Monte Carlo estimate of B1 is
                             n
                     B1 = ∑ Ej n
                             j =1
                      K              K
        where    Ej = ∑ IjkQjk      ∑I     jk

                      k =1          k =1
                                                        if xj ∉ kth bootstrap sample
                                    with    Ijk = 1
                                                  0
                                                       otherwise
                                                   
                                                          *
                                    and    Qjk   = 1 if R k misallocates xj
                                                    0 otherwise
                                                   
Toussaint & Sharpe (1975) proposed the
       ERROR RATE ESTIMATOR
              A(w) = (1 - w)AE + wCV2E
  where   w = 0.5
McLachlan (1977) proposed w=wo where wo is
chosen to minimize asymptotic bias of A(w) in the
case of two homoscedastic normal groups.

      Value of w0 was found to range between 0.6
                                              n1
and 0.7, depending on the values of p, ∆, and n .
                                            2
.632+ estimate of Efron & Tibshirani (1997, JASA)
                   B.632 + = (1 - w)AE + w B1

              .632
 where   w=
            1 − .368r
            B1 − AE
         r=                   (relative overfitting rate)
            γ − AE
               g
         γ = ∑ pi (1 − qi )   (estimate of no information error rate)
              i =1

     If r = 0, w = .632, and so B.632+ = B.632
          r = 1, w = 1, and so B.632+ = B1
One concern is the heterogeneity of the tumours
 themselves, which consist of a mixture of normal
 and malignant cells, with blood vessels in between.
 Even if one pulled out some cancer cells from a
 tumour, there is no guarantee that those are the cells
 that are going to metastasize, just because tumours
 are heterogeneous.
“What we really need are expression profiles from
hundreds or thousands of tumours linked to relevant,
and appropriate, clinical data.”
                                 John Quackenbush
UNSUPERVISED CLASSIFICATION
      (CLUSTER ANALYSIS)

   INFER CLASS LABELS y1, …, yn of x1, …, xn


Initially, hierarchical distance-based methods
of cluster analysis were used to cluster the
tissues and the genes

Eisen, Spellman, Brown, & Botstein (1998, PNAS)
Hierarchical (agglomerative) clustering algorithms
 are largely heuristically motivated and there exist a
 number of unresolved issues associated with their
 use, including how to determine the number of
 clusters.
    “in the absence of a well-grounded statistical
    model, it seems difficult to define what is
    meant by a ‘good’ clustering algorithm or the
    ‘right’ number of clusters.”
(Yeung et al., 2001, Model-Based Clustering and Data Transformations
for Gene Expression Data, Bioinformatics 17)
Attention is now turning towards a model-based
approach to the analysis of microarray data
For example:
• Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model
for identifying changes in gene expression from microarray
experiments. Journal of Computational Biology 9

•Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression
data from microarray experiments. Bioinformatics 18

•Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering with
variable and transformation selection. In Bayesian Statistics 7
• Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray
gene expression data. Genome Biology 3

• Yeung et al., 2001, Model based clustering and data transformations
for gene expression data, Bioinformatics 17
The notion of a cluster is not easy to define.
There is a very large literature devoted to
clustering when there is a metric known in
advance; e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis.
That is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
In this case, one attractive feature of
adopting mixture models with elliptically
symmetric components such as the normal
or t densities, is that the implied clustering
is invariant under affine transformations of
the data (that is, under operations relating
to changes in location, scale, and rotation
of the data).
Thus the clustering process does not
depend on irrelevant factors such as the
units of measurement or the orientation of
the clusters in space.
 Height    H + W
                    
x =  Weight     H-W 
     BP         BP 
                    
MIXTURE OF g NORMAL COMPONENTS

  f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π gφ ( x; μg , Σ g )

where

  − 2 log φ ( x; μ, Σ ) = ( x − μ) Σ ( x − μ) + constant
                                 T
                                 T   −1
                                     −1
                           
                                         

                      MAHALANOBIS DISTANCE

                           ( x − μ )T ( x − μ )

                       EUCLIDEAN DISTANCE
MIXTURE OF g NORMAL COMPONENTS

  f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π gφ ( x; μg , Σ g )
k-means


          Σ1 =  = Σg =  I
                     g   σ I                    2
                                                2



                                    SPHERICAL CLUSTERS
Equal spherical covariance matrices
Crab Data




Figure 6: Plot of Crab Data
Figure 7: Contours of the fitted component
densities on the 2nd & 3rd variates for the blue crab
                      data set.
With a mixture model-based approach to
clustering, an observation is assigned
outright to the ith cluster if its density in
the ith component of the mixture
distribution (weighted by the prior
probability of that component) is greater
than in the other (g-1) components.

f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π iφ ( x; μi , Σ i ) +
                      + π gφ ( x; μg , Σ g )
http://www.maths.uq.edu.au/~gjm

  McLachlan and Peel (2000),
  Finite Mixture Models. Wiley.
Estimation of Mixture Distributions
 It was the publication of the seminal paper of
 Dempster, Laird, and Rubin (1977) on the
 EM algorithm that greatly stimulated interest
 in the use of finite mixture distributions to
 model heterogeneous data.

 McLachlan and Krishnan (1997, Wiley)
• If need be, the normal mixture model can
be made less sensitive to outlying
observations by using t component densities.

• With this t mixture model-based approach,
the normal distribution for each component
in the mixture is embedded in a wider class
of elliptically symmetric distributions with an
additional parameter called the degrees of
freedom.
The advantage of the t mixture model is that,
although the number of outliers needed for
breakdown is almost the same as with the
normal mixture model, the outliers have to
be much larger.
Two Clustering Problems:
• Clustering of genes on basis of tissues –
      genes not independent

• Clustering of tissues on basis of genes -
    latter is a nonstandard problem in
   cluster analysis (n << p)
Mixture Software
McLachlan, Peel, Adams, and Basford (1999)
  http://www.maths.uq.edu.au/~gjm/emmix/emmix.html
EMMIX for Windows




http://www.maths.uq.edu.au/~gjm/EMMIX_Demo/emmix.html
PROVIDES A MODEL-BASED
             APPROACH TO CLUSTERING

     McLachlan, Bean, and Peel, 2002, A Mixture Model-
      Based Approach to the Clustering of Microarray
        Expression Data, Bioinformatics 18, 413-422


http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf
Example: Microarray Data
Colon Data of Alon et al. (1999)
    n=62 (40 tumours; 22 normals)
    tissue samples of
    p=2,000 genes in a
    2,000 × 62 matrix.
Mixture of 2 normal components
Mixture of 2 t components
Mixture of 2 t components
Mixture of 3 t components
In this process, the genes are being treated
anonymously.

May wish to incorporate existing biological
information on the function of genes into
the selection procedure.
Lottaz and Spang (2003, Proceedings of 54th Meeting of the ISI)

They structure the feature space by using a functional grid
provided by the Gene Ontology annotations.
Clustering of COLON Data
Genes using EMMIX-GENE
Grouping for Colon Data
 1       2                  3               4        5




6        7                  8               9        10




    11       12                 13              14    15




16       17                 18              19       20
Clustering of COLON Data
Tissues using EMMIX-GENE
Grouping for Colon Data
 1       2                  3               4        5




6        7                  8               9        10




    11       12                 13              14    15




16       17                 18              19       20
Mixtures of Factor Analyzers
A normal mixture model without restrictions
on the component-covariance matrices may
be viewed as too general for many situations
in practice, in particular, with high
dimensional data.

One approach for reducing the number of
parameters is to work in a lower
dimensional
space by adopting mixtures of factor
analyzers (Ghahramani & Hinton, 1997).
g
 f ( x j ) = ∑ π iφ ( x j ; µi , Σ i ),
             i =1

where
    Σ i = Bi B + Di T
                    i         (i = 1,..., g ),

Bi is a p x q matrix and Di is a
diagonal matrix.
Number of Components
    in a Mixture Model
Testing for the number of components,
g, in a mixture is an important but very
difficult problem which has not been
completely resolved.
Order of a Mixture Model
A mixture density with g components might
be empirically indistinguishable from one
with either fewer than g components or
more than g components. It is therefore
sensible in practice to approach the question
of the number of components in a mixture
model in terms of an assessment of the
smallest number of components in the
mixture compatible with the data.
Likelihood Ratio Test Statistic
An obvious way of approaching the
problem of testing for the smallest value of
the number of components in a mixture
model is to use the LRTS, -2logλ.
Suppose we wish to test the null
hypothesis,
     H 0 : g = g 0 versus H1 : g = g1
for some g1>g0.
ˆ
We let Ψ i denote the MLE of Ψ calculated
under Hi , (i=0,1). Then the evidence against
H0 will be strong if λ is sufficiently small, or
equivalently, if -2logλ is sufficiently large,
where
                      ˆ             ˆ
 − 2 log λ = 2{log L(Ψ 1 ) − log L(Ψ 0 )}
Bootstrapping the LRTS

McLachlan       (1987)    proposed     a
resampling approach to the assessment of
the P-value of the LRTS in testing
    H 0 : g = g0     v    H1 : g = g1

for a specified value of g0.
Bayesian Information Criterion
The Bayesian information criterion (BIC)
of Schwarz (1978) is given by
                ˆ
      − 2 log L(Ψ ) + d log n
as the penalized log likelihood to be
maximized in model selection, including
the present situation for the number of
components g in a mixture model.
Gap statistic (Tibshirani et al., 2001)



Clest (Dudoit and Fridlyand, 2002)
Analysis of LEUKAEMIA Data
    using EMMIX-GENE
Grouping for Leukemia Data
 1       2                  3                  4        5




6        7                  8                  9        10




    11       12                 13                 14    15




16       17                 18                 19       20
21   22    23    24    25




26    27    28    29    30




 31    32    33    34    35




36    37    38    39    40
Breast cancer data set in van’t Veer et al.
(van’t Veer et al., 2002, Gene Expression Profiling Predicts
Clinical Outcome Of Breast Cancer, Nature 415)
  These data were the result of microarray experiments
  on three patient groups with different classes of
  breast cancer tumours.

  The overall goal was to identify a set of genes that
  could distinguish between the different tumour
  groups based upon the gene expression information
  for these groups.
The Economist (US), February 2, 2002
   The chips are down; Diagnosing
   breast cancer (Gene chips have shown
   that there are two sorts of breast cancer)
Nature (2002, 4 July Issue, 418)

News feature (Ball)

Data visualiztion: Picture this
Colour-coded: this plot of gene-expression data
 shows breast tumours falling into two groups
Microarray data from 98 patients with
primary breast cancers with p = 24,881
genes
• 44 from good prognosis group
  (remained metastasis free after a period
  of more than 5 years)
• 34 from poor prognosis group (developed
  distant metastases within 5 years)
• 20 with hereditary form of cancer
  (18 with BRAC1; 2 with BRAC2)
Pre-processing filter of van’t Veer et al.

only genes with both:
   • P-value less than 0.01; and
   • at least a two-fold difference in more
     than 5 out of the 98 tissues for the genes
were retained.

    This reduces the data set to 4869 genes.
Heat Map Displaying the Reduced Set of 4,869 Genes
         on the 98 Breast Cancer Tumours
Unsupervised Classification Analysis Using
               EMMIX-GENE
Steps used in the application of EMMIX-GENE:
  1. Select the most relevant genes from this filtered set
     of 4,869 genes. The set of retained genes is thus
     reduced to 1,867.

 2. Cluster these 1,867 genes into forty groups. The
    majority of gene groups produced were reasonably
    cohesive and distinct.

 3. Using these forty group means, cluster the tissue
    samples into two and three components using a
    mixture of factor analyzers model with q = 4 factors.
Insert heat map of 1867 genes




Heat Map of Top 1867 Genes
1    2        3    4    5




6    7        8    9    10




11   12       13   14   15




16       17   18   19   20
21   22    23   24   25




26   27    28   29   30




31   32    33   34   35




36    37   38   39   40
i    mi   Ui      i   mi    Ui       i mi     Ui        i mi     Ui
1    146 112.98   11   66   25.72    21 44    13.77     31 53    9.84
2    93   74.95   12   38   25.45    22 30    13.28     32 36    8.95
3    61   46.08   13   28   25.00    23 25    13.10     33 36    8.89
4    55   35.20   14   53   21.33    24 67    13.01     34 38    8.86
5    43   30.40   15   47   18.14    25 12    12.04     35 44    8.02
6    92   29.29   16   23   18.00    26 58    12.03     36 56    7.43
7    71   28.77   17   27   17.62    27 27    11.74     37 46    7.21
8    20   28.76   18   45   17.51    28 64    11.61     38 19    6.14
9    23   28.44   19   80   17.28    29 38    11.38     39 29    4.64
10   23   27.73   20   55   13.79    30 21    10.72     40 35    2.44
                                    where     i = group number
                                             mi = number in group i
                                             Ui = -2 log λi
Heat Map of Genes in Group G1
Heat Map of Genes in Group G2
Heat Map of Genes in Group G3
1. A change in gene expression is apparent between
   the sporadic (first 78 tissue samples) and hereditary
   (last 20 tissue samples) tumours.
2. The final two tissue samples (the two BRCA2
   tumours) show consistent patterns of expression.
   This expression is different from that exhibited by
   the set of BRCA1 tumours.
3. The problem of trying to distinguish between the
   two classes, patients who were disease-free after 5
   years Π1 and those with metastases within 5 years
   Π2, is not straightforward on the basis of the gene
   expressions.
Selection of Relevant Genes
We compared the genes selected by EMMIX-
GENE with those genes retained in the original
study by van’t Veer et al. (2002).
van’t Veer et al. used an agglomerative
hierarchical algorithm to organise the genes into
dominant genes groups. Two of these groups were
highlighted in their paper, with their genes
corresponding to biologically significant features.
Number of matches
                                                       Number
              Identification of van’t Veer et al.               with genes retained
                                                       of genes
                                                                  by select-gene
            containing genes co-regulated with the
Cluster A                                                40             24
                      ER-a gene (ESR1)
           containing “co-regulated genes that are
             the molecular reflection of extensive
Cluster B                                                40             23
          lymphocytic infiltrate, and comprise a set
             of genes expressed in T and B cells”

      We can see that of the 80 genes identified by van’t Veer
      et al., only 47 are retained by the select-genes step of
      the EMMIX-GENE algorithm.
Comparing Clusters from Hierarchical Algorithm
   with those from EMMIX-GENE Algorithm
                Cluster
                 Index  Number Percentage
                                Matched
               (EMMIX- of Genes
                        Matched   (%)
                GENE)
                   2      21      87.5
       Cluster
         A         3       2      8.33
                   14      1      4.17
                   17     18      78.3
       Cluster     19      1      4.35
         B         21      4      17.4

Subsets of these 47 genes appeared inside several of the
40 groups produced by the cluster-genes step of
EMMIX-GENE.
Genes Retained by EMMIX-GENE Appearing in Cluster A
    (vertical blue lines indicate the three groups of tumours)
Genes Rejected by EMMIX-GENE Appearing in Cluster A
Genes Retained by EMMIX-GENE Appearing in Cluster B
Genes Rejected by EMMIX-GENE Appearing in Cluster B
Assessing the Number of Tissue Groups
To assess the number of components g to be used in
the normal mixture the likelihood ratio statistic λ
was adopted, and the resampling approach used to
assess the P-value.

By proceeding sequentially, testing the null
hypothesis H0: g = g0 versus the alternative
hypothesis H1: g = g0 + 1, starting with g0 = 1 and
continuing until a non-significant result was
obtained it was concluded that g = 3 components
were adequate for this data set.
Clustering Tissue Samples on the Basis of Gene
         Groups using EMMIX-GENE

 Tissue samples can be subdivided into two groups
 corresponding to 78 sporadic tumours and 20
 hereditary tumours.
 When the two cluster assignment of EMMIX-GENE
 is compared to this genuine grouping, only 1 of the
 20 hereditary tumour patients is misallocated,
 although 37 of the sporadic tumour patients are
 incorrectly assigned to the hereditary tumour
 cluster.
Using a mixture of factor analyzers model with q = 8
factors, we would misallocate:
       7 out of the 44 members of Π1;
       24 out of the 34 members of Π2; and
       1 of the 18 BRCA1 samples.
The misallocation rate of 24/34 for the second class,
Π2, is not surprising given both the gene expressions
as summarized in the groups of genes and that we are
classifying the tissues in an unsupervised manner
without using the knowledge of their true
classification.
Supervised Classification
When knowledge of the groups’ true classification is
used (van’t Veer et al.), the reported error rate was
approximately 50% for members of Π2 when
allowance was made for the selection bias in forming a
classifier on the basis of an optimal subset of the
genes.

Further analysis of this data set in a supervised context
confirms the difficulty in trying to discriminate
between the disease-free class Π1 and the metastases
class Π2. (Tibshirani and Efron, 2002, “Pre-Validation and Inference in
Microarrays”, Statistical Applications In Genetics And Molecular Biology 1)
Investigating Underlying Signatures With
           Other Clinical Indicators

The three clusters constructed by EMMIX-
GENE were investigated in order to determine
whether they followed a pattern contingent upon
the clinical predictors of histological grade,
angioinvasion, oestrogen receptor, lymphocytic
infiltrate.
Microarrays have become promising
diagnostic tools for clinical applications.
However,       large-scale   screening
approaches in general and microarray
technology in particular, inescapably
lead to the challenging problem of
learning from high-dimensional data.
Hope to see you in Cairns in 2004!

Contenu connexe

Tendances

04 structured prediction and energy minimization part 1
04 structured prediction and energy minimization part 104 structured prediction and energy minimization part 1
04 structured prediction and energy minimization part 1zukun
 
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Alex (Oleksiy) Varfolomiyev
 
Prediction of Financial Processes
Prediction of Financial ProcessesPrediction of Financial Processes
Prediction of Financial ProcessesSSA KPI
 
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
Sparsity with sign-coherent groups of variables via the cooperative-LassoSparsity with sign-coherent groups of variables via the cooperative-Lasso
Sparsity with sign-coherent groups of variables via the cooperative-LassoLaboratoire Statistique et génome
 
Estimation and Prediction of Complex Systems: Progress in Weather and Climate
Estimation and Prediction of Complex Systems: Progress in Weather and ClimateEstimation and Prediction of Complex Systems: Progress in Weather and Climate
Estimation and Prediction of Complex Systems: Progress in Weather and Climatemodons
 
Numerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodNumerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodAlexander Decker
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031frdos
 
Csr2011 june14 14_00_agrawal
Csr2011 june14 14_00_agrawalCsr2011 june14 14_00_agrawal
Csr2011 june14 14_00_agrawalCSR2011
 
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton TensorDual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton TensorSebastian De Haro
 
Module iii sp
Module iii spModule iii sp
Module iii spVijaya79
 
NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
NIPS2009: Sparse Methods for Machine Learning: Theory and AlgorithmsNIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithmszukun
 
Presentation cm2011
Presentation cm2011Presentation cm2011
Presentation cm2011antigonon
 
Chapter 3 projection
Chapter 3 projectionChapter 3 projection
Chapter 3 projectionNBER
 

Tendances (18)

Holographic Cotton Tensor
Holographic Cotton TensorHolographic Cotton Tensor
Holographic Cotton Tensor
 
04 structured prediction and energy minimization part 1
04 structured prediction and energy minimization part 104 structured prediction and energy minimization part 1
04 structured prediction and energy minimization part 1
 
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
 
Metric Embeddings and Expanders
Metric Embeddings and ExpandersMetric Embeddings and Expanders
Metric Embeddings and Expanders
 
Prediction of Financial Processes
Prediction of Financial ProcessesPrediction of Financial Processes
Prediction of Financial Processes
 
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
Sparsity with sign-coherent groups of variables via the cooperative-LassoSparsity with sign-coherent groups of variables via the cooperative-Lasso
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
 
Estimation and Prediction of Complex Systems: Progress in Weather and Climate
Estimation and Prediction of Complex Systems: Progress in Weather and ClimateEstimation and Prediction of Complex Systems: Progress in Weather and Climate
Estimation and Prediction of Complex Systems: Progress in Weather and Climate
 
Numerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodNumerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis method
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031
 
Gz3113501354
Gz3113501354Gz3113501354
Gz3113501354
 
Csr2011 june14 14_00_agrawal
Csr2011 june14 14_00_agrawalCsr2011 june14 14_00_agrawal
Csr2011 june14 14_00_agrawal
 
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton TensorDual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
 
Intraguild mutualism
Intraguild mutualismIntraguild mutualism
Intraguild mutualism
 
Module iii sp
Module iii spModule iii sp
Module iii sp
 
NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
NIPS2009: Sparse Methods for Machine Learning: Theory and AlgorithmsNIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
 
Presentation cm2011
Presentation cm2011Presentation cm2011
Presentation cm2011
 
Chapter 3 projection
Chapter 3 projectionChapter 3 projection
Chapter 3 projection
 
Lei
LeiLei
Lei
 

En vedette

Expert systems and solutions - B.E Projects ECE, EEE, EIE
Expert systems and solutions - B.E Projects ECE, EEE, EIEExpert systems and solutions - B.E Projects ECE, EEE, EIE
Expert systems and solutions - B.E Projects ECE, EEE, EIESenthil Kumar
 
Como ser un gran maestro
Como ser un gran maestroComo ser un gran maestro
Como ser un gran maestroWerton Bastos
 
Economic dispatch using fuzzy logic
Economic dispatch using fuzzy logicEconomic dispatch using fuzzy logic
Economic dispatch using fuzzy logicSenthil Kumar
 
2015 AT&T Developer Summit
2015 AT&T Developer Summit2015 AT&T Developer Summit
2015 AT&T Developer SummitDoug Sillars
 
Project titles for eee, ece, eie - M.E, B.E, Ph.D, EEE, ECE, EIE Projects
Project titles for  eee, ece, eie - M.E, B.E, Ph.D, EEE, ECE, EIE Projects Project titles for  eee, ece, eie - M.E, B.E, Ph.D, EEE, ECE, EIE Projects
Project titles for eee, ece, eie - M.E, B.E, Ph.D, EEE, ECE, EIE Projects Senthil Kumar
 
Case analysis pharma offshoring drgorad
Case analysis pharma offshoring  drgoradCase analysis pharma offshoring  drgorad
Case analysis pharma offshoring drgoradDeepak R Gorad
 
Carteles y Obra de Cassandre
Carteles y Obra de CassandreCarteles y Obra de Cassandre
Carteles y Obra de CassandreBrian Lurex
 
Deepak ib solved paper
Deepak ib solved paperDeepak ib solved paper
Deepak ib solved paperDeepak R Gorad
 
The Social Challenge of 1.5°C Webinar: Ioan Fazey
The Social Challenge of 1.5°C Webinar: Ioan FazeyThe Social Challenge of 1.5°C Webinar: Ioan Fazey
The Social Challenge of 1.5°C Webinar: Ioan Fazeytewksjj
 
Powerpoint maasdijkmarathon 2013
Powerpoint maasdijkmarathon 2013Powerpoint maasdijkmarathon 2013
Powerpoint maasdijkmarathon 2013pgvanderpoel
 
The Social Challenge of 1.5°C Webinar: Frank Biermann
The Social Challenge of 1.5°C Webinar: Frank BiermannThe Social Challenge of 1.5°C Webinar: Frank Biermann
The Social Challenge of 1.5°C Webinar: Frank Biermanntewksjj
 

En vedette (20)

Expert systems and solutions - B.E Projects ECE, EEE, EIE
Expert systems and solutions - B.E Projects ECE, EEE, EIEExpert systems and solutions - B.E Projects ECE, EEE, EIE
Expert systems and solutions - B.E Projects ECE, EEE, EIE
 
Embedded projects
Embedded projectsEmbedded projects
Embedded projects
 
Como ser un gran maestro
Como ser un gran maestroComo ser un gran maestro
Como ser un gran maestro
 
Calculo leithold
Calculo leitholdCalculo leithold
Calculo leithold
 
Embedded projects
Embedded projectsEmbedded projects
Embedded projects
 
Economic dispatch using fuzzy logic
Economic dispatch using fuzzy logicEconomic dispatch using fuzzy logic
Economic dispatch using fuzzy logic
 
2015 AT&T Developer Summit
2015 AT&T Developer Summit2015 AT&T Developer Summit
2015 AT&T Developer Summit
 
Project titles for eee, ece, eie - M.E, B.E, Ph.D, EEE, ECE, EIE Projects
Project titles for  eee, ece, eie - M.E, B.E, Ph.D, EEE, ECE, EIE Projects Project titles for  eee, ece, eie - M.E, B.E, Ph.D, EEE, ECE, EIE Projects
Project titles for eee, ece, eie - M.E, B.E, Ph.D, EEE, ECE, EIE Projects
 
Case analysis pharma offshoring drgorad
Case analysis pharma offshoring  drgoradCase analysis pharma offshoring  drgorad
Case analysis pharma offshoring drgorad
 
Carteles y Obra de Cassandre
Carteles y Obra de CassandreCarteles y Obra de Cassandre
Carteles y Obra de Cassandre
 
Mcs DRGORAD
Mcs DRGORADMcs DRGORAD
Mcs DRGORAD
 
Rainbow Salad
Rainbow SaladRainbow Salad
Rainbow Salad
 
Deepak ib solved paper
Deepak ib solved paperDeepak ib solved paper
Deepak ib solved paper
 
Entrepreneurship
EntrepreneurshipEntrepreneurship
Entrepreneurship
 
Mscc抜粋版
Mscc抜粋版Mscc抜粋版
Mscc抜粋版
 
Mechanical projects
Mechanical projectsMechanical projects
Mechanical projects
 
The Social Challenge of 1.5°C Webinar: Ioan Fazey
The Social Challenge of 1.5°C Webinar: Ioan FazeyThe Social Challenge of 1.5°C Webinar: Ioan Fazey
The Social Challenge of 1.5°C Webinar: Ioan Fazey
 
Powerpoint maasdijkmarathon 2013
Powerpoint maasdijkmarathon 2013Powerpoint maasdijkmarathon 2013
Powerpoint maasdijkmarathon 2013
 
The Social Challenge of 1.5°C Webinar: Frank Biermann
The Social Challenge of 1.5°C Webinar: Frank BiermannThe Social Challenge of 1.5°C Webinar: Frank Biermann
The Social Challenge of 1.5°C Webinar: Frank Biermann
 
Abstract8
Abstract8Abstract8
Abstract8
 

Similaire à Symmetrical2

Discretization of a Mathematical Model for Tumor-Immune System Interaction wi...
Discretization of a Mathematical Model for Tumor-Immune System Interaction wi...Discretization of a Mathematical Model for Tumor-Immune System Interaction wi...
Discretization of a Mathematical Model for Tumor-Immune System Interaction wi...mathsjournal
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...mathsjournal
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...mathsjournal
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...mathsjournal
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...mathsjournal
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...mathsjournal
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...mathsjournal
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...mathsjournal
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfAlexander Litvinenko
 
5. cem granger causality ecm
5. cem granger causality  ecm 5. cem granger causality  ecm
5. cem granger causality ecm Quang Hoang
 
Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...tuxette
 
ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)CrackDSE
 
Jam 2006 Test Papers Mathematical Statistics
Jam 2006 Test Papers Mathematical StatisticsJam 2006 Test Papers Mathematical Statistics
Jam 2006 Test Papers Mathematical Statisticsashu29
 
A direct method for estimating linear non-Gaussian acyclic models
A direct method for estimating linear non-Gaussian acyclic modelsA direct method for estimating linear non-Gaussian acyclic models
A direct method for estimating linear non-Gaussian acyclic modelsShiga University, RIKEN
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingUSC
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 

Similaire à Symmetrical2 (20)

Discretization of a Mathematical Model for Tumor-Immune System Interaction wi...
Discretization of a Mathematical Model for Tumor-Immune System Interaction wi...Discretization of a Mathematical Model for Tumor-Immune System Interaction wi...
Discretization of a Mathematical Model for Tumor-Immune System Interaction wi...
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
5. cem granger causality ecm
5. cem granger causality  ecm 5. cem granger causality  ecm
5. cem granger causality ecm
 
T tests anovas and regression
T tests anovas and regressionT tests anovas and regression
T tests anovas and regression
 
Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...
 
E028047054
E028047054E028047054
E028047054
 
ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)
 
KAUST_talk_short.pdf
KAUST_talk_short.pdfKAUST_talk_short.pdf
KAUST_talk_short.pdf
 
Jam 2006 Test Papers Mathematical Statistics
Jam 2006 Test Papers Mathematical StatisticsJam 2006 Test Papers Mathematical Statistics
Jam 2006 Test Papers Mathematical Statistics
 
A direct method for estimating linear non-Gaussian acyclic models
A direct method for estimating linear non-Gaussian acyclic modelsA direct method for estimating linear non-Gaussian acyclic models
A direct method for estimating linear non-Gaussian acyclic models
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 

Dernier

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Symmetrical2

  • 1. EXPERT SYSTEMS AND SOLUTIONS Email: expertsyssol@gmail.com expertsyssol@yahoo.com Cell: 9952749533 www.researchprojects.info PAIYANOOR, OMR, CHENNAI Call For Research Projects Final year students of B.E in EEE, ECE, EI, M.E (Power Systems), M.E (Applied Electronics), M.E (Power Electronics) Ph.D Electrical and Electronics. Students can assemble their hardware in our Research labs. Experts will be guiding the projects.
  • 2. Classification of Microarray Gene Expression Data Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland
  • 3. Institute for Molecular Bioscience, University of Queensland
  • 4. “A wide range of supervised and unsupervised learning methods have been considered to better organize data, be it to infer coordinated patterns of gene expression, to discover molecular signatures of disease subtypes, or to derive various predictions. ” Statistical Methods for Gene Expression: Microarrays and Proteomics
  • 5. Outline of Talk • Introduction • Supervised classification of tissue samples – selection bias • Unsupervised classification (clustering) of tissues – mixture model-based approach
  • 6. Vital Statistics by C. Tilstone Nature 424, 610-612, 2003. “DNA microarrays have given geneticists and molecular biologists access to more data than ever before. But do these Branching out: cluster analysis can group researchers have the samples that show similar patterns of gene statistical know-how to expression. cope?”
  • 7. MICROARRAY DATA REPRESENTED by a p × n matrix ( x1 ,  , x n ) xj contains the gene expressions for the p genes of the jth tissue sample (j = 1, …, n). p = No. of genes (103 - 104) n = No. of tissue samples (10 - 102) STANDARD STATISTICAL METHODOLOGY APPROPRIATE FOR n >> p HERE p >> n
  • 8.
  • 9. Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. The principal ellipses of the two groups are shown as solid curves.
  • 10. bioArray News (2, no. 35, 2002) Arrays Hold Promise for Cancer Diagnostics Oncologists would like to use arrays to predict whether or not a cancer is going to spread in the body, how likely it will respond to a certain type of treatment, and how long the patient will probably survive. It would be useful if the gene expression signatures could distinguish between subtypes of tumours that standard methods, such as histological pathology from a biopsy, fail to discriminate, and that require different treatments.
  • 11. van’t Veer & De Jong (2002, Nature Medicine 8) The microarray way to tailored cancer treatment In principle, gene activities that determine the biological behaviour of a tumour are more likely to reflect its aggressiveness than general parameters such as tumour size and age of the patient. (indistinguishable disease states in diffuse large B-cell lymphoma unravelled by microarray expression profiles – Shipp et al., 2002, Nature Med. 8)
  • 12. Microarray to be used as routine clinical screen by C. M. Schubert Nature Medicine 9, 9, 2003. The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.
  • 13. Microarrays also to be used in the prediction of breast cancer by Mike West (Duke University) and the Koo Foundation Sun Yat-Sen Cancer Centre, Taipei Huang et al. (2003, The Lancet, Gene expression predictors of breast cancer).
  • 14. CLASSIFICATION OF TISSUES SUPERVISED CLASSIFICATION (DISCRIMINANT ANALYSIS) We OBSERVE the CLASS LABELS y1, …, yn where yj = i if jth tissue sample comes from the ith class (i=1,…,g). AIM: TO CONSTRUCT A CLASSIFIER C(x) FOR PREDICTING THE UNKNOWN CLASS LABEL y OF A TISSUE SAMPLE x. e.g. g = 2 classes G1 - DISEASE-FREE G2 - METASTASES
  • 15.
  • 16. LINEAR CLASSIFIER FORM C ( x) = β 0 + β x T = β0 + β1 x1 +  + β p x p for the production of the group label y of a future entity with feature vector x.
  • 17. FISHER’S LINEAR DISCRIMINANT FUNCTION y = sign C ( x ) −1 where β = S ( x1 − x 2 ) 1 β 0 = − ( x1 − x 2 ) S ( x1 − x 2 ) T −1 2 and x1 , x 2 , and S are the sample means and pooled sample covariance matrix found from the training data
  • 18. SUPPORT VECTOR CLASSIFIER Vapnik (1995) C ( x ) = β0 + β1 x1 +  + β p x p where β0 and β are obtained as follows: n 1 β + γ ∑ξ j 2 min β , β0 2 j =1 subject to ξ j ≥ 0, y j C(x j ) ≥ 1 − ξ j ( j = 1, , n) ξ1 ,, ξ n relate to the slack variables γ = ∞ separable case
  • 19. n β = ∑α j y j x j ˆ ˆ j =1 with non-zero α j only for those observations j for which the ˆ constraints are exactly met (the support vectors). n C ( x ) = ∑α j y j x T x + β 0 ˆ j ˆ j =1 n = ∑α j y j x j , x + β 0 ˆ ˆ j =1
  • 20. Support Vector Machine (SVM) REPLACE x by h( x ) n C ( x ) = ∑ α j h( x j ), h( x ) + β 0 ˆ ˆ j =1 n = ∑α j K ( x j , x) + β 0 ˆ ˆ j =1 where the kernel function K ( x j , x ) = h( x j ), h( x ) is the inner product in the transformed feature space.
  • 21. HASTIE et al. (2001, Chapter 12) The Lagrange (primal function) is [ ] n n n 1 β + γ ∑ ξ j − ∑ α j y j C ( x j ) − (1 − ξ j ) − ∑ λ jξ j 2 LP = (1) 2 j =1 j =1 j =1 which we maximize w.r.t. β, β0, and ξj. Setting the respective derivatives to zero, we get n β = ∑α j y j x j (2) j =1 n ∆ = ∑α j y j (3) j =1 α j = γ − λj ( j = 1, , n). (4) with α j ≥ 0, λ j ≥ 0, and ξ j ≥ 0 ( j = 1,  , n).
  • 22. By substituting (2) to (4) into (1), we obtain the Lagrangian dual function n 1 n n LD = ∑ α j − ∑∑α α j k y j yk x x k T j (5) j =1 2 j =1 k =1 n We maximize (5) subject to 0 ≤ α j ≤ γ and ∑α j =1 j y j = 0. In addition to (2) to (4), the constraints include α j [ y j C (x j ) − (1 − ξ j )] = 0 (6) λ jξ j = 0 (7) y j C (x j ) − (1 − ξ j ) ≥ 0 (8) for j = 1, , n. Together these equations (2) to (8) uniquely characterize the solution to the primal and dual problem.
  • 23. Leo Breiman (2001) Statistical modeling: the two cultures (with discussion). Statistical Science 16, 199-231. Discussants include Brad Efron and David Cox
  • 24. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the National Academy of Sciences Vol. 99, Issue 10, 6562-6566, May 14, 2002 http://www.pnas.org/cgi/content/full/99/10/6562
  • 25. GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) • COLON Data (Alon et al., 1999) • LEUKAEMIA Data (Golub et al., 1999)
  • 26. Since p>>n, consideration given to selection of suitable genes SVM: FORWARD or BACKWARD (in terms of magnitude of weight βi) RECURSIVE FEATURE ELIMINATION (RFE) FISHER: FORWARD ONLY (in terms of CVE)
  • 27. GUYON et al. (2002) LEUKAEMIA DATA: Only 2 genes are needed to obtain a zero CVE (cross-validated error rate) COLON DATA: Using only 4 genes, CVE is 2%
  • 28. GUYON et al. (2002) “The success of the RFE indicates that RFE has a built in regularization mechanism that we do not understand yet that prevents overfitting the training data in its selection of gene subsets.”
  • 29. Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples
  • 30. Figure 2: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of leukemia tissue samples
  • 31. Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data
  • 32. Figure 4: Error rates of Fisher’s rule with stepwise forward selection procedure using all the leukemia data
  • 33. Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues
  • 34. Error Rate Estimation Suppose there are two groups G1 and G2 C(x) is a classifier formed from the data set (x1, x2, x3,……………, xn) The apparent error is the proportion of the data set misallocated by C(x).
  • 35. Cross-Validation From the original data set, remove x1 to give the reduced set (x2, x3,……………, xn) Then form the classifier C(1)(x ) from this reduced set. Use C(1)(x1) to allocate x1 to either G1 or G2.
  • 36. Repeat this process for the second data point, x2. So that this point is assigned to either G1 or G2 on the basis of the classifier C(2)(x2). And so on up to xn.
  • 37. Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples
  • 38. ADDITIONAL REFERENCES Selection bias ignored: XIONG et al. (2001, Molecular Genetics and Metabolism) XIONG et al. (2001, Genome Research) ZHANG et al. (2001, PNAS) Aware of selection bias: SPANG et al. (2001, Silico Biology) WEST et al. (2001, PNAS) NGUYEN and ROCKE (2002)
  • 39. BOOTSTRAP APPROACH Efron’s (1983, JASA) .632 estimator B.632 = .368 × AE + .632 × B1 * where B1 is the bootstrap when rule R k is applied to a point not in the training sample. A Monte Carlo estimate of B1 is n B1 = ∑ Ej n j =1 K K where Ej = ∑ IjkQjk ∑I jk k =1 k =1 if xj ∉ kth bootstrap sample with Ijk = 1 0  otherwise   * and Qjk = 1 if R k misallocates xj  0 otherwise 
  • 40. Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR A(w) = (1 - w)AE + wCV2E where w = 0.5 McLachlan (1977) proposed w=wo where wo is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups. Value of w0 was found to range between 0.6 n1 and 0.7, depending on the values of p, ∆, and n . 2
  • 41. .632+ estimate of Efron & Tibshirani (1997, JASA) B.632 + = (1 - w)AE + w B1 .632 where w= 1 − .368r B1 − AE r= (relative overfitting rate) γ − AE g γ = ∑ pi (1 − qi ) (estimate of no information error rate) i =1 If r = 0, w = .632, and so B.632+ = B.632 r = 1, w = 1, and so B.632+ = B1
  • 42. One concern is the heterogeneity of the tumours themselves, which consist of a mixture of normal and malignant cells, with blood vessels in between. Even if one pulled out some cancer cells from a tumour, there is no guarantee that those are the cells that are going to metastasize, just because tumours are heterogeneous. “What we really need are expression profiles from hundreds or thousands of tumours linked to relevant, and appropriate, clinical data.” John Quackenbush
  • 43. UNSUPERVISED CLASSIFICATION (CLUSTER ANALYSIS) INFER CLASS LABELS y1, …, yn of x1, …, xn Initially, hierarchical distance-based methods of cluster analysis were used to cluster the tissues and the genes Eisen, Spellman, Brown, & Botstein (1998, PNAS)
  • 44. Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. “in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a ‘good’ clustering algorithm or the ‘right’ number of clusters.” (Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17)
  • 45. Attention is now turning towards a model-based approach to the analysis of microarray data For example: • Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. Journal of Computational Biology 9 •Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18 •Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering with variable and transformation selection. In Bayesian Statistics 7 • Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray gene expression data. Genome Biology 3 • Yeung et al., 2001, Model based clustering and data transformations for gene expression data, Bioinformatics 17
  • 46. The notion of a cluster is not easy to define. There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis. That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.
  • 47. In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data). Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.
  • 48.  Height  H + W     x =  Weight   H-W   BP   BP     
  • 49. MIXTURE OF g NORMAL COMPONENTS f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π gφ ( x; μg , Σ g ) where − 2 log φ ( x; μ, Σ ) = ( x − μ) Σ ( x − μ) + constant T T −1 −1     MAHALANOBIS DISTANCE ( x − μ )T ( x − μ ) EUCLIDEAN DISTANCE
  • 50. MIXTURE OF g NORMAL COMPONENTS f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π gφ ( x; μg , Σ g ) k-means Σ1 =  = Σg =  I g σ I 2 2 SPHERICAL CLUSTERS
  • 52. Crab Data Figure 6: Plot of Crab Data
  • 53. Figure 7: Contours of the fitted component densities on the 2nd & 3rd variates for the blue crab data set.
  • 54. With a mixture model-based approach to clustering, an observation is assigned outright to the ith cluster if its density in the ith component of the mixture distribution (weighted by the prior probability of that component) is greater than in the other (g-1) components. f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π iφ ( x; μi , Σ i ) +  + π gφ ( x; μg , Σ g )
  • 55. http://www.maths.uq.edu.au/~gjm McLachlan and Peel (2000), Finite Mixture Models. Wiley.
  • 56. Estimation of Mixture Distributions It was the publication of the seminal paper of Dempster, Laird, and Rubin (1977) on the EM algorithm that greatly stimulated interest in the use of finite mixture distributions to model heterogeneous data. McLachlan and Krishnan (1997, Wiley)
  • 57. • If need be, the normal mixture model can be made less sensitive to outlying observations by using t component densities. • With this t mixture model-based approach, the normal distribution for each component in the mixture is embedded in a wider class of elliptically symmetric distributions with an additional parameter called the degrees of freedom.
  • 58. The advantage of the t mixture model is that, although the number of outliers needed for breakdown is almost the same as with the normal mixture model, the outliers have to be much larger.
  • 59.
  • 60. Two Clustering Problems: • Clustering of genes on basis of tissues – genes not independent • Clustering of tissues on basis of genes - latter is a nonstandard problem in cluster analysis (n << p)
  • 61. Mixture Software McLachlan, Peel, Adams, and Basford (1999) http://www.maths.uq.edu.au/~gjm/emmix/emmix.html
  • 63. PROVIDES A MODEL-BASED APPROACH TO CLUSTERING McLachlan, Bean, and Peel, 2002, A Mixture Model- Based Approach to the Clustering of Microarray Expression Data, Bioinformatics 18, 413-422 http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf
  • 64.
  • 65. Example: Microarray Data Colon Data of Alon et al. (1999) n=62 (40 tumours; 22 normals) tissue samples of p=2,000 genes in a 2,000 × 62 matrix.
  • 66.
  • 67. Mixture of 2 normal components
  • 68. Mixture of 2 t components
  • 69. Mixture of 2 t components
  • 70. Mixture of 3 t components
  • 71.
  • 72.
  • 73. In this process, the genes are being treated anonymously. May wish to incorporate existing biological information on the function of genes into the selection procedure. Lottaz and Spang (2003, Proceedings of 54th Meeting of the ISI) They structure the feature space by using a functional grid provided by the Gene Ontology annotations.
  • 74.
  • 75. Clustering of COLON Data Genes using EMMIX-GENE
  • 76. Grouping for Colon Data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  • 77.
  • 78.
  • 79. Clustering of COLON Data Tissues using EMMIX-GENE
  • 80. Grouping for Colon Data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  • 81. Mixtures of Factor Analyzers A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data. One approach for reducing the number of parameters is to work in a lower dimensional space by adopting mixtures of factor analyzers (Ghahramani & Hinton, 1997).
  • 82. g f ( x j ) = ∑ π iφ ( x j ; µi , Σ i ), i =1 where Σ i = Bi B + Di T i (i = 1,..., g ), Bi is a p x q matrix and Di is a diagonal matrix.
  • 83. Number of Components in a Mixture Model Testing for the number of components, g, in a mixture is an important but very difficult problem which has not been completely resolved.
  • 84. Order of a Mixture Model A mixture density with g components might be empirically indistinguishable from one with either fewer than g components or more than g components. It is therefore sensible in practice to approach the question of the number of components in a mixture model in terms of an assessment of the smallest number of components in the mixture compatible with the data.
  • 85. Likelihood Ratio Test Statistic An obvious way of approaching the problem of testing for the smallest value of the number of components in a mixture model is to use the LRTS, -2logλ. Suppose we wish to test the null hypothesis, H 0 : g = g 0 versus H1 : g = g1 for some g1>g0.
  • 86. ˆ We let Ψ i denote the MLE of Ψ calculated under Hi , (i=0,1). Then the evidence against H0 will be strong if λ is sufficiently small, or equivalently, if -2logλ is sufficiently large, where ˆ ˆ − 2 log λ = 2{log L(Ψ 1 ) − log L(Ψ 0 )}
  • 87. Bootstrapping the LRTS McLachlan (1987) proposed a resampling approach to the assessment of the P-value of the LRTS in testing H 0 : g = g0 v H1 : g = g1 for a specified value of g0.
  • 88. Bayesian Information Criterion The Bayesian information criterion (BIC) of Schwarz (1978) is given by ˆ − 2 log L(Ψ ) + d log n as the penalized log likelihood to be maximized in model selection, including the present situation for the number of components g in a mixture model.
  • 89. Gap statistic (Tibshirani et al., 2001) Clest (Dudoit and Fridlyand, 2002)
  • 90. Analysis of LEUKAEMIA Data using EMMIX-GENE
  • 91.
  • 92. Grouping for Leukemia Data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  • 93. 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
  • 94.
  • 95.
  • 96. Breast cancer data set in van’t Veer et al. (van’t Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415) These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours. The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.
  • 97. The Economist (US), February 2, 2002 The chips are down; Diagnosing breast cancer (Gene chips have shown that there are two sorts of breast cancer)
  • 98. Nature (2002, 4 July Issue, 418) News feature (Ball) Data visualiztion: Picture this
  • 99. Colour-coded: this plot of gene-expression data shows breast tumours falling into two groups
  • 100. Microarray data from 98 patients with primary breast cancers with p = 24,881 genes • 44 from good prognosis group (remained metastasis free after a period of more than 5 years) • 34 from poor prognosis group (developed distant metastases within 5 years) • 20 with hereditary form of cancer (18 with BRAC1; 2 with BRAC2)
  • 101. Pre-processing filter of van’t Veer et al. only genes with both: • P-value less than 0.01; and • at least a two-fold difference in more than 5 out of the 98 tissues for the genes were retained. This reduces the data set to 4869 genes.
  • 102. Heat Map Displaying the Reduced Set of 4,869 Genes on the 98 Breast Cancer Tumours
  • 103. Unsupervised Classification Analysis Using EMMIX-GENE Steps used in the application of EMMIX-GENE: 1. Select the most relevant genes from this filtered set of 4,869 genes. The set of retained genes is thus reduced to 1,867. 2. Cluster these 1,867 genes into forty groups. The majority of gene groups produced were reasonably cohesive and distinct. 3. Using these forty group means, cluster the tissue samples into two and three components using a mixture of factor analyzers model with q = 4 factors.
  • 104. Insert heat map of 1867 genes Heat Map of Top 1867 Genes
  • 105.
  • 106. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  • 107. 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
  • 108. i mi Ui i mi Ui i mi Ui i mi Ui 1 146 112.98 11 66 25.72 21 44 13.77 31 53 9.84 2 93 74.95 12 38 25.45 22 30 13.28 32 36 8.95 3 61 46.08 13 28 25.00 23 25 13.10 33 36 8.89 4 55 35.20 14 53 21.33 24 67 13.01 34 38 8.86 5 43 30.40 15 47 18.14 25 12 12.04 35 44 8.02 6 92 29.29 16 23 18.00 26 58 12.03 36 56 7.43 7 71 28.77 17 27 17.62 27 27 11.74 37 46 7.21 8 20 28.76 18 45 17.51 28 64 11.61 38 19 6.14 9 23 28.44 19 80 17.28 29 38 11.38 39 29 4.64 10 23 27.73 20 55 13.79 30 21 10.72 40 35 2.44 where i = group number mi = number in group i Ui = -2 log λi
  • 109. Heat Map of Genes in Group G1
  • 110. Heat Map of Genes in Group G2
  • 111. Heat Map of Genes in Group G3
  • 112. 1. A change in gene expression is apparent between the sporadic (first 78 tissue samples) and hereditary (last 20 tissue samples) tumours. 2. The final two tissue samples (the two BRCA2 tumours) show consistent patterns of expression. This expression is different from that exhibited by the set of BRCA1 tumours. 3. The problem of trying to distinguish between the two classes, patients who were disease-free after 5 years Π1 and those with metastases within 5 years Π2, is not straightforward on the basis of the gene expressions.
  • 113. Selection of Relevant Genes We compared the genes selected by EMMIX- GENE with those genes retained in the original study by van’t Veer et al. (2002). van’t Veer et al. used an agglomerative hierarchical algorithm to organise the genes into dominant genes groups. Two of these groups were highlighted in their paper, with their genes corresponding to biologically significant features.
  • 114. Number of matches Number Identification of van’t Veer et al. with genes retained of genes by select-gene containing genes co-regulated with the Cluster A 40 24 ER-a gene (ESR1) containing “co-regulated genes that are the molecular reflection of extensive Cluster B 40 23 lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells” We can see that of the 80 genes identified by van’t Veer et al., only 47 are retained by the select-genes step of the EMMIX-GENE algorithm.
  • 115. Comparing Clusters from Hierarchical Algorithm with those from EMMIX-GENE Algorithm Cluster Index Number Percentage Matched (EMMIX- of Genes Matched (%) GENE) 2 21 87.5 Cluster A 3 2 8.33 14 1 4.17 17 18 78.3 Cluster 19 1 4.35 B 21 4 17.4 Subsets of these 47 genes appeared inside several of the 40 groups produced by the cluster-genes step of EMMIX-GENE.
  • 116. Genes Retained by EMMIX-GENE Appearing in Cluster A (vertical blue lines indicate the three groups of tumours)
  • 117. Genes Rejected by EMMIX-GENE Appearing in Cluster A
  • 118. Genes Retained by EMMIX-GENE Appearing in Cluster B
  • 119. Genes Rejected by EMMIX-GENE Appearing in Cluster B
  • 120. Assessing the Number of Tissue Groups To assess the number of components g to be used in the normal mixture the likelihood ratio statistic λ was adopted, and the resampling approach used to assess the P-value. By proceeding sequentially, testing the null hypothesis H0: g = g0 versus the alternative hypothesis H1: g = g0 + 1, starting with g0 = 1 and continuing until a non-significant result was obtained it was concluded that g = 3 components were adequate for this data set.
  • 121. Clustering Tissue Samples on the Basis of Gene Groups using EMMIX-GENE Tissue samples can be subdivided into two groups corresponding to 78 sporadic tumours and 20 hereditary tumours. When the two cluster assignment of EMMIX-GENE is compared to this genuine grouping, only 1 of the 20 hereditary tumour patients is misallocated, although 37 of the sporadic tumour patients are incorrectly assigned to the hereditary tumour cluster.
  • 122. Using a mixture of factor analyzers model with q = 8 factors, we would misallocate: 7 out of the 44 members of Π1; 24 out of the 34 members of Π2; and 1 of the 18 BRCA1 samples. The misallocation rate of 24/34 for the second class, Π2, is not surprising given both the gene expressions as summarized in the groups of genes and that we are classifying the tissues in an unsupervised manner without using the knowledge of their true classification.
  • 123. Supervised Classification When knowledge of the groups’ true classification is used (van’t Veer et al.), the reported error rate was approximately 50% for members of Π2 when allowance was made for the selection bias in forming a classifier on the basis of an optimal subset of the genes. Further analysis of this data set in a supervised context confirms the difficulty in trying to discriminate between the disease-free class Π1 and the metastases class Π2. (Tibshirani and Efron, 2002, “Pre-Validation and Inference in Microarrays”, Statistical Applications In Genetics And Molecular Biology 1)
  • 124.
  • 125. Investigating Underlying Signatures With Other Clinical Indicators The three clusters constructed by EMMIX- GENE were investigated in order to determine whether they followed a pattern contingent upon the clinical predictors of histological grade, angioinvasion, oestrogen receptor, lymphocytic infiltrate.
  • 126.
  • 127. Microarrays have become promising diagnostic tools for clinical applications. However, large-scale screening approaches in general and microarray technology in particular, inescapably lead to the challenging problem of learning from high-dimensional data.
  • 128. Hope to see you in Cairns in 2004!