Development of multivariate classifiers in cancer

Multivariate Algorithms and
Classifiers in Cancer
Micro-RNA profiles help predict distant diseasefree survival in breast cancer
Bits and pieces of bioinformatics workflow

Mehis Pold, MD
October 18, 2013

Feature Selection &
algorithm development

Training
samples

Iterative process

Internal Algorithm
Validation

Validation
samples

Clinical Validation
Training and validation datasets in each step don’t
overlap
Rule of thumb: validation always produces weaker
statistics than training

• Analysis of early primary breast cancer to identify prognostic
markers and associated pathways: mRNA and miRNA profiling
• GEO (Gene Expression Omnibus) accession ID: GSE22220
• Technology platform: ILLUMINA
• 733 micro-RNA
• 210 breast cancer samples
• 79 complete pathological response (pCR) to chemotherapy; 131
recurrent disease samples (RD)
• Data collected up to 10 years after start of chemotherapy
Buffa et al. microRNA-Associated Progression Pathways and
Potential Therapeutic Targets Identified by Integrated mRNA and
microRNA Expression Profiling in Breast
Cancer. Cancer Res. 2011, 71:5635

BIOINFORMATICS WORKFLOW
Multiple statistical
approaches to
maximize outcome

TRAINING SET:
36 RD
74 pCR

VALIDATION SET:
43 RD
57 pCR

Kaplan-Meier & ROC

Sensitivity (Se)
Specificity (Sp)
Positive Predictive Value (PPV)
Negative Predictive Value (NPV)

Comparison of two
algorithms and
classification by kNN
Custom-scripting (R, VBA)
Standard Software : MS Excel
Medical Statistics: MedCalc

FEATURE SELECTION
Reduction of dimensionality from n = 733 to n = 1
Approach 1: iterative clustering

Approach 2: T-test combined with enriching for weak
inter-profile correlation
Significance of feature selection evaluated by KaplanMeyer survival analysis and ROC (receiver-operator curve)

RD

Up
pCR

Down

KAPLAN-MEIER SURVIVAL CURVE
The Kaplan–Meier estimator, also known as the product limit estimator, is an
estimator for estimating the survival function from lifetime data. In medical
research, it is often used to measure the fraction of patients living for a certain
amount of time after treatment. In economics, it can be used to measure the
length of time people remain unemployed after a job loss. In engineering, it can
be used to measure the time until failure of machine parts. In ecology, it can be
used to estimate how long fleshy fruits remain on plants before they are removed
by frugivores. The estimator is named after Edward L. Kaplan and Paul Meier.

Receiver operating characteristic (ROC)
In signal detection theory, a receiver operating characteristic (ROC), or simply
ROC curve, is a graphical plot which illustrates the performance of a binary
classifier system as its discrimination threshold is varied. It is created by plotting
the fraction of true positives out of the total actual positives (TPR = true positive
rate) vs. the fraction of false positives out of the total actual negatives (FPR =
false positive rate), at various threshold settings. TPR is also known as sensitivity
(also called recall in some fields), and FPR is one minus the specificity or true
negative rate.

ITERATIVE CLUSTERING TO BINARY OUTCOME

T-TEST ENRICHED TOWARD WEAK CORRELATIONS

Nearest Neighbor Classification - kNN
• Based on a measure of distance between observations (e.g.
Euclidean distance or one minus correlation).
• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an
observation X as follows:
– find the k closest observations in the training data,
– predict the class by majority vote, i.e. choose the class that is
most common among those k neighbors.
Classification of
data in 2D space
K=3

K=5

SUMMARY
ITERATIVE CLUSTERING TO BINARY OUTCOME
TRAINING p-value
Kaplan-Meier
ROC

AOC

Sensitivity Specificity

0.0001
<.0001

0.773

72

0.67

65

0.61

0.50

0.63

65

0.51

NPV

68

0.0002
0.0024

PPV

VALIDATION
Kaplan-Meier
ROC

CLASSIFICATION
kNN

T-TEST ENRICHED FOR WEAK CORRELATIONS
TRAINING p-value
Kaplan-Meier
ROC

AOC

Sensitivity Specificity

<.0001
<.0001

0.898

83

0.624

58

0.86

0.65

0.64

56

0.35

NPV

82

0.012
0.0334

PPV

VALIDATION
Kaplan-Meier
ROC

CLASSIFICATION
kNN

CONCLUDING REMARKS
• There is no single ‘right’ approach to algorithm development.
• Validation always produces weaker statistics than training.
• Significance of training statistics and validation statistics are
not very well correlating.
• Algorithms are only as stable and significant as upstream
R&D data. The better standardized and controlled the wetbench, the more stable and significant the algorithms and
eventual clinical validation.

Development of multivariate classifiers in cancer

Recommandé

Recommandé

Contenu connexe

Similaire à Development of multivariate classifiers in cancer

Similaire à Development of multivariate classifiers in cancer (20)

Plus de Mehis Pold

Plus de Mehis Pold (7)

Dernier

Dernier (20)

Development of multivariate classifiers in cancer