1. Extreme Spatio Temporal Data Analysis in
Biomedical Informatics
Joel Saltz MD, PhD
Director Center for Comprehensive
Informatics
2. Contributions
• Computer Science: Methods and middleware
Center for Comprehensive Informatics
for analysis, classification of very large
datasets from low dimensional spatio-
temporal sensors; methods to carry out
comparisons and change detection between
sensor datasets
• Biomedical: Mine whole slide image datasets
to better predict outcome and response to
treatments, generate basic insights into
pathophysiology and identify new treatment
targets
3. Outline of Talk
• Pathology: Analysis of Digitized Tissue for Research
Center for Comprehensive Informatics
and Practice
• Feature Clustering: Morphologic Tumor Subtypes in
GBM Brain Tumors and Relationship to “omic”
classifications
• Whole Slide Image Analysis in Clinical Practice:
Neuroblastoma
• Tissue Flow: Multiplex Quantum Dot
• HPC/BIGDATA Feature Pipeline
• Pathology data analytic tools and techniques
6. Computerized Classification System
for Grading Neuroblastoma
Initialization Yes
Image Tile Background? Label
I=L
• Background Identification
No
Create Image I(L)
• Image Decomposition (Multi-
Training Tiles resolution levels)
Segmentation I = I -1 • Image Segmentation
Down-sampling
(EMLDA)
Segmentation
Feature Construction
• Feature Construction (2nd
Yes
No order statistics, Tonal
Feature Extraction I > 1?
Feature Construction
Features)
Feature Extraction
Classification
• Feature Extraction (LDA) +
Classification (Bayesian)
Classifier Training
No
• Multi-resolution Layer
Within Confidence
Region ? Controller (Confidence
Yes
TRAINING Region)
TESTING
8. Direct Study of Relationship Between
vs
Center for Comprehensive Informatics
9. In Silico Brain Tumor Center
Anaplastic Astrocytoma
(WHO grade III)
Glioblastoma
(WHO grade IV)
10. Morphological Tissue Classification
Whole Slide Imaging Cellular Features
Center for Comprehensive Informatics
Nuclei Segmentation
Lee Cooper,
Jun Kong
11. Nuclear Features Used to Classify GBMs
Center for Comprehensive Informatics
50
3 2 1
20
1
45
40
Silhouette Area
40 60
Cluster
80
2
35
100
120
30
140
3
25 160
2 3 4 5 6 7 20 40 60 80 100 120 140 160
# Clusters 0 0.5 1
Silhouette Value
Consensus clustering of morphological
signatures
Study includes 200 million nuclei taken from 480
slides corresponding to 167 distinct patients
Each possibility evaluated using 2000 iterations of K-
means to quantify co-clustering
12. Clustering identifies three morphological groups
Center for Comprehensive Informatics
• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)
• Prognostically-significant (logrank p=4.5e-4)
CC CM PB
1
CC
10 0.8 CM
PB
20
Feature Indices
0.6
Survival
30 0.4
40 0.2
50
0
0 500 1000 1500 2000 2500 3000
Days
13. Gene Expression Class Associations
Center for Comprehensive Informatics
• Cox proportional hazards
– Gene expression class not significant p=0.58
– Morphology clustering p=5.0e-3
100
Classical
Mesenchymal
80
Subtype Percentage (%)
Neural
Proneural
60
40
20
0
CC CM PB
Cluster
14. Clustering Validation
Center for Comprehensive Informatics
• Separate set of 84 GBMs from Henry Ford Hospital
• ClusterRepro: CC p=7.2e-3, CM p=1.3e-2
CC Mixed CM
1
10 CC
0.8 Mixed
Feature Indices
20 CM
0.6
30 Survival
0.4
40
0.2
50
0
0 20 40 60 80 100
Months
19. Example Application: Cancer Stem Cell
Niche
• Cancer stem cells
– Rare(?), proliferative cells, regenerative
– Do they prefer to live near blood vessels, or necrosis?
20. Extreme Spatio-Temporal Sensor Data Analytics
Center for Comprehensive Informatics
• Leverage exascale data and
computer resources to
squeeze the most out of
image, sensor or simulation
data
• Run lots of different
algorithms to derive same
features
• Run lots of algorithms to
derive complementary
features
• Data models and data
management infrastructure
to manage data products,
feature sets and results from
classification and machine
learning algorithms
21. Application Targets
Center for Comprehensive Informatics
• Multi-dimensional spatial-temporal datasets
– Microscopy image analyses
– Biomass monitoring using satellite imagery
– Weather prediction using satellite and ground sensor
data
– Large scale simulations
• Can we analyze 100,000+ microscopy images per
hour?
• Correlative and cooperative analysis of data from
multiple sensor modalities and sources
• What-if scenarios and multiple design choices or
initial conditions
22. Biomass Monitoring (joint with ORNL)
• Investigate changes in vegetation and land use
Center for Comprehensive Informatics
• Hierarchical, multi-resolution coarse/fine-grained analytics into a
unified framework
• Changes identified using high temporal/low spatial resolution MODIS
data
• Segmentation and classification methods used to characterize
changes using higher resolution data (e.g. multitemporal AWiFS
data)
• Segmentation and classification to identify man-made structures.
24. Core Transformations
• Data Cleaning and Low Level Transformations
• Data Subsetting, Filtering, Subsampling
• Spatio-temporal Mapping and Registration
• Object Segmentation
• Feature Extraction, Object Classification
• Spatio-temporal Aggregation
• Change Detection, Comparison, and Quantification
25. Extreme DataCutter
DataCutter
Pipeline of filters connected though logical streams
In transit processing
Flow control between filters and streams
Developed 1990s-2000s; led to IBM System S
Extreme DataCutter
Two level hierarchical pipeline framework
In transit processing
Coarse grained components coordinated by Manager that
coordinates work on pipeline stages between nodes
Fine grained pipeline operations managed at the node level
Both levels employ filter/stream paradigm
27. Node Level Work Scheduling
Center for Comprehensive Informatics
• Features of Node Level Architectures
– Nodes contain CPUs, GPUs
– Each CPU contains multiple cores
– GPU has complex internal architecture
– Data locality within node
– Data paths between CPUs and GPUs
Keeneland Node
28. Node Level Work Scheduling
Center for Comprehensive Informatics
• Attempt to minimize data movement
• Identify and assign operations that perform
well on GPU
• Balance load between CPUs and GPUs
• Prefetch data
• Identify and use high bandwidth CPU/GPU
data paths
• Schedule exclusive GPU access for
components (e.g. morphological
reconstruction) requiring fine grained
parallelism
30. Brain Tumor Pipeline Scaling on Keeneland
(100 Nodes)
Center for Comprehensive Informatics
31. Control Structures for Handling Fine
Grained/Runtime Dependent Parallelism in GPUs
Center for Comprehensive Informatics
Morphological Reconstruction:
8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070
and GTX580 GPUs
32. Large Scale Data Management
Center for Comprehensive Informatics
Implemented with IBM DB2 for large scale
pathology image metadata (~million markups per
slide)
Represented by a complex data model capturing
multi-faceted information including markups,
annotations, algorithm provenance, specimen, etc.
Support for complex relationships and spatial
query: multi-level granularities, relationships
between markups and annotations, spatial and
nested relationships
Highly optimized spatial query and analyses
33. Spatial Centric – Pathology Imaging “GIS”
Point query: human marked point Window query: return markups
inside a nucleus contained in a rectangle
.
Containment query: nuclear feature Spatial join query: algorithm
aggregation in tumor regions validation/comparison
34. PAIS
PAIS (Pathology Analytical Imaging Standards)
Supported by caBIG, R01 and ACTSI
class Domain Mo...
PAIS Logical Model
WholeSlideImageReference TMAImageReference
Patient
Region
62 UML classes
MicroscopyImageReference
Subject
DICOMImageReference
0..1
1
0..1 markups, annotations,
1 Specimen
0..*
ImageReference 1 0..*
0..1
imageReferences,
1 0..1 AnatomicEntity
User
0..1
0..* 1
1
provenance
1 1 Equipment
Group
PAIS Data Representation
0..1
1
0..1 1 PAIS
1 0..*
Project AnnotationReference
1
XML (compressed) or HDF5
0..1
0..* 1 1 1
Collection 0..* 0..*
0..*
1
Markup
0..*
0..*
0..1
Annotation 0..*
PAIS Databases
0..*
GeometricShape 1
loading, managing and
1 1
0..* 0..* 0..*
1
Observation Calculation Inference
Surface Field
1 1
querying and sharing data
0..1 0..1 0..1
Provenance
0..*
Native XML DBMS or
RDBMS + SDBMS
35. PAIS: Example Queries
Example Query for Integrative Studies
• Find mean nuclear feature vector and covariance on
Center for Comprehensive Informatics
tumor regions for each patient grouped by tumor
subtype
SELECT c.pais_uid, pc.subtype,
AVG(area), AVG(perimeter), AVG(eccentricity),
COVARIANCE(area, perimeter), COVARIANCE(area, eccentricity)
FROM pais.calculation_flat c,TCGA.PATIENT_CHARACTERISTIC pc,
pais.patient p
WHERE p.patientid = pc.patient_id AND
p.pais_uid = c.pais_uid
GROUP BY c.pais_uid, pc.subtype;
2 1 3 4
1
1 Cluster 1
20 10 0.9 Cluster 2
20 Cluster 3
40 0.8
Cluster 4
30
0.7
60 40
2
Feature Indices
0.6
Cluster
50
Survival
80
0.5
60
100
0.4
70
3
120 80 0.3
140 90 0.2
100 0.1
160 4
110
50 100 150 0
0 500 1000 1500 2000 2500 3000
0 0.2 0.4 0.6 0.8 1 Days
Silhouette Value
37. VLDB 2012
Center for Comprehensive Informatics
Change Detection, Comparison, and Quantification
38. Summary and Perspective
• Large scale integrative data analytic methods and
Center for Comprehensive Informatics
tools to integrate clinical, molecular, Pathology,
Radiology data
• Characterize new cancer subtypes and biomarkers,
predict outcome, treatment response
• Algorithms to quantify Pathology classification
• HPC/BIGDATA analysis pipelines
39. Importance:
• Computer Science: general approaches to analysis
Center for Comprehensive Informatics
and classification of very large datasets from low
dimensional spatio-temporal sensors
• Biomedical: generate basic insights into
pathophysiology, clues to new treatments, better
ways of evaluating existing treatments and core
infrastructure needed for comparative effectiveness
research studies
40. Thanks to:
• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David
Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel
Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)
• caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin
Kurc, Himanshu Rathod Emory leads
• caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon,
Daniel Rubin, Fred Prior, Larry Tarbox and many others
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz
• Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul
Pantalone
• Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony
Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun
Hu, Lin Yang, David J. Foran (Rutgers)
• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima
Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos
Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani,
Carl Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc,
Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-
Ramos
• NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman,
Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi
Metadata about imagesMetadata about image targets, how images are derived (patient, specimen, anatomicEntity, etc)3) Metadata about analyses (the purpose of the analysis, who performed the analysis, etc) 4) Image markups -- a markup delineates a spatial region (e.g., as points, lines, polygons, multi-polygons) in images5) Annotation: Image features: a type of annotation calculated or derived from the markups6) Annotation: observation -- an annotation associates semantic meaning to markup entities through coded or free text terms that provide explanatory or descriptive information7) provenance information, i.e., the derivation history of a markup or annotation, including algorithm information, parameters, and inputsNative XML database based approachSmall sized PAIS documents, e.g., organ, tissue, or region level annotationsNo mapping needed, support standard XML queriesRelational and spatial database approachFor large scale PAIS documents, e.g., analysis results at cellular or subcellular level Data mapped into relational tables and spatial objectsHighly efficient on storage and queries