grizzly - informal overview - pydata boston 2013

grizzly
statistical analysis with
multidimensional dataﬂows in python
Adrian Heilbut
Boston University and Broad Institute
http://www.empiricist.ca
(graphs for reproducible
interactive visualization and analysis)
PyData Boston 2013

1. Motivation
Biological discovery from complex, multidimensional data;

common features of complex biological data and analyses
2. Problems and Goals
Reproducible, efficient, elegant, collaborative,interactive analysis

Data + analysis evolving over time
3. Toy Dataset A simple dataset with hierarchical and temporal structure
4. Strategies
Separate concerns; Represent types and structure explicitly;

Abstract away data management; Formalize
5. Inspirations
OLAP and data cube models;

Declarative visualization grammars;

Scientific workflow systems
6. Core Ideas
Dataflows + Temporal Graphs +

Multidimensional Types + Syntactic syrup
7. Toy Demos
8. Implementation
9. Biology application
Mechanisms of drug side effects in Parkinson’s Disease
10. Summary and Conclusions

Motivation
• Common and unique features of scientiﬁc data
• Examples of complex datasets and analyses in
computational biology
• Data analysis desiderata
Motivation Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application

Biological data is increasingly complex;
Many datasets and analyses share
common structural features
• High-dimensional measurements
• Longitudinal / time-course measurements
• Hierarchical structure of dimensions
• Multiple modalities
(expression, protein concentration, phosphorylation)
• Complex experimental designs
• Complex analysis designs
• Complex pre-processing pipelines
• Many parameter choices
• Many cell types
• Many treatments
• Many organisms
• Many patients
• Many replicates

Ex 1. Cancer Proﬁling and Signatures
Cancer Cell Line Encylopedia (CCLE)
Broad / Novartis, Barretina 2012
1000 cell lines
expressionfor
20,000genesmutationstatusdrugresponse

P0 P07 P12 P18 P21 P56
proliferationproliferation differentiationdifferentiation migration & patterningmigration & patterning
P0 P07 P15 P21
E0 E11 E15 E18
3 reps, 40k
probes

Saline
Acute (9)
Low Dose
Levodopa
Chronic (12)
Saline
Chronic (11)
6-OHDA
Ascorbate
Day 1
Expression + AIM
CP73
Day 8
Expression + AIM
High Dose
Levodopa
Acute (10)
High Dose
Levodopa
Chronic (11)
Saline
Chronic (10)
Low Levodopa
Chronic (8)
Saline
Chronic (7)
6-OHDA
Ascorbate
CP101
Day 8
Expression + AIM
High Levodopa
Chronic (8)
Saline
Chronic (10)
Change in Expression between treatment groups
Expression vs. AIM (correlation) within treatment groups / cell types
Statistics (per gene)
Expression vs. AIM (correlation) within combined treatment groups
~ 23,000 x 200 matrix
of stats for different contrasts between groups

Unique characteristics of scientiﬁc data
• Relatively short half-life of data and projects
• Uncertain and complex analysis methods
• Constantly changing data
• Lots of internal and external structure over dimensions
• Teams with diverse backgrounds and skills over multiple institutions
and locations
• Communication of data is a primary goal
• High risk and high value outcomes
project selection / experimental followup
clinical decisions
Distinctive characteristics, uses, and problems with scientiﬁc
data analysis motivates need for tailored abstractions and tools

Desiderata for Data Analysis
• Correctness
• Thoroughness (scientific hypothesis space + analysis space)
• Reproducibility
• Verifiability (analysis and underlying data, others and oneself)
• Clarity
• Provenance (of the data, and of the analysis)
• Interactivity (Exploration, Drill-down)
• Computational Efficiency
• Scientist Efficiency

Vision
Every figure, every table, and every quantitative claim in a scientific
analysis or publication should be verifiable and explorable
it should link to an understandable, executable,
modifiable representation of the data analysis pipeline by
which it was generated
one should be able to trace back all the way to the primary
experimental data
it should be easy and fun to play with

Problems and Goals
Errors have serious consequences
Practical problems in day-to-day analysis
Unmet need for better tools
Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions

Mistakes even happen in Cambridge...
Reinhart / RogoffHerndon, Ash, Pollin
OriginalCorrect

it’s even worse than it appears...
Kimball, 2013
ability to easily
drill down to view
and assess the
underlying data is
critical

Elements of statistical analysis
statistical
algorithms
output
data
Input data
visualizations
summary
tables

Version 2.
output
data
Input dataInput dataInput dataInput dataInput dataInput dataInput dataInput dataInput data
statistical
algorithm
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
statistical
algorithm
statistical
algorithm

Version 247...
(ah_2013_09_13_v247_
3-17am)
statistical
algorithm
output
data
Input
data
Input
data
Input
data
Input
data
Input
data
Input
data
Input
data
statistical
algorithm
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
statistical
algorithm
statistical
algorithm
statistical
algorithm
statistical
algorithm

v247_ﬁgs.
pdf
75mb
(450
pages)
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab

Toy Dataset
Multidimensional proﬁling of fermentation
metabolites of S. cerevisiae
Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application

Beer ratings
BeerAdvocate.com & RateBeer.com,
via Stanford SNAP & a very kind blogger
Multidimensional: Appearance, Aroma,
Palate, Taste, Overall
Hierarchies:
Location -> Brewery -> Beer
Beer style -> Beer
Temporal
Toy Dataset
Multidimensional proﬁling of fermentation
metabolites of S. cerevisiae

Strategies
• Separate concerns
• Abstract away data management problems
• Formalize
• Optimize representations
(logical and physical)
Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions

Separation of Concerns
• Each of these components evolves over time
• Each may be modifed independently by diﬀerent
people with diﬀerent goals
statistical
algorithms
output
data
Input data
visualizations
summary
tables

Abstract and automate data
management
Deciding and remembering how to name columns and ﬁles and
track changes over time is not what I’d like to spend time on
Especially since I’ll probably do it inconsistently with what I
decided to do last week
If the system is responsible for persisting data, caching and
memoization can be done automatically.

Logical and physical
representations matter
• Choice of representation and notation has a major effect
on ease and efﬁciency with which concepts can be
manipulated, by either a person or a computer
• Given our goals for an analysis system, and engineering
instinct to separate independent concerns, what are
optimal representations for
• data?
• analysis programs?
• visualizations and summary tables?

statistical
algorithm
output
data
Input
data
Input
data
Input
data
Input
data
Input
data
Input
data
Input
data
statistical
algorithm
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
statistical
algorithm
statistical
algorithm
statistical
algorithm
statistical
algorithm
How do scientists actually think about
analyses?

Inspirations (and their deficiencies..)
1. OLAP (On-Line Analytical Processing) and MDX
(Multidimensional Expressions)
2. Tableau / Polaris
3. Scientific workflow systems
VisTrails, KNIME
Galaxy, Genepattern

1: OLAP
(on-line analytical processing)

2. Declarative Visualization Grammars
(Polaris/Tableau; Stolte 2003)
• key idea: declarative speciﬁcation of visualizations is possible and works well
• recent focus has been on busines analytics, rather than statistical graphics;
• assumes a static, structured database (ie. OLAP star schema) Stolte 2000

3. Scientiﬁc Workﬂow Systems
VisTrails

Hypothesis
Careful design and selection of representations for data,
programs, and visualizations will make it possible to
satistfy our data analysis objectives:
• multidimensional cubes with static, semantic types
for conceptual representation of data
• directed acyclic graphs of functions with static,
multidimensional input and output type signatures
for our statistical programs
• declarative queries
to generate summary tables
• declarative visualization grammar
to generate graphics
(this is not how most researchers represent their analyses today)
Correctness
Thoroughness
Reproducibility
Verifiability
Clarity
Provenance
Interactivity
Computational Efficiency
Scientist Efficiency

Multidimensional Cubes
and OLAP
Semantic Types
Dataﬂow Programming
Core Ideas

Data consists of facts about the world.
1 5.5 3 3 4 5
2 6 2 3 2 2
3 8 5 5 4 4.5
ceci n’est pas data

Data consists of facts about the world.
1
2
3
5.5 3 3 4 5
6 2 3 2 2
8 5 5 4 4.5
ABV Smell Color Taste OverallBeerID

Facts lie in specific domains defined by the
structure of the real world or experimental design
1
2
3
5.5 3 3 4 5
6 2 3 2 2
8 5 5 4 4.5
ABV
float
(%EtOh)
Smell
ordinal
(1-5)
5 is best
Color
ordinal
(1-5)
5 is best
Taste
ordinal
(1-5)
5 is best
Overall
ordinal
(1-5)
5 is best
BeerID
Integer
(BeerAdvocate
BeerID)

There are a number of possible representations;
logically but not practically equivalent
1
2
3
5.5 3 3 4 5
6 2 3 2 2
8 5 5 4 4.5
ABV
ﬂoat
(%EtOh)
Smell
ordinal (1-5)
5 is best
Color
ordinal
(1-5)
5 is best
Taste
ordinal
(1-5)
5 is best
Overall
ordinal
(1-5)
5 is best
BeerID
Integer
(BeerAdvocate)
BeerID
BeerID Measure Value
1 ABV 5.5
1 Smell 3
1 Color 3
1 Taste 4
1 Overall 5
2 ABV 6
2 Smell 2
2 Color 3
2 Taste 2
2 Overall 2
3 ABV 8
3 Smell 5
3 Color 5
3 Taste 4
3 Overall 4.5
cf. pandas reshape, plyr melt/cast
≈

Data Representations
• Scientific / statistical data is usally in matrix format, and it must
be for efficient storage and computation
• Relational model is good for precisely encoding logical
structure of data, but
• moving between relations and matrices is cumbersome
• defining a relational schema for all intermediate data would
be a lot of work, especially as with change over time
• on its own, the relational model does explicitly represent
semantics and units

Conceptual Model:
OLAP Data Cubes
Cartesian product of a set of
dimensions (ﬁnite discrete sets)
deﬁnes an N-dimensional grid
A multidimensional dataset is a
function mapping locations in that
grid to typed values called
measures (identities of the
measures can also be considered as
just a special kind of dimension)
Beer ID
UserID
Time
Gene
Brain
Region
Stage of
Development3 3 2 7.8 3 2
3 2 2.3 2.1 3 2
3 2.3 7.4 12 3 2
3 3.14 15 9 3 2
3 2 2 6.5 2 2
measure:
log2 gene expression
measure:
overall beer rating

Conceptual Model:
Data Cubes as functions mapping dimensions
to measures
def BeerRatingsByUser(UserID, BeerID):
return (Taste, Smell, Color,
Texture, Overall)
def BeerRatingsByBeer(BeerID):
return (mean Taste, mean Smell,
mean Color, mean Texture, mean
Overall)
def ExpressionBySample(Gene, Region, SampleID):
return (log2 expression)
def ExpressionByRegionTime(Gene, Region,
Timepoint):
return (median expression, mean
expression, std deviation, median abs
deviation, # replicates)

Hierarchies
Dimensions are related to each
other in structures that reﬂect:
• the nature of the world
• experimental methods
and designs
• analysis processes and
decisions
These hierarchical relationships are critical to understanding and
performing analyses, and need to be represented explicitly.

Multidimensional Semantic Types
1970s / 80s: Semantic Database formalisms
Specify different kinds of relationships and interactions between objects
(eg. containment, is-a, relations / cross-products)
Overshadowed by ER model and later, UML..
1990s: OLAP

Dataﬂow
Lots of domains model computation as ‘declarative’ dataﬂows
circuit design
audio / video processing

Grizzly Computation Model
Directed Acyclic Graph of processing nodes
Inputs and outputs of every node are typed cubes
Function nodes add type information to describe their output dimensions
‘Apply’ nodes propagate any types of their input dimensions that they
aren’t modiﬁed to the outputs
Computation is declarative / intensional, not imperative; nodes
automatically process whatever is on their inputs, like an electrical circuit
(ReviewID, BeerID) -->
(Appearance,
Aroma, Palate,
Taste, Overall)
CalcMedian
Ratings
(BeerID) -->
(Appearance,
Aroma, Palate, Taste, Overall)
(ReviewID, BeerID,
SourceID)
-->
(Appearance,
Aroma,
Palate,
Taste,
Overall)
(SourceID, BeerID)
-->
(MedianAppearance,
MedianAroma,
MedianPalate,
MedianTaste,
MedianOverall)
Apply

Advantages of DAG representation
• Static type speciﬁcations allow precise and clear modeling /
design of an analysis pipeline before having to write all the
code needed to implement it
• Model can be turned into an actual working program, instead
of just being a schematic diagram
• Provenance tracking without extra instrumentation
• Memoization of intermediate results is easy because data
dependencies are already explicit
• Easier to understand, reason about, and explain to others
• Easier to track modiﬁcation history as graph edits

Syntactic Syrup: CubeApply
Takes cross-product of a set of input cubes /
vectors and applies function to all results
(BeerID) -->
(Appearance,
Aroma, Palate,
Taste, Overall)
BeerRank
(BeerID) -->
(RankScore)
(BeerID)
-->
(Appearance,
Aroma,
Palate,
Taste,
Overall)
(BeerID,
RankModelID)
-->
(RankScore)
(AppWeight, AromaWt, PalWt,
TasteWt, OverallWt)
(RankModelD)
-->
(AppWt, AromaWt,
PalWt, TasteWt,
OverallWt)

Slicing, Dicing
Since semantic type data is always propagated, in principle we
can deﬁne the schema for any intermediate data (including
hierarchy structure) and make use of existing OLAP tools to run
declarative queries

Implementation
• Type system
• DAGs
• Execution
• Data Management
• Visualizations
• ...queries?

Requirements for a practical system
• Programmable and extensible, without requiring discontinuous
changes to existing habits
• OLAP systems not general enough; energy barrier to setting up
a ‘data warehouse’ for a particular scientiﬁc analysis is too
high; arbitrary, complex statistics not supported
• System must be deployable over the web, so analyses and
results can be easily shared with geographically dispersed
collaborators and the scientiﬁc community
• Free and open source

Current Support for Hierarchies in
Pandas
• Hierarchical dataframes only support ‘uniform’ hierarchies
• lots of real analysis requires comparisons across many
different types
• Metadata is unstructured
• can’t compute effectively on column names
• Manual management
• consistency of column naming and interpretation depends
entirely on programmer discipline

Simple Semantic Types over Pandas
['[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]],
["ct", "cp73"],
["mc", "bh"],
["st", "pval"],
["tt", "welch ttest"]]',
'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]],
["ct", "cp73"],
["mc", "nominal"],
["st", "pval"],
["tt", "student ttest"]]',
["ct", "cp73"],
["mc", "bonf"],
["st", "pval"],
["ct", "cp73"],
["mc", "bh"],
["st", "pval"],
["ct", "cp73"],
["st", "pval"],
["tt", "levene"]]
ct
CP73 CP101
tt
student
ttest
welch
ttest
st
pval t-stat
bonf bh nom
mc
X
ct tt mccmp st

Temporal Graph Database
• Canonical
representation for
types, ‘programs’,
and pointers to data
are all as typed
property graphs
(DAGs) that can
hold JSON
payloads
• All edit history to the
graph is recorded,
so user can rewind /
replay and branch

Generic Visualization Components
to compose visualizations & reports

Architecture Overview
GZDB
Graph
Editor
Grizzly Webapp
SQLAlchemy
Postgres
IPython
Pandas
HTML Viz
Widgets
GZData
GZFlow
CherryPy
D3, Slickgrid, FlotjsPlumb
Filesystem

Bio Example 1: Striatal Gene
Expression w. L-DOPA
Summary tables
Drilldown and provenance from summary tables to primary data

Drilldown from summary to statistical
tables

Drilldown from statistical tables to plots
of primary data

Bio Example 2: Complex,
interactive visualizations:
BOMBASTIC
Subspace clustering of time-series data
A. Define blocks and an ordering
B. Cluster each block
independently
C. Represent resulting clusters in a
tree and explore/filter interactively
Each (predefined) subspace
has unique information; we
want to understand patterns
both within and between
blocks

Summary
Increasing complexity of biological data presents critical
requirements for better systems for collaborative analysis of high-
dimensional, multi-factor, dynamic data
A dataflow computation model with semantic, multidimensional
types offers significant advantages for meeting these requirements
Grizzly defines a simple, formal model for multidimensional data and
DAGs of operations on that data, adapting and combining ideas
from OLAP, declarative visualization, and dataflow programming.
Proof-of-concept implementation in python establishes feasibility
Applications to analysis of real biological experiments (PD, Neuro,
Cancer) will evaluate practical utility and benefits
Correctness
Thoroughness
Reproducibility
Verifiability
Clarity
Provenance
Interactivity
Computational Efficiency
Scientist Efficiency

Acknowledgements: Software
• IPython
• NumPy
• Pandas
• Statsmodels
• Patsy
• CherryPy
• SQLAlchemy
• postgres
• NetworkX
• igraph
• backbone
• underscore
• jsPlumb
• ﬂot
• D3.js

@adrian_h
http://www.grizzly.io

grizzly - informal overview - pydata boston 2013

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à grizzly - informal overview - pydata boston 2013

Similaire à grizzly - informal overview - pydata boston 2013 (20)

Dernier

Dernier (20)

grizzly - informal overview - pydata boston 2013