SlideShare une entreprise Scribd logo
1  sur  62
Télécharger pour lire hors ligne
grizzly
statistical analysis with
multidimensional dataflows in python
Adrian Heilbut
Boston University and Broad Institute
http://www.empiricist.ca
(graphs for reproducible
interactive visualization and analysis)
PyData Boston 2013
1. Motivation 
 Biological discovery from complex, multidimensional data;

 common features of complex biological data and analyses
2. Problems and Goals
 Reproducible, efficient, elegant, collaborative,interactive analysis

 Data + analysis evolving over time
3. Toy Dataset	 	 A simple dataset with hierarchical and temporal structure
4. Strategies
 Separate concerns; Represent types and structure explicitly;

 Abstract away data management; Formalize
5. Inspirations 
 OLAP and data cube models;

 Declarative visualization grammars;

 Scientific workflow systems
6. Core Ideas
 Dataflows + Temporal Graphs +

 Multidimensional Types + Syntactic syrup
7. Toy Demos 	
8. Implementation
9. Biology application 
 Mechanisms of drug side effects in Parkinson’s Disease
10. Summary and Conclusions
Motivation
• Common and unique features of scientific data
• Examples of complex datasets and analyses in
computational biology
• Data analysis desiderata
Motivation Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application
Biological data is increasingly complex;
Many datasets and analyses share
common structural features
• High-dimensional measurements
• Longitudinal / time-course measurements
• Hierarchical structure of dimensions
• Multiple modalities
(expression, protein concentration, phosphorylation)
• Complex experimental designs
• Complex analysis designs
• Complex pre-processing pipelines
• Many parameter choices
• Many cell types
• Many treatments
• Many organisms
• Many patients
• Many replicates
Ex 1. Cancer Profiling and Signatures
Cancer Cell Line Encylopedia (CCLE)
Broad / Novartis, Barretina 2012
1000 cell lines
expressionfor
20,000genesmutationstatusdrugresponse
P0 P07 P12 P18 P21 P56
proliferationproliferation differentiationdifferentiation migration & patterningmigration & patterning
P0 P07 P15 P21
E0 E11 E15 E18
3 reps, 40k
probes
Saline
Acute (9)
Low Dose
Levodopa
Chronic (12)
Saline
Chronic (11)
6-OHDA
Ascorbate
Day 1
Expression + AIM
CP73
Day 8
Expression + AIM
High Dose
Levodopa
Acute (10)
High Dose
Levodopa
Chronic (11)
Saline
Chronic (10)
Low Levodopa
Chronic (8)
Saline
Chronic (7)
6-OHDA
Ascorbate
CP101
Day 8
Expression + AIM
High Levodopa
Chronic (8)
Saline
Chronic (10)
Change in Expression between treatment groups
Expression vs. AIM (correlation) within treatment groups / cell types
Statistics (per gene)
Expression vs. AIM (correlation) within combined treatment groups
~ 23,000 x 200 matrix
of stats for different contrasts between groups
Unique characteristics of scientific data
• Relatively short half-life of data and projects
• Uncertain and complex analysis methods
• Constantly changing data
• Lots of internal and external structure over dimensions
• Teams with diverse backgrounds and skills over multiple institutions
and locations
• Communication of data is a primary goal
• High risk and high value outcomes
project selection / experimental followup
clinical decisions
Distinctive characteristics, uses, and problems with scientific
data analysis motivates need for tailored abstractions and tools
Desiderata for Data Analysis
• Correctness
• Thoroughness (scientific hypothesis space + analysis space)
• Reproducibility
• Verifiability (analysis and underlying data, others and oneself)
• Clarity
• Provenance (of the data, and of the analysis)
• Interactivity (Exploration, Drill-down)
• Computational Efficiency
• Scientist Efficiency
Vision
Every figure, every table, and every quantitative claim in a scientific
analysis or publication should be verifiable and explorable
it should link to an understandable, executable,
modifiable representation of the data analysis pipeline by
which it was generated
one should be able to trace back all the way to the primary
experimental data
it should be easy and fun to play with
Problems and Goals
Errors have serious consequences
Practical problems in day-to-day analysis
Unmet need for better tools
Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions
Mistakes even happen in Cambridge...
Reinhart / RogoffHerndon, Ash, Pollin
OriginalCorrect
it’s even worse than it appears...
Kimball, 2013
ability to easily
drill down to view
and assess the
underlying data is
critical
Elements of statistical analysis
statistical
algorithms
output
data
Input data
visualizations
summary
tables
Version 2.
output
data
Input dataInput dataInput dataInput dataInput dataInput dataInput dataInput dataInput data
statistical
algorithm
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
statistical
algorithm
statistical
algorithm
Version 247...
(ah_2013_09_13_v247_
3-17am)
statistical
algorithm
output
data
Input
data
Input
data
Input
data
Input
data
Input
data
Input
data
Input
data
statistical
algorithm
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
statistical
algorithm
statistical
algorithm
statistical
algorithm
statistical
algorithm
v247_figs.
pdf
75mb
(450
pages)
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
v247_tabl
e_1.tab
Toy Dataset
Multidimensional profiling of fermentation
metabolites of S. cerevisiae
Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application
Beer ratings
BeerAdvocate.com & RateBeer.com,
via Stanford SNAP & a very kind blogger
Multidimensional: Appearance, Aroma,
Palate, Taste, Overall
Hierarchies:
Location -> Brewery -> Beer
Beer style -> Beer
Temporal
Toy Dataset
Multidimensional profiling of fermentation
metabolites of S. cerevisiae
Strategies
• Separate concerns
• Abstract away data management problems
• Formalize
• Optimize representations
(logical and physical)
Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions
Separation of Concerns
• Each of these components evolves over time
• Each may be modifed independently by different
people with different goals
statistical
algorithms
output
data
Input data
visualizations
summary
tables
Abstract and automate data
management
Deciding and remembering how to name columns and files and
track changes over time is not what I’d like to spend time on
Especially since I’ll probably do it inconsistently with what I
decided to do last week
If the system is responsible for persisting data, caching and
memoization can be done automatically.
Logical and physical
representations matter
• Choice of representation and notation has a major effect
on ease and efficiency with which concepts can be
manipulated, by either a person or a computer
• Given our goals for an analysis system, and engineering
instinct to separate independent concerns, what are
optimal representations for
• data?
• analysis programs?
• visualizations and summary tables?
statistical
algorithm
output
data
Input
data
Input
data
Input
data
Input
data
Input
data
Input
data
Input
data
statistical
algorithm
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
output
data
statistical
algorithm
statistical
algorithm
statistical
algorithm
statistical
algorithm
How do scientists actually think about
analyses?
Inspirations (and their deficiencies..)
1. OLAP (On-Line Analytical Processing) and MDX
(Multidimensional Expressions)
2. Tableau / Polaris
3. Scientific workflow systems
VisTrails, KNIME
Galaxy, Genepattern
1: OLAP
(on-line analytical processing)
2. Declarative Visualization Grammars
(Polaris/Tableau; Stolte 2003)
• key idea: declarative specification of visualizations is possible and works well
• recent focus has been on busines analytics, rather than statistical graphics;
• assumes a static, structured database (ie. OLAP star schema) Stolte 2000
3. Scientific Workflow Systems
VisTrails
Hypothesis
Careful design and selection of representations for data,
programs, and visualizations will make it possible to
satistfy our data analysis objectives:
• multidimensional cubes with static, semantic types
for conceptual representation of data
• directed acyclic graphs of functions with static,
multidimensional input and output type signatures
for our statistical programs
• declarative queries
to generate summary tables
• declarative visualization grammar
to generate graphics
(this is not how most researchers represent their analyses today)
Correctness
Thoroughness
Reproducibility
Verifiability
Clarity
Provenance
Interactivity
Computational Efficiency
Scientist Efficiency
Multidimensional Cubes
and OLAP
Semantic Types
Dataflow Programming
Core Ideas
Data consists of facts about the world.
1 5.5 3 3 4 5
2 6 2 3 2 2
3 8 5 5 4 4.5
ceci n’est pas data
Data consists of facts about the world.
1
2
3
5.5 3 3 4 5
6 2 3 2 2
8 5 5 4 4.5
ABV Smell Color Taste OverallBeerID
Facts lie in specific domains defined by the
structure of the real world or experimental design
1
2
3
5.5 3 3 4 5
6 2 3 2 2
8 5 5 4 4.5
ABV
float
(%EtOh)
Smell
ordinal
(1-5)
5 is best
Color
ordinal
(1-5)
5 is best
Taste
ordinal
(1-5)
5 is best
Overall
ordinal
(1-5)
5 is best
BeerID
Integer
(BeerAdvocate
BeerID)
There are a number of possible representations;
logically but not practically equivalent
1
2
3
5.5 3 3 4 5
6 2 3 2 2
8 5 5 4 4.5
ABV
float
(%EtOh)
Smell
ordinal (1-5)
5 is best
Color
ordinal
(1-5)
5 is best
Taste
ordinal
(1-5)
5 is best
Overall
ordinal
(1-5)
5 is best
BeerID
Integer
(BeerAdvocate)
BeerID
BeerID Measure Value
1 ABV 5.5
1 Smell 3
1 Color 3
1 Taste 4
1 Overall 5
2 ABV 6
2 Smell 2
2 Color 3
2 Taste 2
2 Overall 2
3 ABV 8
3 Smell 5
3 Color 5
3 Taste 4
3 Overall 4.5
cf. pandas reshape, plyr melt/cast
≈
Data Representations
• Scientific / statistical data is usally in matrix format, and it must
be for efficient storage and computation
• Relational model is good for precisely encoding logical
structure of data, but
• moving between relations and matrices is cumbersome
• defining a relational schema for all intermediate data would
be a lot of work, especially as with change over time
• on its own, the relational model does explicitly represent
semantics and units
Conceptual Model:
OLAP Data Cubes
Cartesian product of a set of
dimensions (finite discrete sets)
defines an N-dimensional grid
A multidimensional dataset is a
function mapping locations in that
grid to typed values called
measures (identities of the
measures can also be considered as
just a special kind of dimension)
Beer ID
UserID
Time
Gene
Brain
Region
Stage of
Development3 3 2 7.8 3 2
3 2 2.3 2.1 3 2
3 2.3 7.4 12 3 2
3 3.14 15 9 3 2
3 2 2 6.5 2 2
measure:
log2 gene expression
measure:
overall beer rating
Conceptual Model:
Data Cubes as functions mapping dimensions
to measures
def BeerRatingsByUser(UserID, BeerID):
return (Taste, Smell, Color,
Texture, Overall)
def BeerRatingsByBeer(BeerID):
return (mean Taste, mean Smell,
mean Color, mean Texture, mean
Overall)
def ExpressionBySample(Gene, Region, SampleID):
return (log2 expression)
def ExpressionByRegionTime(Gene, Region,
Timepoint):
return (median expression, mean
expression, std deviation, median abs
deviation, # replicates)
Hierarchies
Dimensions are related to each
other in structures that reflect:
• the nature of the world
• experimental methods
and designs
• analysis processes and
decisions
These hierarchical relationships are critical to understanding and
performing analyses, and need to be represented explicitly.
Multidimensional Semantic Types
1970s / 80s: Semantic Database formalisms
Specify different kinds of relationships and interactions between objects
(eg. containment, is-a, relations / cross-products)
Overshadowed by ER model and later, UML..
1990s: OLAP
Dataflow
Lots of domains model computation as ‘declarative’ dataflows
circuit design
audio / video processing
Grizzly Computation Model
Directed Acyclic Graph of processing nodes
Inputs and outputs of every node are typed cubes
Function nodes add type information to describe their output dimensions
‘Apply’ nodes propagate any types of their input dimensions that they
aren’t modified to the outputs
Computation is declarative / intensional, not imperative; nodes
automatically process whatever is on their inputs, like an electrical circuit
(ReviewID, BeerID) -->
(Appearance,
Aroma, Palate,
Taste, Overall)
CalcMedian
Ratings
(BeerID) -->
(Appearance,
Aroma, Palate, Taste, Overall)
(ReviewID, BeerID,
SourceID)
-->
(Appearance,
Aroma,
Palate,
Taste,
Overall)
(SourceID, BeerID)
-->
(MedianAppearance,
MedianAroma,
MedianPalate,
MedianTaste,
MedianOverall)
Apply
Advantages of DAG representation
• Static type specifications allow precise and clear modeling /
design of an analysis pipeline before having to write all the
code needed to implement it
• Model can be turned into an actual working program, instead
of just being a schematic diagram
• Provenance tracking without extra instrumentation
• Memoization of intermediate results is easy because data
dependencies are already explicit
• Easier to understand, reason about, and explain to others
• Easier to track modification history as graph edits
Syntactic Syrup: CubeApply
Takes cross-product of a set of input cubes /
vectors and applies function to all results
(BeerID) -->
(Appearance,
Aroma, Palate,
Taste, Overall)
BeerRank
(BeerID) -->
(RankScore)
(BeerID)
-->
(Appearance,
Aroma,
Palate,
Taste,
Overall)
(BeerID,
RankModelID)
-->
(RankScore)
(AppWeight, AromaWt, PalWt,
TasteWt, OverallWt)
(RankModelD)
-->
(AppWt, AromaWt,
PalWt, TasteWt,
OverallWt)
Slicing, Dicing
Since semantic type data is always propagated, in principle we
can define the schema for any intermediate data (including
hierarchy structure) and make use of existing OLAP tools to run
declarative queries
Implementation
• Type system
• DAGs
• Execution
• Data Management
• Visualizations
• ...queries?
Requirements for a practical system
• Programmable and extensible, without requiring discontinuous
changes to existing habits
• OLAP systems not general enough; energy barrier to setting up
a ‘data warehouse’ for a particular scientific analysis is too
high; arbitrary, complex statistics not supported
• System must be deployable over the web, so analyses and
results can be easily shared with geographically dispersed
collaborators and the scientific community
• Free and open source
Current Support for Hierarchies in
Pandas
• Hierarchical dataframes only support ‘uniform’ hierarchies
• lots of real analysis requires comparisons across many
different types
• Metadata is unstructured
• can’t compute effectively on column names
• Manual management
• consistency of column naming and interpretation depends
entirely on programmer discipline
Simple Semantic Types over Pandas
['[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]],
["ct", "cp73"],
["mc", "bh"],
["st", "pval"],
["tt", "welch ttest"]]',
'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]],
["ct", "cp73"],
["mc", "nominal"],
["st", "pval"],
["tt", "student ttest"]]',
'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]],
["ct", "cp73"],
["mc", "bonf"],
["st", "pval"],
["tt", "student ttest"]]',
'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]],
["ct", "cp73"],
["mc", "bh"],
["st", "pval"],
["tt", "student ttest"]]',
'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]],
["ct", "cp73"],
["st", "pval"],
["tt", "levene"]]
ct
CP73 CP101
tt
student
ttest
welch
ttest
st
pval t-stat
bonf bh nom
mc
X
ct tt mccmp st
Temporal Graph Database
• Canonical
representation for
types, ‘programs’,
and pointers to data
are all as typed
property graphs
(DAGs) that can
hold JSON
payloads
• All edit history to the
graph is recorded,
so user can rewind /
replay and branch
Generic Visualization Components
to compose visualizations & reports
Architecture Overview
GZDB
Graph
Editor
Grizzly Webapp
SQLAlchemy
Postgres
IPython
Pandas
HTML Viz
Widgets
GZData
GZFlow
CherryPy
D3, Slickgrid, FlotjsPlumb
Filesystem
Biological Applications
Bio Example 1: Striatal Gene
Expression w. L-DOPA
Summary tables
Drilldown and provenance from summary tables to primary data
Drilldown from summary to statistical
tables
Drilldown from statistical tables to plots
of primary data
Bio Example 2: Complex,
interactive visualizations:
BOMBASTIC
Subspace clustering of time-series data
A. Define blocks and an ordering
B. Cluster each block
independently
C. Represent resulting clusters in a
tree and explore/filter interactively
Each (predefined) subspace
has unique information; we
want to understand patterns
both within and between
blocks
Summary
Increasing complexity of biological data presents critical
requirements for better systems for collaborative analysis of high-
dimensional, multi-factor, dynamic data
A dataflow computation model with semantic, multidimensional
types offers significant advantages for meeting these requirements
Grizzly defines a simple, formal model for multidimensional data and
DAGs of operations on that data, adapting and combining ideas
from OLAP, declarative visualization, and dataflow programming.
Proof-of-concept implementation in python establishes feasibility
Applications to analysis of real biological experiments (PD, Neuro,
Cancer) will evaluate practical utility and benefits
Correctness
Thoroughness
Reproducibility
Verifiability
Clarity
Provenance
Interactivity
Computational Efficiency
Scientist Efficiency
Acknowledgements: Software
• IPython
• NumPy
• Pandas
• Statsmodels
• Patsy
• CherryPy
• SQLAlchemy
• postgres
• NetworkX
• igraph
• backbone
• underscore
• jsPlumb
• flot
• D3.js
Acknowledgements
@adrian_h
http://www.grizzly.io

Contenu connexe

Tendances

Data Structures and Algorithm - Week 6 - Red Black Trees
Data Structures and Algorithm - Week 6 - Red Black TreesData Structures and Algorithm - Week 6 - Red Black Trees
Data Structures and Algorithm - Week 6 - Red Black TreesFerdin Joe John Joseph PhD
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataCSCJournals
 
Data Structures and Algorithm - Week 3 - Stacks and Queues
Data Structures and Algorithm - Week 3 - Stacks and QueuesData Structures and Algorithm - Week 3 - Stacks and Queues
Data Structures and Algorithm - Week 3 - Stacks and QueuesFerdin Joe John Joseph PhD
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
 
Data Structures and Algorithm - Week 5 - AVL Trees
Data Structures and Algorithm - Week 5 - AVL TreesData Structures and Algorithm - Week 5 - AVL Trees
Data Structures and Algorithm - Week 5 - AVL TreesFerdin Joe John Joseph PhD
 
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem ChemAxon
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET Journal
 
Introduction to NTCIR 2016 MedNLPDoc
Introduction to NTCIR 2016 MedNLPDocIntroduction to NTCIR 2016 MedNLPDoc
Introduction to NTCIR 2016 MedNLPDocYu Liu
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesPistoia Alliance
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET Journal
 
10 Algorithms in data mining
10 Algorithms in data mining10 Algorithms in data mining
10 Algorithms in data miningGeorge Ang
 
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsHPCC Systems
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
Recsys2021_slides_sato
Recsys2021_slides_satoRecsys2021_slides_sato
Recsys2021_slides_satoMasahiro Sato
 

Tendances (20)

Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
 
Artificial Intelligence in Data Curation
Artificial Intelligence in Data CurationArtificial Intelligence in Data Curation
Artificial Intelligence in Data Curation
 
Data Structures and Algorithm - Week 6 - Red Black Trees
Data Structures and Algorithm - Week 6 - Red Black TreesData Structures and Algorithm - Week 6 - Red Black Trees
Data Structures and Algorithm - Week 6 - Red Black Trees
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance Data
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Data Structures and Algorithm - Week 3 - Stacks and Queues
Data Structures and Algorithm - Week 3 - Stacks and QueuesData Structures and Algorithm - Week 3 - Stacks and Queues
Data Structures and Algorithm - Week 3 - Stacks and Queues
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
Data Structures and Algorithm - Week 5 - AVL Trees
Data Structures and Algorithm - Week 5 - AVL TreesData Structures and Algorithm - Week 5 - AVL Trees
Data Structures and Algorithm - Week 5 - AVL Trees
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
USUGM 2014 - Dana Vanderwall (Bristol-Myers Squibb): Instant JChem
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction System
 
Introduction to NTCIR 2016 MedNLPDoc
Introduction to NTCIR 2016 MedNLPDocIntroduction to NTCIR 2016 MedNLPDoc
Introduction to NTCIR 2016 MedNLPDoc
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
 
NPS_TDA_forPDF_JPrendki
NPS_TDA_forPDF_JPrendkiNPS_TDA_forPDF_JPrendki
NPS_TDA_forPDF_JPrendki
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
 
10 Algorithms in data mining
10 Algorithms in data mining10 Algorithms in data mining
10 Algorithms in data mining
 
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Recsys2021_slides_sato
Recsys2021_slides_satoRecsys2021_slides_sato
Recsys2021_slides_sato
 

En vedette

Constitutional law of corrections t crim372 first amendment
Constitutional law of corrections  t crim372 first amendmentConstitutional law of corrections  t crim372 first amendment
Constitutional law of corrections t crim372 first amendmentAcklin1921
 
Legislación sobre los derechos de los animales en
Legislación sobre los derechos de los animales enLegislación sobre los derechos de los animales en
Legislación sobre los derechos de los animales en26111996
 
Competing with Giants - How to Win With Drupal vs. Proprietary Alternatives
Competing  with Giants - How to Win With Drupal vs. Proprietary AlternativesCompeting  with Giants - How to Win With Drupal vs. Proprietary Alternatives
Competing with Giants - How to Win With Drupal vs. Proprietary AlternativesAcquia
 
Building a greenhostel
Building a greenhostelBuilding a greenhostel
Building a greenhostelGoMio.com
 

En vedette (6)

Constitutional law of corrections t crim372 first amendment
Constitutional law of corrections  t crim372 first amendmentConstitutional law of corrections  t crim372 first amendment
Constitutional law of corrections t crim372 first amendment
 
Legislación sobre los derechos de los animales en
Legislación sobre los derechos de los animales enLegislación sobre los derechos de los animales en
Legislación sobre los derechos de los animales en
 
Competing with Giants - How to Win With Drupal vs. Proprietary Alternatives
Competing  with Giants - How to Win With Drupal vs. Proprietary AlternativesCompeting  with Giants - How to Win With Drupal vs. Proprietary Alternatives
Competing with Giants - How to Win With Drupal vs. Proprietary Alternatives
 
Birds animals are our friends
Birds animals are our friendsBirds animals are our friends
Birds animals are our friends
 
Unidad 4 quimica
Unidad 4 quimicaUnidad 4 quimica
Unidad 4 quimica
 
Building a greenhostel
Building a greenhostelBuilding a greenhostel
Building a greenhostel
 

Similaire à grizzly - informal overview - pydata boston 2013

Data Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptxData Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptxcharlslabarda
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisDmitry Grapov
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational StatisticsSetia Pramana
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfArmyTrilidiaDevegaSK
 
Neo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life SciencesNeo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life SciencesNeo4j
 
Practical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in CybersecurityPractical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in Cybersecurityscoopnewsgroup
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networkingStenio Fernandes
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingOla Spjuth
 
The challenges of Analytical Data Management in R&D
The challenges of Analytical Data Management in R&DThe challenges of Analytical Data Management in R&D
The challenges of Analytical Data Management in R&DLaura Berry
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadGeohedrick
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Anubhav Dhiman
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsOla Spjuth
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 

Similaire à grizzly - informal overview - pydata boston 2013 (20)

Data Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptxData Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptx
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysis
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational Statistics
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Neo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life SciencesNeo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life Sciences
 
Practical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in CybersecurityPractical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in Cybersecurity
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imaging
 
The challenges of Analytical Data Management in R&D
The challenges of Analytical Data Management in R&DThe challenges of Analytical Data Management in R&D
The challenges of Analytical Data Management in R&D
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 

Dernier

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Dernier (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

grizzly - informal overview - pydata boston 2013

  • 1. grizzly statistical analysis with multidimensional dataflows in python Adrian Heilbut Boston University and Broad Institute http://www.empiricist.ca (graphs for reproducible interactive visualization and analysis) PyData Boston 2013
  • 2. 1. Motivation Biological discovery from complex, multidimensional data; common features of complex biological data and analyses 2. Problems and Goals Reproducible, efficient, elegant, collaborative,interactive analysis Data + analysis evolving over time 3. Toy Dataset A simple dataset with hierarchical and temporal structure 4. Strategies Separate concerns; Represent types and structure explicitly; Abstract away data management; Formalize 5. Inspirations OLAP and data cube models; Declarative visualization grammars; Scientific workflow systems 6. Core Ideas Dataflows + Temporal Graphs + Multidimensional Types + Syntactic syrup 7. Toy Demos 8. Implementation 9. Biology application Mechanisms of drug side effects in Parkinson’s Disease 10. Summary and Conclusions
  • 3. Motivation • Common and unique features of scientific data • Examples of complex datasets and analyses in computational biology • Data analysis desiderata Motivation Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application
  • 4. Biological data is increasingly complex; Many datasets and analyses share common structural features • High-dimensional measurements • Longitudinal / time-course measurements • Hierarchical structure of dimensions • Multiple modalities (expression, protein concentration, phosphorylation) • Complex experimental designs • Complex analysis designs • Complex pre-processing pipelines • Many parameter choices • Many cell types • Many treatments • Many organisms • Many patients • Many replicates
  • 5. Ex 1. Cancer Profiling and Signatures Cancer Cell Line Encylopedia (CCLE) Broad / Novartis, Barretina 2012 1000 cell lines expressionfor 20,000genesmutationstatusdrugresponse
  • 6. P0 P07 P12 P18 P21 P56 proliferationproliferation differentiationdifferentiation migration & patterningmigration & patterning P0 P07 P15 P21 E0 E11 E15 E18 3 reps, 40k probes
  • 7. Saline Acute (9) Low Dose Levodopa Chronic (12) Saline Chronic (11) 6-OHDA Ascorbate Day 1 Expression + AIM CP73 Day 8 Expression + AIM High Dose Levodopa Acute (10) High Dose Levodopa Chronic (11) Saline Chronic (10) Low Levodopa Chronic (8) Saline Chronic (7) 6-OHDA Ascorbate CP101 Day 8 Expression + AIM High Levodopa Chronic (8) Saline Chronic (10) Change in Expression between treatment groups Expression vs. AIM (correlation) within treatment groups / cell types Statistics (per gene) Expression vs. AIM (correlation) within combined treatment groups ~ 23,000 x 200 matrix of stats for different contrasts between groups
  • 8. Unique characteristics of scientific data • Relatively short half-life of data and projects • Uncertain and complex analysis methods • Constantly changing data • Lots of internal and external structure over dimensions • Teams with diverse backgrounds and skills over multiple institutions and locations • Communication of data is a primary goal • High risk and high value outcomes project selection / experimental followup clinical decisions Distinctive characteristics, uses, and problems with scientific data analysis motivates need for tailored abstractions and tools
  • 9. Desiderata for Data Analysis • Correctness • Thoroughness (scientific hypothesis space + analysis space) • Reproducibility • Verifiability (analysis and underlying data, others and oneself) • Clarity • Provenance (of the data, and of the analysis) • Interactivity (Exploration, Drill-down) • Computational Efficiency • Scientist Efficiency
  • 10. Vision Every figure, every table, and every quantitative claim in a scientific analysis or publication should be verifiable and explorable it should link to an understandable, executable, modifiable representation of the data analysis pipeline by which it was generated one should be able to trace back all the way to the primary experimental data it should be easy and fun to play with
  • 11. Problems and Goals Errors have serious consequences Practical problems in day-to-day analysis Unmet need for better tools Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions
  • 12.
  • 13. Mistakes even happen in Cambridge... Reinhart / RogoffHerndon, Ash, Pollin OriginalCorrect
  • 14. it’s even worse than it appears... Kimball, 2013 ability to easily drill down to view and assess the underlying data is critical
  • 15. Elements of statistical analysis statistical algorithms output data Input data visualizations summary tables
  • 16. Version 2. output data Input dataInput dataInput dataInput dataInput dataInput dataInput dataInput dataInput data statistical algorithm output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data output data statistical algorithm statistical algorithm
  • 19. Toy Dataset Multidimensional profiling of fermentation metabolites of S. cerevisiae Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application
  • 20. Beer ratings BeerAdvocate.com & RateBeer.com, via Stanford SNAP & a very kind blogger Multidimensional: Appearance, Aroma, Palate, Taste, Overall Hierarchies: Location -> Brewery -> Beer Beer style -> Beer Temporal Toy Dataset Multidimensional profiling of fermentation metabolites of S. cerevisiae
  • 21. Strategies • Separate concerns • Abstract away data management problems • Formalize • Optimize representations (logical and physical) Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions
  • 22. Separation of Concerns • Each of these components evolves over time • Each may be modifed independently by different people with different goals statistical algorithms output data Input data visualizations summary tables
  • 23. Abstract and automate data management Deciding and remembering how to name columns and files and track changes over time is not what I’d like to spend time on Especially since I’ll probably do it inconsistently with what I decided to do last week If the system is responsible for persisting data, caching and memoization can be done automatically.
  • 24. Logical and physical representations matter • Choice of representation and notation has a major effect on ease and efficiency with which concepts can be manipulated, by either a person or a computer • Given our goals for an analysis system, and engineering instinct to separate independent concerns, what are optimal representations for • data? • analysis programs? • visualizations and summary tables?
  • 26. Inspirations (and their deficiencies..) 1. OLAP (On-Line Analytical Processing) and MDX (Multidimensional Expressions) 2. Tableau / Polaris 3. Scientific workflow systems VisTrails, KNIME Galaxy, Genepattern
  • 28. 2. Declarative Visualization Grammars (Polaris/Tableau; Stolte 2003) • key idea: declarative specification of visualizations is possible and works well • recent focus has been on busines analytics, rather than statistical graphics; • assumes a static, structured database (ie. OLAP star schema) Stolte 2000
  • 29. 3. Scientific Workflow Systems VisTrails
  • 30. Hypothesis Careful design and selection of representations for data, programs, and visualizations will make it possible to satistfy our data analysis objectives: • multidimensional cubes with static, semantic types for conceptual representation of data • directed acyclic graphs of functions with static, multidimensional input and output type signatures for our statistical programs • declarative queries to generate summary tables • declarative visualization grammar to generate graphics (this is not how most researchers represent their analyses today) Correctness Thoroughness Reproducibility Verifiability Clarity Provenance Interactivity Computational Efficiency Scientist Efficiency
  • 31. Multidimensional Cubes and OLAP Semantic Types Dataflow Programming Core Ideas
  • 32. Data consists of facts about the world. 1 5.5 3 3 4 5 2 6 2 3 2 2 3 8 5 5 4 4.5 ceci n’est pas data
  • 33. Data consists of facts about the world. 1 2 3 5.5 3 3 4 5 6 2 3 2 2 8 5 5 4 4.5 ABV Smell Color Taste OverallBeerID
  • 34. Facts lie in specific domains defined by the structure of the real world or experimental design 1 2 3 5.5 3 3 4 5 6 2 3 2 2 8 5 5 4 4.5 ABV float (%EtOh) Smell ordinal (1-5) 5 is best Color ordinal (1-5) 5 is best Taste ordinal (1-5) 5 is best Overall ordinal (1-5) 5 is best BeerID Integer (BeerAdvocate BeerID)
  • 35. There are a number of possible representations; logically but not practically equivalent 1 2 3 5.5 3 3 4 5 6 2 3 2 2 8 5 5 4 4.5 ABV float (%EtOh) Smell ordinal (1-5) 5 is best Color ordinal (1-5) 5 is best Taste ordinal (1-5) 5 is best Overall ordinal (1-5) 5 is best BeerID Integer (BeerAdvocate) BeerID BeerID Measure Value 1 ABV 5.5 1 Smell 3 1 Color 3 1 Taste 4 1 Overall 5 2 ABV 6 2 Smell 2 2 Color 3 2 Taste 2 2 Overall 2 3 ABV 8 3 Smell 5 3 Color 5 3 Taste 4 3 Overall 4.5 cf. pandas reshape, plyr melt/cast ≈
  • 36. Data Representations • Scientific / statistical data is usally in matrix format, and it must be for efficient storage and computation • Relational model is good for precisely encoding logical structure of data, but • moving between relations and matrices is cumbersome • defining a relational schema for all intermediate data would be a lot of work, especially as with change over time • on its own, the relational model does explicitly represent semantics and units
  • 37. Conceptual Model: OLAP Data Cubes Cartesian product of a set of dimensions (finite discrete sets) defines an N-dimensional grid A multidimensional dataset is a function mapping locations in that grid to typed values called measures (identities of the measures can also be considered as just a special kind of dimension) Beer ID UserID Time Gene Brain Region Stage of Development3 3 2 7.8 3 2 3 2 2.3 2.1 3 2 3 2.3 7.4 12 3 2 3 3.14 15 9 3 2 3 2 2 6.5 2 2 measure: log2 gene expression measure: overall beer rating
  • 38. Conceptual Model: Data Cubes as functions mapping dimensions to measures def BeerRatingsByUser(UserID, BeerID): return (Taste, Smell, Color, Texture, Overall) def BeerRatingsByBeer(BeerID): return (mean Taste, mean Smell, mean Color, mean Texture, mean Overall) def ExpressionBySample(Gene, Region, SampleID): return (log2 expression) def ExpressionByRegionTime(Gene, Region, Timepoint): return (median expression, mean expression, std deviation, median abs deviation, # replicates)
  • 39. Hierarchies Dimensions are related to each other in structures that reflect: • the nature of the world • experimental methods and designs • analysis processes and decisions These hierarchical relationships are critical to understanding and performing analyses, and need to be represented explicitly.
  • 40. Multidimensional Semantic Types 1970s / 80s: Semantic Database formalisms Specify different kinds of relationships and interactions between objects (eg. containment, is-a, relations / cross-products) Overshadowed by ER model and later, UML.. 1990s: OLAP
  • 41. Dataflow Lots of domains model computation as ‘declarative’ dataflows circuit design audio / video processing
  • 42. Grizzly Computation Model Directed Acyclic Graph of processing nodes Inputs and outputs of every node are typed cubes Function nodes add type information to describe their output dimensions ‘Apply’ nodes propagate any types of their input dimensions that they aren’t modified to the outputs Computation is declarative / intensional, not imperative; nodes automatically process whatever is on their inputs, like an electrical circuit (ReviewID, BeerID) --> (Appearance, Aroma, Palate, Taste, Overall) CalcMedian Ratings (BeerID) --> (Appearance, Aroma, Palate, Taste, Overall) (ReviewID, BeerID, SourceID) --> (Appearance, Aroma, Palate, Taste, Overall) (SourceID, BeerID) --> (MedianAppearance, MedianAroma, MedianPalate, MedianTaste, MedianOverall) Apply
  • 43. Advantages of DAG representation • Static type specifications allow precise and clear modeling / design of an analysis pipeline before having to write all the code needed to implement it • Model can be turned into an actual working program, instead of just being a schematic diagram • Provenance tracking without extra instrumentation • Memoization of intermediate results is easy because data dependencies are already explicit • Easier to understand, reason about, and explain to others • Easier to track modification history as graph edits
  • 44. Syntactic Syrup: CubeApply Takes cross-product of a set of input cubes / vectors and applies function to all results (BeerID) --> (Appearance, Aroma, Palate, Taste, Overall) BeerRank (BeerID) --> (RankScore) (BeerID) --> (Appearance, Aroma, Palate, Taste, Overall) (BeerID, RankModelID) --> (RankScore) (AppWeight, AromaWt, PalWt, TasteWt, OverallWt) (RankModelD) --> (AppWt, AromaWt, PalWt, TasteWt, OverallWt)
  • 45. Slicing, Dicing Since semantic type data is always propagated, in principle we can define the schema for any intermediate data (including hierarchy structure) and make use of existing OLAP tools to run declarative queries
  • 46. Implementation • Type system • DAGs • Execution • Data Management • Visualizations • ...queries?
  • 47. Requirements for a practical system • Programmable and extensible, without requiring discontinuous changes to existing habits • OLAP systems not general enough; energy barrier to setting up a ‘data warehouse’ for a particular scientific analysis is too high; arbitrary, complex statistics not supported • System must be deployable over the web, so analyses and results can be easily shared with geographically dispersed collaborators and the scientific community • Free and open source
  • 48. Current Support for Hierarchies in Pandas • Hierarchical dataframes only support ‘uniform’ hierarchies • lots of real analysis requires comparisons across many different types • Metadata is unstructured • can’t compute effectively on column names • Manual management • consistency of column naming and interpretation depends entirely on programmer discipline
  • 49. Simple Semantic Types over Pandas ['[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "bh"], ["st", "pval"], ["tt", "welch ttest"]]', '[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "nominal"], ["st", "pval"], ["tt", "student ttest"]]', '[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "bonf"], ["st", "pval"], ["tt", "student ttest"]]', '[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "bh"], ["st", "pval"], ["tt", "student ttest"]]', '[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["st", "pval"], ["tt", "levene"]] ct CP73 CP101 tt student ttest welch ttest st pval t-stat bonf bh nom mc X ct tt mccmp st
  • 50. Temporal Graph Database • Canonical representation for types, ‘programs’, and pointers to data are all as typed property graphs (DAGs) that can hold JSON payloads • All edit history to the graph is recorded, so user can rewind / replay and branch
  • 51. Generic Visualization Components to compose visualizations & reports
  • 52. Architecture Overview GZDB Graph Editor Grizzly Webapp SQLAlchemy Postgres IPython Pandas HTML Viz Widgets GZData GZFlow CherryPy D3, Slickgrid, FlotjsPlumb Filesystem
  • 54. Bio Example 1: Striatal Gene Expression w. L-DOPA Summary tables Drilldown and provenance from summary tables to primary data
  • 55. Drilldown from summary to statistical tables
  • 56. Drilldown from statistical tables to plots of primary data
  • 57. Bio Example 2: Complex, interactive visualizations: BOMBASTIC Subspace clustering of time-series data A. Define blocks and an ordering B. Cluster each block independently C. Represent resulting clusters in a tree and explore/filter interactively Each (predefined) subspace has unique information; we want to understand patterns both within and between blocks
  • 58.
  • 59. Summary Increasing complexity of biological data presents critical requirements for better systems for collaborative analysis of high- dimensional, multi-factor, dynamic data A dataflow computation model with semantic, multidimensional types offers significant advantages for meeting these requirements Grizzly defines a simple, formal model for multidimensional data and DAGs of operations on that data, adapting and combining ideas from OLAP, declarative visualization, and dataflow programming. Proof-of-concept implementation in python establishes feasibility Applications to analysis of real biological experiments (PD, Neuro, Cancer) will evaluate practical utility and benefits Correctness Thoroughness Reproducibility Verifiability Clarity Provenance Interactivity Computational Efficiency Scientist Efficiency
  • 60. Acknowledgements: Software • IPython • NumPy • Pandas • Statsmodels • Patsy • CherryPy • SQLAlchemy • postgres • NetworkX • igraph • backbone • underscore • jsPlumb • flot • D3.js