Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration
1. Creating an Urban Legend:
A System for Electrophysiology Data
Management and Exploration
Anita de Waard
VP Research Data Collaborations
a.dewaard@elsevier.com
3. Life is complicated!
1. Interspecies variability > A specimen is not a species!
2. Gene expression variability > Knowing genes is not
knowing how they are expressed!
3. Microbiome > An animal is an ecosystem!
4. Systems biology > Whole is more than the sum of its parts!
5. Models vs. experiment > Are we talking about the same
things? In a way we can all use?
6. Dynamics > Life is not in equilibrium!
=> Reductionism doesn’t
work for living systems!
http://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg
4. Statistics could help…
With enough observations, trends and anomalies can be
detected:
• “Here we present resources from a population of 242
healthy adults sampled at 15 or 18 body sites up to three
times, which have generated 5,177 microbial taxonomic
profiles from 16S ribosomal RNA genes and over 3.5
terabases of metagenomic sequence so far.”
The Human Microbiome Project Consortium, Structure, function and diversity of
the healthy human microbiome, Nature 486, 207–214 (14 June 2012)
doi:10.1038/nature11234
• “The large sample size — 4,298 North Americans of
European descent and 2,217 African Americans — has
enabled the researchers to mine down into the human
genome.”
Nidhi Subbaraman, Nature News, 28 November 2012, High-resolution sequencing
study emphasizes importance of rare variants in disease.
5. …but biological research is insular.
• Biology is small: size 10^-5 – 10^2 m,
scientist can work alone (‘King’ and
‘subjects’).
• Biology is messy: it doesn’t
happen behind a terminal.
• Biology is competitive: many
Ponder
people with similar skill sets,
Communicate
vying for the same grants
• In summary: the structure of biological
research does not inherently promote
collaboration (vs., for instance, HE physics or
astronomy (and they’re not all they’re cracked up to be,
either…)).
Prepare
Observe
Analyze
6. What if we could connect experiments?
Across labs, experiments:
track reagents and how
they are used
Observations
Observations
Observations
Prepare
Prepare
Analyze
Communicate
Analyze
Communicate
7. What if we could connect experiments?
Compare outcome of
interactions with these
entities
Observations
Observations
Observations
Prepare
Prepare
Analyze
Communicate
Analyze
Communicate
8. What if we could connect experiments?
Build a ‘virtual reagent
spectrogram’ by comparing
how different entities
interacted in different
experiments
Observations
Think
Observations
Observations
Prepare
Prepare
Analyze
Communicate
Reason collectively!
Communicate Analyze
9. Research Data Management today:
Using antibodies
and squishy bits
Grad Students experiment
and enter details into their
lab notebook.
The PI then tries to make
sense of their slides,
and writes a paper.
End of story.
10. An Urban Legend is born:
• How can we make a standard neuroscience
wet lab more data-sharing savvy?
• Incorporate structured workflows into the daily
practice of a typical electrophysiology lab (the
Urban Lab at CMU)
– What does it take?
– Where are points of conflict?
• 1-year pilot, funded by Elsevier RDS:
– CMU: Shreejoy Tripathy, manage/user test
– Elsevier: development, UI, project management
11. Goal: Enable Effective data sharing:
• Effective data sharing = “someone who is not the
person who collected the data can understand the
experiment and data” (Shreejoy’s definition)
– So datasets should be more or less self-describing
– > 90% of data sharing use cases are an experimentalist
sharing data with a future version of herself or with a
labmate
• Not just experimental data file, but also the
experimental metadata:
– What was done? What does this variable mean?
– This is usually stored in paper lab notebooks,
understandable by only the experimenter
12. Main Assumptions:
1. Effective data sharing
includes raw data files +
experimental metadata
(typically stored in a lab
notebook)
2. You know most about an
experiment while you’re
performing it
3. Improved data practices can
make labs more productive
and more creative
SDB_MC_12_voltages.mat
15. Data integration:
• Syncing of metadata
app and
electrophysiology data
acquisition via server
• Each trace of
experimental data
annotated with
metadata
• IGOR-Pro specific,
support pClamp, other
acquisition packages as
needed later
17. Semantic Integration:
Entity tables uses a scope and
an attributes field to create
a NoSQL like, hierarchical
key/value structure in
PostgreSQL with the built-in
hstore extension.
Ontology Information (in
normalized sql tables) map
keys, values & scopes to
ontology information.
Entity
ID : UUID
Investigator : references investigators
table
created : timestamp
last_modified : timestamp
scope : string ~ /[A-Z]d+(::[AZ]d+)*/
attributes : hstore (string → string
mapping)
18. Data dashboard (planned):
• Use collected metadata to sort
experiments: organize by
mouse strain, neuron type,
animal age
• Enable in-browser analyses:
track provenance of analyzed
data back to raw data: “what
was that outlier?”
• Simple link in to publishing/data
sharing tools: “we can publish
papers no one else can”
19. Next steps Urban Legend Project:
• Populate data server with many experiments:
– Are people using it? Why/why not?
– What questions can we answer now that we
couldn’t before?
• Export data to neuroscience databases: NIF, INCF
Dataspace, neuroelectro.org
• How adaptable is this solution for use in other labs?
• Can we scale this up and make it sustainable?
• Software is available! Ready to swap this simple system
for something better: point is process!
• How does it fit into a larger data infrastructure within
the institution/nationally/internationally?
20. Elsevier Research Data Services:
• Main goal: make research data optimally available,
discoverable and reusable
• Collaboration is tailored to partner’s unique needs:
– Working with a few domain-specific and institutional
repositories and institutions
– Aspects where collaboration is needed are discussed
– Collaboration plan is drawn up using SLA: agree on time,
conditions, etc.
• 2013/2014: series of pilots, studies and reports to
enable feasibility study:
– What are key needs?
– Can Elsevier play a role: skillsets, partnerships?
– Is there a (transparent) business model for this?
22. Data Initiatives:
• Data Citation group:
– Synthesize principles of proper data citation
– ‘Declaration of Data Citation Principles’, 8 principles of
successful data citation -http://www.force11.org/datacitation
• Resource Identification Initiative:
– Promote research resource identification, discovery, and reuse
– Resource Identification Portal http://scicrunch.com/resources
– Central location for obtaining research resource identifiers
(RRIDs) for materials and software used in biomedical research
• Antibody: Abgent Cat# AP7251E, ABR:AB_2140114
• Tool: CellProfiler Image Analysis Software, NIFRegistry:nif-0000-00280
• Organism: MGI:MGI:3840442
23. Summary:
• Life is complicated: knowledge needs to
be connected!
• A small pilot: “Urban Legend”
• Context and next steps:
– Working with institutions and databases to piece
together this puzzle
– Force11 is contributing some pieces
24. Thank you!
Collaborations and discussions gratefully acknowledged:
• CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Rick
Gerkin,
• Santosh Chandrasekaran, Matthew Geramita, Eduard Hovy
• UCSD: Phil Bourne, Brian Shoettlander, David Minor, Declan
Fleming, Ilya Zaslavsky
• NIF/Force11: Maryann Martone, Anita Bandrowski
• OHSU: Melissa Haendel, Nicole Vasilevsky
• California Digital Library: Carly Strasser, John Kunze, Stephen
Abrams
• Elsevier: Mark Harviston, Jez Alder, David Marques
25. Questions?
Anita de Waard
VP Research Data Collaborations
a.dewaard@elsevier.com
http://researchdata.elsevier.com/
26. Scopes
Follows the format L#::L#::L#...
where L is a letter identifier and # is any number of decimal
digits.
Example: P1::S1::R3 = Animal Prep 1, Slice 1, Run 3
The Letter need not be globally unique but only chain unique.
Example: P1::S1::E1(Electrode) is different from P1::S1::R1::E1
(Run-Electrode)
Scopes are 1 indexed.
27. Attributes
Each scope has an attributes field that consists
of multiple key, value pairs.
The keys are unique and not tied to scope. (e.g.
electrode_name instead of name).
Keys can be a choice, scalar (with units), or freetext field and which is determined by the
ontology tables.
28. Downsides to Flexible Schema
Converting to/from the flat scopes to a true hierarchy
(say in JSON) is rather complicated and led to many
errors in the App.
Very easy to get corrupted data in the App.
Schema is closely aligned to the way the lua App did
things.
A flexible schema was a good choice, but not scopes for
hierarchies.
29. Raw Data
For use in data-dashboard.
Standardized on HDF5.
Files uploaded via FTP.
Username, filename, and metadata w/i the
HDF5 file used to identify associated metadata
records.
Batch or individually uploaded.
Editor's Notes
Walk through pieces 1 by 1, also mention that this is very much an uncompleted work in progress