Eamonn Maguire's talk on "The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe" at ISCB-Asia, December 17th 2012
Enterprise Knowledge - Taxonomy Design Best Practices and Methodology
Similaire à Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe
Fmri of bilingual brain atl reveals language independent representations Emily Sabo
Similaire à Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe (20)
Potential of AI (Generative AI) in Business: Learnings and Insights
Eamonn Maguire: The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe
1. The Open Source ISA Metadata Tracking Framework:
From Data Curation and Management at the Source, to the Linked Data Universe
Eamonn Maguire
Lead Software Engineer
Oxford University
eamonn.maguire@oerc.ox.ac.uk
ISCB-Asia, 17th December 2012
2. What is ISA all about?
We want to enable better reporting of
experiments...
We want to make to easier for
submitters...
We want to provide tooling which
biologists will want to use...
ISCB-Asia, 17th December 2012
3. What’s the problem?
Could be beans. Could be peas. Could be soup.
Analogy time.
Each can is an experiment.
Tin can analogy borrowed from We have no labels, so no indication about what is in the can.
Norman Morrison & converted
from ontologies to metadata
transfer standards.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same
language.
1. there is fragmentation in formats: the formats used to describe experiments are different,
e.g. MAGE-Tab, PRIDE-ML, SRA-XML.
2. different formats often capture different information - often not enough to actually repeat
an experiment correctly
3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens
or rat vs rattus norvegicus, making search more difficult.
ISCB-Asia, 17th December 2012
4. What’s the problem?
Could be beans. Could be peas. Could be soup.
可能是豌豆 - a different representation...non latin language
Analogy time.
Each can is an experiment.
Tin can analogy borrowed from We have no labels, so no indication about what is in the can.
Norman Morrison & converted
from ontologies to metadata
transfer standards.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same
language.
1. there is fragmentation in formats: the formats used to describe experiments are different,
e.g. MAGE-Tab, PRIDE-ML, SRA-XML.
2. different formats often capture different information - often not enough to actually repeat
an experiment correctly
3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens
or rat vs rattus norvegicus, making search more difficult.
ISCB-Asia, 17th December 2012
5. What’s the problem?
Could be beans. Could be peas. Could be soup.
可能是豌豆 - a different representation...non latin language
Might be petit pois - a different terminology
Analogy time.
Each can is an experiment.
Tin can analogy borrowed from We have no labels, so no indication about what is in the can.
Norman Morrison & converted
from ontologies to metadata
transfer standards.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same
language.
1. there is fragmentation in formats: the formats used to describe experiments are different,
e.g. MAGE-Tab, PRIDE-ML, SRA-XML.
2. different formats often capture different information - often not enough to actually repeat
an experiment correctly
3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens
or rat vs rattus norvegicus, making search more difficult.
ISCB-Asia, 17th December 2012
6. 1. There is fragmentation in formats
Can you imagine having to translate everything you write into a different language in
order to submit your data?
ISCB-Asia, 17th December 2012
7. 1. There is fragmentation in formats
Can you imagine having to translate everything you write into a different language in
order to submit your data?
你能想象有翻译成不同的语言编写的一切,以提交 的数据吗?即使转换
工具,像谷歌,翻译弄错了。
ISCB-Asia, 17th December 2012
8. 1. There is fragmentation in formats
Can you imagine having to translate everything you write into a different language in
order to submit your data?
你能想象有翻译成不同的语言编写的一切,以提交 的数据吗?即使转换
工具,像谷歌,翻译弄错了。
An féidir leat a shamhlú go bhfuil gach rud a scríobh tú a aistriú isteach i
dteanga eile d'fhonn a chur isteach do chuid sonraí? Fiú uirlisí chomhshó,
cosúil le google translate a fháilsé mícheart.
ISCB-Asia, 17th December 2012
9. 1. There is fragmentation in formats: our solution
Repositories are making it difficult for biologists to submit data, and for others to use it.
Particularly for those performing multi-omic experiments...to submit say proteomic and
transcriptomic data, one must provide slightly different information in two very different
formats...why?
Our solution is one general purpose, flexible format, herein referred to as ISA-Tab.
A domain agnostic format to capture experimental metadata in omic experiments
(transcriptomic, genomic, proteomic, metabolomic) as well as traditional experiments such as
clinical chemistry and histology.
...it already works in lots of domains...nutrigenomics, toxicogenomics, public health... etc.
ISCB-Asia, 17th December 2012
10. 1. There is fragmentation in formats: our solution
investigation investigation
high level concept to link
related studies
study
the central unit, containing
information on the subject
under study, its characteristics
and any treatments applied.
a study has associated assays
assay
test performed either on
material taken from the sub-
ject or on the whole initial
subject, which produce quali-
tative or quantitative meas-
urements (data)
assay(s) assay(s)
pointers to data file
Biologists like tab.
names/location
They don’t like XML.
Through basic inference...
external files in ISA-Tab is good :)
native or other for-
mats
data data
ISCB-Asia, 17th December 2012
11. 2. Different formats often capture different information
...But there are lots of similarities
Minimal Information about a Biological or Biomedical Investigation.
The information captured by a format is generated via a ‘checklist’, ideally a list of fields that
together provide the minimal amount of information required to be able to reproduce an
experiment.
MIBBI is trying to harmonise these checklists to reduce redundancy and make them
interoperable.
We have 32 checklists at present because there are differences in what is deemed important
depending on the experiment being performed.
ISCB-Asia, 17th December 2012
12. Now integrated in
Helping to demystify the
unwieldy world of
standards...
Find out what standards are out
there...MI Checklists, ontologies
and formats plus what domains
they are suited to...
Find out about data sharing
policies from NIH for example.
Databases, which standards they
use etc.
ISCB-Asia, 17th December 2012
13. Now integrated in
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same
language. What do I mean by this? Well...
1. there is fragmentation:
2. different formats often capture different information
3. the terminologies used to describe an experiment are different: we promote the use of
ontologies to harmonize the recording of experiments.
ISCB-Asia, 17th December 2012
14. The ISA tools...
Ontologies MI Checklists
Common representation
ISA tools brings together a common representation, MI checklists and ontologies.
ISCB-Asia, 17th December 2012
15. The ISA tools
Developed on top of the ISA-Tab format...modular, configurable, open source, Java based*
See them all at isa-tools.org
ISCB-Asia, 17th December 2012
16. The ISA tools... a tool for all your needs
ISCB-Asia, 17th December 2012
17. Configurable...
We need to support lots of different checklists,
and it should be easy for people to change their
requirements should they need to....
So, our infrastructure is built upon XML files.
These are created by the ISAConfigurator.
A configuration XML file describes the fields (or
checklist) required to describe a particular
experiment and any ontologies to be used.
ISCB-Asia, 17th December 2012
19. isacreator
Create & Edit ISA-Tab
ISCB-Asia, 17th December 2012
20. The ISAcreator... file chooser
publication searcher visualization
ontology search
QR code generator
isacreator
Developed to be a user friendly
way to enter standards-compliant automated ontology tagging
metadata: it has lots of features... spreadsheet-like interface tagterms visualise suggest clear all help
powered by ncbo annotator
But these are just some of
them...we also have a data entry
wizard and an import utility...
ISCB-Asia, 17th December 2012
23. Make sure the ISA-Tab is correct
ISCB-Asia, 17th December 2012
24. validate from the dedicated tool...
or...
validate from the command line...
or...
within ISAcreator directly...
ISCB-Asia, 17th December 2012
25. Convert to or from differing formats
ISCB-Asia, 17th December 2012
26. The converters
Fully Endorsed by ArrayExpress, PRIDE and the European Nucleotide Archive (ENA)...
Converts MAGE-Tab to ISA-Tab.
This is still in beta, however we are getting close to a fully working version. We’ve successfully
creating validated ISA-Tab for ~90% of the 21k experiments in ArrayExpress
Available as a web service, web interface and source is available for running conversions locally
http://isatab.sourceforge.net/magetoisa/
ISCB-Asia, 17th December 2012
28. The converters...semantic web
•Make the semantics of ISAtab explicit, including materials & data entities
& processes
•Exploit the semantic annotations available in ISAtab datasets
•Augment ISA syntax with new elements (e.g. groups), facilitating the
understanding & querying of experimental design
•Facilitate querying, data integration & knowledge discovery/reasoning
ISCB-Asia, 17th December 2012
29. The converters...semantic web
Notes&in&Lab&books& Spreadsheets&&&Tables& Facts&as&RDF&statements&
(informa1on&for&humans)& (ISAtab&metadata)& (informa1on&for&machines)&
ISCB-Asia, 17th December 2012
30. Get ISA-Tab into a database
Share it (or don’t) with the world
ISCB-Asia, 17th December 2012
31. Database & Web Application
ISCB-Asia, 17th December 2012
35. Last but not least...
Analysis
ISCB-Asia, 17th December 2012
36. Package to read ISA-Tab into R, especially BioConductor to run analysis
scripts on your data...
It can automatically call microarray, mass spec and flow cytometry
analysis packages on appropriate datasets...
Available from BioConductor...
There is also a script to create Galaxy libraries from ISA-Tab
Brad Chapman is working on this at HSPH
Dedicated ISAcreator mode. Allows for persistence and perusal of
ISA experiments in GenomeSpace
ISCB-Asia, 17th December 2012
37. isacommons
A growing ecosystem of over
30 public and internal
resources using the ISA
metadata tracking framework
to facilitate standards-
compliant collection, curation,
management and reuse of
investigations in an increasingly
diverse set of life science
domains, including:
S t e m C e ll C o m m o n s
Nanotechnology
Informatics Working
Group
ISCB-Asia, 17th December 2012
39. ISA software suite: supporting standards-compliant
experimental annotation and enabling curation at the
community level
Philippe Rocca-Serra; Marco Brandizi; Eamonn Maguire; Nataliya Sklyar; Chris Taylor; Kimberly Begley; Dawn
Field; Stephen Harris; Winston Hide; Oliver Hofmann; Steffen Neumann; Peter Sterk; Weida Tong; Susanna-
Assunta Sansone
Bioinformatics 2010 26: 2354-2356
Towards Interoperable Bioscience Data
Sansone SA, Rocca-Serra P, Field D, Maguire E et al
Nature Genetics 2012
ISCB-Asia, 17th December 2012
40. Thanks for listening...
Questions??
You can email us...
isatools@googlegroups.com
View our website
http://www.isa-tools.org
View our Git repo & contribute
http://github.com/ISA-tools
View our blog
http://isatools.wordpress.com
Follow us on Twitter
@isatools
ISCB-Asia, 17th December 2012