A talk given at the DCC digital curation 101 workshop which illustrates how to curate and manage scientific data, considering the content, syntax and semantics of the data
1. a centre of expertise in data curation and preservation
Create or Receive Scientific data
Dr. Frank Gibson and Dr. Phillip Lord
Frank.Gibson@newcastle.ac.uk
Phillip.Lord@newcastle.ac.uk
Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK:
Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-
sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San
Francisco, California, 94105, USA.
Digital Curation 101, October 6th-10th, 2008, NeSC, Edinburgh
2. a centre of expertise in data curation and preservation
“In the standard
model, one collects
data, publishes a
paper or papers and
then gradually loses
the original dataset.”
- Geoffrey Bowker
Create or Receive
3. a centre of expertise in data curation and preservation
Create or Receive
Slide by Cameron Neylon http://www.slideshare.net/CameronNeylon
4. a centre of expertise in data curation and preservation
Create or Receive
Slide by Cameron Neylon http://www.slideshare.net/CameronNeylon
5. a centre of expertise in data curation and preservation
Create or Receive
Slide by Cameron Neylon http://www.slideshare.net/CameronNeylon
6. a centre of expertise in data curation and preservation
Create or Receive
Slide by Cameron Neylon http://www.slideshare.net/CameronNeylon
7. If we have a paper a centre of expertise in data curation and preservation
who cares about the data?
Create or Receive
http://flickr.com/photos/nicmcphee/2756494307/
8. a centre of expertise in data curation and preservation
A paper = a claim (or claims)
The full record that supports that
claim should be available for detailed
examination and critique
Create or Receive
Slide by Cameron Neylon http://www.slideshare.net/CameronNeylon
9. a centre of expertise in data curation and preservation
Create or Receive
Slide by Cameron Neylon http://www.slideshare.net/CameronNeylon
10. a centre of expertise in data curation and preservation
1000+
Databases
Create or Receive
11. Biocuration: Databases
a centre of expertise in data curation and preservation
Create or Receive
12. Biocuration: Wiki
a centre of expertise in data curation and preservation
Create or Receive
13. a centre of expertise in data curation and preservation
Create or Receive
Slide by Cameron Neylon http://www.slideshare.net/CameronNeylon
14. a centre of expertise in data curation and preservation
Create or Receive
15. Funders
a centre of expertise in data curation and preservation
http://flickr.com/photos/luismimunoznajar/2093185804/or
Create Receive
16. a centre of expertise in data curation and preservation
Create
or
Receive
Create or Receive
17. a centre of expertise in data curation and preservation
Curation aims
Amenable
Preservable
Ownable
Accessible
Citable
Create or Receive
18. a centre of expertise in data curation and preservation
Significant Properties of Data
Content
Syntax
Semantics
Create or Receive
19. a centre of expertise in data curation and preservation
Content
Create or Receive
20. a centre of expertise in data curation and preservation
Publisher
Type
Title
Creator
Source Identifier
Date
Rights
Create or Receive
21. Simple Dublin Core a centre of expertise in data curation and preservation
Type
Format
Title
Identifier
Creator
Source
Subject
Language
Description
Relation
Publisher
Coverage
Contributor
Rights
Date
Create or Receive
22. a centre of expertise in data curation and preservation
Content:
Domain Specific
Create or Receive
23. a centre of expertise in data curation and preservation
Syntax
Create or Receive
24. a centre of expertise in data curation and preservation
Create or Receive
25. a centre of expertise in data curation and preservation
Choosing a Syntax
• Openness
• -is there an open, publicly available specification for the
format; are its specifications in the public domain; is it
unencrypted?
• Portability
• -is the format independent of hardware, operating system, of
other software; is it independent of particular institutions,
groups, or events; is it in widespread current use; does it
contain little or no built-in functionality?
• Quality
• -is it robust; simple; highly tested; loss-free?
Create or Receive
26. a centre of expertise in data curation and preservation
Semantics
Create or Receive
27. a centre of expertise in data curation and preservation
Semantics can be complex
One semantic = many words
Many words = one semantic
Create or Receive
28. a centre of expertise in data curation and preservation
• Excel data example – do I need it?
Create or Receive
•Zeeberg et al. BMC Bioinformatics 2004 5:80 doi:10.1186/1471-2105-5-80 •Zeeberg et al. BMC Bioinformatics 2004 5:80 doi:10.1186/1471-2105-5-80
29. What is fly? a centre of expertise in data curation and preservation
•Fly
•Fly
•http://en.wikipedia.org/wiki/Image:Air_india_b747-400_vt-esn_arp.jpg
•http://en.wikipedia.org/wiki/Image:MuscuDomestica.jpg
•Fly
•Fly
•http://en.wikipedia.org/wiki/Image:Green_Highlander_salmon_fly.jpg
•http://en.wikipedia.org/wiki/Image:Fly_poster.jpg
Create or Receive
30. a centre of expertise in data curation and preservation
Ontology
• A controlled vocabulary is an association
between formal names (identifiers) and their
definitions.
• An ontology is a controlled vocabulary
augmented with logical constraints that
describe their interrelationships.
Create or Receive
31. a centre of expertise in data curation and preservation
Ontologies for Life science
• Emergence has occurred for two reasons
• Consistent annotation of data
• To add meaning and understanding that can
be interpreted computationally
• Bio-ontologies registered on the OBO foundry
Create or Receive
32. a centre of expertise in data curation and preservation
Application of
Significant Properties
In
Proteomics
Create or Receive
33. a centre of expertise in data curation and preservation
Minimum Information about a
Proteomics Experiment (MIAPE)
• Sufficiency.
• The MIAPE guidelines should require sufficient information about
a dataset and its experimental context to allow a reader to
understand and critically evaluate the interpretation and
conclusions, and to support their experimental corroboration.
• Practicability.
• Achieving compliance with MIAPE should not be so burdensome
as to prohibit its widespread use.
Create or Receive
34. a centre of expertise in data curation and preservation
Create or Receive
35. a centre of expertise in data curation and preservation
Minimum reporting guidelines
• Describe content
• Implementation
independent
• Impacts
• Publication
• Syntax
• Semantics
Create or Receive
36. a centre of expertise in data curation and preservation
Syntax for proteomics
• The content in MIAPE GE needs to be structured to
facilitate
• dissemination
• transfer
• storage
• A community development process to agree on a
syntax
• building upon the FuGE data model
• A pre-existing community developed representation of
scientific experiments
• Interoperable
Create or Receive
37. a centre of expertise in data curation and preservation
FuGE
• Model of common components in science investigations, such
as materials, data, protocols, equipment and software.
• Provides a framework for capturing complete laboratory
workflows, enabling the integration of pre-existing data
formats.
Create or Receive
38. a centre of expertise in data curation and preservation
UML/XML/RDBMS
• UML gives structure (but not syntax)
• Very abstract, very general
• XML provides a concrete syntax
• Meta language is interoperable, checkable, viable and has
basic metadata support (language, character coding and so
on).
• Tends toward the verbose. Not (very) searchable for itself.
• Therefore, transfer and archive format.
• RDBMS
• SQL is (sort of) a standard
• Highly computationally amenable form; v. good for searching
• Conversion from XML is possible, but in a number of ways.
• Hard work – nice to have an off-the-shelf implementation.
Create or Receive
39. GelMLa centre of expertise in data curation and preservation
Create or Receive
40. a centre of expertise in data curation and preservation
Semantics
for
Gels
Create or Receive
41. Semantics for science
a centre of expertise in data curation and preservation
Create or Receive
42. a centre of expertise in data curation and preservation
Curation of Gel experiments
Public
Laboratory Data entry and transfer repositories
I) GelML data entry tools
GelML
MAIPE
GE II) Direct database submission
III) Automated export of GelInfoML
MAIPE
GI
sepCV
Create or Receive
43. Discoverability and reuse
a centre of expertise in data curation and preservation
•Persistent Identifiers
•Rights management
Create or Receive
44. a centre of expertise in data curation and preservation
Persistent Identifiers
• a name for a resource which will remain the same
regardless of where the resource is located
• In biology typically assigned to data upon publication
• Type of identifier dependent on publication method
• Description and Representation Information provides
more information about persistent identifiers
Create or Receive
45. a centre of expertise in data curation and preservation
Rights management
• Difficult to determine
• Lots of legal issues
• In biology/bioinformatics
tends to be open
access
•Creative commons
Create or Receive
46. Receiving data for curation
a centre of expertise in data curation and preservation
Content
Syntax
Semantics
Create or Receive
47. Who will receive it? Route map
a centre of expertise in data curation and preservation
What are their policies on:
Route map
Content, Syntax, Semantics
Plan your experiment to conform to
Content, Syntax, Semantics
Implement experiment to;
Collect appropriate Content
Structure in appropriate Syntax
Ensure Semantics are preserved
Curate…
Create or Receive
48. a centre of expertise in data curation and preservation
Meta Route Map
• How to build the map if you don’t have one
yet.
Create or Receive
49. a centre of expertise in data curation and preservation
Appraise and Select
• Investigates the evaluation and selection of
data for longterm curation and preservation
Create or Receive
50. a centre of expertise in data curation and preservation
Acknowledgments
• The CARMEN project
• www.carmen.org.uk
• The Proteomics Standards Initiative (PSI)
• http://psidev.info
• Colleagues at Newcastle University
• Phillip Lord, Anil Wipat, Allyson Lister
Create or Receive
51. a centre of expertise in data curation and preservation
Create or Receive