Academic scientists need a tool to capture the science they do so that it can be shared in open science, integrated with linked data, and shared/searched. Eureka is an evolving platform to do this.
1. Eureka Research Workbench:
An Open Source eScience
Laboratory Notebook
Stuart J. Chalk
Department of Chemistry
University of North Florida
schalk@unf.edu
2014 Spring ACS Meeting – CINF Paper 38
2. Big Data
Electronic Notebooks
The Eureka Research Workbench
Experiment Markup Language
ExptML Schema and Files
Semantic Data and Ontologies
File Storage
Eureka Interface
Web Interface
Conclusion
Outline
3. Current buzz word for “this bring together lots of data and
build tools on top to extract knowledge”
This is great, except…
…how do we do that for science?
Platform, data structures, and exchange protocols to
capture, identify, and disseminate scientific information
Research Data Alliance (https://rd-alliance.org/)
“Research Data Sharing without barriers”
Fran Berman at RPI is NSF funded co-chair of RDA
Big Data
4. Scientists need to move to
digital notebooks…
...and record not just the data
but the flow and context
How science is done
is important for searching,
aggregation, meta-analysis
We need more than an electronic version of a notebook
We need a science version of “Second Life” (SciLife?)
Electronic Notebooks
5. Started in 2006 after getting involved in the
Analytical Information Markup Language (AnIML) project
Store all research notes/data in a digital format
Capture the workflow of scientists
Writing in a lab notebook is equivalent to
“multi-type” blogging in the digital world
How to capture information? Many data types! (ExptML)
How to store files “online”? (Fedora-Commons)
How to access files in the browser? (CakePHP)
How to represent laboratory resources? (ExptML)
How to link data together? RDF (in Fedora-Commons)
Eureka Research Workbench
6. A specification (written in XML) that describes different
types of information recorded during the scientific process
(http://exptml.sourceforge.net)
Experiment Markup Language (ExptML)
Sample
Solution
Space
Specimen
Substance
Task
Template
Timeline
User
Vendor
Annotation
Api
Calculation
Chemical
Citation
Customer
Data
Dataset
Definition
Element
Equipment
Event
Experiment
Group
Message
Project
Protocol
Quote
Report
Result
10. Data are connected to other data – ‘Linked Data’
(http://www.w3.org/standards/semanticweb/data)
The ‘Semantic Web’ approach to contextualize data
Proposed storage of ‘relationships’ between data is the
Resource Description Format (RDF - http://www.w3.org/RDF/)
Semantic Data
11. Digital repository software http://fedora-commons.org/
Creation and management of online digital libraries
Fedora ‘Digital Object’ consists of metadata + streams
Metadata stored as Dublin Core (DC stream)
ExptML file stored as EXPTML stream
Other files (PDFs, Images, Word etc.) stored as streams
Relationships stored as RDF (RELS-EXT stream)
Features: Version control, Checksumming, Archiving
Built-in search of objects and relationships
Add-on for file content search (Fedora GSearch)
Fedora Commons
12. Fedora-Commons defines and works on digital objects
In the definition of a Fedora object an ExptML file is just
one stream of many. By default each object also has a
“DC” stream of metadata and an “RELS-EXT” stream of
relationships
Each Fedora object can have any number of additional
streams for
Paper PDFs, product/sample pictures,
binary file formats (if a conversion has been done)
Video, audio, RDF, anything…
You can export individual streams or the whole Fedora
object with streams binary encoded (Sharing/archiving)
Fedora for File Storage
14. Web interface written in PHP using the CakePHP Framework
Communicates with Fedora-Commons API to create,
retrieve, update and delete (CRUD) ExptML and other files
Representational State Transfer (REST) format for URLs
E.g. http://example.com/chemicals/view/exptml:chm1
Creation of ExptML via interface
Provides search via Fedora and Gsearch
Can extract data out of XML files
Can gather data from other websites (via API controller)
and integrate into ExptML files
Eureka Web Application
15. Eureka Website – Group View
Only data types related to the research group show up on left
16. Eureka Website – Bench View
Clicking on the “Add” menu on the right
allows you add a comment or link to data
21. Web Application
Server: Fedora 4, JSON-LD, ElasticSearch
Client: CakePHP 3/HTML5, Recline.js, Annotator, JQuery
Standards
Linked Data Platform (http://www.w3.org/TR/ldp/)
Datapackage/Simple Data Format (http://dataprotocols.org/)
Markup Languages: AnIML, UnitsML, CML
Other Molecular File Formats: MOL/SDF/CDX/CIF/PDB etc.
Open Framework for Laboratory Data (Allotrope Foundation)
Datasources
ChemSpider, CIR, PubChem, Google Scholar, CrossRef, VIVO
ExchangeNetwork (EPA), NIST, SDBS (no API’s yet)
Tools
Marvin for JS, JSXGraph, JSpecView, Chemicalize.org
Eureka Technology Stack
22. Implement ingest of all data types, file (if appropriate) and web based
In browser processing of data -> dataset -> result, report writing
Extraction of file based legacy data -> ExptML format data
Open access to data/spectra, ‘available data’ page (browser only)
Access to data/spectra via linked data server (discovery/indexing)
Publishing of packaged datasets with authenticated download option
Automated ingestion of data from instruments/sensors
Collaborative research: authentication and data exchange
Timeframe? Depends on securing funding
Eureka Roadmap
23. Eureka: Web application to create ExptML files
Built on ExptML to capture data/resources/workflows
Reliable storage/archiving system for ExptML files (Fedora)
Storage of relationships between data (RDF)
TODO
Provide mechanism for sharing of data (different levels)
Add tools to find, visualize and work on science data
Integration into the RDA model for sharing research data
Get the word out and test system with many users
Conclusion