Slides for a talk at a Farmbio BioScience Seminar May 18, 2018, at http://farmbio.uu.se introducing Linked Data as a way to manage research data in a way that can better keep track of provenance, make its semantics more explicit, and make it more easily integrated with other data, and consumed by others, both humans and machines.
Linked Data for improved organization of research data
1. Linked Data
for improved organization
of research data
Farmbio BioScience Seminar May 18, 2018
Samuel Lampa @smllmp
PhD Student in Pharm. Bioinformatics @ pharmb.io / farmbio.uu.se
2. ● Large datasets
● Automation
● Scientific workflows
● Machine Learning
● Semantic data
● Reasoning
● Query systems
● Something user friendly
● And hopefully usable
● “Answer all the (computational)
research questions”
Research interests
6. Database to the rescue?
● Same problems with losing data identity on export
● So, put all data in the same database?
● One database can’t fit all the world’s data!
● What to do?
7. What to do?
What if all data could be:
● Easy to share
● Self-described
● Use the same (underlying) format
● Be easy to integrate with other data
(In other words: FAIR – Findable, Accessible, Interoperable, Re-usable)
9. Linked data – Basic ideas
● Use URI:s (“https://”) to identify things
● Make URI:s into dereferenceable links
(So one can visit them to find relevant data)
● Refer to other data using their links
10. What about the linking?
Triple model*:
– Subject (URI), Predicate (URI), Object (URI or literal value)
@ex: http://example.org/myontology/
ex:Sweden ex:hasPopulation 9000000
ex:Sweden ex:hasCapital ex:Stockholm
* For more info: Check “RDF: Resource Description Framework”
12. Example data
Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES.
Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6.
Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
13. <http://[...]/nmrshiftdb/?moleculeId=234>
dc:title "warburganal";
chem:casnumber "62994-47-2";
nmr:moleculeId "234";
nmr:hasSpectrum <http://[...]/nmrshiftdb/?spectrumId=4735>;
<http://[...]/nmrshiftdb/?spectrumId=4735> nmr:field "50";
nmr:hasPeak <http://[...]/nmrshiftdb/?s4735p0>,
<http://[...]/nmrshiftdb/?s4735p1>,
<http://[...]/nmrshiftdb/?s4735p2>,
<http://[...]/nmrshiftdb/?s4735p3>;
nmr:solvent "Chloroform-D1 (CDCl3)";
nmr:spectrumId "4735";
nmr:spectrumType "13C";
nmr:temperature "298".
<http://[...]/nmrshiftdb/?s4735p1>
nmr:hasShift 18.3;
a nmr:peak.
<http://[...]/nmrshiftdb/?s4735p2>
nmr:hasShift 22.6;
a nmr:peak.
<http://[...]/nmrshiftdb/?s4735p3>
nmr:hasShift 26.5;
a nmr:peak.
Example data
Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES.
Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6.
Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
15. What to do? - Linked Data!
What if all data could be:
● Easy to share – Yep, RDF is a web based format
● Self-described – Yes, links in the data describe the data
● Use the same (underlying) format – Yes, RDF triples
● Be easy to integrate with other data - Yes, just create links
(In other words: FAIR – Findable, Accessible, Interoperable, Re-usable)
17. What we did (1/3):
Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES.
Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6.
Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
← SWI-Prolog for querying
… Integrated into Bioclipse
Pros / Cons:
+ Powerful querying
+ Easy to integrate into other software
=> Powerful interactive environment
+ Excellent performance
- No support for really large datasets
(exceednig RAM size)
18. What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
Semantic MediaWiki as a collaborative and
interactive platform for playing around with
data, summarizing and visualizing using SMW’s
Ask query language →
Pros / Cons:
+ Collaboration supported
+ Versioned data storage
+ UI generation included in SMW
- Performance concerns
- Lack of expressiveness and power
in the SMW “Ask” query language
19. What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
20. What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
21. What we did (3/3): urisolve
● A simple web server to resolve, or “dereference” URIs
● Returns any data / triples for the URI in question
● Based on data in a triplestore (semantic database)
or an RDF-HDT file (compressed, indexed file format)
● Source code: github.com/pharmbio/urisolve
Lapins M,Arvidsson S, Lampa S, Berg A, Schaal W,Alvarsson J, Spjuth O.
A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform. 2018;10(1):17. doi: 10.1186/s13321-018-0271-1
22. ● Linked Data makes data self-describing
● It is extremely flexible to work with
● Lowers the barriers to data entry
Conclusions
23. Vision:A central workbench for Linked Data
SWISH: SWI-Prolog Notebook: swish.swi-prolog.org
… to access all data sources, and
“answer all the (computational) research questions”
24. Thank you
Samuel Lampa @smllmp
PhD Student in Pharm. Bioinformatics @ pharmb.io / farmbio.uu.se