Producing, publishing and consuming linked data - CSHALS 2013

Producing, Publishing and Consuming
Linked Data
Three Lessons from the Bio2RDF Project
François Belleau
Centre de recherche du CHUQ, Laval University
Québec, Canada
@bio2rdf

• Looking backward to 2004
• Lessons :
1) How to produce RDF
2) How to publish Linked Data
3) How to consume SPARQL endpoints
• Looking forward for the next decade

The story of two images
or Bio2RDF fairy tale
2004 vision 2011 reality

Data Integration problem in
bioinformatics

Mashup !
FungalWeb
from Christopher Baker
YeastHub
from Kei-Hoi Cheung

W3C conference in 2007
46 millions documents in SESAME

DILS conference in 2008
63 millions triples in Virtuoso

ISMB conference in 2008
65 millions triples in Virtuoso

March 2009 Linked Data cloud is
published
Bio2RDF 2,3 billions
triples represents 54%
of the global graph

W3C-HCLS F2F Meeting in 2009
41 in Virtuoso endpoints

CSHALS conference in 2013
1 billions triples in 19 Virtuoso endpoints with
Bio2RDF release 2 and still adding…

Bio2RDF is not alone anymore !

How to produce RDF
• Bio2RDF project transform existing public
database into RDF;
• Data format transformation to RDF triples is
simple to do;
• Transformation need to be done from many
kind of format (CSV, XML, JSON, HTML,
relational database) to RDF.

Methods
• 2006 Converting XML and HTML document
from the web using JSP JSTL library
• 2007-2010 Perl scripts, JSP web pages
• 2012 – Release 2.0 rdfiser are written in PHP
• 2013 – Use Talend ETL job

ETL definition from Wikipedia
In computing, Extract, Transform and Load
(ETL) refers to a process in database usage
and especially in data warehousing that
involves:
 Extracting data from outside sources
 Transforming it to fit operational which can
include quality levels)
 Loading it into the end target (database, more
specifically, operational data store, data mart
or data warehouse)
http://en.wikipedia.org/wiki/Extract,_transform,_load

Why not use ETL software
to rdfize existing data ?

Talend Open Studio for Data Integration
an open source free ETL software build with Eclipse
http://www.talend.com/

HGNC 2 Bio2RDF example
EXTRACT from the web
TRANSFORM to RDF
LOAD into triplestore

This rdfizer is available on myExperiment
http://www.myexperiment.org/workflows/3420.html

Lesson #1
• Use existing ETL tool, like Talend, to do fast
and efficient transformation to RDF n-triples
format.
• Talend could be extended with new Semantic
web components to ease RDF transformation
and simplify SPARQL query submission.

How to publish
Linked Data
• Design your URI pattern;
• Publish SPARQL endpoint on the Internet;
• Offer a search engine and a browser;
• Register it to official registry like CKAN;
• Advertise it in SPARQL endpoint list;
• Describe your triples with an ontology or the way
Bio2RDF does;
• Publish SPARQL query example;
• Index your data in semantic search service like Sindice;

Design your URI pattern
• Bio2RDF use Banff manifesto URIs
• http://sourceforge.net/apps/mediawiki/bio2rdf/index.p
hp?title=Banff_Manifesto
• Example : http://bio2rdf.org/geneid:15275
• Apply the four linked data rules
• http://www.w3.org/DesignIssues/LinkedData.html
• Be polite with other URIs
• http://hackathon3.dbcls.jp/wiki/URI
• Example : http://purl.uniprot.org/uniprot/P05067

Publish SPARQL endpoint on the
Internet
• Choose a triplestore technology
• http://en.wikipedia.org/wiki/Triplestore

Offer a search engine and a browser

Register it to official registry like
CKAN

Advertise it in SPARQL endpoint list
http://www.freebase.com/view/base/politeuri/sparql_endpoint
http://beta.bio2rdf.org/

Publish SPARQL query example
http://sourceforge.net/apps/mediawiki/bio2rdf/index.php?title=Essential_SPARQL_queries

Index your data in semantic search
service

Lesson #2
• To be present in the Linked Data cloud, just
publish your data through a SPARQL
endpoint.
• Register it to public resources, describe its
content and suggest SPARQL queries.
• We use OpenLink Virtuoso free edition since
2007. Without this first class triplestore
software there would not be a Bio2RDF
service.

How to consume
SPARQL endpoints
Two principles :
1. To answer a specific question first build a
mashup using public or private SPARQL
endpoints.
2. Then, ask your questions to the mashup.

How to build a semantic mashup
• 2005 - Import RDF file in Protégé.
• 2006 - Use ELMO RDF crawler to import RDF
data into SESAME triplestore.
• 2007 - We implement a import function in
SESAME based on derefencable URIs.
• 2008 - Use Virtuoso sponge option and Perl
scripts.
• 2009 - Use Taverna workflow engine to fetch
triples from SPARQL endpoint.
• 2012 Use a Talend workflow consuming
SPARQL endpoint.

Who is influential at CSHALS ?
http://cshals.mashup.bio2rdf.org/relfinder/
http://cshals.mashup.bio2rdf.org/sparql

Talend workflow to create the
needed semantic mashup
• Do a full text search for each author (~80)
who talked at CSHALS since 2007 and get
its publication;
• For each publication get its XML
description (~1000) and rdfize it;
• For each publication get its citation list;
• For each publication citing a previous one
get its description (~10 000).

Global workflow in 3 steps
Full text search
Describe publication
Describe citing
publication

Full text search using ncbi/esearch

Describe publication, pubmed rdfizer for
ncbi/efetch and ncbi/elink service

Describe citing publication using
ncbi/elinks

Then query the mashup
• What is CSHALS conference about ?
• Who are the most influential researchers in
the community ?
• Which articles in semantics as been mostly
cited ?

What is CSHALS conference about ?
select ?label2 as ?mesh count(*) as ?count
where {
?s <http://bio2rdf.org/pubmed_vocabulary#xFoundIn> ?pubmed .
?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh .
?xMesh rdfs:label "Semantics" .
?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh2 .
?xMesh2 rdfs:label ?label2 .
}
order by desc(2)

Who are the most influential
researchers in the community ?
select ?l3 as ?author count(distinct ?pubmed ) as ?citation
where {
?s a <http://bio2rdf.org/pubmed_vocabulary#searchResults> .
?s rdfs:label ?l .
?pubmed <http://bio2rdf.org/pubmed_vocabulary:xCitedIn>
?xCitedIn .
?pubmed rdfs:label ?l2 .
?pubmed <http://bio2rdf.org/pubmed_vocabulary#xPerson> ?xPerson .
?xPerson rdfs:label ?l3 .
}
order by desc(2)

Which articles in semantics has
been most cited ?
select ?l2 as ?title count(?xCitedIn) as ?count
where {
?s a <http://bio2rdf.org/pubmed_vocabulary#searchResults> .
?s rdfs:label ?l .
?pubmed <http://bio2rdf.org/pubmed_vocabulary:xCitedIn> ?xCitedIn .
?pubmed rdfs:label ?l2 .
} order by desc(2)

What is the relation between François
Belleau and Michel Dumontier ?

Using RelFinder
http://www.visualdataweb.org/relfinder.php
http://cshals.mashup.bio2rdf.org/relfinder

Using Sentient Knowledge Explorer
http://www.io-informatics.com/

Gruff for AllegroGraph
http://www.franz.com/agraph/gruff/

Lesson #3
• To answer a specific question build a mashup
from SPARQL endpoints and query it.
• To build your semantic mashup, use a
workflow which can be created with an ETL
like Talend.
• Explore the mashup with semantic software
like Virtuoso faceted browser, RelFinder,
Gruff or Sentient.

Projects
• Add new data source to Bio2RDF collection
of SPARQL endpoints;
• Develop Talend ETL Semantic web extension
to ease rdfizing and SPARQL endpoint
consumption needed to build mashup;
• Create a mobile application to browse
Bio2RDF or other SPARQL data sources.

Looking forward foir the next decade
• More data provider will expose their data as SPARQL endpoints,
but Bio2RDF is still needed.
• Now that Data has been converted to RDF (a dirty job) we need
to ask useful question to the Linked Data cloud (a hard one).
SPARQL query will not be sufficient and reasoner will be
essential.
• Semantic software for browsing, visualisation, edition will be
created and SPARQL federated query engine will become
available. This will be the next game changer.
• Intuitive mobile applications will give access to Semantic web
data in a user friendly manner.
• Data Integration experience will be successful for scientist user, if
our enthusiast community get organize, so governance for Linked
Data in Life Science is a major issue.

LSSEC - Life Science
SPARQL Endpoint Club
https://groups.google.com/d/forum/life-science-sparql-endpoint-club
A private club for SPARQL endpoint
publisher to gather and discuss their
concerns about Linked Data, Ontology
and promotion of the Semantic Web in
the Life Science community.
To become a member you need to
publish RDF or host a SPARQL endpoint
of interest for the Life Science
community.

Acknowledgements
• Bio2RDF is a community project available at http://bio2rdf.org
• The community can be joined at
https://groups.google.com/forum/?fromgroups#!forum/bio2rdf
• This work was done under the supervision of Dr Arnaud Droit,
assistant professor and director of the Centre de Biologie
Computationnelle du CRCHUQ at Laval University, where Bio2RDF
is hosted.
• Michel Dumontier, from the Dumontier Lab at Carleton University, is
also hosting Bio2RDF server and his team created new release 2.
• Thanks to all the people member of the Bio2RDF community, and
especially Marc-Alexandre Nolin and Peter Ansell, initial developers.
• This work was supported by Ministère du Développement
Economique, Innovation Exportation (MDEIE).

Producing, publishing and consuming linked data - CSHALS 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Producing, publishing and consuming linked data - CSHALS 2013

Similar to Producing, publishing and consuming linked data - CSHALS 2013 (20)

More from François Belleau

More from François Belleau (16)

Recently uploaded

Recently uploaded (20)

Producing, publishing and consuming linked data - CSHALS 2013