1. Datalift: A Catalyser for the Web of Data
François Scharffe
LIRMM/CNRS/University of Montpellier
francois.scharffe@lirmm.fr
@lechatpito
With the help of the Datalift team
And the support of the French National Research Agency
FOSDEM 5/02/2011 1
7. Making your data 5 stars
http://www.w3.org/DesignIssues/LinkedData.html
8. So, how to lift data ?
How to publish data on the Web as linked-
data ?
● Basic principles Tim Berners Lee [2006] (Design Issues)
– Use URIs to identify things (not only documents)
– Use HTTP URIs
– When dereferecing URIS, return a description of the
ressource
– Include links to other ressources on the Web
9. Welcome aboard the data lift
Published and interlinked data on the Web
Applications
Interconnexion
Publication infrastructure
Data convertion
Vocabulary selection
Raw data
10. Datalift
Datasets publication
R&D to automate the publication process
Tool suite to help publish data
Training, tutorials, data publication camps
11. st
1 floor - Selection
SemWebPro 18/01/2011 11
12. Les vocabulaires de mes amis …
Ø What is a (good) vocabulary for linked data ?
§ Usability criterias
Simplicity, visibility, sustainability, integration, coherence …
Ø Differents types of vocabularies
§ metadata, reference, domain, generalist …
§ The pillars of Linked Data : Dublin Core, FOAF, SKOS
Ø Good and less good practices
§ Ex : Programmes BBC vs legislation.gov.uk
§ Vocabulary of a Friend : networked vocabularies
Ø Linguistic problems
§ Existing vocabularies are in English at 99%
§ Terminological approach :which vocabularies for « Event » « Organization »
13. Did you say « vocabulary »
… And why not « ontology »?
§ Or « schema » ou « metadata schema »?
§ Ou « model » (data ? World ?)
Ø All these terms are used and justifiable
They are all « vocabularies »
§ The define types of objects (or classes)
and the properties (oo attributes) atttached to these objects.
§ Types and attributes are logically defined
and named using natural language
§ A (semantic) vocabulary
is an explicit formalization
of concepts existing in natural language
SemWebPro 18/01/2011 13
14. Vocabularies for linked data
Ø Are meant to describe resources in RDF
Ø Are based on one of the standard W3C language
§ RDF Schema (RDFS)
• For vocabulaires without too much logical complexity
§ OWL
• For more complex ontological constructs
§ These two languages are compatible (almost)
Ø The can be composed « ad libitum »
§ One can reuse a few elements of a vocabulary
§ The original semantics have to be followed
15. What makes a good vocabulary ?
Ø A good vocabulary is a used vocabulary
§ Data published on CKAN give an idea of vocabulary usage
§ Exemple : v
list of datasets using FOAF http://xmlns.com/foaf/0.1/
Ø Other usability criterias
§ Simplicity and readability in natural language
§ Elements documentation (definition in natural language)
§ Visibility and sustainability of the publication
§ Flexibility and extensibility
§ Sémantique integration (with other vocabularies)
§ Social integration (with the user community)
16. A vocabulary is also a community
Ø Bad (but common) practice
●
Build a lonely vocabulary
– For example as a research project
– Without basing it on any existing vocabulary
§ To publish it (or not) and then to forget about it
§ Not to care about its users
Ø A good vocabulary has an organic life
§ Users and use cases
§ Revisions and extensions
§ Like a « natural » vocabulary
17. Types of vocabularies
Ø Metadata vocabularies
§ Allowing to annotate other vocabularies
• Dublin Core, Vann, cc REL, Status
Ø Reference vocabularies
§ Provide « common » classes and properties
• FOAF, Event, Time, Org Ontology
Ø Domain vocabularies
§ Specific to a domain of knowledge
• Geonames, Music Ontology, WildLife Ontology
Ø « general » vocabularies
§ Describe « everything » at an arbitrary detail level
• DBpedia Ontology, Cyc Ontology, SUMO
18. Vocabulary of a Friend
Ø http://www.mondeca.com/foaf/voaf
Ø A simple vocabulary...
Ø To represent interconnexions between vocabularies
Ø A unique entry point to vocabularies and Datasets of
the linked-data cloud Linked Data Cloud
Ø Ongoing work in Datalift
20. URL Design et URL Pattern
Ø Good practices for linked-data
§ Ressource: http://dbpedia.org/resource/Paris
§ Document: http://dbpedia.org/page/Paris
§ Data: http://dbpedia.org/data/Paris
Ø … served using content negociation
21. URI Pattern in REST
Ø Les services REST (Representational State Transfer)
manipulent des ressources et les URLs sont
principalement utilisés pour adresser ces ressources
Ø Une URI de base:
§ http://www.example.com/bookstore/
Ø Une ressource à un URL unique: (retrieve, update,
create, delete)
§ http://www.example.com/bookstore/books/ISBN123
Ø Notion de collection: (list, replace, create, delete)
§ http://www.example.com/bookstore/books
22. Convertion tools to RDF
Ø How is the raw data to be converted ?
§ Relational Database ?
§ (Semi-)structured formats ?
§ Programmatic acces (API) ?
Ø There are solutions for all cases
24. Triplify: Relational data to JSON/RDF
Ø Extract a folder in your Webapp:
http://sourceforge.net/projects/triplify/
Ø Modify a config file:
§ SQL query … URI pattern
§ PHP lover!
27. RDF extension for Google Refine
Ø A graphical extension for Google Refine allowing to
export the clean data as RDF
http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/
Annual pay rate
- including
Name Job Title Grade Organization Notes
taxable benefits
and allowances
Chief Executive Asset Protection £150,000 -
Stephan Wilcke
Officer Agency £154,999
Asset Protection £165,000 -
Jens Bech Chief Risk Officer No pension
Agency £169,999
Chief Invesment Asset Protection £165,000 -
Ion Dagtoglou No pension
Officer Agency £169,999
Chief Credit Asset Protection £130,000 -
Brian Scammell 4 days per week
Officer Agency £134,999
30. Publication components
Querying
Browsing
SPARQL REST
endpoint
Alimentation
Inference
Engine RDF
storage Alimentation
Alimentation
A few products
Virtuoso, Sesame, Mulgara, 4store
OWLIM, AllegroGraph, Big Data,Jena
31. Named graphs
Ø Rdf graphs are bags of triples, everything is mixed
1
Ø Delete on a graph
2
Ø SPARQL queries define 3
5
graphs 9
6
11
10
8
12
4 7
13
16
14 15
32. Inference
1
3 2
5
Ø Generating triples from other triples 9
6
10 11
8
Ø Deduction mechanism 12
4 7
13
§ Men are mortals, Socrates is a man, so Socrates is 16
mortal 14 15
Ø Allows to avoid exhaustivity, give sense to
defining hierarchies
Ø Constraints: cardinality, NFPs, ...
33. Analyse des RDF Store : la méthode QSOS
Ø Qualification and Selection of Open Source Software
§ Projet Open Source sur des solutions open source
§ http://www.qsos.org
Ø Objectifs de QSOS
§ Qualifier des logiciels
§ Comparer des solutions après avoir défini des exigences et en pondérant les critères
§ Sélectionner le produit le plus adapté par rapport à un besoin
Ø QSOS fournit
§ Une méthode objective et formalisée
§ Un référentiel d’études disponibles
§ Des outils facilitant le déroulement de la méthode
35. Linked data and interconnexions
Ø Without links there is no Web but data silos
Ø Links can be part of the datasets design (reference
datasets)
Ø Links can be found after the publication: equivalence
links between resources
37. Tools
Ø RKB-CRS A coreference resolution service for the RKB
knowledge base
Ø LD-mapper A linkage tool for datasets described using the
Music Ontology
Ø ODD Linker A linkage tool based on SQL
Ø RDF-AI Multi purpose data linkage and fusion
Ø Silk et Silk LSL Linkage tool and linkage specification language
Ø Knofuss architecture Datasets linkage and fusion
40. Towards automated interconnexion services
Ø The linkage specification could be simplified
§ Using alignments between vocabularies
§ Detection of discriminating properties
§ Indicating comparison methods by attaching metadata to
ontologies
Ø Work in progress in Datalift