State of the art of adding web semantics and linekd data in a CMS like Jahia 6.6
Presentation made at the JahiaOne event in Feb 2014 with A.Di Mascio (Logilab)
4. Background : who we are ?
Thomas Delerm and Adrien Di Mascio from Logilab will explain the interest of
web semantics in modern web applications for the best use of your data.
They’ll give the recipes that make Jahia an appropriate CMS for the semantic
and linked data web, a.k.a. "web 3.0"
Adrien DI MASCIO - Semantic Web Director
Company : Logilab
Thomas DELERM - Web Architect
Company : SIGMA
Worked in cell and IPTV content startups
www.sigma.fr
5. How the web evolved
Web « 1 » was about documents
and links
Web « 2.0 » is about social and
users
https://web.archive.org/web/19991116151216/http://www4.yahoo.com/
www.sigma.fr
7. Failures of Web 2.0
All the databases and APIs are in “silo” searches are limited
Results are documents, not objects
Are my results up to date and reliable ?
Example : Renault : Too many combinations when you want to buy a car : more than 10^20
[1]
[1] http://www.semweb.pro/talk/2474
www.sigma.fr
8. Failures of Web 2.0
Web 2.0 is far from perfect :
User tag
– Different orthography
– Different meanings for the
same orthography (Hollande)
– No relationships between
tags
You cannot (in one request)
answer complex queries like “List
on my website 10 products
whose producer is Samsung and
price under $50”
www.sigma.fr
9. We have a solution
There is always a technical evolution
– From PC to Web : WWW and links
– From Web to Web 2.0 : AJAX (dynamic web sites)
– From Web 2.0 to Web 3.0 : Semantic properties and Linked
data
So let’s learn what the semantic web is !
www.sigma.fr
11. Semantic Web – (Anti)definitions
Today, Semantic Web is not:
Magic
Natural Language Processing
Image Automatic Processing
A new protocol
It's a worldwide network of data built upon a set of interoperable standards that
use URLs to identify data and link them together.
www.sigma.fr
12. No Natural Language Processing
A human reads:
<h1>Semantic Web</h1>
<p>Semantic Web is worldwide network of data invented by <a
href="http://w3.org/People/Berners-Lee">Tim Berners Lee</a> in
1994.</p>
A machine reads:
<h1> ????????????</h1>
<p> ??????????????????????????????????????????????????
?????<a href="http://w3.org/People/BernersLee"> ???????????????</a> ????????</p>
www.sigma.fr
13. If only ...
… The machine could read:
SemanticWeb is_a network
SemanticWeb was_created_by TimBernersLee
SemanticWeb was_created_in 1994
www.sigma.fr
14. Annotate your document
Use rdfa or schema.org
<p itemtype="Concept">
<span itemprop="name">Semantic Web</span> is
<span itemprop="description">worldwide network of data</span>
invented by
<a itemprop="creator" href="http://w3.org/People/Berners-Lee">
Tim Berners Lee</a>
in <span="creation_date">1994</span>.</p>
www.sigma.fr
15. Publish another representation
Publish RDF and use HTTP content-negotiation
<http://mysite.com/SemanticWeb>
a <http://www.w3.org/2004/02/skos/core#Concept>;
skos:closeMatch <http://data.bnf.fr/ark:/12148/cb119328992> ;
dc:creator <http://w3.org/People/Berners-Lee/> ;
dc:date "1994".
More familiar with JSON ? Take a look at JSON-LD
www.sigma.fr
16. Vocabularies, ontologies
An ontology is a structured set of terms and concepts.
Each term and concept is also identified by a URL
There are quite a few standard ontologies for various domains
(social interactions, libraries, music, events, etc.)
www.sigma.fr
17. Make it happen now !
RDF is nice
Some database engines store RDF graphs
- You can query them with the SPARQL language
Standardized by W3C
You don't necessarily need to change your technology stack
If your data is structured, publishing RDF is easy
- Choosing an ontology or a vocabulary can be hard
- Make your relational database answer a SPARQL query is hard
www.sigma.fr
19. It's all about data
Publishing structured data:
Helps search engines
Better indexation
Better page rank
Eases external data integration
Importing a CSV file requires a preliminary agreement on its structure
Maintaining data is expensive, reuse published data (dbpedia, freebase,
geonames)
www.sigma.fr
22. Client case : Bpi
One goal : use state-of-the art Semantic Web since they are a library
(Bibliothèque Publique d’information)
3 main needs:
– Input data easily for contents and within contents
– Store data in a safe, RDF-friendly manner
– Output data
• On every page for SEO (RDFa)
• In searches
• In exports (RDF)
Good news : Jahia fits !
www.sigma.fr
23. The choice of Jahia
Input :
- Jahia allows to define clear content definitions (CND files) with
inheritance.
- Jahia is content-centric
Enrich within contents : CKEditor
On contents : contribution or edition (GWT) modes
www.sigma.fr
24. The choice of Jahia : storage and output
Storage : you need a framework than can abstract different sources of data :
enter JCR
– Unique repository for all content
– External data are abstract : LDAP, Files, other DB…
Output:
– Graph structure + XML format fit for meta data
– JSP views can be easily tailored for special export formats
www.sigma.fr
26. Input : CKEditor and categories
Make sure text data is stored as plain HTML
- Properties file to map schema.org HTML code
- In-content schema.org properties Created a CKEditor Plugin
Triple categorization of contents
–Categories (closed list)
–Tags (open)
–Authorities (closed – linked with BnF)
Next steps
–Need for a triple store ?
–Categorization through automatic spider browsing ?
www.sigma.fr
27. Content structure
Directories per category
The semantic mapping is transparent :
no additional field to fill in
Properties files to map a field and its
semantic exports (Dublin Core, FOAF..)
Kind of challenges met
– Where to store meta data of a file
extend jnt:file
– How to create a sub content while
creating its parents edit Spring GWT
XML
www.sigma.fr
28. Vocabularies used
Page
Schema.org
OpenGraph
Dublin Core
FOAF
Lists
Details on short and
long contents
No
Yes
No
Yes
No
Yes
No
Partial
Details : events, IT
resource [file]
Yes
No
Yes
No
Auteurs
Place
No
No
Yes
Yes
In HTML
Everywhere
Header
Header
Everywhere
Format in HTML
RDFa
Meta
Meta
RDFa
In RDF
Yes
Yes, one line per
meta
Automatic
(mapping)
Yes, native
Contributed
By
Yes, one line per
meta
Automatic + Automatic
Manual Bpi
(mapping)
Automatic
(mapping)
www.sigma.fr
29. Output
We chose RDFa because more widely used for now (than microdata)
Debate : shall enrichment be made manually ? Automatically ? Though a
mixed technology ?
The field dc:xxx mapping will be used to improve search results
“ARK” URIs are used to exchange objects between repositories (internal,
Jahia, external like BnF)
www.sigma.fr
30. Future
Free your data !
Put them together
Share them between applications and
externally
Forces you to organize your IT
differently
www.sigma.fr
31. Future : Facebook
Facebook is gradually promoting the
posts that contain Opengraph data [1]
« Facebook testing more uses for
Open Graph » [2]
[1] http://newsroom.fb.com/News/787/News-Feed-FYI-WhatHappens-When-You-See-More-Updates-fromFriends(January 21, 2014)
[2] http://allfacebook.com/add-to-my-movies-link_b128387
www.sigma.fr
33. Conclusion
“If you’re not paying for it, you are the product” [1]
Semantic Web is going to be imposed by internet giants because they need it
to know you better
Make the first step to enrich your data, don’t miss the train !
Jahia 7 catches it :
– External data provider
– Quality, extendable editor
[1] http://blogs.law.harvard.edu/futureoftheinternet/2012/03/21/meme-patrol-when-something-online-is-free-youre-not-the-customer-youre-the-product/
www.sigma.fr
34. Questions & Answers
Webography:
New W3C Blog on Semantic Web & linked data : http://www.w3.org/blog/data/
http://fr.slideshare.net/AntidotNet/time2-market-lyon-13nov2013-slideshare#
http://fr.slideshare.net/terraces/technologies-du-web-smantique-pour-lentreprise-20
http://fr.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-dataquelques-repres-pour-sy-retrouver
www.sigma.fr
Notes de l'éditeur
19 July 2013 at Google : Knowledge Graph expansion – More than a quarter of all searches started showing some kind of knwoledge graph after this date20 August 2013 Google Hummingbird foces on conversational and semantic search to try and delivery correct answers to broad meanung questions
We chose not to output semantics on lists pages on purpose