This talk was presented in Leipzig, during the SEMANTiCS '2014 Conference, in September. It basically gives an overview of how Information Content Theory metrics can be applied to Semantic Web, and especially to vocabularies. The results of the proposed ranking metrics can be applied in three areas: (1) vocabulary life-cycle management, (ii) semantic web visualizations and (iii) Interlinking process.
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Information Content based Ranking Metric for Linked Open Vocabularies
1. Information Content based
Ranking Metric for Linked Open
Vocabularies
Ghislain A. Atemezing (@gatemezing)
Raphaël Troncy (@rtroncy)
2. Goal and Agenda
Goal: Present a new ranking metric for reusing
vocabularies
Motivation
Combine Information Theory with metadata information
Find new assessment metric for vocabularies
Current situation
Unicity of popularity based-metric (e.g. prefix.cc or lodstats)
Only ONE dimension used for assessing vocabularies
Proposal: compute informativeness of LOV terms
Experiments and Results
Applications
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 2
3. Vocabulary Purpose
Model to understand a domain’s semantics
Vocabulary terms contain information
A term = Class, Object Property, Data Property
Essential for publishing data on the Web
How to quantify value of a term?
Informativeness value = negative relation with
probability
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 3
4. Existing catalogs of vocabularies
Some catalogs of vocabularies
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 4
5. Linked Open Vocabularies (LOV)
A curated list of vocabularies
More than 420 vocabularies
Each of them described by the vocabulary-of-a-friend
(voaf) schema
Track the (temporal) evolution of vocabularies
Some related services
SPARQL endpoint: http://lov.okfn.org/endpoint/lov
Search function: http://lov.okfn.org/dataset/lov/search
An Aggregator endpoint:
http://lov.okfn.org/endpoint/lov_aggregator
An intelligent bot agent for updates:
http://lov.okfn.org/dataset/lov/bot
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 5
6. LOV DESCRIPTION: http://lov.okfn.org/dataset/lov/
CORE FEATURES OF THE FRAMEWORK
Domain Intended Use Collection Gatekeeping
Number of
Ontologies
Dynamics
Search
metadata
Search
within
ontology
Search across
ontologies
Navigation
criteria
General
Promote and
facilitate the
reuse of
vocabularies in
the linked data
ecosystem.
Submitted by any
user via LOV-Suggest
tool.
Manual
curation and
automatic URI
validation
450+ Growing
Yes, with
visual
depiction
Yes
Keyword-based;
structured
search (query-based)
Ordered by
prefix,
namespace,
title and
visual links
navigation
CORE FEATURES OF THE FRAMEWORK
Metrics
Comments
and review
Ranking
Web
service
access
SPARQL
endpoint
Content
available
Read/
Write
Ontology
directory
Ontology
registry
Applicatio
n platform
Reuse
popularity on
the LOD
Cloud
N/A - Only by
the curators
Metric-based
API Yes
Ontology
metadata
, URI
Read Yes Yes Yes
LOV DESCRIPTION WITH THE FRAMEWORK OF [d’Aquin-Noy2012-Survey]
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 6
7. LOV Evolution since March, 2011
Quasi linearity of the growth,
started with 75 vocabularies
The glitch in 2012
corresponds to the
migration to OKFN
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 7
8. Proposal: Metrics for Ranking LOV
Metrics
Information Content Metric (IC): value of
information associated with a given entity
Partition Information Content Metric (PIC)
Proposed a ranking based on IC and PIC
Method
Adapt IC and PIC function on semantics
Select candidate vocabularies in LOV catalog
Compute the scores
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 8
9. Information Content Metrics for LOV
Information Content
Formula:
N = MAX value of term
occurrence in LOV
φ(t)=occurrence of
term in LOV
Partitioned IC
LOV is a semantic
network of resources
Formula:
wf= weight for vocab f
+objectURI+ =
owl:ObjectProperty/Datatyp
eProperty; rdfs:Property
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 9
10. Information Content Metrics for LOV
(Light)weighting
scheme
wf=2 if datasets are using
vocabulary
wf=1 if vocabulary reused
other vocabularies.
wf=3 if vocabulary reused
elsewhere
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 10
14. Comparison
Relative stable position of foaf in prefix.cc,
vocab.cc and lodstats catalogues.
LOV-PIC/LODstats: skos, dcterms
with “relative” stable raking.
List of “most popular”
vocabularies: foaf, skos,
dcterms, time, dce, prov.
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 14
15. Applications of the Ranking Metrics
Vocabulary life-cycle management
Help assessing the use of terms and vocabulary updates
Monitoring the use of http://www.w3.org/2003/06/sw-vocab-
status/ns#:term_status or owl:deprecated
Semantic Web applications
Vocabularies with higher PIC might be proposed to a
user as much as possible, e.g. for choosing properties to
display in a facetted browsing interface
Interlinking datasets
Generate sameAs links with data based on vocabularies
terms with lower IC value
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 15
16. Conclusion and Future Work
We have presented new metrics for ranking
vocabularies
By applying Information Content concept to LOV
By taking more dimensions in the ranking metrics
The metrics can be applied to vocabulary
reused, ontology modelling and visualizations
Future work
Add equivalence axioms in the ranking model
Compare (P)IC with other graph-based ranking
(e.g. pagerank)
Investigate the dependency ranking between vocabularies
201/09/05 SEMANTICS 2014 - Leipzig, Germany - 16
The answers to these Use Cases need to do some screen scraping on different html pages of data portals where the applications are described, sometimes without many information nor description.
*The Apps4Europe project, recently release a big catalog of all Apps in Europe and the goal now is to make some script for populating DVIA.
* Towards Linked Open Visualizations (LOVIZ) catalog