Online Learning Linked Data Profiling Best Practices
1. Online Learning and Linked Data
Lessons Learned and Best Practices
Dataset Profiling
3. April 2014 1Besnik Fetahu
2. LinkedUp: Data Catalog Features
34 linked datasets of educational relevance (http://datahub.io/dataset?organization=linked-education)
VoID representations of datasets include the following information:
Manual dataset schema alignments
Accessibility information, i.e. SPARQL endpoint URL
3. April 2014 2Besnik Fetahu
http://purl.org/ontology/bibo/Thesis owl:equivalentClass http://purl.org/ontology/bibo/Thesis
http://swrc.ontoware.org/ontology#Article owl:equivalentClass http://purl.org/ontology/bibo/AcademicArticle
http://data.linkededucation.org/linkedup/dataset/data-open-ac-uk void:sparqlEndpoint http://data.open.ac.uk/queryCo-occurence graph of data
types in 146 datasets: 144
Vocabularies, 588 highly
overlapping types, 719
Properties
Assessing the Educational Linked Data Landscape, D’Aquin, M.,
Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris,
France, May 2013.
3. LinkedUp: Data Catalog Features
34 linked datasets of educational relevance (http://datahub.io/dataset?organization=linked-education)
VoID representations of datasets include the following information:
Datasets’ resources type graph
Datasets’ Topic Extraction (Dataset Profiling)
3. April 2014 3Besnik Fetahu
morelab
OpenCourseWare
4. LinkedUp: Data Catalog Features
34 linked datasets of educational relevance (http://datahub.io/dataset?organization=linked-education)
VoID representations of datasets include the following information:
Federated query interface:
3. April 2014 4Besnik Fetahu
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX aiiso: <http://purl.org/vocab/aiiso/schema#>
SELECT DISTINCT ?endpoint WHERE{
?ds void:sparqlEndpoint ?endpoint.
{{ ?ds void:classPartition [void:class
aiiso:School] }
UNION
{?ds void:subset [void:classPartition [void:class
aiiso:School]] }}
}
5. LinkedUp: Why dataset profiling?
3. April 2014 5Besnik Fetahu
Few linked dataset characteristics (from Linked Open Data Cloud).
Growing number of datasets: 227 datasets
Data represented as triples: 31 billion triples
Multi-lingual content: 18 languages
Broad set of topics covered
Inter-dataset links
Domain
Number of
datasets
Triples % (Out-)Links %
Media 25 1,841,852,061 5.82 % 50,440,705 10.01 %
Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 %
Government 49 13,315,009,400 42.09 % 19,343,519 3.84 %
Publications 87 2,950,720,693 9.33 % 139,925,218 27.76 %
Cross-domain 41 4,184,635,715 13.23 % 63,183,065 12.54 %
Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 %
User-generated
content
20 134,127,413 0.42 % 3,449,143 0.68 %
295 31,634,213,770 503,998,829
Domains covered by “lod-cloud” datasets
6. LinkedUp: Why dataset profiling?
3. April 2014 6Besnik Fetahu
Domain
Number of
datasets
Triples % (Out-)Links %
Media 25 1,841,852,061 5.82 % 50,440,705 10.01 %
Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 %
Government 49 13,315,009,400 42.09 % 19,343,519 3.84 %
Publications 87 2,950,720,693 9.33 % 139,925,218 27.76 %
Cross-domain 41 4,184,635,715 13.23 % 63,183,065 12.54 %
Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 %
User-generated
content
20 134,127,413 0.42 % 3,449,143 0.68 %
295 31,634,213,770 503,998,829
How do I find
information about
“renewable energy”?
31 billion
resources
18 languages 180 organisations
How can we do that?
Check datasets that cover such topic?
Use SPARQL filter clause?
What are all possible forms of renewable energy?
38 out of 228 datasets
contain topic coverage
information
regex(*) filter clause
needs to check all
triples that contain a
specific keyword
renewable energy:
solar energy, wind
energy, geothermal…...
7. LinkedUp: How to profile Linked Data?
3. April 2014 7Besnik Fetahu
What is a linked data profile?
Linked Dataset profiles consist of structured information describing their topic coverage. A profile
is represented as a graph. The vertices in the profile graph consist of datasets, resources, and
topics. The edges of the profile graph are constructed between the tuples ‹dataset, resources›
and ‹resources, topics›. Finally, edges between resources and topics are weighted conveying the
relevance of a topic for a dataset.
Profile Definition
<resource_uri_1> ?predicate_x value
<resource_uri_1> ?predicate_y value
<resource_uri_1> ?predicate_z value
A dataset consists of a
set of resource instances.
A resource is represented
by a set of triples.
A topic is equivalent to a DBpedia
category, associated to one of the
resource values.
<resource_uri_1>
<resource_uri_2>
……
<resource_uri_n>
8. Linked-Up: Profiling Linked Data
3. April 2014 8Besnik Fetahu
i. Metadata extraction
ii. Sampling of resource instances
iii. Entity and topic extraction
iv. Topic ranking (PageRank with Priors, HITS
with Priors and K-Step Markov)
v. Weighted dataset-topic profile graphs
vi. Profiles representation
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles. Besnik Fetahu,
Stefan Dietze, Bernardo Pereira Nunes, Marco Antonio Casanova, Davide Taibi, and Wolfgang Nejdl. In
Proceedings of the 11th Extended Semantic Web Conference, Springer, 2014 (to appear).
9. Profiling Linked Data – (I)
3. April 2014 9Besnik Fetahu
i. Metadata extraction:
DataHub’s CKAN API
i. Sampling of resource instances
weighted, random, centrality
i. Entity and topic extraction
Consider only the textual values assigned to a resource
NER: Disambiguate and extract named entities (DBpedia Spotlight)
10. Profiling Linked Data – (II)
3. April 2014 10Besnik Fetahu
i. Topic ranking (PageRank with Priors, HITS with Priors and K-Step Markov)
Rank topics for each dataset, and compute their relevance w.r.t the
associated resources
i. Weighted dataset-topic profile graph
The computed topic weights for each dataset, represent the weights for the
edges <dataset, topic>
i. Profiles representation (Vocabulary of Interlinked Datasets (VoID) and
Vocabulary of Links (VoL))
VoID: Captures information about a Linked Dataset as a set of links
VoL : Defines a link (of entity or topic type), along with the provenance
information and the relevance score of such link
11. Profiling Linked Data: Representation Example
3. April 2014Besnik Fetahu 11
Dataset Profile Metadata
Dataset’s Profile and Index
Entity Type Link
extracted entity
extracted topic
Provenance information
(resources) for the entity link
Provenance information (entities)
for the topic link
Topic Type Link
topic relevance score
12. SELECT ?dataset ?link ?score ?link_1 ?entity ?resource WHERE {
?dataset a void:Linkset.
?dataset vol:hasLink ?link.
?link vol:linksResource
<http://dbpedia.org/resource/Category:Renewable_energy>.
?link vol:derivedFrom ?entity.
?link vol:hasScore ?score.
?link_1 vol:linksResource ?entity.
?dataset vol:hasLink ?link_1.
?link_1 vol:derivedFrom ?resource }
ORDER BY DESC(?score)
3. April 2014Besnik Fetahu 12
How are the profiles useful?
• “Renewable Energy” is in different forms:
• Solar Energy
• Wind-farms
• Biogas
• Hydroelectricity etc.
http://enipedia.tudelft.nl/wiki/Windmar_Renewable_Energy
http://enipedia.tudelft.nl/data/page/eGRID/Plant/57050
http://enipedia.tudelft.nl/wiki/Us_Energy_Biogas_Corp
http://www.reegle.info/profiles/JP
How do I find
information about
“renewable energy”?
13. Profiling Linked Data: Evaluation
3. April 2014Stefan Dietze 13
Profiling accuracy for the different ranking approaches
using the full sample of analysed resource instances,
and with NDCG score averaged over all datasets.
The correlation between ranking accuracy (averaged
over all datasets and for ∆NDCG ) and ranking time.
14. Profiling Linked Data: Example use cases
3. April 2014Besnik Fetahu 14
Type specific views on datasets/
categories
“Document” (foaf:document)
“Person “ (foaf:person)
“Course” (aaiso:course)
LinkedUp Catalog only (as schema mappings
already available here)
Exploratory functionalities over
the dataset profiles
Available for LinkedUp catalog
and the LOD-Cloud.
15. Online Learning and Linked Data
Lessons Learned and Best Practices
Cite4Me and Linked Challenge
3. April 2014Besnik Fetahu 15
16. Semantic Search and Retrieval of Publications
3. April 2014Besnik Fetahu 16
Semantic Search
Graph Search
Paper Recommendation
In-depth Analysis
Cite4Me: A Semantic Search and Retrieval Web Application for Scientific Publications. Bernardo
Pereira Nunes, Besnik Fetahu, Stefan Dietze, and Marco Antonio Casanova. Proceedings of the 12th
International Semantic Web Conference, Sydney, Australia, (2013)
18. Demos and Other Resources
3. April 2014Besnik Fetahu 18
Cite4Me: A Semantic Search and Retrieval Web Application for Scientific Publications. Bernardo
Pereira Nunes, Besnik Fetahu, Stefan Dietze, and Marco Antonio Casanova. Proceedings of the 12th
International Semantic Web Conference, Sydney, Australia, (2013)
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles. Besnik Fetahu,
Stefan Dietze, Bernardo Pereira Nunes, Marco Antonio Casanova, Davide Taibi, and Wolfgang Nejdl. In
Proceedings of the 11th Extended Semantic Web Conference, Springer, 2014 (to appear).
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, France, May 2013.
LinkedUp Catalog: http://data.linkededucation.org/linkedup/catalog/
DevTalk LinkedUp: http://data.linkededucation.org/linkedup/devtalk/
LOD Profile Data: http://data-observatory.org/lod-profiles/sparql
LOD Profile Explorer: http://data-observatory.org/lod-profiles/profile-explorer
Cite4Me Application: http://www.cite4me.com/
LinkedUp Challenge: http://linkedup-challenge.org/