1. The Linked Data Life-Cycle
Jens Lehmann
Quan Nguyen
Sebastian Hellmann
Claus Stadler
Lorenz Bühmann
contributors:
Sören Auer
Anja Jentzsch
Christina Unger
Richard Cyganiak
Dimitris Kontokostas
Daniel Gerber
Axel Ngonga
2013-08-23
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
1 / 252
2. Outline
1
Introduction to Linked Data
2
Linked Dataset Example: DBpedia
3
Linked Data Life-Cycle Overview
4
Knowledge Extraction
5
Data Integration / Linking
Interlinking
/ Fusing
Manual
revision/
Authoring
Classification/
Enrichment
Linked Data
Lifecycle
Storage/
Querying
Evolution /
Repair
Extraction
6
Enrichment
7
Repair
8
Quality
Analysis
Knowledge Base Exploration / Querying
Lehmann, Bühmann (Univ. Leipzig)
Search/
Browsing/
Exploration
The Linked Data Life-Cycle
2013-08-23
2 / 252
3. Outline
1
Introduction to Linked Data
2
Linked Dataset Example: DBpedia
3
Linked Data Life-Cycle Overview
4
Knowledge Extraction
5
Data Integration / Linking
Interlinking
/ Fusing
Manual
revision/
Authoring
Classification/
Enrichment
Linked Data
Lifecycle
Storage/
Querying
Evolution /
Repair
Extraction
6
Enrichment
7
Repair
8
Quality
Analysis
Knowledge Base Exploration / Querying
Lehmann, Bühmann (Univ. Leipzig)
Search/
Browsing/
Exploration
The Linked Data Life-Cycle
2013-08-23
3 / 252
4. The Linked Data Principles
The term Linked Data refers to a set of best practices for publishing and
interlinking structured data on the Web.
Linked Data principles:
1
Use URIs as names for things.
2
Use HTTP URIs, so that people can look up those names.
3
When someone looks up a URI, provide useful information, using the
standards (RDF, SPARQL).
4
Include links to other URIs, so that they can discover more things.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
4 / 252
6. Linked Data Principles Detailed: 1 + 2
1
URI references to identify not just Web documents and digital
content, but also real world objects and abstract concepts
tangible things: people, places
abstract things: relationship type of knowing somebody
2
HTTP URIs enable re-use of Web architecture Linked Data gives
emphasis to the Web in Semantic Web
Resource dereferencing
Re-use of standard tools for security, load-balancing etc.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
6 / 252
7. Principles Detailed: 3 Content Negotiation
Humans and machines should be able to retrieve appropirate
representations of resources:
machines
Lehmann, Bühmann (Univ. Leipzig)
HTML for humans, RDF for
The Linked Data Life-Cycle
2013-08-23
7 / 252
8. Principles Detailed: 3 Content Negotiation
Humans and machines should be able to retrieve appropirate
representations of resources:
machines
HTML for humans, RDF for
Achievable using an HTTP mechanism called
content negotiation
Basic idea: HTTP client sends HTTP headers with each request to
indicate what kinds of documents they prefer
Servers can inspect headers and select appropriate response
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
7 / 252
9. Principles Detailed: 3 Content Negotiation
Humans and machines should be able to retrieve appropirate
representations of resources:
machines
HTML for humans, RDF for
Achievable using an HTTP mechanism called
content negotiation
Basic idea: HTTP client sends HTTP headers with each request to
indicate what kinds of documents they prefer
Servers can inspect headers and select appropriate response
Two strategies:
303 URIs
Hash URIs
Both ensure that objects and the documents that describe them are
not confused + humans and machines can retrieve appropriate
representations
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
7 / 252
10. 303 URIs
303 Redirect:
instead of sending the object itself over the network,
the server responds to the client with the HTTP response code
303
See Other and the URI of a Web document which describes the
real-world object
Second step: client dereferences new URI and gets a Web document
describing the real-world object
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
8 / 252
11. Hash URIs
Hash URI strategy builds on characteristic that URIs may contain a
special part (
fragment identier) separated from their base part by a
hash symbol (#)
HTTP protocol requires the fragment part to be stripped o before
requesting the URI from the server
→
a URI that includes a hash cannot be retrieved directly and
therefore does not necessarily identify a Web document
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
9 / 252
12. Hash versus 303
Hash Uris
(+) Reduced number of necessary HTTP round-trips → reduces access
latency
(-) Descriptions of all resources sharing the same non-fragment URI
part are always returned to the client together
→
can lead to large
amounts of data being unnecessarily transmitted to the client
303 Uris
(+) Flexible because the redirection target can be congured
separately for each resource (usually points to a single document for
each resource, but could also summarise several resources)
(-) Requires two HTTP requests to retrieve a single description of a
real-world object
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
10 / 252
13. Principles Detailed: 4 Links
If an RDF triple connects URIs in dierent namespaces/datasets, is is
called a
link (no unique syntactical denition of link
exists)
Basic idea of Linked Data: apply the general hyperlink-based
architecture of the World Wide Web to the task of sharing structured
data on global scale
Research challenge: ecient creation of links with high precision and
recall
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
11 / 252
14. Why Linked Data?
Problem:
Try to search for these things on the current Web:
Apartments near German-Russian bilingual childcare in Leipzig.
ERP service providers with oces in Vienna and London.
Researchers working on multimedia topics in Eastern Europe.
Information is available on the Web, but opaque to current Web
search.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
12 / 252
15. Why Linked Data?
Problem:
Try to search for these things on the current Web:
Apartments near German-Russian bilingual childcare in Leipzig.
ERP service providers with oces in Vienna and London.
Researchers working on multimedia topics in Eastern Europe.
Information is available on the Web, but opaque to current Web
search.
Solution: complement text on Web pages with structured linked open data
intelligently combine/integrate such structured information from
dierent sources:
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
13 / 252
16. How to get there?
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
14 / 252
17. Tim Berners-Lee's 5-star plan
Tim Berners-Lee's 5-star plan for an open web of data
Make data available on the Web under an open license
Make it available as structured data
Use a non-proprietary format
Use URIs to identify things
Link your data to other people's data to provide context
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
15 / 252
18. The 0th star
Data catalog with good metadata
Make your data ndable
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
16 / 252
19. Data on the Web, Open License
���������� ���� ��������
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
17 / 252
20. Data on the Web, Open License
Open vs. Closed:
Data used to be closed by default
In the future, it may be open by default.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
18 / 252
21. Data on the Web, Open License
Publishers: sharing data to make it more visible
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
19 / 252
22. Data on the Web, Open License
E-Commerce: Data sharing for increasing trac
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
20 / 252
23. Data on the Web, Open License
Community: Collaboratively created databases
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
21 / 252
24. Good reasons against opening data
Privacy
Competitive advantage
Producing data and charging for it as business model
Can't get license from upstream
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
22 / 252
25. Structured Data
Enabling re-use:
Delivering data to end users in dierent forms
Combining data with other data
3rd party analysis of data
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
23 / 252
26. Structured Data
Formats:
Good for re-use / Structured: MS Excel, CSV, XML, JSON, Microdata
Not so good for re-use: Pure websites, MS Word
Bad for re-use: PDF
Really bad for re-use: Only charts/maps without numbers
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
24 / 252
28. Non-Proprietary Formats
Specialist tools often have specialist formats
Few people have the tools
Expensive
Dicult to re-use
(Geospatial tools, statistics packages, etc.)
Non-proprietary:
CSV (dead simple)
XML
JSON
RDF (good for 4+5 stars)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
26 / 252
31. URIs as Identiers
URI-Design: prefer stable, implementation independent URIs
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
29 / 252
32. URIs as Identiers
Turning local identiers into URIsWhy?
Make them globally unique
Clarify auhority
Make them resolvable
Make them linkable
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
30 / 252
33. Links to Other Data
Hyperlinks are the soul of the Web. The Web of Data is no dierent.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
31 / 252
34. Links to Other Data
Hyperlinks are the soul of the Web. The Web of Data is no dierent.
���� �����
������� �����������������������������
��������
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
31 / 252
35. Summary
Linked Data Principles:
1 Use URIs to name things (not only documents, but also people,
locations, concepts, etc.)
2
To enable agents (human users and machine agents alike) to look up
those names,
3
use HTTP URIs
When someone looks up a URI,
provide useful information
(structured data in RDF, SPARQL).
4
Include
links to other URIs allowing agents to discover more things
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
32 / 252
36. Summary
Linked Data Principles:
1 Use URIs to name things (not only documents, but also people,
locations, concepts, etc.)
2
To enable agents (human users and machine agents alike) to look up
those names,
3
use HTTP URIs
When someone looks up a URI,
provide useful information
(structured data in RDF, SPARQL).
links to other URIs allowing agents to discover more things
5-Star-Data:
4
Include
Five-star plan for realising an emerging web of data, dataset by
dataset
2 stars: re-usable data
3 stars: open standards
4+5 stars: connect data silos
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
32 / 252
37. Outline
1
Introduction to Linked Data
2
Linked Dataset Example: DBpedia
3
Linked Data Life-Cycle Overview
4
Knowledge Extraction
5
Data Integration / Linking
Interlinking
/ Fusing
Manual
revision/
Authoring
Classification/
Enrichment
Linked Data
Lifecycle
Storage/
Querying
Evolution /
Repair
Extraction
6
Enrichment
7
Repair
8
Quality
Analysis
Knowledge Base Exploration / Querying
Lehmann, Bühmann (Univ. Leipzig)
Search/
Browsing/
Exploration
The Linked Data Life-Cycle
2013-08-23
33 / 252
38. DBpedia
Community eort to extract structured information from Wikipedia
and to make this information available on the Web
Allows to ask sophisticated queries against Wikipedia, and to link
other data sets on the Web to Wikipedia data
Semi-structured Wiki markup
Lehmann, Bühmann (Univ. Leipzig)
→
structured information
The Linked Data Life-Cycle
2013-08-23
34 / 252
39. Wikipedia Limitations
Simple Questions hard to answer with Wikipedia:
What have Innsbruck and Leipzig in common?
Who are mayors of central European towns elevated more than
1000m?
Which movies are starring both Brad Pitt and Angelina Jolie?
All soccer players, who played as goalkeeper for a club that has a
stadium with more than 40.000 seats and who are born in a country
with more than 10 million inhabitants
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
35 / 252
41. DBpedia Information Extraction Framework
DBpedia Information Extraction Framework (DIEF)
Started in 2007
Hosted on Sourceforge and Github
Initially written in PHP but fully re-written Written in Scala and Java
Around 40 Contributors
See
https://www.ohloh.net/p/dbpedia
for detailed overview
Can potentially be adapted to other MediaWikis
Currently Wiktionary
Lehmann, Bühmann (Univ. Leipzig)
http://wiktionary.dbpedia.org
The Linked Data Life-Cycle
2013-08-23
37 / 252
43. DIEF - Raw Infobox Extractor
WikiText syntax
{{Infobox Korean settlement
|title = Busan Metropolitan City
...
|area_km2 = 763.46
|pop = 3635389
|region = [[Yeongnam]]
}}
RDF serialization
dbp:Busan dbp:title Busan Metropolitan City
dbp:Busan dbp:area_km2 763.46^xsd:oat
dbp:Busan dbp:pop 3635389^xsd:int
dbp:Busan dbp:region dbp:Yeongnam
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
39 / 252
44. DIEF - Raw Infobox Extractor/Diversity
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
40 / 252
45. DIEF - Raw Infobox extractor/Diversity
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
41 / 252
46. DIEF - Mapping-Based Infobox Extractor
Cleaner data:
Combine what belongs together (birth_place, birthplace)
Separate what is dierent (bornIn, birthplace)
Correct handling of datatypes
Mappings Wiki:
http://mappings.dbpedia.org
Everybody can contribute to new mappings or improve existing ones
≈
170 editors
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
42 / 252
47. DIEF - Mapping-Based Infobox Extractor
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
43 / 252
48. URI/IRI schemes
http://{lang.}dbpedia.org is the main domain
For every article there exists a DBpedia resource in the form:
http://lang.dbpedia.org/resource/{ArticleName}
Properties from the raw infobox extractor use the
http://{lang.}dbpedia.org/property/namespace
Ontology is global for all languages and under
http://dbpedia.org/ontology/namespace
Note: that for English language no language code is used
http://dbpedia.org as main domain
http://dbpedia.org/resource/{title} for articles
http://dbpedia.org/property/{title} for properties
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
44 / 252
49. Linked Data Publication via 303 Redirects
http://dbpedia.org/resource/Dresden
- URI of the city of
Dresden
http://dbpedia.org/page/Dresden
- information resource
describing the city of Dresden in HTML format
http://dbpedia.org/data/Dresden
- information resource
describing the city of Dresden in RDF/XML format
further formats supported,
e.g.
http://dbpedia.org/data/Dresden.n3
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
for N3
2013-08-23
45 / 252
50. DBpedia Links
Data set
Predicate
Amsterdam Museum
owl:sameAs
BBC Wildlife Finder
owl:sameAs
Book Mashup
rdf:type
Count
Tool
627
S
444
S
9 100
owl:sameAs
Bricklink
dc:publisher
10 100
CORDIS
owl:sameAs
314
S
Dailymed
owl:sameAs
894
S
DBLP Bibliography
owl:sameAs
196
S
DBTune
owl:sameAs
838
S
Diseasome
owl:sameAs
2 300
S
Drugbank
owl:sameAs
4 800
S
EUNIS
owl:sameAs
3 100
S
Eurostat (Linked Stats)
owl:sameAs
253
S
Eurostat (WBSG)
owl:sameAs
137
CIA World Factbook
owl:sameAs
545
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
S
2013-08-23
46 / 252
51. DBpedia Links
Data set
Predicate
ickr wrappr
dbp:hasPhoto-
Count
Tool
3 800 000
C
3 600 000
C
Collection
Freebase
owl:sameAs
GADM
owl:sameAs
1 900
GeoNames
owl:sameAs
86 500
S
GeoSpecies
owl:sameAs
16 000
S
GHO
owl:sameAs
196
L
Project Gutenberg
owl:sameAs
2 500
S
Italian Public Schools
owl:sameAs
5 800
S
LinkedGeoData
owl:sameAs
103 600
S
LinkedMDB
owl:sameAs
13 800
S
MusicBrainz
owl:sameAs
23 000
New York Times
owl:sameAs
9 700
OpenCyc
owl:sameAs
27 100
C
OpenEI (Open Energy)
owl:sameAs
678
S
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
47 / 252
52. DBpedia Links
Data set
Predicate
Revyu
owl:sameAs
6
Sider
owl:sameAs
2 000
TCMGeneDIT
owl:sameAs
904
UMBEL
rdf:type
US Census
owl:sameAs
WikiCompany
owl:sameAs
WordNet
dbp:wordnet_type
YAGO2
rdf:type
Sum
Count
Tool
S
896 400
12 600
8 300
467 100
18 100 000
27 211 732
(S: Silk, L: LIMES, C: custom script, missing: no regeneration)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
48 / 252
53. DBpedia Links - Query Example
Compare funding per year (from FTS) and country with the gross domestic
product of a country (from DBpedia)
SELECT
∗
{
{
SELECT ? f t s y e a r
? dbpcountry
? com
rdf : type
? com
fts
? year
fts
−o : y e a r
? year
rdfs : label
(SUM( ? amount )
−o : Commitment
.
fts
? ftscountry
o w l : sameAs
SELECT ? d b p c o u n t r y
? dbpcountry
? gdpyear
.
}
? gdpnominal
.
.
{
? dbpcountry
rdf : type
? dbpcountry
dbp : g d p N o m i n a l
? dbpcountry
}
{
.
? ftsyear
−o : d e t a i l A m o u n t ? amount .
? b e n e f i t f t s −o : b e n e f i c i a r y ? b e n e f i c i a r y
? b e n e f i c i a r y f t s −o : c o u n t r y ? f t s c o u n t r y
? benefit
AS ? f u n d i n g )
.
d bo : C o u n t r y
dbp : g d p N o m i n a l Y e a r
}
{
.
? gdpnominal
? gdpyear
.
.
}
FILTER
((? ftsyear
Lehmann, Bühmann (Univ. Leipzig)
=
s t r (? gdpyear ) )
}
The Linked Data Life-Cycle
2013-08-23
49 / 252
54. Infrastructure
DBpedia has two extraction modes:
Wikipedia-database-dump-based extraction
DBpedia Live synchronisation (more later)
DBpedia Dumps:
The DBpedia Dump archive is located in:
http://downloads.dbpedia.org/
Latest downloads is described in: http://dbpedia.org/Downloads
Ocial Endpoint (by OpenLink): http://dbpedia.org/sparql
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
50 / 252
55. Query Answering
Back to our Wikipedia questions:
What have Innsbruck and Leipzig in common?
Who are mayors of central European towns elevated more than
1000m?
Which movies are starring both Brad Pitt and Angelina Jolie?
All soccer players, who played as goalkeeper for a club that has a
stadium with more than 40.000 seats and who are born in a country
with more than 10 million inhabitants
Using the data extracted from Wikipedia and the public SPARQL endpoint
DBpedia can answer these questions.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
51 / 252
56. DBpedia Live
DBpedia dumps are generated on a bi-annual basis
Wikipedia has around 100,000 150,000 page edits per day
DBpedia Live pulls page updates in real-time and extraction results
update the triple store
In practice, a 5 minute update delay increases performance by 15%
Links
http://live.dbpedia.org/sparql
Documentation: http://wiki.dbpedia.org/DBpediaLive
Statistics: http://live.dbpedia.org/LiveStats/
SPARQL Endpoint:
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
52 / 252
57. DBpedia Live - Overview
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
53 / 252
58. DBpedia Internationalization (I18n)
DBpedia Internationalization Committee founded:
http://wiki.dbpedia.org/Internationalization
Available DBpedia language editions in:
Korean, Greek, German, Polish, Russian, Dutch, Portuguese, Spanish,
Italian, Japanese, French
Use the corresponding Wikipedia language edition for input
Mappings available for 23 languages
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
54 / 252
59. DBpedia I18n - Overview
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
55 / 252
60. Applications: Disambiguation
Named entity recognition and disambiguation Tools such as:
DBpedia
Spotlight, AlchemyAPI, Semantic API, Open Calais, Zemanta and Apache
Stanbol
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
56 / 252
61. Applications: Question Answering
DBpedia is the primary target for several QA systems in the Question
Answering over Linked Data (QALD) workshop series
IBM Watson relied also on DBpedia
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
57 / 252
63. Applications: Search and Querying
Query Builder
RelFinder
SemLens
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
59 / 252
64. Applications: Digital Libraries Archives
Virtual International Authority Files (VIAF) project as Linked Data
VIAF added a total of 250,000 reciprocal authority links to Wikipedia.
DBpedia can also provide:
Context information for bibliographic and archive records (e.g. an
author's demographics, a lm's homepage, an image etc.)
Stable and curated identiers for linking.
The broad range of Wikipedia topics can form the basis for a thesaurus
for subject indexing.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
60 / 252
65. Applications: DBpedia Mobile
DBpedia Mobile is a location-centric DBpedia client application for mobile
devices consisting of a map view, the Marbles Linked Data Browser and a
GPS-enabled launcher application.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
61 / 252
66. Applications: DBpedia Wiktionary
Wiktionary is a Wikimedia project: http://wiktionary.org
171 languages, 3M words for English.
Extracted Using the DBpedia Information Extraction Framework
Easily congurable for every Wiktionary language edition
Pre-congured for German, Greek, English, Russian and French.
http://Wiktionary.dbpedia.org
100 milion triples
Lemon model
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
62 / 252
68. Outline
1
Introduction to Linked Data
2
Linked Dataset Example: DBpedia
3
Linked Data Life-Cycle Overview
4
Knowledge Extraction
5
Data Integration / Linking
Interlinking
/ Fusing
Manual
revision/
Authoring
Classification/
Enrichment
Linked Data
Lifecycle
Storage/
Querying
Evolution /
Repair
Extraction
6
Enrichment
7
Repair
8
Quality
Analysis
Knowledge Base Exploration / Querying
Lehmann, Bühmann (Univ. Leipzig)
Search/
Browsing/
Exploration
The Linked Data Life-Cycle
2013-08-23
64 / 252
69. Linked Data - Achievements and Challenges
Achievements:
1
2
3
data
commons (50B facts)
vibrant, global RTD community
Industrial uptake begins (e.g.
Extension of the Web with a
BBC, Thomson Reuters, Eli Lilly,
Challenges:
1 Coherence:
2
4
5
Governmental adoption in sight
Establishing Linked Data as a
deployment path for the Semantic
Web.
Quality:
partly low quality data
and inconsistencies
3
NY Times, Facebook, Google,
Yahoo)
Relatively few,
expensively maintained links
Performance:
Still substantial
penalties compared to relational
4
Data consumption:
large-scale
processing, schema mapping and
data fusion still in its infancy
5
Usability: Missing direct end-user
tools and network eect.
These issues are closely related and
should ultimately lead to an
ecosystem of interlinked knowledge!
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
65 / 252
72. Extraction
From
unstructured sources
Formats: plain text
Methods: NLP, text mining, ontology learning
From
semi-structured sources
Formats: wiki markup, tags
Tools: DBpedia framework (Wikipedia, Wictionary)
From
structured sources
Formats: databases, spreadsheets, XML
RDB2RDF tools: Sparqlify, D2R, Triplify
CSV converters: RDF extension of Google Rene
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
68 / 252
73. Extraction Challenges
From
unstructured sources
Improve F-Measure of existing NLP approaches (OpenCalais, Ontos
API)
Develop standardized, LOD enabled interfaces between NLP tools
(NLP2RDF)
From
semi-structured sources
Ecient bi-directional synchronization
From
structured sources
Declarative syntax and semantics of data model transformations (W3C
WG RDB2RDF)
Orthogonal challenges
Using LOD as background knowledge
Provenance
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
69 / 252
75. RDF Data Management
From unstructured sources
SPARQL RDF access still by a factor 2-10 slower than relational data
management
Performance increases steadily
Comprehensive, well-supported open-soure and commercial
implementations are available:
OpenLink's Virtuoso (os+commercial)
OWLIM-Lite (free), OWLIM-SE, OWLIM-Enterprise
Talis (hosted)
Bigdata (distributed)
Allegrograph (commercial)
Mulgara (os)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
71 / 252
76. Storage and Querying Challenges
Reduce the performance gap between relational and RDF data
management
SPARQL Query extensions: Spatial/semantic/temporal data
management
View maintenance / adaptive reorganization based on common access
patterns
More realistic benchmarks
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
72 / 252
82. Interlinking
Data Web is an uncontrolled environment proliferation of equivalent
or similar entities need for links / merging
Currently only few RDF triples are links
Manual Link Discovery:
Sindice Integration, LODStats, Semantic Pingback
Tool supported / Semi-Automatic:
SILK, LIMES, COMA, RDF-AI
Usually via mapping specications / heuristics
Machine Learning / Automatic:
RAVEN, EAGLE, SILK GP
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
77 / 252
83. Interlinking Challenges
Apply work in the de-duplication/record linkage literature
Consider the open world nature of Linked Data
Use LOD background knowledge
Zero-conguration linking
Explore active learning approaches, which integrate users in a feedback
loop
Maintain a 24/7 linking service: Linked Open Data Around-The-Clock
project (http://latc-project.eu/)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
78 / 252
85. Enrichment
Currently, lack of knowledge bases with sophisticated schema
information and instance data adhering to this schema
Goal: powerful reasoning, consistency checking and querying
Manual:
Via ontology editors, DBpedia mappings
(Semi-)Automatic:
DL-Learner, Statistical Schema Induction
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
80 / 252
86. Enrichment: Example
Given: knowledge base with property birthPlace (i.e. triples using that
property) but no information on the semantics of birthPlace
Possibly enrichment:
ObjectProperty: birthPlace
Characteristics: Functional
Domain: Person
Range: Place
SubPropertyOf: hasBeenAt
Benets:
axioms serve as documentation for purpose and correct usage of
schema elements
additional implicit information can be inferred
improve the applicability of schema debugging techniques
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
81 / 252
87. Repair
Ontology Debugging:
OWL reasoning to detect inconsistencies and
satisable classes + detect the most likely sources for the problems
basic task: provide feedback to user for resolving undesired entailments
justication J
⊆O
of an entailment is a minimal set of axioms from
which the entailment can be drawn
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
82 / 252
89. Linked Data Quality Analysis
Quality on the Data Web is varying a lot
Hand crafted or expensively curated knowledge base (e.g. DBLP,
UMLS) vs. extracted from text or Web 2.0 sources (DBpedia)
Quality = Fitness for use
Often not necessary to x all problems, but to know about them
30+ quality dimensions dened in recent survey
Research Challenge
Establish measures for assessing the authority, provenance, reliability of
Data Web resources
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
84 / 252
91. KB Evolution
Tasks:
Performing knowledge base changes / refactoring
Ensuring consistency of related knowledge
Managing changes, e.g. undo operations
Update materialized inferred data upon changes
Update materialised links to other data upon changes
Tools:
Protégé - PROMPT and change management plugins
EvoPat - easily re-usable and sharable evolution patterns dened via
SPARQL
PatOMat - ontology transformation framework
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
86 / 252
93. Exploration
RDF data can be complex (as discussed by Pascal Hitzler)
Exploration phase aims to make data accessible to non-experts
Options:
Faceted Browsing
Question Answering
Query Builders
Visualisation of statistical or geospatial data
...
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
88 / 252
98. Make the Web a Linked Data Washing Machine
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
93 / 252
99. Tool Support for Life-Cycle?
Many SW tools support one or more life-cycle stages
Linked Data Stack (http://stack.linkeddata.org) provides a
consolidated repository of such tools
Each tool is a Debian package
Lightweight integration between tools via common vocabularies and
SPARQL
Demonstrator interfaces for showing tools in combination
Developed by LOD2 and GeoKnow EU projects
Geo
Know
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
94 / 252
100. Outline
1
Introduction to Linked Data
2
Linked Dataset Example: DBpedia
3
Linked Data Life-Cycle Overview
4
Knowledge Extraction
5
Data Integration / Linking
Interlinking
/ Fusing
Manual
revision/
Authoring
Classification/
Enrichment
Linked Data
Lifecycle
Storage/
Querying
Evolution /
Repair
Extraction
6
Enrichment
7
Repair
8
Quality
Analysis
Knowledge Base Exploration / Querying
Lehmann, Bühmann (Univ. Leipzig)
Search/
Browsing/
Exploration
The Linked Data Life-Cycle
2013-08-23
95 / 252
101. Knowledge Extraction
Knowledge Extraction is the creation of knowledge from structured
(relational databases, XML) and unstructured (text, documents, images)
sources.
Resulting knowledge needs to be in a machine-readable and
machine-interpretable format and facilitate inferencing
Similar to Information Extraction (NLP) and ETL (Data Warehouse),
but main dierence: extraction result goes beyond the creation of
structured information or the transformation into a relational schema
Requires re-use of existing formal knowledge (reusing ontologies) or
the generation of a schema based on the source data
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
96 / 252
102. Categorisation of Approaches
Source - Examples: plain text, relational databases, XML, CSV
Exposition - How is the extracted knowledge made explicit? How can
you query and perform inference?
Synchronization - Is the knowledge extraction process executed once
to produce a dump or is the result synchronized with the source? Are
changes to the result written back (Bi-directional)?
Reuse of Vocabularies - Can popular ontologies (Good Relations,
FOAF, . . . ) be re-used to simplify global data integration?
Automatisation - manual, semi-automatic, automatic
Domain Ontology Required - Does the approach require a
pre-dened ontology or can it create a schema from the source
(e.g. ontology learning)?
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
97 / 252
103. Extraction from Structured Sources to RDF
Simple mappings from RDB tables/views to RDF
Direct mapping of the model of relational databases to RDF
→ OWL class
→ Instance s of
Table
Row
this class
→ Triple (s ,p ,o )
http://www.w3.org/TR/rdb-direct-mapping/
Cell with value o in column p
Details:
Complex mappings of relational databases to RDF
Additional renements can be employed to 1:1 mapping to improve the
usefulness of RDF output
Extract or learn an OWL schema from the given database schema
Map the schema and its contents to a pre-existing domain ontology
Powerful mapping languages: R2RML, SML
XML
XML tree structure can be directly converted to RDF graph structure
Complex mappings possible, e.g. via XSLT processors
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
98 / 252
104. Extraction from Natural Language Sources
80% of the information in business documents is in unstructured
natural language
1
(-) Increased complexity and decreased quality of extraction
(+) Potential for a massive acquisition of extracted knowledge
Traditional Information Extraction (IE)
Recognize and categorise elements in text
Techniques: Named Entity Recognition (NER), Coreference Resolution
(CO), . . .
Ontology Learning (OL) from Text
Learn whole ontologies from natural language text
Usually (semi-)automatic extracted
1
Wimalasuriya, Dou. Ontology-based information extraction: [. . . ] Journal of Information Science
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
99 / 252
105. LinkedGeoData + Sparqlify
Example: LinkedGeoData Knowledge Extraction Project using Sparqlify
Structure
Motivation
OpenStreetMap
LGD Architecture
Mapping
Access (How LinkedGeoData is published)
Use Cases
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
100 / 252
106. Motivation
Ease
information integration tasks that require spatial knowledge,
such as
Oerings of bakeries next door
Map of distributed branches of a company
Historical sights along a bicycle track
LOD cloud contains data sets with spatial features
e.g. Geonames, DBpedia, US census, EuroStat
But:
they are
restricted to popular or large entities like countries,
famous places etc. or specic regions
Therefore
they lack
Lehmann, Bühmann (Univ. Leipzig)
buildings, roads, mailboxes, etc.
The Linked Data Life-Cycle
2013-08-23
101 / 252
107. OpenStreetMap - Datamodel
Basic entities are:
Nodes Latitude, Longitude.
Ways Sequence of nodes.
Relations Associations between any number of nodes, ways and
relations. Every member in a relation plays a certain role.
Each entity may be described with tags (= key-value pairs)
A way is
closed if the ID of the last referenced node equals that of the
rst one.
Whether a closed way denotes a linear ring or a polygon (i.e. whether
the enclosed area is part of the respective OSM entity) depends on the
tags.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
102 / 252
112. Tag Mappings
Key-value pairs will be assigned to
RDF ressources
Each pair
(k , v )
can be annotated with
datatypes, language tags, classes
Mappings are themselves tables
Example table:
k
lgd_map_literal
name
name:en
alt_label
note
...
Lehmann, Bühmann (Univ. Leipzig)
property
rdfs:label
rdfs:label
skos:altLabel
rdfs:comment
...
The Linked Data Life-Cycle
lang
en
...
2013-08-23
107 / 252
113. View Denition
RDF mapping of the data from a
PostgreSQL database
Create View lgd_nodes As
Construct {
?n a lgdm:Node .
?n geom:geometry ?g .
?g ogc:asWKT ?o .
}
With
?n = uri(lgd:node, ?id)
?g = uri(lgd-geom:node, ?id)
?o = typedLiteral(?geom, ogc:wktLiteral)
From
nodes
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
108 / 252
114. Sparqlify
SPARQL-SQL Rewriter
Rewrites SPARQL Queries according
to the view denition
Platform module oers SPARQL
Endpoint and Linked Data interface
https:
//github.com/AKSW/Sparqlify
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
109 / 252
115. Rest-API
Oers REST methods for frequent
queries
Based on SPARQL (Virtuoso) endpoint
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
110 / 252
116. Downloads
RDF dataset for download
Generated using
Construct { ?s ?p ?o }
http:
//downloads.linkedgeodata.org
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
111 / 252
117. Ontology
Enriched
classes and properties with multilingual labels from
TranslateWiki
http://translatewiki.net
Imported
icons for 90 classes from the freely available icon
collection from the SJJB Management
http://www.sjjb.co.uk/mapicons/
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
112 / 252
118. SML Mapping Examples
The following slides demonstrate how to map relational data to RDF
with the Sparqlication Mapping Language (SML).
Thereby, these prexes are used:
prex
rdfs
ogc
geom
lgd
lgd-geom
IRI
Prexes
http://www.w3.org/2000/01/rdf-schema#
http://www.opengis.net/ont/geosparql#
http://geovocab.org/geometry#
http://linkedgeodata.org/triplify/
http://linkedgeodata.org/geometry/
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
113 / 252
119. SML - Mapping Example I: The Goal (1/4)
Input Table
id
1
2
How to map tables to RDF?
nodes
How to introduce the
geom
commonly used
POINT(0 0)
POINT(1 1)
distinction in GIS between
feature and geometry?
Aimed for RDF Output
@prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
...
lgd:node1 geom:geometry lgd-geom:node1 .
lgd:node2 geom:geometry lgd-geom:node2 .
lgd-geom:node1 ogc:asWKT POINT(0 0)^^ogc:wktLiteral .
lgd-geom:node2 ogc:asWKT POINT(1 1)^^ogc:wktLiteral .
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
114 / 252
120. SML - Mapping Example I: SML Syntax Outline (2/4)
Input Table
id
1
2
nodes
geom
POINT(0 0)
POINT(1 1)
Create View myNodesView As
Construct {
...
}
With
...
From
...
Aimed for RDF Output
@prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
...
lgd:node1 geom:geometry lgd-geom:node1 .
lgd:node2 geom:geometry lgd-geom:node2 .
lgd-geom:node1 ogc:asWKT POINT(0 0)^^ogc:wktLiteral .
lgd-geom:node2 ogc:asWKT POINT(1 1)^^ogc:wktLiteral .
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
115 / 252
121. SML - Mapping Example I: Construct and From (3/4)
Input Table
id
1
2
nodes
geom
POINT(0 0)
POINT(1 1)
Create View myNodesView As
Construct {
?n geom:geometry ?g .
?g ogc:asWKT ?o
}
With
...
From
nodes
Aimed for RDF Output
@prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
...
lgd:node1 geom:geometry lgd-geom:node1 .
lgd:node2 geom:geometry lgd-geom:node2 .
lgd-geom:node1 ogc:asWKT POINT(0 0)^^ogc:wktLiteral .
lgd-geom:node2 ogc:asWKT POINT(1 1)^^ogc:wktLiteral .
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
116 / 252
123. SML Mapping Examples
A more complex example, which demonstrates the use of an SQL
mapping table and an SQL helper view.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
118 / 252
124. SML - Mapping Example II: The Goal (1/8)
Input Table
id
1
1
1
1
1
node_tags
k
name
name:en
amenity
addr:street
addr:city
v
Universitaet Leipzig
University of Leipzig
university
Augustusplatz
Leipzig
Aimed for RDF Output
@prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
@prefix lgd: http://linkedgeodata.org/triplify/ .
lgd:node1 rdfs:label Universitaet Leipzig .
lgd:node1 rdfs:label University of Leipzig@en .
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
119 / 252
125. SML - Mapping Example II: Source Data (2/8)
OSM Table
id
1
1
1
1
1
node_tags
k
name
name:en
amenity
addr:street
addr:city
v
Universitaet Leipzig
University of Leipzig
university
Augustusplatz
Leipzig
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
120 / 252
126. SML - Mapping Example II: Mapping Table (3/8)
OSM Table
id
1
1
1
1
1
node_tags
k
name
name:en
amenity
addr:street
addr:city
RDF Mapping Table
v
Universitaet Leipzig
University of Leipzig
university
Augustusplatz
Leipzig
Lehmann, Bühmann (Univ. Leipzig)
k
lgd_map_literal
name
name:en
alt_label
note
...
The Linked Data Life-Cycle
property
rdfs:label
rdfs:label
skos:altLabel
rdfs:comment
...
lang
en
...
2013-08-23
121 / 252
127. SML - Mapping Example II: Helper View (4/8)
OSM Table
id
1
1
1
1
1
node_tags
k
name
name:en
amenity
addr:street
addr:city
RDF Mapping Table
v
Universitaet Leipzig
University of Leipzig
university
Augustusplatz
Leipzig
k
lgd_map_literal
name
name:en
alt_label
note
...
property
rdfs:label
rdfs:label
skos:altLabel
rdfs:comment
...
lang
en
...
Helper View
lgd_node_tags_literal
id
property
v
lang
1
rdfs:label
Universitaet Leipzig
1
rdfs:label
University of Leipzig en
...
...
...
...
SELECT id, property, v, lang FROM node_tags, lgd_map_literal
WHERE node_tags.k = lgd_map_literal.k
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
122 / 252
128. SML - Mapping Example II: SML View (5/8)
Logical Table
id
1
1
...
SML View
lgd_node_tags_literal
property
rdfs:label
rdfs:label
...
v
Univ. L.
Univ. of L.
...
Lehmann, Bühmann (Univ. Leipzig)
lang
en
...
Create View lgd_node_tags_text As
Construct {
The Linked Data Life-Cycle
2013-08-23
123 / 252
129. SML - Mapping Example II: SML View (6/8)
Logical Table
id
1
1
...
SML View
lgd_node_tags_literal
property
rdfs:label
rdfs:label
...
v
Univ. L.
Univ. of L.
...
Lehmann, Bühmann (Univ. Leipzig)
lang
en
...
Create View lgd_node_tags_text As
Construct {
?s ?p ?o .
}
With
...
From
lgd_node_tags_literal
The Linked Data Life-Cycle
2013-08-23
124 / 252
130. SML - Mapping Example II: SML View (7/8)
Logical Table
id
1
1
...
SML View
lgd_node_tags_literal
property
rdfs:label
rdfs:label
...
v
Univ. L.
Univ. of L.
...
Lehmann, Bühmann (Univ. Leipzig)
lang
en
...
Create View lgd_node_tags_text As
Construct {
?s ?p ?o .
}
With
?s = uri(lgd:node, ?id)
?p = uri(?property)
?o = plainLiteral(?v, ?lang)
From
lgd_node_tags_literal
The Linked Data Life-Cycle
2013-08-23
125 / 252
131. SML - Mapping Example II: SML View (8/8)
Logical Table
SML View
+
Create View lgd_node_tags_text As
Construct {
?s ?p ?o .
}
With
?s = uri(lgd:node, ?id)
?p = uri(?property)
?o = plainLiteral(?v, ?lang)
From
lgd_node_tags_literal
id
1
1
...
lgd_node_tags_literal
property
rdfs:label
rdfs:label
...
v
Univ. L.
Univ. of L.
...
lang
en
...
Resulting RDF
@prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
@prefix lgd: http://linkedgeodata.org/triplify/ .
lgd:node1 rdfs:label Universitaet Leipzig .
lgd:node1 rdfs:label University of Leipzig@en .
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
126 / 252
135. Statistics (15 August 2013)
Complete OSM planet le corresponds to
Virtual access via Sparqlify
∼
20.000.000.000 triples
Downloads limited to selected classes.
292.780.188 Triples
153.613.243 triples of Nodes
139.166.945 triples of Ways
Relations not yet available for download
Among them
532.812 PlaceOfWorship
82.788 RailwayStation
72.091 Toilets
71.613 Town
19.937 City
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
130 / 252
136. Access
Materialized Sparql Endpoint (based on Virtuoso DB, download
datasets loaded)
http://linkedgeodata.org/sparql
http://linkedgeodata.org/snorql
Virtual Sparql Endpoint (based on Sparqlify, access to 20B triples,
limited SPARQL 1.0 support)
http://linkedgeodata.org/vsparql
http://linkedgeodata.org/vsnorql
Rest Interface (based on the Virtual Sparql Endpoint)
Supports limited queries (e.g. circular/rectangular area, ltering by
labels)
Downloads
http://downloads.linkedgeodata.org
Monthly updates on the above datasets envisioned
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
131 / 252
137. Use Cases Augmented Reality
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
132 / 252
138. Use Cases Generic Browsing
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
133 / 252
139. Use Cases Generic Browsing
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
134 / 252
140. Outline
1
Introduction to Linked Data
2
Linked Dataset Example: DBpedia
3
Linked Data Life-Cycle Overview
4
Knowledge Extraction
5
Data Integration / Linking
Interlinking
/ Fusing
Manual
revision/
Authoring
Classification/
Enrichment
Linked Data
Lifecycle
Storage/
Querying
Evolution /
Repair
Extraction
6
Enrichment
7
Repair
8
Quality
Analysis
Knowledge Base Exploration / Querying
Lehmann, Bühmann (Univ. Leipzig)
Search/
Browsing/
Exploration
The Linked Data Life-Cycle
2013-08-23
135 / 252
141. Why Link Discovery?
1
Fourth Linked Data
principle
2
Links are central for
Cross-ontology QA
Data Integration
Reasoning
Federated Queries
...
3
2011 topology of the
LOD Cloud:
31+ billion triples
≈ 0.5 billion links
owl:sameAs in most
cases
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
136 / 252
142. Why is it dicult?
1
Time complexity
Large number of triples
Quadratic a-priori runtime
69 days for mapping cities from
DBpedia to Geonames (1ms per
comparison)
decades for linking DBpedia and LGD
...
Denition (Link Discovery)
Given sets S and T of resources and relation
Task: Find M
= {(s , t ) ∈ S × T : R(s , t )}
R
Common approaches:
Find M
Find M
= {(s , t ) ∈ S × T : σ(s , t ) ≥ θ}
= {(s , t ) ∈ S × T : δ(s , t ) ≤ θ}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
137 / 252
143. Why is it dicult?
2
Complexity of specications
Combination of several attributes required for high precision
Tedious discovery of most adequate mapping
Dataset-dependent similarity functions
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
138 / 252
145. Runtime Optimization
Reduce the number of comparisons C (A)
all
σ /θ
values for links)
≥ |M |
(assuming we need
Maximize reduction ratio:
RR (A)
Lehmann, Bühmann (Univ. Leipzig)
=1−
C (A)
|S ||T |
The Linked Data Life-Cycle
2013-08-23
140 / 252
146. Runtime Optimization
Reduce the number of comparisons C (A)
all
σ /θ
values for links)
≥ |M |
(assuming we need
Maximize reduction ratio:
RR (A)
=1−
C (A)
|S ||T |
Question
Can we devise lossless approaches with guaranteed RR?
Advantages
Space management
Runtime prediction
Resource scheduling
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
140 / 252
147. RR Guarantee
Best achievable reduction ratio: RRmax
Lehmann, Bühmann (Univ. Leipzig)
=1−
The Linked Data Life-Cycle
|M |
|S ||T |
2013-08-23
141 / 252
149. RR Guarantee
Best achievable reduction ratio: RRmax
Approach
H(α)
=1−
|M |
|S ||T |
fullls RR guarantee criterion, i:
∀r RRmax , ∃α : RR (H(α)) ≥ r
Here, we use relative reduction ratio (RRR ):
RRR (A)
Lehmann, Bühmann (Univ. Leipzig)
=
RRmax
RR (A)
The Linked Data Life-Cycle
2013-08-23
141 / 252
150. Goal
Formal Goal
Devise
H(α) : ∀r 1, ∃α : RRR (H(α)) ≤ r
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
142 / 252
151. Restrictions
Minkowski Distance
δ(s , t ) = p
n
1
i=
Lehmann, Bühmann (Univ. Leipzig)
|si − ti |p , p ≥ 2
The Linked Data Life-Cycle
2013-08-23
143 / 252
152. Space Tiling
HYPPO
δ(s , t ) ≤ θ
describes a hypersphere
Approximate hypersphere by using a hypercube
Easy to compute
No loss of recall (blocking)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
144 / 252
153. Space Tiling
Set width of single hypercube to
Lehmann, Bühmann (Univ. Leipzig)
∆ = θ/α
The Linked Data Life-Cycle
2013-08-23
145 / 252
154. Space Tiling
Set width of single hypercube to
Tile
Ω=S ∪T
(c1 , . . . , c ) ∈ N
points ω ∈ Ω : ∀i ∈ {1 . . . n}, c ∆ ≤ ω (c + 1)∆
Coordinates:
Contains
∆ = θ/α
into the adjacent cubes C
Lehmann, Bühmann (Univ. Leipzig)
n
n
i
The Linked Data Life-Cycle
i
i
2013-08-23
145 / 252
155. Space Tiling
Set width of single hypercube to
Tile
Ω=S ∪T
(c1 , . . . , c ) ∈ N
points ω ∈ Ω : ∀i ∈ {1 . . . n}, c ∆ ≤ ω (c + 1)∆
Coordinates:
Contains
∆ = θ/α
into the adjacent cubes C
Lehmann, Bühmann (Univ. Leipzig)
n
n
i
The Linked Data Life-Cycle
i
i
2013-08-23
145 / 252
156. HYPPO
Combine
(2α + 1)n
hypercubes around C (ω) to approximate
hypersphere
RRR (HYPPO (α))
n
2
= (αα+(1))
nS n
lim RRR (HYPPO (α))
α→∞
Lehmann, Bühmann (Univ. Leipzig)
n
= S2 n)
(
The Linked Data Life-Cycle
2013-08-23
146 / 252
157. HYPPO
RRR(HYPPO) for p
Lehmann, Bühmann (Univ. Leipzig)
= 2,
n
= 2, 3, 4
and 2
≤ α ≤ 50
The Linked Data Life-Cycle
2013-08-23
147 / 252
159. HR3 : Idea
index (C , ω)
=
0
if
n
i=
Lehmann, Bühmann (Univ. Leipzig)
∃i : |ci − c (ω)i | ≤ 1, 1 ≤ i ≤ n,
(|ci − c (ω)i | − 1)p
1
The Linked Data Life-Cycle
else,
2013-08-23
148 / 252
160. HR3 : Idea
Compare C (ω) with C i index (C , ω)
α = 4, p = 2
Lehmann, Bühmann (Univ. Leipzig)
≤ αp
The Linked Data Life-Cycle
2013-08-23
149 / 252
161. HR3 : Idea
Lemma
∀s ∈ S : index (C , s ) αp
implies that all t
∈C
are non-matches
Claims
No loss of recall
3 (α))
lim RRR (HR
α→∞
Lehmann, Bühmann (Univ. Leipzig)
=1
The Linked Data Life-Cycle
2013-08-23
150 / 252
166. HR3 : Idea
Theorem
3 (α))
lim RRR (HR
α→∞
=1
Claims
No loss of recall
3 (α))
lim RRR (HR
α→∞
Lehmann, Bühmann (Univ. Leipzig)
=1
The Linked Data Life-Cycle
2013-08-23
155 / 252
167. HR3 : Experiments
Compare
HR3
with LIMES 0.5's HYPPO and SILK 2.5.1
Experimental Setup:
Deduplicating DBpedia places by minimum elevation, elevation and
maximum elevation (θ
= 49m, 99m).
Geonames and LinkedGeoData by longitude and latitude (θ
= 1◦ , 9◦ )
64-bit computer with a 2.8GHz i7 processor with 8GB RAM.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
156 / 252
171. HR3 : Summary
Mission
New category of algorithms for link discovery
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
160 / 252
172. HR3 : Summary
Mission
New category of algorithms for link discovery
Presented
HR3
Link discovery in ane spaces with Minkowski measures
Outperforms the state of the art (runtime, comparisons)
Optimal reduction ratio
Integrated in LIMES
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
160 / 252
179. Learning Complex Specications
Insight
Choice of right example is key for learning
So far, only use of informativeness
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
163 / 252
180. Learning Complex Specications
Insight
Choice of right example is key for learning
So far, only use of informativeness
Question
Can we do better by using more information?
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
163 / 252
181. Learning Complex Specications
Insight
Choice of right example is key for learning
So far, only use of informativeness
Question
Can we do better by using more information?
Higher F-measure
Often slower
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
163 / 252
182. Basic Idea
Use similarity of link candidates when selecting most informative
examples (intra + inter class similarity)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
164 / 252
183. Basic Idea
Use similarity of link candidates when selecting most informative
examples (intra + inter class similarity)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
164 / 252
184. Basic Idea
Use similarity of link candidates when selecting most informative
examples (intra + inter class similarity)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
164 / 252
185. Similarity of Candidates
= (s , t ) can
(σ1 (x ), . . . , σn (x )) ∈ [0, 1]n .
Link candidate x
be regarded as vector
Similarity of link candidates x and y :
sim (x , y )
1
=
n
1
+
i=
.
(1)
(σi (x ) − σi (y ))2
1
Allows exploiting both intra- and inter-class similarity
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
165 / 252
186. Graph Clustering
Rationale:
Approach
Use intra-class similarity
Cluster elements of S
+
and S
−
independently
Choose one element per cluster as representative
Present oracle with most informative representatives
e
S+
0.9
a
0.25
0.8
c
0.8
0.8
b
h
0.8
f
Lehmann, Bühmann (Univ. Leipzig)
d
0.9
0.25
l
0.8
i
0.9
0.8
0.8
g
k
0.25
The Linked Data Life-Cycle
S2013-08-23
166 / 252
187. BorderFlow
G
= (V , E , ω)
with V
= S+
or V
= S−
ω(x , y ) = sim(x , y )
Keep best ec edges for each x
Lehmann, Bühmann (Univ. Leipzig)
∈V
The Linked Data Life-Cycle
2013-08-23
167 / 252
193. Conclusion
Can be combined with arbitrary active learning ML algorithms
Was experimentally combined with EAGLE (genetic programming) and
RAVEN (linear classier) and shown to outperform the plain
informativeness function in terms of F-measure
Choice of example important to minimise user eort
Contact me for detailed experimental results
Longer runtimes (up to 2×)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
170 / 252
194. Summary
Linking crucial task in the web of data
Tow key problems
1
Ecient execution of link specications
2
Creation of link specication
Presented HR3 to handle the rst problem
Presented COALA as building block for the second problem
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
171 / 252
195. Outline
1
Introduction to Linked Data
2
Linked Dataset Example: DBpedia
3
Linked Data Life-Cycle Overview
4
Knowledge Extraction
5
Data Integration / Linking
Interlinking
/ Fusing
Manual
revision/
Authoring
Classification/
Enrichment
Linked Data
Lifecycle
Storage/
Querying
Evolution /
Repair
Extraction
6
Enrichment
7
Repair
8
Quality
Analysis
Knowledge Base Exploration / Querying
Lehmann, Bühmann (Univ. Leipzig)
Search/
Browsing/
Exploration
The Linked Data Life-Cycle
2013-08-23
172 / 252
196. Motivation
rise in the availability and usage of knowledge bases
still a lack of knowledge bases that consist of sophisticated schema
information and instance data adhering to this schema
e.g. in the life sciences several knowledge bases
only consist of schema information
to a large extent, a collection of facts without a clear structure
(e.g. information extracted from databases)
combination of sophisticated schema and instance data would allow
powerful reasoning, consistency checking, and improved querying
→
create schemata based on existing data
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
173 / 252
197. Example
dbr : Brad_Pitt
: birthPlace
a
dbr : Angela_Merkel
: birthPlace
a
: birthPlace
a
d b r : Shawnee , _Oklahoma
a
Suggestions:
a
a
: Place .
: Place .
birthPlace
ObjectProperty :
birthPlace
Characteristics :
Range :
d b r : Ulm ;
: Person .
: Place .
d b r : Hamburg
Domain :
d b r : Hamburg ;
: Person .
dbr : A l b e r t _ E i n s t e i n
d b r : Ulm
d b r : Shawnee , _Oklahoma ;
: Person .
Functional
Person
Place
SubPropertyOf :
hasBeenAt
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
174 / 252
198. Benets of an expressive schema
Axioms serve as documentation for the purpose and correct usage of
schema elements
Additional implicit information can be inferred
Improve querying optimisations
Improve/allow the application of schema debugging techniques
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
175 / 252
199. Each person was only born at one place?!
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
176 / 252
205. =
birthPlace
birthPlace
SELECT ? s WHERE {
? s dbo : b i r t h P l a c e ? o1 .
? s dbo : b i r t h P l a c e ? o2 .
FILTER ( ? o1 != ? o2 ) }
}
birthPlace is functional
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
177 / 252
206. Where was Julia Nannie Wallace born?
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
178 / 252
207. Julia Nannie Wallace was born in Lacrosse?
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
179 / 252
208. No, Julia Nannie Wallace was born in La Crosse!
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
180 / 252
217. rdf:type
City
birthPlace
rdf:type
Place
SELECT ? s ? p l a c e WHERE {
? s dbo : b i r t h P l a c e ? p l a c e .
? place r d f : type / r d f s : subClassOf ∗ ? type1 .
? t y p e 2 r d f s : s u b C l a birthPlace :range Place
s s O f ∗ dbo P l a c e .
? t y p e 1 owl : d i s j o i n t W i t h ? t y p e 2 .
}
Place disjointWith Sport
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
181 / 252
218. 3 Steps to get a schema
3-Phase Enrichment
Learning Approach:
SPARQL
Endpoint
Input: Entity URI,
Axiom Type,
Knowledge Base
(SPARQL Endpoint)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
182 / 252
219. 3 Steps to get a schema
3-Phase Enrichment
Learning Approach:
(only executed once
per knowledge base)
SPARQL
Endpoint
Input: Entity URI,
1. obtain schema
Axiom Type,
information
Knowledge Base
(SPARQL Endpoint)
Lehmann, Bühmann (Univ. Leipzig)
Background
Knowledge
The Linked Data Life-Cycle
2013-08-23
183 / 252
220. 3 Steps to get a schema
3-Phase Enrichment
Learning Approach:
Input: Entity URI,
1. obtain schema
Axiom Type,
information
Knowledge Base
(SPARQL Endpoint)
Lehmann, Bühmann (Univ. Leipzig)
(sample data
if necessary)
Reasoner
(optional
invocation)
(only executed once
per knowledge base)
SPARQL
Endpoint
Background
Knowledge
2. obtain axiom type
and entity specific data
Background
Knowledge
+ Relevant
Instance Data
The Linked Data Life-Cycle
2013-08-23
184 / 252
221. 3 Steps to get a schema
3-Phase Enrichment
Learning Approach:
Input: Entity URI,
1. obtain schema
Axiom Type,
information
Knowledge Base
(SPARQL Endpoint)
Lehmann, Bühmann (Univ. Leipzig)
(sample data
if necessary)
Reasoner
Learner
DL-Learner
Enrichment
Ontology
(optional
invocation)
(only executed once
per knowledge base)
SPARQL
Endpoint
Background
Knowledge
2. obtain axiom type
and entity specific data
Background
3. run machine learning
Knowledge
algorithm
+ Relevant
Instance Data
The Linked Data Life-Cycle
2013-08-23
List of Axiom
Suggestions
+ Metadata
185 / 252
222. 3 Steps to get a schema
3-Phase Enrichment
Learning Approach:
Input: Entity URI,
1. obtain schema
Axiom Type,
information
Knowledge Base
(SPARQL Endpoint)
Lehmann, Bühmann (Univ. Leipzig)
(sample data
if necessary)
Reasoner
Learner
DL-Learner
Enrichment
Ontology
(optional
invocation)
(only executed once
per knowledge base)
iterate over all axiom types
and schema entities for full
enrichment
SPARQL
Endpoint
Background
Knowledge
2. obtain axiom type
and entity specific data
Background
3. run machine learning
Knowledge
algorithm
+ Relevant
Instance Data
The Linked Data Life-Cycle
2013-08-23
List of Axiom
Suggestions
+ Metadata
186 / 252
224. Step 1 - Obtaining Schema Information
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
188 / 252
225. Step 1 - Obtaining Schema Information
CONSTRUCT WHERE {
? sub r d f s : s u b C l a s s O f ? sup .
}
ORDER BY DESC( ? sub ) LIMIT 1000 OFFSET 1000
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
188 / 252
226. Step 1 - Obtaining Schema Information
CONSTRUCT WHERE {
? sub r d f s : s u b C l a s s O f ? sup .
}
ORDER BY DESC( ? sub ) LIMIT 1000 OFFSET 1000
dbo : D i s e a s e
dbo : Book
dbo : WrittenWork
dbo : Work
dbo : P h i l o s o p h e r
dbo : P e r s o n
dbo : Agent
dbo : S p o r t
dbo : A c t i v i t y
dbo : F i s h
rdfs
rdfs
rdfs
rdfs
rdfs
rdfs
rdfs
rdfs
rdfs
rdfs
Lehmann, Bühmann (Univ. Leipzig)
: subClassOf
: subClassOf
: subClassOf
: subClassOf
: subClassOf
: subClassOf
: subClassOf
: subClassOf
: subClassOf
: subClassOf
owl : Thing .
dbo : WrittenWork .
dbo : Work .
owl : Thing .
dbo : P e r s o n .
dbo : Agent .
owl : Thing .
dbo : A c t i v i t y .
owl : Thing .
dbo : Animal .
The Linked Data Life-Cycle
2013-08-23
188 / 252
227. Step 2 - Obtain axiom type and entity specic data
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
189 / 252
228. Step 2 - Obtain axiom type and entity specic data
SELECT ? t y p e (COUNT( DISTINCT ? s ) AS ? c n t ) WHERE {
? s dbo : a u t h o r ? o .
? s a ? type .
} GROUP BY ? t y p e ORDER BY DESC( ? c n t )
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
189 / 252
229. Step 2 - Obtain axiom type and entity specic data
SELECT ? t y p e (COUNT( DISTINCT ? s ) AS ? c n t ) WHERE {
? s dbo : a u t h o r ? o .
? s a ? type .
} GROUP BY ? t y p e ORDER BY DESC( ? c n t )
type
cnt
owl:Thing
30284
dbo:Work
30284
schema:CreativeWork
30284
dbo:WrittenWork
25730
dbo:Book
24673
schema:Book
24673
dbo:TelevisionShow
2567
dbo:Play
1057
.
.
.
Lehmann, Bühmann (Univ. Leipzig)
.
.
.
The Linked Data Life-Cycle
2013-08-23
189 / 252
230. Step 2 - Obtain axiom type and entity specic data
CONSTRUCT WHERE {
? i n d dbo : a u t h o r ? o .
? ind a ? type .
}
ORDER BY DESC( ? i n d ) LIMIT 1000 OFFSET 2000
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
189 / 252
231. Step 2 - Obtain axiom type and entity specic data
CONSTRUCT WHERE {
? i n d dbo : a u t h o r ? o .
? ind a ? type .
}
ORDER BY DESC( ? i n d ) LIMIT 1000 OFFSET 2000
.
.
.
d b p e d i a : The_Adventures_of_Tom_Sawyer
dbo : a u t h o r
d b p e d i a : Mark_Twain ;
rdf : type
dbo : Book .
d b p e d i a : The_Zombie_Survival_Guide
dbo : a u t h o r
d b p e d i a : Max_Brooks ;
rdf : type
dbo : WrittenWork .
d b p e d i a : Web_Therapy
dbo : a u t h o r
d b p e d i a : Lisa_Kudrow ;
rdf : type
dbo : Book .
.
.
.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
189 / 252
232. Step 3 - Scoring
d b p e d i a : The_Adventures_of_Tom_Sawyer
dbo : a u t h o r
d b p e d i a : Mark_Twain ;
rdf : type
dbo : Book .
d b p e d i a : The_Zombie_Survival_Guide
dbo : a u t h o r
d b p e d i a : Max_Brooks ;
rdf : type
dbo : WrittenWork .
d b p e d i a : Web_Therapy
dbo : a u t h o r
d b p e d i a : Lisa_Kudrow ;
rdf : type
dbo : Book .
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
190 / 252
233. Step 3 - Scoring
d b p e d i a : The_Adventures_of_Tom_Sawyer
dbo : a u t h o r
d b p e d i a : Mark_Twain ;
rdf : type
dbo : Book .
d b p e d i a : The_Zombie_Survival_Guide
dbo : a u t h o r
d b p e d i a : Max_Brooks ;
rdf : type
dbo : WrittenWork .
d b p e d i a : Web_Therapy
dbo : a u t h o r
d b p e d i a : Lisa_Kudrow ;
rdf : type
dbo : Book .
Score(Domain(dbo:author, dbo:Book))=
2
3
≈ 66.7%
Score(Domain(dbo:author, dbo:WrittenWork))=
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
1
3
≈ 33.3%
2013-08-23
190 / 252
234. Step 3 - Scoring
d b p e d i a : The_Adventures_of_Tom_Sawyer
dbo : a u t h o r
d b p e d i a : Mark_Twain ;
rdf : type
dbo : Book .
d b p e d i a : The_Zombie_Survival_Guide
dbo : a u t h o r
d b p e d i a : Max_Brooks ;
rdf : type
dbo : WrittenWork .
d b p e d i a : Web_Therapy
dbo : a u t h o r
d b p e d i a : Lisa_Kudrow ;
rdf : type
dbo : Book .
Score(Domain(dbo:author, dbo:Book))=
2
3
≈ 66.7%
Score(Domain(dbo:author, dbo:WrittenWork))=
dbo : Book
1
3
≈ 33.3%
r d f s : s u b C l a s s O f dbo : WrittenWork .
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
190 / 252
235. Step 3 - Scoring
d b p e d i a : The_Adventures_of_Tom_Sawyer
dbo : a u t h o r
d b p e d i a : Mark_Twain ;
rdf : type
dbo : Book .
d b p e d i a : The_Zombie_Survival_Guide
dbo : a u t h o r
d b p e d i a : Max_Brooks ;
rdf : type
dbo : WrittenWork .
d b p e d i a : Web_Therapy
dbo : a u t h o r
d b p e d i a : Lisa_Kudrow ;
rdf : type
dbo : Book .
Score(Domain(dbo:author, dbo:Book))=
2
3
≈ 66.7%
Score(Domain(dbo:author, dbo:WrittenWork))=
dbo : Book
1
3
≈ 33.3%
r d f s : s u b C l a s s O f dbo : WrittenWork .
Score(Domain(dbo:author, dbo:WrittenWork))=
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
3
3
= 100%
2013-08-23
190 / 252
236. Step 3 - Scoring(2)
Problem:
support for axiom in KB not taken into account
→
no dierence between 3 out of 3 and 100 out of 100
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
191 / 252
237. Step 3 - Scoring(2)
Problem:
support for axiom in KB not taken into account
→
no dierence between 3 out of 3 and 100 out of 100
Solution:
Average of 95% condence interval (Wald method)
s
p = m+2
+4
min(1, p + 1.96 ·
p ·(1−p ) ) max(0, p − 1.96 ·
m +4
− #success
m − #total
s
p ·(1−p ) )
m +4
In 95% of the intervals the true value is between ... and ...
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
191 / 252
238. Step 3 - Scoring(2)
Problem:
support for axiom in KB not taken into account
→
no dierence between 3 out of 3 and 100 out of 100
Solution:
Average of 95% condence interval (Wald method)
s
p = m+2
+4
min(1, p + 1.96 ·
p ·(1−p ) ) max(0, p − 1.96 ·
m +4
− #success
m − #total
s
p ·(1−p ) )
m +4
In 95% of the intervals the true value is between ... and ...
Score(Domain(dbo:author, dbo:Book))≈ 57.3%
Score(Domain(dbo:author, dbo:WrittenWork))≈ 69.1%
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
191 / 252
239. More Complex Axioms
Pattern Based Knowledge Base Enrichment, ISWC 2013
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
192 / 252
240. Outlook and Summary
Schema in the Linked Data Web often shallow
support knowledge engineers
→
tools needed to
Showed some techniques for learning OWL axioms on large knowledge
bases available as SPARQL endpoints
More complex aioms require:
OWL-SPARQL rewriting or
Fragment extraction
Small- and medium sized knowledge bases can be handled via
techniques from Inductive Logic Programming
All algorithms implemented in DL-Learner framework
(http://dl-learner.org)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
193 / 252
241. Outline
1
Introduction to Linked Data
2
Linked Dataset Example: DBpedia
3
Linked Data Life-Cycle Overview
4
Knowledge Extraction
5
Data Integration / Linking
Interlinking
/ Fusing
Manual
revision/
Authoring
Classification/
Enrichment
Linked Data
Lifecycle
Storage/
Querying
Evolution /
Repair
Extraction
6
Enrichment
7
Repair
8
Quality
Analysis
Knowledge Base Exploration / Querying
Lehmann, Bühmann (Univ. Leipzig)
Search/
Browsing/
Exploration
The Linked Data Life-Cycle
2013-08-23
194 / 252
242. Motivation
increasing number of knowledge bases in the
Semantic Web (see e.g. LOD cloud)
maintenance of knowledge bases with
expressive semantics is challenging
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
195 / 252
243. (Automatically) Detectable Ontology Problems
Common problems:
Syntactic Problems
Structural Problems
Semantic Problems (focus of talk)
Task Based Problems:
Reasoning Related Problems
Linked Data Related Problems
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
196 / 252
244. Syntactic Problems
Syntactic errors are mainly violations of conventions of the language in
which the ontology is modelled.
Example (Validity of XML)
? x m l
v e r s i o n = 1 . 0 ?
r d f : R D F
x m l n s : r d f = h t t p : / /www . w3 . o r g /1999/02/22 − r d f −
s y n t a x −n s#
x m l n s : d c= h t t p : / / p u r l . o r g / d c / e l e m e n t s
/ 1 . 1 /
r d f : D e s c r i p t i o n
r d f : a b o u t = h t t p : / /www . w3 . o r g /
d c : t i t l e W o r l d
Wide Web
C o n s o r t i u m/ d c : t i t l e
/ r d f : R D F
FatalError: The element type rdf:Description must be terminated by the
matching end-tag /rdf:Description.[Line = 7, Column = 3]
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
197 / 252
245. Structural Problems
Problems in the taxonomy
Example (Circularities)
A
Lehmann, Bühmann (Univ. Leipzig)
B, B
C, C
A
The Linked Data Life-Cycle
2013-08-23
198 / 252
246. Reasoning Related Problems
Problems which negatively aect the performance of reasoning over
expressive knowledge bases
Example (A named concept is equivalent to an AllValues restriction)
A
≡ ∀r .C
Reasoning complexity:
Universal restriction does not require to have a property value but only
restricts the values for existing property values
Any concept B for which instances cannot have r -llers satises the
restriction, i.e. B
∀r .C ,
and becomes a subclass of A
Typically leads to unintended inferences and additional inferences may
eventually slow down reasoning performance
Can be checked via Pellint (part of Pellet)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
199 / 252
247. Linked Data Related Problems
Problems which are the specic to publishing RDF using the Linked Data
principles
Incorrect implementation of content negotiation
Mixing up information and non-information resources
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
200 / 252
248. Semantic Problems
Logical contradictions in the underlying knowledge base
Example (Unsatisable classes)
O = {A
B
C, C
¬B } |= A
⊥
Example (Inconsistent ontology)
O = {A
B
C, C
¬B , A(x )} |=
⊥
Usually handled by Ontology Debugging
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
201 / 252
249. Ontology Debugging
Problem: We have undesirable entailments
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
202 / 252
250. Ontology Debugging
Problem: We have undesirable entailments
Solution:
Repair (Delete/Modify) responsible axioms
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
202 / 252
251. Ontology Debugging
Problem: We have undesirable entailments
Solution:
Repair (Delete/Modify) responsible axioms
Question: Which axioms?
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
202 / 252
252. Ontology Debugging
Problem: We have undesirable entailments
Solution:
Repair (Delete/Modify) responsible axioms
Question: Which axioms?
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
202 / 252
253. Justication
Justication
For an ontology
O and an entailment η where O |= η , a set of axioms J
η in O if J ⊆ O, J |= η and if J ⊂ J then J |= η .
is
a justication for
Minimal subsets of an ontology that are sucient for a given
entailment to hold
Synonyms: MUPS (Minimal Unsatisability Preserving Sub-TBoxes),
MinAs (Minimal Axiom sets), Kernels
Observations:
there can be multiple justications for a single entailment
an axiom can be part of multiple justications
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
203 / 252
254. Justication - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
(3)
(4)
E
A
C
(2)
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
204 / 252
255. Justication - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
(3)
(4)
E
A
C
(2)
|= A
⊥
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
204 / 252
256. Justication - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
(3)
(4)
E
A
C
J1 = {1, 2, 3}
(2)
|= A
⊥
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
204 / 252
257. Justication - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
(3)
(4)
E
A
C
J1 = {1, 2, 3}
(2)
|= A
⊥
J2 = {5, 6}
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
204 / 252
258. Justication - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
(3)
(4)
E
A
C
(5)
F
J1 = {1, 2, 3}
(2)
(6)
|= A
⊥
J2 = {5, 6}
J3 = {3, 4}
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
204 / 252
259. Justication Based Repair
For a repair, at least one axiom from every justication needs to be
removed.
For a repair plan, all justications are needed.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
205 / 252
260. Justication Algorithms
Single justication:
Glass Box: Modifying underlying reasoning algorithm (tableau tracing)
Black-Box: Using reasoner as oracle
All justications:
Reiter's Hitting Set Tree Algorithm (HST)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
206 / 252
261. Black-Box
Expansion-Contraction Strategy
Expansion: Add axioms to empty set until entailment holds
Contraction: Remove axioms from set such that set becomes minimal
CHAPTER 3. COMPUTING JUSTIFICATIONS
54
and entailment still can be derived.
Expansion
Contraction
Key:
Axiom
Axiom in justification
Selected axiom
Figure 3.1: A Depiction of a Black-Box Expand-Contract Strategy
Source: M. Horridge:Justication
3.2
Based Explanation
Black-Box Algorithms for Computing Sin- in Ontologies(PhD
Thesis)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
207 / 252
262. Hitting Set Tree Algorithm
from eld of Model Based Diagnosis
given a faulty system (ontology), it constructs nite tree whose
nodes are labelled with conict sets (justications), and whose
edges are labelled with components (axioms)
nds all minimal hitting sets, which represent diagnoses for the
conict sets in the system
diagnosis = repair
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
208 / 252
263. CHAPTER 3. COMPUTING JUSTIFICATIONS
63
Hitting Set Tree Algorithm - Example
O = {A
B
Figure 3.2: An Example of a Hitting Set Tree
B
D
A
∃R .C
∃R .
J2 = {A
D}
J1 = {A
|= A
∃R.C, ∃R.
A
∃R.C
{}
B, B
D}
D
A
B
B
D
J2 = {A
D}
∃R.
{}
D
A
∃R.C
{}
∃R.
∃R.C, ∃R.
D}
D
{}
Source: M. Horridge:Justication
Based Explanation in Ontologies(PhD
bottom right hand successor to the node labelled with J2 and whose successor
Thesis)
Lehmann, Bühmann (Univ. Leipzig)
2013-08-23
209 / 252
edge is labelled with ∃R. The Linked Data Life-Cycle by considering O S where
D was generated
264. Justication Scenarios
A user can be faced with the following situations:
Small number of small justications
Easy and pleasant to inspect
Small number of large justications
Better than alternatives
Large number of justications
Pretty hopeless with current mechanisms
Idea: Find source of unsatisability
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
210 / 252
265. Root Unsatisability - Denitions
A root UC is a class whose unsatisability does not depend on another
class, otherwise it is a derived UC.
A derived UC for which there is some justication that is not a strict
superset of a justication for another UC is a partial derived UC.
Root Unsatisable Class
A class A is a root unsatisable class if there is no justication
such that
J
J |= A
is a strict superset of a justication for some other
⊥
unsatisable class.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
211 / 252
266. Root Unsatisability - Approaches
Approaches:
1: compute all justications for each unsatisable class and apply the
denition
→
computationally often too expensive
2: heuristics for structural analysis of axioms
Debugging Unsatisable Classes in OWL Ontologies, Kalyanpur, Parsia, Sirin, Hendler,
J. Web Sem, 2005.
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
212 / 252
267. Root Unsatisability - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
(3)
(4)
E
A
C
(2)
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
213 / 252
268. Root Unsatisability - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
|= A
⊥
(3)
(4)
E
A
C
(2)
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
213 / 252
269. Root Unsatisability - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
(3)
|= A
⊥
J2 = {5, 6}
J3 = {3, 4}
(4)
E
A
C
(2)
J1 = {1, 2, 3}
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
213 / 252
270. Root Unsatisability - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
|= A
⊥
|= B
J2 = {5, 6}
⊥
(3)
J3 = {3, 4}
(4)
E
A
C
(2)
J1 = {1, 2, 3}
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
213 / 252
271. Root Unsatisability - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
|= A
⊥
|= B
⊥
(3)
J2 = {5, 6}
J3 = {3, 4}
(4)
E
A
C
(2)
J1 = {1, 2, 3}
(5)
F
(6)
J4 = {1, 2}
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
213 / 252
272. Root Unsatisability - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
|= A
⊥
|= B
⊥
(3)
J2 = {5, 6}
J3 = {3, 4}
(4)
E
A
C
(2)
J1 = {1, 2, 3}
(5)
F
(6)
J4 = {1, 2}
root
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
213 / 252
273. Root Unsatisability - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
|= A
⊥
|= B
⊥
(3)
J2 = {5, 6}
partial
J4 = {1, 2}
root
J3 = {3, 4}
(4)
E
A
C
(2)
J1 = {1, 2, 3}
(5)
F
(6)
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
213 / 252
274. Root Unsatisability - Example
O={
B
B
∃r .D
(1)
∀r .¬D
A
B
B
¬C
A
¬E
|= A
⊥
|= B
⊥
(3)
J2 = {5, 6}
J3 = {3, 4}
partial
(J4
⊂ J1 )
(4)
E
A
C
(2)
J1 = {1, 2, 3}
(5)
F
(6)
J4 = {1, 2}
root
}
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
213 / 252
275. Axiom Relevance
resolving justication requires to delete or edit axioms
ranking methods highlight the most probable causes for problems
methods:
frequency
syntactic relevance
semantic relevance
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
214 / 252
276. Repair Consequences
after repairing process, axioms have been deleted or modied
→
desired entailments may be lost or new entailments obtained
→
user can decide to preserve them
(including inconsistencies!)
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
215 / 252
277. SPARQL Endpoint Support
Previously mentioned approaches are implemented in the ORE tool
(http://ore-tool.net)
ORE supports using SPARQL endpoints
implements an incremental load procedure
knowledge base is loaded in small chunks:
count number of axioms by type
priority based loading procedure
e.g. disjointness axioms have higher priority than class assertion axioms
uses Pellet incremental reasoning
Learning of OWL Class Descriptions on Very Large Knowledge Bases,
Hellmann, Lehmann, Auer, Int. Journal Semantic Web Inf. Syst, 2009
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
216 / 252
278. SPARQL Endpoint Support II
algorithm performs sanity checks, e.g. SPARQL queries which probe
for typical inconsistent axiom sets
can fetch additional Linked Data
dierent termination criteria
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
217 / 252
279. SPARQL Endpoint Support II
algorithm performs sanity checks, e.g. SPARQL queries which probe
for typical inconsistent axiom sets
can fetch additional Linked Data
dierent termination criteria
overall:
ORE allows to apply state-of-the-art ontology debugging methods on a
larger scale than was possible previously
aims at stronger support for the web aspect of the Semantic Web
and the high popularity of Web of Data initiative
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
217 / 252
280. DBpedia Live Demo
Inconsistency in DBpedia Live:
Individual: dbr:Purify_(album)
Facts: dbo:artist dbr:Axis_of_Advance
Individual: dbr:Axis_of_Advance
Types: dbo:Organisation
Class: dbo:Organisation
DisjointWith dbo:Person
ObjectProperty: dbo:artist
Range: dbo:Person
Lehmann, Bühmann (Univ. Leipzig)
The Linked Data Life-Cycle
2013-08-23
218 / 252