PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia

Large-Scale Multilingual Knowledge Extraction,
Publishing and Quality Assessment:
The Case of DBpedia
Verteidigung der Dissertation
zur Erlangung des akademischen Grades
Doktor-Ingenieur (Dr.-Ing.) im Fachgebiet Informatik
Dimitrios Kontokostas, MSc
1

DBpedia
● Crowdsourced community effort
● Extract structured content from the information created in various
Wikimedia projects
● Assemble an open knowledge graph
● Allows extracted information to be collected, organised, shared, searched
and utilised
http://wiki.dbpedia.org/about
3
Lehmann, Isele, Jakob, Jentzsch, Kontokostas, Mendes, Hellmann, Morsey, Kleef, Auer, and Bizer, (2015)

● Serves as the central hub of LOD
● 9.5 billion facts (latest release)
○ interdisciplinary and multilingual nature
● Mentioned in 22K scientific articles
○ 3.6K only in 2017
● 7.7M daily hits by 380K different IPs
● Strong community
● Strong industrial presence
Broad Outreach...
DBpedia impact
5
https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71

Knowledge Extraction
6
Extracting semantically enriched information from semi-structured or
unstructured documents. Knowledge Extraction usually involves many
challenging sub-tasks
- i.e Named entity recognition (and disambiguation), Coreference resolution, Template element
construction, Template relation construction & Scenario template production
Cunningham, 2005
Evaluating the results of Knowledge Extraction is not straightforward
- Recall as the fraction of available knowledge that is extracted
- Precision as the fraction of extracted knowledge that is correct
Olson and Delen, 2008

[I18n] Knowledge Extraction From Wikipedia
Information in Wikipedia is curated by very diverse and highly dynamic
communities
Roth, Taraborelli, and Gilbert, 2008
- No strict global coordination for uniformity of the data
- Different Wikipedia language editions (en, de, nl, …)
- Different Wikimedia projects (Commons, Wikidata)
Community, Language & project specific conventions for defining information
7

Knowledge Extraction & Usefulness
The results of an extraction process can only be evaluated on the basis of its
usefulness in a specific context, i.e. fitness for use
Juran, 1974
Data with similar or same quality indicators can be suitable for one
application and not for another
- e.g. completeness may matter more in one case than another
8

Quality Assessment
Many possible requirements for extracted data
i.e. schema consistency, exhaustive coverage, correctness, etc
Common is schema & constraint conformance, e.g.
A person must have a birth date and at most one death date
The death date of a person cannot be before the person’s birthdate
Dates must be formatted with the xsd:date format
No (constraint) schema language for RDF (at the thesis starting time)
9

[Constraint] schema languages for RDF (1/2)
● OWL/RDFS
○ Meant for reasoning / open-world assumption
○ Constraining with closed-world semantics
■ Some constraints are not possible, e.g. language tags
● SPARQL
○ Meant for querying RDF data
○ Constraining by querying for errors
○ Difficult syntax but very expressive
● SPIN
○ Vendor-driven, based on SPARQL queries wrapped in a vocabulary
○ Difficult to write constraints but very expressive
● IBM Resource Shapes, Dublin Core Profiles
○ Vendor-driven, high level RDF syntax
○ Easy to write constraints, not very expressive 10

Constraint schema languages for RDF (2/2)
● Shape Expressions (ShEx): (http://shex.io)
○ High-level language with compact syntax
○ Very expressive (not as SPARQL), well-defined schema recursion semantics
● Shape constraint language (SHACL): (https://www.w3.org/TR/shacl/)
○ High-level RDF syntax
○ Allows embedding SPARQL queries with SHACL-SPARQL
○ Good expressivity, undefined recursion semantics
● SHACL is a W3C recommendation as of July 2017
● ShEx is also expected to be standardized (W3C / Other body)
● Both ShEx & SHACL were developed after the beginning of this thesis
11

Research question
This thesis focuses on both large-scale multilingual knowledge extraction
and quality assessment
● Evaluating the usability of the extracted knowledge is essential
How can we increase recall & precision in knowledge extraction?
● DBpedia as use case, generalize where possible
12

Increasing recall & precision in knowledge extraction
13
● Available knowledge to be extracted is unknown
○ Cannot create generic approach
● Focus on increasing extracted knowledge by
○ Adding new data sources
○ New approaches for improving extraction of existing sources
○ Improving existing infrastructure
● Better data coverage

Large-scale Multilingual Knowledge
Extraction [in DBpedia]
14

DBpedia information extraction framework
15
© 1st & 2nd generation
of DBpedia devs
Lehmann, Isele, Jakob, Jentzsch, Kontokostas, Mendes, Hellmann, Morsey, Kleef, Auer, and Bizer, (2015)

DBpedia & I18n*
● English Wikipedia was (and still is) the most abundant in information
○ main focus of early versions of DBpedia
● Limited support for extraction of non-English Wikipedias
○ Using the english configuration
○ Discarding pages with no english interwiki links
○ Represented with english labels
● English biased… however:
○ local Information is usually better than English
○ Not all local information has an English page
(*)
I18n comes from “Internationalization and localization“ and stands for: ‘I’, 18 characters and ‘n’
16

DBpedia I18n results
● I18n extractor extensions
● I18n parser extensions
● Articles without enwiki links
61.26% increase in triples
(DBpedia version 3.7)
17
Kontokostas, Bratsas, Auer, Hellmann,
Antoniou, and Metakides, (2012)

Incorporation of Wikimedia Commons
● Wikimedia Media backend
● Extend DBpedia Framework
○ File pages, Media extractors
Extend mapping process
● Incorporate
○ Galleries, Image annotations
Media licensing, geo-data
Media metadata
● Resulted in
○ 1.4 billion statements
○ 25M images , 600K artwork,
50K Videos, etc
○ 43M license statements
18Vaidya, Kontokostas, Knuth, Lehmann, and Hellmann, (2015)

Incorporation of Wikidata
● Wikimedia Structured data backend
○ Maintains interwiki links
○ Structured data related to pages
● Extend DBpedia Framework
○ Json parser
○ Json to rdf mappings
○ Wikidata extractors
● Incorporate Wikidata
○ Map Wikidata data in DBpedia ontology
● Resulted in
○ 1.4 billion statements
○ Up to 2.7K daily visitors
19Ismayilov, Kontokostas, Auer, Lehmann, and Hellmann, (2016)

DBTax
Unsupervised Learning of an Extensive and Usable DBpedia Taxonomy
● Only 2.8M out of 4.9M entities are classified in the DBpedia ontology
○ e.g. Persons, Organizations, places, etc
● Data-driven approach
● Use the Wikipedia categories for classification
● Generated:
○ 1.9K classes (T-Box)
○ 10.7M instance-of assertions (A-Box)
○ 4.26M entities (2.32M had no type)
Fossati, Kontokostas, and Lehmann, 2015
20

21
Results:
Circa 2x increase in data coverage
● 1.4B triples from DBpedia Commons
● 1.4B triples from DBpedia Wikidata
● 62% increase from I18n enhancements
● DBTax

22
● Quality assessment of RDF data
● Focus on validating data against schemas
○ Automate the test case elicitation process
○ Unify existing approaches under a common methodology
■ RDFS, OWL, RS, DCMI Profiles, SPIN, SHACL, ShEx
○ Generic approach that can be used for
■ General-purpose validation
■ Domain-specific validation
■ Diverse use cases

Quality Assessment of
RDF & Linked Data
the Test-driven Quality Assessment Methodology
23

Quality assessment, why?
● Unprecedented volume of structured data on the Web
● Datasets are of varying quality
● OWL schemas are often not sufficiently developed or exploited for quality
evaluation
Software development promotes testing code (TDD) why not testing data?
Test-driven quality assessment methodology (TDQAM)
24

Test-Driven Development (Software)
● Test case: input on which the program under test is executed during
testing
● Test suite: a set of test cases for testing a program
● Status: Success, Fail or Error
Test cases are implemented largely manually or with limited programmatic
support
Zhu, Hall, and May, 1997
25

Test-Driven Quality Assessment (RDF)
● Test case: a data constraint that involves one or more triples
● Test suite: a set of test cases for testing a dataset
● Status: Success, Fail, Timeout (complexity) or Error (e.g. network)
○ Fail: Violation, warning or notice
RDF: basis for both data and schema
● Unified model facilitates automatic test case generation
● SPARQL serves as the test case definition language
Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
26

Example Test Case
A person should never have a birth date after a death date
Test Cases are written in SPARQL:
We query for errors:
● Success: Query returns empty result set
● Fail: Query returns results
○ Every result we get is a violation , warning or notice instance
SELECT ?s WHERE {
?s dbo:birthDate ?v1 .
?s dbo:deathDate ?v2 .
FILTER ( ?v1 > ?v2 ) }
27

Data Quality Test Patterns (DQTP)
abstract patterns, which can be further refined into concrete data quality test
cases using test pattern bindings
Bindings
mapping of variables to valid pattern replacement
SELECT ?s WHERE {
?s %%P1%% ?v1 .
?s %%P2%% ?v2 .
FILTER ( ?v1 %%OP%% ?v2 ) }
SELECT ?s WHERE {
?s dbo:birthDate ?v1.
?s dbo:deathDate ?v2.
FILTER ( ?v1 > ?v2 ) }
P1 => dbo:birthDate
P2 => dbo:deathDate
OP => >
28

Test Auto Generators (TAG)
● Query schema for supported constraints
○ Every result, creates a binding to a pattern & instantiates a test case
● Supported schemas at the moment:
○ RDFS
○ OWL
○ IBM Resource Shapes 2.0
○ DCMI Profiles
○ SHACL
● Users write in their prefered (constraint) language, and we translate
● Data-driven, support common axioms / definitions
○ not good support on the long tail
29

Test coverage of test suite for a dataset
Coverage computation function f:Q→2E
Takes a SPARQL query q∈Q corresponding to a test case pattern binding as
input and returns a set of entities.
Coverage metrics for:
● Property domain (dom) & range (ran) : F(QS,D)=Σp∈F(QS)
pfreq(p)
● Property dependency (pdep) and class dependency (cdep)
● Property cardinality (card) and class instance (mem)
30Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)

Test case reusability
● Tests can be associated with either a dataset or a schema
○ e.g. tests specific to English DBpedia: Articles with template X must have value Y
○ e.g. test about the DBpedia ontology: dbo:birthDate must be before dbo:deathDate
● Automatic and manual tests associated with schemas are reusable
○ e.g. any dataset reusing the DBpedia ontology can reuse the DBpedia test cases
● A dataset can be tested against the schemas that it reuses
○ e.g. dbo, foaf, skos, etc
31

Test-driven quality assessment methodology (TDQAM)
Workflow
32

TDQAM ontology
● Input / Output entirely in RDF
● Models the methodology in OWL
○ test suites, test cases, patterns, TAGs, etc
● Error reporting
○ Four different granularity formats
■ From aggregated to detailed
○ Influenced SHACL
33

Test case reusability (evaluation)
Linked Open Vocabularies, http://lov.okfn.org (as of 10/2013)
● Describes 400 vocabularies in RDF (prex, uri, description, etc.)
● Run OWL/RDFS TAGs on all vocabularies
● 32K unique reusable test cases
34

TDQAM: general-purpose evaluation
● Implement the methodology in RDFUnit tool
● Tested on 3 crowdsourced and 2 library datasets
○ dbpedia.org
○ nl.dbpedia.org
○ linkedgeodata.org
○ Id.loc.gov
○ datos.bne.es
● Defined manual test cases for
○ DBpedia ontology (22), Linked Geo Data ontology (6), SKOS (20)
35

Evaluation overview per source
36
Errors from
automatically (TAG) and
manually generated
constraints
Errors per subject
Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)

Evaluation overview aggregated per schema
Schema & test
Reusability
37

TDQAM: domain-specific validation improvement
● Quickly improving domain-specific validation for vocabularies provided by
vocabulary maintainers
● POC with NLP Domain
○ LEMON: models lexicon and machine readable dictionaries
○ NIF: Brings interoperability between NLP tools, resources & annotations
38

Lemon & NIF test case elicitation
● Existing maintainer validation
○ Lemon: 24 structural test in python, not good reporting / slow
○ NIF: 11 SPARQL queries, not complete
● Test auto generation based on the schema (TAGs)
● Manual SPARQL test cases for remaining constraints
39

Evaluation
40
Based on Manual TCs
marked as Errors, Warnings or Info
Based on
TAGs
Kontokostas, Brümmer, Hellmann, Lehmann, and Ioannidis, 2014

TDQAM Summary
● Schema-driven design
○ Automates most of the test case elicitation
○ Reusable schema-based test cases across different datasets
● Methodology is accompanied by
○ Open source tool (RDFUnit)
○ Detailed ontology
● Evaluation revealed a substantial number of errors
○ Even for library datasets in the general purpose evaluation
○ Improving maintainer validation
● Feedback for fixing many errors in DBpedia
41

Test-Driven Quality Assessment
in Other Domains
42

TDQA - Mappings Quality Assessment
● RDF is most times generated by non-RDF data and a set of mappings
○ Code to RDF
○ High level mappings to RDF (i.e. RDF Mapping Language - RML)
● Violations that are related to the dataset schema are very frequent
○ Incorrect data types, wrong domain & range, etc, usually mapping errors
43

Mapping error propagation
ex:{Name}
foaf:Project
{Age}
xsd:float
foaf:age
foaf:age range > xsd:int
foaf:age domain > foaf:Agent
ex:Dimitris
foaf:Project
“36”
xsd:float
foaf:age
ex:John
foaf:Project
“31”
xsd:float
foaf:age
ex:{Name}
foaf:Agent
{Age}
xsd:int
foaf:age
ex:Dimitris
foaf:Agent
“36”
xsd:int
foaf:age
ex:Mary
foaf:Project
“37”
xsd:float
foaf:age
Name Age
Dimitris 36
John 31
Mary 37
44

Discover violations before they are generated
● Apply TDQAM directly on the mappings
○ Using schema information (e.g. domain & range of foaf:age)
● Use RML as a prototype mapping language
○ Extend RML for the DBpedia infobox to ontology mappings
● Automatic detection and refinement suggestions
○ class & property disjointness, range, language tags, domain, and deprecation
● Some violations are inherent to the data
○ Cardinality, functionality, symmetricity, irreflexivity
45

Mapping TDQAM evaluation
46
Dimou, Kontokostas, Freudenberg, Verborgh,
Lehmann, Mannens, Hellmann, and Walle, 2015
8%
13%
98%

● Continuous...
○ Triplification
○ Schema changes and enhancements
● Data quality assurance
○ Manual
Goal:
● Schema-driven testing
● Automated
● Integrated in engineering
workflow
47
TDQAM @ Wolters Kluwers

Workflow
TDQAM / RDFUnit
- In Jenkins CI
- JUnit extension
48

Qualitative feedback from experts & users
WKD staff, software & data development: expert & usability interview evaluation
● Productivity
○ Almost immediate feedback,
● Quality
○ Process helpful to spot and address errors
○ Successful tests are less significant
● Agility
○ Easy to add new constraints or adjust existing ones
○ Fully automated process
○ Adding more documents increases total time
Kontokostas, Mader, Dirschl, Eck, Leuthold, Lehmann, and Hellmann, 2016
49

50
Results:
● Test-driven quality assessment methodology
○ Unifies pre-existing data constraining approaches
○ Influenced SHACL
● Methodology applicable for
○ General purpose
○ Domain-specific validation
○ Diverse use cases
● RDFUnit Implementation
● 38% increase in DBpedia schema conformance in DBpedia English
○ Measured for ALIGNED (EU project) mid-term report deliverable
○ From Spring 2015 to Summer 2016

Thesis impact on DBpedia
doubled the amount of triples extracted
38% increase in schema conformance
Given the uptake of DBpedia this should have a big network effect
but… usage scenarios are rarely known ($$$)
however, traffic keeps increasing
52

TDQA & RDFUnit
● ALIGNED Project / WP 4: Data Quality Engineering
● Part of Linked Data Stack (GeoKnow & LOD2 EU projects)
● Uptake by the research community
○ circa 200 total citations
● Known Industrial adoption
○ Wolters Kluwers, Semantic Web Company (ALIGNED)
○ Ontotext, GeoPhy, Springer Nature, Oxford University Press, Europeana, Eccenca, ...
53

Standardization
SHACL
● Active member of the Shapes W3C Working Group
● Became one of the specification editors on March 2016
● SHACL transitioned to a W3C Recommendation on July 2017
ShEx
● Originally planned to merge with SHACL
● Diverged due to core differences in language semantics
● Serving as the community group chairman for the last year
54

Thesis publications
● 11 peer-reviewed conference papers
● 4 Journal Publications
● 5 peer-reviewed demo-posters
● 4 workshop proceedings
○ Linked Data Quality (LDQ) workshop series
● 1 Journal Special Issue
○ SWJ: Quality Management of Semantic Web Assets
● 1 Book Chapter
● 1 Book (Validating RDF Data, co-author)
○ Synthesis Lectures on the Semantic Web: Theory and Technology
● 1 W3C Standard (SHACL, co-editor)
55

Overview
Comprehensive set of research and engineering tasks for increasing both
precision and recall in large-scale multilingual knowledge extraction, with a
focus on DBpedia.
The results of this thesis are already contributed back to the scientific &
industry community through an improved DBpedia open data stack, open
source tools, services and specifications
56

Thank you for your attention!
Looking forward to answering
all your questions...
57

TDQAM vs Reasoners
● SPARQL test cases detect a subset of validation errors detectable by an
OWL reasoner. Limited by:
○ SPARQL endpoint reasoning support
○ limitations of the OWL-to-SPARQL translation.
● SPARQL test cases detect validation errors not expressible in OWL
● OWL reasoning is often not feasible on large datasets.
● Datasets are already deployed and accessible via SPARQL endpoints
● Pattern library more user friendly approach for building validation rules
compared to modelling OWL axioms.
○ requires familiarity
○ non-common validations require manual SPARQL test cases
59

60
Limitations of TDQAM:
● Recursion
○ Partial support planned using SPARQL property paths
● Disjunctive constraints
○ In progress
● Complete language conversion (long-tail)
○ e.g. OWL to SPARQL algorithms

Evolution & quality
Data evolves
so do ontologies
so do RDF mappings
so does code
so do SPARQL queries
so do constraints
http://aligned-project.eu

DBpedia mapping validation
62http://mappings.dbpedia.org/validation/

TDQAM ontology - Definition & generation
63

TDQAM ontology - Result representation
64
[SHACL]

Schema Enrichment (optional)
65

TAG Example
INVFUNC pattern
A general pattern that we want to auto generate
Test Auto Generators
Query the schema for bindings
Bindings
for every result of a TAG, binds values
to a pattern and instantiate test cases
SELECT ? s WHERE {
?a %%P1%% ?resource .
?b %%P1%% ?resource .
FILTER (?a != ?b)}
SELECT DISTINCT ?P1 WHERE {
?P1 rdf:type
owl:InverseFunctionalProperty.}
INVFUNC
BIND (P1 -> foaf:homepage)
66

PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia

Similaire à PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia (20)

Plus de Dimitris Kontokostas

Plus de Dimitris Kontokostas (12)

Dernier

Dernier (20)

PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia