AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia
1. Large-Scale Multilingual Knowledge Extraction,
Publishing and Quality Assessment:
The Case of DBpedia
Verteidigung der Dissertation
zur Erlangung des akademischen Grades
Doktor-Ingenieur (Dr.-Ing.) im Fachgebiet Informatik
Dimitrios Kontokostas, MSc
1
3. DBpedia
● Crowdsourced community effort
● Extract structured content from the information created in various
Wikimedia projects
● Assemble an open knowledge graph
● Allows extracted information to be collected, organised, shared, searched
and utilised
http://wiki.dbpedia.org/about
3
Lehmann, Isele, Jakob, Jentzsch, Kontokostas, Mendes, Hellmann, Morsey, Kleef, Auer, and Bizer, (2015)
5. ● Serves as the central hub of LOD
● 9.5 billion facts (latest release)
○ interdisciplinary and multilingual nature
● Mentioned in 22K scientific articles
○ 3.6K only in 2017
● 7.7M daily hits by 380K different IPs
● Strong community
● Strong industrial presence
Broad Outreach...
DBpedia impact
5
https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71
6. Knowledge Extraction
6
Extracting semantically enriched information from semi-structured or
unstructured documents. Knowledge Extraction usually involves many
challenging sub-tasks
- i.e Named entity recognition (and disambiguation), Coreference resolution, Template element
construction, Template relation construction & Scenario template production
Cunningham, 2005
Evaluating the results of Knowledge Extraction is not straightforward
- Recall as the fraction of available knowledge that is extracted
- Precision as the fraction of extracted knowledge that is correct
Olson and Delen, 2008
7. [I18n] Knowledge Extraction From Wikipedia
Information in Wikipedia is curated by very diverse and highly dynamic
communities
Roth, Taraborelli, and Gilbert, 2008
- No strict global coordination for uniformity of the data
- Different Wikipedia language editions (en, de, nl, …)
- Different Wikimedia projects (Commons, Wikidata)
Community, Language & project specific conventions for defining information
7
8. Knowledge Extraction & Usefulness
The results of an extraction process can only be evaluated on the basis of its
usefulness in a specific context, i.e. fitness for use
Juran, 1974
Data with similar or same quality indicators can be suitable for one
application and not for another
- e.g. completeness may matter more in one case than another
8
9. Quality Assessment
Many possible requirements for extracted data
i.e. schema consistency, exhaustive coverage, correctness, etc
Common is schema & constraint conformance, e.g.
A person must have a birth date and at most one death date
The death date of a person cannot be before the person’s birthdate
Dates must be formatted with the xsd:date format
No (constraint) schema language for RDF (at the thesis starting time)
9
10. [Constraint] schema languages for RDF (1/2)
● OWL/RDFS
○ Meant for reasoning / open-world assumption
○ Constraining with closed-world semantics
■ Some constraints are not possible, e.g. language tags
● SPARQL
○ Meant for querying RDF data
○ Constraining by querying for errors
○ Difficult syntax but very expressive
● SPIN
○ Vendor-driven, based on SPARQL queries wrapped in a vocabulary
○ Difficult to write constraints but very expressive
● IBM Resource Shapes, Dublin Core Profiles
○ Vendor-driven, high level RDF syntax
○ Easy to write constraints, not very expressive 10
11. Constraint schema languages for RDF (2/2)
● Shape Expressions (ShEx): (http://shex.io)
○ High-level language with compact syntax
○ Very expressive (not as SPARQL), well-defined schema recursion semantics
● Shape constraint language (SHACL): (https://www.w3.org/TR/shacl/)
○ High-level RDF syntax
○ Allows embedding SPARQL queries with SHACL-SPARQL
○ Good expressivity, undefined recursion semantics
● SHACL is a W3C recommendation as of July 2017
● ShEx is also expected to be standardized (W3C / Other body)
● Both ShEx & SHACL were developed after the beginning of this thesis
11
12. Research question
This thesis focuses on both large-scale multilingual knowledge extraction
and quality assessment
● Evaluating the usability of the extracted knowledge is essential
How can we increase recall & precision in knowledge extraction?
● DBpedia as use case, generalize where possible
12
13. Increasing recall & precision in knowledge extraction
13
● Available knowledge to be extracted is unknown
○ Cannot create generic approach
● Focus on increasing extracted knowledge by
○ Adding new data sources
○ New approaches for improving extraction of existing sources
○ Improving existing infrastructure
● Better data coverage
16. DBpedia & I18n*
● English Wikipedia was (and still is) the most abundant in information
○ main focus of early versions of DBpedia
● Limited support for extraction of non-English Wikipedias
○ Using the english configuration
○ Discarding pages with no english interwiki links
○ Represented with english labels
● English biased… however:
○ local Information is usually better than English
○ Not all local information has an English page
(*)
I18n comes from “Internationalization and localization“ and stands for: ‘I’, 18 characters and ‘n’
16
17. DBpedia I18n results
● I18n extractor extensions
● I18n parser extensions
● Articles without enwiki links
61.26% increase in triples
(DBpedia version 3.7)
17
Kontokostas, Bratsas, Auer, Hellmann,
Antoniou, and Metakides, (2012)
18. Incorporation of Wikimedia Commons
● Wikimedia Media backend
● Extend DBpedia Framework
○ File pages, Media extractors
Extend mapping process
● Incorporate
○ Galleries, Image annotations
Media licensing, geo-data
Media metadata
● Resulted in
○ 1.4 billion statements
○ 25M images , 600K artwork,
50K Videos, etc
○ 43M license statements
18Vaidya, Kontokostas, Knuth, Lehmann, and Hellmann, (2015)
19. Incorporation of Wikidata
● Wikimedia Structured data backend
○ Maintains interwiki links
○ Structured data related to pages
● Extend DBpedia Framework
○ Json parser
○ Json to rdf mappings
○ Wikidata extractors
● Incorporate Wikidata
○ Map Wikidata data in DBpedia ontology
● Resulted in
○ 1.4 billion statements
○ Up to 2.7K daily visitors
19Ismayilov, Kontokostas, Auer, Lehmann, and Hellmann, (2016)
20. DBTax
Unsupervised Learning of an Extensive and Usable DBpedia Taxonomy
● Only 2.8M out of 4.9M entities are classified in the DBpedia ontology
○ e.g. Persons, Organizations, places, etc
● Data-driven approach
● Use the Wikipedia categories for classification
● Generated:
○ 1.9K classes (T-Box)
○ 10.7M instance-of assertions (A-Box)
○ 4.26M entities (2.32M had no type)
Fossati, Kontokostas, and Lehmann, 2015
20
21. Increasing recall & precision in knowledge extraction
21
Results:
Circa 2x increase in data coverage
● 1.4B triples from DBpedia Commons
● 1.4B triples from DBpedia Wikidata
● 62% increase from I18n enhancements
● DBTax
22. Increasing recall & precision in knowledge extraction
22
● Quality assessment of RDF data
● Focus on validating data against schemas
○ Automate the test case elicitation process
○ Unify existing approaches under a common methodology
■ RDFS, OWL, RS, DCMI Profiles, SPIN, SHACL, ShEx
○ Generic approach that can be used for
■ General-purpose validation
■ Domain-specific validation
■ Diverse use cases
24. Quality assessment, why?
● Unprecedented volume of structured data on the Web
● Datasets are of varying quality
● OWL schemas are often not sufficiently developed or exploited for quality
evaluation
Software development promotes testing code (TDD) why not testing data?
Test-driven quality assessment methodology (TDQAM)
24
25. Test-Driven Development (Software)
● Test case: input on which the program under test is executed during
testing
● Test suite: a set of test cases for testing a program
● Status: Success, Fail or Error
Test cases are implemented largely manually or with limited programmatic
support
Zhu, Hall, and May, 1997
25
26. Test-Driven Quality Assessment (RDF)
● Test case: a data constraint that involves one or more triples
● Test suite: a set of test cases for testing a dataset
● Status: Success, Fail, Timeout (complexity) or Error (e.g. network)
○ Fail: Violation, warning or notice
RDF: basis for both data and schema
● Unified model facilitates automatic test case generation
● SPARQL serves as the test case definition language
Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
26
27. Example Test Case
A person should never have a birth date after a death date
Test Cases are written in SPARQL:
We query for errors:
● Success: Query returns empty result set
● Fail: Query returns results
○ Every result we get is a violation , warning or notice instance
SELECT ?s WHERE {
?s dbo:birthDate ?v1 .
?s dbo:deathDate ?v2 .
FILTER ( ?v1 > ?v2 ) }
27
28. Data Quality Test Patterns (DQTP)
abstract patterns, which can be further refined into concrete data quality test
cases using test pattern bindings
Bindings
mapping of variables to valid pattern replacement
SELECT ?s WHERE {
?s %%P1%% ?v1 .
?s %%P2%% ?v2 .
FILTER ( ?v1 %%OP%% ?v2 ) }
SELECT ?s WHERE {
?s dbo:birthDate ?v1.
?s dbo:deathDate ?v2.
FILTER ( ?v1 > ?v2 ) }
P1 => dbo:birthDate
P2 => dbo:deathDate
OP => >
28
29. Test Auto Generators (TAG)
● Query schema for supported constraints
○ Every result, creates a binding to a pattern & instantiates a test case
● Supported schemas at the moment:
○ RDFS
○ OWL
○ IBM Resource Shapes 2.0
○ DCMI Profiles
○ SHACL
● Users write in their prefered (constraint) language, and we translate
● Data-driven, support common axioms / definitions
○ not good support on the long tail
29
30. Test coverage of test suite for a dataset
Coverage computation function f:Q→2E
Takes a SPARQL query q∈Q corresponding to a test case pattern binding as
input and returns a set of entities.
Coverage metrics for:
● Property domain (dom) & range (ran) : F(QS,D)=Σp∈F(QS)
pfreq(p)
● Property dependency (pdep) and class dependency (cdep)
● Property cardinality (card) and class instance (mem)
30Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
31. Test case reusability
● Tests can be associated with either a dataset or a schema
○ e.g. tests specific to English DBpedia: Articles with template X must have value Y
○ e.g. test about the DBpedia ontology: dbo:birthDate must be before dbo:deathDate
● Automatic and manual tests associated with schemas are reusable
○ e.g. any dataset reusing the DBpedia ontology can reuse the DBpedia test cases
● A dataset can be tested against the schemas that it reuses
○ e.g. dbo, foaf, skos, etc
31
33. TDQAM ontology
● Input / Output entirely in RDF
● Models the methodology in OWL
○ test suites, test cases, patterns, TAGs, etc
● Error reporting
○ Four different granularity formats
■ From aggregated to detailed
○ Influenced SHACL
33
34. Test case reusability (evaluation)
Linked Open Vocabularies, http://lov.okfn.org (as of 10/2013)
● Describes 400 vocabularies in RDF (prex, uri, description, etc.)
● Run OWL/RDFS TAGs on all vocabularies
● 32K unique reusable test cases
34
35. TDQAM: general-purpose evaluation
● Implement the methodology in RDFUnit tool
● Tested on 3 crowdsourced and 2 library datasets
○ dbpedia.org
○ nl.dbpedia.org
○ linkedgeodata.org
○ Id.loc.gov
○ datos.bne.es
● Defined manual test cases for
○ DBpedia ontology (22), Linked Geo Data ontology (6), SKOS (20)
35
36. Evaluation overview per source
36
Errors from
automatically (TAG) and
manually generated
constraints
Errors per subject
Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
38. TDQAM: domain-specific validation improvement
● Quickly improving domain-specific validation for vocabularies provided by
vocabulary maintainers
● POC with NLP Domain
○ LEMON: models lexicon and machine readable dictionaries
○ NIF: Brings interoperability between NLP tools, resources & annotations
38
39. Lemon & NIF test case elicitation
● Existing maintainer validation
○ Lemon: 24 structural test in python, not good reporting / slow
○ NIF: 11 SPARQL queries, not complete
● Test auto generation based on the schema (TAGs)
● Manual SPARQL test cases for remaining constraints
39
40. Evaluation
40
Based on Manual TCs
marked as Errors, Warnings or Info
Based on
TAGs
Kontokostas, Brümmer, Hellmann, Lehmann, and Ioannidis, 2014
41. TDQAM Summary
● Schema-driven design
○ Automates most of the test case elicitation
○ Reusable schema-based test cases across different datasets
● Methodology is accompanied by
○ Open source tool (RDFUnit)
○ Detailed ontology
● Evaluation revealed a substantial number of errors
○ Even for library datasets in the general purpose evaluation
○ Improving maintainer validation
● Feedback for fixing many errors in DBpedia
41
43. TDQA - Mappings Quality Assessment
● RDF is most times generated by non-RDF data and a set of mappings
○ Code to RDF
○ High level mappings to RDF (i.e. RDF Mapping Language - RML)
● Violations that are related to the dataset schema are very frequent
○ Incorrect data types, wrong domain & range, etc, usually mapping errors
43
45. Discover violations before they are generated
● Apply TDQAM directly on the mappings
○ Using schema information (e.g. domain & range of foaf:age)
● Use RML as a prototype mapping language
○ Extend RML for the DBpedia infobox to ontology mappings
● Automatic detection and refinement suggestions
○ class & property disjointness, range, language tags, domain, and deprecation
● Some violations are inherent to the data
○ Cardinality, functionality, symmetricity, irreflexivity
45
49. Qualitative feedback from experts & users
WKD staff, software & data development: expert & usability interview evaluation
● Productivity
○ Almost immediate feedback,
● Quality
○ Process helpful to spot and address errors
○ Successful tests are less significant
● Agility
○ Easy to add new constraints or adjust existing ones
○ Fully automated process
○ Adding more documents increases total time
Kontokostas, Mader, Dirschl, Eck, Leuthold, Lehmann, and Hellmann, 2016
49
50. Increasing recall & precision in knowledge extraction
50
Results:
● Test-driven quality assessment methodology
○ Unifies pre-existing data constraining approaches
○ Influenced SHACL
● Methodology applicable for
○ General purpose
○ Domain-specific validation
○ Diverse use cases
● RDFUnit Implementation
● 38% increase in DBpedia schema conformance in DBpedia English
○ Measured for ALIGNED (EU project) mid-term report deliverable
○ From Spring 2015 to Summer 2016
52. Thesis impact on DBpedia
doubled the amount of triples extracted
38% increase in schema conformance
Given the uptake of DBpedia this should have a big network effect
but… usage scenarios are rarely known ($$$)
however, traffic keeps increasing
52
53. TDQA & RDFUnit
● ALIGNED Project / WP 4: Data Quality Engineering
● Part of Linked Data Stack (GeoKnow & LOD2 EU projects)
● Uptake by the research community
○ circa 200 total citations
● Known Industrial adoption
○ Wolters Kluwers, Semantic Web Company (ALIGNED)
○ Ontotext, GeoPhy, Springer Nature, Oxford University Press, Europeana, Eccenca, ...
53
54. Standardization
SHACL
● Active member of the Shapes W3C Working Group
● Became one of the specification editors on March 2016
● SHACL transitioned to a W3C Recommendation on July 2017
ShEx
● Originally planned to merge with SHACL
● Diverged due to core differences in language semantics
● Serving as the community group chairman for the last year
54
55. Thesis publications
● 11 peer-reviewed conference papers
● 4 Journal Publications
● 5 peer-reviewed demo-posters
● 4 workshop proceedings
○ Linked Data Quality (LDQ) workshop series
● 1 Journal Special Issue
○ SWJ: Quality Management of Semantic Web Assets
● 1 Book Chapter
● 1 Book (Validating RDF Data, co-author)
○ Synthesis Lectures on the Semantic Web: Theory and Technology
● 1 W3C Standard (SHACL, co-editor)
55
56. Overview
Comprehensive set of research and engineering tasks for increasing both
precision and recall in large-scale multilingual knowledge extraction, with a
focus on DBpedia.
The results of this thesis are already contributed back to the scientific &
industry community through an improved DBpedia open data stack, open
source tools, services and specifications
56
57. Thank you for your attention!
Looking forward to answering
all your questions...
57
59. TDQAM vs Reasoners
● SPARQL test cases detect a subset of validation errors detectable by an
OWL reasoner. Limited by:
○ SPARQL endpoint reasoning support
○ limitations of the OWL-to-SPARQL translation.
● SPARQL test cases detect validation errors not expressible in OWL
● OWL reasoning is often not feasible on large datasets.
● Datasets are already deployed and accessible via SPARQL endpoints
● Pattern library more user friendly approach for building validation rules
compared to modelling OWL axioms.
○ requires familiarity
○ non-common validations require manual SPARQL test cases
59
60. Increasing recall & precision in knowledge extraction
60
Limitations of TDQAM:
● Recursion
○ Partial support planned using SPARQL property paths
● Disjunctive constraints
○ In progress
● Complete language conversion (long-tail)
○ e.g. OWL to SPARQL algorithms
61. Evolution & quality
Data evolves
so do ontologies
so do RDF mappings
so does code
so do SPARQL queries
so do constraints
http://aligned-project.eu
66. TAG Example
INVFUNC pattern
A general pattern that we want to auto generate
Test Auto Generators
Query the schema for bindings
Bindings
for every result of a TAG, binds values
to a pattern and instantiate test cases
SELECT ? s WHERE {
?a %%P1%% ?resource .
?b %%P1%% ?resource .
FILTER (?a != ?b)}
SELECT DISTINCT ?P1 WHERE {
?P1 rdf:type
owl:InverseFunctionalProperty.}
INVFUNC
BIND (P1 -> foaf:homepage)
66