SlideShare une entreprise Scribd logo
1  sur  66
Télécharger pour lire hors ligne
Large-Scale Multilingual Knowledge Extraction,
Publishing and Quality Assessment:
The Case of DBpedia
Verteidigung der Dissertation
zur Erlangung des akademischen Grades
Doktor-Ingenieur (Dr.-Ing.) im Fachgebiet Informatik
Dimitrios Kontokostas, MSc
1
Motivation
2
DBpedia
● Crowdsourced community effort
● Extract structured content from the information created in various
Wikimedia projects
● Assemble an open knowledge graph
● Allows extracted information to be collected, organised, shared, searched
and utilised
http://wiki.dbpedia.org/about
3
Lehmann, Isele, Jakob, Jentzsch, Kontokostas, Mendes, Hellmann, Morsey, Kleef, Auer, and Bizer, (2015)
4
● Serves as the central hub of LOD
● 9.5 billion facts (latest release)
○ interdisciplinary and multilingual nature
● Mentioned in 22K scientific articles
○ 3.6K only in 2017
● 7.7M daily hits by 380K different IPs
● Strong community
● Strong industrial presence
Broad Outreach...
DBpedia impact
5
https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71
Knowledge Extraction
6
Extracting semantically enriched information from semi-structured or
unstructured documents. Knowledge Extraction usually involves many
challenging sub-tasks
- i.e Named entity recognition (and disambiguation), Coreference resolution, Template element
construction, Template relation construction & Scenario template production
Cunningham, 2005
Evaluating the results of Knowledge Extraction is not straightforward
- Recall as the fraction of available knowledge that is extracted
- Precision as the fraction of extracted knowledge that is correct
Olson and Delen, 2008
[I18n] Knowledge Extraction From Wikipedia
Information in Wikipedia is curated by very diverse and highly dynamic
communities
Roth, Taraborelli, and Gilbert, 2008
- No strict global coordination for uniformity of the data
- Different Wikipedia language editions (en, de, nl, …)
- Different Wikimedia projects (Commons, Wikidata)
Community, Language & project specific conventions for defining information
7
Knowledge Extraction & Usefulness
The results of an extraction process can only be evaluated on the basis of its
usefulness in a specific context, i.e. fitness for use
Juran, 1974
Data with similar or same quality indicators can be suitable for one
application and not for another
- e.g. completeness may matter more in one case than another
8
Quality Assessment
Many possible requirements for extracted data
i.e. schema consistency, exhaustive coverage, correctness, etc
Common is schema & constraint conformance, e.g.
A person must have a birth date and at most one death date
The death date of a person cannot be before the person’s birthdate
Dates must be formatted with the xsd:date format
No (constraint) schema language for RDF (at the thesis starting time)
9
[Constraint] schema languages for RDF (1/2)
● OWL/RDFS
○ Meant for reasoning / open-world assumption
○ Constraining with closed-world semantics
■ Some constraints are not possible, e.g. language tags
● SPARQL
○ Meant for querying RDF data
○ Constraining by querying for errors
○ Difficult syntax but very expressive
● SPIN
○ Vendor-driven, based on SPARQL queries wrapped in a vocabulary
○ Difficult to write constraints but very expressive
● IBM Resource Shapes, Dublin Core Profiles
○ Vendor-driven, high level RDF syntax
○ Easy to write constraints, not very expressive 10
Constraint schema languages for RDF (2/2)
● Shape Expressions (ShEx): (http://shex.io)
○ High-level language with compact syntax
○ Very expressive (not as SPARQL), well-defined schema recursion semantics
● Shape constraint language (SHACL): (https://www.w3.org/TR/shacl/)
○ High-level RDF syntax
○ Allows embedding SPARQL queries with SHACL-SPARQL
○ Good expressivity, undefined recursion semantics
● SHACL is a W3C recommendation as of July 2017
● ShEx is also expected to be standardized (W3C / Other body)
● Both ShEx & SHACL were developed after the beginning of this thesis
11
Research question
This thesis focuses on both large-scale multilingual knowledge extraction
and quality assessment
● Evaluating the usability of the extracted knowledge is essential
How can we increase recall & precision in knowledge extraction?
● DBpedia as use case, generalize where possible
12
Increasing recall & precision in knowledge extraction
13
● Available knowledge to be extracted is unknown
○ Cannot create generic approach
● Focus on increasing extracted knowledge by
○ Adding new data sources
○ New approaches for improving extraction of existing sources
○ Improving existing infrastructure
● Better data coverage
Large-scale Multilingual Knowledge
Extraction [in DBpedia]
14
DBpedia information extraction framework
15
© 1st & 2nd generation
of DBpedia devs
Lehmann, Isele, Jakob, Jentzsch, Kontokostas, Mendes, Hellmann, Morsey, Kleef, Auer, and Bizer, (2015)
DBpedia & I18n*
● English Wikipedia was (and still is) the most abundant in information
○ main focus of early versions of DBpedia
● Limited support for extraction of non-English Wikipedias
○ Using the english configuration
○ Discarding pages with no english interwiki links
○ Represented with english labels
● English biased… however:
○ local Information is usually better than English
○ Not all local information has an English page
(*)
I18n comes from “Internationalization and localization“ and stands for: ‘I’, 18 characters and ‘n’
16
DBpedia I18n results
● I18n extractor extensions
● I18n parser extensions
● Articles without enwiki links
61.26% increase in triples
(DBpedia version 3.7)
17
Kontokostas, Bratsas, Auer, Hellmann,
Antoniou, and Metakides, (2012)
Incorporation of Wikimedia Commons
● Wikimedia Media backend
● Extend DBpedia Framework
○ File pages, Media extractors
Extend mapping process
● Incorporate
○ Galleries, Image annotations
Media licensing, geo-data
Media metadata
● Resulted in
○ 1.4 billion statements
○ 25M images , 600K artwork,
50K Videos, etc
○ 43M license statements
18Vaidya, Kontokostas, Knuth, Lehmann, and Hellmann, (2015)
Incorporation of Wikidata
● Wikimedia Structured data backend
○ Maintains interwiki links
○ Structured data related to pages
● Extend DBpedia Framework
○ Json parser
○ Json to rdf mappings
○ Wikidata extractors
● Incorporate Wikidata
○ Map Wikidata data in DBpedia ontology
● Resulted in
○ 1.4 billion statements
○ Up to 2.7K daily visitors
19Ismayilov, Kontokostas, Auer, Lehmann, and Hellmann, (2016)
DBTax
Unsupervised Learning of an Extensive and Usable DBpedia Taxonomy
● Only 2.8M out of 4.9M entities are classified in the DBpedia ontology
○ e.g. Persons, Organizations, places, etc
● Data-driven approach
● Use the Wikipedia categories for classification
● Generated:
○ 1.9K classes (T-Box)
○ 10.7M instance-of assertions (A-Box)
○ 4.26M entities (2.32M had no type)
Fossati, Kontokostas, and Lehmann, 2015
20
Increasing recall & precision in knowledge extraction
21
Results:
Circa 2x increase in data coverage
● 1.4B triples from DBpedia Commons
● 1.4B triples from DBpedia Wikidata
● 62% increase from I18n enhancements
● DBTax
Increasing recall & precision in knowledge extraction
22
● Quality assessment of RDF data
● Focus on validating data against schemas
○ Automate the test case elicitation process
○ Unify existing approaches under a common methodology
■ RDFS, OWL, RS, DCMI Profiles, SPIN, SHACL, ShEx
○ Generic approach that can be used for
■ General-purpose validation
■ Domain-specific validation
■ Diverse use cases
Quality Assessment of
RDF & Linked Data
the Test-driven Quality Assessment Methodology
23
Quality assessment, why?
● Unprecedented volume of structured data on the Web
● Datasets are of varying quality
● OWL schemas are often not sufficiently developed or exploited for quality
evaluation
Software development promotes testing code (TDD) why not testing data?
Test-driven quality assessment methodology (TDQAM)
24
Test-Driven Development (Software)
● Test case: input on which the program under test is executed during
testing
● Test suite: a set of test cases for testing a program
● Status: Success, Fail or Error
Test cases are implemented largely manually or with limited programmatic
support
Zhu, Hall, and May, 1997
25
Test-Driven Quality Assessment (RDF)
● Test case: a data constraint that involves one or more triples
● Test suite: a set of test cases for testing a dataset
● Status: Success, Fail, Timeout (complexity) or Error (e.g. network)
○ Fail: Violation, warning or notice
RDF: basis for both data and schema
● Unified model facilitates automatic test case generation
● SPARQL serves as the test case definition language
Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
26
Example Test Case
A person should never have a birth date after a death date
Test Cases are written in SPARQL:
We query for errors:
● Success: Query returns empty result set
● Fail: Query returns results
○ Every result we get is a violation , warning or notice instance
SELECT ?s WHERE {
?s dbo:birthDate ?v1 .
?s dbo:deathDate ?v2 .
FILTER ( ?v1 > ?v2 ) }
27
Data Quality Test Patterns (DQTP)
abstract patterns, which can be further refined into concrete data quality test
cases using test pattern bindings
Bindings
mapping of variables to valid pattern replacement
SELECT ?s WHERE {
?s %%P1%% ?v1 .
?s %%P2%% ?v2 .
FILTER ( ?v1 %%OP%% ?v2 ) }
SELECT ?s WHERE {
?s dbo:birthDate ?v1.
?s dbo:deathDate ?v2.
FILTER ( ?v1 > ?v2 ) }
P1 => dbo:birthDate
P2 => dbo:deathDate
OP => >
28
Test Auto Generators (TAG)
● Query schema for supported constraints
○ Every result, creates a binding to a pattern & instantiates a test case
● Supported schemas at the moment:
○ RDFS
○ OWL
○ IBM Resource Shapes 2.0
○ DCMI Profiles
○ SHACL
● Users write in their prefered (constraint) language, and we translate
● Data-driven, support common axioms / definitions
○ not good support on the long tail
29
Test coverage of test suite for a dataset
Coverage computation function f:Q→2E
Takes a SPARQL query q∈Q corresponding to a test case pattern binding as
input and returns a set of entities.
Coverage metrics for:
● Property domain (dom) & range (ran) : F(QS,D)=Σp∈F(QS)
pfreq(p)
● Property dependency (pdep) and class dependency (cdep)
● Property cardinality (card) and class instance (mem)
30Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
Test case reusability
● Tests can be associated with either a dataset or a schema
○ e.g. tests specific to English DBpedia: Articles with template X must have value Y
○ e.g. test about the DBpedia ontology: dbo:birthDate must be before dbo:deathDate
● Automatic and manual tests associated with schemas are reusable
○ e.g. any dataset reusing the DBpedia ontology can reuse the DBpedia test cases
● A dataset can be tested against the schemas that it reuses
○ e.g. dbo, foaf, skos, etc
31
Test-driven quality assessment methodology (TDQAM)
Workflow
32
TDQAM ontology
● Input / Output entirely in RDF
● Models the methodology in OWL
○ test suites, test cases, patterns, TAGs, etc
● Error reporting
○ Four different granularity formats
■ From aggregated to detailed
○ Influenced SHACL
33
Test case reusability (evaluation)
Linked Open Vocabularies, http://lov.okfn.org (as of 10/2013)
● Describes 400 vocabularies in RDF (prex, uri, description, etc.)
● Run OWL/RDFS TAGs on all vocabularies
● 32K unique reusable test cases
34
TDQAM: general-purpose evaluation
● Implement the methodology in RDFUnit tool
● Tested on 3 crowdsourced and 2 library datasets
○ dbpedia.org
○ nl.dbpedia.org
○ linkedgeodata.org
○ Id.loc.gov
○ datos.bne.es
● Defined manual test cases for
○ DBpedia ontology (22), Linked Geo Data ontology (6), SKOS (20)
35
Evaluation overview per source
36
Errors from
automatically (TAG) and
manually generated
constraints
Errors per subject
Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
Evaluation overview aggregated per schema
Schema & test
Reusability
37
TDQAM: domain-specific validation improvement
● Quickly improving domain-specific validation for vocabularies provided by
vocabulary maintainers
● POC with NLP Domain
○ LEMON: models lexicon and machine readable dictionaries
○ NIF: Brings interoperability between NLP tools, resources & annotations
38
Lemon & NIF test case elicitation
● Existing maintainer validation
○ Lemon: 24 structural test in python, not good reporting / slow
○ NIF: 11 SPARQL queries, not complete
● Test auto generation based on the schema (TAGs)
● Manual SPARQL test cases for remaining constraints
39
Evaluation
40
Based on Manual TCs
marked as Errors, Warnings or Info
Based on
TAGs
Kontokostas, Brümmer, Hellmann, Lehmann, and Ioannidis, 2014
TDQAM Summary
● Schema-driven design
○ Automates most of the test case elicitation
○ Reusable schema-based test cases across different datasets
● Methodology is accompanied by
○ Open source tool (RDFUnit)
○ Detailed ontology
● Evaluation revealed a substantial number of errors
○ Even for library datasets in the general purpose evaluation
○ Improving maintainer validation
● Feedback for fixing many errors in DBpedia
41
Test-Driven Quality Assessment
in Other Domains
42
TDQA - Mappings Quality Assessment
● RDF is most times generated by non-RDF data and a set of mappings
○ Code to RDF
○ High level mappings to RDF (i.e. RDF Mapping Language - RML)
● Violations that are related to the dataset schema are very frequent
○ Incorrect data types, wrong domain & range, etc, usually mapping errors
43
Mapping error propagation
ex:{Name}
foaf:Project
{Age}
xsd:float
foaf:age
foaf:age range > xsd:int
foaf:age domain > foaf:Agent
ex:Dimitris
foaf:Project
“36”
xsd:float
foaf:age
ex:John
foaf:Project
“31”
xsd:float
foaf:age
ex:{Name}
foaf:Agent
{Age}
xsd:int
foaf:age
ex:Dimitris
foaf:Agent
“36”
xsd:int
foaf:age
ex:Mary
foaf:Project
“37”
xsd:float
foaf:age
Name Age
Dimitris 36
John 31
Mary 37
44
Discover violations before they are generated
● Apply TDQAM directly on the mappings
○ Using schema information (e.g. domain & range of foaf:age)
● Use RML as a prototype mapping language
○ Extend RML for the DBpedia infobox to ontology mappings
● Automatic detection and refinement suggestions
○ class & property disjointness, range, language tags, domain, and deprecation
● Some violations are inherent to the data
○ Cardinality, functionality, symmetricity, irreflexivity
45
Mapping TDQAM evaluation
46
Dimou, Kontokostas, Freudenberg, Verborgh,
Lehmann, Mannens, Hellmann, and Walle, 2015
8%
13%
98%
● Continuous...
○ Triplification
○ Schema changes and enhancements
● Data quality assurance
○ Manual
Goal:
● Schema-driven testing
● Automated
● Integrated in engineering
workflow
47
TDQAM @ Wolters Kluwers
Workflow
TDQAM / RDFUnit
- In Jenkins CI
- JUnit extension
48
Qualitative feedback from experts & users
WKD staff, software & data development: expert & usability interview evaluation
● Productivity
○ Almost immediate feedback,
● Quality
○ Process helpful to spot and address errors
○ Successful tests are less significant
● Agility
○ Easy to add new constraints or adjust existing ones
○ Fully automated process
○ Adding more documents increases total time
Kontokostas, Mader, Dirschl, Eck, Leuthold, Lehmann, and Hellmann, 2016
49
Increasing recall & precision in knowledge extraction
50
Results:
● Test-driven quality assessment methodology
○ Unifies pre-existing data constraining approaches
○ Influenced SHACL
● Methodology applicable for
○ General purpose
○ Domain-specific validation
○ Diverse use cases
● RDFUnit Implementation
● 38% increase in DBpedia schema conformance in DBpedia English
○ Measured for ALIGNED (EU project) mid-term report deliverable
○ From Spring 2015 to Summer 2016
Overview & Impact
51
Thesis impact on DBpedia
doubled the amount of triples extracted
38% increase in schema conformance
Given the uptake of DBpedia this should have a big network effect
but… usage scenarios are rarely known ($$$)
however, traffic keeps increasing
52
TDQA & RDFUnit
● ALIGNED Project / WP 4: Data Quality Engineering
● Part of Linked Data Stack (GeoKnow & LOD2 EU projects)
● Uptake by the research community
○ circa 200 total citations
● Known Industrial adoption
○ Wolters Kluwers, Semantic Web Company (ALIGNED)
○ Ontotext, GeoPhy, Springer Nature, Oxford University Press, Europeana, Eccenca, ...
53
Standardization
SHACL
● Active member of the Shapes W3C Working Group
● Became one of the specification editors on March 2016
● SHACL transitioned to a W3C Recommendation on July 2017
ShEx
● Originally planned to merge with SHACL
● Diverged due to core differences in language semantics
● Serving as the community group chairman for the last year
54
Thesis publications
● 11 peer-reviewed conference papers
● 4 Journal Publications
● 5 peer-reviewed demo-posters
● 4 workshop proceedings
○ Linked Data Quality (LDQ) workshop series
● 1 Journal Special Issue
○ SWJ: Quality Management of Semantic Web Assets
● 1 Book Chapter
● 1 Book (Validating RDF Data, co-author)
○ Synthesis Lectures on the Semantic Web: Theory and Technology
● 1 W3C Standard (SHACL, co-editor)
55
Overview
Comprehensive set of research and engineering tasks for increasing both
precision and recall in large-scale multilingual knowledge extraction, with a
focus on DBpedia.
The results of this thesis are already contributed back to the scientific &
industry community through an improved DBpedia open data stack, open
source tools, services and specifications
56
Thank you for your attention!
Looking forward to answering
all your questions...
57
Skipped - Backup Slides
58
TDQAM vs Reasoners
● SPARQL test cases detect a subset of validation errors detectable by an
OWL reasoner. Limited by:
○ SPARQL endpoint reasoning support
○ limitations of the OWL-to-SPARQL translation.
● SPARQL test cases detect validation errors not expressible in OWL
● OWL reasoning is often not feasible on large datasets.
● Datasets are already deployed and accessible via SPARQL endpoints
● Pattern library more user friendly approach for building validation rules
compared to modelling OWL axioms.
○ requires familiarity
○ non-common validations require manual SPARQL test cases
59
Increasing recall & precision in knowledge extraction
60
Limitations of TDQAM:
● Recursion
○ Partial support planned using SPARQL property paths
● Disjunctive constraints
○ In progress
● Complete language conversion (long-tail)
○ e.g. OWL to SPARQL algorithms
Evolution & quality
Data evolves
so do ontologies
so do RDF mappings
so does code
so do SPARQL queries
so do constraints
http://aligned-project.eu
DBpedia mapping validation
62http://mappings.dbpedia.org/validation/
TDQAM ontology - Definition & generation
63
TDQAM ontology - Result representation
64
[SHACL]
Schema Enrichment (optional)
65
TAG Example
INVFUNC pattern
A general pattern that we want to auto generate
Test Auto Generators
Query the schema for bindings
Bindings
for every result of a TAG, binds values
to a pattern and instantiate test cases
SELECT ? s WHERE {
?a %%P1%% ?resource .
?b %%P1%% ?resource .
FILTER (?a != ?b)}
SELECT DISTINCT ?P1 WHERE {
?P1 rdf:type
owl:InverseFunctionalProperty.}
INVFUNC
BIND (P1 -> foaf:homepage)
66

Contenu connexe

Tendances

Tendances (20)

LDCache - a cache for linked data-driven web applications
LDCache - a cache for linked data-driven web applicationsLDCache - a cache for linked data-driven web applications
LDCache - a cache for linked data-driven web applications
 
HyperGraphQL
HyperGraphQLHyperGraphQL
HyperGraphQL
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
Cogapp Open Studios 2012 - Adventures with Linked Data
Cogapp Open Studios 2012 - Adventures with Linked DataCogapp Open Studios 2012 - Adventures with Linked Data
Cogapp Open Studios 2012 - Adventures with Linked Data
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
 
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data MudSHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databases
 
SWT Lecture Session 8 - Rules
SWT Lecture Session 8 - RulesSWT Lecture Session 8 - Rules
SWT Lecture Session 8 - Rules
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph Schema
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
 
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant FormatELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
Theory behind Image Compression and Semantic Search
Theory behind Image Compression and Semantic SearchTheory behind Image Compression and Semantic Search
Theory behind Image Compression and Semantic Search
 
Christian Jakenfelds
Christian JakenfeldsChristian Jakenfelds
Christian Jakenfelds
 
Getting triples from records: the role of ISBD
Getting triples from records: the role of ISBDGetting triples from records: the role of ISBD
Getting triples from records: the role of ISBD
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 

Similaire à PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia

IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
Dr.-Ing. Thomas Hartmann
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
Sören Auer
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
Lucy McKenna
 

Similaire à PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia (20)

KEDL DBpedia 2019
KEDL DBpedia  2019KEDL DBpedia  2019
KEDL DBpedia 2019
 
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 
RDM Jargon Busting Session: Demystifying Commonly Used Terms
RDM Jargon Busting Session: Demystifying Commonly Used TermsRDM Jargon Busting Session: Demystifying Commonly Used Terms
RDM Jargon Busting Session: Demystifying Commonly Used Terms
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Analysing Structured Scholarly Data Embedded in Web Pages
Analysing Structured Scholarly Data Embedded in Web PagesAnalysing Structured Scholarly Data Embedded in Web Pages
Analysing Structured Scholarly Data Embedded in Web Pages
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
 
The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
 
TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
 
A multi-institutional model for advancing open access journals and reclaiming...
A multi-institutional model for advancing open access journals and reclaiming...A multi-institutional model for advancing open access journals and reclaiming...
A multi-institutional model for advancing open access journals and reclaiming...
 
Wikidata Introductory Workshop
Wikidata Introductory WorkshopWikidata Introductory Workshop
Wikidata Introductory Workshop
 

Plus de Dimitris Kontokostas

Plus de Dimitris Kontokostas (12)

Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...
 
8th DBpedia meeting / California 2016
8th DBpedia meeting /  California 20168th DBpedia meeting /  California 2016
8th DBpedia meeting / California 2016
 
Semantically enhanced quality assurance in the jurion business use case
Semantically enhanced quality assurance in the jurion  business use caseSemantically enhanced quality assurance in the jurion  business use case
Semantically enhanced quality assurance in the jurion business use case
 
Graph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDFGraph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDF
 
DBpedia past, present & future
DBpedia past, present & futureDBpedia past, present & future
DBpedia past, present & future
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
 
DBpedia ♥ Commons
DBpedia ♥ CommonsDBpedia ♥ Commons
DBpedia ♥ Commons
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)
 

Dernier

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Dernier (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 

PhD thesis defense: Large-scale multilingual knowledge extraction, publishing and quality assessment: the case of DBpedia

  • 1. Large-Scale Multilingual Knowledge Extraction, Publishing and Quality Assessment: The Case of DBpedia Verteidigung der Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) im Fachgebiet Informatik Dimitrios Kontokostas, MSc 1
  • 3. DBpedia ● Crowdsourced community effort ● Extract structured content from the information created in various Wikimedia projects ● Assemble an open knowledge graph ● Allows extracted information to be collected, organised, shared, searched and utilised http://wiki.dbpedia.org/about 3 Lehmann, Isele, Jakob, Jentzsch, Kontokostas, Mendes, Hellmann, Morsey, Kleef, Auer, and Bizer, (2015)
  • 4. 4
  • 5. ● Serves as the central hub of LOD ● 9.5 billion facts (latest release) ○ interdisciplinary and multilingual nature ● Mentioned in 22K scientific articles ○ 3.6K only in 2017 ● 7.7M daily hits by 380K different IPs ● Strong community ● Strong industrial presence Broad Outreach... DBpedia impact 5 https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71
  • 6. Knowledge Extraction 6 Extracting semantically enriched information from semi-structured or unstructured documents. Knowledge Extraction usually involves many challenging sub-tasks - i.e Named entity recognition (and disambiguation), Coreference resolution, Template element construction, Template relation construction & Scenario template production Cunningham, 2005 Evaluating the results of Knowledge Extraction is not straightforward - Recall as the fraction of available knowledge that is extracted - Precision as the fraction of extracted knowledge that is correct Olson and Delen, 2008
  • 7. [I18n] Knowledge Extraction From Wikipedia Information in Wikipedia is curated by very diverse and highly dynamic communities Roth, Taraborelli, and Gilbert, 2008 - No strict global coordination for uniformity of the data - Different Wikipedia language editions (en, de, nl, …) - Different Wikimedia projects (Commons, Wikidata) Community, Language & project specific conventions for defining information 7
  • 8. Knowledge Extraction & Usefulness The results of an extraction process can only be evaluated on the basis of its usefulness in a specific context, i.e. fitness for use Juran, 1974 Data with similar or same quality indicators can be suitable for one application and not for another - e.g. completeness may matter more in one case than another 8
  • 9. Quality Assessment Many possible requirements for extracted data i.e. schema consistency, exhaustive coverage, correctness, etc Common is schema & constraint conformance, e.g. A person must have a birth date and at most one death date The death date of a person cannot be before the person’s birthdate Dates must be formatted with the xsd:date format No (constraint) schema language for RDF (at the thesis starting time) 9
  • 10. [Constraint] schema languages for RDF (1/2) ● OWL/RDFS ○ Meant for reasoning / open-world assumption ○ Constraining with closed-world semantics ■ Some constraints are not possible, e.g. language tags ● SPARQL ○ Meant for querying RDF data ○ Constraining by querying for errors ○ Difficult syntax but very expressive ● SPIN ○ Vendor-driven, based on SPARQL queries wrapped in a vocabulary ○ Difficult to write constraints but very expressive ● IBM Resource Shapes, Dublin Core Profiles ○ Vendor-driven, high level RDF syntax ○ Easy to write constraints, not very expressive 10
  • 11. Constraint schema languages for RDF (2/2) ● Shape Expressions (ShEx): (http://shex.io) ○ High-level language with compact syntax ○ Very expressive (not as SPARQL), well-defined schema recursion semantics ● Shape constraint language (SHACL): (https://www.w3.org/TR/shacl/) ○ High-level RDF syntax ○ Allows embedding SPARQL queries with SHACL-SPARQL ○ Good expressivity, undefined recursion semantics ● SHACL is a W3C recommendation as of July 2017 ● ShEx is also expected to be standardized (W3C / Other body) ● Both ShEx & SHACL were developed after the beginning of this thesis 11
  • 12. Research question This thesis focuses on both large-scale multilingual knowledge extraction and quality assessment ● Evaluating the usability of the extracted knowledge is essential How can we increase recall & precision in knowledge extraction? ● DBpedia as use case, generalize where possible 12
  • 13. Increasing recall & precision in knowledge extraction 13 ● Available knowledge to be extracted is unknown ○ Cannot create generic approach ● Focus on increasing extracted knowledge by ○ Adding new data sources ○ New approaches for improving extraction of existing sources ○ Improving existing infrastructure ● Better data coverage
  • 15. DBpedia information extraction framework 15 © 1st & 2nd generation of DBpedia devs Lehmann, Isele, Jakob, Jentzsch, Kontokostas, Mendes, Hellmann, Morsey, Kleef, Auer, and Bizer, (2015)
  • 16. DBpedia & I18n* ● English Wikipedia was (and still is) the most abundant in information ○ main focus of early versions of DBpedia ● Limited support for extraction of non-English Wikipedias ○ Using the english configuration ○ Discarding pages with no english interwiki links ○ Represented with english labels ● English biased… however: ○ local Information is usually better than English ○ Not all local information has an English page (*) I18n comes from “Internationalization and localization“ and stands for: ‘I’, 18 characters and ‘n’ 16
  • 17. DBpedia I18n results ● I18n extractor extensions ● I18n parser extensions ● Articles without enwiki links 61.26% increase in triples (DBpedia version 3.7) 17 Kontokostas, Bratsas, Auer, Hellmann, Antoniou, and Metakides, (2012)
  • 18. Incorporation of Wikimedia Commons ● Wikimedia Media backend ● Extend DBpedia Framework ○ File pages, Media extractors Extend mapping process ● Incorporate ○ Galleries, Image annotations Media licensing, geo-data Media metadata ● Resulted in ○ 1.4 billion statements ○ 25M images , 600K artwork, 50K Videos, etc ○ 43M license statements 18Vaidya, Kontokostas, Knuth, Lehmann, and Hellmann, (2015)
  • 19. Incorporation of Wikidata ● Wikimedia Structured data backend ○ Maintains interwiki links ○ Structured data related to pages ● Extend DBpedia Framework ○ Json parser ○ Json to rdf mappings ○ Wikidata extractors ● Incorporate Wikidata ○ Map Wikidata data in DBpedia ontology ● Resulted in ○ 1.4 billion statements ○ Up to 2.7K daily visitors 19Ismayilov, Kontokostas, Auer, Lehmann, and Hellmann, (2016)
  • 20. DBTax Unsupervised Learning of an Extensive and Usable DBpedia Taxonomy ● Only 2.8M out of 4.9M entities are classified in the DBpedia ontology ○ e.g. Persons, Organizations, places, etc ● Data-driven approach ● Use the Wikipedia categories for classification ● Generated: ○ 1.9K classes (T-Box) ○ 10.7M instance-of assertions (A-Box) ○ 4.26M entities (2.32M had no type) Fossati, Kontokostas, and Lehmann, 2015 20
  • 21. Increasing recall & precision in knowledge extraction 21 Results: Circa 2x increase in data coverage ● 1.4B triples from DBpedia Commons ● 1.4B triples from DBpedia Wikidata ● 62% increase from I18n enhancements ● DBTax
  • 22. Increasing recall & precision in knowledge extraction 22 ● Quality assessment of RDF data ● Focus on validating data against schemas ○ Automate the test case elicitation process ○ Unify existing approaches under a common methodology ■ RDFS, OWL, RS, DCMI Profiles, SPIN, SHACL, ShEx ○ Generic approach that can be used for ■ General-purpose validation ■ Domain-specific validation ■ Diverse use cases
  • 23. Quality Assessment of RDF & Linked Data the Test-driven Quality Assessment Methodology 23
  • 24. Quality assessment, why? ● Unprecedented volume of structured data on the Web ● Datasets are of varying quality ● OWL schemas are often not sufficiently developed or exploited for quality evaluation Software development promotes testing code (TDD) why not testing data? Test-driven quality assessment methodology (TDQAM) 24
  • 25. Test-Driven Development (Software) ● Test case: input on which the program under test is executed during testing ● Test suite: a set of test cases for testing a program ● Status: Success, Fail or Error Test cases are implemented largely manually or with limited programmatic support Zhu, Hall, and May, 1997 25
  • 26. Test-Driven Quality Assessment (RDF) ● Test case: a data constraint that involves one or more triples ● Test suite: a set of test cases for testing a dataset ● Status: Success, Fail, Timeout (complexity) or Error (e.g. network) ○ Fail: Violation, warning or notice RDF: basis for both data and schema ● Unified model facilitates automatic test case generation ● SPARQL serves as the test case definition language Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014) 26
  • 27. Example Test Case A person should never have a birth date after a death date Test Cases are written in SPARQL: We query for errors: ● Success: Query returns empty result set ● Fail: Query returns results ○ Every result we get is a violation , warning or notice instance SELECT ?s WHERE { ?s dbo:birthDate ?v1 . ?s dbo:deathDate ?v2 . FILTER ( ?v1 > ?v2 ) } 27
  • 28. Data Quality Test Patterns (DQTP) abstract patterns, which can be further refined into concrete data quality test cases using test pattern bindings Bindings mapping of variables to valid pattern replacement SELECT ?s WHERE { ?s %%P1%% ?v1 . ?s %%P2%% ?v2 . FILTER ( ?v1 %%OP%% ?v2 ) } SELECT ?s WHERE { ?s dbo:birthDate ?v1. ?s dbo:deathDate ?v2. FILTER ( ?v1 > ?v2 ) } P1 => dbo:birthDate P2 => dbo:deathDate OP => > 28
  • 29. Test Auto Generators (TAG) ● Query schema for supported constraints ○ Every result, creates a binding to a pattern & instantiates a test case ● Supported schemas at the moment: ○ RDFS ○ OWL ○ IBM Resource Shapes 2.0 ○ DCMI Profiles ○ SHACL ● Users write in their prefered (constraint) language, and we translate ● Data-driven, support common axioms / definitions ○ not good support on the long tail 29
  • 30. Test coverage of test suite for a dataset Coverage computation function f:Q→2E Takes a SPARQL query q∈Q corresponding to a test case pattern binding as input and returns a set of entities. Coverage metrics for: ● Property domain (dom) & range (ran) : F(QS,D)=Σp∈F(QS) pfreq(p) ● Property dependency (pdep) and class dependency (cdep) ● Property cardinality (card) and class instance (mem) 30Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
  • 31. Test case reusability ● Tests can be associated with either a dataset or a schema ○ e.g. tests specific to English DBpedia: Articles with template X must have value Y ○ e.g. test about the DBpedia ontology: dbo:birthDate must be before dbo:deathDate ● Automatic and manual tests associated with schemas are reusable ○ e.g. any dataset reusing the DBpedia ontology can reuse the DBpedia test cases ● A dataset can be tested against the schemas that it reuses ○ e.g. dbo, foaf, skos, etc 31
  • 32. Test-driven quality assessment methodology (TDQAM) Workflow 32
  • 33. TDQAM ontology ● Input / Output entirely in RDF ● Models the methodology in OWL ○ test suites, test cases, patterns, TAGs, etc ● Error reporting ○ Four different granularity formats ■ From aggregated to detailed ○ Influenced SHACL 33
  • 34. Test case reusability (evaluation) Linked Open Vocabularies, http://lov.okfn.org (as of 10/2013) ● Describes 400 vocabularies in RDF (prex, uri, description, etc.) ● Run OWL/RDFS TAGs on all vocabularies ● 32K unique reusable test cases 34
  • 35. TDQAM: general-purpose evaluation ● Implement the methodology in RDFUnit tool ● Tested on 3 crowdsourced and 2 library datasets ○ dbpedia.org ○ nl.dbpedia.org ○ linkedgeodata.org ○ Id.loc.gov ○ datos.bne.es ● Defined manual test cases for ○ DBpedia ontology (22), Linked Geo Data ontology (6), SKOS (20) 35
  • 36. Evaluation overview per source 36 Errors from automatically (TAG) and manually generated constraints Errors per subject Kontokostas, Westphal, Auer, Hellmann, Lehmann, Cornelissen, and Zaveri, (2014)
  • 37. Evaluation overview aggregated per schema Schema & test Reusability 37
  • 38. TDQAM: domain-specific validation improvement ● Quickly improving domain-specific validation for vocabularies provided by vocabulary maintainers ● POC with NLP Domain ○ LEMON: models lexicon and machine readable dictionaries ○ NIF: Brings interoperability between NLP tools, resources & annotations 38
  • 39. Lemon & NIF test case elicitation ● Existing maintainer validation ○ Lemon: 24 structural test in python, not good reporting / slow ○ NIF: 11 SPARQL queries, not complete ● Test auto generation based on the schema (TAGs) ● Manual SPARQL test cases for remaining constraints 39
  • 40. Evaluation 40 Based on Manual TCs marked as Errors, Warnings or Info Based on TAGs Kontokostas, Brümmer, Hellmann, Lehmann, and Ioannidis, 2014
  • 41. TDQAM Summary ● Schema-driven design ○ Automates most of the test case elicitation ○ Reusable schema-based test cases across different datasets ● Methodology is accompanied by ○ Open source tool (RDFUnit) ○ Detailed ontology ● Evaluation revealed a substantial number of errors ○ Even for library datasets in the general purpose evaluation ○ Improving maintainer validation ● Feedback for fixing many errors in DBpedia 41
  • 43. TDQA - Mappings Quality Assessment ● RDF is most times generated by non-RDF data and a set of mappings ○ Code to RDF ○ High level mappings to RDF (i.e. RDF Mapping Language - RML) ● Violations that are related to the dataset schema are very frequent ○ Incorrect data types, wrong domain & range, etc, usually mapping errors 43
  • 44. Mapping error propagation ex:{Name} foaf:Project {Age} xsd:float foaf:age foaf:age range > xsd:int foaf:age domain > foaf:Agent ex:Dimitris foaf:Project “36” xsd:float foaf:age ex:John foaf:Project “31” xsd:float foaf:age ex:{Name} foaf:Agent {Age} xsd:int foaf:age ex:Dimitris foaf:Agent “36” xsd:int foaf:age ex:Mary foaf:Project “37” xsd:float foaf:age Name Age Dimitris 36 John 31 Mary 37 44
  • 45. Discover violations before they are generated ● Apply TDQAM directly on the mappings ○ Using schema information (e.g. domain & range of foaf:age) ● Use RML as a prototype mapping language ○ Extend RML for the DBpedia infobox to ontology mappings ● Automatic detection and refinement suggestions ○ class & property disjointness, range, language tags, domain, and deprecation ● Some violations are inherent to the data ○ Cardinality, functionality, symmetricity, irreflexivity 45
  • 46. Mapping TDQAM evaluation 46 Dimou, Kontokostas, Freudenberg, Verborgh, Lehmann, Mannens, Hellmann, and Walle, 2015 8% 13% 98%
  • 47. ● Continuous... ○ Triplification ○ Schema changes and enhancements ● Data quality assurance ○ Manual Goal: ● Schema-driven testing ● Automated ● Integrated in engineering workflow 47 TDQAM @ Wolters Kluwers
  • 48. Workflow TDQAM / RDFUnit - In Jenkins CI - JUnit extension 48
  • 49. Qualitative feedback from experts & users WKD staff, software & data development: expert & usability interview evaluation ● Productivity ○ Almost immediate feedback, ● Quality ○ Process helpful to spot and address errors ○ Successful tests are less significant ● Agility ○ Easy to add new constraints or adjust existing ones ○ Fully automated process ○ Adding more documents increases total time Kontokostas, Mader, Dirschl, Eck, Leuthold, Lehmann, and Hellmann, 2016 49
  • 50. Increasing recall & precision in knowledge extraction 50 Results: ● Test-driven quality assessment methodology ○ Unifies pre-existing data constraining approaches ○ Influenced SHACL ● Methodology applicable for ○ General purpose ○ Domain-specific validation ○ Diverse use cases ● RDFUnit Implementation ● 38% increase in DBpedia schema conformance in DBpedia English ○ Measured for ALIGNED (EU project) mid-term report deliverable ○ From Spring 2015 to Summer 2016
  • 52. Thesis impact on DBpedia doubled the amount of triples extracted 38% increase in schema conformance Given the uptake of DBpedia this should have a big network effect but… usage scenarios are rarely known ($$$) however, traffic keeps increasing 52
  • 53. TDQA & RDFUnit ● ALIGNED Project / WP 4: Data Quality Engineering ● Part of Linked Data Stack (GeoKnow & LOD2 EU projects) ● Uptake by the research community ○ circa 200 total citations ● Known Industrial adoption ○ Wolters Kluwers, Semantic Web Company (ALIGNED) ○ Ontotext, GeoPhy, Springer Nature, Oxford University Press, Europeana, Eccenca, ... 53
  • 54. Standardization SHACL ● Active member of the Shapes W3C Working Group ● Became one of the specification editors on March 2016 ● SHACL transitioned to a W3C Recommendation on July 2017 ShEx ● Originally planned to merge with SHACL ● Diverged due to core differences in language semantics ● Serving as the community group chairman for the last year 54
  • 55. Thesis publications ● 11 peer-reviewed conference papers ● 4 Journal Publications ● 5 peer-reviewed demo-posters ● 4 workshop proceedings ○ Linked Data Quality (LDQ) workshop series ● 1 Journal Special Issue ○ SWJ: Quality Management of Semantic Web Assets ● 1 Book Chapter ● 1 Book (Validating RDF Data, co-author) ○ Synthesis Lectures on the Semantic Web: Theory and Technology ● 1 W3C Standard (SHACL, co-editor) 55
  • 56. Overview Comprehensive set of research and engineering tasks for increasing both precision and recall in large-scale multilingual knowledge extraction, with a focus on DBpedia. The results of this thesis are already contributed back to the scientific & industry community through an improved DBpedia open data stack, open source tools, services and specifications 56
  • 57. Thank you for your attention! Looking forward to answering all your questions... 57
  • 58. Skipped - Backup Slides 58
  • 59. TDQAM vs Reasoners ● SPARQL test cases detect a subset of validation errors detectable by an OWL reasoner. Limited by: ○ SPARQL endpoint reasoning support ○ limitations of the OWL-to-SPARQL translation. ● SPARQL test cases detect validation errors not expressible in OWL ● OWL reasoning is often not feasible on large datasets. ● Datasets are already deployed and accessible via SPARQL endpoints ● Pattern library more user friendly approach for building validation rules compared to modelling OWL axioms. ○ requires familiarity ○ non-common validations require manual SPARQL test cases 59
  • 60. Increasing recall & precision in knowledge extraction 60 Limitations of TDQAM: ● Recursion ○ Partial support planned using SPARQL property paths ● Disjunctive constraints ○ In progress ● Complete language conversion (long-tail) ○ e.g. OWL to SPARQL algorithms
  • 61. Evolution & quality Data evolves so do ontologies so do RDF mappings so does code so do SPARQL queries so do constraints http://aligned-project.eu
  • 63. TDQAM ontology - Definition & generation 63
  • 64. TDQAM ontology - Result representation 64 [SHACL]
  • 66. TAG Example INVFUNC pattern A general pattern that we want to auto generate Test Auto Generators Query the schema for bindings Bindings for every result of a TAG, binds values to a pattern and instantiate test cases SELECT ? s WHERE { ?a %%P1%% ?resource . ?b %%P1%% ?resource . FILTER (?a != ?b)} SELECT DISTINCT ?P1 WHERE { ?P1 rdf:type owl:InverseFunctionalProperty.} INVFUNC BIND (P1 -> foaf:homepage) 66