Semantically enhanced quality assurance in the jurion business use case
1. Semantically Enhanced
Quality Assurance in the
JURION Business Use Case
Dimitris Kontokostas, Christian Mader,
Christian Dirschl, Katja Eck, Michael Leuthold,
Jens Lehmann, Sebastian Hellmann
2. ESWC 2016
Overview
● Wolters Kluwers overview
● Use Case Tools
● Challenges
● Solutions
● Evaluation
● Future Work
3. ESWC 2016
Wolters Kluwers
Wolters Kluwer provides solutions to customers
in over 170 countries and provides content in at
least a dozen languages.
Focusing on legal, tax, finance and health
industries.
11. ESWC 2016
● TDDD: Test Driven (Data) Development
○ Methodology, definitions & Tools
● SPARQL
● Reusable unit tests for
○ vocabularies
○ datasets
○ applications
● Test Auto Generators
○ OWL
○ IBM Shapes
○ DSP (Dublin Core Set Profiles)
○ W3c Shapes (in progress)
● Open Source (Apache license)
● Stable tool, used in many research & industrial settings
http://rdfunit.aksw.org
12. ESWC 2016
https://www.poolparty.biz
● Commercial product developed by Semantic Web Company
● Thesauri development in a collaborative way
○ From scratch / by extraction of terms from a document corpus
● Compliance to the 5-star Open Data principles (RDF & SKOS)
● Automatically retrieve potential additional concepts for inclusion into the
thesauri by querying SPARQL endpoints (e.g. DBpedia)
● identify and link to related resources from local / remote projects
● Simple ontology editing (rdf:type, rdfs:subClassOf, rdfs:domain/range,...)
● Automated quality assurance mechanisms
○ Conformance to SKOS or a custom schema
○ Enforcement level of some quality metrics can be configured by the user so that it is, e.g.,
possible to get an alert if circular hierarchical relation
○ Check a taxonomy “as a whole” against a set of potential quality violations
14. ESWC 2016
Metadata RDF Conversion Verification
Existing Infrastructure
● Platform Content Interface (PCI) ontology
○ proprietary schema that describes legal documents and metadata in OWL
● PCI revisions => verify data conforms to PCI
● Proprietary SOAP-based validation service
○ Package based validation => hard error detection
○ Asynchronous & complex web service => hard to use
○ Network dependency => potentially unstable
15. ESWC 2016
Metadata RDF Conversion Verification
Continuous & high quality triplification of semi-structured data is a common
problem in the information industry. Schema changes and enhancements are
routine tasks, but ensuring data quality is still very often purely manual effort.
So any automation will support a lot of real-life use cases in different domains.
Goal: Based on the schema, test cases should automatically be created, which
are run on a regular basis against the data that needs to be transformed. The
errors detected will lead to refinements and changes of the XSLT scripts and
sometimes also to schema changes, which impose again new automatically
created test cases
18. ESWC 2016
Quality Control in Thesaurus Management
● WKD develops multiple controlled vocabularies for annotating documents (e.g.,
court decision, labour law,...) using PoolParty
● Interconnected to each other
● Consistency and quality must be ensured over all vocabularies
● Various quality issues, e.g.,
○ Duplicates
○ Links to deprecated (deleted) concepts
○ Unresolvable links
● Up to now curated manually in deployed system, regular errors in production
versions
19. ESWC 2016
Quality Control in Thesaurus Management
The creation and maintenance of knowledge models is gaining importance in the
Web of Data. These tasks are increasingly being executed by SME’s in the domain,
not in knowledge modelling and IT as such. Therefore, better automatic support of
these processes will directly help achieving quality and efficiency gains.
● Automated quality checks over multiple vocabularies
● Improved notifications: email on changes performed by users
● Additional statistics on, e.g, vocabulary dependencies, changes, etc
20. ESWC 2016
Vocabulary link validation (PoolParty)
● Uses project metadata to identify linked vocabularies
● Link is invalid if target concept is either deprecated or deleted
● Creates a report for human curators
● Vocabulary repair still manual process
Quality Control in Thesaurus Management
21. ESWC 2016
Results & Evaluation
The analysis is based on measured metrics and the qualitative feedback of
experts and users.
Participants of the evaluation study were selected from WKD staff in the fields of
software development and data development. There were seven participants in
total: four involved in the expert evaluation and three content experts involved in
the usability/interview evaluation.
● Productivity
● Quality
● Agility
22. ESWC 2016
Productivity (RDFUnit)
● Total time for quality checks and error detection
● The time need for manual interaction.
What we measured:
● 1ms to 50ms per single test (depending on the document / ontology size)
○ as close to real-time as possible, currently a couple of minutes
● Quality checks can be triggered by manual execution, but they are always
verified automatically by the CI build system
● A total of 44.000 tests with a total duration of 11 minutes
○ may scale-up easily when parallelized or clustered
23. ESWC 2016
Quality (RDFUnit)
What kind of errors can be detected and is categorization possible?
● Experts concluded that it is helpful to spot errors introduced by changes,
since issues spotted in this way can be assumed to point to really existing
errors; the causes of which can be identified and addressed
● Successful tests are less significant as we are not yet able to evaluate
whether and how the measurements taken correspond to target measures and
these tests do not point to concrete errors.
○ Coverage & other metrics needed
24. ESWC 2016
Agility (RDFUnit)
… time to include new requirements
● Including new constraints or adapting existing constraints works by adding
new reference documents to the input dataset to make the test environment
as representative as possible.
● The process of generating tests and testing is fully automated, it adapts very
easily to changed parameters.
● Adding more documents to the input dataset increases the total runtime
25. ESWC 2016
Productivity (PoolParty)
● The number of checked links
● The number of violations
● The total time
What we measured:
The presentation of the results was well understood. In general, the tool was
received well by the experts, which was reflected by their feedback in the
interviews.
27. ESWC 2016
Agility (PoolParty)
… integration, configuration time and extension
● Very useful for getting an overview
● cases it is desired to limit the link lookups and adapt the way links to external
datasets are detected
○ Use custom base URI or regular expression-based techniques
● Re-configuration is possible but recompiling the application might be needed
○ Plans to delegate this process to unified views
28. ESWC 2016
Future Work
● Error analysis (statistics, time to fix an issue, regressions)
● Test coverage and better metrics
● Improve the UI of the Link Validation tool
● Provide more advanced settings
● Inter-repository Link Validation
29. ESWC 2016
Thank You!
Questions ?
(You might want to) take a look at…
RDF and XML Interoperability W3c Community group
https://www.w3.org/community/rax/