* A talk by Alena Vasilevich, Computational Linguist at Coreon.
In realm of data-driven businesses, formalized knowledge is a valuable resource for AI projects, created at great expense.
IATE, with almost one million concepts storing multilingual terms and metadata, holds a large part of the textual knowledge of the EU. However, it can only be accessed lexically, and the database concepts stand alone.
Taxonomization is linking a flat set of concepts into a hierarchical knowledge graph. So if IATE were converted in a full-fledged ontology, its data could not only be consumed by linguists, but would also become accessible for machines through e.g. a SPARQL endpoint.
In this talk, we will present our approach to a semi-automatic generation of taxonomised concept maps, elevating a sub-domain of IATE terminology into a multilingual knowledge graph. We taxonomized a flat list of concepts within the COVID sub-domain, benchmarking two approaches to tackle this task: automatic concept map creation using an enhanced ML-powered language model and manual creation of the graph by a linguist expert.
We will dwell on performance and resource-saving advantages of our collaborative method, made easy by Coreon user-friendly UI, and show how the achieved productivity rate can make the taxonomization of even larger terminology databases economically viable.
To demonstrate empirically the effectiveness of the semi-automatic approach in a typical industry use case scenario, the resulting IATE/Covid graph was used to initialize a CNN for a multilingual document classification task. Leveraging the created taxonomy, we got a classification granularity that is not reachable by state-of-the-art models, such as non-initialised CNNs and zero-shot classifiers.
2. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Coreon: Backbone for the Data
convert data into knowledge
incorporate terminologies
taxonomies
ontologies
thesauri
vocabularies
into one Knowledge Graph
concept-oriented and language-agnostic
data model
intuitive, lightweight UI
3. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH 3
Agenda
Structured data and IATE as a resource
Manual Taxonomization
Automatic Taxonomization
Collaborative-AI Approach
Industry use case: topic classification
4. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Perks of Structured Data
4
A Powerful Resource for AI/ML projects
Cross-lingual Data Analysis
Enterprise Search
Actionable intelligence
Cross-border Interoperability
5. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Interactive Terminology for Europe, IATE
5
Introduced in 2004, used by most EU Institutions,
covers all EU domains
Recent focus on healthcare, financial crisis,
environment, fisheries, and migration
EuroVoc for domain classification system
Number of concepts: 961 116
Number of terms: 7 992 325
New terms last week: 1 646
6. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Study Objective
TODO: draft a deeply-structured taxonomy, _fast_ 🚀
INPUT: a flat set of COVID concepts, no relations between them
OUTPUT: a hierarchical knowledge graph,
consumable via REST/SPARQL
6
7. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Two Tested Approaches
7
export 424 Covid concepts
into TBX file
agree on
• top level nodes
• max leaf size
study domain study domain
measure and compare
• time
• edit actions
• taxonomies
load into Coreon
build taxonmy from
scratch
taxonomize
automatically
name generated concepts
move wrongly grouped
semi-automatic manual
8. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH 8
Manual Taxonomization
Top level nodes, temporary
helper buckets, and lots to
do…
Concept card displaying
important metadata
9. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Manual Taxonomization: Editing Actions
9
dragging ‘inflammation’
from ‘diseases’
to ‘immune system’
drag’n’drop
pin
filter
10. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Manual Taxonomization: Result
10
load into Coreon
build taxonomy
from scratch
11. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Auto-Taxonomization:
Data + WE + Community Detection Algorithm
11
12. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Human Revision of AI-Drafted Taxonomy
12
initial situation
after automatic
taxonomization
13. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Good Clustering, Bad Clustering?
13
55 clusters, majority pretty accurate
some clusters are off, and we blame WE:
‘interstitial space’ and ‘hospital pharmacy’
spaces appearing in similar “semantic neighborhoods”
some existing IATE concepts became parents of concept clusters
14. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Collaborative Taxonomization: Result
14
automatic
taxonomization using
ML algorithm
name auto concepts
move wrong concepts
load into Coreon
15. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Effort and Perfomance
15
Metrics
Taxonomization
Manual Semi-Automatic
Curator‘s recorded time (hours) 40 8
Relations created / changed 1 147 432
Concepts created 115 28
Intermediate structural nodes renamed — 45
Overall relations 679 470
16. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Resulting Taxonomies
16
load into Coreon
build taxonmy
from scratch
automatic
taxonomization using
ML algorithm
name auto concepts
move wrong concepts
load into Coreon
17. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Taxonomization Benefits
17
Effective way to add structure to data
Improve data quality
avoid duplicates and overlapping concepts
associative relations
Easier and safer data maintenance
Formalize multilingual knowledge,
make it machine-digestible
Boost performance of AI algorithms,
priming them with structured data
18. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Text Classification
18
label
learn
tune
test
F1 score
classify
training
documents
production
documents
split Training/Dev/Test
1500 / 300 / 800
19. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
CNN Predictions and True Labels / Test Set
19
21. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Taxonomized Data to Enhance AI Performance
21
Label documents automatically
Boost CNN with taxonomy
Enjoy finer granularity in document
classification
Get multilingual for free