SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Alena Vasilevich
Computational Linguist@Coreon
ML-powered Taxonomization:
AI Lends Taxonomists a Hand
alena@coreon.com
https://www.linkedin.com/company/coreon-gmbh
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Coreon: Backbone for the Data
convert data into knowledge
incorporate terminologies
taxonomies
ontologies
thesauri
vocabularies
into one Knowledge Graph
concept-oriented and language-agnostic
data model
intuitive, lightweight UI
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH 3
Agenda
Structured data and IATE as a resource
Manual Taxonomization
Automatic Taxonomization
Collaborative-AI Approach
Industry use case: topic classification
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Perks of Structured Data
4
A Powerful Resource for AI/ML projects
Cross-lingual Data Analysis
Enterprise Search
Actionable intelligence
Cross-border Interoperability
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Interactive Terminology for Europe, IATE
5
Introduced in 2004, used by most EU Institutions,
covers all EU domains
Recent focus on healthcare, financial crisis,
environment, fisheries, and migration
EuroVoc for domain classification system
Number of concepts: 961 116
Number of terms: 7 992 325
New terms last week: 1 646
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Study Objective
TODO: draft a deeply-structured taxonomy, _fast_ 🚀
INPUT: a flat set of COVID concepts, no relations between them
OUTPUT: a hierarchical knowledge graph,
consumable via REST/SPARQL
6
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Two Tested Approaches
7
export 424 Covid concepts
into TBX file
agree on
• top level nodes
• max leaf size
study domain study domain
measure and compare
• time
• edit actions
• taxonomies
load into Coreon
build taxonmy from
scratch
taxonomize
automatically
name generated concepts
move wrongly grouped
semi-automatic manual
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH 8
Manual Taxonomization
 Top level nodes, temporary
helper buckets, and lots to
do…
 Concept card displaying
important metadata
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Manual Taxonomization: Editing Actions
9
dragging ‘inflammation’
from ‘diseases’
to ‘immune system’
drag’n’drop
pin
filter
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Manual Taxonomization: Result
10
load into Coreon
build taxonomy
from scratch
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Auto-Taxonomization:
Data + WE + Community Detection Algorithm
11
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Human Revision of AI-Drafted Taxonomy
12
initial situation
after automatic
taxonomization
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Good Clustering, Bad Clustering?
13
 55 clusters, majority pretty accurate
 some clusters are off, and we blame WE:
 ‘interstitial space’ and ‘hospital pharmacy’
 spaces appearing in similar “semantic neighborhoods”
 some existing IATE concepts became parents of concept clusters
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Collaborative Taxonomization: Result
14
automatic
taxonomization using
ML algorithm
name auto concepts
move wrong concepts
load into Coreon
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Effort and Perfomance
15
Metrics
Taxonomization
Manual Semi-Automatic
Curator‘s recorded time (hours) 40 8
Relations created / changed 1 147 432
Concepts created 115 28
Intermediate structural nodes renamed — 45
Overall relations 679 470
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Resulting Taxonomies
16
load into Coreon
build taxonmy
from scratch
automatic
taxonomization using
ML algorithm
name auto concepts
move wrong concepts
load into Coreon
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Taxonomization Benefits
17
Effective way to add structure to data
Improve data quality
avoid duplicates and overlapping concepts
associative relations
Easier and safer data maintenance
Formalize multilingual knowledge,
make it machine-digestible
Boost performance of AI algorithms,
priming them with structured data
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Text Classification
18
label
learn
tune
test
F1 score
classify
training
documents
production
documents
split Training/Dev/Test
1500 / 300 / 800
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
CNN Predictions and True Labels / Test Set
19
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Document Classification: Metrics
20
Classifiers
Metrics (micro-averages), %
Precision Recall F-1
Non-initialized CNN 81.6 75.4 78.4
Initialized CNN 82.5 78.8 80.6
🤗 Zero-shot 0.95 threshold 15.3 37.8 21.7
🤗 Zero-shot 0.97 threshold 15.0 26.3 19.1
🤗 Zero-shot 0.99 threshold 12.0 10.2 11.0
15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH
Taxonomized Data to Enhance AI Performance
21
Label documents automatically
Boost CNN with taxonomy
Enjoy finer granularity in document
classification
Get multilingual for free
Thank You!
22
@coreonapp
@lennyvasilevich
https://www.linkedin.com/in/alenavasilevich
alena@coreon.com

Contenu connexe

Plus de Connected Data World

In Search of the Universal Data Model
In Search of the Universal Data ModelIn Search of the Universal Data Model
In Search of the Universal Data Model
Connected Data World
 
Graph Realities
Graph RealitiesGraph Realities
Graph Realities
Connected Data World
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property Graphs
Connected Data World
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
Connected Data World
 
One Ontology, One Data Set, Multiple Shapes with SHACL
One Ontology, One Data Set, Multiple Shapes with SHACLOne Ontology, One Data Set, Multiple Shapes with SHACL
One Ontology, One Data Set, Multiple Shapes with SHACL
Connected Data World
 
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
Connected Data World
 

Plus de Connected Data World (20)

In Search of the Universal Data Model
In Search of the Universal Data ModelIn Search of the Universal Data Model
In Search of the Universal Data Model
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
 
Graph Realities
Graph RealitiesGraph Realities
Graph Realities
 
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
 
Semantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scaleSemantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scale
 
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
 
Schema, Google & The Future of the Web
Schema, Google & The Future of the WebSchema, Google & The Future of the Web
Schema, Google & The Future of the Web
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property Graphs
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
Graph for Good: Empowering your NGO
Graph for Good: Empowering your NGOGraph for Good: Empowering your NGO
Graph for Good: Empowering your NGO
 
What are we Talking About, When we Talk About Ontology?
What are we Talking About, When we Talk About Ontology?What are we Talking About, When we Talk About Ontology?
What are we Talking About, When we Talk About Ontology?
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
 
Develop A Basic Recommendation System using Cypher
Develop A Basic Recommendation System using CypherDevelop A Basic Recommendation System using Cypher
Develop A Basic Recommendation System using Cypher
 
A Semi-Automatic Tool for Linked Data Integration
A Semi-Automatic Tool for Linked Data IntegrationA Semi-Automatic Tool for Linked Data Integration
A Semi-Automatic Tool for Linked Data Integration
 
One Ontology, One Data Set, Multiple Shapes with SHACL
One Ontology, One Data Set, Multiple Shapes with SHACLOne Ontology, One Data Set, Multiple Shapes with SHACL
One Ontology, One Data Set, Multiple Shapes with SHACL
 
Dow Jones: Reimagining the News as a Knowledge Graph
Dow Jones: Reimagining the News as a Knowledge GraphDow Jones: Reimagining the News as a Knowledge Graph
Dow Jones: Reimagining the News as a Knowledge Graph
 
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
 
Graph intelligence: the future of data-driven investigations
Graph intelligence: the future of data-driven investigationsGraph intelligence: the future of data-driven investigations
Graph intelligence: the future of data-driven investigations
 

Dernier

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Machine Learning-powered Taxonomization: AI Lends Taxonomists a Hand

  • 1. Alena Vasilevich Computational Linguist@Coreon ML-powered Taxonomization: AI Lends Taxonomists a Hand alena@coreon.com https://www.linkedin.com/company/coreon-gmbh
  • 2. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Coreon: Backbone for the Data convert data into knowledge incorporate terminologies taxonomies ontologies thesauri vocabularies into one Knowledge Graph concept-oriented and language-agnostic data model intuitive, lightweight UI
  • 3. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH 3 Agenda Structured data and IATE as a resource Manual Taxonomization Automatic Taxonomization Collaborative-AI Approach Industry use case: topic classification
  • 4. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Perks of Structured Data 4 A Powerful Resource for AI/ML projects Cross-lingual Data Analysis Enterprise Search Actionable intelligence Cross-border Interoperability
  • 5. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Interactive Terminology for Europe, IATE 5 Introduced in 2004, used by most EU Institutions, covers all EU domains Recent focus on healthcare, financial crisis, environment, fisheries, and migration EuroVoc for domain classification system Number of concepts: 961 116 Number of terms: 7 992 325 New terms last week: 1 646
  • 6. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Study Objective TODO: draft a deeply-structured taxonomy, _fast_ 🚀 INPUT: a flat set of COVID concepts, no relations between them OUTPUT: a hierarchical knowledge graph, consumable via REST/SPARQL 6
  • 7. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Two Tested Approaches 7 export 424 Covid concepts into TBX file agree on • top level nodes • max leaf size study domain study domain measure and compare • time • edit actions • taxonomies load into Coreon build taxonmy from scratch taxonomize automatically name generated concepts move wrongly grouped semi-automatic manual
  • 8. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH 8 Manual Taxonomization  Top level nodes, temporary helper buckets, and lots to do…  Concept card displaying important metadata
  • 9. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Manual Taxonomization: Editing Actions 9 dragging ‘inflammation’ from ‘diseases’ to ‘immune system’ drag’n’drop pin filter
  • 10. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Manual Taxonomization: Result 10 load into Coreon build taxonomy from scratch
  • 11. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Auto-Taxonomization: Data + WE + Community Detection Algorithm 11
  • 12. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Human Revision of AI-Drafted Taxonomy 12 initial situation after automatic taxonomization
  • 13. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Good Clustering, Bad Clustering? 13  55 clusters, majority pretty accurate  some clusters are off, and we blame WE:  ‘interstitial space’ and ‘hospital pharmacy’  spaces appearing in similar “semantic neighborhoods”  some existing IATE concepts became parents of concept clusters
  • 14. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Collaborative Taxonomization: Result 14 automatic taxonomization using ML algorithm name auto concepts move wrong concepts load into Coreon
  • 15. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Effort and Perfomance 15 Metrics Taxonomization Manual Semi-Automatic Curator‘s recorded time (hours) 40 8 Relations created / changed 1 147 432 Concepts created 115 28 Intermediate structural nodes renamed — 45 Overall relations 679 470
  • 16. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Resulting Taxonomies 16 load into Coreon build taxonmy from scratch automatic taxonomization using ML algorithm name auto concepts move wrong concepts load into Coreon
  • 17. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Taxonomization Benefits 17 Effective way to add structure to data Improve data quality avoid duplicates and overlapping concepts associative relations Easier and safer data maintenance Formalize multilingual knowledge, make it machine-digestible Boost performance of AI algorithms, priming them with structured data
  • 18. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Text Classification 18 label learn tune test F1 score classify training documents production documents split Training/Dev/Test 1500 / 300 / 800
  • 19. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH CNN Predictions and True Labels / Test Set 19
  • 20. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Document Classification: Metrics 20 Classifiers Metrics (micro-averages), % Precision Recall F-1 Non-initialized CNN 81.6 75.4 78.4 Initialized CNN 82.5 78.8 80.6 🤗 Zero-shot 0.95 threshold 15.3 37.8 21.7 🤗 Zero-shot 0.97 threshold 15.0 26.3 19.1 🤗 Zero-shot 0.99 threshold 12.0 10.2 11.0
  • 21. 15-04-2021 AI Lends Taxonomists a Hand Alena Vasilevich, Coreon GmbH Taxonomized Data to Enhance AI Performance 21 Label documents automatically Boost CNN with taxonomy Enjoy finer granularity in document classification Get multilingual for free