Entity Typing Using Distributional Semantics and DBpedia

•

0 j'aime•264 vues

Presentation given at NLP&DBpedia workshop on 18 October 2016. The presentation accompanies the work described in: https://nlpdbpedia2016.files.wordpress.com/2016/09/nlpdbpedia2016_paper_9.pdf

Technologie

Entity Typing Using
Distributional Semantics and DBpedia
Marieke van Erp and Piek Vossen

Conclusions
• Finegrained entity typing is necessary for semantic
queries over text
• Search space for word2vec is large, topics help
• Combining Distributional Semantics with DBpedia can
help overcome NIL and Dark Entities
https://github.com/MvanErp/entity-typing/

Dark entities: little or no information available in KB
https://github.com/MvanErp/entity-typing/

Distributional Semantics
• Similar concepts (denoted by words) occur in similar
contexts
• Word2Vec (Mikolov et al., 2013) explores this notion in a
popular implementation
Sushi
Teriyaki
Udon
Okonomiyaki
Soba
Sashimi
Kimono
Yukata
Nemaki
KFC
Steak
Hamburger
McDonald’s
Jeans
T-shirt
Skirt

Research Question:
• Can we predict the type of the concept ‘Sushi’ by
modelling it in a distributional semantics space and
comparing its vector to the vectors of concepts for which
we do know the type?
Sushi
Teriyaki
Udon
Okonomiyaki
Soba
Sashimi
Kimono
Yukata
Nemaki
KFC
Steak
Hamburger
McDonald’s
Jeans
T-shirt
Skirt

Setup
• 7 Named Entity Linking Benchmark datasets (AIDA-YAGO,
2014 NEEL, 2015 NEEL, OKE2015, RSS500, WES2015,
Wikinews)
• 3 Word2Vec models: GoogleNews, English Wikipedia,
Reuters RCV1*
• Compare all entities within datasets to each other and return
highest ranking type (as taken from DBpedia)
* AIDA-YAGO is part of Reuters RCV1
https://github.com/MvanErp/entity-typing/

Initial results
• Not so great?
https://github.com/MvanErp/entity-typing/

Initial results (some footnotes)
• Ranking approach favours ﬁne-grained entity types
• The Word2Vec corpus matters! NEEL2014&2015 are derived
from Tweets, typically low coverage when querying news
• Smaller datasets (Wikinews, WES2015, OKE2015) do better?
https://github.com/MvanErp/entity-typing/

Let’s zoom in
on topics
• Initially, all entities
within a benchmark
dataset were
compared to all other
entities.
• What if we only
compare entities from
sports documents to
other entities from
sports documents?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
AIDA−YAGO Coarsegrained Categories GoogleNews Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
AIDA−YAGO Coarsegrained Categories RCV1 Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
AIDA−YAGO Coarsegrained Categories Wikipedia Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
AIDA−YAGO Finegrained Categories GoogleNews Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
AIDA−YAGO Finegrained Categories RCV1 Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
AIDA−YAGO Finegrained Categories Wikipedia Fine
20
40
60
80
100
1
5
10
https://github.com/MvanErp/entity-typing/

Conclusions and Future Work
• Difﬁcult task, but topics help
• Ranking needs to be improved
• Multi-class classiﬁcation (KFC: food & organisation,
Arnold Schwarzenegger: Actor & Politician)
• What else can we discover beyond type?
https://github.com/MvanErp/entity-typing/

Thank you!
https://github.com/MvanErp/entity-typing/

This research was made possible by the CLARIAH-CORE project
ﬁnanced by NWO.
http://www.clariah.nl

Contenu connexe

En vedette

ULM-1 Understanding Languages by Machines: The borders of AmbiguityRubén Izquierdo Beviá

2017-01-25-SystemT-Overview-StanfordLaura Chiticariu

HDRF: Stream-Based Partitioning for Power-Law GraphsFabio Petroni, PhD

Entity Typing and Event Extraction Marieke van Erp

Mining at scale with latent factor models for matrix completionFabio Petroni, PhD

LCBM: Statistics-Based Parallel Collaborative FilteringFabio Petroni, PhD

KafNafParserPy: a python library for parsing/creating KAF and NAF filesRubén Izquierdo Beviá

Topic modeling and WSD on the Ancora corpusRubén Izquierdo Beviá

The Power of Declarative AnalyticsYunyao Li

RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá

CLTL python course: Object Oriented Programming (3/3)Rubén Izquierdo Beviá

Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li

DutchSemCor workshop: Domain classification and WSD systemsRubén Izquierdo Beviá

HSIENA: a hybrid publish/subscribe systemFabio Petroni, PhD

Transparent Machine Learning for Information Extraction: State-of-the-Art and...Yunyao Li

Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesYunyao Li

Error analysis of Word Sense DisambiguationRubén Izquierdo Beviá

CORE: Context-Aware Open Relation Extraction with Factorization MachinesFabio Petroni, PhD

Juan Calvino y el CalvinismoRubén Izquierdo Beviá

Information ExtractionRubén Izquierdo Beviá

En vedette (20)

ULM-1 Understanding Languages by Machines: The borders of Ambiguity

2017-01-25-SystemT-Overview-Stanford

HDRF: Stream-Based Partitioning for Power-Law Graphs

Entity Typing and Event Extraction

Mining at scale with latent factor models for matrix completion

LCBM: Statistics-Based Parallel Collaborative Filtering

KafNafParserPy: a python library for parsing/creating KAF and NAF files

Topic modeling and WSD on the Ancora corpus

The Power of Declarative Analytics

RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus

CLTL python course: Object Oriented Programming (3/3)

Polyglot: Multilingual Semantic Role Labeling with Unified Labels

DutchSemCor workshop: Domain classification and WSD systems

HSIENA: a hybrid publish/subscribe system

Transparent Machine Learning for Information Extraction: State-of-the-Art and...

Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

Error analysis of Word Sense Disambiguation

CORE: Context-Aware Open Relation Extraction with Factorization Machines

Juan Calvino y el Calvinismo

Information Extraction

Similaire à Entity Typing Using Distributional Semantics and DBpedia

Vector Search for Data Scientists.pdfConnorShorten2

Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks

Vectors in Search - Towards More Semantic MatchingSimon Hughes

Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections

Searching with vectorsSimon Hughes

What I Learned Building a Toy Example to Crawl & Render like GoogleCatalyst

Groundhog Day: Near-Duplicate Detection on Twitter Ke Tao

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...Pierpaolo Basile

Vectorization In NLP.pptxChode Amarnath

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim

Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease

Data Science - Part XI - Text AnalyticsDerek Kane

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks

Alternative microservices - one size doesn't fit allJeppe Cramon

TechSEO Boost 2017: Fun with Machine Learning: How Machine Learning is Shapin...Catalyst

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks

Visually Exploring Patent Collections for Events and PatternsXiaoyu Wang

Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra

Similaire à Entity Typing Using Distributional Semantics and DBpedia (20)

Vector Search for Data Scientists.pdf

Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com

Vectors in Search - Towards More Semantic Matching

Haystack 2019 - Search with Vectors - Simon Hughes

Searching with vectors

What I Learned Building a Toy Example to Crawl & Render like Google

Groundhog Day: Near-Duplicate Detection on Twitter

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...

Vectorization In NLP.pptx

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

Data Science - Part XI - Text Analytics

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Alternative microservices - one size doesn't fit all

TechSEO Boost 2017: Fun with Machine Learning: How Machine Learning is Shapin...

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...

Visually Exploring Patent Collections for Events and Patterns

Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond

Plus de Marieke van Erp

Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp

A Polyvocal and Contextualised Semantic WebMarieke van Erp

AI x Digital Humanities = > Inclusiviteit Marieke van Erp

Computationally Tracing Concepts Through Time and SpaceMarieke van Erp

The Hitchhiker's Guide to the Future of Digital HumanitiesMarieke van Erp

Why language technology can’t handle Game of Thrones (yet)Marieke van Erp

(Beyond) Combining Text and Tables for qualitative and quantitative research Marieke van Erp

Finding common ground between text, maps, and tables for quantitative and qua...Marieke van Erp

Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchMarieke van Erp

Good Lynx, bad Lynx: Document enrichment for historical ecologistsMarieke van Erp

Towards Semantic Enrichment of Newspapers: a historical ecology use case Marieke van Erp

Natural Language Processing en Named Entity Recognition Marieke van Erp

HuC lecture - Digital and Humanities: Continuing the ConversationMarieke van Erp

Multilingual Fine-grained Entity Typing Marieke van Erp

Finding Stories in 1,784,532 Events: Scaling up computational models of narr...Marieke van Erp

Evaluating Named Entity Recognition and Disambiguation in News and TweetsMarieke van Erp

Orientation EBC 2013: Digitising Natural HistoryMarieke van Erp

Offspring from Reproduction Problems: what replication failure teaches us Marieke van Erp

From Events to Stories: Different ways of structuring the same bag of events ...Marieke van Erp

Lecture4 Social Web Marieke van Erp

Plus de Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium

A Polyvocal and Contextualised Semantic Web

AI x Digital Humanities = > Inclusiviteit

Computationally Tracing Concepts Through Time and Space

The Hitchhiker's Guide to the Future of Digital Humanities

Why language technology can’t handle Game of Thrones (yet)

(Beyond) Combining Text and Tables for qualitative and quantitative research

Finding common ground between text, maps, and tables for quantitative and qua...

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

Good Lynx, bad Lynx: Document enrichment for historical ecologists

Towards Semantic Enrichment of Newspapers: a historical ecology use case

Natural Language Processing en Named Entity Recognition

HuC lecture - Digital and Humanities: Continuing the Conversation

Multilingual Fine-grained Entity Typing

Finding Stories in 1,784,532 Events: Scaling up computational models of narr...

Evaluating Named Entity Recognition and Disambiguation in News and Tweets

Orientation EBC 2013: Digitising Natural History

Offspring from Reproduction Problems: what replication failure teaches us

From Events to Stories: Different ways of structuring the same bag of events ...

Lecture4 Social Web

Dernier

AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Architecting Cloud Native ApplicationsWSO2

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

ICT role in 21st century education and its challengesrafiqahmad00786416

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Corporate and higher education May webinar.pptxRustici Software

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

DBX First Quarter 2024 Investor PresentationDropbox

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Manulife - Insurer Transformation Award 2024The Digital Insurer

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

MINDCTI Revenue Release Quarter One 2024MIND CTI

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Dernier (20)

AXA XL - Insurer Innovation Award Americas 2024

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Architecting Cloud Native Applications

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

ICT role in 21st century education and its challenges

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Exploring the Future Potential of AI-Enabled Smartphone Processors

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Corporate and higher education May webinar.pptx

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

DBX First Quarter 2024 Investor Presentation

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Manulife - Insurer Transformation Award 2024

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

MINDCTI Revenue Release Quarter One 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Apidays New York 2024 - The value of a flexible API Management solution for O...

Entity Typing Using Distributional Semantics and DBpedia

1. Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp and Piek Vossen

2. Conclusions • Finegrained entity typing is necessary for semantic queries over text • Search space for word2vec is large, topics help • Combining Distributional Semantics with DBpedia can help overcome NIL and Dark Entities https://github.com/MvanErp/entity-typing/

3. Dark entities: little or no information available in KB https://github.com/MvanErp/entity-typing/

4. Dark entities: little or no information available in KB https://github.com/MvanErp/entity-typing/

5. Distributional Semantics • Similar concepts (denoted by words) occur in similar contexts • Word2Vec (Mikolov et al., 2013) explores this notion in a popular implementation Sushi Teriyaki Udon Okonomiyaki Soba Sashimi Kimono Yukata Nemaki KFC Steak Hamburger McDonald’s Jeans T-shirt Skirt

6. Research Question: • Can we predict the type of the concept ‘Sushi’ by modelling it in a distributional semantics space and comparing its vector to the vectors of concepts for which we do know the type? Sushi Teriyaki Udon Okonomiyaki Soba Sashimi Kimono Yukata Nemaki KFC Steak Hamburger McDonald’s Jeans T-shirt Skirt

7. Setup • 7 Named Entity Linking Benchmark datasets (AIDA-YAGO, 2014 NEEL, 2015 NEEL, OKE2015, RSS500, WES2015, Wikinews) • 3 Word2Vec models: GoogleNews, English Wikipedia, Reuters RCV1* • Compare all entities within datasets to each other and return highest ranking type (as taken from DBpedia) * AIDA-YAGO is part of Reuters RCV1 https://github.com/MvanErp/entity-typing/

8. Initial results • Not so great? https://github.com/MvanErp/entity-typing/

9. Initial results (some footnotes) • Ranking approach favours ﬁne-grained entity types • The Word2Vec corpus matters! NEEL2014&2015 are derived from Tweets, typically low coverage when querying news • Smaller datasets (Wikinews, WES2015, OKE2015) do better? https://github.com/MvanErp/entity-typing/

10. Let’s zoom in on topics • Initially, all entities within a benchmark dataset were compared to all other entities. • What if we only compare entities from sports documents to other entities from sports documents? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 AIDA−YAGO Coarsegrained Categories GoogleNews Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 AIDA−YAGO Coarsegrained Categories RCV1 Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 AIDA−YAGO Coarsegrained Categories Wikipedia Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 AIDA−YAGO Finegrained Categories GoogleNews Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 AIDA−YAGO Finegrained Categories RCV1 Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 AIDA−YAGO Finegrained Categories Wikipedia Fine 20 40 60 80 100 1 5 10 https://github.com/MvanErp/entity-typing/

11. Conclusions and Future Work • Difﬁcult task, but topics help • Ranking needs to be improved • Multi-class classiﬁcation (KFC: food & organisation, Arnold Schwarzenegger: Actor & Politician) • What else can we discover beyond type? https://github.com/MvanErp/entity-typing/

12. Thank you! https://github.com/MvanErp/entity-typing/

13. This research was made possible by the CLARIAH-CORE project ﬁnanced by NWO. http://www.clariah.nl

Entity Typing Using Distributional Semantics and DBpedia

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à Entity Typing Using Distributional Semantics and DBpedia

Similaire à Entity Typing Using Distributional Semantics and DBpedia (20)

Plus de Marieke van Erp

Plus de Marieke van Erp (20)

Dernier

Dernier (20)

Entity Typing Using Distributional Semantics and DBpedia