SlideShare a Scribd company logo
1 of 40
Linking Entities for Enriching
and Structuring Social
Media Content
Raphaël Troncy <raphael.troncy@eurecom.fr>
@rtroncy
12/04/2016 NLPIT Workshop @ WWW 2016 - 2
Extracting and Linking Entities (NER/NEL)
 “ Tampa Bay Lightning vs Canadiens in
Montreal tonight with @erikmannens
#hockey #NHL ”
12/04/2016 NLPIT Workshop @ WWW 2016 - 3
https://www.youtube.com/
watch?v=Rmug-PUyIzI
Part of Speech (GATE Twitter POS)
Tampa NNP
Bay NNP
Lightning NNP
vs CC
Canadiens NNP
in IN
Montreal NNP
tonight NN
with IN
@erikmannens USR
#hockey HT
#NHL HT
12/04/2016 NLPIT Workshop @ WWW 2016
NER: What is NHL?
- 4
https://gate.ac.uk/wiki/twitter-postagger.html
NEL: Which Montreal
are we talking about?
What is #NHL? Type Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016 - 5
Sports League
Organization
Place
Railway Line
What is #NHL? Type Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016 - 6
http://schema.org
/SportsEvent
http://dbpedia.org/
ontology/Event
http://schema.org
/Organization
http://dbpedia.org/
ontology/IceHocke
yLeague
Different infobox
templates
Named Entity Recognition (NER)
Tampa NNP ORG
Bay NNP ORG
Lightning NNP ORG
vs CC O
Canadiens NNP ORG
in IN O
Montreal NNP LOC
tonight NN O
with IN O
@erikmannens USR PER
#hockey HT THG
#NHL HT ORG
12/04/2016 NLPIT Workshop @ WWW 2016 - 7
What is Montreal? Name Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016
Montréal, Ardèche Montréal, Aude Montréal, Gers
Montreal, Wisconsin
Mont-ral, Catalonia
- 8
Named Entity Linking (NEL)
Tampa NNP ORG
Bay NNP ORG
Lightning NNP ORG
vs CC O
Canadiens NNP ORG
in IN O
Montreal NNP LOC http://dbpedia.org/resource/Montreal
tonight NN O
with IN O
@erikmannens USR PER NIL
#hockey HT THG
#NHL HT ORG
12/04/2016 NLPIT Workshop @ WWW 2016 - 9
NERD: a framework for comparing NER APIs
 NER
Stanford CoreNLP
 Web APIs
http://nerd.eurecom.fr/
12/04/2016 NLPIT Workshop @ WWW 2016 - 10
NERD: AlchemyAPI
12/04/2016 NLPIT Workshop @ WWW 2016 - 11
Incorrect boundaries
No disambiguation
No dereferencing for @mention
NERD: Dandelion
12/04/2016 NLPIT Workshop @ WWW 2016 - 12
Everything is a Thing
No dereferencing for @mention
NERDML
12/04/2016 NLPIT Workshop @ WWW 2016 - 13
No dereferencing for @mention
Research Questions
 How to adapt an entity linking system
depending on different criteria?
 How to design an entity linking system in
order to be able to process a large amount of
data in near real time?
12/04/2016 NLPIT Workshop @ WWW 2016 - 14
ADEL: Adaptive Framework for NER
 POS Tagger:
 use bidirectional
dependency
network
 combine CMM
left to right and
right to left
 NER:
 use CRF with Gibbs sampling (Monte Carlo for approximate
inference) to take n words into account instead of only the previous
and next one
12/04/2016 NLPIT Workshop @ WWW 2016 - 15
ADEL: Overlap Resolution
 Detect overlaps among extractors with the boundaries
of the entities
 Different heuristics can be applied:
 Merge: (“United States” and “States of America” => “United States of
America”) default behavior
 Simple Substring: (“Florence” and “Florence May Harding” =>
”Florence” and “May Harding”)
 Smart Substring: (”Giants of New York” and “New York” => “Giants”
and “New York”)
12/04/2016 NLPIT Workshop @ WWW 2016 - 16
ADEL: KB Indexing
 Create index from
DBpedia and
Wikipedia
 Integrate external data
such as PageRank
and HITS scores from
Hasso Platner Institute
12/04/2016 NLPIT Workshop @ WWW 2016 - 17
ADEL: Adaptive Framework for NEL
 Generate candidate links
for all extracted mentions:
 If any, they go to the linking
method
 If not, they are linked to NIL
 Linking method:
 ADEL linear formula:
r(l): the score of the candidate l
L: the Levenshtein distance
m: the extracted mention
title: the title of the candidate l
R: the set of redirect pages associated to the candidate l
D: the set of disambiguation pages associated to the
candidate l
PR: Pagerank associated to the candidate l
a, b and c are weights
following the properties:
a > b > c and a + b + c = 1
12/04/2016 NLPIT Workshop @ WWW 2016 - 18
ADEL: Pruning for NER/NEL
 k-NN machine learning
algorithm
 Why a pruning module?
 Useful to correct the errors from the extractor by removing wrong
annotations. Example:
 France played against Russia for a friendly match
 Yesterday, I went to see Against in concert
 Useful to adapt the annotations in order to follow a given guideline
Example: suppose we are participating to two different challenges,
the first one count the dates as entities, and the second one does not
 NEEL challenge: Jimmy Page was born the January 9th, 1944.
 OKE challenge: Jimmy Page was born the January 9th, 1944.
12/04/2016 NLPIT Workshop @ WWW 2016 - 19
ADEL Evaluation
 #Micropost2014 NEEL Challenge – ADEL v1
 #Micropost2015 NEEL Challenge – ADEL v1
 #Micropost2016 NEEL Challenge – ADEL v2
 OKE2015 Challenge – ADEL v1
 OKE2016 Challenge – ADEL v2
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-
measure
70.06 54.93 49.9 46.29 45.37 45.23 39.02
ADEL FOX FRED
F-
measure
60.75 49.88 34.73
ousia acubelab ADEL uniba ualberta uva cen_neel
F-
measure
76.2 52.3 47.9 46.4 41.5 31.6 0
ADEL
F-
measure
78.8
ADEL
F-
measure
56.5
12/04/2016 NLPIT Workshop @ WWW 2016 - 20
ADEL Live Demo
12/04/2016 NLPIT Workshop @ WWW 2016 - 21
Social Media: some definitions
 Media Item: a photo or a video that is shared on
a social network
 Micropost: a text status message that can
optionally accompany a media item
 Social Network: an online service that focuses
on building and reflecting social relationships
among people sharing interests or activities
Media Sharing Platforms: emphasis on sharing media
but blurred boundaries with social networks since users
are encouraged to react on media content
(like, comment, favorite, etc.)
NLPIT Workshop @ WWW 201612/04/2016 - 22
Media Server
 Composition of media item extractors (12 SNs)
 Rely on search APIs + a fix 30s timeout window to provide results
 Fallback on screen scraping when necessary (Twitter ecosystem)
 Implemented as a NodeJS server
 Serialize results in a common schema (JSON)
NLPIT Workshop @ WWW 201612/04/2016 - 23
https://github.com/tomayac/media-server
12/04/2016 NLPIT Workshop @ WWW 2016
Deep link
Permalink
Clean text for NLP
processing
Aggregate view of ALL
social interactions
12 Social Networks
Media Finder (www2013)
12/04/2016 NLPIT Workshop @ WWW 2016 - 25
Media Finder (zooming on media items)
12/04/2016 NLPIT Workshop @ WWW 2016 - 26
Media Finder (timeline view)
12/04/2016 NLPIT Workshop @ WWW 2016 - 27
Media Finder Architecture
 Media items harvesting using the Media Server
http://eventmedia.eurecom.fr/media-
server/search/{combined}/{term}
https://github.com/vuknje/media-server (@tomayac fork)
 Image near de-duplication
DCT signature on image and video frame,
Hamming distance between image pairs
 Clustering and disambiguation
Named Entity Extraction using NERD
Topic Generation using LDA
12/04/2016 NLPIT Workshop @ WWW 2016 - 28
Media Finder (named entities clustering)
12/04/2016 NLPIT Workshop @ WWW 2016 - 29
Media Finder (zooming in a cluster)
12/04/2016 NLPIT Workshop @ WWW 2016 - 30
Media Finder
 Live Topic Generation from Event Streams
Published at WWW 2013 Demo Track
http://www.youtube.com/watch?v=8iRiwz7cDYY
12/04/2016 NLPIT Workshop @ WWW 2016 - 31
Tracking an event: Italian Election
 Repeated queries over a period of time
We have tracked and analyzed media posts tagged as
elezioni2013 from 2013-02-26 to 2013-03-03
Cron job: every 30 minutes over the 6 days
Slice the data in 24 hours slots
 Research questions:
Can we re-create the news headlines?
 Storyboarding:
http://mediafinder.eurecom.fr/story/elezioni2013
12/04/2016 NLPIT Workshop @ WWW 2016 - 32
Tracking an event: Italian Election
 Dataset:
~16501 microposts containing (duplicate) media items
~21087 Named Entities extracted
 Clustering
NER and LDA
Generate Bag of Entities (BOE) disambiguated with a
DBpedia URI
 Examples:
Monti, Bersani, Italia, Berlusconi, Grillo, Stelle
12/04/2016 NLPIT Workshop @ WWW 2016 - 33
Tracking an event: Italian Election
 Tracking and Analyzing The 2013 Italian Election
Published at ESWC 2013 Demo Track
http://www.youtube.com/watch?v=jIMdnwMoWnk
12/04/2016 NLPIT Workshop @ WWW 2016 - 34
Searching and browsing
TED Talks
GO!
MF: Chapters
“This is Nikita, a security guard from one of the bars in St. Petersburg.”
“This is Nikita, a security guard from one of the bars in St. Petersburg.”
NER
Example taken from the transcript of
https://www.ted.com/talks/2089
PERSON
FUNCTION
LOCATION
Category:
type in the NER task.
Natural Language Processing (NPL)
Task  disambiguating URL in
a knowledge base.
E.g.
http://dbpedia.org/resource/Saint_P
etersburg.
Annotations: Named Entities
1. Clustering of consecutive chapters which talk
about similar topics and entities
2. Ordering of those fragments based on
annotation relevance (TF-IDF)
3. Filtering: Hot Spots are fragments whose
relative relevance falls under the first quarter of
the final score distribution
MF: Hot Spots
Hot Spot 1
Chapters
Hot Spot 2
Hot Spots
Hyperlink: Indexing TED Talks
http://www.slideshare.net/troncy
12/04/2016 NLPIT Workshop @ WWW 2016 - 40

More Related Content

Viewers also liked

InVID Project Presentation 3rd release March 2016
InVID Project Presentation 3rd release March 2016InVID Project Presentation 3rd release March 2016
InVID Project Presentation 3rd release March 2016InVID Project
 
Summer Training In Dotnet
Summer Training In DotnetSummer Training In Dotnet
Summer Training In DotnetDUCC Systems
 
LinkedIn for Students and Graduates - how to start networking and checking al...
LinkedIn for Students and Graduates - how to start networking and checking al...LinkedIn for Students and Graduates - how to start networking and checking al...
LinkedIn for Students and Graduates - how to start networking and checking al...Charles Hardy
 
Presentazione Davide Geraci Mutuo BNL 2in1
Presentazione Davide Geraci Mutuo BNL 2in1Presentazione Davide Geraci Mutuo BNL 2in1
Presentazione Davide Geraci Mutuo BNL 2in1davide geraci
 
Financial accounting MCQ (ledger)
Financial accounting MCQ (ledger)Financial accounting MCQ (ledger)
Financial accounting MCQ (ledger)Md Yeakub Hossain
 
A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...Raphael Troncy
 

Viewers also liked (9)

PROVA DE BIOLOGIA
PROVA DE BIOLOGIA PROVA DE BIOLOGIA
PROVA DE BIOLOGIA
 
Aula01 senac
Aula01 senacAula01 senac
Aula01 senac
 
InVID Project Presentation 3rd release March 2016
InVID Project Presentation 3rd release March 2016InVID Project Presentation 3rd release March 2016
InVID Project Presentation 3rd release March 2016
 
Edital Técnico de Enfermagem 2016
Edital Técnico de Enfermagem 2016Edital Técnico de Enfermagem 2016
Edital Técnico de Enfermagem 2016
 
Summer Training In Dotnet
Summer Training In DotnetSummer Training In Dotnet
Summer Training In Dotnet
 
LinkedIn for Students and Graduates - how to start networking and checking al...
LinkedIn for Students and Graduates - how to start networking and checking al...LinkedIn for Students and Graduates - how to start networking and checking al...
LinkedIn for Students and Graduates - how to start networking and checking al...
 
Presentazione Davide Geraci Mutuo BNL 2in1
Presentazione Davide Geraci Mutuo BNL 2in1Presentazione Davide Geraci Mutuo BNL 2in1
Presentazione Davide Geraci Mutuo BNL 2in1
 
Financial accounting MCQ (ledger)
Financial accounting MCQ (ledger)Financial accounting MCQ (ledger)
Financial accounting MCQ (ledger)
 
A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...
 

More from Raphael Troncy

K CAP 2019 Opening Ceremony
K CAP 2019 Opening CeremonyK CAP 2019 Opening Ceremony
K CAP 2019 Opening CeremonyRaphael Troncy
 
Semantic Technologies for Connected Vehicles in a Web of Things Environment
Semantic Technologies for Connected Vehicles in a Web of Things EnvironmentSemantic Technologies for Connected Vehicles in a Web of Things Environment
Semantic Technologies for Connected Vehicles in a Web of Things EnvironmentRaphael Troncy
 
HyperTED: exploring video lectures at the fragment levels for enhancing learning
HyperTED: exploring video lectures at the fragment levels for enhancing learningHyperTED: exploring video lectures at the fragment levels for enhancing learning
HyperTED: exploring video lectures at the fragment levels for enhancing learningRaphael Troncy
 
Location Embeddings for Next Trip Recommendation
Location Embeddings for Next Trip RecommendationLocation Embeddings for Next Trip Recommendation
Location Embeddings for Next Trip RecommendationRaphael Troncy
 
Contextualizing Events in TV News Shows - SNOW 2014
Contextualizing Events in TV News Shows - SNOW 2014Contextualizing Events in TV News Shows - SNOW 2014
Contextualizing Events in TV News Shows - SNOW 2014Raphael Troncy
 
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014Raphael Troncy
 
NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...Raphael Troncy
 
Deep-linking into Media Assets at the Fragment Level SMAM 2013
Deep-linking into Media Assets at the Fragment Level SMAM 2013Deep-linking into Media Assets at the Fragment Level SMAM 2013
Deep-linking into Media Assets at the Fragment Level SMAM 2013Raphael Troncy
 
Describing Media Assets: Media Fragment Specification and Description
Describing Media Assets: Media Fragment Specification and DescriptionDescribing Media Assets: Media Fragment Specification and Description
Describing Media Assets: Media Fragment Specification and DescriptionRaphael Troncy
 
Semantics at the multimedia fragment level SSSW 2013
Semantics at the multimedia fragment level SSSW 2013Semantics at the multimedia fragment level SSSW 2013
Semantics at the multimedia fragment level SSSW 2013Raphael Troncy
 
Semantic structuring and linking of event-centric data in the social web
Semantic structuring and linking of event-centric data in the social webSemantic structuring and linking of event-centric data in the social web
Semantic structuring and linking of event-centric data in the social webRaphael Troncy
 
Live topic generation from event streams
Live topic generation from event streamsLive topic generation from event streams
Live topic generation from event streamsRaphael Troncy
 
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdMediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdRaphael Troncy
 
EventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
EventMedia Live: Exploring Events Connections in Real-Time to Enhance ContentEventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
EventMedia Live: Exploring Events Connections in Real-Time to Enhance ContentRaphael Troncy
 
Extracting Media Items from Multiple Social Networks
Extracting Media Items from Multiple Social NetworksExtracting Media Items from Multiple Social Networks
Extracting Media Items from Multiple Social NetworksRaphael Troncy
 
Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Raphael Troncy
 
MediaEval 2012 SED Opening
MediaEval 2012 SED OpeningMediaEval 2012 SED Opening
MediaEval 2012 SED OpeningRaphael Troncy
 
DeRiVE 2011 workshop opening
DeRiVE 2011 workshop openingDeRiVE 2011 workshop opening
DeRiVE 2011 workshop openingRaphael Troncy
 
MediaEval 2011 SED Opening
MediaEval 2011 SED OpeningMediaEval 2011 SED Opening
MediaEval 2011 SED OpeningRaphael Troncy
 
ShareIt: Mining SocialMedia Activities for Detecting Events
ShareIt: Mining SocialMedia Activities for Detecting EventsShareIt: Mining SocialMedia Activities for Detecting Events
ShareIt: Mining SocialMedia Activities for Detecting EventsRaphael Troncy
 

More from Raphael Troncy (20)

K CAP 2019 Opening Ceremony
K CAP 2019 Opening CeremonyK CAP 2019 Opening Ceremony
K CAP 2019 Opening Ceremony
 
Semantic Technologies for Connected Vehicles in a Web of Things Environment
Semantic Technologies for Connected Vehicles in a Web of Things EnvironmentSemantic Technologies for Connected Vehicles in a Web of Things Environment
Semantic Technologies for Connected Vehicles in a Web of Things Environment
 
HyperTED: exploring video lectures at the fragment levels for enhancing learning
HyperTED: exploring video lectures at the fragment levels for enhancing learningHyperTED: exploring video lectures at the fragment levels for enhancing learning
HyperTED: exploring video lectures at the fragment levels for enhancing learning
 
Location Embeddings for Next Trip Recommendation
Location Embeddings for Next Trip RecommendationLocation Embeddings for Next Trip Recommendation
Location Embeddings for Next Trip Recommendation
 
Contextualizing Events in TV News Shows - SNOW 2014
Contextualizing Events in TV News Shows - SNOW 2014Contextualizing Events in TV News Shows - SNOW 2014
Contextualizing Events in TV News Shows - SNOW 2014
 
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
 
NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...
 
Deep-linking into Media Assets at the Fragment Level SMAM 2013
Deep-linking into Media Assets at the Fragment Level SMAM 2013Deep-linking into Media Assets at the Fragment Level SMAM 2013
Deep-linking into Media Assets at the Fragment Level SMAM 2013
 
Describing Media Assets: Media Fragment Specification and Description
Describing Media Assets: Media Fragment Specification and DescriptionDescribing Media Assets: Media Fragment Specification and Description
Describing Media Assets: Media Fragment Specification and Description
 
Semantics at the multimedia fragment level SSSW 2013
Semantics at the multimedia fragment level SSSW 2013Semantics at the multimedia fragment level SSSW 2013
Semantics at the multimedia fragment level SSSW 2013
 
Semantic structuring and linking of event-centric data in the social web
Semantic structuring and linking of event-centric data in the social webSemantic structuring and linking of event-centric data in the social web
Semantic structuring and linking of event-centric data in the social web
 
Live topic generation from event streams
Live topic generation from event streamsLive topic generation from event streams
Live topic generation from event streams
 
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdMediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
 
EventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
EventMedia Live: Exploring Events Connections in Real-Time to Enhance ContentEventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
EventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
 
Extracting Media Items from Multiple Social Networks
Extracting Media Items from Multiple Social NetworksExtracting Media Items from Multiple Social Networks
Extracting Media Items from Multiple Social Networks
 
Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...
 
MediaEval 2012 SED Opening
MediaEval 2012 SED OpeningMediaEval 2012 SED Opening
MediaEval 2012 SED Opening
 
DeRiVE 2011 workshop opening
DeRiVE 2011 workshop openingDeRiVE 2011 workshop opening
DeRiVE 2011 workshop opening
 
MediaEval 2011 SED Opening
MediaEval 2011 SED OpeningMediaEval 2011 SED Opening
MediaEval 2011 SED Opening
 
ShareIt: Mining SocialMedia Activities for Detecting Events
ShareIt: Mining SocialMedia Activities for Detecting EventsShareIt: Mining SocialMedia Activities for Detecting Events
ShareIt: Mining SocialMedia Activities for Detecting Events
 

Recently uploaded

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Linking Entities for Enriching and Structuring Social Media Content

  • 1. Linking Entities for Enriching and Structuring Social Media Content Raphaël Troncy <raphael.troncy@eurecom.fr> @rtroncy
  • 2. 12/04/2016 NLPIT Workshop @ WWW 2016 - 2
  • 3. Extracting and Linking Entities (NER/NEL)  “ Tampa Bay Lightning vs Canadiens in Montreal tonight with @erikmannens #hockey #NHL ” 12/04/2016 NLPIT Workshop @ WWW 2016 - 3 https://www.youtube.com/ watch?v=Rmug-PUyIzI
  • 4. Part of Speech (GATE Twitter POS) Tampa NNP Bay NNP Lightning NNP vs CC Canadiens NNP in IN Montreal NNP tonight NN with IN @erikmannens USR #hockey HT #NHL HT 12/04/2016 NLPIT Workshop @ WWW 2016 NER: What is NHL? - 4 https://gate.ac.uk/wiki/twitter-postagger.html NEL: Which Montreal are we talking about?
  • 5. What is #NHL? Type Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 - 5 Sports League Organization Place Railway Line
  • 6. What is #NHL? Type Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 - 6 http://schema.org /SportsEvent http://dbpedia.org/ ontology/Event http://schema.org /Organization http://dbpedia.org/ ontology/IceHocke yLeague Different infobox templates
  • 7. Named Entity Recognition (NER) Tampa NNP ORG Bay NNP ORG Lightning NNP ORG vs CC O Canadiens NNP ORG in IN O Montreal NNP LOC tonight NN O with IN O @erikmannens USR PER #hockey HT THG #NHL HT ORG 12/04/2016 NLPIT Workshop @ WWW 2016 - 7
  • 8. What is Montreal? Name Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 Montréal, Ardèche Montréal, Aude Montréal, Gers Montreal, Wisconsin Mont-ral, Catalonia - 8
  • 9. Named Entity Linking (NEL) Tampa NNP ORG Bay NNP ORG Lightning NNP ORG vs CC O Canadiens NNP ORG in IN O Montreal NNP LOC http://dbpedia.org/resource/Montreal tonight NN O with IN O @erikmannens USR PER NIL #hockey HT THG #NHL HT ORG 12/04/2016 NLPIT Workshop @ WWW 2016 - 9
  • 10. NERD: a framework for comparing NER APIs  NER Stanford CoreNLP  Web APIs http://nerd.eurecom.fr/ 12/04/2016 NLPIT Workshop @ WWW 2016 - 10
  • 11. NERD: AlchemyAPI 12/04/2016 NLPIT Workshop @ WWW 2016 - 11 Incorrect boundaries No disambiguation No dereferencing for @mention
  • 12. NERD: Dandelion 12/04/2016 NLPIT Workshop @ WWW 2016 - 12 Everything is a Thing No dereferencing for @mention
  • 13. NERDML 12/04/2016 NLPIT Workshop @ WWW 2016 - 13 No dereferencing for @mention
  • 14. Research Questions  How to adapt an entity linking system depending on different criteria?  How to design an entity linking system in order to be able to process a large amount of data in near real time? 12/04/2016 NLPIT Workshop @ WWW 2016 - 14
  • 15. ADEL: Adaptive Framework for NER  POS Tagger:  use bidirectional dependency network  combine CMM left to right and right to left  NER:  use CRF with Gibbs sampling (Monte Carlo for approximate inference) to take n words into account instead of only the previous and next one 12/04/2016 NLPIT Workshop @ WWW 2016 - 15
  • 16. ADEL: Overlap Resolution  Detect overlaps among extractors with the boundaries of the entities  Different heuristics can be applied:  Merge: (“United States” and “States of America” => “United States of America”) default behavior  Simple Substring: (“Florence” and “Florence May Harding” => ”Florence” and “May Harding”)  Smart Substring: (”Giants of New York” and “New York” => “Giants” and “New York”) 12/04/2016 NLPIT Workshop @ WWW 2016 - 16
  • 17. ADEL: KB Indexing  Create index from DBpedia and Wikipedia  Integrate external data such as PageRank and HITS scores from Hasso Platner Institute 12/04/2016 NLPIT Workshop @ WWW 2016 - 17
  • 18. ADEL: Adaptive Framework for NEL  Generate candidate links for all extracted mentions:  If any, they go to the linking method  If not, they are linked to NIL  Linking method:  ADEL linear formula: r(l): the score of the candidate l L: the Levenshtein distance m: the extracted mention title: the title of the candidate l R: the set of redirect pages associated to the candidate l D: the set of disambiguation pages associated to the candidate l PR: Pagerank associated to the candidate l a, b and c are weights following the properties: a > b > c and a + b + c = 1 12/04/2016 NLPIT Workshop @ WWW 2016 - 18
  • 19. ADEL: Pruning for NER/NEL  k-NN machine learning algorithm  Why a pruning module?  Useful to correct the errors from the extractor by removing wrong annotations. Example:  France played against Russia for a friendly match  Yesterday, I went to see Against in concert  Useful to adapt the annotations in order to follow a given guideline Example: suppose we are participating to two different challenges, the first one count the dates as entities, and the second one does not  NEEL challenge: Jimmy Page was born the January 9th, 1944.  OKE challenge: Jimmy Page was born the January 9th, 1944. 12/04/2016 NLPIT Workshop @ WWW 2016 - 19
  • 20. ADEL Evaluation  #Micropost2014 NEEL Challenge – ADEL v1  #Micropost2015 NEEL Challenge – ADEL v1  #Micropost2016 NEEL Challenge – ADEL v2  OKE2015 Challenge – ADEL v1  OKE2016 Challenge – ADEL v2 E2E UTwente DataTXT ADEL AIDA Hyberabad SAP F- measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02 ADEL FOX FRED F- measure 60.75 49.88 34.73 ousia acubelab ADEL uniba ualberta uva cen_neel F- measure 76.2 52.3 47.9 46.4 41.5 31.6 0 ADEL F- measure 78.8 ADEL F- measure 56.5 12/04/2016 NLPIT Workshop @ WWW 2016 - 20
  • 21. ADEL Live Demo 12/04/2016 NLPIT Workshop @ WWW 2016 - 21
  • 22. Social Media: some definitions  Media Item: a photo or a video that is shared on a social network  Micropost: a text status message that can optionally accompany a media item  Social Network: an online service that focuses on building and reflecting social relationships among people sharing interests or activities Media Sharing Platforms: emphasis on sharing media but blurred boundaries with social networks since users are encouraged to react on media content (like, comment, favorite, etc.) NLPIT Workshop @ WWW 201612/04/2016 - 22
  • 23. Media Server  Composition of media item extractors (12 SNs)  Rely on search APIs + a fix 30s timeout window to provide results  Fallback on screen scraping when necessary (Twitter ecosystem)  Implemented as a NodeJS server  Serialize results in a common schema (JSON) NLPIT Workshop @ WWW 201612/04/2016 - 23 https://github.com/tomayac/media-server
  • 24. 12/04/2016 NLPIT Workshop @ WWW 2016 Deep link Permalink Clean text for NLP processing Aggregate view of ALL social interactions 12 Social Networks
  • 25. Media Finder (www2013) 12/04/2016 NLPIT Workshop @ WWW 2016 - 25
  • 26. Media Finder (zooming on media items) 12/04/2016 NLPIT Workshop @ WWW 2016 - 26
  • 27. Media Finder (timeline view) 12/04/2016 NLPIT Workshop @ WWW 2016 - 27
  • 28. Media Finder Architecture  Media items harvesting using the Media Server http://eventmedia.eurecom.fr/media- server/search/{combined}/{term} https://github.com/vuknje/media-server (@tomayac fork)  Image near de-duplication DCT signature on image and video frame, Hamming distance between image pairs  Clustering and disambiguation Named Entity Extraction using NERD Topic Generation using LDA 12/04/2016 NLPIT Workshop @ WWW 2016 - 28
  • 29. Media Finder (named entities clustering) 12/04/2016 NLPIT Workshop @ WWW 2016 - 29
  • 30. Media Finder (zooming in a cluster) 12/04/2016 NLPIT Workshop @ WWW 2016 - 30
  • 31. Media Finder  Live Topic Generation from Event Streams Published at WWW 2013 Demo Track http://www.youtube.com/watch?v=8iRiwz7cDYY 12/04/2016 NLPIT Workshop @ WWW 2016 - 31
  • 32. Tracking an event: Italian Election  Repeated queries over a period of time We have tracked and analyzed media posts tagged as elezioni2013 from 2013-02-26 to 2013-03-03 Cron job: every 30 minutes over the 6 days Slice the data in 24 hours slots  Research questions: Can we re-create the news headlines?  Storyboarding: http://mediafinder.eurecom.fr/story/elezioni2013 12/04/2016 NLPIT Workshop @ WWW 2016 - 32
  • 33. Tracking an event: Italian Election  Dataset: ~16501 microposts containing (duplicate) media items ~21087 Named Entities extracted  Clustering NER and LDA Generate Bag of Entities (BOE) disambiguated with a DBpedia URI  Examples: Monti, Bersani, Italia, Berlusconi, Grillo, Stelle 12/04/2016 NLPIT Workshop @ WWW 2016 - 33
  • 34. Tracking an event: Italian Election  Tracking and Analyzing The 2013 Italian Election Published at ESWC 2013 Demo Track http://www.youtube.com/watch?v=jIMdnwMoWnk 12/04/2016 NLPIT Workshop @ WWW 2016 - 34
  • 37. “This is Nikita, a security guard from one of the bars in St. Petersburg.” “This is Nikita, a security guard from one of the bars in St. Petersburg.” NER Example taken from the transcript of https://www.ted.com/talks/2089 PERSON FUNCTION LOCATION Category: type in the NER task. Natural Language Processing (NPL) Task  disambiguating URL in a knowledge base. E.g. http://dbpedia.org/resource/Saint_P etersburg. Annotations: Named Entities
  • 38. 1. Clustering of consecutive chapters which talk about similar topics and entities 2. Ordering of those fragments based on annotation relevance (TF-IDF) 3. Filtering: Hot Spots are fragments whose relative relevance falls under the first quarter of the final score distribution MF: Hot Spots Hot Spot 1 Chapters Hot Spot 2 Hot Spots