SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Julien Plu
julien.plu@eurecom.fr
@julienplu
Knowledge extraction in Web
media: at the frontier of NLP,
Machine Learning and Semantics
Use Case: Bringing Context to Documents
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 3
NEWSWIRES
TWEETS
SEARCH
QUERIES
SUBTITLES
Use Case: Bringing Context to Documents
James Patrick Page, OBE (born 9 January 1944)
is an English musician, songwriter, and record
producer who achieved international success as
the guitarist and founder of the rock band Led
Zeppelin. Know More
Sort name: Page, Jimmy
Type: Person
Gender: Male
Born: 1944-01-09 (72 years ago)
Born in: Heston, Hounslow, London,
United Kingdom
Pays d’origine : Royaume-Uni
Genre musical : Blues rock, rock
psychédélique
Années actives : 1962-1968 et
depuis 1992
Labels : Columbia
The Yardbirds est un groupe de rock britannique
des années 1960, formé en mai 1963 à Londres
en Angleterre dont les guitaristes ont été Eric
Clapton, Jeff Beck puis Jimmy Page. Know More
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 4
Six Different Problems
1. Identity of an entity
Ø Arena; Arena (magazine); Arena (TV series)
Ø Bucks County, Pennsylvania; Milwaukee Bucks
2. Knowledge bases have different coverage
Yannick Noah is a
Tennis Player and a
Singer
4. Various types for an
entity (granularity) 5. Different type of
documents
written in multiple
languages
3. High
computation to
handle large
streams
6. Are all phrases
entities? (e.g.
dates or roles)
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 5
Research Questions
1. How to adapt an entity linking system depending on
different criteria?
2. How to design an entity linking system in order to
be able to process a large amount of data in near
real time?
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 6
State Of The Art
§ The key role of entities:
Ø 70% of search queries contain at least one entity [1]
Ø Bring context to videos [2]
Ø Help making summary [3]
§ Current systems (e.g. TagME [3], AIDA [4], Babelfy [5] or DBpedia
Spotlight [6]) are hardly parametrized and often do not propose to be
adapted to at least one of the previous criteria
§ Those solutions are often not able to handle large streams of text
[1] Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010
[2] José Luis Redondo García, Giuseppe Rizzo, Raphaël Troncy: The Concentric Nature of News Semantic Snapshots: Knowledge
Extraction for Semantic Annotation of News Items. K-CAP 2015
[3] Shruti Chhabra, Srikanta Bedathur: Towards Generating Text Summaries for Entity Chains. ECIR 2014
[4] Paolo Ferragina, Ugo Scaiella: TAGME: on-the-fly annotation of short text fragments (by wikipediaentities). CIKM 2010
[5] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, Gerhard Weikum: AIDA: AnOnline Tool for Accurate
Disambiguation of Named Entities in Text and Tables. PVLDB 4(12)
[6] Andrea Moro, Alessandro Raganato, Roberto Navigli: Entity Linking meets Word Sense Disambiguation: a Unified Approach.
TACL 2014
[7] Pablo N. Mendes, Max Jakob, Andrés García-Silva, Christian Bizer: DBpedia spotlight: shedding light on the web of documents.
I-SEMANTICS 2011
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 7
Methodology
We have split up this thesis into six tasks:
Start thesis
Today
End thesis
(1) Text adaptivity
(1) Entity type adaptivity
(1) Knowledge base adaptivity
(1) Language adaptivity
(1- 2) ADEL Modular framework
(2) Distributed and scalable architecture
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 8
§ POS Tagger:
Ø bidirectional
CMM (left to right and
right to left)
§ NER Combiner:
Ø Use a combination of CRF with Gibbs sampling (Monte Carlo as graph inference method)
models. A simple CRF model could be:
PER PER PERO OOO
X X X X XX XXXX
X set of features for the current word: word capitalized, previous word is “de”, next word is a
NNP, … Suppose P(PER | X, PER, O, LOC) = P(PER | X, neighbors(PER)) then X with PER is a CRF
Jimmy Page , connaissant le profesionnalisme de John Paul Jones
ADEL: Modular Framework (Extractors)
PER PERO
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 9
ADEL: Modular Framework (Overlap Resolution)
§ Detect overlaps
among extractors
with the boundaries
of the entities
§ Different heuristics can be applied:
Ø Merge: (“United States” and “States of America” => “United States of
America”) default behavior
Ø Simple Substring: (“Florence” and “Florence May Harding” => ”Florence”
and “May Harding”)
Ø Smart Substring: (”Giants of New York” and “New York” => “Giants” and
“New York”)
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 10
Modular Framework: Indexing
§ Create index from
DBpedia and Wikipedia
§ Integrate external data
such as PageRank and
HITS scores from Hasso
Platner Institute
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 11
ADEL: Modular Framework (Linking)
§ Generate candidate links for
all extracted mentions:
Ø If any, they go to the linking
method
Ø If not, they are linked to NIL
§ Linking method:
Ø ADEL linear formula:
𝑟 𝑙 = 𝑎. 𝐿 𝑚, 𝑡𝑖𝑡𝑙𝑒 + 𝑏. max 𝐿 𝑚, 𝑅 + 𝑐. max 𝐿 𝑚, 𝐷 . 𝑃𝑅(𝑙)
r(l): the score of the candidate l
L: the Levenshtein distance
m:	the extracted mention
title: the title of the candidate l
R: the set of redirect pages associated to the candidate l
D: the set of disambiguation pages associated to the candidate l
PR: Pagerank associated to the candidate l	
a,	b	and c are weights
following the properties:
a	>	b	>	c	 and a	+	b	+	c	=	1
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 12
ADEL: Modular Framework (Pruning)
§ k-NN machine learning
algorithm
§ Why a pruning module?
Ø Useful to correct the errors from the extractor by removing wrong
annotations. Example:
F France played against Russia for a friendly match.
F Yesterday, I went to see Against in concert.
Ø Useful to adapt the annotations in order to follow a given guideline.
Example: suppose we are participating to two different challenges, 2014
NEEL that count the dates as entities, and OKE2015 that do not.
F 1st challenge: Jimmy Page was born the January 9th, 1944.
F 2nd challenge: Jimmy Page was born the January 9th, 1944.
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 13
§ Experiments on different kind of text by
benchmarking ADEL over different challenges
Ø Tweets: NEEL2014, NEEL2015 and NEEL2016
Ø News article: OKE2015 and OKE2016
§ Need to adapt the extractors to use a proper model
to handle different kind of texts
Ø Retrain the NER extractor with a training dataset
Text Adaptivity
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 14
Type Adaptivity
§ Challenges have their own definition of types
§ In ADEL types are coming from the NER extractor
and the used knowledge base
Ø NER types are different of KB types
Ø NER types and KB types are different of challenges types
§ Need a mapping between those different types. It is
currently manually made.
OKE2015 and OKE2016 Person, Place, Organization, Role
NEEL2015 and NEEL2016 Person, Location, Organization, Product, Event, Thing
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 15
Knowledge Base Adaptivity
§ Joint work with Vrije Universiteit Amsterdam
§ ReCon: define several heuristics in order to re-rank
candidate links provided by our system on newswire
articles
Ø H1: process the article text first and disambiguate the article
title at the end because titles are often too ambiguous
Ø H2: detect co-referential entities throughout the article
Ø H3: topic modeling to exploit a contextual knowledge base
about the found topic
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 16
Language Adaptivity
§ No results yet. The goal is to let the user choosing
the natural language used in the text
§ Test the framework on ETAPE which is a NER
challenge on French TV content from 2012
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 17
Distributed and Scalable Architecture
§ No results yet. Being able to deploy the framework in
order to run the tasks in a distributed and scalable
way
§ Making each task (extraction, linking and pruning)
independent of each other and put them out of the
global architecture (see how Docker is developed as
model)
§ Stress test the new architecture over large streams
such as Twitter streaming API to detect the possible
bottlenecks
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 18
Evaluation Over Multiple Datasets in Linking
§ 2014 NEEL Challenge with ADEL v1 using the neleval scorer
§ 2015 NEEL Challenge with ADEL v1 using the neleval scorer
§ 2016 NEEL Challenge with ADEL v2 using the neleval scorer
§ OKE2015 Challenge with ADEL v1 usingthe GERBIL scorer
§ OKE2016 Challenge with ADEL v2 usingthe neleval scorer
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02
ADEL FOX FRED
F-measure 60.75 49.88 34.73
ousia acubelab ADEL uniba ualberta uva cen_neel
F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0
ADEL kea Insight mit ju unimib
F-measure 61.98 54.86 38,28 36.09 35.48 33.53
ADEL
F-measure 56.5
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 19
Conclusions
§ Combining multiple techniques coming from different
domains for entity recognition and linking
§ Having developed different methods in order to make an
entity linking system adaptive to one or multiple criteria
§ Bringing a new approach with ADEL while also reusing
existing approaches with the POS and NER extractors
§ Testing ADEL over different datasets and participating in
challenges
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 20
Future Work
§ Knowledge base adaptivity
Ø Further evaluate the knowledge base and text adaptive features using the ERD dataset
Ø Evaluate the knowledge base adaptive feature using the TAC KBP dataset
Ø Experiment the knowledge base adaptive feature using 3cixty and ad-hoc tourism dataset
§ Language adaptivity
Ø Evaluate the language adaptive feature using the ETAPE and TAC KBP datasets
§ Modular Framework
Ø Improving the linking and the pruning with new methods (e.g. evaluate deep learning
methods)
§ Type adaptivity
Ø Further evaluate the approach over more fine grained types using ETAPE challenge. This will
bring more issues especially with the scorers
§ Engineer and evaluate a distributed and scalable architecture on large
data streams
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 21
Questions?
Thank you for listening!
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 22

Contenu connexe

Similaire à Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics

The AOSD Research Community in Brazil and its Crosscutting Impact
The AOSD Research Community in Brazil and  its Crosscutting ImpactThe AOSD Research Community in Brazil and  its Crosscutting Impact
The AOSD Research Community in Brazil and its Crosscutting Impact
Uirá Kulesza
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
Mohamed BEN ELLEFI
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
John Doove
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Julien PLU
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
Maria Eskevich
 

Similaire à Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics (20)

Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
 
The AOSD Research Community in Brazil and its Crosscutting Impact
The AOSD Research Community in Brazil and  its Crosscutting ImpactThe AOSD Research Community in Brazil and  its Crosscutting Impact
The AOSD Research Community in Brazil and its Crosscutting Impact
 
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesFOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Web
 
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Wi presentation
Wi presentationWi presentation
Wi presentation
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Enhancing Xtext for General Purpose Languages
Enhancing Xtext for General Purpose LanguagesEnhancing Xtext for General Purpose Languages
Enhancing Xtext for General Purpose Languages
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
 
NEEL2015 challenge summary
NEEL2015 challenge summaryNEEL2015 challenge summary
NEEL2015 challenge summary
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
 
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
 

Dernier

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 

Dernier (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics

  • 1. Julien Plu julien.plu@eurecom.fr @julienplu Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics
  • 2. Use Case: Bringing Context to Documents 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 3 NEWSWIRES TWEETS SEARCH QUERIES SUBTITLES
  • 3. Use Case: Bringing Context to Documents James Patrick Page, OBE (born 9 January 1944) is an English musician, songwriter, and record producer who achieved international success as the guitarist and founder of the rock band Led Zeppelin. Know More Sort name: Page, Jimmy Type: Person Gender: Male Born: 1944-01-09 (72 years ago) Born in: Heston, Hounslow, London, United Kingdom Pays d’origine : Royaume-Uni Genre musical : Blues rock, rock psychédélique Années actives : 1962-1968 et depuis 1992 Labels : Columbia The Yardbirds est un groupe de rock britannique des années 1960, formé en mai 1963 à Londres en Angleterre dont les guitaristes ont été Eric Clapton, Jeff Beck puis Jimmy Page. Know More 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 4
  • 4. Six Different Problems 1. Identity of an entity Ø Arena; Arena (magazine); Arena (TV series) Ø Bucks County, Pennsylvania; Milwaukee Bucks 2. Knowledge bases have different coverage Yannick Noah is a Tennis Player and a Singer 4. Various types for an entity (granularity) 5. Different type of documents written in multiple languages 3. High computation to handle large streams 6. Are all phrases entities? (e.g. dates or roles) 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 5
  • 5. Research Questions 1. How to adapt an entity linking system depending on different criteria? 2. How to design an entity linking system in order to be able to process a large amount of data in near real time? 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 6
  • 6. State Of The Art § The key role of entities: Ø 70% of search queries contain at least one entity [1] Ø Bring context to videos [2] Ø Help making summary [3] § Current systems (e.g. TagME [3], AIDA [4], Babelfy [5] or DBpedia Spotlight [6]) are hardly parametrized and often do not propose to be adapted to at least one of the previous criteria § Those solutions are often not able to handle large streams of text [1] Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010 [2] José Luis Redondo García, Giuseppe Rizzo, Raphaël Troncy: The Concentric Nature of News Semantic Snapshots: Knowledge Extraction for Semantic Annotation of News Items. K-CAP 2015 [3] Shruti Chhabra, Srikanta Bedathur: Towards Generating Text Summaries for Entity Chains. ECIR 2014 [4] Paolo Ferragina, Ugo Scaiella: TAGME: on-the-fly annotation of short text fragments (by wikipediaentities). CIKM 2010 [5] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, Gerhard Weikum: AIDA: AnOnline Tool for Accurate Disambiguation of Named Entities in Text and Tables. PVLDB 4(12) [6] Andrea Moro, Alessandro Raganato, Roberto Navigli: Entity Linking meets Word Sense Disambiguation: a Unified Approach. TACL 2014 [7] Pablo N. Mendes, Max Jakob, Andrés García-Silva, Christian Bizer: DBpedia spotlight: shedding light on the web of documents. I-SEMANTICS 2011 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 7
  • 7. Methodology We have split up this thesis into six tasks: Start thesis Today End thesis (1) Text adaptivity (1) Entity type adaptivity (1) Knowledge base adaptivity (1) Language adaptivity (1- 2) ADEL Modular framework (2) Distributed and scalable architecture 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 8
  • 8. § POS Tagger: Ø bidirectional CMM (left to right and right to left) § NER Combiner: Ø Use a combination of CRF with Gibbs sampling (Monte Carlo as graph inference method) models. A simple CRF model could be: PER PER PERO OOO X X X X XX XXXX X set of features for the current word: word capitalized, previous word is “de”, next word is a NNP, … Suppose P(PER | X, PER, O, LOC) = P(PER | X, neighbors(PER)) then X with PER is a CRF Jimmy Page , connaissant le profesionnalisme de John Paul Jones ADEL: Modular Framework (Extractors) PER PERO 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 9
  • 9. ADEL: Modular Framework (Overlap Resolution) § Detect overlaps among extractors with the boundaries of the entities § Different heuristics can be applied: Ø Merge: (“United States” and “States of America” => “United States of America”) default behavior Ø Simple Substring: (“Florence” and “Florence May Harding” => ”Florence” and “May Harding”) Ø Smart Substring: (”Giants of New York” and “New York” => “Giants” and “New York”) 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 10
  • 10. Modular Framework: Indexing § Create index from DBpedia and Wikipedia § Integrate external data such as PageRank and HITS scores from Hasso Platner Institute 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 11
  • 11. ADEL: Modular Framework (Linking) § Generate candidate links for all extracted mentions: Ø If any, they go to the linking method Ø If not, they are linked to NIL § Linking method: Ø ADEL linear formula: 𝑟 𝑙 = 𝑎. 𝐿 𝑚, 𝑡𝑖𝑡𝑙𝑒 + 𝑏. max 𝐿 𝑚, 𝑅 + 𝑐. max 𝐿 𝑚, 𝐷 . 𝑃𝑅(𝑙) r(l): the score of the candidate l L: the Levenshtein distance m: the extracted mention title: the title of the candidate l R: the set of redirect pages associated to the candidate l D: the set of disambiguation pages associated to the candidate l PR: Pagerank associated to the candidate l a, b and c are weights following the properties: a > b > c and a + b + c = 1 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 12
  • 12. ADEL: Modular Framework (Pruning) § k-NN machine learning algorithm § Why a pruning module? Ø Useful to correct the errors from the extractor by removing wrong annotations. Example: F France played against Russia for a friendly match. F Yesterday, I went to see Against in concert. Ø Useful to adapt the annotations in order to follow a given guideline. Example: suppose we are participating to two different challenges, 2014 NEEL that count the dates as entities, and OKE2015 that do not. F 1st challenge: Jimmy Page was born the January 9th, 1944. F 2nd challenge: Jimmy Page was born the January 9th, 1944. 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 13
  • 13. § Experiments on different kind of text by benchmarking ADEL over different challenges Ø Tweets: NEEL2014, NEEL2015 and NEEL2016 Ø News article: OKE2015 and OKE2016 § Need to adapt the extractors to use a proper model to handle different kind of texts Ø Retrain the NER extractor with a training dataset Text Adaptivity 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 14
  • 14. Type Adaptivity § Challenges have their own definition of types § In ADEL types are coming from the NER extractor and the used knowledge base Ø NER types are different of KB types Ø NER types and KB types are different of challenges types § Need a mapping between those different types. It is currently manually made. OKE2015 and OKE2016 Person, Place, Organization, Role NEEL2015 and NEEL2016 Person, Location, Organization, Product, Event, Thing 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 15
  • 15. Knowledge Base Adaptivity § Joint work with Vrije Universiteit Amsterdam § ReCon: define several heuristics in order to re-rank candidate links provided by our system on newswire articles Ø H1: process the article text first and disambiguate the article title at the end because titles are often too ambiguous Ø H2: detect co-referential entities throughout the article Ø H3: topic modeling to exploit a contextual knowledge base about the found topic 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 16
  • 16. Language Adaptivity § No results yet. The goal is to let the user choosing the natural language used in the text § Test the framework on ETAPE which is a NER challenge on French TV content from 2012 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 17
  • 17. Distributed and Scalable Architecture § No results yet. Being able to deploy the framework in order to run the tasks in a distributed and scalable way § Making each task (extraction, linking and pruning) independent of each other and put them out of the global architecture (see how Docker is developed as model) § Stress test the new architecture over large streams such as Twitter streaming API to detect the possible bottlenecks 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 18
  • 18. Evaluation Over Multiple Datasets in Linking § 2014 NEEL Challenge with ADEL v1 using the neleval scorer § 2015 NEEL Challenge with ADEL v1 using the neleval scorer § 2016 NEEL Challenge with ADEL v2 using the neleval scorer § OKE2015 Challenge with ADEL v1 usingthe GERBIL scorer § OKE2016 Challenge with ADEL v2 usingthe neleval scorer E2E UTwente DataTXT ADEL AIDA Hyberabad SAP F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02 ADEL FOX FRED F-measure 60.75 49.88 34.73 ousia acubelab ADEL uniba ualberta uva cen_neel F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0 ADEL kea Insight mit ju unimib F-measure 61.98 54.86 38,28 36.09 35.48 33.53 ADEL F-measure 56.5 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 19
  • 19. Conclusions § Combining multiple techniques coming from different domains for entity recognition and linking § Having developed different methods in order to make an entity linking system adaptive to one or multiple criteria § Bringing a new approach with ADEL while also reusing existing approaches with the POS and NER extractors § Testing ADEL over different datasets and participating in challenges 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 20
  • 20. Future Work § Knowledge base adaptivity Ø Further evaluate the knowledge base and text adaptive features using the ERD dataset Ø Evaluate the knowledge base adaptive feature using the TAC KBP dataset Ø Experiment the knowledge base adaptive feature using 3cixty and ad-hoc tourism dataset § Language adaptivity Ø Evaluate the language adaptive feature using the ETAPE and TAC KBP datasets § Modular Framework Ø Improving the linking and the pruning with new methods (e.g. evaluate deep learning methods) § Type adaptivity Ø Further evaluate the approach over more fine grained types using ETAPE challenge. This will bring more issues especially with the scorers § Engineer and evaluate a distributed and scalable architecture on large data streams 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 21
  • 21. Questions? Thank you for listening! 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 22