Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics

Julien Plu
julien.plu@eurecom.fr
@julienplu
Knowledge extraction in Web
media: at the frontier of NLP,
Machine Learning and Semantics

Use Case: Bringing Context to Documents
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 3
NEWSWIRES
TWEETS
SEARCH
QUERIES
SUBTITLES

Use Case: Bringing Context to Documents
James Patrick Page, OBE (born 9 January 1944)
is an English musician, songwriter, and record
producer who achieved international success as
the guitarist and founder of the rock band Led
Zeppelin. Know More
Sort name: Page, Jimmy
Type: Person
Gender: Male
Born: 1944-01-09 (72 years ago)
Born in: Heston, Hounslow, London,
United Kingdom
Pays d’origine : Royaume-Uni
Genre musical : Blues rock, rock
psychédélique
Années actives : 1962-1968 et
depuis 1992
Labels : Columbia
The Yardbirds est un groupe de rock britannique
des années 1960, formé en mai 1963 à Londres
en Angleterre dont les guitaristes ont été Eric
Clapton, Jeff Beck puis Jimmy Page. Know More

Six Different Problems
1. Identity of an entity
Ø Arena; Arena (magazine); Arena (TV series)
Ø Bucks County, Pennsylvania; Milwaukee Bucks
2. Knowledge bases have different coverage
Yannick Noah is a
Tennis Player and a
Singer
4. Various types for an
entity (granularity) 5. Different type of
documents
written in multiple
languages
3. High
computation to
handle large
streams
6. Are all phrases
entities? (e.g.
dates or roles)

Research Questions
1. How to adapt an entity linking system depending on
different criteria?
2. How to design an entity linking system in order to
be able to process a large amount of data in near
real time?

State Of The Art
§ The key role of entities:
Ø 70% of search queries contain at least one entity [1]
Ø Bring context to videos [2]
Ø Help making summary [3]
§ Current systems (e.g. TagME [3], AIDA [4], Babelfy [5] or DBpedia
Spotlight [6]) are hardly parametrized and often do not propose to be
adapted to at least one of the previous criteria
§ Those solutions are often not able to handle large streams of text
[1] Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010
[2] José Luis Redondo García, Giuseppe Rizzo, Raphaël Troncy: The Concentric Nature of News Semantic Snapshots: Knowledge
Extraction for Semantic Annotation of News Items. K-CAP 2015
[3] Shruti Chhabra, Srikanta Bedathur: Towards Generating Text Summaries for Entity Chains. ECIR 2014
[4] Paolo Ferragina, Ugo Scaiella: TAGME: on-the-fly annotation of short text fragments (by wikipediaentities). CIKM 2010
[5] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, Gerhard Weikum: AIDA: AnOnline Tool for Accurate
Disambiguation of Named Entities in Text and Tables. PVLDB 4(12)
[6] Andrea Moro, Alessandro Raganato, Roberto Navigli: Entity Linking meets Word Sense Disambiguation: a Unified Approach.
TACL 2014
[7] Pablo N. Mendes, Max Jakob, Andrés García-Silva, Christian Bizer: DBpedia spotlight: shedding light on the web of documents.
I-SEMANTICS 2011

Methodology
We have split up this thesis into six tasks:
Start thesis
Today
End thesis
(1) Text adaptivity
(1) Entity type adaptivity
(1) Knowledge base adaptivity
(1) Language adaptivity
(1- 2) ADEL Modular framework
(2) Distributed and scalable architecture

§ POS Tagger:
Ø bidirectional
CMM (left to right and
right to left)
§ NER Combiner:
Ø Use a combination of CRF with Gibbs sampling (Monte Carlo as graph inference method)
models. A simple CRF model could be:
PER PER PERO OOO
X X X X XX XXXX
X set of features for the current word: word capitalized, previous word is “de”, next word is a
NNP, … Suppose P(PER | X, PER, O, LOC) = P(PER | X, neighbors(PER)) then X with PER is a CRF
Jimmy Page , connaissant le profesionnalisme de John Paul Jones
ADEL: Modular Framework (Extractors)
PER PERO

ADEL: Modular Framework (Overlap Resolution)
§ Detect overlaps
among extractors
with the boundaries
of the entities
§ Different heuristics can be applied:
Ø Merge: (“United States” and “States of America” => “United States of
America”) default behavior
Ø Simple Substring: (“Florence” and “Florence May Harding” => ”Florence”
and “May Harding”)
Ø Smart Substring: (”Giants of New York” and “New York” => “Giants” and
“New York”)

Modular Framework: Indexing
§ Create index from
DBpedia and Wikipedia
§ Integrate external data
such as PageRank and
HITS scores from Hasso
Platner Institute

ADEL: Modular Framework (Linking)
§ Generate candidate links for
all extracted mentions:
Ø If any, they go to the linking
method
Ø If not, they are linked to NIL
§ Linking method:
Ø ADEL linear formula:
𝑟 𝑙 = 𝑎. 𝐿 𝑚, 𝑡𝑖𝑡𝑙𝑒 + 𝑏. max 𝐿 𝑚, 𝑅 + 𝑐. max 𝐿 𝑚, 𝐷 . 𝑃𝑅(𝑙)
r(l): the score of the candidate l
L: the Levenshtein distance
m: the extracted mention
title: the title of the candidate l
R: the set of redirect pages associated to the candidate l
D: the set of disambiguation pages associated to the candidate l
PR: Pagerank associated to the candidate l
a, b and c are weights
following the properties:
a > b > c and a + b + c = 1

ADEL: Modular Framework (Pruning)
§ k-NN machine learning
algorithm
§ Why a pruning module?
Ø Useful to correct the errors from the extractor by removing wrong
annotations. Example:
F France played against Russia for a friendly match.
F Yesterday, I went to see Against in concert.
Ø Useful to adapt the annotations in order to follow a given guideline.
Example: suppose we are participating to two different challenges, 2014
NEEL that count the dates as entities, and OKE2015 that do not.
F 1st challenge: Jimmy Page was born the January 9th, 1944.
F 2nd challenge: Jimmy Page was born the January 9th, 1944.

§ Experiments on different kind of text by
benchmarking ADEL over different challenges
Ø Tweets: NEEL2014, NEEL2015 and NEEL2016
Ø News article: OKE2015 and OKE2016
§ Need to adapt the extractors to use a proper model
to handle different kind of texts
Ø Retrain the NER extractor with a training dataset
Text Adaptivity

Type Adaptivity
§ Challenges have their own definition of types
§ In ADEL types are coming from the NER extractor
and the used knowledge base
Ø NER types are different of KB types
Ø NER types and KB types are different of challenges types
§ Need a mapping between those different types. It is
currently manually made.
OKE2015 and OKE2016 Person, Place, Organization, Role
NEEL2015 and NEEL2016 Person, Location, Organization, Product, Event, Thing

Knowledge Base Adaptivity
§ Joint work with Vrije Universiteit Amsterdam
§ ReCon: define several heuristics in order to re-rank
candidate links provided by our system on newswire
articles
Ø H1: process the article text first and disambiguate the article
title at the end because titles are often too ambiguous
Ø H2: detect co-referential entities throughout the article
Ø H3: topic modeling to exploit a contextual knowledge base
about the found topic

Language Adaptivity
§ No results yet. The goal is to let the user choosing
the natural language used in the text
§ Test the framework on ETAPE which is a NER
challenge on French TV content from 2012

Distributed and Scalable Architecture
§ No results yet. Being able to deploy the framework in
order to run the tasks in a distributed and scalable
way
§ Making each task (extraction, linking and pruning)
independent of each other and put them out of the
global architecture (see how Docker is developed as
model)
§ Stress test the new architecture over large streams
such as Twitter streaming API to detect the possible
bottlenecks

Evaluation Over Multiple Datasets in Linking
§ 2014 NEEL Challenge with ADEL v1 using the neleval scorer
§ OKE2015 Challenge with ADEL v1 usingthe GERBIL scorer
§ OKE2016 Challenge with ADEL v2 usingthe neleval scorer
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02
ADEL FOX FRED
F-measure 60.75 49.88 34.73
ousia acubelab ADEL uniba ualberta uva cen_neel
F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0
ADEL kea Insight mit ju unimib
F-measure 61.98 54.86 38,28 36.09 35.48 33.53
ADEL
F-measure 56.5

Conclusions
§ Combining multiple techniques coming from different
domains for entity recognition and linking
§ Having developed different methods in order to make an
entity linking system adaptive to one or multiple criteria
§ Bringing a new approach with ADEL while also reusing
existing approaches with the POS and NER extractors
§ Testing ADEL over different datasets and participating in
challenges

Future Work
§ Knowledge base adaptivity
Ø Further evaluate the knowledge base and text adaptive features using the ERD dataset
Ø Evaluate the knowledge base adaptive feature using the TAC KBP dataset
Ø Experiment the knowledge base adaptive feature using 3cixty and ad-hoc tourism dataset
§ Language adaptivity
Ø Evaluate the language adaptive feature using the ETAPE and TAC KBP datasets
§ Modular Framework
Ø Improving the linking and the pruning with new methods (e.g. evaluate deep learning
methods)
§ Type adaptivity
Ø Further evaluate the approach over more fine grained types using ETAPE challenge. This will
bring more issues especially with the scorers
§ Engineer and evaluate a distributed and scalable architecture on large
data streams

Questions?
Thank you for listening!

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics

Recommandé

Recommandé

Contenu connexe

Similaire à Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics

Similaire à Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics (20)

Dernier

Dernier (20)

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics