SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Seed Selection for
Distantly Supervised Web-Based
Relation Extraction
Isabelle Augenstein
Department of Computer Science, University of Sheffield, UK
i.augenstein@dcs.shef.ac.uk
August 24, 2014
Semantic Web for Information Extraction (SWAIE) Workshop, COLING 2014
2
Motivation
•  Goal: extraction of relations in text on Web pages (e.g.
Mashable) with respect to a knowledge base (e.g. Freebase)
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
3
Motivation
•  Goal: extraction of relations in text on Web pages (e.g.
Mashable) with respect to a knowledge base (e.g. Freebase)
•  What are possible methodologies?
•  Supervised learning: manually annotate text, train machine learning
classifier
•  Unsupervised learning: extract language patterns, cluster similar ones
•  Semi-supervised learning: start with a small number of language
patterns, iteratively learn more (bootstrapping)
•  Distant supervision: automatically label text with relations from
knowledge base, train machine learning classifier
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
4
Motivation
•  Goal: extraction of relations in text on Web pages (e.g.
Mashable) with respect to a knowledge base (e.g. Freebase)
•  What are possible methodologies?
•  Supervised learning: manually annotate text, train machine learning
classifier -> manual effort
•  Unsupervised learning: extract language patterns, cluster similar ones
-> difficult to map to KB, lower precision than supervised method
•  Semi-supervised learning: start with a small number of language
patterns, iteratively learn more (bootstrapping)
-> still manual effort, semantic drift (unwanted shift in meaning)
•  Distant supervision: automatically label text with relations from
knowledge base, train machine learning classifier
-> allows to extract relations with respect to KB, reasonably high
precision, no manual effort
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
5
Distant Supervision
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Creating positive &
negative training
examples
Feature
Extraction
Classifier
Training
Prediction of
New
Relations
6
Distant Supervision
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Creating positive &
negative training
examples
Feature
Extraction
Classifier
Training
Prediction of
New
Relations
Supervised learning
Automatically generated
training data
+
7
Generating training data
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
“If two entities participate in a relation, any sentence that contains those two
entities might express that relation.” (Mintz, 2009)
Amy Jade Winehouse was a
singer and songwriter known
for her eclectic mix of musical
genres including R&B, soul
and jazz.!

Blur helped to popularise the
Britpop genre.!

Beckham rose to fame with the
all-female pop group Spice
Girls.!

Name Genre …
Amy Winehouse
Amy Jade Winehouse
Wino
…
R&B
soul
jazz
…
…
Blur
…
Britpop
…
…
Spice Girls
…
pop
…
…
different
lexicalisations
8
Generating training data: is it that easy?
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Let It Be is the twelfth album
by The Beatles which contains
their hit single Let It Be.
Name Album Track
The Beatles
…
Let It Be
…
Let It Be
…
9
Generating training data: is it that easy?
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Let It Be is the twelfth album
by The Beatles which contains
their hit single Let It Be.
Name Album Track
The Beatles
…
Let It Be
…
Let It Be
…
•  Use ‘Let It Be’ mentions as positive training examples for
album or for track?
•  Problem: if both mentions of ‘Let It Be’ are used to extract
features for both album and track, wrong weights are learnt
•  How can such ambiguous examples be detected?
•  Develop methods to detect, then automatically discard
potentially ambiguous positive and negative training data
10
Seed Selection: ambiguity within an entity
•  Example: Let It Be is the twelfth album by The Beatles
which contains their hit single Let It Be.
•  Let It Be can be both an album and a track of the musical artist
The Beatles
•  For every relation, consisting of a subject, a property and an
object (s, p, o), is the subject related to (at least) two different
objects with the same lexicalisation which express two
different relations?
•  Unam:
•  Retrieve the number of such senses using the Freebase API
•  Discard the lexicalisation of the object as positive training data if it has at
least two different senses within an entity
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
11
Seed Selection: ambiguity across classes
•  Example: common names of book authors or common genres,
e.g. “Jack mentioned that he read On the Road”, in which
Jack is falsely recognised as the author Jack Kerouac.
•  Stop: remove common words that are stopwords
•  Stat: Estimate how ambiguous a lexicalisation of an object is
compared to other lexicalisations of objects of the same relation
•  For every lexicalisation of an object of a relation, retrieve the number of
senses using the Freebase API (example: for Jack n=1066)
•  Compute frequency distribution per relation with min, max, median (50th
percentile), lower (25th percentile) and upper quartile (75th percentile)
(example: for author: min=0, max=3059, median=10, lower=4, upper=32)
•  For every lexicalisation of an object of a relation, if the number of senses >
upper quartile (or the lower quartile, or median, depending on the model),
discard it (example: 1066 > 32 -> Jack will be discarded)
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
12
Seed Selection: discarding negative seeds
•  Creating negative training data: all entities which appear in the
sentence with the subject, but are not in a relation with it, will
be used as negative training data
•  Problem: knowledge bases are incomplete
•  Idea: object lexicalisations are often shared across entities,
e.g. for the relation genre
•  Check if an unknown lexicalisation is a lexicalisation of a
different relation
•  Incomp: for every lexicalisation l of a property, discard it as
negative training data if any of the properties of the class we
examine has an object lexicalisation l
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
13
Distant Supervision system: corpus
•  Web crawl corpus, created using entity-specific search
queries, consisting of 450k Web pages
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
Class Property / Relation
Book author, characters, publication date, genre,
ISBN, original language
Musical Artist album, active (start), active (end), genre,
record label, origin, track
Politician birthdate, birthplace, educational institution,
nationality, party, religion, spouses
14
Distant Supervision system: relation candidate identification
•  Two step process: recognise entities, then check if they
appear in Freebase
•  Use Stanford 7-class NERC to identify named entities (NEs)
•  Problem: domain-specific entities (e.g. album, track) are often
not recognised
•  Solution: use heuristic to recognise (but not classify) more
NEs
•  Get all sequences of capitalised words, and noun sequences
•  Every subsequence of those sequences is a potential NE,
“pop music” -> “pop music”, “pop”, “music”
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
15
Distant Supervision system
•  Features: standard relation extraction features, including BOW
features, POS-based features, NE class
•  Classifier: first-order CRF
•  Model: one model per Freebase class to classify a relation
candidate (occurrence) into one of the different properties or
NONE, apply to respective part of corpus
•  Seed Selection: apply different seed selection methods Unam,
Stop, Stat75, Stat50, Stat25, Incomp
•  Merging and Ranking: aggregate predictions of occurrences
with same surface form
•  E.g.: Dublin could have predictions MusicalArtist:album, origin and NONE
•  Compute mean average of confidence values, select highest ranked one
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
16
Evaluation
•  Split corpus equally for training and test
•  Hand-annotate the portion of the test corpus which has NONE
prediction (no representation in Freebase)
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
17
Results: precision per model, ranked by confidence
y-axis: precision,
x-axis: min confidence
(e.g. 0.1 are all
occurrences with
confidence >= 0.1)
Overall precision:
unam_stop_stat25: 0.896
unam_stop_stat50: 0.882
unam_stop_stat75: 0.873
unam_stop: 0.842
stop: 0.84
baseline: 0.825
incomp: 0.74
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
unam_stop_stat25
unam_stop_stat50
unam_stop_stat75
unam_stop
stop
baseline
incompl
18
Results Summary
•  Best-performing model (unam_stop_stat25) has a precision of
0.896, compared to baseline model with precision of 0.825,
reducing the error rate by 35%
•  However, those seed selection methods all come at a small
loss of the number of extractions (20%) because they reduce
the amount of training data
•  Removing potentially false negative training data (incomp)
does not perform well
•  Too many training examples are removed
•  The training examples which are removed are lexicalisations which
have the same types of values, those are crucial for learning
•  Especially poor performance for numerical values
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
19
Related work on dealing with noise for distant supervision
•  At least one models:
•  Relaxed distant supervision assumption, assume that just one of the
relation mentions is a true relation mention
•  Graphical models, inference based on ranking
•  Very challenging to train
•  Hierarchical topic models:
•  Only learn from positive training examples
•  Pre-processing with multi-layer topic model to group extraction patterns
to determine which ones are specific for each relation and which are not
•  Pattern correlations:
•  Probabilistic graphical model to group extraction patterns
•  Information Retrieval approach:
•  Pseudo relevance feedback, re-rank extraction patterns
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
20
Comparison of our approach to deal with ambiguity with
related approaches
•  Related approaches all try to solve the problem of ambiguity
within the machine learning model
•  Our approach deals with ambiguity as a pre-processing step for
creating training data
•  While related approaches try to address the problem of noisy
data by using more complicated models, we explored how to
exploit background data from the KB even further
•  We explored how simple, statistical statistical methods based on
data already present in the knowledge base can help to filter
unreliable training data
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
21
Conclusions
•  Simple, statistical methods based on background knowledge
present in KB perform well at detecting ambiguous training
examples
•  Error reduction of up to 35% can be achieved by strategically
selecting seed data
•  Increase in precision is encouraging, however, this comes at
the expense of the number of extractions (20% fewer
extractions)
•  Higher recall could be achieved by increasing the number of
training instances initially
•  Use a bigger corpus
•  Make better use of knowledge contained in corpus
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
22
Future Work
•  Distantly supervised named entity classification for relation
extraction to improve performance for non-standard entities,
joint models for NER and RE
•  Relax distant supervision assumption to achieve a higher
number of extractions: extract relations across sentence
boundaries, coreference resolution
•  Combined extraction models for information from text, lists
and tables on Web pages to improve precision and recall
Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
23Seed Selection for Distantly Supervised Web-Based
Relation Extraction (SWAIE Workshop)
Thank you for
your attention!
Questions?
Isabelle Augenstein

Contenu connexe

En vedette

Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine ReadingIsabelle Augenstein
 
Lodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured TextLodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured TextIsabelle Augenstein
 
Seed Selection & Storage
Seed Selection & StorageSeed Selection & Storage
Seed Selection & StorageSeeds
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
Information Extraction with Linked Data
Information Extraction with Linked DataInformation Extraction with Linked Data
Information Extraction with Linked DataIsabelle Augenstein
 
Natural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebNatural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebIsabelle Augenstein
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
Semantic Search Over The Web
Semantic Search Over The WebSemantic Search Over The Web
Semantic Search Over The Webalierkan
 
Management information system question and answers
Management information system question and answersManagement information system question and answers
Management information system question and answerspradeep acharya
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question AnsweringSujit Pal
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
 
agriculture of pakistan
agriculture of pakistanagriculture of pakistan
agriculture of pakistanUmair Riaz
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Glen Cathey
 
Web 3.0 The Semantic Web
Web 3.0 The Semantic WebWeb 3.0 The Semantic Web
Web 3.0 The Semantic WebHatem Mahmoud
 

En vedette (19)

Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
 
Lodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured TextLodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured Text
 
Seed Selection & Storage
Seed Selection & StorageSeed Selection & Storage
Seed Selection & Storage
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Corporate farming pakistan
Corporate farming pakistanCorporate farming pakistan
Corporate farming pakistan
 
Mapping Keywords to
Mapping Keywords to Mapping Keywords to
Mapping Keywords to
 
Human Neural Machine
Human Neural MachineHuman Neural Machine
Human Neural Machine
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Information Extraction with Linked Data
Information Extraction with Linked DataInformation Extraction with Linked Data
Information Extraction with Linked Data
 
Natural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebNatural Language Processing for the Semantic Web
Natural Language Processing for the Semantic Web
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Semantic Search Over The Web
Semantic Search Over The WebSemantic Search Over The Web
Semantic Search Over The Web
 
Management information system question and answers
Management information system question and answersManagement information system question and answers
Management information system question and answers
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
agriculture of pakistan
agriculture of pakistanagriculture of pakistan
agriculture of pakistan
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
 
Agriculture in pakistan
Agriculture in pakistanAgriculture in pakistan
Agriculture in pakistan
 
Web 3.0 The Semantic Web
Web 3.0 The Semantic WebWeb 3.0 The Semantic Web
Web 3.0 The Semantic Web
 

Similaire à Seed Selection for Distantly Supervised Web-Based Relation Extraction

Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Epistemic networks for Epistemic Commitments
Epistemic networks for Epistemic CommitmentsEpistemic networks for Epistemic Commitments
Epistemic networks for Epistemic CommitmentsSimon Knight
 
Discovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory searchDiscovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory searchFabien Gandon
 
Modeling missing data in distant supervision for information extraction (Ritt...
Modeling missing data in distant supervision for information extraction (Ritt...Modeling missing data in distant supervision for information extraction (Ritt...
Modeling missing data in distant supervision for information extraction (Ritt...Naoaki Okazaki
 
Metric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target PlaylistsMetric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target PlaylistsYing-Shu Kuo
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSaeedeh Shekarpour
 
Ashford university gen 103 week 3
Ashford university gen 103 week 3Ashford university gen 103 week 3
Ashford university gen 103 week 3leesa marteen
 
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...AIST
 
Watson at RPI - Summer 2013
Watson at RPI - Summer 2013Watson at RPI - Summer 2013
Watson at RPI - Summer 2013James Hendler
 
Universal Dependencies
Universal DependenciesUniversal Dependencies
Universal DependenciesTeresa Lynn
 
Data Science 101
Data Science 101Data Science 101
Data Science 101ideatoipo
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
 

Similaire à Seed Selection for Distantly Supervised Web-Based Relation Extraction (15)

Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Epistemic networks for Epistemic Commitments
Epistemic networks for Epistemic CommitmentsEpistemic networks for Epistemic Commitments
Epistemic networks for Epistemic Commitments
 
Discovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory searchDiscovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory search
 
Modeling missing data in distant supervision for information extraction (Ritt...
Modeling missing data in distant supervision for information extraction (Ritt...Modeling missing data in distant supervision for information extraction (Ritt...
Modeling missing data in distant supervision for information extraction (Ritt...
 
Metric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target PlaylistsMetric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target Playlists
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
Ashford university gen 103 week 3
Ashford university gen 103 week 3Ashford university gen 103 week 3
Ashford university gen 103 week 3
 
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
 
Watson at RPI - Summer 2013
Watson at RPI - Summer 2013Watson at RPI - Summer 2013
Watson at RPI - Summer 2013
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Universal Dependencies
Universal DependenciesUniversal Dependencies
Universal Dependencies
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Survey on Open IE
Survey on Open IESurvey on Open IE
Survey on Open IE
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log Analysis
 

Plus de Isabelle Augenstein

Beyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationBeyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationIsabelle Augenstein
 
Automatically Detecting Scientific Misinformation
Automatically Detecting Scientific MisinformationAutomatically Detecting Scientific Misinformation
Automatically Detecting Scientific MisinformationIsabelle Augenstein
 
Accountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact CheckingAccountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact CheckingIsabelle Augenstein
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science CommunicationIsabelle Augenstein
 
Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)Isabelle Augenstein
 
Towards Explainable Fact Checking
Towards Explainable Fact CheckingTowards Explainable Fact Checking
Towards Explainable Fact CheckingIsabelle Augenstein
 
Tracking False Information Online
Tracking False Information OnlineTracking False Information Online
Tracking False Information OnlineIsabelle Augenstein
 
What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...Isabelle Augenstein
 
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Isabelle Augenstein
 
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondIsabelle Augenstein
 
Learning to read for automated fact checking
Learning to read for automated fact checkingLearning to read for automated fact checking
Learning to read for automated fact checkingIsabelle Augenstein
 
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
 
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...Isabelle Augenstein
 
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...Isabelle Augenstein
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
 
Extracting Relations between Non-Standard Entities using Distant Supervision ...
Extracting Relations between Non-Standard Entities using Distant Supervision ...Extracting Relations between Non-Standard Entities using Distant Supervision ...
Extracting Relations between Non-Standard Entities using Distant Supervision ...Isabelle Augenstein
 

Plus de Isabelle Augenstein (17)

Beyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationBeyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific Communication
 
Automatically Detecting Scientific Misinformation
Automatically Detecting Scientific MisinformationAutomatically Detecting Scientific Misinformation
Automatically Detecting Scientific Misinformation
 
Accountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact CheckingAccountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact Checking
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science Communication
 
Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)
 
Explainability for NLP
Explainability for NLPExplainability for NLP
Explainability for NLP
 
Towards Explainable Fact Checking
Towards Explainable Fact CheckingTowards Explainable Fact Checking
Towards Explainable Fact Checking
 
Tracking False Information Online
Tracking False Information OnlineTracking False Information Online
Tracking False Information Online
 
What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...
 
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
 
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
 
Learning to read for automated fact checking
Learning to read for automated fact checkingLearning to read for automated fact checking
Learning to read for automated fact checking
 
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
 
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
 
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
 
Extracting Relations between Non-Standard Entities using Distant Supervision ...
Extracting Relations between Non-Standard Entities using Distant Supervision ...Extracting Relations between Non-Standard Entities using Distant Supervision ...
Extracting Relations between Non-Standard Entities using Distant Supervision ...
 

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Seed Selection for Distantly Supervised Web-Based Relation Extraction

  • 1. Seed Selection for Distantly Supervised Web-Based Relation Extraction Isabelle Augenstein Department of Computer Science, University of Sheffield, UK i.augenstein@dcs.shef.ac.uk August 24, 2014 Semantic Web for Information Extraction (SWAIE) Workshop, COLING 2014
  • 2. 2 Motivation •  Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase) Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  • 3. 3 Motivation •  Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase) •  What are possible methodologies? •  Supervised learning: manually annotate text, train machine learning classifier •  Unsupervised learning: extract language patterns, cluster similar ones •  Semi-supervised learning: start with a small number of language patterns, iteratively learn more (bootstrapping) •  Distant supervision: automatically label text with relations from knowledge base, train machine learning classifier Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  • 4. 4 Motivation •  Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase) •  What are possible methodologies? •  Supervised learning: manually annotate text, train machine learning classifier -> manual effort •  Unsupervised learning: extract language patterns, cluster similar ones -> difficult to map to KB, lower precision than supervised method •  Semi-supervised learning: start with a small number of language patterns, iteratively learn more (bootstrapping) -> still manual effort, semantic drift (unwanted shift in meaning) •  Distant supervision: automatically label text with relations from knowledge base, train machine learning classifier -> allows to extract relations with respect to KB, reasonably high precision, no manual effort Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  • 5. 5 Distant Supervision Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Creating positive & negative training examples Feature Extraction Classifier Training Prediction of New Relations
  • 6. 6 Distant Supervision Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Creating positive & negative training examples Feature Extraction Classifier Training Prediction of New Relations Supervised learning Automatically generated training data +
  • 7. 7 Generating training data Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) “If two entities participate in a relation, any sentence that contains those two entities might express that relation.” (Mintz, 2009) Amy Jade Winehouse was a singer and songwriter known for her eclectic mix of musical genres including R&B, soul and jazz.! Blur helped to popularise the Britpop genre.! Beckham rose to fame with the all-female pop group Spice Girls.! Name Genre … Amy Winehouse Amy Jade Winehouse Wino … R&B soul jazz … … Blur … Britpop … … Spice Girls … pop … … different lexicalisations
  • 8. 8 Generating training data: is it that easy? Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. Name Album Track The Beatles … Let It Be … Let It Be …
  • 9. 9 Generating training data: is it that easy? Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. Name Album Track The Beatles … Let It Be … Let It Be … •  Use ‘Let It Be’ mentions as positive training examples for album or for track? •  Problem: if both mentions of ‘Let It Be’ are used to extract features for both album and track, wrong weights are learnt •  How can such ambiguous examples be detected? •  Develop methods to detect, then automatically discard potentially ambiguous positive and negative training data
  • 10. 10 Seed Selection: ambiguity within an entity •  Example: Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be. •  Let It Be can be both an album and a track of the musical artist The Beatles •  For every relation, consisting of a subject, a property and an object (s, p, o), is the subject related to (at least) two different objects with the same lexicalisation which express two different relations? •  Unam: •  Retrieve the number of such senses using the Freebase API •  Discard the lexicalisation of the object as positive training data if it has at least two different senses within an entity Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  • 11. 11 Seed Selection: ambiguity across classes •  Example: common names of book authors or common genres, e.g. “Jack mentioned that he read On the Road”, in which Jack is falsely recognised as the author Jack Kerouac. •  Stop: remove common words that are stopwords •  Stat: Estimate how ambiguous a lexicalisation of an object is compared to other lexicalisations of objects of the same relation •  For every lexicalisation of an object of a relation, retrieve the number of senses using the Freebase API (example: for Jack n=1066) •  Compute frequency distribution per relation with min, max, median (50th percentile), lower (25th percentile) and upper quartile (75th percentile) (example: for author: min=0, max=3059, median=10, lower=4, upper=32) •  For every lexicalisation of an object of a relation, if the number of senses > upper quartile (or the lower quartile, or median, depending on the model), discard it (example: 1066 > 32 -> Jack will be discarded) Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  • 12. 12 Seed Selection: discarding negative seeds •  Creating negative training data: all entities which appear in the sentence with the subject, but are not in a relation with it, will be used as negative training data •  Problem: knowledge bases are incomplete •  Idea: object lexicalisations are often shared across entities, e.g. for the relation genre •  Check if an unknown lexicalisation is a lexicalisation of a different relation •  Incomp: for every lexicalisation l of a property, discard it as negative training data if any of the properties of the class we examine has an object lexicalisation l Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  • 13. 13 Distant Supervision system: corpus •  Web crawl corpus, created using entity-specific search queries, consisting of 450k Web pages Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein Class Property / Relation Book author, characters, publication date, genre, ISBN, original language Musical Artist album, active (start), active (end), genre, record label, origin, track Politician birthdate, birthplace, educational institution, nationality, party, religion, spouses
  • 14. 14 Distant Supervision system: relation candidate identification •  Two step process: recognise entities, then check if they appear in Freebase •  Use Stanford 7-class NERC to identify named entities (NEs) •  Problem: domain-specific entities (e.g. album, track) are often not recognised •  Solution: use heuristic to recognise (but not classify) more NEs •  Get all sequences of capitalised words, and noun sequences •  Every subsequence of those sequences is a potential NE, “pop music” -> “pop music”, “pop”, “music” Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  • 15. 15 Distant Supervision system •  Features: standard relation extraction features, including BOW features, POS-based features, NE class •  Classifier: first-order CRF •  Model: one model per Freebase class to classify a relation candidate (occurrence) into one of the different properties or NONE, apply to respective part of corpus •  Seed Selection: apply different seed selection methods Unam, Stop, Stat75, Stat50, Stat25, Incomp •  Merging and Ranking: aggregate predictions of occurrences with same surface form •  E.g.: Dublin could have predictions MusicalArtist:album, origin and NONE •  Compute mean average of confidence values, select highest ranked one Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  • 16. 16 Evaluation •  Split corpus equally for training and test •  Hand-annotate the portion of the test corpus which has NONE prediction (no representation in Freebase) Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  • 17. 17 Results: precision per model, ranked by confidence y-axis: precision, x-axis: min confidence (e.g. 0.1 are all occurrences with confidence >= 0.1) Overall precision: unam_stop_stat25: 0.896 unam_stop_stat50: 0.882 unam_stop_stat75: 0.873 unam_stop: 0.842 stop: 0.84 baseline: 0.825 incomp: 0.74 Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 unam_stop_stat25 unam_stop_stat50 unam_stop_stat75 unam_stop stop baseline incompl
  • 18. 18 Results Summary •  Best-performing model (unam_stop_stat25) has a precision of 0.896, compared to baseline model with precision of 0.825, reducing the error rate by 35% •  However, those seed selection methods all come at a small loss of the number of extractions (20%) because they reduce the amount of training data •  Removing potentially false negative training data (incomp) does not perform well •  Too many training examples are removed •  The training examples which are removed are lexicalisations which have the same types of values, those are crucial for learning •  Especially poor performance for numerical values Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  • 19. 19 Related work on dealing with noise for distant supervision •  At least one models: •  Relaxed distant supervision assumption, assume that just one of the relation mentions is a true relation mention •  Graphical models, inference based on ranking •  Very challenging to train •  Hierarchical topic models: •  Only learn from positive training examples •  Pre-processing with multi-layer topic model to group extraction patterns to determine which ones are specific for each relation and which are not •  Pattern correlations: •  Probabilistic graphical model to group extraction patterns •  Information Retrieval approach: •  Pseudo relevance feedback, re-rank extraction patterns Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  • 20. 20 Comparison of our approach to deal with ambiguity with related approaches •  Related approaches all try to solve the problem of ambiguity within the machine learning model •  Our approach deals with ambiguity as a pre-processing step for creating training data •  While related approaches try to address the problem of noisy data by using more complicated models, we explored how to exploit background data from the KB even further •  We explored how simple, statistical statistical methods based on data already present in the knowledge base can help to filter unreliable training data Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Isabelle Augenstein
  • 21. 21 Conclusions •  Simple, statistical methods based on background knowledge present in KB perform well at detecting ambiguous training examples •  Error reduction of up to 35% can be achieved by strategically selecting seed data •  Increase in precision is encouraging, however, this comes at the expense of the number of extractions (20% fewer extractions) •  Higher recall could be achieved by increasing the number of training instances initially •  Use a bigger corpus •  Make better use of knowledge contained in corpus Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  • 22. 22 Future Work •  Distantly supervised named entity classification for relation extraction to improve performance for non-standard entities, joint models for NER and RE •  Relax distant supervision assumption to achieve a higher number of extractions: extract relations across sentence boundaries, coreference resolution •  Combined extraction models for information from text, lists and tables on Web pages to improve precision and recall Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
  • 23. 23Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop) Thank you for your attention! Questions? Isabelle Augenstein