SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Exploiting Structure in
Representation of Named Entities
using Active Learning
Nikita Bhutani*
Kun Qian†
Yunyao Li†
H. V. Jagadish*
Mauricio A. Hernandez†
Mitesh Vasa†
* University of Michigan, Ann Arbor
† IBM Research, Almaden
1
Entities lack unique representation
……..D. Freysinger, is
the Graduate Program
Coordinator at …
Freysinger, Dawn is
affiliated with UMich..
…………………….
Entity Linking/Resolution/De-duplication
!2
Company Academic Title
Barclays
GE Corporation
IBM UK Ltd.
Kumagai Professor of Engineering
The Helen L. Crocker Faculty Scholar
Professor of Public Policy
IBM - United Kingdom Kumagai Prof. of Engg.
Entities have an internal structured representation
⟨name⟩
⟨loc⟩
⟨suffix⟩
⟨prefix⟩
⟨position⟩
⟨specialty⟩
⟨name⟩⟨loc⟩⟨suffix⟩
⟨name⟩⟨suffix⟩
⟨name⟩
….
⟨prefix⟩⟨position⟩⟨specialty⟩
⟨prefix⟩⟨position⟩
⟨position⟩⟨specialty⟩
….
!3
Barclays
GE Corporation
IBM UK Ltd.
Kumagai Professor of Engineering
The Helen L. Crocker Faculty Scholar
Professor of Public Policy
IBM - United Kingdom Kumagai Prof. of Engg.
Company Academic Title
Structural similarity is more reliable than textual similarity
reasoning over structured representations is more robust!
General Electric Corporation General Electric Corporation
⟨name⟩ ⟨suffix⟩
abbreviate
GE Corp.
abbreviate
General Electric China Corporation
General Electric Corporation
GE Corp.
textual similarity can be misleading!
!4
How do we obtain these structured representations?
⟨name⟩⟨suffix⟩
⟨name⟩
⟨name⟩⟨subsidiary⟩⟨suffix⟩...
structured representations programs
+
“General Electric Corp.”
⟨name⟩ ⟨suffix⟩
?!
Manually1
- incorporate domain knowledge (e.g. ⟨suffix⟩ lexicon)
- error-prone, specialized skills, expensive tuning
Programmable Framework2
- directly manipulate representation of entities
- user has to define a program of grammar rules to parse each mention
1 [Campos et al., 2015], 2 [Arasu and Kaushik, 2009]!5
Reducing user-effort
1. help discover structured representations
2. reduce manual effort in learning structured
representations and their programs
3. incorporate domain knowledge in programs
GE Corp.
!6
IBM Ltd.
Barclays
Microsoft Asia
GE Corp.
⟨name⟩
⟨name⟩⟨suffix⟩
⟨name⟩⟨loc⟩
⟨name⟩⟨suffix⟩
Program
⟨name⟩⟨suffix⟩
Program
Key notations and task
Learn a model of mapping rules with minimal user effort by:
- Iteratively seeking labels for informative mentions (Active Learning)
- Automatically infer mapping rules from user labels (Rule Generation)
General Electric Corp.General Electric Corp. ⟨name⟩ ⟨suffix⟩
matcher1
(regex)
matcher2
(dictsuffix)
mapping rule
semantic unit
!7
⟨name⟩ ⟨suffix⟩
IBM Ltd.
⟨name⟩ ⟨suffix⟩
LUSTRE System
Labeled
Mentions
Parsing
Labeled Mention
Wrong Predictions
Structured
Representations
Partly-labeled Mention
Intermediate Predictions
Learned
Model
Indexing
Enriched Unlabeled
Mentions
User Interface
Rule Generation
PRE-PROCESSING
Mentions
Dictionaries
•———
•———
•———
Candidate Selection
caps, alphaNum, num..
Mapping Rule
TRAINING
!8
Inputs and Pre-processing
Evaluate matchers against unlabeled mentions for candidate selection and rule generation
General China Corporation
dsuffix
caps
alphanum
Electric
dcountry
caps
alphanum
caps
alphanum
caps
alphanum
Rank matchers to resolve ties: dconcept > caps > alphanum > num > special > wild
Unlabeled Mentions Domain Dictionaries
•———
•———
•———
Built-in regex matchers
caps, alphanum,num,special..
matchers
Inputs
Preprocessing
!9
Selecting Informative Mention - Query Strategy
Similar structure as unlabeled mentions
e.g. IBM Ltd. ~ Apple Inc., GE Corp.
Unknown or Uncertain structure
e.g. GE Oil & Gas
Correlation Score: Uncertainty Score:
Utility Score:
c(mi) = g(sim(si, su)), where u ∈ U
sim(si, su) = 1 - edit distance(si, su)
max edit distance
where si is the structure of mi
edit distance(IBM Ltd., GE Corp.) = 0
edit distance(IBM Ltd., Microsoft Asia Inc.) = 1
f(mi) = f(Prs
)
where rs is the mapping rule of s
Higher the reliability of a mapping rule,
lower the uncertainty of its structure
u(mi) = c(mi) x f(mi)
m* = argmax u(mi)
mi
Informative Mention
!10
Seeking user labels for selected mentions
General China CorporationElectric
⟨suffix⟩⟨country⟩
Partly labeled mention
General China CorporationElectric
⟨suffix⟩⟨country⟩⟨name⟩
Additional feedback on intermediate predictions
General Motors IBM UK Barclays
✔ 𐄂 ✔
⟨name⟩ ⟨name⟩ ⟨suffix⟩ ⟨name⟩
!11
Generating mapping rule
Non-Trivial: semantic units can span multiple tokens and matchers
Solution: reliable rule as the sequence of most selective3 matchers
where selectivity is expected number of matches of a matcher in a dataset
General China CorporationElectric
⟨suffix⟩⟨country⟩⟨name⟩
dsuffix
caps
alphanum
dcountry
caps
alphanum
caps
alphanum
caps
alphanum
⟨name:: caps{1,2}⟩ ⟨country:: dcountry⟩ ⟨suffix::dsuffix⟩
3 [Li et al., 2008]!12
Updating model with learned rule
Rule Reliability: for query strategy and for resolving structural ambiguities
3 [Li et al., 2008]
Prs
= 1 - selectivity(p*)
where p* = argmin selectivity(pi | pi ∈ rs)
i
For a new rule, estimate as a function of
selectivity of matchers in the rule
P
j
rs
= P
i
rs
x (1 - 𝝺 frac. incorrect pred)
For a learned rule, update based on the fraction of
predictions of the rule marked incorrect by user
!13
Experiments - Datasets, Baselines and Metrics
Datasets
Baselines
Evaluation Metric
STG1: hand-crafted programs used in production
Linear-Chain CRF: sequence labeling with matchers as features
LUSTREt: LUSTRE with tf-idf based query strategy
1 [Campos et al., 2015]
Precision, P: fraction of predictions that are correct
Recall, R: fraction of correct structures that are predicted
Manual effort, ɑ: F-score of method X
# labels requested by X
, where X ∈ {CRF, LUSTRE}
Type Train In-Domain Out-of-domain
Person 200 200 200
Company 200 100 200
Tournament 50 50 -
Academic Title 175 175 -
ACE 2005, Freebase
!14
Experiments - Qualitative Analysis
Type Method
In-Domain Out-of-domain
P R F1 P R F1
Person
STG 0.92 0.92 0.92 0.85 0.85 0.85
CRF 0.97 0.97 0.97 0.90 0.90 0.90
LUSTREt 0.98 0.95 0.96 0.92 0.90 0.91
LUSTRE 0.99 0.97 0.98 0.92 0.95 0.93
Company
STG 0.83 0.83 0.83 0.79 0.79 0.79
CRF 0.87 0.87 0.87 0.85 0.85 0.85
LUSTREt 0.84 0.77 0.80 0.78 0.60 0.68
LUSTRE 0.95 0.86 0.90 0.91 0.85 0.88
Tournament
CRF 0.70 0.70 0.70 - - -
LUSTREt 0.96 0.68 0.79 - - -
LUSTRE 0.96 0.90 0.93 - - -
Academic Title
CRF 0.69 0.69 0.69 - - -
LUSTREt 0.36 0.23 0.28 - - -
LUSTRE 0.79 0.65 0.72 - - -
Learned Models (LUSTRE, CRF) > Manually Crafted (STG)
!15
Experiments - Qualitative Analysis
Complex entities have more variations. LUSTRE outperforms other methods for complex types.
Type Method
In-Domain Out-of-domain
P R F1 P R F1
Person
STG 0.92 0.92 0.92 0.85 0.85 0.85
CRF 0.97 0.97 0.97 0.90 0.90 0.90
LUSTREt 0.98 0.95 0.96 0.92 0.90 0.91
LUSTRE 0.99 0.97 0.98 0.92 0.95 0.93
Company
STG 0.83 0.83 0.83 0.79 0.79 0.79
CRF 0.87 0.87 0.87 0.85 0.85 0.85
LUSTREt 0.84 0.77 0.80 0.78 0.60 0.68
LUSTRE 0.95 0.86 0.90 0.91 0.85 0.88
Tournament
CRF 0.70 0.70 0.70 - - -
LUSTREt 0.96 0.68 0.79 - - -
LUSTRE 0.96 0.90 0.93 - - -
Academic Title
CRF 0.69 0.69 0.69 - - -
LUSTREt 0.36 0.23 0.28 - - -
LUSTRE 0.79 0.65 0.72 - - -
!16
Experiments - Qualitative Analysis
Type Method
In-Domain Out-of-domain
P R F1 P R F1
Person
STG 0.92 0.92 0.92 0.85 0.85 0.85
CRF 0.97 0.97 0.97 0.90 0.90 0.90
LUSTREt 0.98 0.95 0.96 0.92 0.90 0.91
LUSTRE 0.99 0.97 0.98 0.92 0.95 0.93
Company
STG 0.83 0.83 0.83 0.79 0.79 0.79
CRF 0.87 0.87 0.87 0.85 0.85 0.85
LUSTREt 0.84 0.77 0.80 0.78 0.60 0.68
LUSTRE 0.95 0.86 0.90 0.91 0.85 0.88
Tournament
CRF 0.70 0.70 0.70 - - -
LUSTREt 0.96 0.68 0.79 - - -
LUSTRE 0.96 0.90 0.93 - - -
Academic Title
CRF 0.69 0.69 0.69 - - -
LUSTREt 0.36 0.23 0.28 - - -
LUSTRE 0.79 0.65 0.72 - - -
Good out-of-domain performance indicates LUSTRE captures structures regardless of data source.
!17
Experiments - Effectiveness
Precision
0
0.2
0.4
0.6
0.8
1
Iterations
1 2 3 4 5 6 7 8 9 10 11 12 13
Person Company Tournament Title
Recall
0
0.2
0.4
0.6
0.8
1
Iterations
1 2 3 4 5 6 7 8 9 10 11 12 13
Constant precision indicates "quality" rules are learned - effective program synthesis
Increasing recall indicates new rules are learned - effective query strategy
Few iterations (8-13) indicate low manual effort
Type LUSTRE CRF
Person 0.089 0.005
Company 0.125 0.004
Tournament 0.072 0.014
Title 0.060 0.004
Manual effort, ɑ
!18
Experiments - Usefulness
Relation Extraction
4 [Qian et al., 2017], 5 [Hoffmann et al., 2011]
MULTIR5: uses weak supervision data created by exact matching
textual mentions to Freebase entities
matching variations: textual mentions to variations of Freebase
entities of type Person and Company
# of exact matches: 24,882 sentences
# of matches to variations: 34,197 sentences
F1-score: Increased by 3%
Entity Resolution4 Refer paper for more details
!19
Demo Paper: An Interactive System for Entity Structured Representation and Variant Generation,
ICDE 2018
Video Link: https://www.youtube.com/watch?v=llaT4Sz6uI4
Conclusion
Framework to reason about name variations of entities based on their structured representations
An active-learning approach to learn structures for an entity type with minimal human input
Automatically synthesize generalizable programs from human-understandable labels for structures
!20
Thank You
!21

Contenu connexe

Tendances

Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold ReinwaldAlpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold ReinwaldChester Chen
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTLaura Chiticariu
 
Proven ETL Developer Interview Questions to Assess and Hire ETL Developers
Proven ETL Developer Interview Questions to Assess and Hire ETL DevelopersProven ETL Developer Interview Questions to Assess and Hire ETL Developers
Proven ETL Developer Interview Questions to Assess and Hire ETL DevelopersInterview Mocha
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
 
The Hunt for Analytics and Finance & Accounting Talent in India
The Hunt for Analytics and Finance & Accounting Talent in IndiaThe Hunt for Analytics and Finance & Accounting Talent in India
The Hunt for Analytics and Finance & Accounting Talent in IndiaAnurag Goel
 

Tendances (6)

Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold ReinwaldAlpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold Reinwald
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
Proven ETL Developer Interview Questions to Assess and Hire ETL Developers
Proven ETL Developer Interview Questions to Assess and Hire ETL DevelopersProven ETL Developer Interview Questions to Assess and Hire ETL Developers
Proven ETL Developer Interview Questions to Assess and Hire ETL Developers
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
The Hunt for Analytics and Finance & Accounting Talent in India
The Hunt for Analytics and Finance & Accounting Talent in IndiaThe Hunt for Analytics and Finance & Accounting Talent in India
The Hunt for Analytics and Finance & Accounting Talent in India
 

Similaire à Exploiting Structure in Representation of Named Entities using Active Learning

The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
 
Evaluating Role Mining Algorithms
Evaluating Role Mining Algorithms Evaluating Role Mining Algorithms
Evaluating Role Mining Algorithms Onur Yılmaz
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
 
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...IRJET Journal
 
Mechanisms for Database Intrusion Detection and Response
Mechanisms for Database Intrusion Detection and ResponseMechanisms for Database Intrusion Detection and Response
Mechanisms for Database Intrusion Detection and ResponseAshish Kamra
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationMongoDB
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Supporting Change in Product Lines within the Context of Use Case-driven Deve...
Supporting Change in Product Lines within the Context of Use Case-driven Deve...Supporting Change in Product Lines within the Context of Use Case-driven Deve...
Supporting Change in Product Lines within the Context of Use Case-driven Deve...Lionel Briand
 
Towards better software quality assurance by providing intelligent support
Towards better software quality assurance by providing intelligent supportTowards better software quality assurance by providing intelligent support
Towards better software quality assurance by providing intelligent supportConcordia University
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep LearningJorge Cardoso
 
Requirements vs design vs runtime
Requirements vs design vs runtimeRequirements vs design vs runtime
Requirements vs design vs runtimebdemchak
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
EUGM 2014 -  Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...EUGM 2014 -  Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...ChemAxon
 
Overview of Movie Recommendation System using Machine learning by R programmi...
Overview of Movie Recommendation System using Machine learning by R programmi...Overview of Movie Recommendation System using Machine learning by R programmi...
Overview of Movie Recommendation System using Machine learning by R programmi...IRJET Journal
 
Optimization Software and Systems for Operations Research: Best Practices and...
Optimization Software and Systems for Operations Research: Best Practices and...Optimization Software and Systems for Operations Research: Best Practices and...
Optimization Software and Systems for Operations Research: Best Practices and...Bob Fourer
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf
 

Similaire à Exploiting Structure in Representation of Named Entities using Active Learning (20)

Data day2017
Data day2017Data day2017
Data day2017
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Evaluating Role Mining Algorithms
Evaluating Role Mining Algorithms Evaluating Role Mining Algorithms
Evaluating Role Mining Algorithms
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
 
Mechanisms for Database Intrusion Detection and Response
Mechanisms for Database Intrusion Detection and ResponseMechanisms for Database Intrusion Detection and Response
Mechanisms for Database Intrusion Detection and Response
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Supporting Change in Product Lines within the Context of Use Case-driven Deve...
Supporting Change in Product Lines within the Context of Use Case-driven Deve...Supporting Change in Product Lines within the Context of Use Case-driven Deve...
Supporting Change in Product Lines within the Context of Use Case-driven Deve...
 
Towards better software quality assurance by providing intelligent support
Towards better software quality assurance by providing intelligent supportTowards better software quality assurance by providing intelligent support
Towards better software quality assurance by providing intelligent support
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep Learning
 
Requirements vs design vs runtime
Requirements vs design vs runtimeRequirements vs design vs runtime
Requirements vs design vs runtime
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
EUGM 2014 -  Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...EUGM 2014 -  Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
Overview of Movie Recommendation System using Machine learning by R programmi...
Overview of Movie Recommendation System using Machine learning by R programmi...Overview of Movie Recommendation System using Machine learning by R programmi...
Overview of Movie Recommendation System using Machine learning by R programmi...
 
Optimization Software and Systems for Operations Research: Best Practices and...
Optimization Software and Systems for Operations Research: Best Practices and...Optimization Software and Systems for Operations Research: Best Practices and...
Optimization Software and Systems for Operations Research: Best Practices and...
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 

Plus de Yunyao Li

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Yunyao Li
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
 
Coling poster
Coling posterColing poster
Coling posterYunyao Li
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative AnalyticsYunyao Li
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesYunyao Li
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionYunyao Li
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summaryYunyao Li
 

Plus de Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summary
 

Dernier

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Dernier (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Exploiting Structure in Representation of Named Entities using Active Learning

  • 1. Exploiting Structure in Representation of Named Entities using Active Learning Nikita Bhutani* Kun Qian† Yunyao Li† H. V. Jagadish* Mauricio A. Hernandez† Mitesh Vasa† * University of Michigan, Ann Arbor † IBM Research, Almaden 1
  • 2. Entities lack unique representation ……..D. Freysinger, is the Graduate Program Coordinator at … Freysinger, Dawn is affiliated with UMich.. ……………………. Entity Linking/Resolution/De-duplication !2 Company Academic Title Barclays GE Corporation IBM UK Ltd. Kumagai Professor of Engineering The Helen L. Crocker Faculty Scholar Professor of Public Policy IBM - United Kingdom Kumagai Prof. of Engg.
  • 3. Entities have an internal structured representation ⟨name⟩ ⟨loc⟩ ⟨suffix⟩ ⟨prefix⟩ ⟨position⟩ ⟨specialty⟩ ⟨name⟩⟨loc⟩⟨suffix⟩ ⟨name⟩⟨suffix⟩ ⟨name⟩ …. ⟨prefix⟩⟨position⟩⟨specialty⟩ ⟨prefix⟩⟨position⟩ ⟨position⟩⟨specialty⟩ …. !3 Barclays GE Corporation IBM UK Ltd. Kumagai Professor of Engineering The Helen L. Crocker Faculty Scholar Professor of Public Policy IBM - United Kingdom Kumagai Prof. of Engg. Company Academic Title
  • 4. Structural similarity is more reliable than textual similarity reasoning over structured representations is more robust! General Electric Corporation General Electric Corporation ⟨name⟩ ⟨suffix⟩ abbreviate GE Corp. abbreviate General Electric China Corporation General Electric Corporation GE Corp. textual similarity can be misleading! !4
  • 5. How do we obtain these structured representations? ⟨name⟩⟨suffix⟩ ⟨name⟩ ⟨name⟩⟨subsidiary⟩⟨suffix⟩... structured representations programs + “General Electric Corp.” ⟨name⟩ ⟨suffix⟩ ?! Manually1 - incorporate domain knowledge (e.g. ⟨suffix⟩ lexicon) - error-prone, specialized skills, expensive tuning Programmable Framework2 - directly manipulate representation of entities - user has to define a program of grammar rules to parse each mention 1 [Campos et al., 2015], 2 [Arasu and Kaushik, 2009]!5
  • 6. Reducing user-effort 1. help discover structured representations 2. reduce manual effort in learning structured representations and their programs 3. incorporate domain knowledge in programs GE Corp. !6 IBM Ltd. Barclays Microsoft Asia GE Corp. ⟨name⟩ ⟨name⟩⟨suffix⟩ ⟨name⟩⟨loc⟩ ⟨name⟩⟨suffix⟩ Program ⟨name⟩⟨suffix⟩ Program
  • 7. Key notations and task Learn a model of mapping rules with minimal user effort by: - Iteratively seeking labels for informative mentions (Active Learning) - Automatically infer mapping rules from user labels (Rule Generation) General Electric Corp.General Electric Corp. ⟨name⟩ ⟨suffix⟩ matcher1 (regex) matcher2 (dictsuffix) mapping rule semantic unit !7 ⟨name⟩ ⟨suffix⟩ IBM Ltd. ⟨name⟩ ⟨suffix⟩
  • 8. LUSTRE System Labeled Mentions Parsing Labeled Mention Wrong Predictions Structured Representations Partly-labeled Mention Intermediate Predictions Learned Model Indexing Enriched Unlabeled Mentions User Interface Rule Generation PRE-PROCESSING Mentions Dictionaries •——— •——— •——— Candidate Selection caps, alphaNum, num.. Mapping Rule TRAINING !8
  • 9. Inputs and Pre-processing Evaluate matchers against unlabeled mentions for candidate selection and rule generation General China Corporation dsuffix caps alphanum Electric dcountry caps alphanum caps alphanum caps alphanum Rank matchers to resolve ties: dconcept > caps > alphanum > num > special > wild Unlabeled Mentions Domain Dictionaries •——— •——— •——— Built-in regex matchers caps, alphanum,num,special.. matchers Inputs Preprocessing !9
  • 10. Selecting Informative Mention - Query Strategy Similar structure as unlabeled mentions e.g. IBM Ltd. ~ Apple Inc., GE Corp. Unknown or Uncertain structure e.g. GE Oil & Gas Correlation Score: Uncertainty Score: Utility Score: c(mi) = g(sim(si, su)), where u ∈ U sim(si, su) = 1 - edit distance(si, su) max edit distance where si is the structure of mi edit distance(IBM Ltd., GE Corp.) = 0 edit distance(IBM Ltd., Microsoft Asia Inc.) = 1 f(mi) = f(Prs ) where rs is the mapping rule of s Higher the reliability of a mapping rule, lower the uncertainty of its structure u(mi) = c(mi) x f(mi) m* = argmax u(mi) mi Informative Mention !10
  • 11. Seeking user labels for selected mentions General China CorporationElectric ⟨suffix⟩⟨country⟩ Partly labeled mention General China CorporationElectric ⟨suffix⟩⟨country⟩⟨name⟩ Additional feedback on intermediate predictions General Motors IBM UK Barclays ✔ 𐄂 ✔ ⟨name⟩ ⟨name⟩ ⟨suffix⟩ ⟨name⟩ !11
  • 12. Generating mapping rule Non-Trivial: semantic units can span multiple tokens and matchers Solution: reliable rule as the sequence of most selective3 matchers where selectivity is expected number of matches of a matcher in a dataset General China CorporationElectric ⟨suffix⟩⟨country⟩⟨name⟩ dsuffix caps alphanum dcountry caps alphanum caps alphanum caps alphanum ⟨name:: caps{1,2}⟩ ⟨country:: dcountry⟩ ⟨suffix::dsuffix⟩ 3 [Li et al., 2008]!12
  • 13. Updating model with learned rule Rule Reliability: for query strategy and for resolving structural ambiguities 3 [Li et al., 2008] Prs = 1 - selectivity(p*) where p* = argmin selectivity(pi | pi ∈ rs) i For a new rule, estimate as a function of selectivity of matchers in the rule P j rs = P i rs x (1 - 𝝺 frac. incorrect pred) For a learned rule, update based on the fraction of predictions of the rule marked incorrect by user !13
  • 14. Experiments - Datasets, Baselines and Metrics Datasets Baselines Evaluation Metric STG1: hand-crafted programs used in production Linear-Chain CRF: sequence labeling with matchers as features LUSTREt: LUSTRE with tf-idf based query strategy 1 [Campos et al., 2015] Precision, P: fraction of predictions that are correct Recall, R: fraction of correct structures that are predicted Manual effort, ɑ: F-score of method X # labels requested by X , where X ∈ {CRF, LUSTRE} Type Train In-Domain Out-of-domain Person 200 200 200 Company 200 100 200 Tournament 50 50 - Academic Title 175 175 - ACE 2005, Freebase !14
  • 15. Experiments - Qualitative Analysis Type Method In-Domain Out-of-domain P R F1 P R F1 Person STG 0.92 0.92 0.92 0.85 0.85 0.85 CRF 0.97 0.97 0.97 0.90 0.90 0.90 LUSTREt 0.98 0.95 0.96 0.92 0.90 0.91 LUSTRE 0.99 0.97 0.98 0.92 0.95 0.93 Company STG 0.83 0.83 0.83 0.79 0.79 0.79 CRF 0.87 0.87 0.87 0.85 0.85 0.85 LUSTREt 0.84 0.77 0.80 0.78 0.60 0.68 LUSTRE 0.95 0.86 0.90 0.91 0.85 0.88 Tournament CRF 0.70 0.70 0.70 - - - LUSTREt 0.96 0.68 0.79 - - - LUSTRE 0.96 0.90 0.93 - - - Academic Title CRF 0.69 0.69 0.69 - - - LUSTREt 0.36 0.23 0.28 - - - LUSTRE 0.79 0.65 0.72 - - - Learned Models (LUSTRE, CRF) > Manually Crafted (STG) !15
  • 16. Experiments - Qualitative Analysis Complex entities have more variations. LUSTRE outperforms other methods for complex types. Type Method In-Domain Out-of-domain P R F1 P R F1 Person STG 0.92 0.92 0.92 0.85 0.85 0.85 CRF 0.97 0.97 0.97 0.90 0.90 0.90 LUSTREt 0.98 0.95 0.96 0.92 0.90 0.91 LUSTRE 0.99 0.97 0.98 0.92 0.95 0.93 Company STG 0.83 0.83 0.83 0.79 0.79 0.79 CRF 0.87 0.87 0.87 0.85 0.85 0.85 LUSTREt 0.84 0.77 0.80 0.78 0.60 0.68 LUSTRE 0.95 0.86 0.90 0.91 0.85 0.88 Tournament CRF 0.70 0.70 0.70 - - - LUSTREt 0.96 0.68 0.79 - - - LUSTRE 0.96 0.90 0.93 - - - Academic Title CRF 0.69 0.69 0.69 - - - LUSTREt 0.36 0.23 0.28 - - - LUSTRE 0.79 0.65 0.72 - - - !16
  • 17. Experiments - Qualitative Analysis Type Method In-Domain Out-of-domain P R F1 P R F1 Person STG 0.92 0.92 0.92 0.85 0.85 0.85 CRF 0.97 0.97 0.97 0.90 0.90 0.90 LUSTREt 0.98 0.95 0.96 0.92 0.90 0.91 LUSTRE 0.99 0.97 0.98 0.92 0.95 0.93 Company STG 0.83 0.83 0.83 0.79 0.79 0.79 CRF 0.87 0.87 0.87 0.85 0.85 0.85 LUSTREt 0.84 0.77 0.80 0.78 0.60 0.68 LUSTRE 0.95 0.86 0.90 0.91 0.85 0.88 Tournament CRF 0.70 0.70 0.70 - - - LUSTREt 0.96 0.68 0.79 - - - LUSTRE 0.96 0.90 0.93 - - - Academic Title CRF 0.69 0.69 0.69 - - - LUSTREt 0.36 0.23 0.28 - - - LUSTRE 0.79 0.65 0.72 - - - Good out-of-domain performance indicates LUSTRE captures structures regardless of data source. !17
  • 18. Experiments - Effectiveness Precision 0 0.2 0.4 0.6 0.8 1 Iterations 1 2 3 4 5 6 7 8 9 10 11 12 13 Person Company Tournament Title Recall 0 0.2 0.4 0.6 0.8 1 Iterations 1 2 3 4 5 6 7 8 9 10 11 12 13 Constant precision indicates "quality" rules are learned - effective program synthesis Increasing recall indicates new rules are learned - effective query strategy Few iterations (8-13) indicate low manual effort Type LUSTRE CRF Person 0.089 0.005 Company 0.125 0.004 Tournament 0.072 0.014 Title 0.060 0.004 Manual effort, ɑ !18
  • 19. Experiments - Usefulness Relation Extraction 4 [Qian et al., 2017], 5 [Hoffmann et al., 2011] MULTIR5: uses weak supervision data created by exact matching textual mentions to Freebase entities matching variations: textual mentions to variations of Freebase entities of type Person and Company # of exact matches: 24,882 sentences # of matches to variations: 34,197 sentences F1-score: Increased by 3% Entity Resolution4 Refer paper for more details !19
  • 20. Demo Paper: An Interactive System for Entity Structured Representation and Variant Generation, ICDE 2018 Video Link: https://www.youtube.com/watch?v=llaT4Sz6uI4 Conclusion Framework to reason about name variations of entities based on their structured representations An active-learning approach to learn structures for an entity type with minimal human input Automatically synthesize generalizable programs from human-understandable labels for structures !20