Many of the most robust Human Language Technologies, including statistical part of speech taggers and entity extractors, are developed primarily using high quality newswire datasources. The performance of these technologies on texts in other genres, including short texts like tweets and even sub-genres of news like market summaries, is typically poor. Adapting such technologies to these increasingly important genres is still very difficult and an active area of commercial and academic research. In this presentation, Mr. Stewart will highlight the ways in which newswire trained modules typically fail on the most important emerging text genres, outline the most effective and lowest cost methods to adapt these resources that researchers and practitioners have discovered, and offer guidance on what degree of improvement users can expect to see in the short to medium term.
2. Introduction
Product Manager for Text
Analytics, including:
–
–
–
–
–
Rosette Linguistics Platform
Entity Analytics
Name Indexing and Translation
Chat Translator
Highlight
Questing for:
–
–
–
–
2
Quality: accuracy, performance
Coverage: languages, domains, genres
Integration: tasks, workflows, UX
Innovation: new aggregates, functions
3. Overview
1
4
2
+
Source
Tasks
Technologies
Adaptations
Properties
Description
Action (Input/Output)
“Out of Box”
Comparison
Problem(s)
Components
Suggested Adaptations
Challenge(s)
Process
Potential Benefits
Approach(es)
Adaptation Opportunities
Costs
Solution
Signal(s)
Focus on entity analytics in four stages of the processing
and exploitation of SOCOM-2012-0000011-HT
Reaching “state of the art” in practice means adapting to
source, task and user.
3
5. Task: Triage
Triage: should we process further
5
and/or urgently?
Too few trained, trusted linguists to
review all the documents in time
Enable non-linguist to do linguist’s
job
Gisting: MT All vs. MT Names alone
Combine Entity Extraction with
Specialized Machine Translation
Integrate into Triage workflow
Signal: Documents Selected (How
are guidelines interpreted?)
8. Technology: Entity Extraction (3)
2
5
Domai
n
Text
Tagged
Text
23
Unsupervised Model
4
Supervised Model
Input
Text
Pattern Match (Regex)
Exact Match (Gazetteer)
Deterministic Extractor
User
Defined
Lists
8
1
User
Defined
Patterns
Entity Redactor
Probabilistic Extractor
Overlap
Adjudication
Entity Joining
Filtering
3
Output
Text
9. Adaptation: Entity Extraction to Triage
Out of the box:
– False +/- because contextual cues are fewer/different.
– Weapon in this document missed, because not a default entity type.
Adaptation:
– Add custom entity type(s) via deterministic extractor, e.g. weapons list
Benefit:
– Highlights important documents that might otherwise be missed.
– Fast and unlikely to affect performance of other components
Difficulties:
– Requires forethought, maintenance of lists and patterns in many
languages, but much less work than developing a new model
9
10. Task: Translation
Produce standardized, “user
10
language” versions of the source
document
Too few translators; name
standardization particularly labor
intensive
Speed up translation without
compromising quality
MT All reduces translation
productivity
NER, Coref and Name
Translation/Standardization
Signals: Resource
Selections, Corrections, Resolutions
11. Adaptation: Extraction to Translation
(1)
Out of the box:
– Same problems as in Gisting case, only now they matter more.
Adaptation:
– Train unsupervised model to help with form and domain differences
– Tune co-reference algorithm to most important entity types
– Develop form/domain specific resource sets, and allow users to select them.
Benefit:
– Fewer errors in highlighting should mean translation actually speeds up
Difficulties:
– Often hard to amass a big enough corpus of like material for model building.
– Form/Domain may be ephemeral
11
12. Adaptation: Extraction to Translation
(2)
Thanks:~ Itai_Rolnick$ cat
en_wc.txt | grep -i "
aleppo " | tr ' ' 'n' |
shuf | head
Unsupervised algorithm clusters words
Loveland -- City in Colorado
Svetogorsk -- Town in Russia
MASSOUD -- ?Probably also of a village.
Atiak -- Town in Uganda
Waltha -- typo for Waltham? - town in Mass
BASILICA -- type of Church?
Sapukai -- Town in Paraguai
Yeisk -- Town in Russia
Descoberto -- Town in Brasil
SINKHOLE -- ? A pub in Beligium ??
12
with distributional similarities together
Word cluster ID is one feature used in
learning the sequence model
Based on Collins & Singer (1999)
Part of REX Field Training Kit
Shown: random sample of words
clustered with “Aleppo” in a ~10GB
English model
Note they’re almost all LOCs
Would an annotated training corpus
ever cover so many remote entities?
13. Task: Cataloging
Distill content into an index, to
13
facilitate search and further
refinement at scale
Impossible to annotate more than a
tiny fraction of documents by hand
High quality automated enrichment
that makes efficient use of
knowledge resources and structure
in data
Many approaches, e.g. LSI, topic
modeling, document classification
Entity resolution is robust extension
of NER; data and knowledge driven.
Signals: mentions/aliases, shallow
relationships between entities
14. Technology: Entity Resolution (1)
Alberto
Alberto Amos
Alberto
Fernandez…
Fernandez…
Alberto Fernandez…
… born in Cuba
… US Ambassador
Sportsmen?
YES
Alberto
Alberto
Alberto
Fernandiz…
Alberto Fernandez
de la Puebla…
Albert
Fernandez…
Alberto Fernandez…
Alberto
Alberto
… Chief of Cabinet
… Argentina…
…Prof of Criminal Law…
Alberto
Ratio of
Politicians to
Sportsmen?
2:1
Alberto
Alberto M.
Fernandez…
Fernandez… Alburto Fernandez…
Alberto Fernandez…
Alberto
Alberto
Alberto
Alberto
… born Sept 7, 1984
… cycling
… Madrid
Nickname
“El Galleta?”
?
18. Technology: Entity Resolution (5)
Resolution Engine
Entity
Mention
Link or
Ghost
Candidate Selection
Ranking
3
4
Learned
Seeded
2
Entity Index
Knowledge Base 1
18
19. Adaptation: EntRes to Cataloging (1)
Out of the box:
– Quality dependent on output of extraction and order of input
– Lots of ghosts, poor links if Wikipedia-based KB doesn’t contain entities in document
– Seeding context selection may not be suited to domain
Adaptations:
–
–
–
–
Custom KB, sized and suited to the domain and languages
Seeding using context most likely to match in your domain
Choose Linking or Learning mode
Choose evidence factoring scheme that meets your operational needs
Benefits:
– Linking throughput is high, accuracy is high, ghosts are informative (because fewer
confounders)
– System can maintain low latency after ingestion of many documents
– Linking accuracy can remain high after ingestion of many documents
Difficulties:
– Each element requires experimentation and thought
– Changes likely to cause discontinuities unless re-indexing
19
20. Adaptation: Ent Res to Cataloging (2)
In Linking mode:
– Link to existing KB or declare unknown,
discarding context
– State size is constant, latency stable
In Learning mode:
– Link to existing KB or create New, storing
context
– State size increases, increasing latency
– Semantic drift
– Confidence measure gets complicated
Scaling with learning introduces the
need to factor evidence.
Evidence factoring schemes need to
be customized to use cases.
20
21. Task: Retrieval
Find relevant information for further
21
analysis
String-based retrieval methods are
easy to understand, but require a
lot of effort and distract from the
task.
Deliver search modalities that are
more productive but still
interpretable and correctable
Search using entity-driven facets, as
well as keywords
Signals: query log, click through,
curation, corrections
22. Adaptation: EntRes to Retrieval
Out of the box:
– Entity labels not in user’s language confusing
– Returns results that can’t be easily summarized as a Boolean, cf aliases
– Complex, potentially misleading measure of confidence
Adaptations:
– Use name translation for non user-language labels, e.g. from KB
– Present users with cues to expansion in string terms, e.g. mentions
– Present confidence measure carefully
Benefits:
– User spends less time confused, search is more productive
Difficulties:
– Users still want to do things like exclude certain mentions.
22
23. Summary
News-trained NER OK for Triage, but adding entity types via
lists and patterns could improve results considerably.
Speeding up Translation requires a better fit: unsupervised
adaptation and custom resource selection could make the
difference between time saved or wasted.
Cataloguing by resolved entities enables powerful search, but
relies on high quality extraction; Learning-mode requires
evidence factoring at scale.
Entity-based search is incredibly productive compared to
Boolean and keyword approaches, but users need cues that
explain expansion and robust measures of confidence.
23
24. Remaining Challenges
Current reality: even “simple” adaptation can be difficult:
–
–
–
–
Too much knowledge, experience required
Too much data required, e.g. 10GB for unsupervised
Mostly “out of band”
Usually Offline
Through the REX Field Training Kit and Entity Resolution
API, Basis lowering the barriers to manual adaptation to
sources, tasks and users today.
Integration of explicit signals, e.g. corrections and implicit
signals, e.g. selections is ongoing.
24
Getting “high quality” entities from textDoing it quickly and accuratelyGuiding people in their use
Adapt to your needs. A system that adapts to data.Could use click through data to factor evidence
Note Time and place is missing from diagram – affects both vocab and grammar
Operational priorities change too quickly to merit the development of a model for interest, and the learned model would probably miss many things that we wanted to see. TaskProblem(s)Challenge(s)Approach(es)Solution
Traditionally, finding (putative) entity mentions in text:Mark spans that we think refer to something “in the world”For each, make a guess about the kind of thing each refers to, e.g. PERSON, PLACE, ORGANISATIONOptionally, group the mentions that you think co-refer into chainsMost often called Named Entity Recognition (NER)Embarrassingly good method combines statistical B-I-O sequence model with lists and known patternsStatistical model typically trained using local features over annotated newswire text: abundance, quality
Deterministic or Explicit Components:Gazeteers: Lists, e.g. company names, product namesRegular Expressions: Patterns, e.g. Probabilistic or Implicit Components:Training and testing data, e.g. annotated newswire, raw domain textFeatures, e.g. metadata_subject=markets, prior_word_class=543LearnersModel(s) – what the learner outputsCombiner/RedactorAdjudication between component outputsEntity Joining/In document Co-reference ResolutionModify joining rulesSet confidence thresholdsIdentify entity types consistentlySet weight or length preferencesEasier:Novel entities with small number of forms - Novel, highly productive but structured entities – regular expressionsForms we know aren’t entities - blacklistsBroad vocabulary and style shift – using unsupervised word class modelsHarder:New Entity Types – additional annotated data, and feature engineeringStructure change – additional annotated data if within bounds set by featuresFine Grained Entities (lots of data and annotation)
Extraction and Co-Reference Performance varies greatly by entity type and languageBrittle to changes in domain and genre:Distribution of Entity TypesVocabulary Differences“Grammar” or Structure, inc. document length, abbreviationData sparsity means:Fine grained, rarer entities can be very difficult to extractPerformance on very short texts is typically very lowEntity types decided up front/embedded in models
First step is to build a representation of the entity-base structured to make feature evaluation easy, so we can learn to link.Our system begins by building an index of the information in the knowledge base - the entity-base can be anything from a list, to a database, to a graph, to a rich, semi-structured text resource like Wikipedia. For each entity in the knowledge base, we create a entry in the index containing information that is known to be useful for efficiently differentiating it from other entities (called features), e.g. the non-stop words in a canonical mention sentence, like “president” and “USA” in the opening line of Barack Obama’s Wikipedia page.
(AT LEFT) Let’s focus on four of these coreference chains: Hyon Song-wol, Ri Sol-ju, Wangjeasan Light Music Band and Chosun. In a first pass, we compare the surface form of the mentions in each chain with the labels of the entities in the index, this generates a small number of candidates for closer consideration. In a second pass, we score the degree of similarity of these mentions and their surrounding context, like the non-stop words in the sentences they appear in, with the contents of the candidate entity entries. We can think of this as trying to find which entity or entities the mentions are closest to in some space, called the feature space. Here we can see that Hyon Song-wol, and Wangjesean groups are quite closely associated with the entities we would expect them to be; that Chosun is equally well associated with two entities; and that Ri Sol-ju is not particularly closely associated with any of the known entities.
(AT LEFT) In this example, our scoring resolves Hyon-Song-wol and Wangjeasan correctly to the respective entries in wikipedia; correctly identifies that Ri Sol-ju is a genuinely new name or “ghost” entity which we may wish to create a knowledge base entry for; but incorrectly associates Chosun with the wikipedia page for Korea, rather than the news agency ChosunIlbo. Had Chosun been correctly tagged as an ORG at the NER stage, it would almost certainly have been resolved correctly. This example emphasizes how important high quality foundational linguistic components are for higher level tasks, and how flexibility must be built into downstream algorithms to prevent the errors that do occur from being unrecoverable.
Deterministic or Explicit Components:Gazeteers: Lists, e.g. company names, product namesRegular Expressions: Patterns, e.g. Probabilistic or Implicit Components:Training and testing data, e.g. annotated newswire, raw domain textFeatures, e.g. metadata_subject=markets, prior_word_class=543LearnersModel(s) – what the learner outputsCombiner/RedactorAdjudication between component outputsEntity Joining/In document Co-reference ResolutionModify joining rulesSet confidence thresholdsIdentify entity types consistentlySet weight or length preferencesEasier:Novel entities with small number of forms - Novel, highly productive but structured entities – regular expressionsForms we know aren’t entities - blacklistsBroad vocabulary and style shift – using unsupervised word class modelsHarder:New Entity Types – additional annotated data, and feature engineeringStructure change – additional annotated data if within bounds set by featuresFine Grained Entities (lots of data and annotation)
If recent TAC data is anything to go by, entity linking is expected to associate very different strings.
REX Field Training KitA Package of Tools and Processes for English, PashtoProvides guidelines for:Effective use of gazetteer and regular expression componentsAnnotation of data and training of supervised modelsClustering tool allows adaptation to domain vocabulary for languages that have word class dataSlated for 2.0:Coverage for all languages REX supports, inc. Korean, ArabicSeed resources for specific domains
Less data: better balance between stability of models and volume of data available for adaptationLess effort: automated adaptation, tools and UIs for annotation projectsInline, e.g. task performance as feedback, in addition to correctionOnline, e.g. dynamic knowledge sources without discontinuities