The Codex of Business Writing Software for Real-World Solutions 2.pptx
Cross Document Coreference
1. Cross document coreference
Kepa Joseba Rodr´ıguez
Seminar on EXtreme Information Extraction
Rovereto, 25. March 2009
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
2. Outline
Background.
Intra-doc/cross-doc coreference tasks.
Overview of a system.
Unsupervised personal name disambiguation.
Generation of extraction patterns.
Algorithm of (Ravichandran & Hovy, 2002)
Generation of vectors and clustering.
Evaluation
Optional: Disambiguation of geographic names.
Optional: Clustering of news.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
3. The task of CDC
Cross document coreference occurs when the same person,
place, event or concept is discussed in more than one text
source. (Bagga & Baldwin 1998)
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
4. Intra-document vs. cross-document coreference
There are substantial differences between intra-document
and cross document coreference resolution.
In a document there is a certain consistency that we
cannot expect across documents.
Most underlying principles of linguistics and discourse
contexts cannot be applied across documents.
There are some links between both.
The resolution of intra-document coreference helps in
the resolution of cross document coreference.
The resolution of cross document coreference can help in
the resolution of intra-document coreference (Haghighi
& Klein, 2007).
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
5. Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
6. Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
7. Unsupervised personal name disambiguation (1)
A personal name can refer to thousands of different
entities in the real world.
Ex: for the name Jim Clark Google shows 76.000
different web-sites (Man & Yarowsky, 2003):
1 Jim Clark Race car driver from Scotland
2 Jim Clark Clock-maker from Colorado
3 Jim Clark Film editor
4 Jim Clark Netscape founder
5 Jim Clark Disaster survivor
6 Jim Clark Car salesman in Kansas
... Jim Clark ...
Each entry has features that may be helpful to
disambiguate the entity.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
8. Unsupervised personal name disambiguation (2)
Earlier approaches to personal name disambiguation use
representations of the context like vectors.
Distinction between instances with identical name based
on potentially indicative words.
Jim Clark - car
Jim Clark - film
Jim Clark - Netscape
Jim Clark - Colorado
In the case of personal names there is more precise
information available than in other kind of entities.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
9. Unsupervised personal name disambiguation (3)
Use of information extraction techniques can add
categorial information like:
Age/date of birth.
Nationality.
Profession.
Space of associated names. It can be used:
As a vector based bag-of-words model.
With extracted specific types of association, such as:
familiar relationships: son, wife, married with...
employment relationship: manager of, etc
...
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
10. Generation of extraction patterns
Patterns are automatically generated from data.
It is possible to get a good performance without use of
parser or other language specific resources.
Automatic generation is more flexible to be applied to
new languages.
Potentially higher precision and recall than patterns
introduced by hand.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
11. (R & H) algorithm for pattern extraction (1)
Select items for the query (i.e. +Mozart, +1756)
Search in a document collection for documents that
contains both terms.
Extract the sentences in which both terms are contained.
Search for the long matches between sentences. For the
sentences:
The great composer Mozart (1756-1791) achieved fame
as a young age.
Mozart (1756-1791) was a genius.
The whole world would always be indebted to the great
music of Mozart (1756-1791).
the longest matching substring is “Mozart (1756-1791)”
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
12. (R & H) algorithm for pattern extraction (2)
Repeat the same procedure for other terms like
+Newton +1642
+Gandhi +1869
...
For BIRTHDATE the algorithm produces this output:
born in <ANSWER>, <NAME>
<NAME> was born in <ANSWER>
<NAME> (<ANSWER> -
<NAME> (<ANSWER> -)
...
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
13. (R & H) algorithm to calculate precision (1)
Build a collection of documents that contain the question
term (the name)
Query a search engine using only the question term
Download the top 1000 web documents.
Extract the sentences that contain the question term.
For each extracted pattern, check the presence in the
sentence obtained for the following instances
Presence of the pattern with <ANSWER> tag matched by
any word (Ca )
i.e: Mozart was born in <WORD>.
Presence of the pattern with <ANSWER> tag matched by
the correct term (Co )
i.e: Mozart was born in 1756.
P = Co /Ca
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
14. (R & H) algorithm to calculate precision (2)
Example: precision for the extracted patters for BIRTHDATE.
1.0 <NAME> (<ANSWER> -)
0.85 <NAME> was born on <ANSWER>
0.6 <NAME> was born in <ANSWER>
0.59 <NAME> was born <ANSWER>
0.53 <ANSWER> <NAME> was born
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
15. Unsupervised Clustering
(Mann & Yarowsky, 2003)
Used cluster method: bottom-up centroid agglomerative
clustering.
Each document is represented by a vector of
automatically extracted features.
The two more similar vectors are merged to produce a
new cluster.
The new cluster is represented by a vector equal to the
centroid of the clustered vectors.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
16. Cluster refactoring
Unsupervised agglomerative clustering can lead to
problems.
The most similar pages are clustered at the begin of the
process.
The less similar pages are added as stragglers to the top
levels of the cluster tree.
The top-level clusters are less discriminative than the
clusters at the bottom of the tree.
The refactoring.
Clustering is stopped when a percentage of the
documents have been classified and clusters have
achieved a given size.
The rest of the documents are assigned to the clusters
with the closest distance measure.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
17. Methods for vector generation
Baseline
Techniques of selective term weighting.
Term Frequency / Inverse Document Frequency
(tf-idf)
Mutual Information (mi).
Biographical features (feat)
Extended biographical features (extfeat)
Cluster refactoring.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
18. Baseline
The term vectors are composed of only proper nouns.
The similarity between vectors is computed using
standard cosine similarity.
a·b
cos(a, b) =
||a|| × ||b||
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
19. TF-IDF
Techniques of selective term weighting.
TF-IDF weight (Term Frequency - Inverse Document
Frequency)
Measure used to evaluate how important a word is to a
document in a collection.
The importance decreases proportionally to the number
of times a word appears in a document, but it is offset
by the frequency of the word in the collection.
n |D|
tfi,j = P i,j idfi = tfidfi,j = tfi,j × idfi
k nk,j |d:ti ∈d|
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
20. Mutual Information
Mutual Information: Measure used to evaluate the
mutual dependence between random variables.
Given a document collection c, for each word w we
compute I (w ; c) = p(w |c)
p(w )
We selected words that
appear more than 20 times in the collection
have a I (w ; c) > 10
these words are added to the document’s feature vector
with a weight equal to log (I (w ; c))
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
21. Extracted biographical features (feat)
Use of biographical features extracted with the algorithm
of (Ravichendran & Hovy, 2002)
Biographical information is used to link the documents:
documents which contain similar extracted features have
the same referent.
The extracted biographical features help to improve
disambiguation: documents with different extracted
features belong to different clusters.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
22. Extracted biographical features (feat)
Type Extracted feature
birth place Midland (4), Texas (3), Alton (1), Illinois(1)
birth year 1926 (9). 1967 (3), 1973 (2), 1947 (1),
1958 (1), 1969 (1)
occupation actor (11), trumpeter (9), heavyweight (2), ...
spouse Demi Moore (1)
Table: feat Features extracted for Davis/Harrelson pseudoname
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
23. Extended biographical features (extfeat)
In this method the system gives higher weight to words
that appear filling patterns.
Example:
The system recognises 1756 as a birth-year using surface
patterns.
Then when it is found in context outside of an extraction
pattern, it is given a higher weight and added to the
document vector as a potential biographical feature.
For the experiment it was applied for words which appears
more than a threshold of 4 times.
Then value of the weight is the log of the number of
times the word was found as an extracted feature.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
24. word w(mi) w(extfeat)
adderley 3.50 0
snipes 5.16 0
coltrane 5.06 0
bitches 4.99 0
danson 4.97 0
hemp 4.97 0
mullally 4.95 0
porgy 4.94 0
remastered 4.92 0
actor 3.50 2.40
1926 0 2-20
trumpeter 0 2.20
midland 0 1.39
Table: 10 words with higher mutual information with the document
collection and all extfeat words for Davis/Harrelson pseudoname
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
25. Experiments: the data set
The data set consisted in web pages collected using
Google for a set of target personal names.
Not more than 1000 pages for each target name.
No requirement that the web-page was focused on the
name.
No minimum number of occurrences of the name in the
page.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
26. Evaluation on pseudonames
Pseudonames created as follows:
Take retrieval results from two different people.
Replace all references to each name by a unique shared
pseudoname.
Resulting collection consists of documents which are
ambiguous as to whom they are talking about.
The aim of the clustering is to distinguish the introduced
pseudoname.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
27. Evaluation on pseudonames
Selected a set of 8 different people:
Historical figures.
Figures from media and pop culture.
Non famous people with similar background (birthdate,
profession, etc.)
Submit Google queries and retrieval up to 1000 pages
about each person.
Select a maximum of 100 pages for each person.
Evaluation of two granularities of feature extraction:
Use high precision rules to extract occupation, birthday,
spouse, birth location and school.
Use high recall rules to extract the same terms and add
parent/child relationships.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
28. Evaluation on pseudonames
Method Accuracy
nnp 79.7
nnp + tfidf 79.7
nnp + mi 82.9
Table: Disambiguation accuracy of different clustering methods
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
29. Evaluation on pseudonames
feature set size
extracted features
small large
nnp+feat 82.5 85.1
nnp+feat+extfeat 82.0 84.6
nnp+feat+mi 85.6 85.3
nnp + feat + tfidf 82.9 86.4
Table: Disambiguation accuracy of different clustering methods
and different size of feature sets
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
30. Evaluation on naturally ambiguous names
Start with a selection of 4 polysemous names with a
average of 60 different instances for each of them.
Manual annotation with name-ID numbers
The occurrences of each name should be classified into 3
clusters
The 2 automatically derived first-pass majority seed sets.
The residual set for “other uses”
Weighting method Precision Recall
TF-IDF .81 .70
Mutual Information .88 .73
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
31. Conclusions
The results of the clustering are improved by:
Learning and using automatically extracted biographic
information.
The use of weighting techniques.
The produced clusters can be used as seeds for
disambiguating further entities.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
32. Disambiguating geographic
names in a digital library
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
33. Outline
Task of the Perseus project.
Problems of the task domain.
External knowledge sources.
Identification and classification of proper names.
First disambiguation of geographical names.
Simple carachterization of the document context.
Final disambiguation.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
34. Task of the project Perseus
Task of the Perseus Project (Smith & Crane, 2002)
Library with historical data in humanities from the ancient
Greece to the 19th century America.
Over a million of toponym references.
The task consist of:
Identification of geographic names.
Link the names to information about location, type,
dates of occupation, relation to other places,
inhabitants, etc.
Link the names to a position in a map.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
35. Problems of the domain
The introduction of the entity by a unambiguous mention
is less common than in new papers articles.
There is a great difference between the documents, like
Different size of the documents.
Lack of standard structures.
Different registers and dialects are used.
Historical variations: borders, names associated to
different political systems, etc.
Long distance anaphora.
Resolution process is more similar to the resolution of
cross-document coreference in the web than in corpora.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
36. Knowledge sources
The system uses external knowledge sources. The most
important are:
Getty Thesaurus of geographic names.
Cruchley’s gazetteer of London, that were build for
geocoding.
Lists of authors of the entries in the Dictionary of
National Biography, that helps to add additional
information to the documents.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
37. Identification and classification of proper-names
The task of identifying the proper names and the first
classification of them is done using simple heuristics.
Capitalisation and punctuation conventions.
Markup added by the editor of the document.
Language specific honorifics (Mr., Dr., etc).
Generic topographic labels are taken as “moderate”
evidence that the name may be geographic.
Rocky Mountains
Charles River
Stand-alone names are preferably classified as personal
names.
John (personal name vs. village in Louisiana or Virginia)
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
38. Disambiguation (1)
Based in local context.
Explicit disambiguating tags put after the names.
e.g.“Lancaster, PA”, “Vienna, Austria”, post code, etc.
If an ambiguous name of a place is mentioned together
with other names of places, the most likely
interpretation of the name is that is geographically near
from the others.
e.g. if “Philadelphia” and “Harrisburg” appear in the same
paragraph, the preferred interpretation of “Lancaster” will be
the town in Pennsylvania, and not the town in England or
Arizona.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
39. Disambiguation (2)
Based in document context.
Preponderance of geographic references in the entire
document.
For short documents, like new papers articles, document
context and local context are considered as the same.
Based in word knowledge.
Captured from gazetteers and other reference works.
Facts about a place like political coordinates, size, etc.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
40. Simple characterisation of the document context
Aggregate all of the possible locations for all the
toponyms in the document onto a one-by-one degree grid.
Assign weights for the number of mentions of each
toponym.
Prune the grid based on general world knowledge.
Compute the centroid of this weighted map.
Compute the standard deviation of the distance of the
points from this centroid.
Discard points more than to times the standard deviation
away from the centroid.
Calculate a new centroid.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
41. Final disambiguation.
Local context of a toponym is represented by a moving
window of the four previous and four following toponyms
in the text.
Only non ambiguous or disambiguated toponyms are
considered.
Each of the possible interpretations of the ambiguous
toponym are scored using:
Geographical proximity to the toponyms around it.
Proximity to the centroid for the document.
Relative importance.
The interpretation that achieves the highest score is
selected.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
42. Evaluation (1)
The system has evaluated using 5 hand-annotated corpora.
Corpus PCat Prec Rec F1
Greek 0.98 0.93 0.99 0.96
Roman 0.99 0.91 1.00 0.95
London 0.92 0.86 0.96 0.91
California 0.92 0.83 0.96 0.89
Upper Midwest 0.89 0.74 0.89 0.81
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
43. Evaluation (2)
Categorisation performed on texts of the Greek and
Roman history texts is better than on texts about more
actual items.
In places with a hight density of population we found
more toponyms that are ambiguous with other names.
Mistakes where ethnonyms are used as geo-political Entity
(like “The Germans” in the text Cæsar’s Gallic War).
Proper names are usually non inflected in English.
We can add rules by hand to correct it, but the precision
of the system could decrease.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
44. Conclusions
Simple heuristic categorisation seems to work properly for
the categorisation of entities that appear in certain kind
of texts.
The evaluation procedure is not very clear.
There are cases that are not covered properly by the
gazetteers, but the use of huge fine grained gazetteers
leads to a higher recall but a lower precision.
An alternative is the use of linguistic processing and
machine learning techniques for restricted cases and
collections of documents.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
45. NewsExplorer: multilingual
coreference resolution
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
46. NewsExplorer
NewsExplorer (Steinberger & Pouliquen, 2008) is an
application that gathers and aggregates extracted
information for 19 languages.
Each entity is displayed on a dedicated web-site.
For each entity the user get:
List of the latest new clusters in which the entity has
been mentioned.
List of other entities found in the same clusters.
Titles and other phrases describing the entity.
Quotations done by the entity or about it.
Photograph if available.
Wikipedia site about the entity if available.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
47. Text analyse components of the system (1)
Monolingual document clustering.
Named entity recognition.
Person.
Organisation.
Geographical location.
Named entity disambiguation.
Quotation recognition and reference resolution for name
parts.
Identification and mapping of name variants for the same
person.
Topic detection and tracking.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
48. Text analyse components of the system (2)
Categorisation of documents according to a multilingual
thesaurus.
Cluster similarity calculation:
monolingual.
across languages.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
49. Language independent rules for geo-tagging
Use of document context:
If a name can be a personal name or the name of a
place, if it has been mentioned as a person earlier, then
the preferred reading is that it is a person.
If a name can be a personal name or the name of a
place, if it has been mentioned as a person earlier, then
the preferred reading is that it is a person.
If a country has been mentioned in the text, and then
appear a polysemous item, resolve the ambiguity in
favour of a place in the mentioned country.
Prefer locations that are physically closer to other, non
ambiguous locations than have been mentioned in the
context.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
50. Language independent rules for geo-tagging
In case of polysemy, most important places will be
preferred.
Ignore places that cannot be disambiguated.
Combine the rules giving different weights.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
51. Inflection and regular variations (1)
Hyphen/space alternations (Jean-Marie / Jean Marie).
Diacritic variations (Schr¨der / Schroder).
o
Name inversion: change of position between first and last
name.
Typos: relatively frequents in names like Condoleezza
Rice, often written as Condoleza, Condolezza, etc.
Simplification: Condoleezza Rice and George W. Bush are
frequently simplified as Ms. Rice and President Bush.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
52. Inflection and regular variations (2)
Morphological declensions: use of prefixes and suffixes in
several languages.
Transliteration from other alphabets:
there is not a 1x1 mapping between letters.
there are different conventions.
Vowel variations, specially in transliterations from and
into Arabic.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
53. Identification of name variants
Some of these variants can be predicted and generated
using sets of regular expressions.
i.e. declination of personal names in Sloven:
s/[aeo]?/(e|a|o|u|om|em|m|ju|jem|ja)?/
For every frequent name in the data base will be
generated a pattern like
Pierr(e|a|o|u|om|em|m|ju|jem|ja)?
Gemayel(e|a|o|u|om|em|m|ju|jem|ja)?
For cases that cannot be resolved by the regular
expressions:
Normalise the names, translating them to a
language-independent representation.
Compute edit distance between name-variant and
normalised-names.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
54. Doc. categorisation with multilingual thesaurus (1)
Eurovoc Thesaurus: hierarchically organised controlled
vocabulary developed by European institutions and
national parliaments of different countries.
It is used in public administrations for cataloguing, search
and retrieval of large multilingual collections.
The thesaurus consists of 6000 descriptors organised in 21
fields and at the second level into 127 micro-thesauri.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
55. Doc. categorisation with multilingual thesaurus (2)
NewsExplorer produces a ranked set of words statistically
related to the descriptor.
These sets of words were produced on the basis of a large
amount of hand annotated documents, by comparing
word frequencies of the subset of texts indexed which
each descriptors with the word frequencies of the whole
training corpus.
This model is completed with a list of stop words to avoid
that irrelevant words have an impact in the categorisation
task.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
56. Thanks
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
57. References (1)
Bagga, A. and Baldwin, B (1998). Entity-based cross
document coreferencing using the vector space model.
Proceedings of the 36th Annual Meeting of the
Association for Computational Linguistics.
Haghighi, A. and Klein, D. (2007). Unsupervised
Coreference Resolution in a Nonparametric Bayesian
Model. In Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics.
Mann, G.S. and Yarowsky, D. (2003). Unsupervised
Personal Name Disambiguation. In Proceedings of the
CoNLL.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference
58. References (2)
Ravichandran, D. and Hovy, E. (2002). Learning surface
text patterns for a question answering system. In
Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics.
Smith, D.A. and Crane, G. (2002). Disambiguating
geographic names in a historical digital library. In
Proceedings of ECDL.
Steinberger, R. & Pouliquen, B. (2008): NewsExplorer -
combining various text analysis tools to allow multilingual
news linking and exploration. Lecture notes for the
lecture held at the SORIA Summer School “Cursos de
Tecnolog´ Ling¨´
ıas uısticas”.
Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction
Cross document coreference