Cross Document Coreference

Cross document coreference

Kepa Joseba Rodr´ıguez
Seminar on EXtreme Information Extraction

Rovereto, 25. March 2009

Kepa Joseba Rodr´
ıguez Seminar on EXtreme Information Extraction

Outline

Background.
Intra-doc/cross-doc coreference tasks.
Overview of a system.
Unsupervised personal name disambiguation.
Generation of extraction patterns.
Algorithm of (Ravichandran & Hovy, 2002)
Generation of vectors and clustering.
Evaluation
Optional: Disambiguation of geographic names.
Optional: Clustering of news.

Kepa Joseba Rodr´

The task of CDC

Cross document coreference occurs when the same person,
place, event or concept is discussed in more than one text
source. (Bagga & Baldwin 1998)

Kepa Joseba Rodr´

Intra-document vs. cross-document coreference
There are substantial diﬀerences between intra-document
and cross document coreference resolution.
In a document there is a certain consistency that we
cannot expect across documents.
Most underlying principles of linguistics and discourse
contexts cannot be applied across documents.
There are some links between both.
The resolution of intra-document coreference helps in
the resolution of cross document coreference.
The resolution of cross document coreference can help in
the resolution of intra-document coreference (Haghighi
& Klein, 2007).

Kepa Joseba Rodr´

Kepa Joseba Rodr´

Unsupervised personal name disambiguation (1)
A personal name can refer to thousands of diﬀerent
entities in the real world.
Ex: for the name Jim Clark Google shows 76.000
diﬀerent web-sites (Man & Yarowsky, 2003):
1 Jim Clark Race car driver from Scotland
2 Jim Clark Clock-maker from Colorado
3 Jim Clark Film editor
4 Jim Clark Netscape founder
5 Jim Clark Disaster survivor
6 Jim Clark Car salesman in Kansas
... Jim Clark ...
Each entry has features that may be helpful to
disambiguate the entity.
Kepa Joseba Rodr´


Earlier approaches to personal name disambiguation use
representations of the context like vectors.
Distinction between instances with identical name based
on potentially indicative words.
Jim Clark - car
Jim Clark - ﬁlm
Jim Clark - Netscape
Jim Clark - Colorado
In the case of personal names there is more precise
information available than in other kind of entities.

Kepa Joseba Rodr´


Use of information extraction techniques can add
categorial information like:
Age/date of birth.
Nationality.
Profession.
Space of associated names. It can be used:
As a vector based bag-of-words model.
With extracted speciﬁc types of association, such as:
familiar relationships: son, wife, married with...
employment relationship: manager of, etc
...

Kepa Joseba Rodr´

Generation of extraction patterns

Patterns are automatically generated from data.
It is possible to get a good performance without use of
parser or other language speciﬁc resources.
Automatic generation is more ﬂexible to be applied to
new languages.
Potentially higher precision and recall than patterns
introduced by hand.

Kepa Joseba Rodr´

(R & H) algorithm for pattern extraction (1)
Select items for the query (i.e. +Mozart, +1756)
Search in a document collection for documents that
contains both terms.
Extract the sentences in which both terms are contained.
Search for the long matches between sentences. For the
sentences:
The great composer Mozart (1756-1791) achieved fame
as a young age.
Mozart (1756-1791) was a genius.
The whole world would always be indebted to the great
music of Mozart (1756-1791).

the longest matching substring is “Mozart (1756-1791)”
Kepa Joseba Rodr´

(R & H) algorithm for pattern extraction (2)

Repeat the same procedure for other terms like
+Newton +1642
+Gandhi +1869
...

For BIRTHDATE the algorithm produces this output:
born in <ANSWER>, <NAME>
<NAME> was born in <ANSWER>
<NAME> (<ANSWER> -
<NAME> (<ANSWER> -)
...

Kepa Joseba Rodr´

(R & H) algorithm to calculate precision (1)
Build a collection of documents that contain the question
term (the name)
Query a search engine using only the question term
Download the top 1000 web documents.
Extract the sentences that contain the question term.
For each extracted pattern, check the presence in the
sentence obtained for the following instances
Presence of the pattern with <ANSWER> tag matched by
any word (Ca )
i.e: Mozart was born in <WORD>.
Presence of the pattern with <ANSWER> tag matched by
the correct term (Co )
i.e: Mozart was born in 1756.
P = Co /Ca
Kepa Joseba Rodr´

(R & H) algorithm to calculate precision (2)

Example: precision for the extracted patters for BIRTHDATE.

1.0 <NAME> (<ANSWER> -)
0.85 <NAME> was born on <ANSWER>
0.6 <NAME> was born in <ANSWER>
0.59 <NAME> was born <ANSWER>
0.53 <ANSWER> <NAME> was born

Kepa Joseba Rodr´

Unsupervised Clustering

(Mann & Yarowsky, 2003)
Used cluster method: bottom-up centroid agglomerative
clustering.
Each document is represented by a vector of
automatically extracted features.
The two more similar vectors are merged to produce a
new cluster.
The new cluster is represented by a vector equal to the
centroid of the clustered vectors.

Kepa Joseba Rodr´

Cluster refactoring
Unsupervised agglomerative clustering can lead to
problems.
The most similar pages are clustered at the begin of the
process.
The less similar pages are added as stragglers to the top
levels of the cluster tree.
The top-level clusters are less discriminative than the
clusters at the bottom of the tree.
The refactoring.
Clustering is stopped when a percentage of the
documents have been classiﬁed and clusters have
achieved a given size.
The rest of the documents are assigned to the clusters
with the closest distance measure.
Kepa Joseba Rodr´

Methods for vector generation

Baseline
Techniques of selective term weighting.
Term Frequency / Inverse Document Frequency
(tf-idf)
Mutual Information (mi).
Biographical features (feat)
Extended biographical features (extfeat)
Cluster refactoring.

Kepa Joseba Rodr´

Baseline

The term vectors are composed of only proper nouns.
The similarity between vectors is computed using
standard cosine similarity.
a·b
cos(a, b) =
||a|| × ||b||

Kepa Joseba Rodr´

TF-IDF

Techniques of selective term weighting.
TF-IDF weight (Term Frequency - Inverse Document
Frequency)
Measure used to evaluate how important a word is to a
document in a collection.
The importance decreases proportionally to the number
of times a word appears in a document, but it is offset
by the frequency of the word in the collection.

n |D|
tfi,j = P i,j idfi = tfidfi,j = tfi,j × idfi
k nk,j |d:ti ∈d|

Kepa Joseba Rodr´

Mutual Information

Mutual Information: Measure used to evaluate the
mutual dependence between random variables.
Given a document collection c, for each word w we
compute I (w ; c) = p(w |c)
p(w )
We selected words that
appear more than 20 times in the collection
have a I (w ; c) > 10
these words are added to the document’s feature vector
with a weight equal to log (I (w ; c))

Kepa Joseba Rodr´

Extracted biographical features (feat)

Use of biographical features extracted with the algorithm
of (Ravichendran & Hovy, 2002)
Biographical information is used to link the documents:
documents which contain similar extracted features have
the same referent.
The extracted biographical features help to improve
disambiguation: documents with diﬀerent extracted
features belong to diﬀerent clusters.

Kepa Joseba Rodr´

Extracted biographical features (feat)

Type Extracted feature
birth place Midland (4), Texas (3), Alton (1), Illinois(1)
birth year 1926 (9). 1967 (3), 1973 (2), 1947 (1),
1958 (1), 1969 (1)
occupation actor (11), trumpeter (9), heavyweight (2), ...
spouse Demi Moore (1)
Table: feat Features extracted for Davis/Harrelson pseudoname

Kepa Joseba Rodr´

Extended biographical features (extfeat)
In this method the system gives higher weight to words
that appear ﬁlling patterns.
Example:
The system recognises 1756 as a birth-year using surface
patterns.
Then when it is found in context outside of an extraction
pattern, it is given a higher weight and added to the
document vector as a potential biographical feature.
For the experiment it was applied for words which appears
more than a threshold of 4 times.
Then value of the weight is the log of the number of
times the word was found as an extracted feature.
Kepa Joseba Rodr´

word w(mi) w(extfeat)
adderley 3.50 0
snipes 5.16 0
coltrane 5.06 0
bitches 4.99 0
danson 4.97 0
hemp 4.97 0
mullally 4.95 0
porgy 4.94 0
remastered 4.92 0
actor 3.50 2.40
1926 0 2-20
trumpeter 0 2.20
midland 0 1.39
Table: 10 words with higher mutual information with the document
collection and all extfeat words for Davis/Harrelson pseudoname

Kepa Joseba Rodr´

Experiments: the data set

The data set consisted in web pages collected using
Google for a set of target personal names.
Not more than 1000 pages for each target name.
No requirement that the web-page was focused on the
name.
No minimum number of occurrences of the name in the
page.

Kepa Joseba Rodr´

Evaluation on pseudonames

Pseudonames created as follows:
Take retrieval results from two diﬀerent people.
Replace all references to each name by a unique shared
pseudoname.
Resulting collection consists of documents which are
ambiguous as to whom they are talking about.
The aim of the clustering is to distinguish the introduced
pseudoname.

Kepa Joseba Rodr´

Selected a set of 8 diﬀerent people:
Historical ﬁgures.
Figures from media and pop culture.
Non famous people with similar background (birthdate,
profession, etc.)
Submit Google queries and retrieval up to 1000 pages
about each person.
Select a maximum of 100 pages for each person.
Evaluation of two granularities of feature extraction:
Use high precision rules to extract occupation, birthday,
spouse, birth location and school.
Use high recall rules to extract the same terms and add
parent/child relationships.
Kepa Joseba Rodr´


Method Accuracy
nnp 79.7
nnp + tfidf 79.7
nnp + mi 82.9
Table: Disambiguation accuracy of diﬀerent clustering methods

Kepa Joseba Rodr´


feature set size
extracted features
small large
nnp+feat 82.5 85.1
nnp+feat+extfeat 82.0 84.6
nnp+feat+mi 85.6 85.3
nnp + feat + tfidf 82.9 86.4
Table: Disambiguation accuracy of different clustering methods
and different size of feature sets

Kepa Joseba Rodr´

Evaluation on naturally ambiguous names

Start with a selection of 4 polysemous names with a
average of 60 different instances for each of them.
Manual annotation with name-ID numbers
The occurrences of each name should be classified into 3
clusters
The 2 automatically derived first-pass majority seed sets.
The residual set for “other uses”

Weighting method Precision Recall
TF-IDF .81 .70
Mutual Information .88 .73

Kepa Joseba Rodr´

Conclusions

The results of the clustering are improved by:
Learning and using automatically extracted biographic
information.
The use of weighting techniques.
The produced clusters can be used as seeds for
disambiguating further entities.

Kepa Joseba Rodr´

Disambiguating geographic
names in a digital library

Kepa Joseba Rodr´

Outline

Task of the Perseus project.
Problems of the task domain.
External knowledge sources.
Identiﬁcation and classiﬁcation of proper names.
First disambiguation of geographical names.
Simple carachterization of the document context.
Final disambiguation.

Kepa Joseba Rodr´

Task of the project Perseus

Task of the Perseus Project (Smith & Crane, 2002)
Library with historical data in humanities from the ancient
Greece to the 19th century America.
Over a million of toponym references.
The task consist of:
Identiﬁcation of geographic names.
Link the names to information about location, type,
dates of occupation, relation to other places,
inhabitants, etc.
Link the names to a position in a map.

Kepa Joseba Rodr´

Problems of the domain

The introduction of the entity by a unambiguous mention
is less common than in new papers articles.
There is a great difference between the documents, like
Different size of the documents.
Lack of standard structures.
Different registers and dialects are used.
Historical variations: borders, names associated to
different political systems, etc.
Long distance anaphora.
Resolution process is more similar to the resolution of
cross-document coreference in the web than in corpora.

Kepa Joseba Rodr´

Knowledge sources

The system uses external knowledge sources. The most
important are:
Getty Thesaurus of geographic names.
Cruchley’s gazetteer of London, that were build for
geocoding.
Lists of authors of the entries in the Dictionary of
National Biography, that helps to add additional
information to the documents.

Kepa Joseba Rodr´

Identification and classification of proper-names
The task of identifying the proper names and the first
classification of them is done using simple heuristics.
Capitalisation and punctuation conventions.
Markup added by the editor of the document.
Language specific honorifics (Mr., Dr., etc).
Generic topographic labels are taken as “moderate”
evidence that the name may be geographic.
Rocky Mountains
Charles River

Stand-alone names are preferably classified as personal
names.
John (personal name vs. village in Louisiana or Virginia)
Kepa Joseba Rodr´

Disambiguation (1)

Based in local context.
Explicit disambiguating tags put after the names.
e.g.“Lancaster, PA”, “Vienna, Austria”, post code, etc.
If an ambiguous name of a place is mentioned together
with other names of places, the most likely
interpretation of the name is that is geographically near
from the others.
e.g. if “Philadelphia” and “Harrisburg” appear in the same
paragraph, the preferred interpretation of “Lancaster” will be
the town in Pennsylvania, and not the town in England or
Arizona.

Kepa Joseba Rodr´

Disambiguation (2)

Based in document context.
Preponderance of geographic references in the entire
document.
For short documents, like new papers articles, document
context and local context are considered as the same.
Based in word knowledge.
Captured from gazetteers and other reference works.
Facts about a place like political coordinates, size, etc.

Kepa Joseba Rodr´

Simple characterisation of the document context
Aggregate all of the possible locations for all the
toponyms in the document onto a one-by-one degree grid.
Assign weights for the number of mentions of each
toponym.
Prune the grid based on general world knowledge.
Compute the centroid of this weighted map.
Compute the standard deviation of the distance of the
points from this centroid.
Discard points more than to times the standard deviation
away from the centroid.
Calculate a new centroid.
Kepa Joseba Rodr´

Final disambiguation.
Local context of a toponym is represented by a moving
window of the four previous and four following toponyms
in the text.
Only non ambiguous or disambiguated toponyms are
considered.
Each of the possible interpretations of the ambiguous
toponym are scored using:
Geographical proximity to the toponyms around it.
Proximity to the centroid for the document.
Relative importance.
The interpretation that achieves the highest score is
selected.
Kepa Joseba Rodr´

Evaluation (1)

The system has evaluated using 5 hand-annotated corpora.

Corpus PCat Prec Rec F1
Greek 0.98 0.93 0.99 0.96
Roman 0.99 0.91 1.00 0.95
London 0.92 0.86 0.96 0.91
California 0.92 0.83 0.96 0.89
Upper Midwest 0.89 0.74 0.89 0.81

Kepa Joseba Rodr´

Evaluation (2)

Categorisation performed on texts of the Greek and
Roman history texts is better than on texts about more
actual items.
In places with a hight density of population we found
more toponyms that are ambiguous with other names.
Mistakes where ethnonyms are used as geo-political Entity
(like “The Germans” in the text Cæsar’s Gallic War).
Proper names are usually non inﬂected in English.
We can add rules by hand to correct it, but the precision
of the system could decrease.

Kepa Joseba Rodr´

Conclusions

Simple heuristic categorisation seems to work properly for
the categorisation of entities that appear in certain kind
of texts.
The evaluation procedure is not very clear.
There are cases that are not covered properly by the
gazetteers, but the use of huge ﬁne grained gazetteers
leads to a higher recall but a lower precision.
An alternative is the use of linguistic processing and
machine learning techniques for restricted cases and
collections of documents.

Kepa Joseba Rodr´

NewsExplorer: multilingual
coreference resolution

Kepa Joseba Rodr´

NewsExplorer
NewsExplorer (Steinberger & Pouliquen, 2008) is an
application that gathers and aggregates extracted
information for 19 languages.
Each entity is displayed on a dedicated web-site.
For each entity the user get:
List of the latest new clusters in which the entity has
been mentioned.
List of other entities found in the same clusters.
Titles and other phrases describing the entity.
Quotations done by the entity or about it.
Photograph if available.
Wikipedia site about the entity if available.

Kepa Joseba Rodr´

Text analyse components of the system (1)
Monolingual document clustering.
Named entity recognition.
Person.
Organisation.
Geographical location.
Named entity disambiguation.
Quotation recognition and reference resolution for name
parts.
Identiﬁcation and mapping of name variants for the same
person.
Topic detection and tracking.
Kepa Joseba Rodr´

Text analyse components of the system (2)

Categorisation of documents according to a multilingual
thesaurus.
Cluster similarity calculation:
monolingual.
across languages.

Kepa Joseba Rodr´

Language independent rules for geo-tagging
Use of document context:
If a name can be a personal name or the name of a
place, if it has been mentioned as a person earlier, then
the preferred reading is that it is a person.
If a name can be a personal name or the name of a
place, if it has been mentioned as a person earlier, then
the preferred reading is that it is a person.
If a country has been mentioned in the text, and then
appear a polysemous item, resolve the ambiguity in
favour of a place in the mentioned country.
Prefer locations that are physically closer to other, non
ambiguous locations than have been mentioned in the
context.
Kepa Joseba Rodr´

Language independent rules for geo-tagging

In case of polysemy, most important places will be
preferred.
Ignore places that cannot be disambiguated.
Combine the rules giving diﬀerent weights.

Kepa Joseba Rodr´

Inflection and regular variations (1)

Hyphen/space alternations (Jean-Marie / Jean Marie).
Diacritic variations (Schr¨der / Schroder).
o
Name inversion: change of position between first and last
name.
Typos: relatively frequents in names like Condoleezza
Rice, often written as Condoleza, Condolezza, etc.
Simplification: Condoleezza Rice and George W. Bush are
frequently simplified as Ms. Rice and President Bush.

Kepa Joseba Rodr´

Inflection and regular variations (2)

Morphological declensions: use of prefixes and suffixes in
several languages.
Transliteration from other alphabets:
there is not a 1x1 mapping between letters.
there are different conventions.
Vowel variations, specially in transliterations from and
into Arabic.

Kepa Joseba Rodr´

Identiﬁcation of name variants
Some of these variants can be predicted and generated
using sets of regular expressions.
i.e. declination of personal names in Sloven:
s/[aeo]?/(e|a|o|u|om|em|m|ju|jem|ja)?/
For every frequent name in the data base will be
generated a pattern like
Pierr(e|a|o|u|om|em|m|ju|jem|ja)?
Gemayel(e|a|o|u|om|em|m|ju|jem|ja)?
For cases that cannot be resolved by the regular
expressions:
Normalise the names, translating them to a
language-independent representation.
Compute edit distance between name-variant and
normalised-names.
Kepa Joseba Rodr´

Doc. categorisation with multilingual thesaurus (1)

Eurovoc Thesaurus: hierarchically organised controlled
vocabulary developed by European institutions and
national parliaments of diﬀerent countries.
It is used in public administrations for cataloguing, search
and retrieval of large multilingual collections.
The thesaurus consists of 6000 descriptors organised in 21
ﬁelds and at the second level into 127 micro-thesauri.

Kepa Joseba Rodr´

Doc. categorisation with multilingual thesaurus (2)

NewsExplorer produces a ranked set of words statistically
related to the descriptor.
These sets of words were produced on the basis of a large
amount of hand annotated documents, by comparing
word frequencies of the subset of texts indexed which
each descriptors with the word frequencies of the whole
training corpus.
This model is completed with a list of stop words to avoid
that irrelevant words have an impact in the categorisation
task.

Kepa Joseba Rodr´

Thanks

Kepa Joseba Rodr´

References (1)

Bagga, A. and Baldwin, B (1998). Entity-based cross
document coreferencing using the vector space model.
Proceedings of the 36th Annual Meeting of the
Association for Computational Linguistics.
Haghighi, A. and Klein, D. (2007). Unsupervised
Coreference Resolution in a Nonparametric Bayesian
Model. In Proceedings of the 45th Annual Meeting of the
Mann, G.S. and Yarowsky, D. (2003). Unsupervised
Personal Name Disambiguation. In Proceedings of the
CoNLL.

Kepa Joseba Rodr´

References (2)
Ravichandran, D. and Hovy, E. (2002). Learning surface
text patterns for a question answering system. In
Proceedings of the 40th Annual Meeting of the
Smith, D.A. and Crane, G. (2002). Disambiguating
geographic names in a historical digital library. In
Proceedings of ECDL.
Steinberger, R. & Pouliquen, B. (2008): NewsExplorer -
combining various text analysis tools to allow multilingual
news linking and exploration. Lecture notes for the
lecture held at the SORIA Summer School “Cursos de
Tecnolog´ Ling¨´
ıas uısticas”.
Kepa Joseba Rodr´

Cross Document Coreference

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (19)

Similaire à Cross Document Coreference

Similaire à Cross Document Coreference (20)

Plus de Kepa J. Rodriguez

Plus de Kepa J. Rodriguez (7)

Dernier

Dernier (20)

Cross Document Coreference