SlideShare une entreprise Scribd logo
1  sur  58
Télécharger pour lire hors ligne
Cross document coreference

                               Kepa Joseba Rodr´ıguez
                      Seminar on EXtreme Information Extraction


                                     Rovereto, 25. March 2009




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Outline

               Background.
                       Intra-doc/cross-doc coreference tasks.
                       Overview of a system.
               Unsupervised personal name disambiguation.
               Generation of extraction patterns.
                       Algorithm of (Ravichandran & Hovy, 2002)
               Generation of vectors and clustering.
               Evaluation
               Optional: Disambiguation of geographic names.
               Optional: Clustering of news.

Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
The task of CDC



      Cross document coreference occurs when the same person,
      place, event or concept is discussed in more than one text
      source. (Bagga & Baldwin 1998)




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Intra-document vs. cross-document coreference
               There are substantial differences between intra-document
               and cross document coreference resolution.
                       In a document there is a certain consistency that we
                       cannot expect across documents.
                       Most underlying principles of linguistics and discourse
                       contexts cannot be applied across documents.
               There are some links between both.
                       The resolution of intra-document coreference helps in
                       the resolution of cross document coreference.
                       The resolution of cross document coreference can help in
                       the resolution of intra-document coreference (Haghighi
                       & Klein, 2007).

Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Unsupervised personal name disambiguation (1)
               A personal name can refer to thousands of different
               entities in the real world.
                       Ex: for the name Jim Clark Google shows 76.000
                       different web-sites (Man & Yarowsky, 2003):
                       1      Jim   Clark        Race car driver from Scotland
                       2      Jim   Clark        Clock-maker from Colorado
                       3      Jim   Clark        Film editor
                       4      Jim   Clark        Netscape founder
                       5      Jim   Clark        Disaster survivor
                       6      Jim   Clark        Car salesman in Kansas
                       ...    Jim   Clark        ...
               Each entry has features that may be helpful to
               disambiguate the entity.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Unsupervised personal name disambiguation (2)

               Earlier approaches to personal name disambiguation use
               representations of the context like vectors.
               Distinction between instances with identical name based
               on potentially indicative words.
                       Jim   Clark     -   car
                       Jim   Clark     -   film
                       Jim   Clark     -   Netscape
                       Jim   Clark     -   Colorado
               In the case of personal names there is more precise
               information available than in other kind of entities.


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Unsupervised personal name disambiguation (3)

               Use of information extraction techniques can add
               categorial information like:
                       Age/date of birth.
                       Nationality.
                       Profession.
               Space of associated names. It can be used:
                       As a vector based bag-of-words model.
                       With extracted specific types of association, such as:
                              familiar relationships: son, wife, married with...
                              employment relationship: manager of, etc
                              ...


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Generation of extraction patterns


               Patterns are automatically generated from data.
               It is possible to get a good performance without use of
               parser or other language specific resources.
               Automatic generation is more flexible to be applied to
               new languages.
               Potentially higher precision and recall than patterns
               introduced by hand.




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
(R & H) algorithm for pattern extraction (1)
               Select items for the query (i.e. +Mozart, +1756)
               Search in a document collection for documents that
               contains both terms.
               Extract the sentences in which both terms are contained.
               Search for the long matches between sentences. For the
               sentences:
                              The great composer Mozart (1756-1791) achieved fame
                              as a young age.
                              Mozart (1756-1791) was a genius.
                              The whole world would always be indebted to the great
                              music of Mozart (1756-1791).

               the longest matching substring is “Mozart (1756-1791)”
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
(R & H) algorithm for pattern extraction (2)

               Repeat the same procedure for other terms like
                              +Newton +1642
                              +Gandhi +1869
                              ...

               For BIRTHDATE the algorithm produces this output:
                       born in        <ANSWER>, <NAME>
                       <NAME>         was born in <ANSWER>
                       <NAME>         (<ANSWER> -
                       <NAME>         (<ANSWER> -)
                       ...


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
(R & H) algorithm to calculate precision (1)
               Build a collection of documents that contain the question
               term (the name)
                       Query a search engine using only the question term
                       Download the top 1000 web documents.
               Extract the sentences that contain the question term.
               For each extracted pattern, check the presence in the
               sentence obtained for the following instances
                       Presence of the pattern              with <ANSWER> tag matched by
                       any word (Ca )
                       i.e: Mozart was born in              <WORD>.
                       Presence of the pattern              with <ANSWER> tag matched by
                       the correct term (Co )
                       i.e: Mozart was born in              1756.
                                                 P = Co /Ca
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
(R & H) algorithm to calculate precision (2)


      Example: precision for the extracted patters for BIRTHDATE.

                         1.0        <NAME> (<ANSWER> -)
                         0.85       <NAME> was born on <ANSWER>
                         0.6        <NAME> was born in <ANSWER>
                         0.59       <NAME> was born <ANSWER>
                         0.53       <ANSWER> <NAME> was born




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Unsupervised Clustering

      (Mann & Yarowsky, 2003)
               Used cluster method: bottom-up centroid agglomerative
               clustering.
               Each document is represented by a vector of
               automatically extracted features.
               The two more similar vectors are merged to produce a
               new cluster.
               The new cluster is represented by a vector equal to the
               centroid of the clustered vectors.


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Cluster refactoring
               Unsupervised agglomerative clustering can lead to
               problems.
                       The most similar pages are clustered at the begin of the
                       process.
                       The less similar pages are added as stragglers to the top
                       levels of the cluster tree.
                       The top-level clusters are less discriminative than the
                       clusters at the bottom of the tree.
               The refactoring.
                       Clustering is stopped when a percentage of the
                       documents have been classified and clusters have
                       achieved a given size.
                       The rest of the documents are assigned to the clusters
                       with the closest distance measure.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Methods for vector generation


               Baseline
               Techniques of selective term weighting.
                       Term Frequency / Inverse Document Frequency
                       (tf-idf)
                       Mutual Information (mi).
               Biographical features (feat)
               Extended biographical features (extfeat)
               Cluster refactoring.



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Baseline


               The term vectors are composed of only proper nouns.
               The similarity between vectors is computed using
               standard cosine similarity.
                                                 a·b
                                 cos(a, b) =
                                             ||a|| × ||b||




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
TF-IDF

               Techniques of selective term weighting.
               TF-IDF weight (Term Frequency - Inverse Document
               Frequency)
                       Measure used to evaluate how important a word is to a
                       document in a collection.
                       The importance decreases proportionally to the number
                       of times a word appears in a document, but it is offset
                       by the frequency of the word in the collection.

                     n                                 |D|
         tfi,j =    P i,j                 idfi =                  tfidfi,j = tfi,j × idfi
                     k nk,j                          |d:ti ∈d|



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Mutual Information

               Mutual Information: Measure used to evaluate the
               mutual dependence between random variables.
               Given a document collection c, for each word w we
               compute I (w ; c) = p(w |c)
                                    p(w )
               We selected words that
                       appear more than 20 times in the collection
                       have a I (w ; c) > 10
               these words are added to the document’s feature vector
               with a weight equal to log (I (w ; c))


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Extracted biographical features (feat)


               Use of biographical features extracted with the algorithm
               of (Ravichendran & Hovy, 2002)
               Biographical information is used to link the documents:
               documents which contain similar extracted features have
               the same referent.
               The extracted biographical features help to improve
               disambiguation: documents with different extracted
               features belong to different clusters.



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Extracted biographical features (feat)


         Type                       Extracted feature
         birth place                Midland (4), Texas (3), Alton (1), Illinois(1)
         birth year                 1926 (9). 1967 (3), 1973 (2), 1947 (1),
                                    1958 (1), 1969 (1)
         occupation                 actor (11), trumpeter (9), heavyweight (2), ...
         spouse                     Demi Moore (1)
        Table: feat Features extracted for Davis/Harrelson pseudoname




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Extended biographical features (extfeat)
               In this method the system gives higher weight to words
               that appear filling patterns.
               Example:
                       The system recognises 1756 as a birth-year using surface
                       patterns.
                       Then when it is found in context outside of an extraction
                       pattern, it is given a higher weight and added to the
                       document vector as a potential biographical feature.
               For the experiment it was applied for words which appears
               more than a threshold of 4 times.
               Then value of the weight is the log of the number of
               times the word was found as an extracted feature.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
word               w(mi)       w(extfeat)
                                   adderley            3.50           0
                                   snipes              5.16           0
                                   coltrane            5.06           0
                                   bitches             4.99           0
                                   danson              4.97           0
                                   hemp                4.97           0
                                   mullally            4.95           0
                                   porgy               4.94           0
                                   remastered          4.92           0
                                   actor               3.50         2.40
                                   1926                 0           2-20
                                   trumpeter            0           2.20
                                   midland              0           1.39
      Table: 10 words with higher mutual information with the document
      collection and all extfeat words for Davis/Harrelson pseudoname

Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Experiments: the data set


               The data set consisted in web pages collected using
               Google for a set of target personal names.
                       Not more than 1000 pages for each target name.
                       No requirement that the web-page was focused on the
                       name.
                       No minimum number of occurrences of the name in the
                       page.




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Evaluation on pseudonames


               Pseudonames created as follows:
                       Take retrieval results from two different people.
                       Replace all references to each name by a unique shared
                       pseudoname.
               Resulting collection consists of documents which are
               ambiguous as to whom they are talking about.
               The aim of the clustering is to distinguish the introduced
               pseudoname.



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Evaluation on pseudonames
               Selected a set of 8 different people:
                       Historical figures.
                       Figures from media and pop culture.
                       Non famous people with similar background (birthdate,
                       profession, etc.)
               Submit Google queries and retrieval up to 1000 pages
               about each person.
               Select a maximum of 100 pages for each person.
               Evaluation of two granularities of feature extraction:
                       Use high precision rules to extract occupation, birthday,
                       spouse, birth location and school.
                       Use high recall rules to extract the same terms and add
                       parent/child relationships.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Evaluation on pseudonames


                                      Method      Accuracy
                                      nnp           79.7
                                      nnp + tfidf   79.7
                                      nnp + mi      82.9
         Table: Disambiguation accuracy of different clustering methods




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Evaluation on pseudonames

                                                                  feature   set size
                             extracted features
                                                                  small      large
                             nnp+feat                              82.5      85.1
                             nnp+feat+extfeat                      82.0      84.6
                             nnp+feat+mi                           85.6      85.3
                             nnp + feat + tfidf                     82.9      86.4
      Table: Disambiguation accuracy of different clustering methods
      and different size of feature sets




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Evaluation on naturally ambiguous names

               Start with a selection of 4 polysemous names with a
               average of 60 different instances for each of them.
               Manual annotation with name-ID numbers
               The occurrences of each name should be classified into 3
               clusters
                       The 2 automatically derived first-pass majority seed sets.
                       The residual set for “other uses”

                             Weighting method                     Precision Recall
                             TF-IDF                               .81       .70
                             Mutual Information                   .88       .73

Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Conclusions


               The results of the clustering are improved by:
                       Learning and using automatically extracted biographic
                       information.
                       The use of weighting techniques.
               The produced clusters can be used as seeds for
               disambiguating further entities.




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Disambiguating geographic
       names in a digital library



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Outline


               Task of the Perseus project.
               Problems of the task domain.
               External knowledge sources.
               Identification and classification of proper names.
               First disambiguation of geographical names.
               Simple carachterization of the document context.
               Final disambiguation.



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Task of the project Perseus

               Task of the Perseus Project (Smith & Crane, 2002)
               Library with historical data in humanities from the ancient
               Greece to the 19th century America.
               Over a million of toponym references.
               The task consist of:
                       Identification of geographic names.
                       Link the names to information about location, type,
                       dates of occupation, relation to other places,
                       inhabitants, etc.
                       Link the names to a position in a map.


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Problems of the domain

               The introduction of the entity by a unambiguous mention
               is less common than in new papers articles.
               There is a great difference between the documents, like
                       Different size of the documents.
                       Lack of standard structures.
                       Different registers and dialects are used.
                       Historical variations: borders, names associated to
                       different political systems, etc.
               Long distance anaphora.
               Resolution process is more similar to the resolution of
               cross-document coreference in the web than in corpora.

Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Knowledge sources


      The system uses external knowledge sources. The most
      important are:
          Getty Thesaurus of geographic names.
          Cruchley’s gazetteer of London, that were build for
          geocoding.
          Lists of authors of the entries in the Dictionary of
          National Biography, that helps to add additional
          information to the documents.



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Identification and classification of proper-names
      The task of identifying the proper names and the first
      classification of them is done using simple heuristics.
           Capitalisation and punctuation conventions.
           Markup added by the editor of the document.
           Language specific honorifics (Mr., Dr., etc).
           Generic topographic labels are taken as “moderate”
           evidence that the name may be geographic.
                              Rocky Mountains
                              Charles River

               Stand-alone names are preferably classified as personal
               names.
                              John (personal name vs. village in Louisiana or Virginia)
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Disambiguation (1)

               Based in local context.
                       Explicit disambiguating tags put after the names.
                       e.g.“Lancaster, PA”, “Vienna, Austria”, post code, etc.
                       If an ambiguous name of a place is mentioned together
                       with other names of places, the most likely
                       interpretation of the name is that is geographically near
                       from the others.
                       e.g. if “Philadelphia” and “Harrisburg” appear in the same
                       paragraph, the preferred interpretation of “Lancaster” will be
                       the town in Pennsylvania, and not the town in England or
                       Arizona.


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Disambiguation (2)


               Based in document context.
                       Preponderance of geographic references in the entire
                       document.
                       For short documents, like new papers articles, document
                       context and local context are considered as the same.
               Based in word knowledge.
                       Captured from gazetteers and other reference works.
                       Facts about a place like political coordinates, size, etc.




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Simple characterisation of the document context
               Aggregate all of the possible locations for all the
               toponyms in the document onto a one-by-one degree grid.
               Assign weights for the number of mentions of each
               toponym.
               Prune the grid based on general world knowledge.
               Compute the centroid of this weighted map.
               Compute the standard deviation of the distance of the
               points from this centroid.
               Discard points more than to times the standard deviation
               away from the centroid.
               Calculate a new centroid.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Final disambiguation.
               Local context of a toponym is represented by a moving
               window of the four previous and four following toponyms
               in the text.
               Only non ambiguous or disambiguated toponyms are
               considered.
               Each of the possible interpretations of the ambiguous
               toponym are scored using:
                       Geographical proximity to the toponyms around it.
                       Proximity to the centroid for the document.
                       Relative importance.
               The interpretation that achieves the highest score is
               selected.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Evaluation (1)

      The system has evaluated using 5 hand-annotated corpora.

                       Corpus                       PCat          Prec   Rec     F1
                       Greek                        0.98          0.93   0.99   0.96
                       Roman                        0.99          0.91   1.00   0.95
                       London                       0.92          0.86   0.96   0.91
                       California                   0.92          0.83   0.96   0.89
                       Upper Midwest                0.89          0.74   0.89   0.81




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Evaluation (2)

               Categorisation performed on texts of the Greek and
               Roman history texts is better than on texts about more
               actual items.
                       In places with a hight density of population we found
                       more toponyms that are ambiguous with other names.
               Mistakes where ethnonyms are used as geo-political Entity
               (like “The Germans” in the text Cæsar’s Gallic War).
                       Proper names are usually non inflected in English.
                       We can add rules by hand to correct it, but the precision
                       of the system could decrease.


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Conclusions

               Simple heuristic categorisation seems to work properly for
               the categorisation of entities that appear in certain kind
               of texts.
                       The evaluation procedure is not very clear.
               There are cases that are not covered properly by the
               gazetteers, but the use of huge fine grained gazetteers
               leads to a higher recall but a lower precision.
               An alternative is the use of linguistic processing and
               machine learning techniques for restricted cases and
               collections of documents.


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
NewsExplorer: multilingual
         coreference resolution



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
NewsExplorer
               NewsExplorer (Steinberger & Pouliquen, 2008) is an
               application that gathers and aggregates extracted
               information for 19 languages.
               Each entity is displayed on a dedicated web-site.
               For each entity the user get:
                       List of the latest new clusters in which the entity has
                       been mentioned.
                       List of other entities found in the same clusters.
                       Titles and other phrases describing the entity.
                       Quotations done by the entity or about it.
                       Photograph if available.
                       Wikipedia site about the entity if available.

Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Text analyse components of the system (1)
               Monolingual document clustering.
               Named entity recognition.
                       Person.
                       Organisation.
                       Geographical location.
               Named entity disambiguation.
               Quotation recognition and reference resolution for name
               parts.
               Identification and mapping of name variants for the same
               person.
               Topic detection and tracking.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Text analyse components of the system (2)



               Categorisation of documents according to a multilingual
               thesaurus.
               Cluster similarity calculation:
                       monolingual.
                       across languages.




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Language independent rules for geo-tagging
               Use of document context:
                       If a name can be a personal name or the name of a
                       place, if it has been mentioned as a person earlier, then
                       the preferred reading is that it is a person.
                       If a name can be a personal name or the name of a
                       place, if it has been mentioned as a person earlier, then
                       the preferred reading is that it is a person.
                       If a country has been mentioned in the text, and then
                       appear a polysemous item, resolve the ambiguity in
                       favour of a place in the mentioned country.
                       Prefer locations that are physically closer to other, non
                       ambiguous locations than have been mentioned in the
                       context.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Language independent rules for geo-tagging



               In case of polysemy, most important places will be
               preferred.
               Ignore places that cannot be disambiguated.
               Combine the rules giving different weights.




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Inflection and regular variations (1)


               Hyphen/space alternations (Jean-Marie / Jean Marie).
               Diacritic variations (Schr¨der / Schroder).
                                         o
               Name inversion: change of position between first and last
               name.
               Typos: relatively frequents in names like Condoleezza
               Rice, often written as Condoleza, Condolezza, etc.
               Simplification: Condoleezza Rice and George W. Bush are
               frequently simplified as Ms. Rice and President Bush.



Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Inflection and regular variations (2)


               Morphological declensions: use of prefixes and suffixes in
               several languages.
               Transliteration from other alphabets:
                       there is not a 1x1 mapping between letters.
                       there are different conventions.
               Vowel variations, specially in transliterations from and
               into Arabic.




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Identification of name variants
               Some of these variants can be predicted and generated
               using sets of regular expressions.
               i.e. declination of personal names in Sloven:
                       s/[aeo]?/(e|a|o|u|om|em|m|ju|jem|ja)?/
                       For every frequent name in the data base will be
                       generated a pattern like
                       Pierr(e|a|o|u|om|em|m|ju|jem|ja)?
                       Gemayel(e|a|o|u|om|em|m|ju|jem|ja)?
               For cases that cannot be resolved by the regular
               expressions:
                       Normalise the names, translating them to a
                       language-independent representation.
                       Compute edit distance between name-variant and
                       normalised-names.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Doc. categorisation with multilingual thesaurus (1)


               Eurovoc Thesaurus: hierarchically organised controlled
               vocabulary developed by European institutions and
               national parliaments of different countries.
               It is used in public administrations for cataloguing, search
               and retrieval of large multilingual collections.
               The thesaurus consists of 6000 descriptors organised in 21
               fields and at the second level into 127 micro-thesauri.




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Doc. categorisation with multilingual thesaurus (2)

               NewsExplorer produces a ranked set of words statistically
               related to the descriptor.
               These sets of words were produced on the basis of a large
               amount of hand annotated documents, by comparing
               word frequencies of the subset of texts indexed which
               each descriptors with the word frequencies of the whole
               training corpus.
               This model is completed with a list of stop words to avoid
               that irrelevant words have an impact in the categorisation
               task.


Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
Thanks




Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
References (1)

               Bagga, A. and Baldwin, B (1998). Entity-based cross
               document coreferencing using the vector space model.
               Proceedings of the 36th Annual Meeting of the
               Association for Computational Linguistics.
               Haghighi, A. and Klein, D. (2007). Unsupervised
               Coreference Resolution in a Nonparametric Bayesian
               Model. In Proceedings of the 45th Annual Meeting of the
               Association for Computational Linguistics.
               Mann, G.S. and Yarowsky, D. (2003). Unsupervised
               Personal Name Disambiguation. In Proceedings of the
               CoNLL.

Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference
References (2)
               Ravichandran, D. and Hovy, E. (2002). Learning surface
               text patterns for a question answering system. In
               Proceedings of the 40th Annual Meeting of the
               Association for Computational Linguistics.
               Smith, D.A. and Crane, G. (2002). Disambiguating
               geographic names in a historical digital library. In
               Proceedings of ECDL.
               Steinberger, R. & Pouliquen, B. (2008): NewsExplorer -
               combining various text analysis tools to allow multilingual
               news linking and exploration. Lecture notes for the
               lecture held at the SORIA Summer School “Cursos de
               Tecnolog´ Ling¨´
                        ıas      uısticas”.
Kepa Joseba Rodr´
                ıguez Seminar on EXtreme Information Extraction
Cross document coreference

Contenu connexe

En vedette

Coreference Resolution
Coreference ResolutionCoreference Resolution
Coreference Resolutionwushumin
 
LVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYLVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYdmbaturin
 
Parallel Community Detection for Massive Graphs
Parallel Community Detection for Massive GraphsParallel Community Detection for Massive Graphs
Parallel Community Detection for Massive GraphsJason Riedy
 
MTAAP12: Scalable Community Detection
MTAAP12: Scalable Community DetectionMTAAP12: Scalable Community Detection
MTAAP12: Scalable Community DetectionJason Riedy
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta PyData
 
Physics Inspired Approaches to Community Detection
Physics Inspired Approaches to Community DetectionPhysics Inspired Approaches to Community Detection
Physics Inspired Approaches to Community Detectionfuzzysphere
 
Tutorial on Coreference Resolution
Tutorial on Coreference Resolution Tutorial on Coreference Resolution
Tutorial on Coreference Resolution Anirudh Jayakumar
 
Dynamic Knowledge-Base Alignment for Coreference Resolution
Dynamic Knowledge-Base Alignment for Coreference ResolutionDynamic Knowledge-Base Alignment for Coreference Resolution
Dynamic Knowledge-Base Alignment for Coreference ResolutionJinho Choi
 
Scalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmScalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmNavid Sedighpour
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.Andrii Soldatenko
 
ابزارهای پردازش زبان طبیعی
ابزارهای پردازش زبان طبیعیابزارهای پردازش زبان طبیعی
ابزارهای پردازش زبان طبیعیEhsan Asgarian
 
Use of graphs for political analysis
Use of graphs for political analysisUse of graphs for political analysis
Use of graphs for political analysisGraph-TA
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Python for text processing
Python for text processingPython for text processing
Python for text processingXiang Li
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlBen Healey
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphsNicola Barbieri
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Prakash Pimpale
 

En vedette (19)

Coreference Resolution
Coreference ResolutionCoreference Resolution
Coreference Resolution
 
LVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYLVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLY
 
Parallel Community Detection for Massive Graphs
Parallel Community Detection for Massive GraphsParallel Community Detection for Massive Graphs
Parallel Community Detection for Massive Graphs
 
MTAAP12: Scalable Community Detection
MTAAP12: Scalable Community DetectionMTAAP12: Scalable Community Detection
MTAAP12: Scalable Community Detection
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
Physics Inspired Approaches to Community Detection
Physics Inspired Approaches to Community DetectionPhysics Inspired Approaches to Community Detection
Physics Inspired Approaches to Community Detection
 
Tutorial on Coreference Resolution
Tutorial on Coreference Resolution Tutorial on Coreference Resolution
Tutorial on Coreference Resolution
 
L1
L1L1
L1
 
Dynamic Knowledge-Base Alignment for Coreference Resolution
Dynamic Knowledge-Base Alignment for Coreference ResolutionDynamic Knowledge-Base Alignment for Coreference Resolution
Dynamic Knowledge-Base Alignment for Coreference Resolution
 
Scalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmScalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithm
 
Text analysis using python
Text analysis using pythonText analysis using python
Text analysis using python
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.
 
ابزارهای پردازش زبان طبیعی
ابزارهای پردازش زبان طبیعیابزارهای پردازش زبان طبیعی
ابزارهای پردازش زبان طبیعی
 
Use of graphs for political analysis
Use of graphs for political analysisUse of graphs for political analysis
Use of graphs for political analysis
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Python for text processing
Python for text processingPython for text processing
Python for text processing
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphs
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 

Similaire à Cross Document Coreference

A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
Open hpi semweb-06-part7
Open hpi semweb-06-part7Open hpi semweb-06-part7
Open hpi semweb-06-part7Nadine Ludwig
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Translating Natural Language into SPARQL for Neural Question Answering
Translating Natural Language into SPARQL for Neural Question AnsweringTranslating Natural Language into SPARQL for Neural Question Answering
Translating Natural Language into SPARQL for Neural Question AnsweringTommaso Soru
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are AlgorithmsInfluxData
 
Using topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchUsing topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchDawn Anderson MSc DigM
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
Choices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein OntologiesChoices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein Ontologiesbenosteen
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningFindwise
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-rankingFELIX75
 
DATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptxDATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptxDrPraveenPawar
 
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Explanations in Dialogue Systems through Uncertain RDF Knowledge BasesExplanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Explanations in Dialogue Systems through Uncertain RDF Knowledge BasesDaniel Sonntag
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureDatabricks
 
Build Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaBuild Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaRavi Mynampaty
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesIan Foster
 

Similaire à Cross Document Coreference (20)

A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Open hpi semweb-06-part7
Open hpi semweb-06-part7Open hpi semweb-06-part7
Open hpi semweb-06-part7
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Translating Natural Language into SPARQL for Neural Question Answering
Translating Natural Language into SPARQL for Neural Question AnsweringTranslating Natural Language into SPARQL for Neural Question Answering
Translating Natural Language into SPARQL for Neural Question Answering
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 
Using topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchUsing topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic search
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Choices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein OntologiesChoices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein Ontologies
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text Mining
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
 
DATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptxDATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptx
 
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Explanations in Dialogue Systems through Uncertain RDF Knowledge BasesExplanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 Literature
 
Build Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaBuild Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to Omega
 
DB and IR Integration
DB and IR IntegrationDB and IR Integration
DB and IR Integration
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 

Plus de Kepa J. Rodriguez

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesKepa J. Rodriguez
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!Kepa J. Rodriguez
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationKepa J. Rodriguez
 
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Kepa J. Rodriguez
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchKepa J. Rodriguez
 
Named entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textNamed entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textKepa J. Rodriguez
 
Active Annotation of Corpora.
Active Annotation of Corpora.Active Annotation of Corpora.
Active Annotation of Corpora.Kepa J. Rodriguez
 

Plus de Kepa J. Rodriguez (7)

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish Studies
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language Identification
 
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical Research
 
Named entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textNamed entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR text
 
Active Annotation of Corpora.
Active Annotation of Corpora.Active Annotation of Corpora.
Active Annotation of Corpora.
 

Dernier

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Dernier (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Cross Document Coreference

  • 1. Cross document coreference Kepa Joseba Rodr´ıguez Seminar on EXtreme Information Extraction Rovereto, 25. March 2009 Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 2. Outline Background. Intra-doc/cross-doc coreference tasks. Overview of a system. Unsupervised personal name disambiguation. Generation of extraction patterns. Algorithm of (Ravichandran & Hovy, 2002) Generation of vectors and clustering. Evaluation Optional: Disambiguation of geographic names. Optional: Clustering of news. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 3. The task of CDC Cross document coreference occurs when the same person, place, event or concept is discussed in more than one text source. (Bagga & Baldwin 1998) Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 4. Intra-document vs. cross-document coreference There are substantial differences between intra-document and cross document coreference resolution. In a document there is a certain consistency that we cannot expect across documents. Most underlying principles of linguistics and discourse contexts cannot be applied across documents. There are some links between both. The resolution of intra-document coreference helps in the resolution of cross document coreference. The resolution of cross document coreference can help in the resolution of intra-document coreference (Haghighi & Klein, 2007). Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 5. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 6. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 7. Unsupervised personal name disambiguation (1) A personal name can refer to thousands of different entities in the real world. Ex: for the name Jim Clark Google shows 76.000 different web-sites (Man & Yarowsky, 2003): 1 Jim Clark Race car driver from Scotland 2 Jim Clark Clock-maker from Colorado 3 Jim Clark Film editor 4 Jim Clark Netscape founder 5 Jim Clark Disaster survivor 6 Jim Clark Car salesman in Kansas ... Jim Clark ... Each entry has features that may be helpful to disambiguate the entity. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 8. Unsupervised personal name disambiguation (2) Earlier approaches to personal name disambiguation use representations of the context like vectors. Distinction between instances with identical name based on potentially indicative words. Jim Clark - car Jim Clark - film Jim Clark - Netscape Jim Clark - Colorado In the case of personal names there is more precise information available than in other kind of entities. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 9. Unsupervised personal name disambiguation (3) Use of information extraction techniques can add categorial information like: Age/date of birth. Nationality. Profession. Space of associated names. It can be used: As a vector based bag-of-words model. With extracted specific types of association, such as: familiar relationships: son, wife, married with... employment relationship: manager of, etc ... Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 10. Generation of extraction patterns Patterns are automatically generated from data. It is possible to get a good performance without use of parser or other language specific resources. Automatic generation is more flexible to be applied to new languages. Potentially higher precision and recall than patterns introduced by hand. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 11. (R & H) algorithm for pattern extraction (1) Select items for the query (i.e. +Mozart, +1756) Search in a document collection for documents that contains both terms. Extract the sentences in which both terms are contained. Search for the long matches between sentences. For the sentences: The great composer Mozart (1756-1791) achieved fame as a young age. Mozart (1756-1791) was a genius. The whole world would always be indebted to the great music of Mozart (1756-1791). the longest matching substring is “Mozart (1756-1791)” Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 12. (R & H) algorithm for pattern extraction (2) Repeat the same procedure for other terms like +Newton +1642 +Gandhi +1869 ... For BIRTHDATE the algorithm produces this output: born in <ANSWER>, <NAME> <NAME> was born in <ANSWER> <NAME> (<ANSWER> - <NAME> (<ANSWER> -) ... Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 13. (R & H) algorithm to calculate precision (1) Build a collection of documents that contain the question term (the name) Query a search engine using only the question term Download the top 1000 web documents. Extract the sentences that contain the question term. For each extracted pattern, check the presence in the sentence obtained for the following instances Presence of the pattern with <ANSWER> tag matched by any word (Ca ) i.e: Mozart was born in <WORD>. Presence of the pattern with <ANSWER> tag matched by the correct term (Co ) i.e: Mozart was born in 1756. P = Co /Ca Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 14. (R & H) algorithm to calculate precision (2) Example: precision for the extracted patters for BIRTHDATE. 1.0 <NAME> (<ANSWER> -) 0.85 <NAME> was born on <ANSWER> 0.6 <NAME> was born in <ANSWER> 0.59 <NAME> was born <ANSWER> 0.53 <ANSWER> <NAME> was born Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 15. Unsupervised Clustering (Mann & Yarowsky, 2003) Used cluster method: bottom-up centroid agglomerative clustering. Each document is represented by a vector of automatically extracted features. The two more similar vectors are merged to produce a new cluster. The new cluster is represented by a vector equal to the centroid of the clustered vectors. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 16. Cluster refactoring Unsupervised agglomerative clustering can lead to problems. The most similar pages are clustered at the begin of the process. The less similar pages are added as stragglers to the top levels of the cluster tree. The top-level clusters are less discriminative than the clusters at the bottom of the tree. The refactoring. Clustering is stopped when a percentage of the documents have been classified and clusters have achieved a given size. The rest of the documents are assigned to the clusters with the closest distance measure. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 17. Methods for vector generation Baseline Techniques of selective term weighting. Term Frequency / Inverse Document Frequency (tf-idf) Mutual Information (mi). Biographical features (feat) Extended biographical features (extfeat) Cluster refactoring. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 18. Baseline The term vectors are composed of only proper nouns. The similarity between vectors is computed using standard cosine similarity. a·b cos(a, b) = ||a|| × ||b|| Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 19. TF-IDF Techniques of selective term weighting. TF-IDF weight (Term Frequency - Inverse Document Frequency) Measure used to evaluate how important a word is to a document in a collection. The importance decreases proportionally to the number of times a word appears in a document, but it is offset by the frequency of the word in the collection. n |D| tfi,j = P i,j idfi = tfidfi,j = tfi,j × idfi k nk,j |d:ti ∈d| Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 20. Mutual Information Mutual Information: Measure used to evaluate the mutual dependence between random variables. Given a document collection c, for each word w we compute I (w ; c) = p(w |c) p(w ) We selected words that appear more than 20 times in the collection have a I (w ; c) > 10 these words are added to the document’s feature vector with a weight equal to log (I (w ; c)) Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 21. Extracted biographical features (feat) Use of biographical features extracted with the algorithm of (Ravichendran & Hovy, 2002) Biographical information is used to link the documents: documents which contain similar extracted features have the same referent. The extracted biographical features help to improve disambiguation: documents with different extracted features belong to different clusters. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 22. Extracted biographical features (feat) Type Extracted feature birth place Midland (4), Texas (3), Alton (1), Illinois(1) birth year 1926 (9). 1967 (3), 1973 (2), 1947 (1), 1958 (1), 1969 (1) occupation actor (11), trumpeter (9), heavyweight (2), ... spouse Demi Moore (1) Table: feat Features extracted for Davis/Harrelson pseudoname Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 23. Extended biographical features (extfeat) In this method the system gives higher weight to words that appear filling patterns. Example: The system recognises 1756 as a birth-year using surface patterns. Then when it is found in context outside of an extraction pattern, it is given a higher weight and added to the document vector as a potential biographical feature. For the experiment it was applied for words which appears more than a threshold of 4 times. Then value of the weight is the log of the number of times the word was found as an extracted feature. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 24. word w(mi) w(extfeat) adderley 3.50 0 snipes 5.16 0 coltrane 5.06 0 bitches 4.99 0 danson 4.97 0 hemp 4.97 0 mullally 4.95 0 porgy 4.94 0 remastered 4.92 0 actor 3.50 2.40 1926 0 2-20 trumpeter 0 2.20 midland 0 1.39 Table: 10 words with higher mutual information with the document collection and all extfeat words for Davis/Harrelson pseudoname Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 25. Experiments: the data set The data set consisted in web pages collected using Google for a set of target personal names. Not more than 1000 pages for each target name. No requirement that the web-page was focused on the name. No minimum number of occurrences of the name in the page. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 26. Evaluation on pseudonames Pseudonames created as follows: Take retrieval results from two different people. Replace all references to each name by a unique shared pseudoname. Resulting collection consists of documents which are ambiguous as to whom they are talking about. The aim of the clustering is to distinguish the introduced pseudoname. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 27. Evaluation on pseudonames Selected a set of 8 different people: Historical figures. Figures from media and pop culture. Non famous people with similar background (birthdate, profession, etc.) Submit Google queries and retrieval up to 1000 pages about each person. Select a maximum of 100 pages for each person. Evaluation of two granularities of feature extraction: Use high precision rules to extract occupation, birthday, spouse, birth location and school. Use high recall rules to extract the same terms and add parent/child relationships. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 28. Evaluation on pseudonames Method Accuracy nnp 79.7 nnp + tfidf 79.7 nnp + mi 82.9 Table: Disambiguation accuracy of different clustering methods Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 29. Evaluation on pseudonames feature set size extracted features small large nnp+feat 82.5 85.1 nnp+feat+extfeat 82.0 84.6 nnp+feat+mi 85.6 85.3 nnp + feat + tfidf 82.9 86.4 Table: Disambiguation accuracy of different clustering methods and different size of feature sets Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 30. Evaluation on naturally ambiguous names Start with a selection of 4 polysemous names with a average of 60 different instances for each of them. Manual annotation with name-ID numbers The occurrences of each name should be classified into 3 clusters The 2 automatically derived first-pass majority seed sets. The residual set for “other uses” Weighting method Precision Recall TF-IDF .81 .70 Mutual Information .88 .73 Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 31. Conclusions The results of the clustering are improved by: Learning and using automatically extracted biographic information. The use of weighting techniques. The produced clusters can be used as seeds for disambiguating further entities. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 32. Disambiguating geographic names in a digital library Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 33. Outline Task of the Perseus project. Problems of the task domain. External knowledge sources. Identification and classification of proper names. First disambiguation of geographical names. Simple carachterization of the document context. Final disambiguation. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 34. Task of the project Perseus Task of the Perseus Project (Smith & Crane, 2002) Library with historical data in humanities from the ancient Greece to the 19th century America. Over a million of toponym references. The task consist of: Identification of geographic names. Link the names to information about location, type, dates of occupation, relation to other places, inhabitants, etc. Link the names to a position in a map. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 35. Problems of the domain The introduction of the entity by a unambiguous mention is less common than in new papers articles. There is a great difference between the documents, like Different size of the documents. Lack of standard structures. Different registers and dialects are used. Historical variations: borders, names associated to different political systems, etc. Long distance anaphora. Resolution process is more similar to the resolution of cross-document coreference in the web than in corpora. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 36. Knowledge sources The system uses external knowledge sources. The most important are: Getty Thesaurus of geographic names. Cruchley’s gazetteer of London, that were build for geocoding. Lists of authors of the entries in the Dictionary of National Biography, that helps to add additional information to the documents. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 37. Identification and classification of proper-names The task of identifying the proper names and the first classification of them is done using simple heuristics. Capitalisation and punctuation conventions. Markup added by the editor of the document. Language specific honorifics (Mr., Dr., etc). Generic topographic labels are taken as “moderate” evidence that the name may be geographic. Rocky Mountains Charles River Stand-alone names are preferably classified as personal names. John (personal name vs. village in Louisiana or Virginia) Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 38. Disambiguation (1) Based in local context. Explicit disambiguating tags put after the names. e.g.“Lancaster, PA”, “Vienna, Austria”, post code, etc. If an ambiguous name of a place is mentioned together with other names of places, the most likely interpretation of the name is that is geographically near from the others. e.g. if “Philadelphia” and “Harrisburg” appear in the same paragraph, the preferred interpretation of “Lancaster” will be the town in Pennsylvania, and not the town in England or Arizona. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 39. Disambiguation (2) Based in document context. Preponderance of geographic references in the entire document. For short documents, like new papers articles, document context and local context are considered as the same. Based in word knowledge. Captured from gazetteers and other reference works. Facts about a place like political coordinates, size, etc. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 40. Simple characterisation of the document context Aggregate all of the possible locations for all the toponyms in the document onto a one-by-one degree grid. Assign weights for the number of mentions of each toponym. Prune the grid based on general world knowledge. Compute the centroid of this weighted map. Compute the standard deviation of the distance of the points from this centroid. Discard points more than to times the standard deviation away from the centroid. Calculate a new centroid. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 41. Final disambiguation. Local context of a toponym is represented by a moving window of the four previous and four following toponyms in the text. Only non ambiguous or disambiguated toponyms are considered. Each of the possible interpretations of the ambiguous toponym are scored using: Geographical proximity to the toponyms around it. Proximity to the centroid for the document. Relative importance. The interpretation that achieves the highest score is selected. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 42. Evaluation (1) The system has evaluated using 5 hand-annotated corpora. Corpus PCat Prec Rec F1 Greek 0.98 0.93 0.99 0.96 Roman 0.99 0.91 1.00 0.95 London 0.92 0.86 0.96 0.91 California 0.92 0.83 0.96 0.89 Upper Midwest 0.89 0.74 0.89 0.81 Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 43. Evaluation (2) Categorisation performed on texts of the Greek and Roman history texts is better than on texts about more actual items. In places with a hight density of population we found more toponyms that are ambiguous with other names. Mistakes where ethnonyms are used as geo-political Entity (like “The Germans” in the text Cæsar’s Gallic War). Proper names are usually non inflected in English. We can add rules by hand to correct it, but the precision of the system could decrease. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 44. Conclusions Simple heuristic categorisation seems to work properly for the categorisation of entities that appear in certain kind of texts. The evaluation procedure is not very clear. There are cases that are not covered properly by the gazetteers, but the use of huge fine grained gazetteers leads to a higher recall but a lower precision. An alternative is the use of linguistic processing and machine learning techniques for restricted cases and collections of documents. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 45. NewsExplorer: multilingual coreference resolution Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 46. NewsExplorer NewsExplorer (Steinberger & Pouliquen, 2008) is an application that gathers and aggregates extracted information for 19 languages. Each entity is displayed on a dedicated web-site. For each entity the user get: List of the latest new clusters in which the entity has been mentioned. List of other entities found in the same clusters. Titles and other phrases describing the entity. Quotations done by the entity or about it. Photograph if available. Wikipedia site about the entity if available. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 47. Text analyse components of the system (1) Monolingual document clustering. Named entity recognition. Person. Organisation. Geographical location. Named entity disambiguation. Quotation recognition and reference resolution for name parts. Identification and mapping of name variants for the same person. Topic detection and tracking. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 48. Text analyse components of the system (2) Categorisation of documents according to a multilingual thesaurus. Cluster similarity calculation: monolingual. across languages. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 49. Language independent rules for geo-tagging Use of document context: If a name can be a personal name or the name of a place, if it has been mentioned as a person earlier, then the preferred reading is that it is a person. If a name can be a personal name or the name of a place, if it has been mentioned as a person earlier, then the preferred reading is that it is a person. If a country has been mentioned in the text, and then appear a polysemous item, resolve the ambiguity in favour of a place in the mentioned country. Prefer locations that are physically closer to other, non ambiguous locations than have been mentioned in the context. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 50. Language independent rules for geo-tagging In case of polysemy, most important places will be preferred. Ignore places that cannot be disambiguated. Combine the rules giving different weights. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 51. Inflection and regular variations (1) Hyphen/space alternations (Jean-Marie / Jean Marie). Diacritic variations (Schr¨der / Schroder). o Name inversion: change of position between first and last name. Typos: relatively frequents in names like Condoleezza Rice, often written as Condoleza, Condolezza, etc. Simplification: Condoleezza Rice and George W. Bush are frequently simplified as Ms. Rice and President Bush. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 52. Inflection and regular variations (2) Morphological declensions: use of prefixes and suffixes in several languages. Transliteration from other alphabets: there is not a 1x1 mapping between letters. there are different conventions. Vowel variations, specially in transliterations from and into Arabic. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 53. Identification of name variants Some of these variants can be predicted and generated using sets of regular expressions. i.e. declination of personal names in Sloven: s/[aeo]?/(e|a|o|u|om|em|m|ju|jem|ja)?/ For every frequent name in the data base will be generated a pattern like Pierr(e|a|o|u|om|em|m|ju|jem|ja)? Gemayel(e|a|o|u|om|em|m|ju|jem|ja)? For cases that cannot be resolved by the regular expressions: Normalise the names, translating them to a language-independent representation. Compute edit distance between name-variant and normalised-names. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 54. Doc. categorisation with multilingual thesaurus (1) Eurovoc Thesaurus: hierarchically organised controlled vocabulary developed by European institutions and national parliaments of different countries. It is used in public administrations for cataloguing, search and retrieval of large multilingual collections. The thesaurus consists of 6000 descriptors organised in 21 fields and at the second level into 127 micro-thesauri. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 55. Doc. categorisation with multilingual thesaurus (2) NewsExplorer produces a ranked set of words statistically related to the descriptor. These sets of words were produced on the basis of a large amount of hand annotated documents, by comparing word frequencies of the subset of texts indexed which each descriptors with the word frequencies of the whole training corpus. This model is completed with a list of stop words to avoid that irrelevant words have an impact in the categorisation task. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 56. Thanks Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 57. References (1) Bagga, A. and Baldwin, B (1998). Entity-based cross document coreferencing using the vector space model. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics. Haghighi, A. and Klein, D. (2007). Unsupervised Coreference Resolution in a Nonparametric Bayesian Model. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Mann, G.S. and Yarowsky, D. (2003). Unsupervised Personal Name Disambiguation. In Proceedings of the CoNLL. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference
  • 58. References (2) Ravichandran, D. and Hovy, E. (2002). Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Smith, D.A. and Crane, G. (2002). Disambiguating geographic names in a historical digital library. In Proceedings of ECDL. Steinberger, R. & Pouliquen, B. (2008): NewsExplorer - combining various text analysis tools to allow multilingual news linking and exploration. Lecture notes for the lecture held at the SORIA Summer School “Cursos de Tecnolog´ Ling¨´ ıas uısticas”. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference