SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Automated Assignment of
Topics to OCRed Historical
Texts
Florian Fink, Christoph Ringlstetter, Klaus U. Schulz
CIS - Center for Information and Language Processing
University of Munich
Motivation
Standard (modern) repositories in libraries
• documents come with metadata describing subjects and topics covered
in the texts (deep subject classification, e.g. UDC)
• subjects often primary key for bringing order to large repositories
• supporting users interested in particular fields
OCRed historical texts from digitization tsunami
• mostly poor metadata, no subject classification
• missing order on whole collection, only keyword search
• missing survey: what IS the collection about, what can I hope to find?
Can we automatically find subjects/topics covered?
Automated topic assignment
Task
Automatically compute all topics/fields that adequately describe contents
of given document, add hierarchical order to topics.
Challenges
• huge number of topics and fields, encyclopedic coverage
• hierarchical order, from general fields to very specific topics
science -> mathematics -> algebra -> group theory -> permutation groups
Comparison: document classification
• small number of given disjoint fields (e.g., politics, science, sports,..)
• Task: find best label(s) for document
Not only „replacement“ for manual topic assignment but
New Visions!
• assigning topics to document parts on all levels of granularity
(chapters, pages, paragraphs, ….)
• horizontal access – automated linking of documents and
document parts using topics found
• detecting „topic reuse“, parallelisms and differences across
repositories and subrepositories
• time lines & trend analysis
• ……..
Method used
TopicZoom
• university spin-off founded by our group in 2008
• topic assignment to texts (head hunting, trend analysis, ...)
Background technology
• huge semantic net: 120,000 nodes (topics, persons, organizations,
events, geographic locations, time periods)
• ordered as a directed acyclic graph
• topic names come with linguistic variants; many multi-word
expressions
German (main focus) and English
Free web service
• users send (manually or XML interface) texts
• receive topics found in texts
• ranked using two relevance scores
Example
Weight
Degree of
Generality
Significance Topic
1 8 7.31492196 South Africa
1 4 5.26957792 Elections
1 7 4.60475280 African countries
1 6 4.45792943 Africa
1 3 3.91069886 Political events
1 2 1.84870472 Politics
“The 2014 South
African general
election will be
held on 7 May
2014 to elect a
new National
Assembly and
new provincial
legislatures in
each province.”
(Wikipedia)
Questions asked
Can this technology be used to bring order to
collections of OCRed historical texts?
• How is topic assignment affected by OCR errors?
• How is topic assignment affected by historical
orthography?
• TopicZoom hierarchy („modern topics“) suitable for
topics found in historical texts?
Historical corpus - Zedler lexicon
Johann Heinrich Zedler „Grosses vollständiges
Universallexicon aller Wissenschafften und Künste“
(Great Complete Encyclopedia of All Sciences and Arts)
• largest and most famous 18th century German encyclopedia
• 64 volumes plus four supplements
• ca. 284,000 articles
• 63,000 two column pages
• article sizes extremely unbalanced
• accessible in the web
Images (tif) received from Bavarian State Library
Experiment
• started with scans from 14 pages of Zedler
• prepared three versions:
1. OCRed page (Finereader)
2. ground truth
3. ground truth with modernized orthography
• manually assigned topics to the 14 pages
• automated topic assignment for the three versions of each page
• looked at recall and precision obtained for three page versions
• analysis of results and problems
OCR quality
• percentage correctly recognized words (tokens)
average 75.03%, for words of length > 3: 71.12%
• for OCR versus ground truth with modernized orthography
average 68.37%, for words of length >3: 62.31%
Zedler manually assigned topics
Average: 25 topics assigned per page
• Main topic (lemma) „Zeugen“ (witnesses)
law and justice, contracts, last will, marriage, rights, courts, judges,
handicapped persons, laws, children, teenagers, corruption,
civil law, childhood, adolescense.
• Several lemmata…
peoples, plague, language, gypsies, eviction, paper production,
hunting helpers, hunting, mines, mining, grammar, rhetoric,
Zeugma (city), bridges, Roman Empire, Romans, Euphrates,
Alexander the Great, nations, France, Spain, Netherlands.
• Main topic (lemma) historiography („giving witness“).
history, historiography , historians, Heinrich Cornelius Agrippa,
jews, diluvian, genesis, Adam and Eve, biblical figures, Persia,
Romulus and Remus, Jesus Christ, Arabs, Koran, Bible, Fables,
Mecca, Mosques, The Franks, Christianity, Paganism, Plutarch.
• ………….
Recall – average values
AA
AA
OCRed Ground truth Modernized ground truth
Recall: Percentage of manually assigned topics found among
computed topics
Threshold for TopicZoom significance value
1.0 0.6 0.3 0.0
50% 50%50% 50%
Notion of recall not fully adequate
Often for a missed topic a closely related is found in the answer set.
E.g. page 1, topics “children”, “teenagers” missed, “childhood” and
“adolescence” are found.
Intuitively, “felt recall” larger than computed recall
Manually assigned spatial areas Computed spatial areas
Recall: 20%
Real problems for recall
• Very rare topics
Zedler treats rare topics such as “civet”, “campher” not represented
in the TopicZoom semantic net.
• Changing world
Topics from parts of world that have dramatically changed
Old professions, habits, and techniques etc.,
E.g. “paper production”, “hunting helpers”, “perfume
manufacture”, “potency means”, “brick oil”
many old professions (“Drechsler”, turner) now very popular family
names.
Average precision values
Correct topic Questionable Wrong topic
OCR
ground truth
ground truth with modernized orthography
Threshold: significance 0.6
Problems for precision
• Wrong time periods
OCR had problemsto recognize years -> wrong time periods assigned
• Wrong resolution of ambigious words
Words of the texts confusedwith the names of smallvillages -> severalwrong topics
• Language changes beyond the level of orthography
• e.g., “Flüsse” (rivers) used twice for liquids of the nose and the eyes
-> several wrong topics (rivers and more general geographic objects)
• e.g. “Verstopfung” (main modern meaning: constipation) refering to problems
of the brain, nose, and ears (interpretation hardly found in modern texts)
-> several wrong topics, all related to diseases of the digestive tract
• e.g. “Blattern” used for a problem of the eyes. Modern language and
TopicZoom net: “Blattern” synonym for “Pocken” (smallpox)
-> several wrong topics
Resume
Unavoidable subjectivity of evaluation
• manually assigned topics
• classifying computed topics into correct, questionable, wrong
!!Do not primarily believe in numbers! Get own impression!
Automated topic assignment
• valuable and useful if some errors are considered acceptable
• insufficient if errors cannot be tolerated
• combination with social tagging (e.g., error elimination)?
Significant improvements – in particular for precision - would be possible with
minor modification of the underlying semantic net
Future work
• extend empirical basis
• realize easy improvements
• combine with social tagging
• look at new visions
• assigning topics to document parts
• interlink documents based on topical similarity
• detection of topic parallelism
• time line analysis and topic trends
Thanks for your attention!
… special thanks to Bavarian State Library …

Contenu connexe

Similaire à Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...cneudecker
 
LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsShalin Hai-Jew
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 documentUma Kant
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchEnrico Daga
 
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...Digital Classicist Seminar Berlin
 
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...Digital Classicist Seminar Berlin
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
(48) (human cognitive processing) alexander ziem frames of understanding in t...
(48) (human cognitive processing) alexander ziem frames of understanding in t...(48) (human cognitive processing) alexander ziem frames of understanding in t...
(48) (human cognitive processing) alexander ziem frames of understanding in t...Nelli17
 
How the Semantic Web is transforming information access
How the Semantic Web is transforming information accessHow the Semantic Web is transforming information access
How the Semantic Web is transforming information accessGuus Schreiber
 
Primary Sources: from Lancaster University Library and beyond
Primary Sources: from Lancaster University Library and beyondPrimary Sources: from Lancaster University Library and beyond
Primary Sources: from Lancaster University Library and beyondTim Leonard
 
01 History Of Hypertext+Bibliography 2010
01 History Of Hypertext+Bibliography 201001 History Of Hypertext+Bibliography 2010
01 History Of Hypertext+Bibliography 2010Paul Kahn
 
In want of a dataset: Text Analysis and the VRC, Catherine D. Adams
In want of a dataset: Text Analysis and the VRC, Catherine D. AdamsIn want of a dataset: Text Analysis and the VRC, Catherine D. Adams
In want of a dataset: Text Analysis and the VRC, Catherine D. AdamsVisual Resources Association
 
topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processingyoukayaslam
 

Similaire à Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts (20)

Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic Patterns
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
Lorentz2020 Syriac
Lorentz2020 SyriacLorentz2020 Syriac
Lorentz2020 Syriac
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities research
 
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
 
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
(48) (human cognitive processing) alexander ziem frames of understanding in t...
(48) (human cognitive processing) alexander ziem frames of understanding in t...(48) (human cognitive processing) alexander ziem frames of understanding in t...
(48) (human cognitive processing) alexander ziem frames of understanding in t...
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
How the Semantic Web is transforming information access
How the Semantic Web is transforming information accessHow the Semantic Web is transforming information access
How the Semantic Web is transforming information access
 
Primary Sources: from Lancaster University Library and beyond
Primary Sources: from Lancaster University Library and beyondPrimary Sources: from Lancaster University Library and beyond
Primary Sources: from Lancaster University Library and beyond
 
01 History Of Hypertext+Bibliography 2010
01 History Of Hypertext+Bibliography 201001 History Of Hypertext+Bibliography 2010
01 History Of Hypertext+Bibliography 2010
 
Television News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/SolrTelevision News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/Solr
 
In want of a dataset: Text Analysis and the VRC, Catherine D. Adams
In want of a dataset: Text Analysis and the VRC, Catherine D. AdamsIn want of a dataset: Text Analysis and the VRC, Catherine D. Adams
In want of a dataset: Text Analysis and the VRC, Catherine D. Adams
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
way_topics.ppt
way_topics.pptway_topics.ppt
way_topics.ppt
 
topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processing
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

  • 1. Automated Assignment of Topics to OCRed Historical Texts Florian Fink, Christoph Ringlstetter, Klaus U. Schulz CIS - Center for Information and Language Processing University of Munich
  • 2. Motivation Standard (modern) repositories in libraries • documents come with metadata describing subjects and topics covered in the texts (deep subject classification, e.g. UDC) • subjects often primary key for bringing order to large repositories • supporting users interested in particular fields OCRed historical texts from digitization tsunami • mostly poor metadata, no subject classification • missing order on whole collection, only keyword search • missing survey: what IS the collection about, what can I hope to find? Can we automatically find subjects/topics covered?
  • 3. Automated topic assignment Task Automatically compute all topics/fields that adequately describe contents of given document, add hierarchical order to topics. Challenges • huge number of topics and fields, encyclopedic coverage • hierarchical order, from general fields to very specific topics science -> mathematics -> algebra -> group theory -> permutation groups Comparison: document classification • small number of given disjoint fields (e.g., politics, science, sports,..) • Task: find best label(s) for document Not only „replacement“ for manual topic assignment but
  • 4. New Visions! • assigning topics to document parts on all levels of granularity (chapters, pages, paragraphs, ….) • horizontal access – automated linking of documents and document parts using topics found • detecting „topic reuse“, parallelisms and differences across repositories and subrepositories • time lines & trend analysis • ……..
  • 5. Method used TopicZoom • university spin-off founded by our group in 2008 • topic assignment to texts (head hunting, trend analysis, ...) Background technology • huge semantic net: 120,000 nodes (topics, persons, organizations, events, geographic locations, time periods) • ordered as a directed acyclic graph • topic names come with linguistic variants; many multi-word expressions German (main focus) and English Free web service • users send (manually or XML interface) texts • receive topics found in texts • ranked using two relevance scores
  • 6. Example Weight Degree of Generality Significance Topic 1 8 7.31492196 South Africa 1 4 5.26957792 Elections 1 7 4.60475280 African countries 1 6 4.45792943 Africa 1 3 3.91069886 Political events 1 2 1.84870472 Politics “The 2014 South African general election will be held on 7 May 2014 to elect a new National Assembly and new provincial legislatures in each province.” (Wikipedia)
  • 7. Questions asked Can this technology be used to bring order to collections of OCRed historical texts? • How is topic assignment affected by OCR errors? • How is topic assignment affected by historical orthography? • TopicZoom hierarchy („modern topics“) suitable for topics found in historical texts?
  • 8. Historical corpus - Zedler lexicon Johann Heinrich Zedler „Grosses vollständiges Universallexicon aller Wissenschafften und Künste“ (Great Complete Encyclopedia of All Sciences and Arts) • largest and most famous 18th century German encyclopedia • 64 volumes plus four supplements • ca. 284,000 articles • 63,000 two column pages • article sizes extremely unbalanced • accessible in the web Images (tif) received from Bavarian State Library
  • 9. Experiment • started with scans from 14 pages of Zedler • prepared three versions: 1. OCRed page (Finereader) 2. ground truth 3. ground truth with modernized orthography • manually assigned topics to the 14 pages • automated topic assignment for the three versions of each page • looked at recall and precision obtained for three page versions • analysis of results and problems OCR quality • percentage correctly recognized words (tokens) average 75.03%, for words of length > 3: 71.12% • for OCR versus ground truth with modernized orthography average 68.37%, for words of length >3: 62.31%
  • 10. Zedler manually assigned topics Average: 25 topics assigned per page • Main topic (lemma) „Zeugen“ (witnesses) law and justice, contracts, last will, marriage, rights, courts, judges, handicapped persons, laws, children, teenagers, corruption, civil law, childhood, adolescense. • Several lemmata… peoples, plague, language, gypsies, eviction, paper production, hunting helpers, hunting, mines, mining, grammar, rhetoric, Zeugma (city), bridges, Roman Empire, Romans, Euphrates, Alexander the Great, nations, France, Spain, Netherlands. • Main topic (lemma) historiography („giving witness“). history, historiography , historians, Heinrich Cornelius Agrippa, jews, diluvian, genesis, Adam and Eve, biblical figures, Persia, Romulus and Remus, Jesus Christ, Arabs, Koran, Bible, Fables, Mecca, Mosques, The Franks, Christianity, Paganism, Plutarch. • ………….
  • 11. Recall – average values AA AA OCRed Ground truth Modernized ground truth Recall: Percentage of manually assigned topics found among computed topics Threshold for TopicZoom significance value 1.0 0.6 0.3 0.0 50% 50%50% 50%
  • 12. Notion of recall not fully adequate Often for a missed topic a closely related is found in the answer set. E.g. page 1, topics “children”, “teenagers” missed, “childhood” and “adolescence” are found. Intuitively, “felt recall” larger than computed recall Manually assigned spatial areas Computed spatial areas Recall: 20%
  • 13. Real problems for recall • Very rare topics Zedler treats rare topics such as “civet”, “campher” not represented in the TopicZoom semantic net. • Changing world Topics from parts of world that have dramatically changed Old professions, habits, and techniques etc., E.g. “paper production”, “hunting helpers”, “perfume manufacture”, “potency means”, “brick oil” many old professions (“Drechsler”, turner) now very popular family names.
  • 14. Average precision values Correct topic Questionable Wrong topic OCR ground truth ground truth with modernized orthography Threshold: significance 0.6
  • 15. Problems for precision • Wrong time periods OCR had problemsto recognize years -> wrong time periods assigned • Wrong resolution of ambigious words Words of the texts confusedwith the names of smallvillages -> severalwrong topics • Language changes beyond the level of orthography • e.g., “Flüsse” (rivers) used twice for liquids of the nose and the eyes -> several wrong topics (rivers and more general geographic objects) • e.g. “Verstopfung” (main modern meaning: constipation) refering to problems of the brain, nose, and ears (interpretation hardly found in modern texts) -> several wrong topics, all related to diseases of the digestive tract • e.g. “Blattern” used for a problem of the eyes. Modern language and TopicZoom net: “Blattern” synonym for “Pocken” (smallpox) -> several wrong topics
  • 16. Resume Unavoidable subjectivity of evaluation • manually assigned topics • classifying computed topics into correct, questionable, wrong !!Do not primarily believe in numbers! Get own impression! Automated topic assignment • valuable and useful if some errors are considered acceptable • insufficient if errors cannot be tolerated • combination with social tagging (e.g., error elimination)? Significant improvements – in particular for precision - would be possible with minor modification of the underlying semantic net
  • 17. Future work • extend empirical basis • realize easy improvements • combine with social tagging • look at new visions • assigning topics to document parts • interlink documents based on topical similarity • detection of topic parallelism • time line analysis and topic trends
  • 18. Thanks for your attention! … special thanks to Bavarian State Library …