This document summarizes an experiment on automatically assigning topics to text from a historical encyclopedia using optical character recognition (OCR).
The researchers tested automated topic assignment on 14 pages from an 18th century German encyclopedia that had been OCR'd. They analyzed the recall and precision of topic assignment on the OCR'd text, original text, and original text with modernized spelling. Topic assignment was challenging due to OCR errors and historical topics not represented in their topic hierarchy.
While automated topic assignment showed some value in organizing the historical texts, errors limited its usefulness if precision needed to be high. The researchers identified ways to improve precision, such as updating the topic hierarchy, and proposed combining it with social tagging to
Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
1. Automated Assignment of
Topics to OCRed Historical
Texts
Florian Fink, Christoph Ringlstetter, Klaus U. Schulz
CIS - Center for Information and Language Processing
University of Munich
2. Motivation
Standard (modern) repositories in libraries
• documents come with metadata describing subjects and topics covered
in the texts (deep subject classification, e.g. UDC)
• subjects often primary key for bringing order to large repositories
• supporting users interested in particular fields
OCRed historical texts from digitization tsunami
• mostly poor metadata, no subject classification
• missing order on whole collection, only keyword search
• missing survey: what IS the collection about, what can I hope to find?
Can we automatically find subjects/topics covered?
3. Automated topic assignment
Task
Automatically compute all topics/fields that adequately describe contents
of given document, add hierarchical order to topics.
Challenges
• huge number of topics and fields, encyclopedic coverage
• hierarchical order, from general fields to very specific topics
science -> mathematics -> algebra -> group theory -> permutation groups
Comparison: document classification
• small number of given disjoint fields (e.g., politics, science, sports,..)
• Task: find best label(s) for document
Not only „replacement“ for manual topic assignment but
4. New Visions!
• assigning topics to document parts on all levels of granularity
(chapters, pages, paragraphs, ….)
• horizontal access – automated linking of documents and
document parts using topics found
• detecting „topic reuse“, parallelisms and differences across
repositories and subrepositories
• time lines & trend analysis
• ……..
5. Method used
TopicZoom
• university spin-off founded by our group in 2008
• topic assignment to texts (head hunting, trend analysis, ...)
Background technology
• huge semantic net: 120,000 nodes (topics, persons, organizations,
events, geographic locations, time periods)
• ordered as a directed acyclic graph
• topic names come with linguistic variants; many multi-word
expressions
German (main focus) and English
Free web service
• users send (manually or XML interface) texts
• receive topics found in texts
• ranked using two relevance scores
6. Example
Weight
Degree of
Generality
Significance Topic
1 8 7.31492196 South Africa
1 4 5.26957792 Elections
1 7 4.60475280 African countries
1 6 4.45792943 Africa
1 3 3.91069886 Political events
1 2 1.84870472 Politics
“The 2014 South
African general
election will be
held on 7 May
2014 to elect a
new National
Assembly and
new provincial
legislatures in
each province.”
(Wikipedia)
7. Questions asked
Can this technology be used to bring order to
collections of OCRed historical texts?
• How is topic assignment affected by OCR errors?
• How is topic assignment affected by historical
orthography?
• TopicZoom hierarchy („modern topics“) suitable for
topics found in historical texts?
8. Historical corpus - Zedler lexicon
Johann Heinrich Zedler „Grosses vollständiges
Universallexicon aller Wissenschafften und Künste“
(Great Complete Encyclopedia of All Sciences and Arts)
• largest and most famous 18th century German encyclopedia
• 64 volumes plus four supplements
• ca. 284,000 articles
• 63,000 two column pages
• article sizes extremely unbalanced
• accessible in the web
Images (tif) received from Bavarian State Library
9. Experiment
• started with scans from 14 pages of Zedler
• prepared three versions:
1. OCRed page (Finereader)
2. ground truth
3. ground truth with modernized orthography
• manually assigned topics to the 14 pages
• automated topic assignment for the three versions of each page
• looked at recall and precision obtained for three page versions
• analysis of results and problems
OCR quality
• percentage correctly recognized words (tokens)
average 75.03%, for words of length > 3: 71.12%
• for OCR versus ground truth with modernized orthography
average 68.37%, for words of length >3: 62.31%
10. Zedler manually assigned topics
Average: 25 topics assigned per page
• Main topic (lemma) „Zeugen“ (witnesses)
law and justice, contracts, last will, marriage, rights, courts, judges,
handicapped persons, laws, children, teenagers, corruption,
civil law, childhood, adolescense.
• Several lemmata…
peoples, plague, language, gypsies, eviction, paper production,
hunting helpers, hunting, mines, mining, grammar, rhetoric,
Zeugma (city), bridges, Roman Empire, Romans, Euphrates,
Alexander the Great, nations, France, Spain, Netherlands.
• Main topic (lemma) historiography („giving witness“).
history, historiography , historians, Heinrich Cornelius Agrippa,
jews, diluvian, genesis, Adam and Eve, biblical figures, Persia,
Romulus and Remus, Jesus Christ, Arabs, Koran, Bible, Fables,
Mecca, Mosques, The Franks, Christianity, Paganism, Plutarch.
• ………….
11. Recall – average values
AA
AA
OCRed Ground truth Modernized ground truth
Recall: Percentage of manually assigned topics found among
computed topics
Threshold for TopicZoom significance value
1.0 0.6 0.3 0.0
50% 50%50% 50%
12. Notion of recall not fully adequate
Often for a missed topic a closely related is found in the answer set.
E.g. page 1, topics “children”, “teenagers” missed, “childhood” and
“adolescence” are found.
Intuitively, “felt recall” larger than computed recall
Manually assigned spatial areas Computed spatial areas
Recall: 20%
13. Real problems for recall
• Very rare topics
Zedler treats rare topics such as “civet”, “campher” not represented
in the TopicZoom semantic net.
• Changing world
Topics from parts of world that have dramatically changed
Old professions, habits, and techniques etc.,
E.g. “paper production”, “hunting helpers”, “perfume
manufacture”, “potency means”, “brick oil”
many old professions (“Drechsler”, turner) now very popular family
names.
14. Average precision values
Correct topic Questionable Wrong topic
OCR
ground truth
ground truth with modernized orthography
Threshold: significance 0.6
15. Problems for precision
• Wrong time periods
OCR had problemsto recognize years -> wrong time periods assigned
• Wrong resolution of ambigious words
Words of the texts confusedwith the names of smallvillages -> severalwrong topics
• Language changes beyond the level of orthography
• e.g., “Flüsse” (rivers) used twice for liquids of the nose and the eyes
-> several wrong topics (rivers and more general geographic objects)
• e.g. “Verstopfung” (main modern meaning: constipation) refering to problems
of the brain, nose, and ears (interpretation hardly found in modern texts)
-> several wrong topics, all related to diseases of the digestive tract
• e.g. “Blattern” used for a problem of the eyes. Modern language and
TopicZoom net: “Blattern” synonym for “Pocken” (smallpox)
-> several wrong topics
16. Resume
Unavoidable subjectivity of evaluation
• manually assigned topics
• classifying computed topics into correct, questionable, wrong
!!Do not primarily believe in numbers! Get own impression!
Automated topic assignment
• valuable and useful if some errors are considered acceptable
• insufficient if errors cannot be tolerated
• combination with social tagging (e.g., error elimination)?
Significant improvements – in particular for precision - would be possible with
minor modification of the underlying semantic net
17. Future work
• extend empirical basis
• realize easy improvements
• combine with social tagging
• look at new visions
• assigning topics to document parts
• interlink documents based on topical similarity
• detection of topic parallelism
• time line analysis and topic trends
18. Thanks for your attention!
… special thanks to Bavarian State Library …