The document discusses the transformation of humanities research through digital technologies and optical character recognition (OCR). It describes efforts to extract over 2,000 years of Latin text from digitized books and track linguistic changes over time using machine learning techniques. Computational analysis is helping scholars build dynamic digital editions and study underrepresented languages on a massive scale.
DevEX - reference for building teams, processes, and platforms
OCR and Digital Humanities
1. OCR and the Transformation of the Humanities Gregory Crane and David Bamman Tufts University Bruce Robertson Mount Allison University John Darlington and Brian Fuchs Imperial College London
9. Towards Dynamic Variorum Editions Gregory Crane and David Bamman Tufts University Bruce Robertson Mount Allison University John Darlington and Brian Fuchs Imperial College London
51. c. 100 CE papyrus from Euclid (c. 300 BCE) http://www.math.ubc.ca/~cass/Euclid/papyrus/papyrus.html
52. 800-1000 CE: Greek into Arabic Hunayn Ibn Ishaq (809–873), Arabic version of the Prognosticon from the Hippocratic Corpus http://www.nlm.nih.gov/exhibition/odysseyofknowledge/
53. c. 1200-1300: Arabic into Latin Medieval Translation of the Prognosticon from Arabic into Latin
54. Return of Greek sources c. 1500 This first edition of Dioscorides' Greek text, printed in Venice in 1499 by Aldo Manuzio (ca. 1447–1515)
74. Polysemy Words have many senses. Lead Iron (verb) cause to go (verb) to smooth w. an iron be in command (noun) element Fe (noun) position of advantage tool with flat steel base used to smooth clothes chief part in play golf club element Pb graphite in pencil Oratio (noun) Speech Prayer
91. Digitizing and Viewing Difficult Texts: Lessons From Ancient Greek 19th century provides a vast array of editions of Greek text, many still very useful - Yet they could not be accessed digitally What tools and workflows might help us digitize diverse texts such as these? What applications can we create to make the resulting OCR data useful to researchers and students?
128. Plato ’ s Republic and the Guardians The Islamic Republic of Iran and the Guardianship of Islamic Jurists
129. Sometimes Greek philosophy does have an impact.. Plato ’ s Republic and the Guardians The Islamic Republic of Iran and the Guardianship of Islamic Jurists
Checked Shakespeare Quarterly 61.1 Spring 2011 – of 100 randomly selected footnotes with secondary source citations, 99 cited English. 1 cited french. A handful cited Greek and Latin sources but in each case cited English translations. There were 4 articles and a review essay
So “a” solution that we’re presenting here is the idea of transferring markup from a richly annotated source text to a plain-text translation (that can be nothing more than perhaps badly OCR’d words). So at a high level, this is done in two steps: an alignment step and a projection step. First we align the source document with the target document in a cascading process and when we have this alignment completed, we then project the XML tags across documents in a way that exploits the similarity in the linguistic structure between that text pair.
1. ocr outputs are aligned by a multiple sequence alignment dynamic algorithm (similar to BLAST, Clustal, etc., used in bioinformatics for DNA alignment) -- 2. the bayesian classifier determines, character by character, which character is more reliable, taking into account the probability that a particular engine provides that good character, when the other engines provides their bad characters (ex.: e1=R e2=P e3=B <-- R is the most probable on e1 output, when e2 provides P and e3 provides B) This is the reason that I don't use a 2-to-1 method but a bayesian classifier: many times we have THREE DIFFERENT THINGS! 3. when we have wrong sequences, the spellchecker provides suggestions. It is accepted only the (first) suggestion that is a sequence of characters appearing at least on one of the three engines (ex.: e1=zoure e2=poqse e3=hoxxe<--house [h-o-u-s-e are provided, in the correct position, at least by one engine:h in e3, o in everyone, u in e1, etc.)]]