16. Historical variant and OCR error patterns Historical Variants OCR Error patterns teil theil theil iheil
17. Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’ Absolute frequency: Pattern was found 120 times in the current document.
18.
19.
20.
21.
22.
23.
24.
25. Evaluation: Measures (1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler (3) Indirect evaluation (For instance, by means of the postcorrection system)
26. Evaluation: Data preparation (1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work – not completely accurate
27. Evaluation: Data Deep: Eckartshausen 100 pages Briefkunst 40 pages Shallow: 5 books each, 16 th , 17 th and 18 th century