Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
READ Presentation for the #DHMASTERCLASS18 at the DHI Paris
1. Kanton Zürich
Direktion der Justiz und des Innern
Transkribus
Workshop
DHI Paris
Staatsarchiv
Tobias Hodel (Zürich)
2. Direktion der Justiz
und des Innern
Making archival (esp. handwritten) documents more accessible
Research infrastructure – Transkribus
Funded until mid-2019 by the European Union (H2020)
15 European partners
What is READ
Recognition and Enrichment of Archival Documents
3. Direktion der Justiz
und des Innern
Recognition of layout and text structures
Recognition of handwriting (Handwritten Text Recognition)
Text recognition with dictionaries
Writer identification
Best-practices for recognition of large amounts of documents
Digital Humanities in archives and scholarly practices
Research perspectives of READ
4. Direktion der Justiz
und des Innern
11:30-11:45 Introduction to READ und Transkribus
11:45-12:45 Using Transkribus
12:45-13:00 Adding your documents to the mix
13:00-14:00 Lunch
14:00-15:30 Keyword Spotting, Training of HTR models and
Layoutanalysis, Crowdsourcing, best-practices
Program of the Workshops
5. Direktion der Justiz
und des Innern
University of Innsbruck (co-ordinator / Austria) → Transkribus
Universitat Politecnica de Valencia (Spain) → HTR
University College London (United Kingdom) → Dissemination, e-Learning
National Center for Scientific Research “Demokritos” (Greece) → Layout Analysis
Democritus University of Thrace (Greece) → Layout Analysis
University of London Computer Centre (United Kingdom) → Webinterface
Vienna University of Technology (Austria) → Layout Analysis, Writer Identification, ScanTent
University of Rostock (Germany) → HTR, Layout Analysis
Leipzig University (Germany) → Dictionaries
Naver Labs (France) → Document Understanding
Ecole Polytechnique Federale de Lausanne (Switzerland) → Large Scale Demonstrator
National Archives Finland (Finland) → Large Scale Demonstrator
Passau Diocesan Archives (Germany) → Large Scale Demonstrator
READ Partner
9. Direktion der Justiz
und des Innern
Machine learning using neural networks
Processes writing by line, rather than by character
Needs to be trained by being shown document images and transcripts
More training data --> more accurate recognition
Create a model to transcribe and search a collection of documents
Automated Text Recognition
10. Bentham model
•
Based on Jeremy Bentham’s papers
(c.18-19 English)
•
Written by Bentham and his
secretaries
•
Trained on 896 pages – using
transcripts submitted by volunteers
•
5-10% CER is possible
11.
12. Direktion der Justiz
und des Innern
1 Writer, 150 pages of material for training:
10%
1 Writer, 450 pages of material for training:
4,4%
Same writer, 10 years later,
without material for training:
9,2%
1 Writer, 1132 pages of material for training:
3%
Text Recognition: What to expect
(Character Error Rate)
13. Direktion der Justiz
und des Innern
Neural networks can also process printed text – with less training data!
Transcribe documents or use OCR engine in Transkribus
Use these transcripts to train a model
Results with 1-2% CER are possible
Recognising printed text
15. Direktion der Justiz
und des Innern
(Web-)Interfaces
Transcription (beta)
Crowdsourcing (beta)
Correction (beta)
Search/extract (under development)
E-Learning
ScanApp
Preview of Transcription WebUI:
https://transkribus.eu/longan/sandbox/transcriber/
?test=0&colId=20688&docId=74458&pageId=1
16. Direktion der Justiz
und des Innern
Transkribus: transkribus.eu / email@transkribus.eu / transkribus.eu/wiki/
READ: read.transkribus.eu (auch für News)
Staatsarchiv Zürich: tobias.hodel@ji.zh.ch
17. Direktion der Justiz
und des Innern
Register: Create Username/Login
10 Steps Guide
10 Steps Video
Transkribus Wiki
Please fill out our feedback form:
http://bit.ly/dhd2018
Up Next: Transkribus
18. Direktion der Justiz
und des Innern
By UPVLC (Bentham writings)
http://prhlt-carabela.prhlt.upv.es/bentham/
Live Demo in Transkribus
Bundesratsprotokolle
Keyword Spotting
20. Direktion der Justiz
und des Innern
Transkribus export to EVT (by HumaReC)
http://humarec-viewer.vital-it.ch
Transkribus and EVT
Edition Visualization Technology
21. Direktion der Justiz
und des Innern
Beta-Test: https://transkribus.eu/read/library/
Alpha-Test: https://transkribus.eu/readTest/
Transkribus WebUI
22. Direktion der Justiz
und des Innern
Transkribus: transkribus.eu / email@transkribus.eu / transkribus.eu/wiki/
READ: read.transkribus.eu (auch für News)
Staatsarchiv Zürich: tobias.hodel@ji.zh.ch