SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Kanton Zürich
Direktion der Justiz und des Innern
Transkribus
Workshop
DHI Paris
Staatsarchiv
Tobias Hodel (Zürich)
Direktion der Justiz
und des Innern

Making archival (esp. handwritten) documents more accessible

Research infrastructure – Transkribus

Funded until mid-2019 by the European Union (H2020)

15 European partners
What is READ
Recognition and Enrichment of Archival Documents
Direktion der Justiz
und des Innern

Recognition of layout and text structures

Recognition of handwriting (Handwritten Text Recognition)

Text recognition with dictionaries

Writer identification

Best-practices for recognition of large amounts of documents

Digital Humanities in archives and scholarly practices
Research perspectives of READ
Direktion der Justiz
und des Innern
11:30-11:45 Introduction to READ und Transkribus
11:45-12:45 Using Transkribus
12:45-13:00 Adding your documents to the mix
13:00-14:00 Lunch
14:00-15:30 Keyword Spotting, Training of HTR models and
Layoutanalysis, Crowdsourcing, best-practices
Program of the Workshops
Direktion der Justiz
und des Innern

University of Innsbruck (co-ordinator / Austria) → Transkribus

Universitat Politecnica de Valencia (Spain) → HTR

University College London (United Kingdom) → Dissemination, e-Learning

National Center for Scientific Research “Demokritos” (Greece) → Layout Analysis

Democritus University of Thrace (Greece) → Layout Analysis

University of London Computer Centre (United Kingdom) → Webinterface

Vienna University of Technology (Austria) → Layout Analysis, Writer Identification, ScanTent

University of Rostock (Germany) → HTR, Layout Analysis

Leipzig University (Germany) → Dictionaries

Naver Labs (France) → Document Understanding

Ecole Polytechnique Federale de Lausanne (Switzerland) → Large Scale Demonstrator

National Archives Finland (Finland) → Large Scale Demonstrator

Passau Diocesan Archives (Germany) → Large Scale Demonstrator
READ Partner
Direktion der Justiz
und des Innern
Projects with a
Memorandum of
Understanding
Direktion der Justiz
und des Innern
Direktion der Justiz
und des Innern
Automated Text Recognition?
Direktion der Justiz
und des Innern

Machine learning using neural networks

Processes writing by line, rather than by character

Needs to be trained by being shown document images and transcripts

More training data --> more accurate recognition

Create a model to transcribe and search a collection of documents
Automated Text Recognition
Bentham model
•
Based on Jeremy Bentham’s papers
(c.18-19 English)
•
Written by Bentham and his
secretaries
•
Trained on 896 pages – using
transcripts submitted by volunteers
•
5-10% CER is possible
Direktion der Justiz
und des Innern
1 Writer, 150 pages of material for training:
10%
1 Writer, 450 pages of material for training:
4,4%
Same writer, 10 years later,
without material for training:
9,2%
1 Writer, 1132 pages of material for training:
3%
Text Recognition: What to expect
(Character Error Rate)
Direktion der Justiz
und des Innern

Neural networks can also process printed text – with less training data!

Transcribe documents or use OCR engine in Transkribus

Use these transcripts to train a model

Results with 1-2% CER are possible
Recognising printed text
Direktion der Justiz
und des Innern
Direktion der Justiz
und des Innern
(Web-)Interfaces

Transcription (beta)

Crowdsourcing (beta)

Correction (beta)

Search/extract (under development)

E-Learning

ScanApp
Preview of Transcription WebUI:
https://transkribus.eu/longan/sandbox/transcriber/
?test=0&colId=20688&docId=74458&pageId=1
Direktion der Justiz
und des Innern
Transkribus: transkribus.eu / email@transkribus.eu / transkribus.eu/wiki/
READ: read.transkribus.eu (auch für News)
Staatsarchiv Zürich: tobias.hodel@ji.zh.ch
Direktion der Justiz
und des Innern

Register: Create Username/Login

10 Steps Guide

10 Steps Video

Transkribus Wiki
Please fill out our feedback form:
http://bit.ly/dhd2018
Up Next: Transkribus
Direktion der Justiz
und des Innern
By UPVLC (Bentham writings)
http://prhlt-carabela.prhlt.upv.es/bentham/
Live Demo in Transkribus
Bundesratsprotokolle
Keyword Spotting
Direktion der Justiz
und des Innern
Direktion der Justiz
und des Innern
Transkribus export to EVT (by HumaReC)
http://humarec-viewer.vital-it.ch
Transkribus and EVT
Edition Visualization Technology
Direktion der Justiz
und des Innern
Beta-Test: https://transkribus.eu/read/library/
Alpha-Test: https://transkribus.eu/readTest/
Transkribus WebUI
Direktion der Justiz
und des Innern
Transkribus: transkribus.eu / email@transkribus.eu / transkribus.eu/wiki/
READ: read.transkribus.eu (auch für News)
Staatsarchiv Zürich: tobias.hodel@ji.zh.ch

Weitere ähnliche Inhalte

Kürzlich hochgeladen

1029-Danh muc Sach Giao Khoa khoi 12.pdf
1029-Danh muc Sach Giao Khoa khoi 12.pdf1029-Danh muc Sach Giao Khoa khoi 12.pdf
1029-Danh muc Sach Giao Khoa khoi 12.pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 11.pdf
1029-Danh muc Sach Giao Khoa khoi 11.pdf1029-Danh muc Sach Giao Khoa khoi 11.pdf
1029-Danh muc Sach Giao Khoa khoi 11.pdf
QucHHunhnh
 

Kürzlich hochgeladen (8)

Welche KI-Kompetenzen brauchen Lehrpersonen?!
Welche KI-Kompetenzen brauchen Lehrpersonen?!Welche KI-Kompetenzen brauchen Lehrpersonen?!
Welche KI-Kompetenzen brauchen Lehrpersonen?!
 
Angewandte Kognitions- und Medienwissenschaft an der Universität Duisburg_Essen
Angewandte Kognitions- und Medienwissenschaft an der Universität Duisburg_EssenAngewandte Kognitions- und Medienwissenschaft an der Universität Duisburg_Essen
Angewandte Kognitions- und Medienwissenschaft an der Universität Duisburg_Essen
 
Wirtschaftsingenieurwesen an der Universität Duisburg-Essen
Wirtschaftsingenieurwesen an der Universität Duisburg-EssenWirtschaftsingenieurwesen an der Universität Duisburg-Essen
Wirtschaftsingenieurwesen an der Universität Duisburg-Essen
 
1029-Danh muc Sach Giao Khoa khoi 12.pdf
1029-Danh muc Sach Giao Khoa khoi 12.pdf1029-Danh muc Sach Giao Khoa khoi 12.pdf
1029-Danh muc Sach Giao Khoa khoi 12.pdf
 
Betriebswirtschaftslehre (B.Sc.) an der Universität Duisburg Essen
Betriebswirtschaftslehre (B.Sc.) an der Universität Duisburg EssenBetriebswirtschaftslehre (B.Sc.) an der Universität Duisburg Essen
Betriebswirtschaftslehre (B.Sc.) an der Universität Duisburg Essen
 
Angewandte Philosophie an der Universität Duisburg-Essen.
Angewandte Philosophie an der Universität Duisburg-Essen.Angewandte Philosophie an der Universität Duisburg-Essen.
Angewandte Philosophie an der Universität Duisburg-Essen.
 
1029-Danh muc Sach Giao Khoa khoi 11.pdf
1029-Danh muc Sach Giao Khoa khoi 11.pdf1029-Danh muc Sach Giao Khoa khoi 11.pdf
1029-Danh muc Sach Giao Khoa khoi 11.pdf
 
LAKO Kreativpreis_2024_Startnummer_02_(LFS_LA).pdf
LAKO Kreativpreis_2024_Startnummer_02_(LFS_LA).pdfLAKO Kreativpreis_2024_Startnummer_02_(LFS_LA).pdf
LAKO Kreativpreis_2024_Startnummer_02_(LFS_LA).pdf
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

READ Presentation for the #DHMASTERCLASS18 at the DHI Paris

  • 1. Kanton Zürich Direktion der Justiz und des Innern Transkribus Workshop DHI Paris Staatsarchiv Tobias Hodel (Zürich)
  • 2. Direktion der Justiz und des Innern  Making archival (esp. handwritten) documents more accessible  Research infrastructure – Transkribus  Funded until mid-2019 by the European Union (H2020)  15 European partners What is READ Recognition and Enrichment of Archival Documents
  • 3. Direktion der Justiz und des Innern  Recognition of layout and text structures  Recognition of handwriting (Handwritten Text Recognition)  Text recognition with dictionaries  Writer identification  Best-practices for recognition of large amounts of documents  Digital Humanities in archives and scholarly practices Research perspectives of READ
  • 4. Direktion der Justiz und des Innern 11:30-11:45 Introduction to READ und Transkribus 11:45-12:45 Using Transkribus 12:45-13:00 Adding your documents to the mix 13:00-14:00 Lunch 14:00-15:30 Keyword Spotting, Training of HTR models and Layoutanalysis, Crowdsourcing, best-practices Program of the Workshops
  • 5. Direktion der Justiz und des Innern  University of Innsbruck (co-ordinator / Austria) → Transkribus  Universitat Politecnica de Valencia (Spain) → HTR  University College London (United Kingdom) → Dissemination, e-Learning  National Center for Scientific Research “Demokritos” (Greece) → Layout Analysis  Democritus University of Thrace (Greece) → Layout Analysis  University of London Computer Centre (United Kingdom) → Webinterface  Vienna University of Technology (Austria) → Layout Analysis, Writer Identification, ScanTent  University of Rostock (Germany) → HTR, Layout Analysis  Leipzig University (Germany) → Dictionaries  Naver Labs (France) → Document Understanding  Ecole Polytechnique Federale de Lausanne (Switzerland) → Large Scale Demonstrator  National Archives Finland (Finland) → Large Scale Demonstrator  Passau Diocesan Archives (Germany) → Large Scale Demonstrator READ Partner
  • 6. Direktion der Justiz und des Innern Projects with a Memorandum of Understanding
  • 8. Direktion der Justiz und des Innern Automated Text Recognition?
  • 9. Direktion der Justiz und des Innern  Machine learning using neural networks  Processes writing by line, rather than by character  Needs to be trained by being shown document images and transcripts  More training data --> more accurate recognition  Create a model to transcribe and search a collection of documents Automated Text Recognition
  • 10. Bentham model • Based on Jeremy Bentham’s papers (c.18-19 English) • Written by Bentham and his secretaries • Trained on 896 pages – using transcripts submitted by volunteers • 5-10% CER is possible
  • 11.
  • 12. Direktion der Justiz und des Innern 1 Writer, 150 pages of material for training: 10% 1 Writer, 450 pages of material for training: 4,4% Same writer, 10 years later, without material for training: 9,2% 1 Writer, 1132 pages of material for training: 3% Text Recognition: What to expect (Character Error Rate)
  • 13. Direktion der Justiz und des Innern  Neural networks can also process printed text – with less training data!  Transcribe documents or use OCR engine in Transkribus  Use these transcripts to train a model  Results with 1-2% CER are possible Recognising printed text
  • 15. Direktion der Justiz und des Innern (Web-)Interfaces  Transcription (beta)  Crowdsourcing (beta)  Correction (beta)  Search/extract (under development)  E-Learning  ScanApp Preview of Transcription WebUI: https://transkribus.eu/longan/sandbox/transcriber/ ?test=0&colId=20688&docId=74458&pageId=1
  • 16. Direktion der Justiz und des Innern Transkribus: transkribus.eu / email@transkribus.eu / transkribus.eu/wiki/ READ: read.transkribus.eu (auch für News) Staatsarchiv Zürich: tobias.hodel@ji.zh.ch
  • 17. Direktion der Justiz und des Innern  Register: Create Username/Login  10 Steps Guide  10 Steps Video  Transkribus Wiki Please fill out our feedback form: http://bit.ly/dhd2018 Up Next: Transkribus
  • 18. Direktion der Justiz und des Innern By UPVLC (Bentham writings) http://prhlt-carabela.prhlt.upv.es/bentham/ Live Demo in Transkribus Bundesratsprotokolle Keyword Spotting
  • 20. Direktion der Justiz und des Innern Transkribus export to EVT (by HumaReC) http://humarec-viewer.vital-it.ch Transkribus and EVT Edition Visualization Technology
  • 21. Direktion der Justiz und des Innern Beta-Test: https://transkribus.eu/read/library/ Alpha-Test: https://transkribus.eu/readTest/ Transkribus WebUI
  • 22. Direktion der Justiz und des Innern Transkribus: transkribus.eu / email@transkribus.eu / transkribus.eu/wiki/ READ: read.transkribus.eu (auch für News) Staatsarchiv Zürich: tobias.hodel@ji.zh.ch