SlideShare une entreprise Scribd logo
1  sur  13
AI for digitized cultural heritage
Qurator.ai @ Berlin State Library
Clemens Neudecker (@cneudecker)
EuropeanaTech x AI webinar
21 May 2021
Berlin State Library (SBB)
● Established 1661 in Berlin (Kingdom of Prussia)
● Largest research library in Germany
(25M media objects, 2.5 PetaBytes digital data storage)
● Forms part of the larger LAM legal entity
Prussian Cultural Heritage Foundation (SPK)
● https://staatsbibliothek-berlin.de/
● In-house Digitization Center since 2007
○ ~80 concurrent digitization projects
○ ~2M scanned images annual production
● Digital collections give access to ~185k digitized documents
(mostly Public Domain)
● https://digital.staatsbibliothek-berlin.de/
Qurator.ai @ SBB
● SBB responsible for sub-project 10: “AI for digitized cultural heritage”
● Main goal: improve the quality and efficiency of (document) digitization
● Full recognition and enrichment
pipeline for digitized documents
● Development of open source tools
https://github.com/qurator-spk
● Publication of open datasets
https://zenodo.org/communities/stabi
● Releases of trained models
https://qurator-data.de/
● Showcases (only available in German)
https://qurator.ai/innovationlab/staatsbibliothek-zu-berlin/
Image Preprocessing: Binarization
● Binarization (i.e. the conversion of colour/greyscale images to black or white pixels) can be used to
increase the contrast between background (paper) and foreground (ink) and to remove defects, noise
etc. which improves subsequent processes
● OCR engines require binarized images for recognition
● Training of autoencoder model for document image binarization
https://github.com/qurator-spk/sbb_binarization
Document Image Analysis
● High-quality analysis of document layout is key for all subsequent tasks
● Training of multiple ResNet50-U-Net models for pixelwise segmentation
● 1st iteration (“pure” ML)
○ some problems with headings,
drop capitals, reading order
● 2nd iteration (“hybrid”)
○ additional heuristics deliver
improvements for textlines
and reading order detection
https://github.com/qurator-spk/eynollah
Text regions
Text lines
Image (Similarity) Search
● Document layout analysis provides (pixel coordinate) information about image content contained in
the digitized documents
● Extraction (and release) of ~600k graphical elements from document images
● Training an image classification
model on the basis of ImageNet
● ROI within image using YOLO v3
● Approximate nearest neighbour
search for similar images
● Alternative search and browse
entry to digitised collections
https://github.com/qurator-spk/sbb_images
OCR / Text Recognition
● Traditionally, OCR for historical documents is hard
(Fraktur fonts, complex layouts, defects and
damages, historical spelling)
● Thanks to deep learning for OCR (Calamari) and
public GT datasets (GT4HistOCR), nearly error-
free OCR is now possible!
● A single (language independent) OCR model can be
applied for both Fraktur + Antigua (also mixed)
● Initial evaluations show reductions of
Character-Error-Rate from ~20% to ~2%
https://github.com/qurator-spk/ocrd_calamari
OCR Postcorrection
● Even with highly accurate OCR, there remain a few recognition errors
● Idea: train a machine translation model to “translate” OCR errors to correct words
● Challenges:
○ retain historical spelling variants
○ avoid introducing new errors
● Two-step model (seq2seq LSTM):
○ First, detect the parts of text with errors
(this helps artificially increase the error
density in the input for step two)
○ Translate (i.e. correct) errors in the OCR text
● Relative OCR accuracy improvement: 18%
https://github.com/qurator-spk/sbb_ocr_postcorrection
Named Entity Recognition
● Named Entity Recognition (NER) is used to identify proper names of persons, locations,
organizations in unstructured text (here: OCR results)
● Unsupervised Pre-Training of BERT model on the digitized historical documents
● Supervised Training of BERT model for NER with labeled data for German NER
● Results are state of the art with f1 score of 85.6%
https://github.com/qurator-spk/sbb_ner
Named Entity Disambiguation and Linking
● Entities recognized by NER can be ambiguous
● Example: “Paris is in France”
- Paris the city or Paris (Hilton) the person?
● Necessary to determine the correct entity by context
● Establishing a knowledge base for comparison based on Wikidata/Wikipedia
(harvesting of all articles for the corresponding categories)
● Training of a “context-comparison” BERT embeddings model that decides for a given entity
in the OCR text whether it is similar to a Wikipedia lemma
● Enrichment of the OCR text with links to Wikidata IDs and geo-coordinates for toponyms
https://github.com/qurator-spk/sbb_ned
Data Annotation
● neat (named entity annotation tool) for data annotation (and OCR correction)
● Simple, browser based Javascript tool
(no installation or rights required)
● TSV (tab-separated-values)
as internal working format
● Embeds image snippets
via IIIF Image API to aid with annotation
● Due to (popular demand - i.e. Covid-19),
neat can now also be used for OCR correction
or transcription (e.g. to create GT)
https://github.com/qurator-spk/neat
Future Work
● Processing all the digitized documents in SBB with the Qurator pipeline would give us some greatly
improved data to extend this work, and for training better models
● But AI/ML is quite demanding on computation - with our current server (36 CPU cores, 2x V100,
192 GiB RAM) this would take years...what can we do to increase throughput without sacrificing
performance?
● Methods that combine computer vision
(document image analysis) and natural
language processing (OCR text content)
features promise further improvements
● Extending current developments to other
languages and scripts (esp. Asian) and layouts (e.g. right-to-left, vertical)
● Provision of interactive demos in our SBB LAB https://lab.sbb.berlin/
Thank you for your attention!
Questions?

Contenu connexe

Similaire à EuropeanaTech x AI: Qurator.ai @ Berlin State Library

Edinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline WorkshopEdinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline Workshop
Petr Pridal
 
This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?
Dobrica Pavlinušić
 

Similaire à EuropeanaTech x AI: Qurator.ai @ Berlin State Library (20)

Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Edinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline WorkshopEdinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline Workshop
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
 
Gjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaGjergj Sheldija: Albania
Gjergj Sheldija: Albania
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
 
This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?
 
DURAARK at IGeLU 2014
DURAARK at IGeLU 2014DURAARK at IGeLU 2014
DURAARK at IGeLU 2014
 
Python and GIS: Improving Your Workflow
Python and GIS: Improving Your WorkflowPython and GIS: Improving Your Workflow
Python and GIS: Improving Your Workflow
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
 
H2O at Berlin R Meetup
H2O at Berlin R MeetupH2O at Berlin R Meetup
H2O at Berlin R Meetup
 

Plus de cneudecker

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 

Plus de cneudecker (20)

ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
 
Europeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 BerlinEuropeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 Berlin
 
Coding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana NewspapersCoding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana Newspapers
 
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

EuropeanaTech x AI: Qurator.ai @ Berlin State Library

  • 1. AI for digitized cultural heritage Qurator.ai @ Berlin State Library Clemens Neudecker (@cneudecker) EuropeanaTech x AI webinar 21 May 2021
  • 2. Berlin State Library (SBB) ● Established 1661 in Berlin (Kingdom of Prussia) ● Largest research library in Germany (25M media objects, 2.5 PetaBytes digital data storage) ● Forms part of the larger LAM legal entity Prussian Cultural Heritage Foundation (SPK) ● https://staatsbibliothek-berlin.de/ ● In-house Digitization Center since 2007 ○ ~80 concurrent digitization projects ○ ~2M scanned images annual production ● Digital collections give access to ~185k digitized documents (mostly Public Domain) ● https://digital.staatsbibliothek-berlin.de/
  • 3. Qurator.ai @ SBB ● SBB responsible for sub-project 10: “AI for digitized cultural heritage” ● Main goal: improve the quality and efficiency of (document) digitization ● Full recognition and enrichment pipeline for digitized documents ● Development of open source tools https://github.com/qurator-spk ● Publication of open datasets https://zenodo.org/communities/stabi ● Releases of trained models https://qurator-data.de/ ● Showcases (only available in German) https://qurator.ai/innovationlab/staatsbibliothek-zu-berlin/
  • 4. Image Preprocessing: Binarization ● Binarization (i.e. the conversion of colour/greyscale images to black or white pixels) can be used to increase the contrast between background (paper) and foreground (ink) and to remove defects, noise etc. which improves subsequent processes ● OCR engines require binarized images for recognition ● Training of autoencoder model for document image binarization https://github.com/qurator-spk/sbb_binarization
  • 5. Document Image Analysis ● High-quality analysis of document layout is key for all subsequent tasks ● Training of multiple ResNet50-U-Net models for pixelwise segmentation ● 1st iteration (“pure” ML) ○ some problems with headings, drop capitals, reading order ● 2nd iteration (“hybrid”) ○ additional heuristics deliver improvements for textlines and reading order detection https://github.com/qurator-spk/eynollah Text regions Text lines
  • 6. Image (Similarity) Search ● Document layout analysis provides (pixel coordinate) information about image content contained in the digitized documents ● Extraction (and release) of ~600k graphical elements from document images ● Training an image classification model on the basis of ImageNet ● ROI within image using YOLO v3 ● Approximate nearest neighbour search for similar images ● Alternative search and browse entry to digitised collections https://github.com/qurator-spk/sbb_images
  • 7. OCR / Text Recognition ● Traditionally, OCR for historical documents is hard (Fraktur fonts, complex layouts, defects and damages, historical spelling) ● Thanks to deep learning for OCR (Calamari) and public GT datasets (GT4HistOCR), nearly error- free OCR is now possible! ● A single (language independent) OCR model can be applied for both Fraktur + Antigua (also mixed) ● Initial evaluations show reductions of Character-Error-Rate from ~20% to ~2% https://github.com/qurator-spk/ocrd_calamari
  • 8. OCR Postcorrection ● Even with highly accurate OCR, there remain a few recognition errors ● Idea: train a machine translation model to “translate” OCR errors to correct words ● Challenges: ○ retain historical spelling variants ○ avoid introducing new errors ● Two-step model (seq2seq LSTM): ○ First, detect the parts of text with errors (this helps artificially increase the error density in the input for step two) ○ Translate (i.e. correct) errors in the OCR text ● Relative OCR accuracy improvement: 18% https://github.com/qurator-spk/sbb_ocr_postcorrection
  • 9. Named Entity Recognition ● Named Entity Recognition (NER) is used to identify proper names of persons, locations, organizations in unstructured text (here: OCR results) ● Unsupervised Pre-Training of BERT model on the digitized historical documents ● Supervised Training of BERT model for NER with labeled data for German NER ● Results are state of the art with f1 score of 85.6% https://github.com/qurator-spk/sbb_ner
  • 10. Named Entity Disambiguation and Linking ● Entities recognized by NER can be ambiguous ● Example: “Paris is in France” - Paris the city or Paris (Hilton) the person? ● Necessary to determine the correct entity by context ● Establishing a knowledge base for comparison based on Wikidata/Wikipedia (harvesting of all articles for the corresponding categories) ● Training of a “context-comparison” BERT embeddings model that decides for a given entity in the OCR text whether it is similar to a Wikipedia lemma ● Enrichment of the OCR text with links to Wikidata IDs and geo-coordinates for toponyms https://github.com/qurator-spk/sbb_ned
  • 11. Data Annotation ● neat (named entity annotation tool) for data annotation (and OCR correction) ● Simple, browser based Javascript tool (no installation or rights required) ● TSV (tab-separated-values) as internal working format ● Embeds image snippets via IIIF Image API to aid with annotation ● Due to (popular demand - i.e. Covid-19), neat can now also be used for OCR correction or transcription (e.g. to create GT) https://github.com/qurator-spk/neat
  • 12. Future Work ● Processing all the digitized documents in SBB with the Qurator pipeline would give us some greatly improved data to extend this work, and for training better models ● But AI/ML is quite demanding on computation - with our current server (36 CPU cores, 2x V100, 192 GiB RAM) this would take years...what can we do to increase throughput without sacrificing performance? ● Methods that combine computer vision (document image analysis) and natural language processing (OCR text content) features promise further improvements ● Extending current developments to other languages and scripts (esp. Asian) and layouts (e.g. right-to-left, vertical) ● Provision of interactive demos in our SBB LAB https://lab.sbb.berlin/
  • 13. Thank you for your attention! Questions?