SlideShare une entreprise Scribd logo
1  sur  20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Workflow Development for OCR 
(and beyond) 
Clemens Neudecker, KB National Library of the Netherlands 
Creating and Communicating Digital Content Conference 
Umea, 26 May 2011
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
IMPACT – Improving access to text 
 Funded by the EC as part of the 7th Framework Programme 
 Coordinated by KB – National Library of the Netherlands 
 EU funding: € 12 100 000 
 26 partners: Libraries, Research Institutes, Industry Partners 
 Start date: 1 January 2008 
 Duration: 48 Months  2012: Centre of Competence 
2 
 Project website: www.impact-project.eu 
 IMPACT blog: http://impactocr.wordpress.com/ 
 Twitter: @impactocr, #impactproject 
 Join us on LinkedIn!
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
3 
A familiar scene? 
VVt Venetien den 1.Junij, Anno 1618. 
DJgn i f paffato te S' aö'Jifeert mo? 
üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / 
sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met 
beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
OCR: A multitude of challenges… 
I. OCR challenges (gothic fonts, bleed-through, warping, etc.) 
4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
OCR: A multitude of challenges… 
II. Language challenges (spelling variants, inflection, and many more!) 
Example: historical variants of the Dutch word ‘wereld’ (world): 
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt 
wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels 
zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts 
werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts 
werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled 
5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
6 
And a multitude of solutions! 
 22 different ‘tools’ from diverse WP’s, 
developers: 
OCR (C++, C#), 
Image Processing & Lexica (DLL), 
Command Line Tools (Win/Linux), 
Java, Ruby, PHP, Perl, etc. 
+ 3rd party software! 
“One ring to rule them all...” 
 IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Requirement: Interoperability Framework 
 Interoperability vs. integration 
 Web based vs. local installation/platform 
 Most important: flexible, scalable, user friendly 
7 
 Java 6 
 Apache Axis2 
 Apache Tomcat 
 Apache Synapse (optional) 
 Taverna Workflow Engine
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
8 
Generic Web Service Wrapper 
Only requirement: Command Line Application  HTML form 
Available on OPFlabs: 
https://github.com/openplanets/scape/tree/master/xa-toolwrapper 
 Minimise integration effort: developers can focus on their 
application and have to worry less about integration = 
higher quality software
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
9 
Service Oriented Architecture 
 Java as programming 
language = platform 
independence 
 Standard Apache 
components = easy to 
maintain, well supported 
 Synapse as enterprise 
service bus = load 
balancing & fail over 
 HTTPS encryption & 
authentication = secure 
 Minimise deployment effort: scalability, hot deployment/update
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
10 
Workflow development 
 OCR workflow = 
data pipeline 
 Building blocks = 
processing steps (nodes) 
 Integration = 
interaction between nodes 
(mashup) 
 Maximise usability
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
11 
Workflow management 
 Web 2.0 style registry: myExperiment 
 Local client: Taverna Workbench 
 Web client: project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Workflow registry 
 Share resources and 
experience 
 Rate/tag/comment 
workflows 
 Organised in groups
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Workflow modules 
 “Basic” workflows = wraps exactly one software tool/web service 
 Documented inputs/outputs
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
14 
Complex workflows 
 Tool/data pipeline 
 Easily derived from 
workflow modules 
 Task/goal oriented 
 Reusable
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Local client: Taverna Workbench 
http://www.taverna.org.uk/ 
 Background: 
BioSciences 
 Developed and 
maintained by 
myGrid, UK 
 Available for 
Windows/Linux/OSX 
and as open source 
 Funding secured 
until 2014
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Web client: Taverna Server/ 
Workflow Parser 
 SOAP/REST API 
 Remote execution of workflows (webapp)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Use case: Workflows for Evaluation 
 Tool A vs Tool B (Tool A(v1) vs Tool A(v2)) 
 Workflow X (Tool A + Tool B) vs Workflow Y (Tool A + Tool C) 
 Workflow X vs previously digitised material 
 Users identify optimal workflow for source material/project 
17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
18 
Other examples 
 Workflows for Digitisation  IMPACT 
 Workflows for Linguistic Analysis  CLARIN 
 Workflows for Preservation  SCAPE 
 Interface for automatic storage of results, based on DAV, 
realised as a workflow module (native beanshell support) 
 And there are many more…
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Benefits & Outlook 
 Modular 
 Transparent 
 Expandable 
 Scalable 
 Platform independent 
 User friendly 
 Growing interest in workflow management in CH sector 
 Easy to set up, deploy, free (open source) 
 Domain independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Thank you! Questions?

Contenu connexe

Tendances

IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Centre of Competence
 
Intelligent tools-mitja-jermol-2013-bali-7 may2013
Intelligent tools-mitja-jermol-2013-bali-7 may2013Intelligent tools-mitja-jermol-2013-bali-7 may2013
Intelligent tools-mitja-jermol-2013-bali-7 may2013
MediaMixerCommunity
 
OpenAIRE - supporting EC Open Access policies
OpenAIRE - supporting EC Open Access policiesOpenAIRE - supporting EC Open Access policies
OpenAIRE - supporting EC Open Access policies
Jean-François Lutz
 

Tendances (15)

IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KB
 
IMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for Digitisation
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
 
Intelligent tools-mitja-jermol-2013-bali-7 may2013
Intelligent tools-mitja-jermol-2013-bali-7 may2013Intelligent tools-mitja-jermol-2013-bali-7 may2013
Intelligent tools-mitja-jermol-2013-bali-7 may2013
 
EuropeanaTech PGM12
EuropeanaTech PGM12EuropeanaTech PGM12
EuropeanaTech PGM12
 
Rutger Rozendal (Noterik) @ Horizon 2020 voorlichtingsbijeenkomst
Rutger Rozendal (Noterik) @ Horizon 2020 voorlichtingsbijeenkomstRutger Rozendal (Noterik) @ Horizon 2020 voorlichtingsbijeenkomst
Rutger Rozendal (Noterik) @ Horizon 2020 voorlichtingsbijeenkomst
 
OpenAIRE at INFSO-RTD, Open Access Co-ordination Workshop, Brussels, May 2011
OpenAIRE at  INFSO-RTD, Open Access Co-ordination Workshop, Brussels, May 2011OpenAIRE at  INFSO-RTD, Open Access Co-ordination Workshop, Brussels, May 2011
OpenAIRE at INFSO-RTD, Open Access Co-ordination Workshop, Brussels, May 2011
 
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
 
Dutch PHP Conference
Dutch PHP ConferenceDutch PHP Conference
Dutch PHP Conference
 
Scientix Observatory: Good practices in internalisation and localisation of l...
Scientix Observatory: Good practices in internalisation and localisation of l...Scientix Observatory: Good practices in internalisation and localisation of l...
Scientix Observatory: Good practices in internalisation and localisation of l...
 
OpenAIRE - supporting EC Open Access policies
OpenAIRE - supporting EC Open Access policiesOpenAIRE - supporting EC Open Access policies
OpenAIRE - supporting EC Open Access policies
 
META-NET and META-SHARE: Language Technology for Europe
META-NET and META-SHARE: Language Technology for EuropeMETA-NET and META-SHARE: Language Technology for Europe
META-NET and META-SHARE: Language Technology for Europe
 
Multilingualism for Digital Europe
Multilingualism for Digital EuropeMultilingualism for Digital Europe
Multilingualism for Digital Europe
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Fcl france oct2015 mar2016 activityreport
Fcl france oct2015 mar2016 activityreportFcl france oct2015 mar2016 activityreport
Fcl france oct2015 mar2016 activityreport
 

En vedette

Bessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity RecognitionBessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity Recognition
cneudecker
 
Preservation Workflows with Taverna
Preservation Workflows with TavernaPreservation Workflows with Taverna
Preservation Workflows with Taverna
cneudecker
 

En vedette (15)

Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
Bessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity RecognitionBessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity Recognition
 
Collaborative Workflow Development and Experimentation in the Digital Humanities
Collaborative Workflow Development and Experimentation in the Digital HumanitiesCollaborative Workflow Development and Experimentation in the Digital Humanities
Collaborative Workflow Development and Experimentation in the Digital Humanities
 
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Online
 
Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshell
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
 
Succeed 2nd hackathon
Succeed 2nd hackathonSucceed 2nd hackathon
Succeed 2nd hackathon
 
The IMPACT Interoperability Framework - Workflows for OCR and beyond
The IMPACT Interoperability Framework - Workflows for OCR and beyondThe IMPACT Interoperability Framework - Workflows for OCR and beyond
The IMPACT Interoperability Framework - Workflows for OCR and beyond
 
Digitale Kuratierungstechnologien in Bibliotheken
Digitale Kuratierungstechnologien in BibliothekenDigitale Kuratierungstechnologien in Bibliotheken
Digitale Kuratierungstechnologien in Bibliotheken
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Berliner DH Rundgang
Berliner DH RundgangBerliner DH Rundgang
Berliner DH Rundgang
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
 
Preservation Workflows with Taverna
Preservation Workflows with TavernaPreservation Workflows with Taverna
Preservation Workflows with Taverna
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 

Similaire à Workflow Development for OCR (and beyond)

ECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming arts
Paolo Nesi
 
Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01
The European Library
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
IMPACT Centre of Competence
 

Similaire à Workflow Development for OCR (and beyond) (20)

An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEP
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
ECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming arts
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01
 
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the CloudEuropeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
 
Rio Info 2009 - Europeana - Bram van der Werf
Rio Info 2009 - Europeana - Bram van der WerfRio Info 2009 - Europeana - Bram van der Werf
Rio Info 2009 - Europeana - Bram van der Werf
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
20070914 Walterus Jeroen
20070914 Walterus Jeroen20070914 Walterus Jeroen
20070914 Walterus Jeroen
 
20070914 Walterus Jeroen
20070914 Walterus Jeroen20070914 Walterus Jeroen
20070914 Walterus Jeroen
 
Bne impact co_c
Bne impact co_cBne impact co_c
Bne impact co_c
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
AGM 2013 - Strategic Plan working groups outcomes
AGM 2013 - Strategic Plan working groups outcomesAGM 2013 - Strategic Plan working groups outcomes
AGM 2013 - Strategic Plan working groups outcomes
 
Europeana Cloud - Alastair Dunning - November 2013
Europeana Cloud - Alastair Dunning - November 2013Europeana Cloud - Alastair Dunning - November 2013
Europeana Cloud - Alastair Dunning - November 2013
 
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdFranco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
 
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
 
Iñaki and Amaia
Iñaki and AmaiaIñaki and Amaia
Iñaki and Amaia
 
Introduction to eCloud
Introduction to eCloudIntroduction to eCloud
Introduction to eCloud
 

Plus de cneudecker

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 

Plus de cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Workflow Development for OCR (and beyond)

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow Development for OCR (and beyond) Clemens Neudecker, KB National Library of the Netherlands Creating and Communicating Digital Content Conference Umea, 26 May 2011
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT – Improving access to text  Funded by the EC as part of the 7th Framework Programme  Coordinated by KB – National Library of the Netherlands  EU funding: € 12 100 000  26 partners: Libraries, Research Institutes, Industry Partners  Start date: 1 January 2008  Duration: 48 Months  2012: Centre of Competence 2  Project website: www.impact-project.eu  IMPACT blog: http://impactocr.wordpress.com/  Twitter: @impactocr, #impactproject  Join us on LinkedIn!
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 3 A familiar scene? VVt Venetien den 1.Junij, Anno 1618. DJgn i f paffato te S' aö'Jifeert mo? üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… I. OCR challenges (gothic fonts, bleed-through, warping, etc.) 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… II. Language challenges (spelling variants, inflection, and many more!) Example: historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6 And a multitude of solutions!  22 different ‘tools’ from diverse WP’s, developers: OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc. + 3rd party software! “One ring to rule them all...”  IMPACT Interoperability Framework (IIF)
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Requirement: Interoperability Framework  Interoperability vs. integration  Web based vs. local installation/platform  Most important: flexible, scalable, user friendly 7  Java 6  Apache Axis2  Apache Tomcat  Apache Synapse (optional)  Taverna Workflow Engine
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8 Generic Web Service Wrapper Only requirement: Command Line Application  HTML form Available on OPFlabs: https://github.com/openplanets/scape/tree/master/xa-toolwrapper  Minimise integration effort: developers can focus on their application and have to worry less about integration = higher quality software
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9 Service Oriented Architecture  Java as programming language = platform independence  Standard Apache components = easy to maintain, well supported  Synapse as enterprise service bus = load balancing & fail over  HTTPS encryption & authentication = secure  Minimise deployment effort: scalability, hot deployment/update
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10 Workflow development  OCR workflow = data pipeline  Building blocks = processing steps (nodes)  Integration = interaction between nodes (mashup)  Maximise usability
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11 Workflow management  Web 2.0 style registry: myExperiment  Local client: Taverna Workbench  Web client: project website
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow registry  Share resources and experience  Rate/tag/comment workflows  Organised in groups
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow modules  “Basic” workflows = wraps exactly one software tool/web service  Documented inputs/outputs
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 14 Complex workflows  Tool/data pipeline  Easily derived from workflow modules  Task/goal oriented  Reusable
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local client: Taverna Workbench http://www.taverna.org.uk/  Background: BioSciences  Developed and maintained by myGrid, UK  Available for Windows/Linux/OSX and as open source  Funding secured until 2014
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Web client: Taverna Server/ Workflow Parser  SOAP/REST API  Remote execution of workflows (webapp)
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Use case: Workflows for Evaluation  Tool A vs Tool B (Tool A(v1) vs Tool A(v2))  Workflow X (Tool A + Tool B) vs Workflow Y (Tool A + Tool C)  Workflow X vs previously digitised material  Users identify optimal workflow for source material/project 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 18 Other examples  Workflows for Digitisation  IMPACT  Workflows for Linguistic Analysis  CLARIN  Workflows for Preservation  SCAPE  Interface for automatic storage of results, based on DAV, realised as a workflow module (native beanshell support)  And there are many more…
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits & Outlook  Modular  Transparent  Expandable  Scalable  Platform independent  User friendly  Growing interest in workflow management in CH sector  Easy to set up, deploy, free (open source)  Domain independent
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you! Questions?

Notes de l'éditeur

  1. <number>
  2. <number>
  3. <number>
  4. <number>
  5. <number>
  6. <number>
  7. <number>
  8. <number>