SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Wroclaw University Library 
Grażyna Piotrowicz
Wroclaw University Library: 
1. is one of the bigest academic libraries in Poland. Its collection has ca 2,4 million of volumes and in that number 0,5 million of special collections‟ items (i.e. manuscriptes, old printed books, incunabula, maps, graphic collecion, music collection, etc.); 
2.is a member of : IFLA, CERL, IAML, Technical Committee No 242 (for Information and Documentation) at Polish Committee for Standardization; 
3.has participated in many research projects (European, international, national, etc.); 
4.has the staff team with the long-standing experience in digitisation of printed items as well as processing and then presentation of digital objects; 
5.has started the digitisation of own physical resources since the year 2000 , has initiated the Digital Library of University of Wroclaw (DLUW) in 2005 and in 2013/2014 – the university repository (Repository of University of Wroclaw – RUW); Owing to the appropriate policy of human resources development, purchases of optical & electronic equipment and computers (hardware & software) as well as participation in many projects the Wroclaw University Library has at its disposal experienced staff and technological base that enable it the cooperation in the framework of the Impact Centre of Competence in Digitisation.
Use Case and Tools 
In order to improve digitisation workflow in DLUW it was required to implement tools that can help to speed – up and optimize the processes. 
For that pourpose two tools have been tested. First, Scan Tailor software was chosen as the post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, etc. It was used for raw scans, and enabled to receive pages ready to be printed or assembled into a PDF or DjVu files. 
The second one was Tesseract OCR software - open source OCR engine that combined with the Leptonica Image Processing Library can read a wide variety of image formats and convert them to text in over 60 languages. 
Both tools were tested while preparing presentation versions of chosen 12 old printed books (from 16th to18th century), all only with the single-column text layout, printed in different languages (e.g. Latin, Italian, German, Romance) and with different font types (e.g. Gothic, Roman). The aim of tests intended was working out the technological line and workflow for digitisation, processing and presentation of good quality delivery files in the DLUW. For the evaluation the ground truth in plain text format was used (5 pages from every marked out document). 
The evaluation was performed by: 1.comparing OCR with ground truth and measuring character error rate, 2. comparing OCR with ground truth and measuring word error rate; 3. comparing OCR from different engines.
Use Case and Tools 
The research proccess was realized on server in 3 following steps: 
1st step – the execution of Scan Tailor program with default adjustments. 
After the processing had been done by Scan Tailor program the visual control and manual correction of wrongly processed files had to be carried out by the operator. 
Owing to that operation it was possible to improve the parameters of the later processing to the satisfying level. We wanted to receive the best quality of „post master” files for the future processing by OCR and aesthetic digital presentations of the originals in DLUW. 
2nd step – saving manual corrections on the server. On the server were saved only these files, that had to be corrected by the operator. The rest of the results of Scan Tailor „s automation operations remained without changes. For supporting the realization of 2nd step the dedicated Web site on server was applied. 
3rd step – execution of Tesseract program. Earlier, the appropriate dictionaries were chosen. We used only the dictioneries which were available with Tesseract software and no additional training tools were applied. It turned out that small size of fonts were the great problems for Tesseract. Additionally, it does not have the tools that enable to point out with precision the text layout and to separate it from the area of graphics. The lack of such a function results in the attempts to apply the text recognition function for graphical objects, like: frames, floratura, seals, etc.
Evaluation Results 
The implementation of new solution consisting in the integration of dispersed digitisation processes and data processing can significantly decrease the costs and increase the efficiency of digital resources‟ creation in the DLUW. The tests carried out on the Scan Tailor and Tesseract programs are of great importance for preparing and organizing technological line for data processing in cloud. It is necessary to work out the procedures and interfaces which enable supporting of the remote processes by our staff. 
In the case of Scan Tailor program it is possible to carry out automatically and efficiently the following tasks: splitting master files into the single pages, turning split pages in order to level the text, removing of margins and rejection of artifacts, generating of files to be prepared for OCR process. The only problem is an appropriate recognition of the text area. That problem causes this task not to be solved automatically without carrying out any control process. That imperfection does not disparage Scan Tailor program and it will be applied in WUL as an important tool in the process of data processing. 
The Teseract program seems to be very promising tool and with absolute certainty can be said that trials will be done to implement it for supporting digitisation process of selected types of library materials. It is essential however to refine and improve the quality of document‟s layout analysis as well as the recognition of graphical elements and small fonts.
Evaluation Results 
The results of text recognition can be saved as the files: “txt” or hocr”. File “hocr” contains the following data: the recognized text, its location relative to the original image, style. These data are saved by means of XML in form of HTML or XHTML file. 
Taking into account the needs of archiving process the „hocr” files seem to be good form of files‟ saving. Each “hocr” file is assigned to specific graphic file. In this way the adjustment of particular pages of document is possible and thus the organization of adjustment process can be more flexible. The creation of hybrid publications (PDF, DjVu) can be executed automatically by server. „hocr” files can be a base for the further preparation of electronic publications. We noticed the potential of that solution and the tools created during the project we are going to use in the near future. 
Additionally, when we were carrying out the other project connected with processing of 19th – century newspapers printed in gothic fonts we observed very satisfying OCR results received by means of Tesseract used on the objects processed to the 1-bit version (black/white). http://www.bibliotekacyfrowa.pl/publication/59368. 
We have also repeated the recognition of samples of the object 319708 from the prepared monochromatic files (1-bit). The Tesseract results: CER 7,80% and WER 19,67% vs Tesseract results from our final report: CER 20,58% and WER 35,56%. 
So, it turned out that creation of good black-white image is essential element which very positively influences on the OCR‟s results.

Contenu connexe

Similaire à Wroclaw university library - Grazyna Piotrowicz

IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET Journal
 
300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptxDanielJDanso
 
Optical character recognization word
Optical character recognization wordOptical character recognization word
Optical character recognization wordDhana K
 
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkBL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkIMPACT Centre of Competence
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriessangeetadhamdhere
 
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...IRJET Journal
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...Alex Zeltov
 
IRJET- Text Extraction from Text Based Image using Android
IRJET- Text Extraction from Text Based Image using AndroidIRJET- Text Extraction from Text Based Image using Android
IRJET- Text Extraction from Text Based Image using AndroidIRJET Journal
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentscneudecker
 
How to create a corpus of machine-readable texts: challenges and solutions
How to create a corpus of machine-readable texts: challenges and solutionsHow to create a corpus of machine-readable texts: challenges and solutions
How to create a corpus of machine-readable texts: challenges and solutionsMonika Renate Barget
 
Abbyy fine reader-server
Abbyy fine reader-serverAbbyy fine reader-server
Abbyy fine reader-serverMan Minh
 
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen MykonosMuehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen MykonosEUscreen
 
OCR 's Functions
OCR 's FunctionsOCR 's Functions
OCR 's Functionsprithvi764
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR RecognitionBharat Kalia
 
Library tools and technologies
Library tools and technologiesLibrary tools and technologies
Library tools and technologiesLiaquat Rahoo
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Robert Monné
 

Similaire à Wroclaw university library - Grazyna Piotrowicz (20)

IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx
 
Optical character recognization word
Optical character recognization wordOptical character recognization word
Optical character recognization word
 
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkBL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
 
Niatalk24jan10
Niatalk24jan10Niatalk24jan10
Niatalk24jan10
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositories
 
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
50120130406005
5012013040600550120130406005
50120130406005
 
Olf2016
Olf2016Olf2016
Olf2016
 
IRJET- Text Extraction from Text Based Image using Android
IRJET- Text Extraction from Text Based Image using AndroidIRJET- Text Extraction from Text Based Image using Android
IRJET- Text Extraction from Text Based Image using Android
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
How to create a corpus of machine-readable texts: challenges and solutions
How to create a corpus of machine-readable texts: challenges and solutionsHow to create a corpus of machine-readable texts: challenges and solutions
How to create a corpus of machine-readable texts: challenges and solutions
 
Abbyy fine reader-server
Abbyy fine reader-serverAbbyy fine reader-server
Abbyy fine reader-server
 
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen MykonosMuehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
 
OCR 's Functions
OCR 's FunctionsOCR 's Functions
OCR 's Functions
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
Library tools and technologies
Library tools and technologiesLibrary tools and technologies
Library tools and technologies
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Wroclaw university library - Grazyna Piotrowicz

  • 1. Wroclaw University Library Grażyna Piotrowicz
  • 2. Wroclaw University Library: 1. is one of the bigest academic libraries in Poland. Its collection has ca 2,4 million of volumes and in that number 0,5 million of special collections‟ items (i.e. manuscriptes, old printed books, incunabula, maps, graphic collecion, music collection, etc.); 2.is a member of : IFLA, CERL, IAML, Technical Committee No 242 (for Information and Documentation) at Polish Committee for Standardization; 3.has participated in many research projects (European, international, national, etc.); 4.has the staff team with the long-standing experience in digitisation of printed items as well as processing and then presentation of digital objects; 5.has started the digitisation of own physical resources since the year 2000 , has initiated the Digital Library of University of Wroclaw (DLUW) in 2005 and in 2013/2014 – the university repository (Repository of University of Wroclaw – RUW); Owing to the appropriate policy of human resources development, purchases of optical & electronic equipment and computers (hardware & software) as well as participation in many projects the Wroclaw University Library has at its disposal experienced staff and technological base that enable it the cooperation in the framework of the Impact Centre of Competence in Digitisation.
  • 3. Use Case and Tools In order to improve digitisation workflow in DLUW it was required to implement tools that can help to speed – up and optimize the processes. For that pourpose two tools have been tested. First, Scan Tailor software was chosen as the post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, etc. It was used for raw scans, and enabled to receive pages ready to be printed or assembled into a PDF or DjVu files. The second one was Tesseract OCR software - open source OCR engine that combined with the Leptonica Image Processing Library can read a wide variety of image formats and convert them to text in over 60 languages. Both tools were tested while preparing presentation versions of chosen 12 old printed books (from 16th to18th century), all only with the single-column text layout, printed in different languages (e.g. Latin, Italian, German, Romance) and with different font types (e.g. Gothic, Roman). The aim of tests intended was working out the technological line and workflow for digitisation, processing and presentation of good quality delivery files in the DLUW. For the evaluation the ground truth in plain text format was used (5 pages from every marked out document). The evaluation was performed by: 1.comparing OCR with ground truth and measuring character error rate, 2. comparing OCR with ground truth and measuring word error rate; 3. comparing OCR from different engines.
  • 4. Use Case and Tools The research proccess was realized on server in 3 following steps: 1st step – the execution of Scan Tailor program with default adjustments. After the processing had been done by Scan Tailor program the visual control and manual correction of wrongly processed files had to be carried out by the operator. Owing to that operation it was possible to improve the parameters of the later processing to the satisfying level. We wanted to receive the best quality of „post master” files for the future processing by OCR and aesthetic digital presentations of the originals in DLUW. 2nd step – saving manual corrections on the server. On the server were saved only these files, that had to be corrected by the operator. The rest of the results of Scan Tailor „s automation operations remained without changes. For supporting the realization of 2nd step the dedicated Web site on server was applied. 3rd step – execution of Tesseract program. Earlier, the appropriate dictionaries were chosen. We used only the dictioneries which were available with Tesseract software and no additional training tools were applied. It turned out that small size of fonts were the great problems for Tesseract. Additionally, it does not have the tools that enable to point out with precision the text layout and to separate it from the area of graphics. The lack of such a function results in the attempts to apply the text recognition function for graphical objects, like: frames, floratura, seals, etc.
  • 5. Evaluation Results The implementation of new solution consisting in the integration of dispersed digitisation processes and data processing can significantly decrease the costs and increase the efficiency of digital resources‟ creation in the DLUW. The tests carried out on the Scan Tailor and Tesseract programs are of great importance for preparing and organizing technological line for data processing in cloud. It is necessary to work out the procedures and interfaces which enable supporting of the remote processes by our staff. In the case of Scan Tailor program it is possible to carry out automatically and efficiently the following tasks: splitting master files into the single pages, turning split pages in order to level the text, removing of margins and rejection of artifacts, generating of files to be prepared for OCR process. The only problem is an appropriate recognition of the text area. That problem causes this task not to be solved automatically without carrying out any control process. That imperfection does not disparage Scan Tailor program and it will be applied in WUL as an important tool in the process of data processing. The Teseract program seems to be very promising tool and with absolute certainty can be said that trials will be done to implement it for supporting digitisation process of selected types of library materials. It is essential however to refine and improve the quality of document‟s layout analysis as well as the recognition of graphical elements and small fonts.
  • 6. Evaluation Results The results of text recognition can be saved as the files: “txt” or hocr”. File “hocr” contains the following data: the recognized text, its location relative to the original image, style. These data are saved by means of XML in form of HTML or XHTML file. Taking into account the needs of archiving process the „hocr” files seem to be good form of files‟ saving. Each “hocr” file is assigned to specific graphic file. In this way the adjustment of particular pages of document is possible and thus the organization of adjustment process can be more flexible. The creation of hybrid publications (PDF, DjVu) can be executed automatically by server. „hocr” files can be a base for the further preparation of electronic publications. We noticed the potential of that solution and the tools created during the project we are going to use in the near future. Additionally, when we were carrying out the other project connected with processing of 19th – century newspapers printed in gothic fonts we observed very satisfying OCR results received by means of Tesseract used on the objects processed to the 1-bit version (black/white). http://www.bibliotekacyfrowa.pl/publication/59368. We have also repeated the recognition of samples of the object 319708 from the prepared monochromatic files (1-bit). The Tesseract results: CER 7,80% and WER 19,67% vs Tesseract results from our final report: CER 20,58% and WER 35,56%. So, it turned out that creation of good black-white image is essential element which very positively influences on the OCR‟s results.