Wroclaw University Library presentation at "Succeed in Digitisation. Spreading Excellence" Conference. Validation and take-up of text digitisation tools.
2. Wroclaw University Library:
1. is one of the bigest academic libraries in Poland. Its collection has ca 2,4 million of volumes and in that number 0,5 million of special collections‟ items (i.e. manuscriptes, old printed books, incunabula, maps, graphic collecion, music collection, etc.);
2.is a member of : IFLA, CERL, IAML, Technical Committee No 242 (for Information and Documentation) at Polish Committee for Standardization;
3.has participated in many research projects (European, international, national, etc.);
4.has the staff team with the long-standing experience in digitisation of printed items as well as processing and then presentation of digital objects;
5.has started the digitisation of own physical resources since the year 2000 , has initiated the Digital Library of University of Wroclaw (DLUW) in 2005 and in 2013/2014 – the university repository (Repository of University of Wroclaw – RUW); Owing to the appropriate policy of human resources development, purchases of optical & electronic equipment and computers (hardware & software) as well as participation in many projects the Wroclaw University Library has at its disposal experienced staff and technological base that enable it the cooperation in the framework of the Impact Centre of Competence in Digitisation.
3. Use Case and Tools
In order to improve digitisation workflow in DLUW it was required to implement tools that can help to speed – up and optimize the processes.
For that pourpose two tools have been tested. First, Scan Tailor software was chosen as the post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, etc. It was used for raw scans, and enabled to receive pages ready to be printed or assembled into a PDF or DjVu files.
The second one was Tesseract OCR software - open source OCR engine that combined with the Leptonica Image Processing Library can read a wide variety of image formats and convert them to text in over 60 languages.
Both tools were tested while preparing presentation versions of chosen 12 old printed books (from 16th to18th century), all only with the single-column text layout, printed in different languages (e.g. Latin, Italian, German, Romance) and with different font types (e.g. Gothic, Roman). The aim of tests intended was working out the technological line and workflow for digitisation, processing and presentation of good quality delivery files in the DLUW. For the evaluation the ground truth in plain text format was used (5 pages from every marked out document).
The evaluation was performed by: 1.comparing OCR with ground truth and measuring character error rate, 2. comparing OCR with ground truth and measuring word error rate; 3. comparing OCR from different engines.
4. Use Case and Tools
The research proccess was realized on server in 3 following steps:
1st step – the execution of Scan Tailor program with default adjustments.
After the processing had been done by Scan Tailor program the visual control and manual correction of wrongly processed files had to be carried out by the operator.
Owing to that operation it was possible to improve the parameters of the later processing to the satisfying level. We wanted to receive the best quality of „post master” files for the future processing by OCR and aesthetic digital presentations of the originals in DLUW.
2nd step – saving manual corrections on the server. On the server were saved only these files, that had to be corrected by the operator. The rest of the results of Scan Tailor „s automation operations remained without changes. For supporting the realization of 2nd step the dedicated Web site on server was applied.
3rd step – execution of Tesseract program. Earlier, the appropriate dictionaries were chosen. We used only the dictioneries which were available with Tesseract software and no additional training tools were applied. It turned out that small size of fonts were the great problems for Tesseract. Additionally, it does not have the tools that enable to point out with precision the text layout and to separate it from the area of graphics. The lack of such a function results in the attempts to apply the text recognition function for graphical objects, like: frames, floratura, seals, etc.
5. Evaluation Results
The implementation of new solution consisting in the integration of dispersed digitisation processes and data processing can significantly decrease the costs and increase the efficiency of digital resources‟ creation in the DLUW. The tests carried out on the Scan Tailor and Tesseract programs are of great importance for preparing and organizing technological line for data processing in cloud. It is necessary to work out the procedures and interfaces which enable supporting of the remote processes by our staff.
In the case of Scan Tailor program it is possible to carry out automatically and efficiently the following tasks: splitting master files into the single pages, turning split pages in order to level the text, removing of margins and rejection of artifacts, generating of files to be prepared for OCR process. The only problem is an appropriate recognition of the text area. That problem causes this task not to be solved automatically without carrying out any control process. That imperfection does not disparage Scan Tailor program and it will be applied in WUL as an important tool in the process of data processing.
The Teseract program seems to be very promising tool and with absolute certainty can be said that trials will be done to implement it for supporting digitisation process of selected types of library materials. It is essential however to refine and improve the quality of document‟s layout analysis as well as the recognition of graphical elements and small fonts.
6. Evaluation Results
The results of text recognition can be saved as the files: “txt” or hocr”. File “hocr” contains the following data: the recognized text, its location relative to the original image, style. These data are saved by means of XML in form of HTML or XHTML file.
Taking into account the needs of archiving process the „hocr” files seem to be good form of files‟ saving. Each “hocr” file is assigned to specific graphic file. In this way the adjustment of particular pages of document is possible and thus the organization of adjustment process can be more flexible. The creation of hybrid publications (PDF, DjVu) can be executed automatically by server. „hocr” files can be a base for the further preparation of electronic publications. We noticed the potential of that solution and the tools created during the project we are going to use in the near future.
Additionally, when we were carrying out the other project connected with processing of 19th – century newspapers printed in gothic fonts we observed very satisfying OCR results received by means of Tesseract used on the objects processed to the 1-bit version (black/white). http://www.bibliotekacyfrowa.pl/publication/59368.
We have also repeated the recognition of samples of the object 319708 from the prepared monochromatic files (1-bit). The Tesseract results: CER 7,80% and WER 19,67% vs Tesseract results from our final report: CER 20,58% and WER 35,56%.
So, it turned out that creation of good black-white image is essential element which very positively influences on the OCR‟s results.