Slides of the paper Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software by Kimmo Kettunen, Teemu Ruokolainen, Erno Liukkonen, Pierrick Tranouez, Daniel Antelme and Thierry Paquet at the 3rd Edition of the DATeCH2019 International Conference
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Session3 02.kimmo ketunnen
1. Detecting Articles in a Digitized Finnish
Historical Newspaper Collection 1771–1929:
Early Results Using the PIVAJ Software
Kimmo Kettunen, Teemu Pääkkönen, Erno Liukkonen
The National Library of Finland, Mikkeli unit,
DH Projects
Pierrick Tranouez, Daniel Antelme, Thierry Paquet
LITIS laboratory
University of Rouen Normandy
France
2. Background
• The National Library of Finland (NLF) has a large
digitized collection of historical periodicals 1771-1929
open in the Web (digi.kansalliskirjasto.fi): 7.58 M pages
in Finnish and Swedish mainly
• The data has been digitized on page basis, and there is
no article structure in the data
• Page is not an informational unit for the user,
article/news item is
• How to produce articles from the existing content?
DATeCH2019, May 8–10, 2019, Brussels, Belgium
3. An example: a page from Päivälehti from 1904: 8
columns, a relatively clear layout with some ads
DATeCH2019, May 8–10, 2019, Brussels, Belgium
4. An example: same issue’s ad pages
DATeCH2019, May 8–10, 2019, Brussels, Belgium
5. Messy data
• The layout of the pages varies quite a lot even inside one
issue of a newspaper and layouts keep changing every
3-5 years in a single newspaper
• Different newspapers may have also different styles,
even though they comply usually to same principles
DATeCH2019, May 8–10, 2019, Brussels, Belgium
6. Article extraction out of the pages: state-of-the-art
results
• Article extraction on complex layout digitized newspaper
pages is not an easy task. Results of the biannual
ICDAR competition on historical newspaper layout
analysis show that current algorithms segment and label
about 80–85 % of the pages correctly at best.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
7. Article extraction out of the pages: state-of-the-art
results
• The latest round of ICDAR competition, ICDAR 2017, considered
comparative evaluation of page segmentation and region
classification methods for documents with complex layouts.
Presented results included the results of the evaluation of seven
methods: five submitted to the competition, and two state-of-the art
systems: commercial ABBYY FineReader® Engine 11 and open-
source Tesseract 3.04. In a combined task of segmentation and
classification of page contents all but one of the systems remain in
the range of 72.5 to 83% correctness.
• The best performing system, MHS 2017, joint work of HoChiMinh
City University of Technology (Ho Chi Minh City, Viet Nam) and
Chonnam National University (Gwangju, Republic of Korea),
achieves a clearly better performance of 90.6%.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
8. Tools for the work
• There are several existing tools for article extraction,
both commercial and research outputs
• After some ground work with tool examination we chose
PIVAJ machine learning based platform developed at the
LITIS laboratory of University of Rouen Normandy
DATeCH2019, May 8–10, 2019, Brussels, Belgium
9. PIVAJ shortly
• PIVAJ has two parts: an on-line application and an off-
line system. The offline system analyzes newspapers’
digitization images in order to rebuild their logical
structure, from issues to sections to articles.
• The online system allows for the display of the resulting
analysis on the Web, as well as additional functionalities
such as transcription corrections.
• We use only PIVAJ’s offline system for newspaper image
analysis and article marking.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
10. Training data
• From a user perspective, PIVAJ article annotation system is a
machine learning software that is first taught with sample page
images that have been marked for layout structure with entities like
title, body, vertical separator, horizontal separator, pictures etc.
• We chose one of our most used newspaper, Uusi Suometar, to work
with
• We established a collection of 224 pages from 1869-1898: 56
issues, 4 pages each
• This data was marked with PIVAJ’s editor
• 168 pages of the data was used for training and 56 pages for
evaluation
DATeCH2019, May 8–10, 2019, Brussels, Belgium
11. Statistics of usage for top 20 newspapers in Digi:
US is the most used in Finnish
DATeCH2019, May 8–10, 2019, Brussels, Belgium
12. A marked page for PIVAJ to learn
DATeCH2019, May 8–10, 2019, Brussels, Belgium
13. Training: effect of amount of columns
• We first trained an individual model on the issues
containing 3 columns in the training set and evaluated
the model on the issue in the development set with 3
columns. Subsequently, we repeated the same for the
rest of the column numbers from 4 to 9.
• PIVAJ was able to learn and provide meaningful
predictions (measured by visual inspection) on the
development issues.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
14. Training: effect of amount of columns
• We then trained a single model using all the 21 issues with varying
column numbers and evaluated the resulting model on all the 7
issues in the development set.
• Our hypothesis was that in case the PIVAJ would have difficulties
handling the variety of number of columns, we would observe
anomalies during training or, more importantly, in the predictions on
the development set. However, we saw no evidence of this, that is,
PIVAJ again provided meaningful predictions and no undesirable
behavior on the development set (measured by visual inspection).
• Therefore, we adopted the single model approach for the primary
experiments conducted on all 56 issues.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
15. Evaluation of results
• We trained PIVAJ successfully, ran the experiments and evaluated
the physical segmentation results using the layout evaluation
software developed by PRImA research laboratory of University of
Salford, which is applicable subsequent to converting the ALTO XML
structure used by PIVAJ to PAGE XML.
• The PRImA software has been employed for evaluation of the
biannual ICDAR competitions (2011/13/15/17).
• However, the specifics of the official competition evaluation are not
publicly available and, thus, the competition evaluation is not
replicable. Therefore, we instead followed the evaluation presented
in Clausner et al. 2011 with three evaluation scenarios
• https://www.primaresearch.org/tools
DATeCH2019, May 8–10, 2019, Brussels, Belgium
16. Different evaluation scenarios
The General recognition is used to
measure the pure segmentation
performance. Therefore,
misclassification errors are ignored
completely. Miss and partial miss errors
are considered worst and have the
highest weights. The weights for merge
and split errors are set to 50%,
whereas false detection, as the least
important error type, has a weight of
only 10%.
The third scenario, Indexing, is based
on the OCR profile but focuses solely
on text, ignoring non-text regions.
The Text structure scenario evaluates
region classification, in the context of a
typical OCR system, focusing primarily
on text but not ignoring the non-text
regions. Accordingly, this profile is
similar to the first but misclassification
of text is weighted highest and all other
misclassification weights are set to
10%.
DATeCH2019, May 8–10, 2019,
Brussels, Belgium
17. Evaluation results: On average the three evaluation scenarios
get success rates of 67.9, 76.1, and 92.2 for the whole data set of 56
pages.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
0
10
20
30
40
50
60
70
80
90
100
1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03
General recognition success rate: arithmetic mean over pages Text structure success rate: arithmetic mean over pages
Indexing scenario success rate: arithmetic mean over pages
18. Examples: a complex page succesfully divided, recognition rates
of 81.86, 89.95, and 96.54 - above average
DATeCH2019, May 8–10, 2019, Brussels, Belgium
19. Examples: bad performance, recognition rates of 57.4,
73.02, and 78.73 - clearly below the averages
DATeCH2019, May 8–10, 2019, Brussels, Belgium
20. Results overall are fair, if not remarkable
• If the pages contain longer sections that consist of short news
items of few lines, the news are not well extracted, only the
larger section, if there are no clear, e.g. bolded, starts for the
short news items.
• The mean quality of the results is lowered by PIVAJ’s behavior
on advertisement heavy pages. LITIS is currently working on
new solutions to overcome this weakness.
• On the whole, the results seem reasonable considering the
varying layouts of the different issues of Uusi Suometar along
the time scale of the data.
• N.B. ¼ and even more of the pages of an US issue can
consist of advertisements
DATeCH2019, May 8–10, 2019, Brussels, Belgium
21. Comparison to performance of docWorks with the same
pages: docWorks finds 150 articles, PIVAJ 1013
DATeCH2019, May 8–10, 2019, Brussels, Belgium
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03
Number of articles, docWorks Number of articles, PIVAJ
22. Benefits for the user: clippings (so far manual and
image only)
DATeCH2019, May 8–10, 2019, Brussels, Belgium
23. A way to use extracted articles: clippings for the
user automatized
DATeCH2019, May 8–10, 2019, Brussels, Belgium
24. Continuation
• A single newspaper like Uusi Suometar can consist of a
lot of pages: 86 068 pages 1869-1918 à it takes some
time to run these through PIVAJ
• After that we shall start work with other popular
newspapers like Aamulehti, Satakunnan Kansa etc.
• The next development issue may be development of a
new page model for PIVAJ for new newspapers (not
known yet).
DATeCH2019, May 8–10, 2019, Brussels, Belgium