SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Detecting Articles in a Digitized Finnish
Historical Newspaper Collection 1771–1929:
Early Results Using the PIVAJ Software
Kimmo Kettunen, Teemu Pääkkönen, Erno Liukkonen
The National Library of Finland, Mikkeli unit,
DH Projects
Pierrick Tranouez, Daniel Antelme, Thierry Paquet
LITIS laboratory
University of Rouen Normandy
France
Background
• The National Library of Finland (NLF) has a large
digitized collection of historical periodicals 1771-1929
open in the Web (digi.kansalliskirjasto.fi): 7.58 M pages
in Finnish and Swedish mainly
• The data has been digitized on page basis, and there is
no article structure in the data
• Page is not an informational unit for the user,
article/news item is
• How to produce articles from the existing content?
DATeCH2019, May 8–10, 2019, Brussels, Belgium
An example: a page from Päivälehti from 1904: 8
columns, a relatively clear layout with some ads
DATeCH2019, May 8–10, 2019, Brussels, Belgium
An example: same issue’s ad pages
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Messy data
• The layout of the pages varies quite a lot even inside one
issue of a newspaper and layouts keep changing every
3-5 years in a single newspaper
• Different newspapers may have also different styles,
even though they comply usually to same principles
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Article extraction out of the pages: state-of-the-art
results
• Article extraction on complex layout digitized newspaper
pages is not an easy task. Results of the biannual
ICDAR competition on historical newspaper layout
analysis show that current algorithms segment and label
about 80–85 % of the pages correctly at best.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Article extraction out of the pages: state-of-the-art
results
• The latest round of ICDAR competition, ICDAR 2017, considered
comparative evaluation of page segmentation and region
classification methods for documents with complex layouts.
Presented results included the results of the evaluation of seven
methods: five submitted to the competition, and two state-of-the art
systems: commercial ABBYY FineReader® Engine 11 and open-
source Tesseract 3.04. In a combined task of segmentation and
classification of page contents all but one of the systems remain in
the range of 72.5 to 83% correctness.
• The best performing system, MHS 2017, joint work of HoChiMinh
City University of Technology (Ho Chi Minh City, Viet Nam) and
Chonnam National University (Gwangju, Republic of Korea),
achieves a clearly better performance of 90.6%.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Tools for the work
• There are several existing tools for article extraction,
both commercial and research outputs
• After some ground work with tool examination we chose
PIVAJ machine learning based platform developed at the
LITIS laboratory of University of Rouen Normandy
DATeCH2019, May 8–10, 2019, Brussels, Belgium
PIVAJ shortly
• PIVAJ has two parts: an on-line application and an off-
line system. The offline system analyzes newspapers’
digitization images in order to rebuild their logical
structure, from issues to sections to articles.
• The online system allows for the display of the resulting
analysis on the Web, as well as additional functionalities
such as transcription corrections.
• We use only PIVAJ’s offline system for newspaper image
analysis and article marking.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Training data
• From a user perspective, PIVAJ article annotation system is a
machine learning software that is first taught with sample page
images that have been marked for layout structure with entities like
title, body, vertical separator, horizontal separator, pictures etc.
• We chose one of our most used newspaper, Uusi Suometar, to work
with
• We established a collection of 224 pages from 1869-1898: 56
issues, 4 pages each
• This data was marked with PIVAJ’s editor
• 168 pages of the data was used for training and 56 pages for
evaluation
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Statistics of usage for top 20 newspapers in Digi:
US is the most used in Finnish
DATeCH2019, May 8–10, 2019, Brussels, Belgium
A marked page for PIVAJ to learn
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Training: effect of amount of columns
• We first trained an individual model on the issues
containing 3 columns in the training set and evaluated
the model on the issue in the development set with 3
columns. Subsequently, we repeated the same for the
rest of the column numbers from 4 to 9.
• PIVAJ was able to learn and provide meaningful
predictions (measured by visual inspection) on the
development issues.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Training: effect of amount of columns
• We then trained a single model using all the 21 issues with varying
column numbers and evaluated the resulting model on all the 7
issues in the development set.
• Our hypothesis was that in case the PIVAJ would have difficulties
handling the variety of number of columns, we would observe
anomalies during training or, more importantly, in the predictions on
the development set. However, we saw no evidence of this, that is,
PIVAJ again provided meaningful predictions and no undesirable
behavior on the development set (measured by visual inspection).
• Therefore, we adopted the single model approach for the primary
experiments conducted on all 56 issues.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Evaluation of results
• We trained PIVAJ successfully, ran the experiments and evaluated
the physical segmentation results using the layout evaluation
software developed by PRImA research laboratory of University of
Salford, which is applicable subsequent to converting the ALTO XML
structure used by PIVAJ to PAGE XML.
• The PRImA software has been employed for evaluation of the
biannual ICDAR competitions (2011/13/15/17).
• However, the specifics of the official competition evaluation are not
publicly available and, thus, the competition evaluation is not
replicable. Therefore, we instead followed the evaluation presented
in Clausner et al. 2011 with three evaluation scenarios
• https://www.primaresearch.org/tools
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Different evaluation scenarios
The General recognition is used to
measure the pure segmentation
performance. Therefore,
misclassification errors are ignored
completely. Miss and partial miss errors
are considered worst and have the
highest weights. The weights for merge
and split errors are set to 50%,
whereas false detection, as the least
important error type, has a weight of
only 10%.
The third scenario, Indexing, is based
on the OCR profile but focuses solely
on text, ignoring non-text regions.
The Text structure scenario evaluates
region classification, in the context of a
typical OCR system, focusing primarily
on text but not ignoring the non-text
regions. Accordingly, this profile is
similar to the first but misclassification
of text is weighted highest and all other
misclassification weights are set to
10%.
DATeCH2019, May 8–10, 2019,
Brussels, Belgium
Evaluation results: On average the three evaluation scenarios
get success rates of 67.9, 76.1, and 92.2 for the whole data set of 56
pages.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
0
10
20
30
40
50
60
70
80
90
100
1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03
General recognition success rate: arithmetic mean over pages Text structure success rate: arithmetic mean over pages
Indexing scenario success rate: arithmetic mean over pages
Examples: a complex page succesfully divided, recognition rates
of 81.86, 89.95, and 96.54 - above average
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Examples: bad performance, recognition rates of 57.4,
73.02, and 78.73 - clearly below the averages
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Results overall are fair, if not remarkable
• If the pages contain longer sections that consist of short news
items of few lines, the news are not well extracted, only the
larger section, if there are no clear, e.g. bolded, starts for the
short news items.
• The mean quality of the results is lowered by PIVAJ’s behavior
on advertisement heavy pages. LITIS is currently working on
new solutions to overcome this weakness.
• On the whole, the results seem reasonable considering the
varying layouts of the different issues of Uusi Suometar along
the time scale of the data.
• N.B. ¼ and even more of the pages of an US issue can
consist of advertisements
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Comparison to performance of docWorks with the same
pages: docWorks finds 150 articles, PIVAJ 1013
DATeCH2019, May 8–10, 2019, Brussels, Belgium
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03
Number of articles, docWorks Number of articles, PIVAJ
Benefits for the user: clippings (so far manual and
image only)
DATeCH2019, May 8–10, 2019, Brussels, Belgium
A way to use extracted articles: clippings for the
user automatized
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Continuation
• A single newspaper like Uusi Suometar can consist of a
lot of pages: 86 068 pages 1869-1918 à it takes some
time to run these through PIVAJ
• After that we shall start work with other popular
newspapers like Aamulehti, Satakunnan Kansa etc.
• The next development issue may be development of a
new page model for PIVAJ for new newspapers (not
known yet).
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Thank you!
Kimmo.Kettunen@helsinki.fi
The National Library of Finland, DH Projects

Contenu connexe

Similaire à Session3 02.kimmo ketunnen

DIGITAL INNOVATION HUBS: WHAT ARE THE ACHIEVEMENTS SO FAR AND WHAT REMAINS TO...
DIGITAL INNOVATION HUBS: WHAT ARE THE ACHIEVEMENTS SO FAR AND WHAT REMAINS TO...DIGITAL INNOVATION HUBS: WHAT ARE THE ACHIEVEMENTS SO FAR AND WHAT REMAINS TO...
DIGITAL INNOVATION HUBS: WHAT ARE THE ACHIEVEMENTS SO FAR AND WHAT REMAINS TO...
I4MS_eu
 
Ict in schools_2006-7_final4
Ict  in  schools_2006-7_final4Ict  in  schools_2006-7_final4
Ict in schools_2006-7_final4
Kiran Zara
 
Ict in schools_2006-7_final4
Ict  in  schools_2006-7_final4Ict  in  schools_2006-7_final4
Ict in schools_2006-7_final4
Kiran Zara
 
Ia4 si caps concertation presentation
Ia4 si caps concertation presentationIa4 si caps concertation presentation
Ia4 si caps concertation presentation
CAPS2020
 

Similaire à Session3 02.kimmo ketunnen (20)

IT-Shape 3. Newsletter
IT-Shape 3. NewsletterIT-Shape 3. Newsletter
IT-Shape 3. Newsletter
 
DIGITAL INNOVATION HUBS: WHAT ARE THE ACHIEVEMENTS SO FAR AND WHAT REMAINS TO...
DIGITAL INNOVATION HUBS: WHAT ARE THE ACHIEVEMENTS SO FAR AND WHAT REMAINS TO...DIGITAL INNOVATION HUBS: WHAT ARE THE ACHIEVEMENTS SO FAR AND WHAT REMAINS TO...
DIGITAL INNOVATION HUBS: WHAT ARE THE ACHIEVEMENTS SO FAR AND WHAT REMAINS TO...
 
Ict in schools_2006-7_final4
Ict  in  schools_2006-7_final4Ict  in  schools_2006-7_final4
Ict in schools_2006-7_final4
 
Ict in schools_2006-7_final4
Ict  in  schools_2006-7_final4Ict  in  schools_2006-7_final4
Ict in schools_2006-7_final4
 
Edwige godlewski
Edwige godlewskiEdwige godlewski
Edwige godlewski
 
Process assessment for use in very small enterprises the noemi assessment met...
Process assessment for use in very small enterprises the noemi assessment met...Process assessment for use in very small enterprises the noemi assessment met...
Process assessment for use in very small enterprises the noemi assessment met...
 
Process assessment for use in very small enterprises the noemi assessment met...
Process assessment for use in very small enterprises the noemi assessment met...Process assessment for use in very small enterprises the noemi assessment met...
Process assessment for use in very small enterprises the noemi assessment met...
 
Towards a Strategic Implementation of the EOSC & Addressing strategic priorit...
Towards a Strategic Implementation of the EOSC & Addressing strategic priorit...Towards a Strategic Implementation of the EOSC & Addressing strategic priorit...
Towards a Strategic Implementation of the EOSC & Addressing strategic priorit...
 
Balanskat ict in schools
Balanskat ict in schoolsBalanskat ict in schools
Balanskat ict in schools
 
Newspapers in the 21
Newspapers in the 21Newspapers in the 21
Newspapers in the 21
 
EVITA Final conference, Marseille@11oct2011
EVITA Final conference, Marseille@11oct2011EVITA Final conference, Marseille@11oct2011
EVITA Final conference, Marseille@11oct2011
 
What will the future hold for EU-BR collaboration in ICT
What will the future hold for EU-BR collaboration in ICTWhat will the future hold for EU-BR collaboration in ICT
What will the future hold for EU-BR collaboration in ICT
 
IC2015
IC2015IC2015
IC2015
 
About OPEN DEI
About OPEN DEIAbout OPEN DEI
About OPEN DEI
 
FITT!: the project idea and main activities (EN) - Tiziano Barone, Veneto Lavoro
FITT!: the project idea and main activities (EN) - Tiziano Barone, Veneto LavoroFITT!: the project idea and main activities (EN) - Tiziano Barone, Veneto Lavoro
FITT!: the project idea and main activities (EN) - Tiziano Barone, Veneto Lavoro
 
IAOS 2018 - The C4S training for trainers course from the MEDSTAT IV project:...
IAOS 2018 - The C4S training for trainers course from the MEDSTAT IV project:...IAOS 2018 - The C4S training for trainers course from the MEDSTAT IV project:...
IAOS 2018 - The C4S training for trainers course from the MEDSTAT IV project:...
 
BE-GOOD: Building an Ecosystem to Generate Opportunities in Open Data
BE-GOOD: Building an Ecosystem to Generate Opportunities in Open DataBE-GOOD: Building an Ecosystem to Generate Opportunities in Open Data
BE-GOOD: Building an Ecosystem to Generate Opportunities in Open Data
 
Ia4 si caps concertation presentation
Ia4 si caps concertation presentationIa4 si caps concertation presentation
Ia4 si caps concertation presentation
 
Electronic management of assessment - Jisc Digital Media 2015
Electronic management of assessment - Jisc Digital Media 2015Electronic management of assessment - Jisc Digital Media 2015
Electronic management of assessment - Jisc Digital Media 2015
 
Mantra for lace preprint
Mantra for lace preprintMantra for lace preprint
Mantra for lace preprint
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Session3 02.kimmo ketunnen

  • 1. Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software Kimmo Kettunen, Teemu Pääkkönen, Erno Liukkonen The National Library of Finland, Mikkeli unit, DH Projects Pierrick Tranouez, Daniel Antelme, Thierry Paquet LITIS laboratory University of Rouen Normandy France
  • 2. Background • The National Library of Finland (NLF) has a large digitized collection of historical periodicals 1771-1929 open in the Web (digi.kansalliskirjasto.fi): 7.58 M pages in Finnish and Swedish mainly • The data has been digitized on page basis, and there is no article structure in the data • Page is not an informational unit for the user, article/news item is • How to produce articles from the existing content? DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 3. An example: a page from Päivälehti from 1904: 8 columns, a relatively clear layout with some ads DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 4. An example: same issue’s ad pages DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 5. Messy data • The layout of the pages varies quite a lot even inside one issue of a newspaper and layouts keep changing every 3-5 years in a single newspaper • Different newspapers may have also different styles, even though they comply usually to same principles DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 6. Article extraction out of the pages: state-of-the-art results • Article extraction on complex layout digitized newspaper pages is not an easy task. Results of the biannual ICDAR competition on historical newspaper layout analysis show that current algorithms segment and label about 80–85 % of the pages correctly at best. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 7. Article extraction out of the pages: state-of-the-art results • The latest round of ICDAR competition, ICDAR 2017, considered comparative evaluation of page segmentation and region classification methods for documents with complex layouts. Presented results included the results of the evaluation of seven methods: five submitted to the competition, and two state-of-the art systems: commercial ABBYY FineReader® Engine 11 and open- source Tesseract 3.04. In a combined task of segmentation and classification of page contents all but one of the systems remain in the range of 72.5 to 83% correctness. • The best performing system, MHS 2017, joint work of HoChiMinh City University of Technology (Ho Chi Minh City, Viet Nam) and Chonnam National University (Gwangju, Republic of Korea), achieves a clearly better performance of 90.6%. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 8. Tools for the work • There are several existing tools for article extraction, both commercial and research outputs • After some ground work with tool examination we chose PIVAJ machine learning based platform developed at the LITIS laboratory of University of Rouen Normandy DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 9. PIVAJ shortly • PIVAJ has two parts: an on-line application and an off- line system. The offline system analyzes newspapers’ digitization images in order to rebuild their logical structure, from issues to sections to articles. • The online system allows for the display of the resulting analysis on the Web, as well as additional functionalities such as transcription corrections. • We use only PIVAJ’s offline system for newspaper image analysis and article marking. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 10. Training data • From a user perspective, PIVAJ article annotation system is a machine learning software that is first taught with sample page images that have been marked for layout structure with entities like title, body, vertical separator, horizontal separator, pictures etc. • We chose one of our most used newspaper, Uusi Suometar, to work with • We established a collection of 224 pages from 1869-1898: 56 issues, 4 pages each • This data was marked with PIVAJ’s editor • 168 pages of the data was used for training and 56 pages for evaluation DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 11. Statistics of usage for top 20 newspapers in Digi: US is the most used in Finnish DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 12. A marked page for PIVAJ to learn DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 13. Training: effect of amount of columns • We first trained an individual model on the issues containing 3 columns in the training set and evaluated the model on the issue in the development set with 3 columns. Subsequently, we repeated the same for the rest of the column numbers from 4 to 9. • PIVAJ was able to learn and provide meaningful predictions (measured by visual inspection) on the development issues. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 14. Training: effect of amount of columns • We then trained a single model using all the 21 issues with varying column numbers and evaluated the resulting model on all the 7 issues in the development set. • Our hypothesis was that in case the PIVAJ would have difficulties handling the variety of number of columns, we would observe anomalies during training or, more importantly, in the predictions on the development set. However, we saw no evidence of this, that is, PIVAJ again provided meaningful predictions and no undesirable behavior on the development set (measured by visual inspection). • Therefore, we adopted the single model approach for the primary experiments conducted on all 56 issues. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 15. Evaluation of results • We trained PIVAJ successfully, ran the experiments and evaluated the physical segmentation results using the layout evaluation software developed by PRImA research laboratory of University of Salford, which is applicable subsequent to converting the ALTO XML structure used by PIVAJ to PAGE XML. • The PRImA software has been employed for evaluation of the biannual ICDAR competitions (2011/13/15/17). • However, the specifics of the official competition evaluation are not publicly available and, thus, the competition evaluation is not replicable. Therefore, we instead followed the evaluation presented in Clausner et al. 2011 with three evaluation scenarios • https://www.primaresearch.org/tools DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 16. Different evaluation scenarios The General recognition is used to measure the pure segmentation performance. Therefore, misclassification errors are ignored completely. Miss and partial miss errors are considered worst and have the highest weights. The weights for merge and split errors are set to 50%, whereas false detection, as the least important error type, has a weight of only 10%. The third scenario, Indexing, is based on the OCR profile but focuses solely on text, ignoring non-text regions. The Text structure scenario evaluates region classification, in the context of a typical OCR system, focusing primarily on text but not ignoring the non-text regions. Accordingly, this profile is similar to the first but misclassification of text is weighted highest and all other misclassification weights are set to 10%. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 17. Evaluation results: On average the three evaluation scenarios get success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages. DATeCH2019, May 8–10, 2019, Brussels, Belgium 0 10 20 30 40 50 60 70 80 90 100 1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03 General recognition success rate: arithmetic mean over pages Text structure success rate: arithmetic mean over pages Indexing scenario success rate: arithmetic mean over pages
  • 18. Examples: a complex page succesfully divided, recognition rates of 81.86, 89.95, and 96.54 - above average DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 19. Examples: bad performance, recognition rates of 57.4, 73.02, and 78.73 - clearly below the averages DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 20. Results overall are fair, if not remarkable • If the pages contain longer sections that consist of short news items of few lines, the news are not well extracted, only the larger section, if there are no clear, e.g. bolded, starts for the short news items. • The mean quality of the results is lowered by PIVAJ’s behavior on advertisement heavy pages. LITIS is currently working on new solutions to overcome this weakness. • On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data. • N.B. ¼ and even more of the pages of an US issue can consist of advertisements DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 21. Comparison to performance of docWorks with the same pages: docWorks finds 150 articles, PIVAJ 1013 DATeCH2019, May 8–10, 2019, Brussels, Belgium 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03 Number of articles, docWorks Number of articles, PIVAJ
  • 22. Benefits for the user: clippings (so far manual and image only) DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 23. A way to use extracted articles: clippings for the user automatized DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 24. Continuation • A single newspaper like Uusi Suometar can consist of a lot of pages: 86 068 pages 1869-1918 à it takes some time to run these through PIVAJ • After that we shall start work with other popular newspapers like Aamulehti, Satakunnan Kansa etc. • The next development issue may be development of a new page model for PIVAJ for new newspapers (not known yet). DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 25. Thank you! Kimmo.Kettunen@helsinki.fi The National Library of Finland, DH Projects