Slides of the paper Towards the Extraction of Statistical Information from Digitised Numerical Tables - The Medical Officer of Health Reports Scoping Study by Christian Clausner, Apostolos Antonacopoulos, Christy Henshaw and Justin Hayes at the 3rd Edition of the DATeCH2019 International Conference
1. Towards the Extraction of Statistical Information
from Digitised NumericalTables
The Medical Officer of Health Reports Scoping Study
Christian Clausner, Apostolos Antonacopoulos,
Christy Henshaw, Justin Hayes
University of Salford
Wellcome Collection
25/09/2019DATeCH 2019, Brussels 1
2. The Medical Officer of Health Reports
• Wellcome Collection holds UK’s
largest collection of Medical
Officer of Health reports
• 130 years
• Over 70,000 reports
• All digitised and OCRed
25/09/2019DATeCH 2019, Brussels 2
https://wellcomelibrary.org/moh/
3. The Medical Officer of Health Reports
• Narrative textual content + tabular content
• Topics:
• Birth and death statistics
• Notifiable diseases
• General population statistics
• Causes of death
• School health
• Food inspections
• …
25/09/2019DATeCH 2019, Brussels 3
4. The Medical Officer of Health Reports
• OCRed and post-corrected data
available for Greater London
• Individual tables provided in
special format
• Statistical data difficult to
extract
25/09/2019DATeCH 2019, Brussels 4
5. Current Practices
• Standard OCR not sufficient for
extraction of numerical data
• Need accuracy for values AND
context (column / row)
• Common:
• Only indexing and providing access
to images with tables
• Manual correction and provision of
tables in dedicated formats
• Rare / very difficult or expensive:
• Full extraction and integration to
provide faceted searches / data
analysis etc.
25/09/2019DATeCH 2019, Brussels 5
1961 Census of England andWales
6. The MOH Scoping Study (2018)
• Gain understanding of tabular
data available in the reports
• Investigate ways of data
extraction
• Scope out users’ needs and
expectations
• Based on Greater London data
25/09/2019DATeCH 2019, Brussels 6
7. Identification of table topics
• Text-based analysis of table
captions and headers
• Grouping instances by text
similarity
• Using a tool that was created for
social media analysis
25/09/2019DATeCH 2019, Brussels 7
Topic Table Count
(approx.)
Mortality / Cause of Death 2530
General statistics / demographics 1900
Infectious Diseases / Notifiable
Diseases
1720
Inspections / conditions 4360
Minor ailments, dental, etc. 710
Financial 470
Food 330
Births 240
Meteorological 100
Legal 190
Immunisation 60
8. Identification of table topics
• Geographies:
• Mostly districts
• Also smaller areas (sub-districts,
wards)
• Considerable variety of
• Information content
• Physical structure
• Across many
• Locations
• Years
25/09/2019DATeCH 2019, Brussels 8
§ Demographics
§ Age
§ Sex
§ Births
§ Deaths
§ Causes of death
§ Infant death
§ Ailments
§ Diseases
§ Infectious diseases
§ Notifiable diseases
§ Immunisations
§ Environmental
§ Inspections
§ Food
§ Conditions
§ Meteorological
§ Financial
§ Legal
9. Extraction of tabular data
• Can remaining data be extracted
in a less costly way?
• Available for experiments:
• OCR results in ALTO XML format
(Greater London)
• Ran ABBYY FineReader Engine 11
ourselves
25/09/2019DATeCH 2019, Brussels 9
10. Extraction of tabular data
• Tests with ABBYY FineReader
• Very inconsistent results
• But column and row headers
sufficiently recognised
25/09/2019DATeCH 2019, Brussels 10
11. Extraction of tabular data
• Prototype: Flexible matching to
locate rows and columns of
interest
• Ignore other data that is less
consistent
• Order of headers usually stable
across geographies
• Variation across the years, but
doable
25/09/2019DATeCH 2019, Brussels 11
12. Extraction of tabular data
• Large proportion of tabular data
could be extracted in an automated
way
• Quality assurance using row /
column totals and geographical
summations
• OCR quality good enough
• Limitations: some rare tables
• Ingestion into database for online
access…
25/09/2019DATeCH 2019, Brussels 12
13. User consultation
• Online survey and informal meeting with
researchers
• Findings
• Mixed level of awareness of MOH reports
• Current access functionality useful (search by
topic and time period)
• Wide range of audiences would be interested in
tabular statistical data
25/09/2019DATeCH 2019, Brussels 13
Interest in quantitative MOH data
Very interested
14. User consultation
• Findings
• Main interest in basic demographics, mortality
and cause of death, ailments, fertility
• Comparative analyses of large subsets of data
would be of interest (e.g. for epidemiologists)
25/09/2019DATeCH 2019, Brussels 14
Priority of topics
15. Conclusion
• There is interest in statistical numerical data
• Automated extraction is viable alternative to
manual transcription (with limitations)
• Flexible detection and recognition approaches in
combination with data integration and validation
• Queryable large-scale data enables new research
• Deep insights
• Context for other (qualitative research)
25/09/2019DATeCH 2019, Brussels 15
16. Future work
• Creating an index of exiting transcribed MOH
tables for better accessibility.
• Create integrated data resource from London
MOH tables for online search across locations and
time.
• Indexing and data extraction across all MOH
reports based on structured OCR results.
• Testing / developing improved table recognition
algorithms (e.g. based on deep learning /
convolutional neural networks).
25/09/2019DATeCH 2019, Brussels 16
?!
17. Questions?
25/09/2019DATeCH 2019, Brussels 17
The 5th International Workshop
on Historical Document Imaging
and Processing
Paper submission deadline: 01 June
In other news
primaresearch.org/hip2019