SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Towards the Extraction of Statistical Information
from Digitised NumericalTables
The Medical Officer of Health Reports Scoping Study
Christian Clausner, Apostolos Antonacopoulos,
Christy Henshaw, Justin Hayes
University of Salford
Wellcome Collection
25/09/2019DATeCH 2019, Brussels 1
The Medical Officer of Health Reports
• Wellcome Collection holds UK’s
largest collection of Medical
Officer of Health reports
• 130 years
• Over 70,000 reports
• All digitised and OCRed
25/09/2019DATeCH 2019, Brussels 2
https://wellcomelibrary.org/moh/
The Medical Officer of Health Reports
• Narrative textual content + tabular content
• Topics:
• Birth and death statistics
• Notifiable diseases
• General population statistics
• Causes of death
• School health
• Food inspections
• …
25/09/2019DATeCH 2019, Brussels 3
The Medical Officer of Health Reports
• OCRed and post-corrected data
available for Greater London
• Individual tables provided in
special format
• Statistical data difficult to
extract
25/09/2019DATeCH 2019, Brussels 4
Current Practices
• Standard OCR not sufficient for
extraction of numerical data
• Need accuracy for values AND
context (column / row)
• Common:
• Only indexing and providing access
to images with tables
• Manual correction and provision of
tables in dedicated formats
• Rare / very difficult or expensive:
• Full extraction and integration to
provide faceted searches / data
analysis etc.
25/09/2019DATeCH 2019, Brussels 5
1961 Census of England andWales
The MOH Scoping Study (2018)
• Gain understanding of tabular
data available in the reports
• Investigate ways of data
extraction
• Scope out users’ needs and
expectations
• Based on Greater London data
25/09/2019DATeCH 2019, Brussels 6
Identification of table topics
• Text-based analysis of table
captions and headers
• Grouping instances by text
similarity
• Using a tool that was created for
social media analysis
25/09/2019DATeCH 2019, Brussels 7
Topic Table Count
(approx.)
Mortality / Cause of Death 2530
General statistics / demographics 1900
Infectious Diseases / Notifiable
Diseases
1720
Inspections / conditions 4360
Minor ailments, dental, etc. 710
Financial 470
Food 330
Births 240
Meteorological 100
Legal 190
Immunisation 60
Identification of table topics
• Geographies:
• Mostly districts
• Also smaller areas (sub-districts,
wards)
• Considerable variety of
• Information content
• Physical structure
• Across many
• Locations
• Years
25/09/2019DATeCH 2019, Brussels 8
§ Demographics
§ Age
§ Sex
§ Births
§ Deaths
§ Causes of death
§ Infant death
§ Ailments
§ Diseases
§ Infectious diseases
§ Notifiable diseases
§ Immunisations
§ Environmental
§ Inspections
§ Food
§ Conditions
§ Meteorological
§ Financial
§ Legal
Extraction of tabular data
• Can remaining data be extracted
in a less costly way?
• Available for experiments:
• OCR results in ALTO XML format
(Greater London)
• Ran ABBYY FineReader Engine 11
ourselves
25/09/2019DATeCH 2019, Brussels 9
Extraction of tabular data
• Tests with ABBYY FineReader
• Very inconsistent results
• But column and row headers
sufficiently recognised
25/09/2019DATeCH 2019, Brussels 10
Extraction of tabular data
• Prototype: Flexible matching to
locate rows and columns of
interest
• Ignore other data that is less
consistent
• Order of headers usually stable
across geographies
• Variation across the years, but
doable
25/09/2019DATeCH 2019, Brussels 11
Extraction of tabular data
• Large proportion of tabular data
could be extracted in an automated
way
• Quality assurance using row /
column totals and geographical
summations
• OCR quality good enough
• Limitations: some rare tables
• Ingestion into database for online
access…
25/09/2019DATeCH 2019, Brussels 12
User consultation
• Online survey and informal meeting with
researchers
• Findings
• Mixed level of awareness of MOH reports
• Current access functionality useful (search by
topic and time period)
• Wide range of audiences would be interested in
tabular statistical data
25/09/2019DATeCH 2019, Brussels 13
Interest in quantitative MOH data
Very interested
User consultation
• Findings
• Main interest in basic demographics, mortality
and cause of death, ailments, fertility
• Comparative analyses of large subsets of data
would be of interest (e.g. for epidemiologists)
25/09/2019DATeCH 2019, Brussels 14
Priority of topics
Conclusion
• There is interest in statistical numerical data
• Automated extraction is viable alternative to
manual transcription (with limitations)
• Flexible detection and recognition approaches in
combination with data integration and validation
• Queryable large-scale data enables new research
• Deep insights
• Context for other (qualitative research)
25/09/2019DATeCH 2019, Brussels 15
Future work
• Creating an index of exiting transcribed MOH
tables for better accessibility.
• Create integrated data resource from London
MOH tables for online search across locations and
time.
• Indexing and data extraction across all MOH
reports based on structured OCR results.
• Testing / developing improved table recognition
algorithms (e.g. based on deep learning /
convolutional neural networks).
25/09/2019DATeCH 2019, Brussels 16
?!
Questions?
25/09/2019DATeCH 2019, Brussels 17
The 5th International Workshop
on Historical Document Imaging
and Processing
Paper submission deadline: 01 June
In other news
primaresearch.org/hip2019

Contenu connexe

Similaire à Session3 03.christian clausner

Community Health Diagnosis 2076 (CHD)
Community Health Diagnosis 2076 (CHD)Community Health Diagnosis 2076 (CHD)
Community Health Diagnosis 2076 (CHD)Deekshya Devkota
 
Into The User Environment 2022! EAHIL2022 plenary presentation
Into The User Environment 2022! EAHIL2022 plenary presentationInto The User Environment 2022! EAHIL2022 plenary presentation
Into The User Environment 2022! EAHIL2022 plenary presentationGuus van den Brekel
 
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticePatient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticeMatthieu Schapranow
 
Towards a National Learning Health System - Aziz Sheikh
Towards a National Learning Health System - Aziz SheikhTowards a National Learning Health System - Aziz Sheikh
Towards a National Learning Health System - Aziz SheikhNIHR CLAHRC West Midlands
 
Why collect and use health data? Professor Peter Bradley, Director of Knowl...
Why collect and use health data?  Professor  Peter Bradley, Director of Knowl...Why collect and use health data?  Professor  Peter Bradley, Director of Knowl...
Why collect and use health data? Professor Peter Bradley, Director of Knowl...NHS England
 
INFORMATION LITERACY INDICATORS: A Must for Countries
INFORMATION LITERACY INDICATORS: A Must for CountriesINFORMATION LITERACY INDICATORS: A Must for Countries
INFORMATION LITERACY INDICATORS: A Must for CountriesJesus Lau
 
Cadth 2015 b7 symposium cost guidance talk draft-ab_v1.0
Cadth 2015 b7 symposium cost guidance talk   draft-ab_v1.0Cadth 2015 b7 symposium cost guidance talk   draft-ab_v1.0
Cadth 2015 b7 symposium cost guidance talk draft-ab_v1.0CADTH Symposium
 
Opening academisch jaar medische informatiekunde AMC
Opening academisch jaar medische informatiekunde AMCOpening academisch jaar medische informatiekunde AMC
Opening academisch jaar medische informatiekunde AMCMartijn Kriens
 
Gpdpr seminar june 2021
Gpdpr seminar june 2021Gpdpr seminar june 2021
Gpdpr seminar june 2021Azeem Majeed
 
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...Health Catalyst
 
Biostatistics is a critical subject in current health data research – pubrica
Biostatistics is a critical subject in current health data research – pubricaBiostatistics is a critical subject in current health data research – pubrica
Biostatistics is a critical subject in current health data research – pubricaPubrica
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchPaul Agapow
 
Kerala-TB services during COVID.pptx
Kerala-TB services during COVID.pptxKerala-TB services during COVID.pptx
Kerala-TB services during COVID.pptxNIDHINDASS1
 

Similaire à Session3 03.christian clausner (20)

Health information
Health informationHealth information
Health information
 
Community Health Diagnosis 2076 (CHD)
Community Health Diagnosis 2076 (CHD)Community Health Diagnosis 2076 (CHD)
Community Health Diagnosis 2076 (CHD)
 
Into The User Environment 2022! EAHIL2022 plenary presentation
Into The User Environment 2022! EAHIL2022 plenary presentationInto The User Environment 2022! EAHIL2022 plenary presentation
Into The User Environment 2022! EAHIL2022 plenary presentation
 
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticePatient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
 
Towards a National Learning Health System - Aziz Sheikh
Towards a National Learning Health System - Aziz SheikhTowards a National Learning Health System - Aziz Sheikh
Towards a National Learning Health System - Aziz Sheikh
 
ECDC webportal microbiology information
ECDC webportal microbiology informationECDC webportal microbiology information
ECDC webportal microbiology information
 
Why collect and use health data? Professor Peter Bradley, Director of Knowl...
Why collect and use health data?  Professor  Peter Bradley, Director of Knowl...Why collect and use health data?  Professor  Peter Bradley, Director of Knowl...
Why collect and use health data? Professor Peter Bradley, Director of Knowl...
 
0201 rachford pemberton w - using evidence to create advocacy impact 1.1
0201 rachford pemberton w - using evidence to create advocacy impact 1.10201 rachford pemberton w - using evidence to create advocacy impact 1.1
0201 rachford pemberton w - using evidence to create advocacy impact 1.1
 
INFORMATION LITERACY INDICATORS: A Must for Countries
INFORMATION LITERACY INDICATORS: A Must for CountriesINFORMATION LITERACY INDICATORS: A Must for Countries
INFORMATION LITERACY INDICATORS: A Must for Countries
 
Cadth 2015 b7 symposium cost guidance talk draft-ab_v1.0
Cadth 2015 b7 symposium cost guidance talk   draft-ab_v1.0Cadth 2015 b7 symposium cost guidance talk   draft-ab_v1.0
Cadth 2015 b7 symposium cost guidance talk draft-ab_v1.0
 
Markham2009
Markham2009Markham2009
Markham2009
 
SEMINAR PRESENTATION.pptx
SEMINAR PRESENTATION.pptxSEMINAR PRESENTATION.pptx
SEMINAR PRESENTATION.pptx
 
Opening academisch jaar medische informatiekunde AMC
Opening academisch jaar medische informatiekunde AMCOpening academisch jaar medische informatiekunde AMC
Opening academisch jaar medische informatiekunde AMC
 
Gpdpr seminar june 2021
Gpdpr seminar june 2021Gpdpr seminar june 2021
Gpdpr seminar june 2021
 
Verbal autopsy
Verbal autopsyVerbal autopsy
Verbal autopsy
 
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
 
Biostatistics is a critical subject in current health data research – pubrica
Biostatistics is a critical subject in current health data research – pubricaBiostatistics is a critical subject in current health data research – pubrica
Biostatistics is a critical subject in current health data research – pubrica
 
Managing and Analyzing Global Health Data
Managing and Analyzing Global Health DataManaging and Analyzing Global Health Data
Managing and Analyzing Global Health Data
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical Research
 
Kerala-TB services during COVID.pptx
Kerala-TB services during COVID.pptxKerala-TB services during COVID.pptx
Kerala-TB services during COVID.pptx
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Session3 03.christian clausner

  • 1. Towards the Extraction of Statistical Information from Digitised NumericalTables The Medical Officer of Health Reports Scoping Study Christian Clausner, Apostolos Antonacopoulos, Christy Henshaw, Justin Hayes University of Salford Wellcome Collection 25/09/2019DATeCH 2019, Brussels 1
  • 2. The Medical Officer of Health Reports • Wellcome Collection holds UK’s largest collection of Medical Officer of Health reports • 130 years • Over 70,000 reports • All digitised and OCRed 25/09/2019DATeCH 2019, Brussels 2 https://wellcomelibrary.org/moh/
  • 3. The Medical Officer of Health Reports • Narrative textual content + tabular content • Topics: • Birth and death statistics • Notifiable diseases • General population statistics • Causes of death • School health • Food inspections • … 25/09/2019DATeCH 2019, Brussels 3
  • 4. The Medical Officer of Health Reports • OCRed and post-corrected data available for Greater London • Individual tables provided in special format • Statistical data difficult to extract 25/09/2019DATeCH 2019, Brussels 4
  • 5. Current Practices • Standard OCR not sufficient for extraction of numerical data • Need accuracy for values AND context (column / row) • Common: • Only indexing and providing access to images with tables • Manual correction and provision of tables in dedicated formats • Rare / very difficult or expensive: • Full extraction and integration to provide faceted searches / data analysis etc. 25/09/2019DATeCH 2019, Brussels 5 1961 Census of England andWales
  • 6. The MOH Scoping Study (2018) • Gain understanding of tabular data available in the reports • Investigate ways of data extraction • Scope out users’ needs and expectations • Based on Greater London data 25/09/2019DATeCH 2019, Brussels 6
  • 7. Identification of table topics • Text-based analysis of table captions and headers • Grouping instances by text similarity • Using a tool that was created for social media analysis 25/09/2019DATeCH 2019, Brussels 7 Topic Table Count (approx.) Mortality / Cause of Death 2530 General statistics / demographics 1900 Infectious Diseases / Notifiable Diseases 1720 Inspections / conditions 4360 Minor ailments, dental, etc. 710 Financial 470 Food 330 Births 240 Meteorological 100 Legal 190 Immunisation 60
  • 8. Identification of table topics • Geographies: • Mostly districts • Also smaller areas (sub-districts, wards) • Considerable variety of • Information content • Physical structure • Across many • Locations • Years 25/09/2019DATeCH 2019, Brussels 8 § Demographics § Age § Sex § Births § Deaths § Causes of death § Infant death § Ailments § Diseases § Infectious diseases § Notifiable diseases § Immunisations § Environmental § Inspections § Food § Conditions § Meteorological § Financial § Legal
  • 9. Extraction of tabular data • Can remaining data be extracted in a less costly way? • Available for experiments: • OCR results in ALTO XML format (Greater London) • Ran ABBYY FineReader Engine 11 ourselves 25/09/2019DATeCH 2019, Brussels 9
  • 10. Extraction of tabular data • Tests with ABBYY FineReader • Very inconsistent results • But column and row headers sufficiently recognised 25/09/2019DATeCH 2019, Brussels 10
  • 11. Extraction of tabular data • Prototype: Flexible matching to locate rows and columns of interest • Ignore other data that is less consistent • Order of headers usually stable across geographies • Variation across the years, but doable 25/09/2019DATeCH 2019, Brussels 11
  • 12. Extraction of tabular data • Large proportion of tabular data could be extracted in an automated way • Quality assurance using row / column totals and geographical summations • OCR quality good enough • Limitations: some rare tables • Ingestion into database for online access… 25/09/2019DATeCH 2019, Brussels 12
  • 13. User consultation • Online survey and informal meeting with researchers • Findings • Mixed level of awareness of MOH reports • Current access functionality useful (search by topic and time period) • Wide range of audiences would be interested in tabular statistical data 25/09/2019DATeCH 2019, Brussels 13 Interest in quantitative MOH data Very interested
  • 14. User consultation • Findings • Main interest in basic demographics, mortality and cause of death, ailments, fertility • Comparative analyses of large subsets of data would be of interest (e.g. for epidemiologists) 25/09/2019DATeCH 2019, Brussels 14 Priority of topics
  • 15. Conclusion • There is interest in statistical numerical data • Automated extraction is viable alternative to manual transcription (with limitations) • Flexible detection and recognition approaches in combination with data integration and validation • Queryable large-scale data enables new research • Deep insights • Context for other (qualitative research) 25/09/2019DATeCH 2019, Brussels 15
  • 16. Future work • Creating an index of exiting transcribed MOH tables for better accessibility. • Create integrated data resource from London MOH tables for online search across locations and time. • Indexing and data extraction across all MOH reports based on structured OCR results. • Testing / developing improved table recognition algorithms (e.g. based on deep learning / convolutional neural networks). 25/09/2019DATeCH 2019, Brussels 16 ?!
  • 17. Questions? 25/09/2019DATeCH 2019, Brussels 17 The 5th International Workshop on Historical Document Imaging and Processing Paper submission deadline: 01 June In other news primaresearch.org/hip2019