SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Named entity extraction tools for raw OCR text
      Kepa J. Rodriguez
      GCDH-colloquium
      04.07.2012
Outline


•   Context of the experiments at the EHRI project
•   Description of the experiment
•   Corpus data
•   Creation and composition of the corpus
•   Results of the NE extraction
•   Conclusions




                            GCDH Colloquium – 11.07.2012
Context in the EHRI project

•   Archival institutions have bigs amount of non digitized documents and
    descriptions
•   EHRI will provide its partners an OCR service that:
     – Extracts text from image files of the documents
     – Text can be used to index the documents and improve the quality of
       the search
     – Indexes can be later validated and improved by collection and
       archive specialists
•   What kind of indexes can be obtained from this noisy text?
•   Quality of OCR transcripts in very low for humans, but … is it useful for
    machines?




                             GCDH Colloquium – 11.07.2012
Experiment


•   Evaluation of four existing NE extraction tools:
     – Stanford NER
     – OpenCalais
     – OpenNLP
     – Alchemy

•   Extracted entity types: PER, LOC, ORG
     – Good coverage by the selected tools.
     – Highly relevant for Shoah research and contemporary historical
        research in general.




                              GCDH Colloquium – 11.07.2012
Experiment

•   Different tools use different annotation tagsets.
     • Output has to be normalized
•   Stanford NER and OpenNLP use Person, Location and Organization as
    annotation categories.
     – Direct mapping to PER, LOC and ORG
•   OpenCalais:
     – Country, City and NaturalFeature merged into LOC
     – Organization and Facility into ORG
•   Alchemy
     – Organization, Facility and Company into ORG
     – City and Continent into LOC




                          GCDH Colloquium – 11.07.2012
Corpus data

•   Two datasets of type-writting monospaced text
•   Wiener Library
     – 17 pages of testimonies of Shoah survivors
     – OCR word accuracy 93%
•   King College London's Serving Soldier Archive
     – 33 newsletters written for the crew of the warship H.M.S. Kelly
     – OCR word accuracy 92.5%




                             GCDH Colloquium – 11.07.2012
Corpus data (WL)




                   GCDH Colloquium – 11.07.2012
Corpus data (WL)

¢3o
had been sold, and we dependedgxhe last night of our stay on the
friendliness of this neighbour. III!! The landlord Mr.and Mrs.
Wolkewitz, who had always gone out of their way to be kind to us,
had a collection arranged to us, and_wn finally left - on the
night of July 4-5, 1939 - all the tenqnts or the house had
assembled, and we all cried.
All people mentioned so for have either been friends or
acqndintanoes. There were others e.g. the grocer and the laundry
who refused payment before our departure, end there are two
indidente with German officials which I would like to tell:




                            GCDH Colloquium – 11.07.2012
Corpus data (KCL)




                    GCDH Colloquium – 11.07.2012
Corpus data (KCL)

:_
» I |“- _
li; A 1 U g  _:__ L, £g!g;“' »
“K” D. F. NEws.,p
No. 24,~ "Monday, 18th September, 1959.
KELLY at Sea. _ ' P
KINGSTQN at portsmouth, Remainder of "K" Flotilla building.
THE "K" D.E. NEwS IS NCT To EE TAKEN ASHCRE NCR ARE ANY or ITS
CONTENTS To EE CCNRUNICATED CUTSIEE THE SHIP UNTIL THE MAR IS
OVER, wHEN ARRANGEMENTS CAN EE MADE To SUPPLY BACE CCPIES PCR
THE PRICE CR THE PAPER oN WHICH THEY ARE PRINTED.
`________________________as--sauna-__-as-_un-_._-»_.__--.`¢___.-_-
n__________..¢.__
THE KELLY'S HUNT - SEPTENEER Ietn/Ivtn,



                         GCDH Colloquium – 11.07.2012
Corpus data (KCL)
Although the events of Saturday night and Sunday
morning are Weil known to the KELLY shipis Company. they are
included here as being of interest to the rest of the Flotilla. `
Shortly after dark information was received which enabled
Course to be altered to close a German submarine on the surface.
Before the KELLY could arrive the submarine had dived, but a
Pemarkably good contact was obtained, and an att
C0ntact was maintained all night in order that the final attack
Sh0uld be carried out by daylight- Unfortunately no Oil, wreckage
'OP Survivors came to the surface, but air bUbb1€S appeared after the
1&St attack, which makes it possible, although by no means certain,
that the submarine was destroyed. - _
THE KINGSTON’S PROGRAIME. ~ -
Today the KINGSTON will be inspected by the Commander-
in~Chief, Portsmouth, and will then proceed to sea for acceptance


                               GCDH Colloquium – 11.07.2012
Construction of the corpus

•   Generate two copies of each datasets
•   Manual correction of one of the copies
     – Used to evaluated the impact of the noise in the NE extraction
•   Tokenization and POS tagging using TreeTagger
•   Conversion of the TreeTagger output into stand-off standard XML.
•   Import of the data into the MMAX2 annotation tool
•   Manual annotation of the named entities
•   Control of reliability of the annotation using the Kappa coeficient




•   K = 0.93
•   K > 0.8 is considered as reliable


                             GCDH Colloquium – 11.07.2012
Corpus data (KCL)


            Wiener Library                           KCL

            RAW                     Corrected        Raw     Corrected

Files       17                      17               33      33
Words       4415                    4398             16982   15693
PER         75                      83               82      80
LOC         60                      63               170     178
ORG         13                      13               52      60
Total       148                     159              305     319

                      GCDH Colloquium – 11.07.2012
Results of the NE extraction




                     GCDH Colloquium – 11.07.2012
Results of the NE extraction



         Raw                                         Corrected

                 P       R                 F1                    P     R     F1

  AL           0.61   0.38              0.47               0.63      0.38   0.48
  OC           0.75   0.29              0.41               0.69      0.30   0.42
  ON           0.42   0.12              0.19               0.53      0.13   0.21
  ST           0.57   0.52              0.54               0.60      0.61   0.60




                      GCDH Colloquium – 11.07.2012
Results of the NE extraction

 •   Low performance of the tools in corrected and raw text
 •   Our data and data used for training and evaluation of tools are quite
     different.
 •   PER: non standard forms as
      – [Last name, First name]
            • “Wa1ter, Klaus”
      – Parenthesis together with initials of the name
            • “Captain (D)
      – Some cases can be resolved using easy heuristics in preprocessing
 •   Names of persons and locations are used for other kind of entities:
            • Warships have been annotated as PER




                             GCDH Colloquium – 11.07.2012
Results of the NE extraction

•   Performance of extraction of entities of type ORG is very low
     – F1 = between 0.11 & 0.32
     – Name of organizations appear in non-standard forms
     – Some of the organization don't exists and are not part of the
        knowledge used to train the system.
          • SS and other relevant nazi organizations have not be detected.
•   Spelling errors and typos in the original files:
     – OpenCalais used general knowledge to resolve this problem
     – Use of general knowledge my be problematic.
          • “Klan, Walter” → “Ku Klux Klan”




                            GCDH Colloquium – 11.07.2012
Conclusions

•   Manual correction of OCR output does not improve significantly the
    performance.
     – Raw output is enough to obtain provisional index candidates
•   Focus in near tearm:
     – Identify most habitual patterns of error
     – Implement preprocessing pipeline using simple heuristics and
        pattern matching tools
•   Focus in longer term:
     – Use domain specific knowledge in form of authority files to validate
        and correct the output of NE extraction tools.
     – Explore the possibility of combining different NE extraction tools
        and select output using a voting algorithm




                             GCDH Colloquium – 11.07.2012
Thanks




GCDH Colloquium – 11.07.2012

Contenu connexe

Similaire à Named entity extraction tools for raw OCR text

NANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials designNANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials designUniversity of California, San Diego
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeDavid Thompson
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtatzafargilani
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer serviceZang Donal
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
 
Jay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying FormatsJay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying FormatsFuture Perfect 2012
 
HPC Resource Accounting
HPC Resource AccountingHPC Resource Accounting
HPC Resource AccountingKen Schumacher
 
An overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newAn overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newchizhangufl
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBAndrei KUCHARAVY
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008Nancy Elkington
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Wavesinside-BigData.com
 
Patrick.guske.update
Patrick.guske.updatePatrick.guske.update
Patrick.guske.updateNASAPMC
 
Patrick.guske.update
Patrick.guske.updatePatrick.guske.update
Patrick.guske.updateNASAPMC
 

Similaire à Named entity extraction tools for raw OCR text (20)

cyclades eswc2016
cyclades eswc2016cyclades eswc2016
cyclades eswc2016
 
NANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials designNANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials design
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to Practice
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtat
 
Real IO and Parallel NetCDF4 Performance
Real IO and Parallel NetCDF4 PerformanceReal IO and Parallel NetCDF4 Performance
Real IO and Parallel NetCDF4 Performance
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer service
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
Jay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying FormatsJay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying Formats
 
HPC Resource Accounting
HPC Resource AccountingHPC Resource Accounting
HPC Resource Accounting
 
An overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newAn overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology new
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
 
Patrick.guske.update
Patrick.guske.updatePatrick.guske.update
Patrick.guske.update
 
Patrick.guske.update
Patrick.guske.updatePatrick.guske.update
Patrick.guske.update
 

Dernier

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Named entity extraction tools for raw OCR text

  • 1. Named entity extraction tools for raw OCR text Kepa J. Rodriguez GCDH-colloquium 04.07.2012
  • 2. Outline • Context of the experiments at the EHRI project • Description of the experiment • Corpus data • Creation and composition of the corpus • Results of the NE extraction • Conclusions GCDH Colloquium – 11.07.2012
  • 3. Context in the EHRI project • Archival institutions have bigs amount of non digitized documents and descriptions • EHRI will provide its partners an OCR service that: – Extracts text from image files of the documents – Text can be used to index the documents and improve the quality of the search – Indexes can be later validated and improved by collection and archive specialists • What kind of indexes can be obtained from this noisy text? • Quality of OCR transcripts in very low for humans, but … is it useful for machines? GCDH Colloquium – 11.07.2012
  • 4. Experiment • Evaluation of four existing NE extraction tools: – Stanford NER – OpenCalais – OpenNLP – Alchemy • Extracted entity types: PER, LOC, ORG – Good coverage by the selected tools. – Highly relevant for Shoah research and contemporary historical research in general. GCDH Colloquium – 11.07.2012
  • 5. Experiment • Different tools use different annotation tagsets. • Output has to be normalized • Stanford NER and OpenNLP use Person, Location and Organization as annotation categories. – Direct mapping to PER, LOC and ORG • OpenCalais: – Country, City and NaturalFeature merged into LOC – Organization and Facility into ORG • Alchemy – Organization, Facility and Company into ORG – City and Continent into LOC GCDH Colloquium – 11.07.2012
  • 6. Corpus data • Two datasets of type-writting monospaced text • Wiener Library – 17 pages of testimonies of Shoah survivors – OCR word accuracy 93% • King College London's Serving Soldier Archive – 33 newsletters written for the crew of the warship H.M.S. Kelly – OCR word accuracy 92.5% GCDH Colloquium – 11.07.2012
  • 7. Corpus data (WL) GCDH Colloquium – 11.07.2012
  • 8. Corpus data (WL) ¢3o had been sold, and we dependedgxhe last night of our stay on the friendliness of this neighbour. III!! The landlord Mr.and Mrs. Wolkewitz, who had always gone out of their way to be kind to us, had a collection arranged to us, and_wn finally left - on the night of July 4-5, 1939 - all the tenqnts or the house had assembled, and we all cried. All people mentioned so for have either been friends or acqndintanoes. There were others e.g. the grocer and the laundry who refused payment before our departure, end there are two indidente with German officials which I would like to tell: GCDH Colloquium – 11.07.2012
  • 9. Corpus data (KCL) GCDH Colloquium – 11.07.2012
  • 10. Corpus data (KCL) :_ » I |“- _ li; A 1 U g _:__ L, £g!g;“' » “K” D. F. NEws.,p No. 24,~ "Monday, 18th September, 1959. KELLY at Sea. _ ' P KINGSTQN at portsmouth, Remainder of "K" Flotilla building. THE "K" D.E. NEwS IS NCT To EE TAKEN ASHCRE NCR ARE ANY or ITS CONTENTS To EE CCNRUNICATED CUTSIEE THE SHIP UNTIL THE MAR IS OVER, wHEN ARRANGEMENTS CAN EE MADE To SUPPLY BACE CCPIES PCR THE PRICE CR THE PAPER oN WHICH THEY ARE PRINTED. `________________________as--sauna-__-as-_un-_._-»_.__--.`¢___.-_- n__________..¢.__ THE KELLY'S HUNT - SEPTENEER Ietn/Ivtn, GCDH Colloquium – 11.07.2012
  • 11. Corpus data (KCL) Although the events of Saturday night and Sunday morning are Weil known to the KELLY shipis Company. they are included here as being of interest to the rest of the Flotilla. ` Shortly after dark information was received which enabled Course to be altered to close a German submarine on the surface. Before the KELLY could arrive the submarine had dived, but a Pemarkably good contact was obtained, and an att C0ntact was maintained all night in order that the final attack Sh0uld be carried out by daylight- Unfortunately no Oil, wreckage 'OP Survivors came to the surface, but air bUbb1€S appeared after the 1&St attack, which makes it possible, although by no means certain, that the submarine was destroyed. - _ THE KINGSTON’S PROGRAIME. ~ - Today the KINGSTON will be inspected by the Commander- in~Chief, Portsmouth, and will then proceed to sea for acceptance GCDH Colloquium – 11.07.2012
  • 12. Construction of the corpus • Generate two copies of each datasets • Manual correction of one of the copies – Used to evaluated the impact of the noise in the NE extraction • Tokenization and POS tagging using TreeTagger • Conversion of the TreeTagger output into stand-off standard XML. • Import of the data into the MMAX2 annotation tool • Manual annotation of the named entities • Control of reliability of the annotation using the Kappa coeficient • K = 0.93 • K > 0.8 is considered as reliable GCDH Colloquium – 11.07.2012
  • 13. Corpus data (KCL) Wiener Library KCL RAW Corrected Raw Corrected Files 17 17 33 33 Words 4415 4398 16982 15693 PER 75 83 82 80 LOC 60 63 170 178 ORG 13 13 52 60 Total 148 159 305 319 GCDH Colloquium – 11.07.2012
  • 14. Results of the NE extraction GCDH Colloquium – 11.07.2012
  • 15. Results of the NE extraction Raw Corrected P R F1 P R F1 AL 0.61 0.38 0.47 0.63 0.38 0.48 OC 0.75 0.29 0.41 0.69 0.30 0.42 ON 0.42 0.12 0.19 0.53 0.13 0.21 ST 0.57 0.52 0.54 0.60 0.61 0.60 GCDH Colloquium – 11.07.2012
  • 16. Results of the NE extraction • Low performance of the tools in corrected and raw text • Our data and data used for training and evaluation of tools are quite different. • PER: non standard forms as – [Last name, First name] • “Wa1ter, Klaus” – Parenthesis together with initials of the name • “Captain (D) – Some cases can be resolved using easy heuristics in preprocessing • Names of persons and locations are used for other kind of entities: • Warships have been annotated as PER GCDH Colloquium – 11.07.2012
  • 17. Results of the NE extraction • Performance of extraction of entities of type ORG is very low – F1 = between 0.11 & 0.32 – Name of organizations appear in non-standard forms – Some of the organization don't exists and are not part of the knowledge used to train the system. • SS and other relevant nazi organizations have not be detected. • Spelling errors and typos in the original files: – OpenCalais used general knowledge to resolve this problem – Use of general knowledge my be problematic. • “Klan, Walter” → “Ku Klux Klan” GCDH Colloquium – 11.07.2012
  • 18. Conclusions • Manual correction of OCR output does not improve significantly the performance. – Raw output is enough to obtain provisional index candidates • Focus in near tearm: – Identify most habitual patterns of error – Implement preprocessing pipeline using simple heuristics and pattern matching tools • Focus in longer term: – Use domain specific knowledge in form of authority files to validate and correct the output of NE extraction tools. – Explore the possibility of combining different NE extraction tools and select output using a voting algorithm GCDH Colloquium – 11.07.2012